ATA/SATA kernel issues

ATA/SATA other issues

ATA/SATA DMA timeout issues

SATA disk troubleshooting

Understanding what you're dealing with

A substantial number of FreeBSD users report SATA disk problems. It is difficult to determine the source of these problems, due to the complex nature of hard disks and all related pieces (cabling and power, disk mechanics, disk firmware, protocol/transport, kernel driver, etc.). Even with a thorough understanding of how SATA disks work, there is a decent chance that even the most skilled system administrator won't be able to determine the root cause. To make matters worse, many system administrators do not have fail-over systems in place, which makes thorough analysis and troubleshooting impossible ("This system has to be up and running 24x7, I can't afford the downtime for others to look at it"). And then there's the issue of finances: sometimes cash is required to work around issues ("Do we know if this Adaptec SATA controller even works? Maybe we should switch to Areca or 3ware or Promise..."), while not everyone has such funds available.

But back to the disks themselves. Comparatively, with regards to bad block management, SCSI disks behave quite differently than SATA. SCSI will report any disk errors with sense code, ASC, and ASCQ, and may even automatically mark that block as a "grown defect" (a user-manageable list of bad blocks) -- while SATA disks will silently attempt to remap bad blocks, keeping track of such defects internally, and will not report to the transport layer (e.g. operating system) that anything had happened. For example, assuming the block was remapped successfully, even SMART statistics are usually left untouched; while in the case of a remapping failure, SMART attribute 198 (Offline_Uncorrectable) may get incremented.

In the case of SATA, such a scenario can take time, and depends greatly upon the type of error. Some errors (such as soft errors) may take under a second to recover from, while others (hard errors) may take longer periods -- and some may cause the disk to lock up entirely, requiring the disk power-cycled and the SATA channel reattached. FreeBSD expects that all ATA commands (that includes SATA!) sent to a device receive a response within 5 seconds. The timeout is hard-coded, and is entirely arbitrary; it has no implied meaning. It was chosen by sos@freebsd.org probably based on personal choice.

What FreeBSD has to say

So what happens when a disk operation is executed, but takes longer than 5 seconds to return a sense code? Well, FreeBSD spits out quite a lot of crap to the kernel console (see dmesg or /var/log/console.log), such as:

ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=XXXXX
ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=YYYYY
ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=YYYYY
ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=YYYYY

This tells you a few things, most of which are low-level:

The IDNF bit seems to indicate that a particular LBA on the disk was inaccessible; I interpret this to mean "the LBA you're trying to access is within an invalid LBA range" (which would strongly indicate a bug in FreeBSD), but there's a good chance I'm reading the description wrong. This needs some further research/clarification, particularly by those more familiar with the ATA protocol semantics than I am.

None of these are very helpful though, are they? To a system administrator, this means "there's something wrong, possibly around 48-bit LBA YYYYY... or maybe 28-bit LBA XXXXX". Most administrators know that if the LBA seeing errors is always the same that the disk itself is likely the cause, but what if the LBA is random?

...more to follow...

What disks have to say

Rather than try to decipher what FreeBSD says, a more logical approach is to examine the disk to see if it logged any sort of error in SMART.

I need to make something clear: SMART is not a guaranteed way to determine the current state of a disk, or past events on a disk. SMART is entirely dependent upon the level of pedantry of the disk firmware programmer him/herself. :-) Some SMART implementations don't even bother to log real errors; others increment counters only when offline SMART tests are run. The trick is knowing how to interpret SMART stats for each disk vendor (Western Digital, Seagate, Fujitsu, etc.). Sometimes it gets even more granular than that (different models of disks behaving differently when it comes to SMART).

...more to follow...

JeremyChadwick/ATA_issues_and_troubleshooting (last edited 2008-11-05 12:15:28 by JeremyChadwick)