ATA/SATA kernel issues
ATA subsystem causes kernel to lock (no panic) if atacontrol detach <channel> is executed without remembering to umount relevant filesystems beforehand
ATA subsystem acts erratically/incorrectly when a SATA disk is removed from the system without doing atacontrol detach <channel> prior to the removal.
- Easily reproducable on any hardware sporting a commercial-grade hot-swap SATA backplane.
Intel MatrixRAID: New ar(4) device created when bad disk in RAID-1 array replaced with new disk
- Patch available in PR.
- Intel MatrixRAID: Array goes incorrectly into READY state when rebooting machine in the middle of an array rebuild
- Patch available in PR 102210, and has been available since 2006.
- Intel MatrixRAID: Kernel panics when a disk is lost and reattached
- Numerous problems with embedded LSI v3 MegaRAID
- Open PRs:
- Patches available in PR 92786 and PR 101819. Patches have been available since 2006.
ServerWorks HT1000 chipsets causing SATA data corruption
Known to affect at least Dell PowerEdge SC1435 systems
Troubleshooting details: http://lists.freebsd.org/pipermail/freebsd-stable/2008-September/045549.html
- Supposedly fixed in January 2008 on RELENG_7 and HEAD.
- Adaptec 1420SA support
ATA/SATA other issues
SMART monitoring: Using the -s flag in smartd.conf to run periodic short/long offline tests results in DMA timeouts
Workaround: Stop using this feature. I explain why in this post.
I am in the process of communicating with Bruce Allen (author of smartmontools) to discuss why this feature exists, why it's advocated in the man page and example smartd.conf, and why one would want to perform these tests on a regular basis.
ATA/SATA DMA timeout issues
- Symptom: messages similar to below are seen output from the kernel. Sometimes harmless, sometimes fatal. LBAs listed are scattered, and SMART statistics for the disk in question show no sign of increased error rates or sector issues:
ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=54112319 ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=764596887 ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=764596887 ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=764596887 ad0: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=453849407 ad0: FAILURE - READ_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=453849407 ad0: TIMEOUT - FLUSHCACHE retrying (1 retry left) ad0: TIMEOUT - FLUSHCACHE retrying (0 retries left)
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly ad4: TIMEOUT - READ_DMA retrying (1 retry left) LBA=193407827
PATA only: Set hw.ata.ata_dma=0 in /boot/loader.conf. This will disable use of ATA DMA. NOTE: This workaround greatly decreases I/O performance. You have been warned...
Volker Theile of the FreeNAS project informs me that they have solved most of the DMA problems by increasing a hard-coded arbitrary timeout value of 5 (seconds) in the ATA code to 10 or 15, while simultaneously making the timeout value adjustable via sysctl. Volker submit patches to sos@ over a year ago, but never received a response.
- As of 2008/02/27, Scott Long has offered to help track this problem down. Those who are able to reproduce the problem reliably should get in contact with Scott; serial console access will very likely be mandatory.
SATA disk troubleshooting
Understanding what you're dealing with
A substantial number of FreeBSD users report SATA disk problems. It is difficult to determine the source of these problems, due to the complex nature of hard disks and all related pieces (cabling and power, disk mechanics, disk firmware, protocol/transport, kernel driver, etc.). Even with a thorough understanding of how SATA disks work, there is a decent chance that even the most skilled system administrator won't be able to determine the root cause. To make matters worse, many system administrators do not have fail-over systems in place, which makes thorough analysis and troubleshooting impossible ("This system has to be up and running 24x7, I can't afford the downtime for others to look at it"). And then there's the issue of finances: sometimes cash is required to work around issues ("Do we know if this Adaptec SATA controller even works? Maybe we should switch to Areca or 3ware or Promise..."), while not everyone has such funds available.
But back to the disks themselves. Comparatively, with regards to bad block management, SCSI disks behave quite differently than SATA. SCSI will report any disk errors with sense code, ASC, and ASCQ, and may even automatically mark that block as a "grown defect" (a user-manageable list of bad blocks) -- while SATA disks will silently attempt to remap bad blocks, keeping track of such defects internally, and will not report to the transport layer (e.g. operating system) that anything had happened. For example, assuming the block was remapped successfully, even SMART statistics are usually left untouched; while in the case of a remapping failure, SMART attribute 198 (Offline_Uncorrectable) may get incremented.
In the case of SATA, such a scenario can take time, and depends greatly upon the type of error. Some errors (such as soft errors) may take under a second to recover from, while others (hard errors) may take longer periods -- and some may cause the disk to lock up entirely, requiring the disk power-cycled and the SATA channel reattached. FreeBSD expects that all ATA commands (that includes SATA!) sent to a device receive a response within 5 seconds. The timeout is hard-coded, and is entirely arbitrary; it has no implied meaning. It was chosen by firstname.lastname@example.org probably based on personal choice.
What FreeBSD has to say
So what happens when a disk operation is executed, but takes longer than 5 seconds to return a sense code? Well, FreeBSD spits out quite a lot of crap to the kernel console (see dmesg or /var/log/console.log), such as:
ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=XXXXX ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=YYYYY ad0: TIMEOUT - WRITE_DMA48 retrying (0 retries left) LBA=YYYYY ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=YYYYY
This tells you a few things, most of which are low-level:
The disk which experienced the problem was ad0
- A time-out occurred when attempting a write operation
- FreeBSD attempted to write data to LBA XXXXX via standard DMA (which uses 28-bit LBA addressing), experienced a time-out, and attempted a write retry once
- FreeBSD attempted to write data to LBA YYYYY via 48-bit DMA, experienced a time-out, and attempted a write retry twice
- FreeBSD deemed the write operation a failure
The ATA status result is value 0x51 (bits 6 (DRDY), 4 (not applicable), and 1 (ERR) set)
The ATA error result is value 0x10 (bit 4 set), which according to ATA-7 specification, Section 6.59.6 is: "IDNF shall be set to one if a user-accessible address could not be found. IDNF shall be set to one if an address outside of the range of user-accessible addresses is requested if command aborted is not returned." FreeBSD labels this bit as NID_NOT_FOUND
The IDNF bit seems to indicate that a particular LBA on the disk was inaccessible; I interpret this to mean "the LBA you're trying to access is within an invalid LBA range" (which would strongly indicate a bug in FreeBSD), but there's a good chance I'm reading the description wrong. This needs some further research/clarification, particularly by those more familiar with the ATA protocol semantics than I am.
None of these are very helpful though, are they? To a system administrator, this means "there's something wrong, possibly around 48-bit LBA YYYYY... or maybe 28-bit LBA XXXXX". Most administrators know that if the LBA seeing errors is always the same that the disk itself is likely the cause, but what if the LBA is random?
...more to follow...
What disks have to say
Rather than try to decipher what FreeBSD says, a more logical approach is to examine the disk to see if it logged any sort of error in SMART.
I need to make something clear: SMART is not a guaranteed way to determine the current state of a disk, or past events on a disk. SMART is entirely dependent upon the level of pedantry of the disk firmware programmer him/herself. Some SMART implementations don't even bother to log real errors; others increment counters only when offline SMART tests are run. The trick is knowing how to interpret SMART stats for each disk vendor (Western Digital, Seagate, Fujitsu, etc.). Sometimes it gets even more granular than that (different models of disks behaving differently when it comes to SMART).
...more to follow...