akawoz Posted May 27, 2017 Share Posted May 27, 2017 (edited) OK, after running unraid for years I've found something to stump me. I had a drive fail recently with write errors (drive 12/sdq) - giving the red cross in the drive list. I removed the drive from the array and ran an extended SMART test on it - which finished without error. I next ran a full pre-clear cycle (full read, full zero, full read) which also ran without error. I added the drive back into the array thinking the drive controller had maybe remapped some bad sectors and I was OK again (for now), only to have the rebuild fail on the drive with more write errors. Decided this drive really was bad so went to the store and purchased an identical WD Red 3TB drive. Conducted a full pre-clear cycle which it passed with flying colours. Added drive back into the array and it failed during rebuild with write errors. This replacement drive was added to a different controller (on-board vs LSI 9211-8i) with different SATA and power cables and mounted in a different chassis location. So now I'm completely stumped. Two drives, both that pre-clear just fine, both fail during rebuild in around the same place (~ 5%) with write errors (different sectors listed).The only commonality is that they are both in drive position 12 and seem to fail very early in the rebuild process. Anyone got any idea where I start troubleshooting this? No other issues with my array and unraid implementation; its been very stable for the last couple of years and normally has uptimes measured in many months at a time. Diagnostics attached. preston-diagnostics-20170527-1427.zip Edited May 28, 2017 by akawoz marking solved Quote Link to comment
Squid Posted May 27, 2017 Share Posted May 27, 2017 The errors that precede the write errors are all implying bad / poor cabling / power / HBA. Reseat everything. Minimize power splitters, etc Quote Link to comment
akawoz Posted May 27, 2017 Author Share Posted May 27, 2017 (edited) Interesting - I've changed only one thing (with anything remotely to do with cabling, power, devices) in the last 6mths; plugged a NiMH battery charger into the same outlet that the server is plugged into. Did this about a week ago. Will try a rebuild again with that removed. Feels a bit like voodoo, but it is a cheap one sourced from Aliexpress. UPDATE: OK that wasn't the problem. Start getting write errors immediately when I start the array. Will power down and try to reseat everything tomorrow and report back. Edited May 27, 2017 by akawoz Quote Link to comment
akawoz Posted May 27, 2017 Author Share Posted May 27, 2017 Haven't done the reseat process yet - but I'm wondering why the whole rest of the array is running just fine, except when I rebuild drive 12. Lots of reads and writes going on to the other 11 drives just fine. Remember the second drive I tried was connected using completely different cabling, to a different HBA (motherboard based). Quote Link to comment
JorgeB Posted May 27, 2017 Share Posted May 27, 2017 Logs don't show what happened with the other disk, but with this one there was trouble from the start: May 27 13:04:27 Preston kernel: ata1: softreset failed (1st FIS failed) May 27 13:04:27 Preston kernel: ata1: SATA link down (SStatus 0 SControl 300) May 27 13:04:27 Preston kernel: ata1: exception Emask 0x10 SAct 0x0 SErr 0x40d0002 action 0xe frozen May 27 13:04:27 Preston kernel: ata1: irq_stat 0x00400040, connection status changed May 27 13:04:27 Preston kernel: ata1: SError: { RecovComm PHYRdyChg CommWake 10B8B DevExch } May 27 13:04:27 Preston kernel: ata1: hard resetting link It failed to identify multiple times, and it ended up succeeding with speed limited to SATA2: May 27 13:05:12 Preston kernel: ata1: softreset failed (1st FIS failed) May 27 13:05:12 Preston kernel: ata1: limiting SATA link speed to 3.0 Gbps May 27 13:05:12 Preston kernel: ata1: hard resetting link May 27 13:05:12 Preston kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320) May 27 13:05:12 Preston kernel: ata1.00: ATA-9: WDC WD30EFRX-68EUZN0, WD-WCC4N2ZUES3F, 82.00A82, max UDMA/133 May 27 13:05:12 Preston kernel: ata1.00: 5860533168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA May 27 13:05:12 Preston kernel: ata1.00: configured for UDMA/133 May 27 13:05:12 Preston kernel: ata1: EH complete But errors started again immediately when trying to rebuild, it's clearly an hardware issue, assuming the disk is fine, cables or controller/port. Quote Link to comment
akawoz Posted May 27, 2017 Author Share Posted May 27, 2017 Thanks @johnnie.black - I rebooted between drive attempts. Agree, def hardware. Will progress +12hrs. Thanks for looking at my logs, cheers! Quote Link to comment
akawoz Posted May 28, 2017 Author Share Posted May 28, 2017 @Squid was correct - problem was faulty SATA power cable - in this case a molex to SATA converter. Thanks for your help! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.