Same drive intermittently disabled - cables/backplane/sata port changed each time - General Support

January 25, 20197 yr

Greetings,
I wonder if anyone has seen anything like this and would welcome suggestions as to how to proceed. I'll start by advising that I can upload diagnostics but as this is very intermittent it likely won't have logs related to this.
The issue:
Starting 4 months ago, and occurring 3 times since then I get an error "Array has 1 disk with read errors" - the disk becomes disabled. It's always the same disk, which is an identical model to another (I have two 4TB Ironwolf disks produced/purchased about 4 months apart.) The disk in question is just under a year old.
Each time this has occurred I have attempted to run smartctl via command line - the disk doesn't respond - so checking the disk simply yields a startup message for smartctl and then exits. Works fine on the other 'identical' disk. I then reboot. This leads to the server getting stuck at 'Detecting hard drives' during initial startup. Interestingly the drive cage for the affected slot also shows a diagnostic red light for the disk.
The first two times - I did this:
Turn off server, remove the drive, switch the cable, switch the slot the drive seats into the backplane - so the cable, the sata port and the backplane slot are ones that were working fine before with other drives. Turn on server - drive is recognized fine. Run extended SMART tests - no errors found. So I then just reassign the disk and it happily rebuilds from parity.
The SMART report on the drive doesn't show any errors.
A month or so passes and then the same thing occurs - typically 2 read errors and the drive is disabled.
This reoccurred again two days ago, so this time I pulled the drive and ran Seatools on Win 10 twice doing the generic long test - which in theory reads the entire disk to check for errors. Can't get an error.
I'd like to RMA the drive but seems that if I cannot prove an error then the RMA probably won't be successful.
At this time the only thing I can think of is - could this be a spindown timing/reporting issue? I can envision a situation whereby the disk is commanded to spin up, but doesn't respond in a timely manner. Strange though, since I'd expect that to be a firmware bug, and both of the 'identical' drives have the same firmware and appear identical in hdparm or Seatools disk info.
I can also report that at no time since this issue has occurred are the temperatures unusual - in the summer they might have seen some heat in the 50c+ range but during the failure times I haven't seen a disk above 30c. Also the disk that isn't failing is always above the one that is, and thus experiences a little more heat than the failing one. Initially I suspected subtle vibration was doing something to the cable/backplane but three times on the same disk in different slots with different cables? Seems doubtful.
Any ideas, suggestions are welcome.
Yrs,
Del

Quote

Same drive intermittently disabled - cables/backplane/sata port changed each time

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)