Need Help: regular failure of all four HDDs

November 6, 200916 yr

I've been having regular errors on my unraid server and I would really appreciate some expert help on what I should be doing to make this system work. The errors manifest regularly and show up as a large number of reads, writes and errors across all four drives. A zipped syslog from write after the most recent crash is attached, typically the status screen will looks like this after the errors occur:

Temperature Size Free Reads Writes Errors

parity * 976,762,552 - 3,086,974,176 3,218,038,968 210

disk1 * 976,762,552 183,050,940 3,086,974,176 3,218,038,968 210

disk2 * 976,762,552 89,215,740 3,086,974,176 3,218,038,968 210

disk3 * 976,762,552 694,360,664 3,086,974,176 3,218,038,968 210

Occasionally, the errors will occur on only one or two disks. On at least two occasions the whole server would not respond (i.e. I couldn't telnet into it and had to hard reset). The errors occur regularly (I haven't been able to go for 48 hours without errors) and during different activities (while copying data to the unraid server, while watching a movie from the server, overnight while the server is inactive, etc).

My build is as follows:

Supermicro C2SEA w/ 2gb ram and e5200 CPU

PC Power & Cooling Silencer 750

Norco 4220 case

4 x samsung f1 1tb drive (3xdata & 1 parity)

As for the troubleshooting I've tried so far: I first found that the wall outlet I was using didn't have a proper ground so I switched the power bar to a new outlet-didn't resolve the issue, next I suspected that it was a problem with the Mobo so I RMA'd it and supermicro returned it a month later saying everything tested ok. I've run memtest - no errors. I''m using the SATA-to-SAS cables and I've switched cables once- no effect. I've tried using a different backplane in the norco case - no effect. I've tried using different combinations of molex connectors (using x1 and x2 cables) to power the backplane-no effect

I set up the unraid server in July from all brand new parts , had it up and running problem free for about a month then I registered a pro key and added a fourth hdd. It was about 2-3 weeks after this that the problems started occurring.

I'm at a loss for what to try next. I expect that one or more piece of hardware is probably toast and will need to be replaced - but rather than randomly purchasing components I'd appreciate any help in identifying what the problem is. Thanks in advance for any suggestions.

Quote

November 6, 200916 yr

I've been having regular errors on my unraid server and I would really appreciate some expert help on what I should be doing to make this system work. The errors manifest regularly and show up as a large number of reads, writes and errors across all four drives. I tried posting a syslog but it says the file is too large

You can zip the syslog and attach it here.

I first found that the wall outlet I was using didn't have a proper ground so ...

It may not be what's causing your problem, but why would you run your server without a UPS?

You can get a decent (APC) UPS for fifty bucks these days, and that will save you a lot of trouble.

Purko

Quote

November 6, 200916 yr

The primary error codes reported are "RecovComm PHYRdyChg CommWake DevExch". I can't claim to be an expert, but to the best of my knowledge, these are all consistent with a brief drive disconnection, either a physical disconnection, perhaps due to vibration or not well seated in a backplane, or a brief loss of power, perhaps a loose power cable or splitter.

Unfortunately, your SATA drives are configured in your BIOS settings to emulate IDE drives. One of the flaws of the IDE system is that there are 2 drives per channel, and since some operations are performed on the channel instead of the individual drive, a problem with one drive can seriously affect the other drive assigned to that same channel. And that is what happened to you, when the SATA link went down. Because of a 'disconnection' of one drive, both drives were disabled.

At the first occurrence, sda and sdc simultaneously and equivalently lost contact, and were both quickly recovered without issue. Resets were sent to both channels so drives sda through sdd were all reset. This pair of drives repeated this simultaneous contact loss several more times, at random points in time. Then later, sdd repeatedly dropped out and returned, until it did not return! Here is a log of the dropouts and drives involved:

10:40:30 sda, sdc
11:03:43 sda, sdc
12:10:51 sdc
12:47:12 sdb, sdc
15:54:23 sdd
16:19:08 sdd
16:28:24 sdd
16:55:27 sdd
17:03:27 sdd  (but this time, SATA link goes down too, resulting in both drives disabled)
17:03:27 Tower kernel: ata2: SATA link down
17:03:38 Tower kernel: ata2.01: disabled   (sdd)
17:03:54 Tower kernel: ata2.00: disabled   (sdc)
17:04:54 sda  (but SATA link also goes down, resulting in both drives disabled)
17:04:55 Tower kernel: ata1: SATA link down
17:05:06 Tower kernel: ata1.01: disabled   (sdb)
17:05:17 Tower kernel: ata1.00: disabled   (sda)

Once a drive is disabled, you can completely ignore all further errors associated with that drive, whether low or higher level, whether Linux device or unRAID drive related. That includes all of the apparent errors reported on the unRAID web page.

In my (inexpert) opinion, you have no faulty hardware components, just some bad connections. Besides checking and locking and vibration-proofing as well as you can all of the relevant connections, I would recommend that you also change the BIOS settings to a native SATA or AHCI choice.

Quote

November 7, 200916 yr

Author

Hi, thanks for the helpful replies. You're right I should've had a UPS from the beginning, I guess I didn't realize it was essential and thought I could make do with a surge protector for a few months. I have a quality UPS on order and hopefully that will do the trick - at the very least it will eliminate another variable.

Thanks for bringing the IDE bios setting to my attention - I actually remember switching the drives from AHCI to IDE mode in a blind, desperate attempt to get things working - it looks like it only made things worse. I'll change it back in bios, recheck all of my connections and try again.

Is there any chance that the faulty connections could be due to the backplane on the Norco 4220 - I've read stories of faulty backplanes but generally drives wouldn't work at all. Is it possible that a faulty backplane could cause the drives to drop-out periodically like I'm experiencing?

Quote

November 8, 200916 yr

I can't say much about the Norco cases (except that I wish I could afford one!), but there are others here with first-hand experience with them. Hopefully, one of them will advise you. What I can say is that, in monitoring the forums quite awhile, there have definitely been a number of cases where a user had problems with a backplane, including brief disconnections such as you have had. Since your drives are in backplanes, that would easily top the list of suspects. Try the drives in different backplanes, in order to determine if a particular one is defective. Then try making sure that all drives are firmly seated in the backplane, with no chance of power or data connections coming loose AT ALL, and no chance that any vibrations could possibly affect them. Check all of the power connections and cables too, especially splitters. Make sure that each wire into each connector is firm, not even slightly loose.

Quote

Need Help: regular failure of all four HDDs

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)