Unraid Issues. Losing My F'ing Mind

January 27, 20179 yr

Greetings Awesome People!

I have an issue that is driving me nuts to the point where I'm almost ready to office space this server.

Background: I recently started having issues on my old hardware which ranged from general instability (freezing) to memory errors upon boot (which magically resolved itself after another reboot). All these issues lead me to believe the motherboard was bad since it was a refurb and I picked it up cheap. I decided to dive in with some new hardware and that leads us up to my current issue.

I'm getting read errors after awhile, and it seems to change between different drives. I've tried changing sata ports, backplane channels, power cables, and even switching drives. They all come to the same outcome: read / write errors which lead to red ball. When it first happened, I checked the drive with an extended SMART test but it came back clean. No bad sectors needed to be re-allocated, no pending, nothing. Rebuilt that drive (disk 7) and thought I was done. Nope. Now other drives are giving me a shit fit.

Hardware:

CPUs: Dual E5-2670s

Memory: 96GB

Motherboard: ASRock EP2C602-4L/D16

Power Supply: EVGA SuperNOVA 850 G2

Looking at my syslog, I see:

Jan 26 22:50:23 Phoenix kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Jan 26 22:50:23 Phoenix kernel: ata10.00: failed command: SMART

Jan 26 22:50:23 Phoenix kernel: ata10.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 7 pio 512 in

Jan 26 22:50:23 Phoenix kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan 26 22:50:23 Phoenix kernel: ata10.00: status: { DRDY }

Jan 26 22:50:23 Phoenix kernel: ata10: hard resetting link

Jan 26 22:50:23 Phoenix kernel: ata10: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Jan 26 22:50:23 Phoenix kernel: ata10.00: configured for UDMA/133

Jan 26 22:50:23 Phoenix kernel: ata10: EH complete

Jan 26 22:50:44 Phoenix kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Jan 26 22:50:44 Phoenix kernel: ata9.00: failed command: SMART

Jan 26 22:50:44 Phoenix kernel: ata9.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 18 pio 512 in

Jan 26 22:50:44 Phoenix kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Jan 26 22:50:44 Phoenix kernel: ata9.00: status: { DRDY }

I know the status of DRDY signals a bad cable but it seems no matter what I do, switching between sata ports, backplane locations, etc.. These errors always seem to happen after awhile.

Please help! I'm losing my mind and ready to office space it.

phoenix-diagnostics-20170126-2255.zip

Quote

January 27, 20179 yr

Community Expert

Move all devices from the Marvell controller to the Intel PCH, they are notoriously problematic on those boards, if they are all in use get a pcie hba, just don't use those 4 sata ports.

Quote

January 27, 20179 yr

Author

Thank you, I will try that!

Quote

January 27, 20179 yr

I know the status of DRDY signals a bad cable but it seems no matter what I do, switching between sata ports, backplane locations, etc.. These errors always seem to happen after awhile.

DRDY just means Device ReaDY. It has nothing at all to do with cabling. It basically means 'all OK', no error flags or unusual status flags to report. In this case, for 2 different drives, a request for SMART info was ignored by the drives, no response at all, not even an error flag raised. Only thing I can think of is an incompatibility with SMART requests on that controller. You may want to make sure it's flashed with the latest firmware, and it's firmware for IT mode, JBOD like. Another extremely unlikely cause, both drives crashed their firmware, simultaneously! I'm not sure that's even possible.

I'm getting read errors after awhile, and it seems to change between different drives. I've tried changing sata ports, backplane channels, power cables, and even switching drives. They all come to the same outcome: read / write errors which lead to red ball. When it first happened, I checked the drive with an extended SMART test but it came back clean. No bad sectors needed to be re-allocated, no pending, nothing. Rebuilt that drive (disk 7) and thought I was done. Nope.

In general, for future reference, you always want to check your diagnostics (and post them for us), when you have issues like this. The higher level functions aren't monitoring the low level kernel access to the drives, and aren't aware of the exceptions being handled. So if a read request isn't successful, it calls it a read error, even though it could be so many other things (cables, controller, ports, drive, power), many of which have nothing to do with the physical drive itself.

Need help? Read me first!

Quote

January 28, 20179 yr

Author

Hi RobJ, thanks for the reply! I really appreciate it.

DRDY just means Device ReaDY. It has nothing at all to do with cabling. It basically means 'all OK', no error flags or unusual status flags to report. In this case, for 2 different drives, a request for SMART info was ignored by the drives, no response at all, not even an error flag raised. Only thing I can think of is an incompatibility with SMART requests on that controller. You may want to make sure it's flashed with the latest firmware, and it's firmware for IT mode, JBOD like. Another extremely unlikely cause, both drives crashed their firmware, simultaneously! I'm not sure that's even possible.

I got to the cabling conclusion from this link: https://lime-technology.com/wiki/index.php/The_Analysis_of_Drive_Issues#Drive_interface_issue_.234 since the message I was getting was very similar. I may have spun down a rabbit hole by the time I found this so I just jumped on it $:-\$

I'm using the SATA ports directly on the board and it is running the latest BIOS and BMC update out of the factory according to what I have compared with ASRock's website and what's on the board.

In general, for future reference, you always want to check your diagnostics (and post them for us), when you have issues like this. The higher level functions aren't monitoring the low level kernel access to the drives, and aren't aware of the exceptions being handled. So if a read request isn't successful, it calls it a read error, even though it could be so many other things (cables, controller, ports, drive, power), many of which have nothing to do with the physical drive itself.

Need help? Read me first!

Yeah, I'm working through all the possibilities. Currently I have obtained another power supply to see if the supplied power may have been flaky, then I will be trying an HBA card to test johnnie.black's recommendation.

Quote

January 28, 20179 yr

Community Expert

Yeah, I'm working through all the possibilities. Currently I have obtained another power supply to see if the supplied power may have been flaky, then I will be trying an HBA card to test johnnie.black's recommendation.

Forgot to say that make sure you get one without a Marvell chipset.

PS: If you don't use it you can also try to disable VT-D and see if it helps with your Marvell issues.

I bet that the disks causing issues are on the Marvell, but you can check yourself, like both you posted about:

ATA9 = WDC_WD60EFRX-68MYMN1_WD-WX11D35C8VDK -> Marvell port 3

ATA10 = WDC_WD20EFRX-68EUZN0_WD-WCC4M6RN5NT3 -> Marvell port 4

Quote

January 28, 20179 yr

Author

Forgot to say that make sure you get one without a Marvell chipset.

PS: If you don't use it you can also try to disable VT-D and see if it helps with your Marvell issues.

I bet that the disks causing issues are on the Marvell, but you can check yourself, like both you posted about:

ATA9 = WDC_WD60EFRX-68MYMN1_WD-WX11D35C8VDK -> Marvell port 3

ATA10 = WDC_WD20EFRX-68EUZN0_WD-WCC4M6RN5NT3 -> Marvell port 4

Sounds about right, whenever it would happen it would be ATA7-ATA10. All those are on the Marvell controller. So far the PSU swap seems to be working, I've been up on the marvell controller for about 7 hours now and haven't experienced the same issues just yet but who knows what will happen in an hour. I'm crossing my fingers it was the PSU and not the motherboard, or an incompatibility with the marvell controller cause that would SUCK lol

Quote

January 28, 20179 yr

Author

The errors happened again on the same ports: ATA7-ATA10 which are the devices plugged into the marvell controller.

I have a non-marvell card coming in which has an ASM1061 chipset. According to the hardware compatibility page for unraid, this will work out of the box. Not sure if I'll encounter the same issues but it's worth a quick test until I can find an appropriate hba card.

I have another card but it has a marvell chipset. Supposedly it will work out of the box according to the hardware compatibility page, so I will give it a test. Just to see if the errors only occur on the devices hooked up to it.

Power supply ruled out, on to the next endeavor.

Quote

January 28, 20179 yr

Community Expert

The Asmedia should work fine and it's plug n play.

Quote

January 28, 20179 yr

Author

The Asmedia should work fine and it's plug n play.

Good to know. I'll keep at it. Thanks for your help!

Quote

January 28, 20179 yr

Author

I don't want to jump the gun too much, but I have been up for 4 1/2 hours with the Marvell ports on board removed from the equation and it seems to have helped.

I'm using an external card that has a different Marvell controller to handle my cache drives atm because I'm 2 drives over what I can put on the board. I also found a forum post here: https://forums.tweaktown.com/asrock/56191-c2750d4i-marvel-9230-sata-port.html which shows some other people having very very similar error messages to what I was getting with unraid on ubuntu.

That makes me feel better about the board's status to where I won't have to RMA it. Once my testing completes, I'll post an update.

Thank you so much to johnnie.black and RobJ for the assistance thus far!

Quote

January 29, 20179 yr

Author

Removing the drives off the onboard Marvell controller did the trick. No errors in 24 hours.

The only issue now is my parity drive has sync errors due to the drives dropping and all my troubleshooting. Not a big deal, I can rebuild it/correct sync errors and deal with any data loss that has already occurred. I'm just happy to have working hardware!

Thanks again!

Quote

Unraid Issues. Losing My F'ing Mind

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)