Disk Failure... Bad Drive? Bad Raid Controller? Need Help Please

September 25, 20169 yr

I am incredibly new to unRAID, 9 days into my trial version of 6.2 and today one of my disks crashed on me and I'm really uncertain of what can be the problem. Here's what happened:

I was copying data from some older network drives to my new unRAID server (I can include specs if needed) and the rsync command died on me saying it could no longer see the folder. I did an ls command and sure enough, only one folder showed up when there were previously hundreds. I checked the console and one of my disks had the dreaded red X beside it and 214 errors. Just a note, I have 3 drives, 1 cache (SSD) and 2 8TB Seagate drives (1 Parity, 1 Data). While clicking around it appears that my cache drive also went offline. At that point I was in freak out mode so I stopped the array. When I did this, the bottom of my Main tab showed Array Offline Configuration is Invalid (or something similar) and at this point I restarted the server.

When it came back on the cache drive was back but the red X remained beside the one failed disk. I found the FAQ page regarding failed disks and pulled the SMART data diagnostics however I think I should have pulled it directly after the failure. SMART data was completely clean. Smart overall health assessment shows as PASSED, Reallocated_Sector_Ct was 0 as was Current_Pending_Sector and UDMA_CRC_Error_Count. Temp looked fine as well. At this point I tried to add back the drive and began the parity sync. This took over 14 hours during the build so I was prepared for a long sync time. I just got back from dinner and did a check and the drive is showing down again with 214 errors.

I pulled the SMART logs again and it says this:

smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.4.19-unRAID] (local build)

=== START OF INFORMATION SECTION ===

Vendor: /10:0:0:

Product: 0

Compliance: SPC-5

User Capacity: 600,332,565,813,390,450 bytes [600 PB]

Logical block size: 774843950 bytes

Physical block size: 3166222336 bytes

Lowest aligned LBA: 12346

scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46

>> Terminate command early due to bad response to IEC mode page

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

These drives are brand new and have only been running for 9 or so days and about 1TB of data has been copied over without any issues until today.

I have both diagnostic zip files from the first and second crash I'm just not sure what is helpful to see in this situation. I am completely out of my depth at this point and have no clue what to do. Can someone please help a newbie out? Thank you all for any help that you can provide!

Quote

September 25, 20169 yr

Just post the entire zip file

Quote

September 25, 20169 yr

Author

Both zip files have been attached, not sure if they show different data or not. Thanks!

FirstCrash_-_diagnostics-20160924-1610.zip

SecondCrash_-_diagnostics-20160924-1930.zip

Quote

September 25, 20169 yr

Looks like both diagnostics were after a reboot, but I'm rather confused as to this line that's spamming the log after the start

Sep 24 16:09:28 Tower root: ACPI group processor / action LNXCPU:0e is not defined

I'm going to leave this alone til the drive guys pipe in.

Quote

September 25, 20169 yr

Author

Could it be rebooting on it's own? That second diagnostic was right after I got home and realized that the disk had died again.

Quote

September 25, 20169 yr

Could it be rebooting on it's own? That second diagnostic was right after I got home and realized that the disk had died again.

I'm sorry... closer look, and the 2nd has the start point the same as the first, so it didn't restart

And since I was looking at it again, I ignored the spammed lines and found this right before drive errors began:

ep 24 17:05:29 Tower kernel: ata7: link is slow to respond, please be patient (ready=0)
Sep 24 17:05:34 Tower kernel: ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 24 17:05:34 Tower kernel: ata9.00: failed command: WRITE DMA EXT
Sep 24 17:05:34 Tower kernel: ata9.00: cmd 35/00:40:e0:b3:f3/00:05:3b:00:00/e0 tag 14 dma 688128 out
Sep 24 17:05:34 Tower kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 24 17:05:34 Tower kernel: ata9.00: status: { DRDY }
Sep 24 17:05:34 Tower kernel: ata9: hard resetting link
Sep 24 17:05:35 Tower kernel: ata9: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Went on a couple of times before the driver finally gave up, and dropped the drive. Usually errors like this are related to bad / loose cabling (usually sata, but rarely power) Reseat all the cables both ends and try again.

You will have to rebuild disk 1 onto itself however since unRaid had a write error to it.

The smart for the drive from the first diagnostics looks clean however

Still no clue about all that ACPI spam

Quote

September 25, 20169 yr

Author

Thanks Squid! I'll unplug the cabling and give it another shot ASAP.

Quote

September 25, 20169 yr

Author

I double checked all of the cables, unplugged and reseated them all. Parity Sync is running now and will take a long time so I'm going to let it sit overnight. Will report back status in the morning. Thanks again for the help!

Quote

September 25, 20169 yr

Author

Well the rebuild completed, both disks show online and the menu states that the parity is valid however there were over 1.7 billion errors reported on my parity drive and I can't access the majority of my folders.

History shows this:

Last check completed on Sun 25 Sep 2016 12:10:47 AM EDT (today), finding 1783683479 errors.

I get this error in terminal when running the ls command:

/bin/ls: reading directory '.': Input/output error

I haven't restarted anything and I've attached the latest diagnostic log hoping that it may point to what may be the problem. I appreciate any help that can be provided.

ThirdDownload-diagnostics-20160925-0832.zip

Quote

September 25, 20169 yr

Community Expert

Looks like you're using the Marvell controller, switch all your disks to the Onboard Intel PCH.

Marvell has issues, especially with VT-D on, I believe there's also an updated firmware for it on the Asrock website.

Quote

September 25, 20169 yr

Author

Thanks johnnie.black. Is there an easy way to determine which SATA ports are which? Are there known issues with the Marvel controller? Will I be unable to use this controller on the board as it was working without issue for almost 10 days?

Quote

September 26, 20169 yr

You have a group of 6 good SATA ports on the motherboard, and you have 8 ports on the Marvell 9230. Can't tell if that chipset is on a card or on the motherboard. If it's on the motherboard, then you have a separate group of 8 ports on the motherboard for it.

All 3 drives are hooked to the ports on the Marvell 9230, and they worked for awhile. But partly into the drive rebuild of Disk 1, both the Kingston and the Parity drive were dropped, quite suddenly, implying a failure of part of that chipset. Disk 1 continued to work fine, but I do notice that there are 4 errors associated with it too, probably read errors.

Look for a BIOS upgrade for the motherboard, which hopefully will fix the LNXCPU ACPI errors, as well as upgrade the Marvell firmware if it's on the board. If it's on a card, then look for a firmware update for the card.

The fact that the drives work at all is good, probably means you may not have to worry about the VT-d issue.

Quote

September 26, 20169 yr

Author

Thanks RobJ. I checked and I'm running the latest BIOS for the board so I'm not sure what to do in this situation. It's an ASRock Rack EP2C602-4L/D16 which I thought was a popular mobo for these kind of builds. I've had multiple problems since this post so I'm trying to pull all data off the drives and start from scratch. Not sure what else to do at this point.

Quote

September 26, 20169 yr

Thanks RobJ. I checked and I'm running the latest BIOS for the board so I'm not sure what to do in this situation. It's an ASRock Rack EP2C602-4L/D16 which I thought was a popular mobo for these kind of builds. I've had multiple problems since this post so I'm trying to pull all data off the drives and start from scratch. Not sure what else to do at this point.

I would avoid the Marvel 9230 controller that is built into the MB. Just use the other 10 connectors and avoid those 4 on the Marvel. I passed mine to a VM but even then the controller will drop the drives periodically (every 2-3 weeks).

Quote

September 26, 20169 yr

Author

I guess I'm confused and I can't tell from the guide, are there 4 Marvell ports or 8 and is there an easy way to distinguish them by looking?

Quote

September 26, 20169 yr

I guess I'm confused and I can't tell from the guide, are there 4 Marvell ports or 8 and is there an easy way to distinguish them by looking?

There are only 4 on my EP2C602-4L/D16. I'll look in the manual as it was hard to tell and I got mine backwards the first time. If I remember correctly they are the four ports right next to the 2 6GB MB ports at the back of the MB. I expected them to be the 4 separate from the others along the side but those were the 3GB MB connectors. Just like I would figure the dedicated IPMI lan port is the one separate from the other 4 lan ports - but it isn't. Very confusing in my opinion

Actually it was where I said but there were 6 other connectors not two. See the graphics. One graphic lists the number of the port on the MB and the other highlights the ports on the MB itself.

Quote

September 26, 20169 yr

Author

OMG you may have just fixed another problem of mine about the ICMI port... I have all 3 disks plugged into the top 3 ports on the right hand side but if you could somehow confirm I'd really appreciate it. Did you have problems with the Marvell chipset? I'm running 1.80 which I believe to be the latest firmware for the board.

Quote

September 26, 20169 yr

OMG you may have just fixed another problem of mine about the ICMI port... I have all 3 disks plugged into the top 3 ports on the right hand side but if you could somehow confirm I'd really appreciate it. Did you have problems with the Marvell chipset? I'm running 1.80 which I believe to be the latest firmware for the board.

I never upgraded mine as it looked like it was the lates on the ASRock site. But I've always been meaning to upgrade it anyway just incase just haven't had time. Attached graphic shows highlighted dedicated IPMI port. Note you can share the IPMI with a regular lan port but you cannot use the IPMI port as a lan port. Not sure why they did that except maybe to reduce the number of network connections by one.

Quote

September 26, 20169 yr

Author

That definitely explains my network issue, thanks for that! If I can figure out the Marvell/Intel ports I'll be ready to move the disks (still copying off data). You'd think they'd have this documented in a similar kind of picture but I can't find one.

Quote

September 26, 20169 yr

That definitely explains my network issue, thanks for that! If I can figure out the Marvell/Intel ports I'll be ready to move the disks (still copying off data). You'd think they'd have this documented in a similar kind of picture but I can't find one.

You did see my previous post where I edited and added graphics for the Marvel ports?

Quote

September 26, 20169 yr

Author

I didn't! You are freaking awesome and those are the ports I'm using as well which explains a lot. Thanks so much for your help with this! I'm hoping to have my data off and ready to try again by tomorrow. I'm blowing away everything because I'm also having weird folders that I have no access to and can't delete, probably corrupted during the crash.

Thanks again everyone for all of your help and BobPhoenix for knowing way more about this board than I!

Quote

September 29, 20169 yr

Author

Just wanted to follow up. I was able to get my data off, backup my configs, change the disks to the Intel controllers and I've been back up for a little over a day now with no problems at all. Thanks again everyone for the great help!

Quote

September 29, 20169 yr

Yep, Marvell controllers are crap tbh. Never had any success with them.

Quote

October 21, 20169 yr

Thankyou for posting this!

You've saved me a fortune! I was ready to pull the trigger on 3 new drives!

Quote

Disk Failure... Bad Drive? Bad Raid Controller? Need Help Please

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)