[SOLVED] Disk write errors multiple drives - Advice on rebuild options


Recommended Posts

I'm hoping someone can please give me some guidance on the best way to recover from this catastrophic failure.

 

This happened a few days ago, I downloaded diagnostics then (attached), and shut it down so I could inspect hardware, cables, and order replacement parts. The array has not been restarted since.

 

Based on the number and location of the drives with errors, I believe this was an LSI HBA controller card failure. Parity 1 & 2, Cache 1 & 2, Disk 1 & 2 are all directly connected to the motherboard. Disk 3 to 8 are via an LSI 9217 flashed to HBA controller. Disks with errors are across both 4 port SAS SATA breakout cables.

 

It looks like Disk 3 and Disk 4 failed first and were disabled while being protected by dual parity. The problem is the mass of errors on Disk 5 and Disk 7 that wont be parity protected. This all started at 5am when Mover kicked in and moved data from cache to array. Due to this, I wasn't aware of what was happening until I woke up later in the day.

 

I've been looking at whatever documentation I can find, and forum posts etc, for people who have been similar situations. Only found examples of people with controller card failures with only read errors, or were still protected by parity.

https://wiki.unraid.net/Troubleshooting#Re-enable_the_drive

https://wiki.unraid.net/Make_unRAID_Trust_the_Parity_Drive,_Avoid_Rebuilding_Parity_Unnecessarily

 

If I know what files were moved to the array during the time of failure, can I ignore the parity data and delete the corrupted files? Then rebuild parity based on whats left.

So,

Unassign parity drives

Start array (unprotected)

Delete any files in the 24 hours preceding this failure

Stop array

Assign parity drives and rebuild parity.

 

Will this work? Or is the data on Disks 5 and 7 toast? 

 

Any advice or additional options would be greatly appreciated. 

 

 

 

 

 

 

Edited by FrequencyLost
Link to comment

Unfortunately because the syslog is spammed with these:

Jun 11 20:18:17 YUKI nginx: 2020/06/11 20:18:17 [error] 10423#10423: *1116981 connect() to unix:/var/tmp/Influxdb.sock failed (111: Connection refused) while connecting to upstream, client: 10.0.0.125, server: , request: "GET /dockerterminal/Influxdb/token HTTP/2.0", upstream: "http://unix:/var/tmp/Influxdb.sock:/token", host: "hash.unraid.net:9001", referrer: "https://hash.unraid.net:9001/dockerterminal/Influxdb/"
Jun 11 20:18:17 YUKI nginx: 2020/06/11 20:18:17 [error] 10423#10423: *1119386 conn

 

There's time missing and it didn't catch the start of the problem, but likely some controller issue, despite how it looks this is usually not a catastrophic failure and easy to recover from, looks like Unraid lost contact with all 5 disks and when this happens it disables as many disks as there are parity drives, which disks get disable is a crap-shoot, still since one of the disks dropped offline there's no SMART so power back on, start the array with the emulated disks and post new diags, then if all looks good you can either rebuild on top or do a new config and re-sync parity.

Link to comment

Thank you @johnnie.black for the information and for getting back to me.

 

I disabled docker and VMs to stop those log spams and to limit any disk activity when starting the array.

 

Array has been started with disks in emulated mode as instructed. Diagnostics are attached and screenshot below.

 

Thank you in advance.

 

 

 

 

Edited by FrequencyLost
Link to comment

Disks look fine and since the emulated disks are unmountable best option here is IMHO doing a new config and re-syncing parity, still good idea to make sure the old disks are mounting correctly before doing the new config, so:

 

-stop array

-unassign disks 3 and 4

-start array

-now use UD to mount the old disks in read-only mode

-if they mount correctly and data looks correct do a new config and re-sync parity.

 

If you need help with the new config please ask.

Link to comment

@johnnie.black Thank you so much! Followed all of the above and this has worked. As far as I can tell I don't have any data issues. Have deleted anything in the 24 hours preceding the fault to be on the cautious side, and parity resync completed a short while ago. 

 

Coming from years of using RAID5 setups until a few months ago when switching to UnRAID. I don't think those setups could have recovered in the same way from errors on more drives than parity, such as with this controller card failure.

 

Lessons learned..

RAID is not a backup
Set up an external syslog server

Install one of the file integrity plugins, Squid's Checksum or Bonienl's Bunker

 

 

Edited by FrequencyLost
Link to comment
  • JorgeB changed the title to [SOLVED] Disk write errors multiple drives - Advice on rebuild options

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.