Jump to content

4 disks with read errors, of which 2 are disabled. How to proceed??


Recommended Posts

Server has been running 24/7 for a long time now. Without any (real) issues. I replaced my motherboard, cpu and cache drive last week. No issues. Yesterday I thought, why not save a few bucks on power and put the server to sleep at night. Used my old settings in the sleep plugin (used it before, without any problems) and enabled the plugin.

 

Today I'm waking up to a nightmare, thinking the plugin somehow (almost) destroyed my array. Thankfully I have 2 parity drives, but still...

 

Server was set to wake at 07:00 hours.

After ~30 minutes of that time disk 3 was disabled.

After 1:40 hours disk 2 was disabled

These are the messages I received:

Quote

Unraid Server, [19/01/2022 07:26] Server-UR: Alert [SERVER-UR] - Disk 3 in error state (disk dsbl)
ST5000LM000-2AN170_WCJ2NMBJ (sdh)

Unraid Server, [19/01/2022 07:27] Server-UR: Warning [SERVER-UR] - array has errors
Array has 1 disk with read errors

Unraid Server, [19/01/2022 08:40] Server-UR: Alert [SERVER-UR] - Disk 2 in error state (disk dsbl)
ST5000LM000-2AN170_WCJ2DNLC (sdg)

Unraid Server, [19/01/2022 08:40] Server-UR: Warning [SERVER-UR] - array has errors
Array has 2 disks with read errors

Unraid Server, [19/01/2022 09:19] Server-UR: Warning [SERVER-UR] - array has errors
Array has 4 disks with read errors

 

So, what I did after seeing all of this:

- downloaded diagnostics, see attachment 

- disabled docker

- disabled the array

- did a short self test on disk 2 and 3 > both passed

- Status as of now:

  • Disk 1 - read error 
  • Disk 2 - disabled, emulated - sst says passed
  • Disk 3 - disabled, emulated - sst says passed
  • Disk 4 - read error

 

Looking for theories what happened:

- The sleep plugin is the last thing that changed. Attached a screenshot of the sleep setting I used. So either the plugin is not working correctly, I used a bad setting or just random luck. I'm almost 100% sure it's one of the latter ones. Although, I heard it "shutdown" rather 'harsh', or maybe I'm not used to the sound of all drives stopping at the same time when the server goes to sleep.

- Opened the case and I can rule out a power issue. The 4 drives aren't connected to the same power cable or the end of one.

- But they are all connected to the same breakout cable and the same LSI card (IBM M1015 > SFF-8087 cable). That's not suspicious at all...

- It's the only cable connected to that IBM M1015 in the server (I have two of those cards, the other one is full)

- So

  1. some error with the sleep plugin? Although, the server was set to wake at 07:00 and the errors started (way) later?
  2. the sleep/wakeup resulted in a 
  • faulty cable?
  • faulty IBM M1015 card?

 

Spare parts (that I know of)

- IBM M1015 > not yet flashed

- Multiple drives (same as in array) ready to go > already pre-cleared

- Not sure about a spare SFF-8087 cable. I know I have at least one laying around, because I replaced that one when a disk was having read errors. Not sure if that was connected to the same IBM card we're talking about now... Maybe be safe and order a new one?

 

Next steps:

Honestly, not sure what to do now. First time dealing with such a catastrophic failure. So before I do anything I would like some advice. Buy a new cable and flash the new LSI card to replace all the hardware that could be faulty? And then follow https://wiki.unraid.net/Manual/Storage_Management#Checking_a_File_System for disks 2 and 3? 

 

What would you guys recommend I do and in which order?

 

 

 

Edit: changed title from "Sleep plugin almost destroyed my array? 4 disks with read errors, of which 2 are disabled. How to proceed??" to "4 disks with read errors, of which 2 are disabled. How to proceed??" because it's probably not the plugin's fault and in hindsight it reads a little bit sensational.

 

 

 

 

 

sleep settings.PNG

server-ur-diagnostics-20220119-0906.zip

Edited by FreakyUnraid
Link to comment
  • FreakyUnraid changed the title to 4 disks with read errors, of which 2 are disabled. How to proceed??
34 minutes ago, JorgeB said:

LSI didn't like waking up, there's even a driver crash besides the many timeout errors, you should reboot first, that will clear the errors on the two still enable disks and the LSI issue, then start array to see if the emulated disks are mounting and post new diags.

 

Is that a 'thing', that LSI cards don't like sleep? Didn't realize that, but after reading your comment I googled some and there a quite some topics where people mention about LSI cards "server parts are not meant for sleep mode" etc. Hadn't even thought about this for a second... dumbdumbdumb.

 

So to sum things up and see if I understand you correctly:

  1. reboot
  2. start array - in maintenance mode i presume?
  3. check everything
  4. download diagnostics

Correct?

Edited by FreakyUnraid
Link to comment
7 minutes ago, JorgeB said:

Yes, that was inevitable, both emulate disks ae mounting, so you can rebuild on top (with dual parity you can rebuild both at the same time):

 

https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself

 

 

 

Okay, and because the disks are mounting there is no need to check the filesystem, correct?

 

I have done a rebuilt in the past (probably also caused by this sleep issue), but never 2 drives at the same time. Is there more risk involved when rebuilding 2 drives at the same time?

 

I mean, both drives are still connected to the same cable and LSI card. Are you sure this was caused by sleep mode and causing (temporarily) issues with the LSI card?

 

Would it be wise to swap disk 2 and 3 with new (pre-cleared) drives? Then start the rebuild with the new drives. In case something does go wrong with the rebuild, I still have disk 2 and 3 laying around so I can recover the data by just copying them to the new disks. Or am I just thinking too hard and should I do as you say, rebuild over the existing drives?

Link to comment
5 hours ago, trurl said:

But don't leave things as they are for too long. I would probably skip the preclears on the new disks, for example. Better if you don't use your server until you are ready to rebuild.

 

Of course. Whenever something like this happens I just disable all services. Too bad for my plex users, but better safe than sorry. I am skipping the pre-clears, because both drives have already been pre-cleared about a month ago. Thankfully I bought some extra on blackfriday. So yeah, no need to pre-clear them again. 

 

5 hours ago, trurl said:

It is always safer to rebuild to spares if you have them.

 

Thanks, I just shucked 2 drives and will be replacing them tonight so the server can rebuilt over night and during the rest of the day. 

 

So to sum things up (don't want to screw this up); I can follow "replacing failed/disabled disk(s)" section from here https://wiki.unraid.net/Manual/Storage_Management#Replacing_disks

 

To translate that to my situation, and just to be 100% sure that what I'm going to do is the right way:

  1. Stop the array.
  2. Power down the unit.
  3. Replace disk 2 and 3 with the spares.
  4. Double check that all cables are connected properly.
  5. Power up the unit.
  6. Assign the spares to disk 2 and 3 spots.
  7. Click the checkbox that says Yes I want to do this.
  8. Click the checkbox Maintenance mode
  9. Click Start.
  10. Click Sync to trigger the rebuild.
  11. Fingers crossed and report back with any problems or success ;) 

Maintenance mode seems like the safest option to me. 

 

Can you confirm that these are the right steps? I'm not missing anything?

 

EDIT:

Successfully replaced disk 2 and 3 and the array is now being rebuild. See you in ~14 hours, hopefully with some good news :)

Edited by FreakyUnraid
Link to comment

@JorgeB  @trurl

 

(sorry, somehow pressing enter posted right away...)

 

Success! What a relief.

 

Disk 2 returned to normal operation
Disk 3 returned to normal operation 
Parity sync / Data rebuild finished - Finding 0 errors 
Duration: 13 hours, 44 minutes, 39 seconds. Average speed: 101.1 MB/sec

 

Don't think another parity check is necessary, right?

 

Lessons learned: never letting the server go to sleep again when using an LSI card, that's for sure haha.

 

But I still wonder, how does Unraid handle a failing LSI card? I was really lucky this time for having dual parity. But my other LSI card has 8 drives connected to it... I hate to think what would have happend if that one had failed. Because what are the odds of "just" 2 drives getting disabled in such a case? RIP array? Or how does Unraid handle this? I know from the past that when you have 'boot array at startup' enabled AND a faulty cable you can be sure that combination results in a disabled disk and thus a rebuild. For that reason alone I disabled array boot at startup a while back.

Edited by FreakyUnraid
Link to comment
1 hour ago, FreakyUnraid said:

Don't think another parity check is necessary, right?

No, I would wait for the next scheduled one.

 

1 hour ago, FreakyUnraid said:

ut I still wonder, how does Unraid handle a failing LSI card? I was really lucky this time for having dual parity. But my other LSI card has 8 drives connected to it... I hate to think what would have happend if that one had failed. Because what are the odds of "just" 2 drives getting disabled in such a case? RIP array? Or how does Unraid handle this?

 

Unraid will only disable one disk with single parity, two disks with dual parity, if there are errors in more disks due to for example a controller issue you just need to fix the issue and reboot/power back on, the disabled disk(s) will need to be rebuilt, like you had to do, the other ones will recover immediately after boot.

 

Link to comment
15 minutes ago, JorgeB said:

No, I would wait for the next scheduled one.

 

Great, back to normal operation it is. Really, really appreciate the help! Thank you!

 

15 minutes ago, JorgeB said:

Unraid will only disable one disk with single parity, two disks with dual parity, if there are errors in more disks due to for example a controller issue you just need to fix the issue and reboot/power back on, the disabled disk(s) will need to be rebuilt, like you had to do, the other ones will recover immediately after boot.

 

 

I'm not sure if I understand this correctly. Are you saying Unraid will never disable more than 2 disks (dual parity)? How does that work? (if there is a wiki about this a link will suffice of course).

Link to comment
6 hours ago, JorgeB said:

And in case I wasn't clear, that's what happened to you, you had issues with 4 disks, only 2 got disable because you have dual parity, with single parity just one would get disabled.

 

Okay, but why do they get disabled? Because the two other disks had read errors too, but came back online after the reboot. I don't understand what made disk 2 and 3 different? And why does Unraid (seemingly always?) disable disks in such a scenario. Is it something preemptive? And what is it preventing when disabling those disks?

Edited by FreakyUnraid
Link to comment
7 hours ago, JorgeB said:

Unraid disables a disk every time a write to it fails

because it is no longer in sync. The disabled disk is emulated from the parity calculation by reading parity and all other disks. The emulated disk is written by updating parity. The initial failed write, and any subsequent writes to the emulated disk, can be recovered by rebuilding.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...