big trouble with my array


caplam

Recommended Posts

Today i think i made a big mstake.

 

I was playing around with powertop and i think i made something i shouldn't.

3 minutes after playing with it i had 2 disks with read errors that were disconnected from array.

Before that all was fine. 

A third disk had read errors but was removed.

I tried to stop array without success. I couldn't also take a diag file. Server was unresponsive and i had to do cold reboot.

I started rebuild procedure for one disk but one my parity drives has now read errors and rebuild is slow (350ko/s).

I don't know what's next.

I suspect the disk which is offline to be good.

Have you any suggestion ?

godzilla-diagnostics-20201023-1530.zip

Edited by caplam
Link to comment

You can try the invalid slot command, follow the instructions below carefully and ask if there's any doubt.

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Check all assignments and assign any missing disk(s) if needed, don't assign parity2, if you have a spare to rebuild disk2 (same size or larger) use it since it will leave you with more options if this doesn't work
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type (don't copy/paste directly from the forum, as sometimes it can insert extra characters):

mdcmd set invalidslot 2 29

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box (GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the invalid slot command, but they won't be as long as the procedure was correctly done), disk2 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish (or cancel it)  and then run a filesystem check.

  • Thanks 1
Link to comment

Not sure to understand.

My present situation is:

disks 1&3 ok

disk4 disabled

disk2 rebuilding (paused at 2%)

parity1 ok

parity2 failing lots of errors

 

If i understand correctly:

8 minutes ago, JorgeB said:

-Tools -> New Config -> Retain current configuration: All -> Apply

is for reassigning drives

9 minutes ago, JorgeB said:

-Check all assignments and assign any missing disk(s) if needed, don't assign parity2, if you have a spare to rebuild disk2 (same size or larger) use it since it will leave you with more options if this doesn't work

i unassign parity2 (as it's failing i suppose it's useless for rebuilding)

i unplug disk2 and replace it with a precleared one

11 minutes ago, JorgeB said:

-Important - After checking the assignments leave the browser on that page, the "Main" page.

clear enough 

11 minutes ago, JorgeB said:

-Open an SSH session/use the console and type (don't copy/paste directly from the forum, as sometimes it can insert extra characters):

mdcmd set invalidslot 2 29

it seems mdcmd has no help.

15 minutes ago, JorgeB said:

Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box (GUI will still show that data on parity disk(s) will be overwritten,

at this stage i think i have only one parity disk in the array (parity2 has been unassigned)

 

16 minutes ago, JorgeB said:

this is normal as it doesn't account for the invalid slot command, but they won't be as long as the procedure was correctly done), disk2 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish (or cancel it)  and then run a filesystem check.

for this step i have a new precleared disk as my disk2

 

So i suppose that all these steps are for having disk4 back in the array.

Link to comment
28 minutes ago, caplam said:

i ran xfs_repair on disk2 in maintenance mode. 

I then restarted the array but no luck : disk2 is disabled.

xfs_repair will not stop a disk being disabled - it is intended to fix it being unmountable.   If the drive is disabled then it is the emulated disk that is being fixed. The standard way to clear the disabled state is to rebuild the disk.

Link to comment

i've just realised that xfs_repair had been run (from gui) on /dev/md2 which is the emulated disk; so it's logical that the drive is still disabled.

I shouldn't have let the rebuild  run. I was useless. 

So a new rebuild is running, next stage in 7 hours.

In the mean time ca i re-enable docker and vm ?

Link to comment
1 hour ago, caplam said:

i've just realised that xfs_repair had been run (from gui) on /dev/md2 which is the emulated disk;

If it had not run against the mdX device it would have invalidated parity which would not be a good idea.

 

What are you rebuilding?   If it is the disabled disk you will end up with whatever showed on the emulated drive before the rebuild.

 

Link to comment

I followed the re enable procedure.
Stop array
Unassign disk2
Start array
Stop array
Assign disk2
Start array
At this point rebuild starts
Before that i ran xfs repair Within the gui (so on emulated disk)
Rebuilding is on its way. I think i’ll start vms and dockers.
Normally, it should Be god.
If not i’ll have to find how to start with a blank drive2 and restore from backup i guess.
But i also have old disk2. From what i saw corrupted data weren’t important(temporary download files)

Link to comment

during rebuilding i was able  to browse disk2. 

Now rebuilding is done but end with disk2 disabled again.

What can i do now ?

 

edit : there is something strange : i see the disk2 as disk2 which is disabled: it's device sdr

and i see it too as unassigned device it's device sdq.

 

Edited by caplam
Link to comment
6 minutes ago, caplam said:

during rebuilding i was able  to browse disk2. 

Now rebuilding is done but end with disk2 disabled again.

What can i do now ?

I can only suggest that you post your diagnostics again.      If the drive is disabled after the rebuild process then that suggests that a write to it failed during the rebuild process.   There might be something in the diagnostics to give a clue as to what exactly happened.

Link to comment

from the log i see many sas errors. 

I have no sas disk in the array. All array disk are in my main case on sata ports.

on the sas controller i have disks in an external case. For now all are unassigned.

 

I also see write errors and before that "link is slow to respond" on a sata port.

What does it mean ? bad cable ? (it would be bad luck)

 

Link to comment

when i try to spin up i have : 

Unraid Disk 2 SMART health [1]: 24-10-2020 19:16
Warning [GODZILLA] - raw read error rate is 132
WDC_WD40EFRX-68WT0N0_WD-WCC4E1KN5L9R (sdr)

Unraid Disk 2 SMART health [200]: 24-10-2020 19:16
Warning [GODZILLA] - multi zone error rate is 1
WDC_WD40EFRX-68WT0N0_WD-WCC4E1KN5L9R (sdr)

Can i rebuild another disk ?

i have others on spare.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.