Jump to content

[SOLVED] Best way out of this jam


Recommended Posts

Posted (edited)

So I had a disc that was giving SMART errors and I wanted to replace it before it went belly up. While swapping it out I had a drive possibly short and actually melt the sata data connector (one of those kapton over the 3.3v pins drives). The worst part being that I only have one parity. I have the functioning but old drive I can pull data off of but what would be the best sequence for me to take here to get back up and running. My last off-site backup was about a month ago so I should lose too much data either way but I'd obviously prefer to lose none. 

Edited by skinsvpn
Resolution
Posted
8 minutes ago, skinsvpn said:

So I had a disc that was giving SMART errors and I wanted to replace it before it went belly up. While swapping it out

Can we assume you never started the array while that disk was removed? If that is true, then the only missing disk will be the one you fried, so you should be able to rebuild the fried disk to a new disk assuming the disk you were intending to replace works well enough.

Posted

No, the array has not been started. Are you saying I should place the potentially failing drive back in the array and rebuild once I've replaced the the fried drive with a new one? Then replace the failing drive? Thanks makes sense to me.

 

Unfortunately I powered down once I saw that the to be discovered as fried drive was missing. I assumed I had bumped a sata cable or something simple. This means the important info in the logs are gone correct? I've attached the latest at least.

tower-diagnostics-20200506-1653.zip

Posted
28 minutes ago, skinsvpn said:

One thing I should add is that the dying drive was placed into a Disabled state by unraid. 

Well, that is important, since that is the disk Unraid wants to rebuild. But it is possible, though a bit more complicated, to get it to rebuild a different disk instead.

 

1 hour ago, skinsvpn said:

This means the important info in the logs are gone correct?

The most important thing I wanted to look at is the SMART report for the disabled disk, which would have been in the diagnostics if it was still connected. Looks like that was disk1, which is disabled and not present. Disk7 also not present, I assume the fried disk.

 

Do you have a SMART report for disk1, a screenshot of its SMART attributes, or any older diagnostics that I might look at?

Posted
1 hour ago, trurl said:

What specifically were these SMART errors and where did you see them?

Looking at my previous notifications I got a message saying the reallocated sector count is 17 then a message followed saying the drive is now disabled. I would have to connect the drive to execute this plan so if I connected it now and got another diagnostic would it contain the necessary info? To complicate things I  had to order a new sata breakout cable (since that connector melted...) so until that arrives I don't have the capability of reconnecting without removing another drive. Not sure if that would make finding that specific drives info in the logs more difficult.

Posted
26 minutes ago, skinsvpn said:

recent diagnostics file

That disk1 SMART didn't have reallocated, but it did have a number of pending, which was actually more concerning.

48 minutes ago, skinsvpn said:

got a message saying the reallocated sector count is 17

Probably the reallocated you have now were some of those pending, which is actually a good thing if pending has decreased.

49 minutes ago, skinsvpn said:

message followed saying the drive is now disabled

Unraid disables a disk when a write to it fails. This is because the failed write and all subsequent writes to that disk still updates parity so those writes can be recovered, but the actual disk isn't used any more (disabled) and won't be used again until rebuilt, because it is no longer in sync with parity. When a disk is disabled, Unraid emulates the disk (for both read and write) using the parity calculation.

 

A failed read can sometimes cause a failed write, because if the data can't be read from the disk, Unraid will get the data from the parity calculation instead, and then try to write it back to the disk.

 

33 minutes ago, skinsvpn said:

I would have to connect the drive to execute this plan so if I connected it now and got another diagnostic would it contain the necessary info? To complicate things I  had to order a new sata breakout cable (since that connector melted...) so until that arrives I don't have the capability of reconnecting without removing another drive. Not sure if that would make finding that specific drives info in the logs more difficult.

The way things stand now, it isn't possible to start the array since there are 2 missing disks. And it won't even let you start it with that disk1 installed again without jumping through some hoops, since that disk has to be rebuilt which can't happen with missing disk7.

 

No problem removing another drive to get disk1 reconnected. But it isn't that important either, at least for now. We can get current SMART for it when you are able to get everything connected again. And in order to rebuild disk7 there isn't really any choice but to rely on disk1 for the rebuild of disk7.

 

Not sure if you know how parity works, but in order to rebuild a disk, parity PLUS ALL other disks must be read to calculate the data for the missing disk. Parity is just an extra bit that allows a missing bit to be calculated from all the other bits.

  • 2 weeks later...
Posted
3 hours ago, skinsvpn said:

Do I start with the 'Trust My Array' procedure in the wiki to get the disabled drive back? 

No. In this situation, you would use the invalidslot command to tell Unraid to rebuild disk7 instead of rebuilding disk1. But unfortunately disk1 doesn't look very healthy. I am going to try to get another opinion from @johnnie.black and see if he has a better idea. Maybe trying to clone disk1 and using the clone for the rebuild? He is probably asleep now so wait a few hours.

Posted
5 hours ago, johnnie.black said:

Though SMART looks bad I would first confirm disk1 is really failing by running an extended SMART test.

Extended test in progress. Would a screenshot be sufficient after or is a diagnostic file still best? 

Posted

Extended test passed

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     38832         -
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   190   189   051    -    690643
  3 Spin_Up_Time            POS--K   154   147   021    -    9275
  4 Start_Stop_Count        -O--CK   097   097   000    -    3583
  5 Reallocated_Sector_Ct   PO--CK   187   187   140    -    187
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   047   047   000    -    38832
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    234
192 Power-Off_Retract_Count -O--CK   200   200   000    -    175
193 Load_Cycle_Count        -O--CK   175   175   000    -    77941
194 Temperature_Celsius     -O---K   113   108   000    -    39
196 Reallocated_Event_Count -O--CK   120   120   000    -    80
197 Current_Pending_Sector  -O--CK   200   198   000    -    31
198 Offline_Uncorrectable   ----CK   200   198   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   190   000    -    2

I would assume that this builds confidence in using disk1 in its current state to rebuild disk7? Then move on and replace disk1.

 

Posted
6 hours ago, skinsvpn said:

I would assume that this builds confidence in using disk1 in its current state to rebuild disk7?

Yes, it should be fine for the rebuild, then probably a good idea to replace it.

Posted

I've been looking for the correct usage for invalidslot and found this recent post of yours
 

Quote

 

-Tools -> New Config -> Retain current configuration: All -> Apply
-Assign any missing disk(s) if needed
-Important - After checking the assignments leave the browser on that page, the "Main" page.

-Open an SSH session/use the console and type (don't copy/paste directly from the forum, as sometimes it can insert extra characters):


mdcmd set invalidslot 1 29

-Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box (GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the invalid slot command, but they won't be as long as the procedure was correctly done), disk1 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check

 

Just so I don't botch anything, my understanding is that above the 1 is the disk being rebuilt and 29 is the parity correct? Since I need to rebuild disk7 then make disk1 enable I would run:

mdcmd set invalidslot 7 29

 

Posted
16 minutes ago, skinsvpn said:

mdcmd set invalidslot 7 29

Yes, that's it, 29 is for parity2 when it's not installed which is also your case, so it's also invalid, but type the command, don't copy/paste from the forum.

Posted

I am getting the following error immediately when running the mdcmd cmd

Warning: Division by zero in /usr/local/emhttp/plugins/parity.check.tuning/parity.check.tuning.php on line 62

This is a copy from my terminal for what I ran exactly

root@Tower:~# mdcmd set invalidslot 7 29

 

Posted (edited)
May 19 08:07:44 Tower emhttpd: import 30 cache device: (sdc) SPCC_Solid_State_Disk_AA000000000000000807

Did I need to run invalidslot with a 30 for parity?

 

Edit: why did i think the cache was my parity? not enough coffee maybe...

Edited by skinsvpn
Mistaken information
Posted

Yes, no output is normal, but attention to this:

1 hour ago, skinsvpn said:

Back on the GUI and without refreshing the page

After the command is entered the GUI can't be refreshed or else it won't take, and the array will begin syncing parity instead of rebuilding the disk.

Posted

Ok, just for my sanity, which is a slow process so thank you for your patience here, my GUI was on the Plug-Ins tab to uninstall the plugin that was giving an error when I ran the invalidslot command. Is going to the Main tab goign to trigger the same effect as a refresh? Should I go to the main tab and rerun the command?

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...