skinsvpn Posted May 6, 2020 Posted May 6, 2020 (edited) So I had a disc that was giving SMART errors and I wanted to replace it before it went belly up. While swapping it out I had a drive possibly short and actually melt the sata data connector (one of those kapton over the 3.3v pins drives). The worst part being that I only have one parity. I have the functioning but old drive I can pull data off of but what would be the best sequence for me to take here to get back up and running. My last off-site backup was about a month ago so I should lose too much data either way but I'd obviously prefer to lose none. Edited May 21, 2020 by skinsvpn Resolution Quote
trurl Posted May 6, 2020 Posted May 6, 2020 Go to Tools - Diagnostics and attach the complete diagnostics ZIP file to your NEXT post. Quote
trurl Posted May 6, 2020 Posted May 6, 2020 8 minutes ago, skinsvpn said: So I had a disc that was giving SMART errors and I wanted to replace it before it went belly up. While swapping it out Can we assume you never started the array while that disk was removed? If that is true, then the only missing disk will be the one you fried, so you should be able to rebuild the fried disk to a new disk assuming the disk you were intending to replace works well enough. Quote
skinsvpn Posted May 6, 2020 Author Posted May 6, 2020 No, the array has not been started. Are you saying I should place the potentially failing drive back in the array and rebuild once I've replaced the the fried drive with a new one? Then replace the failing drive? Thanks makes sense to me. Unfortunately I powered down once I saw that the to be discovered as fried drive was missing. I assumed I had bumped a sata cable or something simple. This means the important info in the logs are gone correct? I've attached the latest at least. tower-diagnostics-20200506-1653.zip Quote
skinsvpn Posted May 6, 2020 Author Posted May 6, 2020 One thing I should add is that the dying drive was placed into a Disabled state by unraid. Quote
trurl Posted May 6, 2020 Posted May 6, 2020 28 minutes ago, skinsvpn said: One thing I should add is that the dying drive was placed into a Disabled state by unraid. Well, that is important, since that is the disk Unraid wants to rebuild. But it is possible, though a bit more complicated, to get it to rebuild a different disk instead. 1 hour ago, skinsvpn said: This means the important info in the logs are gone correct? The most important thing I wanted to look at is the SMART report for the disabled disk, which would have been in the diagnostics if it was still connected. Looks like that was disk1, which is disabled and not present. Disk7 also not present, I assume the fried disk. Do you have a SMART report for disk1, a screenshot of its SMART attributes, or any older diagnostics that I might look at? Quote
trurl Posted May 6, 2020 Posted May 6, 2020 1 hour ago, skinsvpn said: disc that was giving SMART errors What specifically were these SMART errors and where did you see them? Quote
skinsvpn Posted May 6, 2020 Author Posted May 6, 2020 1 hour ago, trurl said: What specifically were these SMART errors and where did you see them? Looking at my previous notifications I got a message saying the reallocated sector count is 17 then a message followed saying the drive is now disabled. I would have to connect the drive to execute this plan so if I connected it now and got another diagnostic would it contain the necessary info? To complicate things I had to order a new sata breakout cable (since that connector melted...) so until that arrives I don't have the capability of reconnecting without removing another drive. Not sure if that would make finding that specific drives info in the logs more difficult. Quote
skinsvpn Posted May 6, 2020 Author Posted May 6, 2020 I actually do have a fairly recent diagnostics file. April 29. tower-diagnostics-20200429-0800.zip Quote
trurl Posted May 7, 2020 Posted May 7, 2020 26 minutes ago, skinsvpn said: recent diagnostics file That disk1 SMART didn't have reallocated, but it did have a number of pending, which was actually more concerning. 48 minutes ago, skinsvpn said: got a message saying the reallocated sector count is 17 Probably the reallocated you have now were some of those pending, which is actually a good thing if pending has decreased. 49 minutes ago, skinsvpn said: message followed saying the drive is now disabled Unraid disables a disk when a write to it fails. This is because the failed write and all subsequent writes to that disk still updates parity so those writes can be recovered, but the actual disk isn't used any more (disabled) and won't be used again until rebuilt, because it is no longer in sync with parity. When a disk is disabled, Unraid emulates the disk (for both read and write) using the parity calculation. A failed read can sometimes cause a failed write, because if the data can't be read from the disk, Unraid will get the data from the parity calculation instead, and then try to write it back to the disk. 33 minutes ago, skinsvpn said: I would have to connect the drive to execute this plan so if I connected it now and got another diagnostic would it contain the necessary info? To complicate things I had to order a new sata breakout cable (since that connector melted...) so until that arrives I don't have the capability of reconnecting without removing another drive. Not sure if that would make finding that specific drives info in the logs more difficult. The way things stand now, it isn't possible to start the array since there are 2 missing disks. And it won't even let you start it with that disk1 installed again without jumping through some hoops, since that disk has to be rebuilt which can't happen with missing disk7. No problem removing another drive to get disk1 reconnected. But it isn't that important either, at least for now. We can get current SMART for it when you are able to get everything connected again. And in order to rebuild disk7 there isn't really any choice but to rely on disk1 for the rebuild of disk7. Not sure if you know how parity works, but in order to rebuild a disk, parity PLUS ALL other disks must be read to calculate the data for the missing disk. Parity is just an extra bit that allows a missing bit to be calculated from all the other bits. Quote
skinsvpn Posted May 17, 2020 Author Posted May 17, 2020 So I finally got my replacement drive and replacement cables in. If I was understanding correctly I need to first get my disabled drive back online then rebuild the fried drive. Do I start with the 'Trust My Array' procedure in the wiki to get the disabled drive back? My latest diag: tower-diagnostics-20200517-1733.zip tower-diagnostics-20200517-1733.zip Quote
trurl Posted May 18, 2020 Posted May 18, 2020 3 hours ago, skinsvpn said: Do I start with the 'Trust My Array' procedure in the wiki to get the disabled drive back? No. In this situation, you would use the invalidslot command to tell Unraid to rebuild disk7 instead of rebuilding disk1. But unfortunately disk1 doesn't look very healthy. I am going to try to get another opinion from @johnnie.black and see if he has a better idea. Maybe trying to clone disk1 and using the clone for the rebuild? He is probably asleep now so wait a few hours. Quote
JorgeB Posted May 18, 2020 Posted May 18, 2020 Though SMART looks bad I would first confirm disk1 is really failing by running an extended SMART test. Quote
skinsvpn Posted May 18, 2020 Author Posted May 18, 2020 5 hours ago, johnnie.black said: Though SMART looks bad I would first confirm disk1 is really failing by running an extended SMART test. Extended test in progress. Would a screenshot be sufficient after or is a diagnostic file still best? Quote
JorgeB Posted May 18, 2020 Posted May 18, 2020 You just need to check the result, in doubt post the SMART report. Quote
skinsvpn Posted May 19, 2020 Author Posted May 19, 2020 Extended test passed Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 38832 - ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate POSR-K 190 189 051 - 690643 3 Spin_Up_Time POS--K 154 147 021 - 9275 4 Start_Stop_Count -O--CK 097 097 000 - 3583 5 Reallocated_Sector_Ct PO--CK 187 187 140 - 187 7 Seek_Error_Rate -OSR-K 200 200 000 - 0 9 Power_On_Hours -O--CK 047 047 000 - 38832 10 Spin_Retry_Count -O--CK 100 100 000 - 0 11 Calibration_Retry_Count -O--CK 100 100 000 - 0 12 Power_Cycle_Count -O--CK 100 100 000 - 234 192 Power-Off_Retract_Count -O--CK 200 200 000 - 175 193 Load_Cycle_Count -O--CK 175 175 000 - 77941 194 Temperature_Celsius -O---K 113 108 000 - 39 196 Reallocated_Event_Count -O--CK 120 120 000 - 80 197 Current_Pending_Sector -O--CK 200 198 000 - 31 198 Offline_Uncorrectable ----CK 200 198 000 - 0 199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0 200 Multi_Zone_Error_Rate ---R-- 200 190 000 - 2 I would assume that this builds confidence in using disk1 in its current state to rebuild disk7? Then move on and replace disk1. Quote
JorgeB Posted May 19, 2020 Posted May 19, 2020 6 hours ago, skinsvpn said: I would assume that this builds confidence in using disk1 in its current state to rebuild disk7? Yes, it should be fine for the rebuild, then probably a good idea to replace it. Quote
skinsvpn Posted May 19, 2020 Author Posted May 19, 2020 I've been looking for the correct usage for invalidslot and found this recent post of yours Quote -Tools -> New Config -> Retain current configuration: All -> Apply -Assign any missing disk(s) if needed -Important - After checking the assignments leave the browser on that page, the "Main" page. -Open an SSH session/use the console and type (don't copy/paste directly from the forum, as sometimes it can insert extra characters): mdcmd set invalidslot 1 29 -Back on the GUI and without refreshing the page, just start the array, do not check the "parity is already valid" box (GUI will still show that data on parity disk(s) will be overwritten, this is normal as it doesn't account for the invalid slot command, but they won't be as long as the procedure was correctly done), disk1 will start rebuilding, disk should mount immediately but if it's unmountable don't format, wait for the rebuild to finish and then run a filesystem check Just so I don't botch anything, my understanding is that above the 1 is the disk being rebuilt and 29 is the parity correct? Since I need to rebuild disk7 then make disk1 enable I would run: mdcmd set invalidslot 7 29 Quote
JorgeB Posted May 19, 2020 Posted May 19, 2020 16 minutes ago, skinsvpn said: mdcmd set invalidslot 7 29 Yes, that's it, 29 is for parity2 when it's not installed which is also your case, so it's also invalid, but type the command, don't copy/paste from the forum. Quote
skinsvpn Posted May 19, 2020 Author Posted May 19, 2020 I am getting the following error immediately when running the mdcmd cmd Warning: Division by zero in /usr/local/emhttp/plugins/parity.check.tuning/parity.check.tuning.php on line 62 This is a copy from my terminal for what I ran exactly root@Tower:~# mdcmd set invalidslot 7 29 Quote
skinsvpn Posted May 19, 2020 Author Posted May 19, 2020 (edited) May 19 08:07:44 Tower emhttpd: import 30 cache device: (sdc) SPCC_Solid_State_Disk_AA000000000000000807 Did I need to run invalidslot with a 30 for parity? Edit: why did i think the cache was my parity? not enough coffee maybe... Edited May 19, 2020 by skinsvpn Mistaken information Quote
JorgeB Posted May 19, 2020 Posted May 19, 2020 The plugin error should be harmless, but you could just uninstalled for now. Quote
skinsvpn Posted May 19, 2020 Author Posted May 19, 2020 Does the invalidslot command work immediately? There is no output from the command in the terminal. Quote
JorgeB Posted May 19, 2020 Posted May 19, 2020 Yes, no output is normal, but attention to this: 1 hour ago, skinsvpn said: Back on the GUI and without refreshing the page After the command is entered the GUI can't be refreshed or else it won't take, and the array will begin syncing parity instead of rebuilding the disk. Quote
skinsvpn Posted May 19, 2020 Author Posted May 19, 2020 Ok, just for my sanity, which is a slow process so thank you for your patience here, my GUI was on the Plug-Ins tab to uninstall the plugin that was giving an error when I ran the invalidslot command. Is going to the Main tab goign to trigger the same effect as a refresh? Should I go to the main tab and rerun the command? Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.