January 9, 20242 yr Hi all, I have a Dell PowerEdge 310 (Xeon X3440, 16GB ECC DDR3, 10Gb Asus Nic, 9207-8e SAS interface) connected to three Dell PowerVault MD1200 12 drive units in series, all with 6TB Seagate SAS drives. Everything has been running well the past 2 years on moderate load, recently I have started moving large amounts of data over the the main array. In the past 2 weeks, randomly one parity drive, and the first data drive would go invalid and cause me to have to rebuild. Then it would happen again few days later in the same locations. Then one time it was both parity drives. Note, I'm assigning in new disks each time, only once did I reuse them not fully understanding what was going on. Then today, the worst case scenario, while doing the parity rebuild (takes 3 days) I lost parity 1, 2 and the first data drive at 10% complete. 1) I'm not really sure what to do next to attempt to recover what I can from the first data disk. 2) Something is faulty/broken/etc causing this to happen so frequently. I know the SMART data on some of the disks is concerning, and I've pulled those out of rotation, but the with with an OK status are still causing this to happen as well. No other changes have been made other than I'm starting to put the system under more load as I migrate data over. But still, the amount and frequency I'm moving over is nothing insane. Diagnostics attached from after the 3 drives just failed recently during the parity rebuild (without have restart the system before exporting the data). Any ideas or advice are greatly appreciated. As well, let me know if any additional information could be helpful. Thanks. tower-diagnostics-20240108-2229.zip
January 9, 20242 yr Community Expert Is the enclosure on the same UPS? Disks appear to have dropped after a power failure.
January 9, 20242 yr Author 6 hours ago, JorgeB said: Is the enclosure on the same UPS? Disks appear to have dropped after a power failure. Ohhhh that's interesting. Which log/line indicates that? And no, the computer is direct wall power, and the MD1200s are on two Eaton UPS' which I have had some issues with on their consistency delivering power.
January 9, 20242 yr Community Expert Solution Jan 8 08:09:44 Tower kernel: scsi 5:0:25:0: _scsih_block_io_device skip device_block for SES handle(0x0025) Jan 8 08:09:44 Tower kernel: scsi 5:0:38:0: _scsih_block_io_device skip device_block for SES handle(0x0032) ### [PREVIOUS LINE REPEATED 2 TIMES] ### Jan 8 08:09:45 Tower apcupsd[4706]: Power failure. Jan 8 08:09:46 Tower kernel: sd 5:0:1:0: device_unblock and setting to running, handle(0x000b) Jan 8 08:09:46 Tower kernel: sd 5:0:2:0: device_unblock and setting to running, handle(0x000c) Connection with the disks is lost at the same time as the power fails.
January 9, 20242 yr Author 6 minutes ago, JorgeB said: Jan 8 08:09:44 Tower kernel: scsi 5:0:25:0: _scsih_block_io_device skip device_block for SES handle(0x0025) Jan 8 08:09:44 Tower kernel: scsi 5:0:38:0: _scsih_block_io_device skip device_block for SES handle(0x0032) ### [PREVIOUS LINE REPEATED 2 TIMES] ### Jan 8 08:09:45 Tower apcupsd[4706]: Power failure. Jan 8 08:09:46 Tower kernel: sd 5:0:1:0: device_unblock and setting to running, handle(0x000b) Jan 8 08:09:46 Tower kernel: sd 5:0:2:0: device_unblock and setting to running, handle(0x000c) Connection with the disks is lost at the same time as the power fails. OK, looks like I'm going to have to take the UPS' out of the picture until I can get them figured out. Now, short term, are there any tips or tricks to bringing those 3 disks back online as I'm confident the disks are fine and have no data loss. Thanks!!
January 9, 20242 yr Community Expert Reboot and post new diags after array start., You won't be able to emulated the disks, if the disks are assumed OK doing a new config is probably the best option, you can check parity is already valid then run a parity check.
January 9, 20242 yr Author So, just started the array back up after removing the UPS' and restarting. It has started a parity sync it says will take an hour, but says disk 1 is unmountable and needs to be formatted. Edit: added diagnostics if of any use. tower-diagnostics-20240109-1137.zip Edited January 9, 20242 yr by bradgoldring
January 9, 20242 yr Community Expert 10 minutes ago, bradgoldring said: disk 1 is unmountable and needs to be formatted. It doesn't say you NEED to format it. It will allow you to format if you check the box. DO NOT format any disk in the array that is supposed to have your data.
January 9, 20242 yr Community Expert Format is a write operation. It writes an empty filesystem to the disk. Unraid treats this write operation exactly as it does any other, by updating parity so the array will be in sync. If you format a disk in the array, parity agrees the disk is empty so empty is the only thing parity could make it if you rebuild.
January 9, 20242 yr Community Expert There are 3 invalid disks, so disk1 cannot be emulated, since we think disk1 is OK you can do a new config instead.
January 9, 20242 yr Community Expert No way 6TB parity can be done in an hour. 12+ hours is more likely. Not sure how you can have 2 disks disabled and the other parity invalid though. As mentioned, New Config is likely the way forward. Then we can see what else might be needed. Syslog seems to indicate multiple disk problems still.
January 9, 20242 yr Author 50 minutes ago, trurl said: No way 6TB parity can be done in an hour. 12+ hours is more likely. Not sure how you can have 2 disks disabled and the other parity invalid though. As mentioned, New Config is likely the way forward. Then we can see what else might be needed. Syslog seems to indicate multiple disk problems still. Normally it's 3 days for a full parity check. So, after this weird mini 1hr parity check just completed, parity disk 1 has come back online and appears fine, so I just have parity disk 2 and data disk 1 as disabled and contents emulated. I'm going to swap disk 1 for a new disk and start the glorious 3 day parity check once again, but without the UPS concerns we should be good, I hope. Any concerns or comments on that approach before I start it? The number of errors from this weird 1hr parity check is concerning though: Thank you again as well!!
January 9, 20242 yr Community Expert Don't do anything yet. You will be rebuilding an unmountable disk1. Post new diagnostics
January 9, 20242 yr Community Expert And unlikely parity is valid anyway so you definitely don't want to rebuild disk1 like that.
January 9, 20242 yr Community Expert As mentioned, New Config is the likely way forward. This means you have to keep current disk1 and all other disks assigned as is, and rebuild both parity. Parity in its current state can't rebuild disk1, so you have to hope for the best with its current contents. If it really is unmountable we can try to repair its filesystem after parity rebuild. You must have port multipliers if parity takes so long on 6TB.
January 9, 20242 yr Community Expert And parity rebuild will tell us if things are working correctly without affecting any data disks. Probably you still have multiple connection problems 1 hour ago, trurl said: Syslog seems to indicate multiple disk problems still. Did you do anything about that?
January 9, 20242 yr Author 1 minute ago, trurl said: As mentioned, New Config is the likely way forward. This means you have to keep current disk1 and all other disks assigned as is, and rebuild both parity. Parity in its current state can't rebuild disk1, so you have to hope for the best with its current contents. If it really is unmountable we can try to repair its filesystem after parity rebuild. You must have port multipliers if parity takes so long on 6TB. Regarding a new config and rebuilding both parity disks, how do I go about that with the existing Disk 1 being unmountable? Regarding the 3 day rebuild: Is it because the three MD1200's are connected in series? The 9207-8e has 2 ports, I could connect one of the 3 drive units directly, would that benefit any? 9 minutes ago, trurl said: And parity rebuild will tell us if things are working correctly without affecting any data disks. Probably you still have multiple connection problems Did you do anything about that? I have not done anything about this because I am not sure what is wrong here or which disks are affected.
January 9, 20242 yr Community Expert 8 minutes ago, bradgoldring said: Regarding a new config and rebuilding both parity disks, how do I go about that with the existing Disk 1 being unmountable? New config will use the actual disk1, not try to emulated it, it cannot be emulated, and actual disk1 is hopefully fine.
January 9, 20242 yr Community Expert New Config accepts all assigned disks into the array exactly as they are, unmountable or not. The only thing it will do is make them all enabled again exactly as they are, and (optionally, by default) rebuild parity based on the contents of all the assigned disks. And you do want to let it rebuild parity. Not clear physical disk1 is actually unmountable anyway, it is being emulated by parity, and that is likely not working well since parity is probably not valid. If physical disk1 is unmountable we can try to repair its filesystem after successfully rebuilding parity. Probably still some hardware problems to work through before we successfully rebuild parity, but trying will make that apparent and give us some idea what needs to be fixed.
January 9, 20242 yr Author Just to confirm before I click "Apply" this is correct: Also Disk 1 was not being emulated by parity when I just had the array started, so I assume the Parity 1 disk is no good per that error count. Thanks!
January 9, 20242 yr Community Expert Yes, click apply and then check "parity is already valid" before start the array.
January 9, 20242 yr Community Expert 5 minutes ago, bradgoldring said: Disk 1 was not being emulated by parity If it was disabled then it was being emulated by parity1, which was almost certainly incorrect. So, the fact that is was showing as unmountable was referring to the emulated disk1.
January 9, 20242 yr Community Expert 2 minutes ago, JorgeB said: check "parity is already valid" before start the array. Why?
January 9, 20242 yr Community Expert It should be mostly valid, then run a check, if there are many errors he can always re-sync instead.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.