Drive just randomly was disabled, is there a way to compare the emulated drive to the physical drive?


Recommended Posts

Ok, I just had a drive randomly have errors and be disabled in unraid. I will be rebooting the server shortly and checking the smart report but guessing it is a cable issue as I changed some stuff around the other day in the case.

 

Here is the log from when it failed, the parity check tuning errors went on for a bit before the failure and naturally the drive errors went on a lot longer. Is it possible the parity check tuning caused an issue? I have not seen that message spammed before?

 

Oct 25 13:05:01 NAS parity.check.tuning.php: DEBUG: No array operation currently in progress
Oct 25 13:05:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR end-------
Oct 25 13:10:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR start------
Oct 25 13:10:01 NAS parity.check.tuning.php: DEBUG: No array operation currently in progress
Oct 25 13:10:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR end-------
Oct 25 13:15:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR start------
Oct 25 13:15:01 NAS parity.check.tuning.php: DEBUG: No array operation currently in progress
Oct 25 13:15:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR end-------
Oct 25 13:20:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR start------
Oct 25 13:20:01 NAS parity.check.tuning.php: DEBUG: No array operation currently in progress
Oct 25 13:20:01 NAS parity.check.tuning.php: DEBUG: -----------MONITOR end-------
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#37 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#37 CDB: opcode=0x88 88 00 00 00 00 00 38 26 11 50 00 00 00 20 00 00
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942018896 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018832
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#33 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018840
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#33 CDB: opcode=0x88 88 00 00 00 00 00 38 26 0f 50 00 00 02 00 00 00
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018848
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#91 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018856
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942018384 op 0x0:(READ) flags 0x0 phys_seg 64 prio class 0
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#91 CDB: opcode=0x88 88 00 00 00 00 00 38 26 0d 50 00 00 02 00 00 00
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942017872 op 0x0:(READ) flags 0x0 phys_seg 64 prio class 0
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018320

My question is, I don't want to rebuild the 12TB drive as that will take 24 hours, I know what data was supposed to be being worked with when it was disabled and it can be replaced, I just need to know exactly what files need to be replaced.

 

Is it possible to compare the contents of the emulated drive to the physical drive?

Edited by TexasUnraid
Link to comment

The fact that it is out of sync is why it is disabled.

 

Unraid disables a disk when a write to it fails. It won't use the disk again until rebuilt. The disk is emulated from all other disks. This emulation includes writes to the emulated disk. The initial failed write, and any subsequent writes to the emulated disk, update parity just as if the disk had been written. So all those writes can be recovered by rebuilding. So, as you can see, it is out-of-sync with parity from the moment it is disabled.

Link to comment
2 minutes ago, trurl said:

Since you were apparently having some hardware issue that caused it to be disabled, diagnostics might give a better idea of what needs to be fixed so the rebuild will be successful.

I sent a PM with the post reboot diagnostics file. I have the server setup to save the diagnostics file on shut down but does it anonymize when it does this?

 

Got to run an errand but pretty sure it was just a cable issue as I made quite a few changes inside the case a few days ago (added a 3.5" drive cage and it needed a bunch of changes in the case to make it fit).

Link to comment

SMART looks OK but doesn't look like an extended SMART test has ever been done on that disk.

8 minutes ago, TexasUnraid said:

pretty sure it was just a cable issue as I made quite a few changes inside the case a few days ago

This seems likely. From syslog it looks like a connection problem to me. You should always double check all connections, all disks, power and SATA, including splitters, any time you are mucking about inside.

 

Here are some relevant excerpts from syslog:

Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#37 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#37 CDB: opcode=0x88 88 00 00 00 00 00 38 26 11 50 00 00 00 20 00 00
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942018896 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018832
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#33 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018840
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#33 CDB: opcode=0x88 88 00 00 00 00 00 38 26 0f 50 00 00 02 00 00 00
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018848
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#91 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018856
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942018384 op 0x0:(READ) flags 0x0 phys_seg 64 prio class 0
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#91 CDB: opcode=0x88 88 00 00 00 00 00 38 26 0d 50 00 00 02 00 00 00
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942017872 op 0x0:(READ) flags 0x0 phys_seg 64 prio class 0
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018320
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942017808
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018328
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942017816
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942017840 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018336
...
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942011224
Oct 25 13:21:44 NAS rc.diskinfo[12723]: SIGHUP received, forcing refresh of disks info.
Oct 25 13:21:44 NAS kernel: md: disk3 write error, sector=942022904
Oct 25 13:21:44 NAS kernel: md: disk3 write error, sector=942022912
Oct 25 13:21:44 NAS kernel: md: disk3 write error, sector=942022920
Oct 25 13:21:44 NAS kernel: md: disk3 write error, sector=942022928

It looks like it may have been read failures that really started it. When Unraid can't read a disk it will try to write the emulated data back to it and if that write fails then the disk gets disabled.

 

It would be mostly guess work to say that the disk may not be very far out-of-sync if nothing was really writing to it and the emulated data is what was already on the disk.

 

If you want to take a chance, you could unassign the disk and then mount it read-only as an Unassigned Device to check its contents. If it seems to be OK you could New Config / Trust Parity. If it doesn't look OK then you could reassign and rebuild. In any case a parity check should be done, either to confirm it wasn't out of sync or to confirm the rebuild went well.

 

If you really want to take a chance you could even postpone that so you can use the server, but if anything is out-of-sync then rebuilding a real failure could be compromised.

 

Do you have good (enough) backups?

  • Like 1
Link to comment

you should have been notified that the disk had been disabled as long as you have notifications enabled - did this happen?
 

Just a FYI the Paroty Tuning plugin will not firectly cause an issue is such a scenario as happened as it will only attempt to pause or resume an operation that was already in progress.    In fact in the log snippet posted it was idling and taking no action.

 

thinking about I could consider adding a feature to the parity tuning plugin where it will halt any array operation if an array disk becomes disabled while the array operation is running.   I would be interested to hear if anyone thought this might be of use?     If so should the operation be cancelled or merely paused indefinitely so that the user can cancel it.    Feedback is welcomed.   This would be quite easy to implement once I had a clear idea of exactly what is wanted.

Link to comment
54 minutes ago, trurl said:

SMART looks OK but doesn't look like an extended SMART test has ever been done on that disk.

This seems likely. From syslog it looks like a connection problem to me. You should always double check all connections, all disks, power and SATA, including splitters, any time you are mucking about inside.

 

Here are some relevant excerpts from syslog:


Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#37 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#37 CDB: opcode=0x88 88 00 00 00 00 00 38 26 11 50 00 00 00 20 00 00
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942018896 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018832
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#33 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018840
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#33 CDB: opcode=0x88 88 00 00 00 00 00 38 26 0f 50 00 00 02 00 00 00
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018848
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#91 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018856
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942018384 op 0x0:(READ) flags 0x0 phys_seg 64 prio class 0
Oct 25 13:21:44 NAS kernel: sd 1:1:3:0: [sde] tag#91 CDB: opcode=0x88 88 00 00 00 00 00 38 26 0d 50 00 00 02 00 00 00
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942017872 op 0x0:(READ) flags 0x0 phys_seg 64 prio class 0
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018320
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942017808
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018328
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942017816
Oct 25 13:21:44 NAS kernel: blk_update_request: I/O error, dev sde, sector 942017840 op 0x0:(READ) flags 0x0 phys_seg 4 prio class 0
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942018336
...
Oct 25 13:21:44 NAS kernel: md: disk3 read error, sector=942011224
Oct 25 13:21:44 NAS rc.diskinfo[12723]: SIGHUP received, forcing refresh of disks info.
Oct 25 13:21:44 NAS kernel: md: disk3 write error, sector=942022904
Oct 25 13:21:44 NAS kernel: md: disk3 write error, sector=942022912
Oct 25 13:21:44 NAS kernel: md: disk3 write error, sector=942022920
Oct 25 13:21:44 NAS kernel: md: disk3 write error, sector=942022928

It looks like it may have been read failures that really started it. When Unraid can't read a disk it will try to write the emulated data back to it and if that write fails then the disk gets disabled.

 

It would be mostly guess work to say that the disk may not be very far out-of-sync if nothing was really writing to it and the emulated data is what was already on the disk.

 

If you want to take a chance, you could unassign the disk and then mount it read-only as an Unassigned Device to check its contents. If it seems to be OK you could New Config / Trust Parity. If it doesn't look OK then you could reassign and rebuild. In any case a parity check should be done, either to confirm it wasn't out of sync or to confirm the rebuild went well.

 

If you really want to take a chance you could even postpone that so you can use the server, but if anything is out-of-sync then rebuilding a real failure could be compromised.

 

Do you have good (enough) backups?

Thanks for the info.

 

Yeah, I didn't run the extended test as it took it overnight to run last time I did it on a 12TB disk. These large disks are great on one hand but it takes forever to do any kind of big task.

 

I did check all of the connections when I put it back together and even now didn't find any obvious cable issues with this drive (found a semi loose cable on another drive that threw a UDMA error the other day though). I just unplugged and replugged all the cables and hoping it is good to go.

 

It has been running with this drive for the past 5 months or so 24/7 without an issue so pretty sure it is a cable issue as I have to re-wire most of the case.

 

Yes, you are spot on, I was in the process of moving some data from disk 3 to disk 4 when the error happened. The data is just media files, I can easily replace them, I just have to know which ones need to be replaced. Thats why I was hoping I could compare the emulated disk to the physical disk but rebuilding might be best anyways since I would have to do a parity check either way.

 

This particular disk is basically only backed up to the parity since it is just media that is replaceable (although not something I would want to do). I have all my important data on other drives and they are backed up to a few different places but can't afford the cost of backup drives for the media.

Edited by TexasUnraid
Link to comment
4 minutes ago, itimpi said:

you should have been notified that the disk had been disabled as long as you have notifications enabled - did this happen?
 

Just a FYI the Paroty Tuning plugin will not firectly cause an issue is such a scenario as happened as it will only attempt to pause or resume an operation that was already in progress.    In fact in the log snippet posted it was idling and taking no action.

 

thinking about I could consider adding a feature to the parity tuning plugin where it will halt any array operation if an array disk becomes disabled while the array operation is running.   I would be interested to hear if anyone thought this might be of use?     If so should the operation be cancelled or merely paused indefinitely so that the user can cancel it.    Feedback is welcomed.   This would be quite easy to implement once I had a clear idea of exactly what is wanted.

Yep, got an email about the issue and checked it right away.

 

I didn't think the parity tuning plugin was an issue, just found it odd that message was spammed just before the issue as I had not seen that happen before.

 

Given the choice, I would like to have the array setup so that if a disk was going to be disabled, all IO would be immediately halted and any pending operations written to some kind of log file.

 

The idea being that in a situation like this where it was just a glitch in the system, things would not get out of sync and I could simply restart the server and if the disk checked out start up the array like normal. The log /temporary file could then be used to finish the pending operation when the drive was disabled. You can then run a parity check of course.

 

Short of that, I still think I would prefer to have the array disabled completely should a disk be disabled with the option to re-enable it if I want.

 

For example today this disk was disabled during a file move process, that process kept going after the drive was disabled and was trashing the drives as it tried to emulate the missing disk and write to another disk. It had another ~400GB to go in the move, that is a lot of strain to put on disks just before a rebuild operation had I not been here to stop it.

 

Most important for me would be to stop any activity to the disks and possibly shut down docker until I can manually get on the server to sort things out. Last thing I would want to do is thrash drives just before counting on them to rebuild a drive.

Link to comment
1 minute ago, TexasUnraid said:

I didn't think the parity tuning plugin was an issue, just found it odd that message was spammed just before the issue as I had not seen that happen before.

Ok, but you only got those messages because you had the debug logging level active.   If that had not been the case none of the messages would have appeared.

 

in terms of the array behaviour you want this is something that is way beyond what this plugin can achieve and it even may be beyond what Limetech could achieve in any realistic manner.

Link to comment
Just now, itimpi said:

Ok, but you only got those messages because you had the debug logging level active.   If that had not been the case none of the messages would have appeared.

 

in terms of the array behaviour you want this is something that is way beyond what this plugin can achieve and it even may be beyond what Limetech could achieve in any realistic manner.

That is strange since I just checked and debug logging is disabled in the settings?

firefox_YNSxGGdM3E.jpg.a5a526df37a7eb03a552d9efcc969106.jpg

 

Yeah, I didn't expect the plugin to be able to do the first part.

 

The second part of just straight up disabling all IO to the array (and maybe docker) on a drive failure, that is what I was thinking might be possible with a plugin?

Link to comment

Ok, guess I will start the rebuild process. Although I have not run a parity check in ~2 months and have added 2 drives in that time, any reason to suspect parity would not be in sync?

 

The process for a rebuild is still to:

 

remove drive from array

start array

stop array

add drive to array

start array?

 

Anything I should be aware of or do first? Still pretty new to unraid.

Link to comment

Thanks for the help everyone, I decided to just go ahead and rebuild the drive, in theory if there are any issues with the rebuild a BTRFS scrub should find them I realized.

 

It is rebuilding now, I did notice that it does not seem to be maxing out the speed of my drives like a parity check, is that normal? Usually they start off around 200MB/s for a parity check but the rebuild is starting off around 150MB/s? CPU usage is not the issue at ~20% usage.

 

Looking at the netdata stats, the IO is going up and down a lot, I assumed it would be like a parity check, pretty much stable speeds slowly dropping over time.

 

Not a big deal but was just curious if that is normal or if there is an issue?

 

Edit, speeds seem to be getting better as things progress, guessing it was just some kind of disk access left over from the startup.

Edited by TexasUnraid
Link to comment
1 minute ago, Frank1940 said:

Depending on your hardware, 'watching' parity operations can slow parity operations down.  (It seems like the GUI can soak up quite a few CPU cycles...)

lol, a watched pot never boils, I get it.

 

I think it was just leftover boot up tasks that were slowing it down, after 15 mins or so it settled down and has been going at the expected speeds since.

 

The hardware is not fast and about 8 years old but fast enough, It is an old Del optiplex 7010 motherboard I hoodwinked into an ATX case. Sitting at around ~15% usage for the rebuild alone, ~20-25% with the dockers running.

Link to comment
4 hours ago, TexasUnraid said:

I would like to have the array setup so that if a disk was going to be disabled, all IO would be immediately halted

As already explained, the disk is disabled because it is out of sync. Typically a failed write will update parity anyway so no data is lost. The whole point of parity is so things can continue to operate when a disk gets "kicked out". 

Link to comment

Ok, so rebuilt the drive and everything worked fine for a few days. Then I decided to do another file transfer like I was doing last time and the same disk dropped out again?

 

In both cases I noticed that it seemed to happen during the file transfer from disk to disk when other stuff was also going on with the array.

 

It is really strange since this drive has been working fine for the last 6 months in exactly the same setup, same cable, same card, same port.

 

The error is the same thing as before but with different sectors:

 

Oct 28 19:48:20 NAS kernel: sd 1:1:3:0: [sde] tag#938 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=9s
Oct 28 19:48:20 NAS kernel: sd 1:1:3:0: [sde] tag#938 CDB: opcode=0x88 88 00 00 00 00 03 da 3d 15 a8 00 00 00 68 00 00
Oct 28 19:48:20 NAS kernel: blk_update_request: I/O error, dev sde, sector 16546338216 op 0x0:(READ) flags 0x0 phys_seg 13 prio class 0
Oct 28 19:48:20 NAS kernel: md: disk3 read error, sector=16546338152
Oct 28 19:48:20 NAS kernel: md: disk3 read error, sector=16546338160

I can RMA the drive (WD 12TB white label shucked drive) but that takes time and there is zero evidence that there is something wrong with the drive, so not even sure if an RMA claim will be approved?

Link to comment
12 minutes ago, trurl said:

This suggests some problem with power or maybe controller.

It has a 320w gold power supply and rarely pulls over 150w from the wall according to the watt meter. The cable that this drive plugs into also powers 2 other array drives, I suppose it could just be this connector but kind of strange. I suppose I could swap the power slitter from those drives and see if things improve, it is the good kind without the molded ends.

 

I am using an HBA and the breakout cable connects to 4 array drives. It has also never given me any issues in the past and the cable was new 6 months ago.

 

The drive seems to of completely disappeared from unraid which is interesting since I can't even get a smart report (same as before), should come back when I reboot though.

 

Thinking I will replace the power splitter and swap this drive with another one and if the issue happens again I will know if it is the drive or the connections.

Edited by TexasUnraid
Link to comment

Is the +12V buss a single rail or a double rail?  320W sounds a bit of the low side.  Remember PSU's don't often have problems with the average load, it is the instantaneous peaks which cause the problems.  So your watt-hour meter is relatively useless in this case.  Hard disks will pull 2-to-3 amperes of surge current when they start up.  Modern PSU's will shutdown if any load on any buss exceeds the specified maximum or if the total pull exceeds the PSU rating.   (As a point of reference, both of my systems-- as listed in my signature -- will run around 100 watts when everything is running full bore.  The Folding at Home Docker will usually load the CPU pretty much to the limit.)

Link to comment

It is a dual rail but I have the drives split up among both rails and peak I have seen is around 175w with a handbrake encode + parity check going IIRC.

 

It is highly unlikely it is a PSU problem though since I have no idea how it could only effect 1 drive and everything else work fine when that same cable feeds 3 other drives.

 

To be honest I have never found a need for oversized power supplies. My gaming rigs have ran 400-450w units for years and would pull 300-350w from the wall and never gave me an issue. They have a lot more transit loads then this server as well. Of course I tend to buy gold or better power supplies so that helps.

Edited by TexasUnraid
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.