Drive dropped out of BTRFS raid5 cache pool but unraid doesn't know it and keeps trying to use it giving erros?


Recommended Posts

Ran into a strange issue today. I am running beta 25 in case anyone cares.

 

Was checking the server after doing some work with it, everything was fine but then noticed that one of my cache drives had many many many times the reads and writes of the other drives and was not reporting temperature, as if it was asleep.

 

The pool still functioned fine and all the files seemed to be ok but I then tried running a scrub and it found a bunch of uncorrectable errors, I stopped it after a few seconds.

 

I checked the log and this seemed to be where it started:

Aug 21 19:33:59 NAS kernel: sd 7:1:5:0: [sdg] tag#934 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s
Aug 21 19:33:59 NAS kernel: sd 7:1:5:0: [sdg] tag#934 CDB: opcode=0x28 28 00 01 48 e2 a0 00 00 78 00
Aug 21 19:33:59 NAS kernel: blk_update_request: I/O error, dev sdg, sector 21553824 op 0x0:(READ) flags 0x80700 phys_seg 15 prio class 0

This is also spammed in the log:

Aug 21 19:34:04 NAS kernel: btrfs_end_super_write: 16 callbacks suppressed
Aug 21 19:34:04 NAS kernel: BTRFS warning (device dm-3): lost page write due to IO error on /dev/mapper/sdg1 (-5)
Aug 21 19:34:04 NAS kernel: BTRFS error (device dm-3): bdev /dev/mapper/sdg1 errs: wr 62, rd 35, flush 26, corrupt 0, gen 0
Aug 21 19:34:04 NAS kernel: BTRFS error (device dm-3): error writing primary super block to device 3

This all seems to of started about an hour ago. I am coping the data to another drive now, luckily everything could be replaced if needed, but would really prefer not to.

 

I also noticed that the sdg drive suddenly showed up in unassigned devices with a different drive letter, even though the same drive is still listed as being part of the pool?

 

I am confused, it is setup in a raid 5, shouldn't it auto-correct for errors with a drive? I am guessing that is what is happening and why I am able to read the data still?

 

Along those lines, why is the scrub failing? I would think it would fix errors from another drive? It is like it is not detecting a "failed drive" (pretty darn sure the drive is fine and it was just a hickup of some kind since it is back in unassigned devices).

 

I had a few files on the pool that had known MD5 values and they still check out now but they were also on here for a long time, not sure about the files that were put on it since this issue started.

 

I am going to copy the data to the array tonight, worried if I stop the array it might not restart. In the morning I will try checking all the cables and restart the server.

 

What are the next steps in this situation?

Link to comment
9 hours ago, TexasUnraid said:

I also noticed that the sdg drive suddenly showed up in unassigned devices with a different drive letter, even though the same drive is still listed as being part of the pool?

 

I am confused, it is setup in a raid 5, shouldn't it auto-correct for errors with a drive? I am guessing that is what is happening and why I am able to read the data still?

 

Along those lines, why is the scrub failing? I would think it would fix errors from another drive? It is like it is not detecting a "failed drive" (pretty darn sure the drive is fine and it was just a hickup of some kind since it is back in unassigned devices).

 

I had a few files on the pool that had known MD5 values and they still check out now but they were also on here for a long time, not sure about the files that were put on it since this issue started.

 

I am going to copy the data to the array tonight, worried if I stop the array it might not restart. In the morning I will try checking all the cables and restart the server.

 

What are the next steps in this situation?

Some pointers:

  • That is typical symptom of a drive dropping offline and then reconnected it back. That is typically a cable problem. Check your power and data cable.
  • You misunderstand a few things.
    • A disconnected device from the pool will not be automatically re-added (because trying to do it blindly risks corrupting your data beyond repair). A reboot is the easiest way for the dropped device to be automatically added back (assuming you haven't wiped the "unassigned" one).
    • There is no such thing as "just a hiccup". If a drive drops offline, the pool is degraded. It's kinda similar to the array, if a drive drops offline, you can't just add it back but have to rebuild it.
    • The list of devices in a pool is a configuration list and not a connected device list. Unraid will alert you something is wrong with the pool but won't just remove the device from the list.
    • Functioning is different from repairing. You seem to lump both into "auto-correct".
      • A RAID-5 pool will function (i.e. all data is readable and writable, including data written after the failure) with a single failed device but in a degraded state.
      • It can't repair while in degraded state (because there is insufficient data to repair). Scrub = repair.
  • My tried-and-tested procedure when a disk drops offline from the pool is as followed. I have had to do this several times now due to a bad power cable and it has always worked.
    • Shut down as soon as possible to reduce risk of additional failure
    • Check cable and reboot
    • Once booted, check that the dropped device is no longer unassigned but in the pool i.e. online (e.g. easiest way is to see if temperature shows up)
    • Run scrub (with correction)
  • Beside what Johnnie linked above, you can also have a look at my post here which I added a few more tips.
    • Before you run any balance, make sure scrub has completed successfully first.
Edited by testdasi
  • Like 1
Link to comment

Thanks for the info.

 

To clarify, I actually am not surprised the drive was not added back to the pool, that makes sense (also, it only showed up in unassigned devices when I was almost done writing the OP).

 

What surprised me was that I did not get any kind of notification / email or even an error on the dashboard / array tab of unraid that something was wrong. If I had not just happened to notice that one drive was not showing the temperature and investigated, it could of stayed this way forever leaving the data at risk should another drive fail.

 

That is worrying to say the least. Seems like an issue like this would trigger some kind of warning that would be emailed to me or at least a notification?

 

The other surprise was that it could not correct for the missing drive during the scrub, I guess it makes sense, I am guessing I would need to run a balance to correct things if a drive truly died? I think I was confusing the roles that scrub vs balance would play in this scenario.

 

About to reboot the server now, I also think things will be ok once rebooted.

Edited by TexasUnraid
Link to comment
7 minutes ago, TexasUnraid said:

What surprised me was that I did not get any kind of notification / email or even an error on the dashboard / array tab of unraid that something was wrong. If I had not just happened to notice that one drive was not showing the temperature and investigated, it could of stayed this way forever leaving the data at risk should another drive fail.

 

The other surprise was that it could not correct for the missing drive during the scrub, I guess it makes sense, I am guessing I would need to run a balance to correct things if a drive truly died?

I received notification in my case (and a red box on the GUI saying fail) so I think you should check your notifications settings.

 

You have to have all devices connected during scrub for correction to happen (i.e. you will have to reboot).

 

If a drive truly died, I would suggest to post on here with diagnostics and wait for Johnnie to respond for next step.

Link to comment

Yeah, thats what I would of expected but in my case the drive was still green in the GUI and no notification or warning of any sort expect in the log file. That is what concerns me the most.

 

I have all the notifications enabled in the settings menu and warning / errors set to email me.

 

The drive appears to be ok in this case, outside of the crc error count increasing by 2 (looking like sata cable is to blame) smart check is good. I was more asking for future reference if a drive truly died. It was added back to the pool and it started up fine but when I scrubbed it it still found uncorrectable errors?

 

I think I am going to move the data off the cache then reformat and recreate the cache pool to see what happens then since I only have a little data left on it at this point.

Link to comment

It was giving me some funny messages in the logs but after formatting all the drives and recreating the cache pool, everything seems to be working properly again. Going to move the data back onto the cache and run a scrub now.

 

Kinda sad, I was trying to see how long I could get the uptime too, I was at 20 days and was aiming for a month as windows would generally need a reboot before a month lol. Oh well, time for round 2.

Edited by TexasUnraid
Link to comment
16 hours ago, TexasUnraid said:

Yeah, thats what I would of expected but in my case the drive was still green in the GUI and no notification or warning of any sort expect in the log file. That is what concerns me the most.

You won't get a notification for most issues with a pool, that's why it's important to run a monitoring script link explained in the link above.

  • Thanks 1
Link to comment

Yeah, I just assumed this would be a core feature of a storage based operating system. Really shocked me that it wasn't, I never even considered needing to setup a manual monitoring script to handle this.

 

I do have that script running now and it appears to be working, if this will not be included with unraid, it should really be made clear that it is up to the end user to find a way to monitor for issues IMHO.

Link to comment

I use BTRFS for all my drives and I thought if any errors were found it would be reported automatically but I now know that is not the case.

 

Will that script also detect or report errors found on a single BTRFS drive from something like bitrot or general read error?

 

I noticed when I was scrubbing/testing the cache after rebooting that the log showed some files that were not read correctly/had errors, This feature is one of the main reasons I use btrfs for all my drives (the main reason being snapshots).

 

I would like to be notified if this happens, or any other error for that matter. Is it possible to expand the script to also monitor the array drives / extra cache pools for these type of errors? Or will they show up with the existing script assuming I edit the path to the array?

Link to comment
15 hours ago, TexasUnraid said:

I use BTRFS for all my drives and I thought if any errors were found it would be reported automatically but I now know that is not the case.

Any read/write errors for array devices will be reported by the standard Unraid notifications, but won't be reported for any pool member.

 

15 hours ago, TexasUnraid said:

Will that script also detect or report errors found on a single BTRFS drive from something like bitrot or general read error?

Yes.

 

15 hours ago, TexasUnraid said:

I noticed when I was scrubbing/testing the cache after rebooting that the log showed some files that were not read correctly/had errors, This feature is one of the main reasons I use btrfs for all my drives (the main reason being snapshots).

If a device drops from a redundant pool and you reboot (bringing the device back online) you can see corrected errors on the log, since btrfs will auto-correct any detected error during a read, but it's best to run a scrub so all corrections are done asap, in case another device drops/fails and also to check that all errors were corrected.

 

15 hours ago, TexasUnraid said:

Is it possible to expand the script to also monitor the array drives / extra cache pools for these type of errors?

Yes, just add the other mount paths, I have a scrip monitoring all my pools, array devices I don't see the need since you'll be warned about any read/write error by Unraid, though the script would warn also if any corruption error is found but that would be also noticed by an i/o error (if trying to access/copy that file).

Link to comment

Wait, so the mechanism for monitoring BTRFS file systems is present in unraid but simply not active for cache pools? 🤨🤨

 

Thanks for the information, think I have a better handle on this now.

 

The main syslog is where any errors would show up correct? including the list of any particular files that had errors?

Link to comment
9 minutes ago, TexasUnraid said:

Wait, so the mechanism for monitoring BTRFS file systems is present in unraid but simply not active for cache pools?

No, any read errors/writes errors on array devices are reported by the md driver and can also be monitored, independent of the filesystem used, cache devices or pools don't use the md driver, and any errors are not monitored/reported in the GUI, if you notice there's also an error column for cache devices but it always shows 0, even if there were errors, and again this is filesystem independent, I made a request some years ago that the error column could be used to report at least errors for btrfs devices/pools, since that can be get easily by reading the output of "btrfs dev stats", but it's not been implemented so far.

Link to comment

I see, so the array errors are monitored via a different method but it would still catch the btrfs errors from say bitrot?

 

I am really shocked that the cache does not have any kind of error monitoring, that seems like such a core feature to leave out. Seemed like that should be pushed up the priority's some with multiple cache pools being a thing now. Till then going to get that script running on all my pools.

 

Looking forward to having multiple array's at some point as well, already figured a use for that, I have some old drives I would like to use for unimportant things but don't want them added to the main pool as I want to keep it "clean" since it holds my important data and rebuilding parity is a real pain if adding/removing drives regularly like I would with a second array.

 

For now I have resorted to using a separate cache pool for each drive as they are so mixed in sizes and speeds it does not make sense to make them into a combined pool.

Link to comment
7 minutes ago, TexasUnraid said:

I see, so the array errors are monitored via a different method but it would still catch the btrfs errors from say bitrot?

Not directly, if you want to monitor that you can also use the script, I don't because bit rot is extremely rare and it would be detected on the next parity sync (by there being some unexpected sync errors, that you could then investigate) or when you next tried to access that file (if you try to access any file and btrfs detects it's corrupt you'll get an i/o error, i.e., the file will fail to read/copy, that is normal behavior with any checksummed filesystem so you're not unknowingly fed corrupt data but are alerted that there is a problem, the checksum failed error would be logged on the syslog).

Link to comment

Ok, that is what I was wanting to know, not so much worried about bitrot but that type of error can also present itself with many others issues, basically anytime the data reaching the system is not what was supposed to be on the disk. The cable, controller, ram, cpu etc could all cause this.

 

So you are saying that unraid would catch these errors (not worried about scanning drive for bitrot, just checksum not matching basically)?

 

How does the md? subsystem handle checksum matching if it is not using the BTRFS stats to find the errors?

Link to comment
19 minutes ago, TexasUnraid said:

So you are saying that unraid would catch these errors (not worried about scanning drive for bitrot, just checksum not matching basically)?

Btrfs itself would catch that, and generate an i/o error, not Unraid/md driver.

 

19 minutes ago, TexasUnraid said:

How does the md? subsystem handle checksum matching if it is not using the BTRFS stats to find the errors?

It doesn't, it's btrfs feature, you'd get an i/o error, i.e., the file would fail to copy/read and the reason would be logged on the syslog, e.g.:

 

Oct 6 09:27:42 Test kernel: BTRFS warning (device md1): csum failed ino 266 off 739328000 csum 1517421248 expected csum 4056982075

 

  • Like 1
Link to comment

Ah, ok I think I am getting the picture now. Sorry for all the questions.

 

In the above example, on a single drive BTRFS setup, would it allow you to still access whatever is left of the file or does it flat out refuse access to it?

 

For example say a picture/video doesn't match the checksum, there is still a good chance it is viewable, just distorted in some way.

Link to comment
8 minutes ago, TexasUnraid said:

In the above example, on a single drive BTRFS setup, would it allow you to still access whatever is left of the file or does it flat out refuse access to it?

If it was for example a media file it would read normally until the error, then stop, but if you wanted to still keep it, say you didn't have a backup and there would probably just be a small glitch on play, you could use btrfs restore to copy the file since it doesn't validate checksums.

Link to comment
Just now, jonathanm said:

Best option is recover from your backup. For family pictures, you do back up, right?

Of course. My backup plan for personal stuff is something like this:

 

Recycle bin (this has already saved me a few times lol) > daily snapshots (already saved me some headache here as well) > weekly cold storage backups > parity > every so often a backup to offsite > having to use corrupted files from a filesystem

 

Just new to linux file systems and not sure what the recovery options are. Had to dive into recovery software on windows a few times over the years to get back data and had good luck so was curios what options there are for btrfs.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.