[6.10.0-rc2] Parity errors and a missing disk


Recommended Posts

 

I'm having some problems with my server! I posted a couple of weeks ago about errors that occurred when running a parity check, but the situation has evolved a bit since then. Apologies for the bullet points, but I wanted to try and keep things clear and as brief as possible. I've attached the latest logs, although the server has been shut down since the last parity check.

  • Probably irrelevant, but in the few days prior to the problems being detected I have moved a lot of relatively large (1-3gb) files around, mostly using MC through the terminal
  • About two weeks ago a parity check ran after I shut down my server to remove two drives (unassigned devices)
  • The check found lots of parity errors (about 650,000) The previous parity check (February) found no errors
  • All drives passed extended SMART checks with no errors. 2 drives have CRC errors from several months ago, none since changing cables/HBA
  • Unraid’s memtest wouldn’t run (when selected the machine restarted), but I downloaded memtest86 (free version, limited to 4 passes) to a different USB stick and it completed 4 passes with no errors
  • No power loss or crash induced hard shutdowns since the last successful parity check, although I now realise Unraid’s ‘clean’ shutdown may not have been clean due to issues stopping the array - but it seems unlikely that a hard shutdown would cause so many parity issues
  • I copied the most important data off the array when parity check started showing errors
  • I used FreeFileSync to compare the copied data to partial backups – mostly identical. Some large (GoPro) video files were different, with the copy on the array corrupted. I looked at a few files on the array to see which disk they were on, and all I checked were on disk 1
  • Removed disk 1, started array with disk 1 emulated, copied off some of the files that were identified in the last step – ran a file comparison and they were identical to the files copied off after the parity errors started, which I assume means they were corrupt before the parity issues started (from a quick binary comparison it looks as though the files go blank about halfway through)
  • Reinstalled disk 1 but it didn’t come back into the array – I can assign it to the array, but it appears as a new disk, i.e. needs to be initialised
  • I didn’t make any changes to the array while I had disk 1 removed, but docker was running and may have made changes (although I have a cache disk and I think all shares on the array use it)
  • I’ve mounted disk 1 as an unassigned device and I’m copying all the data off it to a spare hard drive

 

Questions:

  • What could have caused the parity errors? How can I prevent this occurring in the future?
  • Why is disk 1 seen as new? Is it because data was written to the array while it was simulated, or just because the array was started while it was missing?
  • Do I have any other sensible options other than assigning disk 1 to the array again then rebuilding?

 

Hardware

  • Intel i5-3570K
  • Gigabyte Z77-D3H
  • 9gb RAM (Odd, I know, but I had a 1gb stick of ECC RAM which I installed last year to see if the motherboard supports ECC and I never got round to removing it!)
  • 4x 8tb Western Digital HDs for the array (1x parity)
  • 1x 500gb Samsung Evo SSD (cache drive)
  • 2x Blu Ray drives
  • 1x Marvell HBA. I understand this can cause problems, but the array drives are all connected to the motherboard SATA controller. The Blu Ray drives and an unassigned drive (now removed) are connected to the HBA
  • I think that's everything!

 

tower-diagnostics-20220521-1015.zip

Link to comment
3 minutes ago, CurlyBen said:

What could have caused the parity errors?

Main suspect is usually the RAM, but it can also be a disk, from your description looks like all corrupt files you found were on disk1?

 

If that's correct I would start by doing a new config without it, then sync parity and run a couple of parity checks, of course if there's any new data on the emulated disk1 vs the actual disk you need to copy that first, if there are no errors do another new config with disk1 back and re-sync parity, if errors return you found the problem.

 

16 minutes ago, CurlyBen said:
  • Why is disk 1 seen as new? Is it because data was written to the array while it was simulated, or just because the array was started while it was missing?
  • Do I have any other sensible options other than assigning disk 1 to the array again then rebuilding?

Once a disk gets disabled it must be rebuilt, but if you do the above it's not needed for now.

Link to comment
43 minutes ago, JorgeB said:

Main suspect is usually the RAM, but it can also be a disk, from your description looks like all corrupt files you found were on disk1?

  Yes, although with the caveat that I didn't check many files and it may have been coincidence (all the files were copied to the array at the same time and haven't been accessed since). I assume that whenever they were corrupted was prior to the last successful parity check, as the emulated files match the ones copied from the array before disk1 was removed. I'll do a little more digging and see if the errors are limited to disk1. It's perhaps also relevant to say that disk1 shouldn't have seen much write activity for the last few months, as my array was at about 90% capacity until I added another drive a few months ago. I didn't rebalance but files have slowly been removed from disk1 (and disk2) as I've moved stuff around.

43 minutes ago, JorgeB said:

If that's correct I would start by doing a new config without it, then sync parity and run a couple of parity checks, of course if there's any new data on the emulated disk1 vs the actual disk you need to copy that first, if there are no errors do another new config with disk1 back and re-sync parity, if errors return you found the problem.

My concern with this - and it might be an issue with my understanding - is that, assuming disk1 is failing in some way which is causing the parity errors, those errors are potentially limited to disk1. At the moment, I can rebuild disk1 using the remaining disks in the array, but if I rebuild parity then I lose this ability. Is that correct?

Link to comment
1 minute ago, CurlyBen said:

At the moment, I can rebuild disk1 using the remaining disks in the array, but if I rebuild parity then I lose this ability. Is that correct?

Yes, but the emulated disk1 should have the same data as the actual disk1, assuming the parity check was correct and there's no new data added/changed since you unassigned it.

 

Alternatively, and if you have one, rebuild to a spare, then do a couple of parity checks to look for sync errors.

Link to comment
3 minutes ago, JorgeB said:

Yes, but the emulated disk1 should have the same data as the actual disk1, assuming the parity check was correct and there's no new data added/changed since you unassigned it.

 

Alternatively, and if you have one, rebuild to a spare, then do a couple of parity checks to look for sync errors.

I think I've misunderstood something - I thought your suggestion was to do a new config without disk1, i.e. parity, disk2, disk3, and rebuild parity - which would then lose the emulated disk1? Then add back disk1 after confirming there are no parity issues with those 3 disks? I don't have a spare disk at the moment, although I could buy one. I'm sure it will get filled at some point!

Link to comment
9 minutes ago, JorgeB said:

First please confirm if the actual disk1 has the same data as the emulated disk1, from what I understood it has, correct?

I believe so, but I haven't done a full comparison. I'm in the process of making a full copy of the data from the actual disk1 and I can probably find enough space to copy the data from the emulated disk1, so I'll do that and compare - it'll take a while though!

Link to comment

OK, assuming the data is the same if you do a new config and keep old disk1 intact you won't lose any data, of course the array will be unprotected until the sync is done, also disk1 won't be parity protected initially, then if no more sync errors you could do another new config this time including disk1 and re-sync parity again.

Link to comment
38 minutes ago, JorgeB said:

OK, assuming the data is the same if you do a new config and keep old disk1 intact you won't lose any data, of course the array will be unprotected until the sync is done, also disk1 won't be parity protected initially, then if no more sync errors you could do another new config this time including disk1 and re-sync parity again.

Thanks. Is it possible to browse the emulated disk1? I've just opened up mc and I can see disk2, disk3, user etc. but I'd like to be able to take a copy of just the emulated disk1. Is there a way to do that or do I have to figure it out from what's in user but not disk2 or disk3?

Link to comment
Just now, CurlyBen said:

Thanks. Is it possible to browse the emulated disk1? I've just opened up mc and I can see disk2, disk3, user etc. but I'd like to be able to take a copy of just the emulated disk1. Is there a way to do that or do I have to figure it out from what's in user but not disk2 or disk3?

If disk1 is being emulated then you should be able to see it’s contents just as if it was physically present.   What does the Main tab show for disk1?

Link to comment
1 hour ago, itimpi said:

In that case you should be following this process to try and repair the file system on the emulated disk.

Thanks, I did that. Now I can see disk1 (emulated), but I have a new "lost+found" folder and some of the mount directories that are on the physical disk1 are missing on the emulated disk1.
This is turning into a bit of a nightmare, I'm starting to regret having a parity drive! At least that way I wouldn't know there's anything wrong...

Link to comment
29 minutes ago, CurlyBen said:

Thanks, I did that. Now I can see disk1 (emulated), but I have a new "lost+found" folder and some of the mount directories that are on the physical disk1 are missing on the emulated disk1.
This is turning into a bit of a nightmare, I'm starting to regret having a parity drive! At least that way I wouldn't know there's anything wrong...

That gives you the choice of using the emulated disk or the physical disk going forward (or some mixture of the two).

Link to comment

I've now copied all the data off the physical disk1 (although comparing it with what I copied off the array when problems started there seem to be some discrepancies... joy...). Is there any reason not to re-assign it to the array and start a rebuild? It potentially gives me a little redundancy while I try and figure what data is good and what's bad. I'll potentially then run a file integrity plugin so at least that way if anything else goes bad I can see what it is, as I still don't have any good indication of where the problem is.

Link to comment

I've rebuilt disk1 and I've now started a parity check. It looks as though approx. 2tb of data has been lost since disk1 was removed though, including some entire mounts. I think all that data is replicated elsewhere, but I'm more than a little nervous as I have absolutely no idea what's causing these problems!

Link to comment
On 5/23/2022 at 12:26 PM, CurlyBen said:

I've rebuilt disk1 and I've now started a parity check. It looks as though approx. 2tb of data has been lost since disk1 was removed though

 

This was expected from what you wrote above:

 

On 5/21/2022 at 5:18 PM, CurlyBen said:

I did that. Now I can see disk1 (emulated), but I have a new "lost+found" folder and some of the mount directories that are on the physical disk1 are missing on the emulated disk1.

 

That's why I asked if you copied everything you needed from the actual disk1 before rebuilding on top, if yes you just need to copy the data over, though you still need to find what's making parity go out of sync and/or possibly corrupting data.

Link to comment
18 hours ago, AndrewZ said:

Is there now a Lost & Found directory on the rebuilt drive?

There is, but there's also about 2TB more free space than there was previously on the drive

 

17 hours ago, JorgeB said:

This was expected from what you wrote above:

Sorry, which bit? I'm not really clear why data loss was expected

17 hours ago, JorgeB said:

That's why I asked if you copied everything you needed from the actual disk1 before rebuilding on top, if yes you just need to copy the data over, though you still need to find what's making parity go out of sync and/or possibly corrupting data.

Yes, I copied everything off the actual disk1. At this stage I don't think I've suffered any real unrecoverable data loss, but my array has lost 2tb of data and possibly corrupted more without any indication as to a cause. A server I can't trust is more or less useless. Can you suggest next steps for identifying the cause?

Link to comment
1 hour ago, CurlyBen said:

Sorry, which bit?

The part were the emulated disk had filesystem corruption, likely the result of parity not being 100% in sync.

 

1 hour ago, CurlyBen said:

Can you suggest next steps for identifying the cause?

Like mentioned this is usually a RAM problem, unfortunately memtest is only definite if errors are found, not the other way, but like mentioned it could also be for example a disk issue, those are much more difficult to diagnose since basically you need to remove/replace one disk from the array at a time and run a coupe of parity checks to see if it helped or not, that's why I suggested:

 

On 5/21/2022 at 10:52 AM, JorgeB said:

I would start by doing a new config without it, then sync parity and run a couple of parity checks, of course if there's any new data on the emulated disk1 vs the actual disk you need to copy that first, if there are no errors do another new config with disk1 back and re-sync parity, if errors return you found the problem.

 

Link to comment
On 5/25/2022 at 9:48 AM, JorgeB said:

The part were the emulated disk had filesystem corruption, likely the result of parity not being 100% in sync.

I'm about 95% sure the filesystem corruption didn't occur immediately - I think I was able to load some files from the emulated disk1 when I first started the array with disk1 removed. I could be wrong but I'm fairly confident that was the case

On 5/25/2022 at 9:48 AM, JorgeB said:

Like mentioned this is usually a RAM problem, unfortunately memtest is only definite if errors are found, not the other way, but like mentioned it could also be for example a disk issue, those are much more difficult to diagnose since basically you need to remove/replace one disk from the array at a time and run a coupe of parity checks to see if it helped or not, that's why I suggested:

 

I've run one parity check with all four drives in the array, no errors, and I'm about 75% through a second check with no errors so far. Are there any other steps that can help in the meantime? I'm considering running the Dynamix File Integrity plugin, although I've not yet read enough to fully understand how it works.

I may also bring forward my plans to upgrade my server's hardware. I don't really have the budget for it at the moment but I don't have time to be dealing with data loss either! Does ECC memory prevent issues like this? (Assuming it is the memory)

Thanks for all your help!

Link to comment
12 hours ago, CurlyBen said:

I've run one parity check with all four drives in the array, no errors, and I'm about 75% through a second check with no errors so far.

That's good news, but on the other hand might make it more difficult to find what the problem was before, if you can't reproduce it.

 

12 hours ago, CurlyBen said:

I'm considering running the Dynamix File Integrity plugin, although I've not yet read enough to fully understand how it works.

It's always a good option to have checksums of you files, plugin is a good option, you can also use for example corz to create them over the shares with Windows.

 

12 hours ago, CurlyBen said:

Does ECC memory prevent issues like this?

I recommend ECC for anyone who cares about data integrity, IMHO RAM issues are definitely the number one reason for data corruption.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.