Random Unclean Shutdown today, now cache drives unmountable

omartian · June 4, 2021

I've been having some issues w/random parity sync errors over the last few weeks. I will get anywhere between 1-9 parity errors on my non-correcting checks. I've re-seated SATA cables/ram and run memtest for > 24 hrs w/o errors, but still would get parity errors on checks

Just finished a non-correcting check this monring and had 1 error this morning (see attached). The consensus on this forum was to keep my drives spinning at all times bc that might be what's contributing to the rando errors i would get per parity check.

When i tried to access my server later today to do this, i wasn't able to access my server. I couldn't initiate a clean shutdown via linux terminal, so had to reboot the system.

A non-correcting parity check just started, but noticed an issue w/my dual cache drives. Under the main page, it states that both cache drives are "unmountable" and then it's asking if i want to format and create a file system on the unmountable drives.

Is this a mobo issue and could this be the underlying cause of my sync errors i've had over the last few weeks?

1 error noncorrect 6042021.zip unclean shutdown, cache drives issue.zip

Edited June 5, 2021 by omartian

omartian · June 5, 2021

bump

JorgeB · June 5, 2021

9 hours ago, omartian said:

The consensus on this forum was to keep my drives spinning at all times bc that might be what's contributing to the rando errors i would get per parity check.

Don't remember ever seeing that mentioned, random sync errors are usually a hardware problem, most time RAM related, start by running memtest.

Ideally after that is fixed you can see here for some recovery options for the cache pool.

omartian · June 5, 2021

6 hours ago, JorgeB said:

Don't remember ever seeing that mentioned, random sync errors are usually a hardware problem, most time RAM related, start by running memtest.

Ideally after that is fixed you can see here for some recovery options for the cache pool.

I ran a 10 pass memtest prior to the cache drive becoming unmountable with 0 errors.

Not sure why both cache drives went down at the exact same time. I reseated their cables and they are visible in bios.

I'll look at the recovery options but this has never happened in the 2 years I've been using unraid. I feel that this might be related to my parity errors.

Anything in the diagnostics that can identify the problem?

JorgeB · June 5, 2021

24 minutes ago, omartian said:

Not sure why both cache drives went down at the exact same time.

It's not a device problem, it's a filesystem problem, corruption likely happened for the same reason you're getting sync errors.

26 minutes ago, omartian said:

I feel that this might be related to my parity errors.

Most likley.

26 minutes ago, omartian said:

Anything in the diagnostics that can identify the problem?

Not that I can see, you could try with just one DIMM at a time and see if you don't get sync errors like that, note that the first check after the problem is fixed may still find errors.

omartian · June 5, 2021

43 minutes ago, JorgeB said:

It's not a device problem, it's a filesystem problem, corruption likely happened for the same reason you're getting sync errors.

Most likley.

Not that I can see, you could try with just one DIMM at a time and see if you don't get sync errors like that, note that the first check after the problem is fixed may still find errors.

Cache currently set to btrfs, data is xfs. Should I reformat Cache to xfs?

JorgeB · June 5, 2021

3 minutes ago, omartian said:

Should I reformat Cache to xfs?

You can, if you don't need a pool, but that's not the main issue, main issue is finding out what is corrupting data.

omartian · June 5, 2021

27 minutes ago, JorgeB said:

You can, if you don't need a pool, but that's not the main issue, main issue is finding out what is corrupting data.

Ok. So run memtest w 1 dimm for 24 hrs then run a non correcting check. If I get more than 1 error, repeat with other dimm?

If ram not the culprit, where do I look next?

JorgeB · June 5, 2021

1 minute ago, omartian said:

So run memtest w 1 dimm for 24 hrs then run a non correcting check. If I get more than 1 error, repeat with other dimm?

No, if memtest didn't detect errors with both DIMMs very unlikely it will with just one, remove one DIMM and run two consecutive parity checks, if the second one still finds errors do the same with the other DIMM, if still errors with either one alone I would try a different board/CPU next.

trurl · June 5, 2021

You didn't mention ever running correcting check. When parity errors exist, you must run a correcting check or you will still have parity errors since they haven't been corrected.

The usual advice is to run non-correcting checks and then when you discover you have parity errors, and you are confident there isn't some other problem causing them, to run a correcting check to fix the parity errors, and then follow that with a non-correcting check to confirm that there are no longer any parity errors.

An unclean shutdown will often result in some parity errors, and those need to be corrected.

omartian · June 5, 2021

12 minutes ago, trurl said:

You didn't mention ever running correcting check. When parity errors exist, you must run a correcting check or you will still have parity errors since they haven't been corrected.

The usual advice is to run non-correcting checks and then when you discover you have parity errors, and you are confident there isn't some other problem causing them, to run a correcting check to fix the parity errors, and then follow that with a non-correcting check to confirm that there are no longer any parity errors.

An unclean shutdown will often result in some parity errors, and those need to be corrected.

It's weird bc w every non correcting check, I'll get different # of parity errors.

Now w both cache drives down, Im thinking it's hardware.

So, if I understand this right, my course of action should be:

1. Get ssd back online

2. Run a non correcting check to get a baseline # of errors w/both dimm of ram in (since I never ran a correcting check)

3. Then run a correcting check

4. Make sure the # of errors corrected in step 3 matches up w/the # identified in step 2

5. Run scheduled non correcting checks monthly

6. if more errors identified, try taking out one dimm and repeat step 2&3

Does that sound ok?

Also what do you think about keeping disks spinning at all times?

Edited June 5, 2021 by omartian

omartian · June 5, 2021

1 hour ago, JorgeB said:

No, if memtest didn't detect errors with both DIMMs very unlikely it will with just one, remove one DIMM and run two consecutive parity checks, if the second one still finds errors do the same with the other DIMM, if still errors with either one alone I would try a different board/CPU next.

What are your thoughts on my post about the steps I should take above?

JorgeB · June 6, 2021

15 hours ago, omartian said:

2. Run a non correcting check to get a baseline # of errors w/both dimm of ram in (since I never ran a correcting check)

3. Then run a correcting check

You can either run a correcting check then a non correcting one and the 2nd one should result in 0 errors, or alternatively run two non correcting checks and see if you get the same results, if the number of errors is the same but a low number good to check they are the same blocks.

Random Unclean Shutdown today, now cache drives unmountable

Recommended Posts

omartian

Link to comment

omartian

Link to comment

JorgeB

Link to comment

omartian

Link to comment

JorgeB

Link to comment

omartian

Link to comment

JorgeB

Link to comment

omartian

Link to comment

JorgeB

Link to comment

trurl

Link to comment

omartian

Link to comment

omartian

Link to comment

JorgeB

Link to comment

Join the conversation