Thousands of errors on parity check

tr3bjockey · March 26, 2023

For the past month, I'm seeing thousands of errors reported when I click on the parity check button.

Some history...

The IO CREST Internal 5 Port Non-Raid SATA III 6GB/SJMB585 SI-PEX40139 is installed but no cables are plugged into it. ( stopped using it before 12/1/22 because I was still getting UDMA CRC errors pointing to bad cables, which were all replaced, and I never bothered removing it from the PCIE slot)

I instead use the JEYI NVMe M.2 to 5 Sata Adapter, Internal 5 Port Non-RAID SATA III 6GB/s plugged into GLOTRENDS M.2 PCIe X1 Adapter that was installed before 12/1/22 and it fixed the UDMA CRC errors I was seeing with the io crest card above. I even did a complete (100%) parity check 12/23/22 and there were 0 errors. On 1/2/23, parity check had started, but I cancelled it after 4 minutes. On 1/9/23, ran another parity check for 10 hours (about 90% done), but I also cancelled it accidentally (due to shut down without knowing a parity check was running). Still 0 errors. On 2/12/23 ran a full parity check, got 2252 errors, that I did not noticed was showing in the parity check history. On 3/12/23, noticed the 2252 errors on the history (never ever had this happen before in the past years), and started a parity check again. Within 39 min, 672 errors were showing and I cancelled it.

I only use the server about once or twice a week for a couple hours. So the smart power on total is about 7 months >>9Power on hours0x0012100100000Old ageAlwaysNever5159 (7m, 2d, 23h)

I recently upgraded unraid from 6.9.2 to 6.10.3, and there errors might have started happening after? If there's some log that has that date of upgrade, it would be good to know if there's any correlation. Yes, I know correlation is not causation, lol.

I have attached tower diagnostics.

Questions:

1. Does running further parity checks with the write corrections to parity, cause my currently stored data files to corrupt more or is this only affecting parity data stored for recovery? I.E. should I not do this until a cause is found?

2. Are the errors happening on the actual drives, or is something else failing?

3. Should I not be adding anymore files to these hard drives?

3. Suggestion for troubleshooting this?

tower-diagnostics-20230326-1515.zip

Edited March 26, 2023 by tr3bjockey
clarify udma errors, thought of additional questions

JorgeB · March 27, 2023

Run a correcting check, then a non correcting one, all without rebooting, then post new diags.

tr3bjockey · March 28, 2023

On 3/27/2023 at 2:01 AM, JorgeB said:

Run a correcting check, then a non correcting one, all without rebooting, then post new diags.

Thank you very much for your assistance in this. I appreciate it. I ran a correcting check last night, shows 0 errors. Have not shutdown or rebooted. Now running a non correcting check that completes in 10 hours from now. Regardless of the result, I will post another diag as instructed.

I've very puzzled that the correcting check that finished this morning shows 0 now. I'm guessing this might be what you were expecting to happen, or are you as puzzled as I am without further analysis of the new diag after completion?

tr3bjockey · March 29, 2023

Here's the final diagnostic. Also 0 errors. Any idea what happened?

tower-diagnostics-20230328-2335.zip

JorgeB · March 29, 2023

Was there any unclean shutdown? In any case everything appears to be fine for now, just keep monitoring.

tr3bjockey · April 2, 2023

On 3/29/2023 at 1:23 AM, JorgeB said:

Was there any unclean shutdown? In any case everything appears to be fine for now, just keep monitoring.

No unclean shutdowns. The only thing that comes to mind is that I did a clean shutdown during a parity check without first cancelling the parity check.

I'm not sure how a clean shutdown proceeds to stop process like a parity check in progress. Could the shutdown process timeouts be to short and not allow enough time for the parity check process to end cleanly? Should I cancel parity check first, then do a clean shutdown?

Again, thank you very much for taking the time to assist me with this issue. 🙂

JonathanM · April 3, 2023

5 hours ago, tr3bjockey said:

Should I cancel parity check first, then do a clean shutdown?

I'd go a step further, personally I stop the array before doing a shutdown if at all possible. If the array doesn't stop fairly quickly, it's likely that the shutdown process by itself won't complete cleanly.

tr3bjockey · October 29, 2023

On 4/2/2023 at 6:26 PM, JonathanM said:

I'd go a step further, personally I stop the array before doing a shutdown if at all possible. If the array doesn't stop fairly quickly, it's likely that the shutdown process by itself won't complete cleanly.

On 3/29/2023 at 1:23 AM, JorgeB said:

Was there any unclean shutdown? In any case everything appears to be fine for now, just keep monitoring.

The same issue is happening again that happen 7 months ago. I've been making sure there's clean shutdowns by clicking the spin-down button first, waiting for all the disks to show that they've spin-down, and then clicking the shutdown button. I still get all of a sudden 4355 errors on a parity check. The only that has changed since the last parity check on 9/18/23, is that after that parity check, I upgraded from 6.10.3 to 6.12.4. So this is the first Parity check I've done after upgrading OS.

What do you suggest I do for troubleshooting?

FYI, I don't use VM's. I just use unraid for plex. I don't leave the server up 24x7, I only turn it on to backup my PC, add movies to plex, and watch movies on plex. My disks have power on hours ranging from 2 to 6 months, but load cycles vary from 611, to 25000.

tower-diagnostics-20231029-1102.zip

Edited October 29, 2023 by tr3bjockey

itimpi · October 29, 2023

Spinning down the disks will not guarantee a clean shutdown. You would need to stop the array before the shutdown to guarantee this.

Frank1940 · October 29, 2023

@tr3bjockey, If you watch what is happening when you stop the Array, all of the array drives will spin up. Then the Linux drive-write cache (don't know for sure if this is the proper name...) will be flushed to the physical hard drives to be written onto the platters. This assures that the parity drive contents will match the parity calculation of the data drives. (From what has been reported on the past 12 years that I have been a forum member, when there is a parity mismatch after an unclean shutdown, the parity data is always wrong and the data on the data drive is correct! This is strictly an observation as I have never seen a case where data corruption of a file was found later.) Once this process is complete, then the array will stop. If the drive-write cache can't be emptied (usually because some process is doing active writing), the array will not be stopped until the cache is empty!)

tr3bjockey · October 30, 2023

7 hours ago, itimpi said:

Spinning down the disks will not guarantee a clean shutdown. You would need to stop the array before the shutdown to guarantee this.

For next time I'll stop the array before the next shutdown.

1. Is there anything in the diagnostics that point to anything being wrong?

2. Should I perform another parity check? If yes, should I check/uncheck the box?

3. When do I know it's safe, to shutdown after I stop the array?

4. Is there a setting that I can turn on to tell Unraid to flush the cache instantly without needing to shutdown the array to force it?

5. Is there a version of unraid, that is later than my current version, that automatically flushes the cache when a shutdown command is given, and does not power off the server until it's done writing?

Edited October 30, 2023 by tr3bjockey

Frank1940 · October 30, 2023

There is a setting designed to do this:

This setting is one that you can determine by timing the actual time in seconds after you manually click on the STOP button to take the array off line. (I consider this setting to be the one that is the last thing that can prevent an unclean shutdown. As I understand, if the array is still is not stopped when this timer expires, Unraid will then force a shutdown and it probably will be an unclean one!)

There is this section in the Manual:

https://docs.unraid.net/unraid-os/manual/troubleshooting/#unclean-shutdowns

In that documentation, you will find a number of settings to force things to stop depending on what features Of Unraid you are using. Notice the 'warning' about Bash scripts and ssh sessions and how to terminate them!

Edited October 30, 2023 by Frank1940

tr3bjockey · October 30, 2023

Is there a way to find out which drive/folder/files were affected by the parity error?

itimpi · October 30, 2023

2 minutes ago, tr3bjockey said:

Is there a way to find out which drive/folder/files were affected by the parity error?

No.

tr3bjockey · October 31, 2023

Should I perform another parity check? If yes, should I check/uncheck the box?

JorgeB · November 1, 2023

If the previous check was correct do a non correct one now.

tr3bjockey · November 3, 2023

On 10/30/2023 at 5:36 AM, Frank1940 said:

There is a setting designed to do this:

This setting is one that you can determine by timing the actual time in seconds after you manually click on the STOP button to take the array off line. (I consider this setting to be the one that is the last thing that can prevent an unclean shutdown. As I understand, if the array is still is not stopped when this timer expires, Unraid will then force a shutdown and it probably will be an unclean one!)

There is this section in the Manual:

https://docs.unraid.net/unraid-os/manual/troubleshooting/#unclean-shutdowns

In that documentation, you will find a number of settings to force things to stop depending on what features Of Unraid you are using. Notice the 'warning' about Bash scripts and ssh sessions and how to terminate them!

Thanks for the tip. My shutdown time-out was set to 90. I bumped it up to 120. I timed stopping the array and it too 45 seconds. (my cache stats showed 10+gb). What's weird is that it had been over an hour since I copied a large file to the array (which gets copied first to the ssd raid 1 cache drive, the manually I press move to move it to the mechanical drives. I assumed that when I stopped the array, it finally dumped the cache of the move to the mechanical drive. When I restarted the array, the cache stats showed less than 4gb. Any ideas why it holds the data hostage in cache instead of writing it to the array immediately?

I don't have the server on a UPS, so now I'm even more paranoid about corruption in case of power failure.

Please correct me if I'm wrong but to make sure that there are no corruption issues, every time I copy data to the array to the array I will have to:

1. after copying, press the move to move it from the SSD cache to the mechanical drives.

2. after the move is done, stop all dockers.

3. once all dockers are stopped, then stop the array and wait for the array operation section to tell me "stopped. configuration valid"

4. If I'm not done for the day using the server, restart the array.

5. If I'm done for the day, then go to the power menu and do a shutdown.

On 11/1/2023 at 2:17 AM, JorgeB said:

If the previous check was correct do a non correct one now.

Thank you for responding JorgeB. I did a non correct one and no errors. Is there a command besides "stop array" that I can run to flush the cache without needing to stop the array, to save me some steps as detailed above?

JorgeB · November 3, 2023

10 minutes ago, tr3bjockey said:

Is there a command besides "stop array" that I can run to flush the cache without needing to stop the array, to save me some steps as detailed above?

Not sure what you mean, do you mean to avoid an unclean shutdown? An array stop should always do that, a reboot/shutdown might not if the timeouts are not enough.

Frank1940 · November 3, 2023

1 hour ago, tr3bjockey said:

1. after copying, press the move to move it from the SSD cache to the mechanical drives

You don't have to do this before you stop the array. In fact, I would think that starting a move operation would actually prevent the array from stopping!

As long as no active user process has a file open on the array, it can be stopped. Any data in any cache will have to be flushed first before the array will start stopping but the OS will handle that if you have the time setting high enough before it forces a shutdown/reboot which will cause an unclean shutdown. The reason that I usually stop the Dockers is to terminate any file operations that one of them might be doing. (Folks set up Dockers to do a wide, wide variety of things that involve files without knowing when they might be uploading or transferring of files to-and-from the array.) The Tips and Tweaks plugin can force shutdown of any bash scripts or ssh sessions. Unfortunately, I don't know of any way to terminate file operations that Share Access Users might be doing. (Outside of you loudly hollering to "Get off the server NOW!!!") VM's could be doing a file operation (I don't use any) but you should beware of them.

Most of the problems that cause threads on the forum are caused by an external event (such as a power outage). Some of those are just unavoidable and you have to clean up the mess that they cause. However, if you are deliberately doing that will require stopping the array for any, if you do a few simple steps to make sure that the array is actually stopped first before you shutdown or reboot the server, you can prevent an unclean shutdown by making sure that the array is topped first. (Probably 90% of these shutdown don't cause a problem. But I, personally, want to avoid the possible hassle of a ~24 hour non-correcting parity check and then a possible 24 correcting one!)

One more thing, The Tips and Tweaks plugin has two settings that control the RAM cache that Linux uses for write operations. This is a screenshot of those settings with the 'Help' for them:

The more memory that you have the more data there will be to be flushed. (Those defaults were set when 2MB was the typical amount of RAM in a Linux computer. That 20% represents 400KB for 2MB of RAM. With 32GB, that becomes 6.4GB. Stop and think how long it takes to write that much data to an HD.)

Thousands of errors on parity check

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation