Jump to content

Thousands of errors on parity check


Recommended Posts

For the past month, I'm seeing thousands of errors reported when I click on the parity check button.   

Some history...

The IO CREST Internal 5 Port Non-Raid SATA III 6GB/SJMB585 SI-PEX40139 is installed but no cables are plugged into it.  ( stopped using it before 12/1/22 because I was still getting UDMA CRC errors pointing to bad cables, which were all replaced, and I never bothered removing it from the PCIE slot)

 

I instead use the JEYI NVMe M.2 to 5 Sata Adapter, Internal 5 Port Non-RAID SATA III 6GB/s plugged into GLOTRENDS M.2 PCIe X1 Adapter that was installed before 12/1/22 and it fixed the UDMA CRC errors I was seeing with the io crest card above.  I even did a complete (100%) parity check 12/23/22 and there were 0 errors.  On 1/2/23, parity check had started, but I cancelled it after 4 minutes.  On 1/9/23, ran another parity check for 10 hours (about 90% done), but I also cancelled it accidentally (due to shut down without knowing a parity check was running).  Still 0 errors.  On 2/12/23 ran a full parity check, got 2252 errors, that I did not noticed was showing in the parity check history.  On 3/12/23, noticed the 2252 errors on the history (never ever had this happen before in the past years), and started a parity check again.  Within 39 min, 672 errors were showing and I cancelled it.

 

I only use the server about once or twice a week for a couple hours.  So the smart power on total is about 7 months >>9Power on hours0x0012100100000Old ageAlwaysNever5159 (7m, 2d, 23h)

 

I recently upgraded unraid from 6.9.2 to 6.10.3, and there errors might have started happening after?  If there's some log that has that date of upgrade, it would be good to know if there's any correlation.  Yes, I know correlation is not causation, lol.

 

 

I have attached tower diagnostics.

 

Questions:

1. Does running further parity checks with the write corrections to parity, cause my currently stored data files to corrupt more or is this only affecting parity data stored for recovery?  I.E. should I not do this until a cause is found?

 

2.  Are the errors happening on the actual drives, or is something else failing?

 

3. Should I not be adding anymore files to these hard drives?

 

3.  Suggestion for troubleshooting this? 

 

 

 

1574718895_unraidmain.thumb.jpg.e1ea6201292c525c3f53f124dbe37c2a.jpg

 

591970922_paritycheckhistorylog.thumb.jpg.50748cd5f01ae93aae49d5de612209a7.jpg

tower-diagnostics-20230326-1515.zip

Edited by tr3bjockey
clarify udma errors, thought of additional questions
Link to comment
On 3/27/2023 at 2:01 AM, JorgeB said:

Run a correcting check, then a non correcting one, all without rebooting, then post new diags.

Thank you very much for your assistance in this.  I appreciate it.  I ran a correcting check last night, shows 0 errors.  Have not shutdown or rebooted.  Now running a non correcting check that completes in 10 hours from now.  Regardless of the result, I will post another diag as instructed.

 

I've very puzzled that the correcting check that finished this morning shows 0 now.  I'm guessing this might be what you were expecting to happen, or are you as puzzled as I am without further analysis of the new diag after completion?

Link to comment
On 3/29/2023 at 1:23 AM, JorgeB said:

Was there any unclean shutdown? In any case everything appears to be fine for now, just keep monitoring.

No unclean shutdowns.  The only thing that comes to mind is that I did a clean shutdown during a parity check without first cancelling the parity check. 

 

I'm not sure how a clean shutdown proceeds to stop process like a parity check in progress.  Could the shutdown process timeouts be to short and not allow enough time for the parity check process to end cleanly?  Should I cancel parity check first, then do a clean shutdown?

 

Again, thank you very much for taking the time to assist me with this issue. 🙂

Link to comment
5 hours ago, tr3bjockey said:

Should I cancel parity check first, then do a clean shutdown?

I'd go a step further, personally I stop the array before doing a shutdown if at all possible. If the array doesn't stop fairly quickly, it's likely that the shutdown process by itself won't complete cleanly.

  • Thanks 1
Link to comment
  • 6 months later...
On 4/2/2023 at 6:26 PM, JonathanM said:

I'd go a step further, personally I stop the array before doing a shutdown if at all possible. If the array doesn't stop fairly quickly, it's likely that the shutdown process by itself won't complete cleanly.

 

On 3/29/2023 at 1:23 AM, JorgeB said:

Was there any unclean shutdown? In any case everything appears to be fine for now, just keep monitoring.

The same issue is happening again that happen 7 months ago.  I've been making sure there's clean shutdowns by clicking the spin-down button first, waiting for all the disks to show that they've spin-down, and then clicking the shutdown button.  I still get all of a sudden 4355 errors on a parity check.  The only that has changed since the last parity check on 9/18/23, is that after that parity check, I upgraded from 6.10.3 to 6.12.4.  So this is the first Parity check I've done after upgrading OS.

 

What do you suggest I do for troubleshooting?

 

FYI, I don't use VM's.  I just use unraid for plex.  I don't leave the server up 24x7, I only turn it on to backup my PC, add movies to plex, and watch movies on plex.  My disks have power on hours ranging from 2 to 6 months, but load cycles vary from 611, to 25000. 

1758207713_parityhistory.thumb.jpg.0b7089463014eb54f28d6c4bc1c27206.jpg

tower-diagnostics-20231029-1102.zip

Edited by tr3bjockey
Link to comment

@tr3bjockey,  If you watch what is happening when you stop the Array, all of the array drives will spin up.  Then the Linux drive-write cache (don't know for sure if this is the proper name...) will be flushed to the physical hard drives to be written onto the platters.  This assures that the parity drive contents will match the parity calculation of the data drives.   (From what has been reported on the past 12 years that I have been a forum member, when there is a parity mismatch after an unclean shutdown, the parity data is always wrong and the data on the data drive is correct!  This is strictly an observation as I have never seen a case where data corruption of a file was found later.)  Once this process is complete, then the array will stop.  If the drive-write cache can't be emptied (usually because some process is doing active writing), the array will not be stopped until the cache is empty!)

  • Thanks 1
Link to comment
7 hours ago, itimpi said:

Spinning down the disks will not guarantee a clean shutdown.    You would need to stop the array before the shutdown to guarantee this.

For next time I'll stop the array before the next shutdown.

 

1.  Is there anything in the diagnostics that point to anything being wrong?

2.  Should I perform another parity check? If yes, should I check/uncheck the box?

3.  When do I know it's safe, to shutdown after I stop the array?

4.  Is there a setting that I can turn on to tell Unraid to flush the cache instantly without needing to shutdown the array to force it?

5.  Is there a version of unraid, that is later than my current version, that automatically flushes the cache when a shutdown command is given, and does not power off the server until it's done writing?

Edited by tr3bjockey
Link to comment

There is a setting designed to do this:

image.thumb.png.f58c7ccbbae7de31594f87a098b03ab7.png

 

This setting is one that you can determine by timing the actual time in seconds after you manually click on the STOP button to take the array off line.  (I consider this setting to be the one that is the last thing that can prevent an unclean shutdown.  As I understand, if the array is still is not stopped when this timer expires, Unraid will then force a shutdown and it probably will be an unclean one!) 

 

There is this section in the Manual:

https://docs.unraid.net/unraid-os/manual/troubleshooting/#unclean-shutdowns

 

In that documentation, you will find a number of settings to force things to stop depending on what features Of Unraid you are using.  Notice the 'warning' about Bash scripts and ssh sessions and how to terminate them! 

Edited by Frank1940
  • Thanks 1
Link to comment
On 10/30/2023 at 5:36 AM, Frank1940 said:

There is a setting designed to do this:

image.thumb.png.f58c7ccbbae7de31594f87a098b03ab7.png

 

This setting is one that you can determine by timing the actual time in seconds after you manually click on the STOP button to take the array off line.  (I consider this setting to be the one that is the last thing that can prevent an unclean shutdown.  As I understand, if the array is still is not stopped when this timer expires, Unraid will then force a shutdown and it probably will be an unclean one!) 

 

There is this section in the Manual:

https://docs.unraid.net/unraid-os/manual/troubleshooting/#unclean-shutdowns

 

In that documentation, you will find a number of settings to force things to stop depending on what features Of Unraid you are using.  Notice the 'warning' about Bash scripts and ssh sessions and how to terminate them! 

Thanks for the tip.  My shutdown time-out was set to 90.  I bumped it up to 120.  I timed stopping the array and it too 45 seconds.  (my cache stats showed 10+gb).  What's weird is that it had been over an hour since I copied a large file to the array (which gets copied first to the ssd raid 1 cache drive, the manually I press move to move it to the mechanical drives.  I assumed that when I stopped the array, it finally dumped the cache of the move to the mechanical drive.  When I restarted the array, the cache stats showed less than 4gb.  Any ideas why it holds the data hostage in cache instead of writing it to the array immediately?

 

I don't have the server on a UPS, so now I'm even more paranoid about corruption in case of power failure.

Please correct me if I'm wrong but to make sure that there are no corruption issues, every time I copy data to the array to the array I will have to:

1.  after copying, press the move to move it from the SSD cache to the mechanical drives.

2.  after the move is done, stop all dockers.

3.  once all dockers are stopped, then stop the array and wait for the array operation section to tell me "stopped. configuration valid"

4.  If I'm not done for the day using the server, restart the array.

5.  If I'm done for the day, then go to the power menu and do a shutdown.

 

 

On 11/1/2023 at 2:17 AM, JorgeB said:

If the previous check was correct do a non correct one now.

Thank you for responding JorgeB.  I did a non correct one and no errors.  Is there a command besides "stop array" that I can run to flush the cache without needing to stop the array, to save me some steps as detailed above?

Link to comment
10 minutes ago, tr3bjockey said:

Is there a command besides "stop array" that I can run to flush the cache without needing to stop the array, to save me some steps as detailed above?

Not sure what you mean, do you mean to avoid an unclean shutdown? An array stop should always do that, a reboot/shutdown might not if the timeouts are not enough.

Link to comment
1 hour ago, tr3bjockey said:

1.  after copying, press the move to move it from the SSD cache to the mechanical drives

 

You don't have to do this before you stop the array.  In fact, I would think that starting a move operation would actually prevent the array from stopping! 

 

As long as no active user process has a file open on the array, it can be stopped.  Any data in any cache will have to be flushed first before the array will start stopping but the OS will handle that if you have the time setting high enough before it forces a shutdown/reboot which will cause an unclean shutdown.  The reason that I usually stop the Dockers is to terminate any file operations that one of them might be doing.  (Folks set up Dockers to do a wide, wide variety of things that involve files without knowing when they might be uploading or transferring of files to-and-from the array.)  The Tips and Tweaks plugin can force shutdown of any bash scripts or ssh sessions.   Unfortunately, I don't know of any way to terminate file operations that Share Access Users might be doing.  (Outside of you loudly hollering to "Get off the server NOW!!!")  VM's could be doing a file operation (I don't use any) but you should beware of them.

 

Most of the problems that cause threads on the forum are caused by an external event (such as a power outage).  Some of those are just unavoidable and you have to clean up the mess that they cause.  However, if you are deliberately doing that will require stopping the array for any, if you do a few simple steps to make sure that the array is actually stopped first before you shutdown or reboot the server, you can prevent an unclean shutdown by making sure that the array is topped first.  (Probably 90% of these shutdown don't cause a problem. But I, personally, want to avoid the possible hassle of a ~24 hour non-correcting parity check and then a possible 24 correcting one!)

 

One more thing, The Tips and Tweaks plugin has two settings that control the RAM cache that Linux uses for write operations.  This is a screenshot of those settings with the 'Help' for them:

image.thumb.png.6b0c5785a57b6869236c7089cfb2e74d.png

 

The more memory that you have the more data there will be to be flushed.  (Those defaults were set when 2MB was the typical amount of RAM in a Linux computer.  That 20% represents 400KB for 2MB of RAM.  With 32GB, that becomes 6.4GB.  Stop and think how long it takes to write that much data to an HD.) 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...