Large number of parity errors during checking after upgrading hardware - what is best action to take?


Go to solution Solved by JorgeB,

Recommended Posts

I have recently replaced my motherboard & CPU of my UnRAID server with a new one and after this I wanted to test if parity was ok but contrary to what I expected/hoped I have already got over 500 errors in under 10% of test done.  Now it was sadly a few months ago I did the last test so cant say for sure if the problem is with the new hardware or existed also before the switch (lets call it a learning for next hardware change to do one just before). I have not specified to have the errors corrected to be able to investigate better if this type of problem would occur. With the new hardware I initially had some problems to get UnRAID to start properly so had to do a few unclean shutdowns but I do not think data was written at those times (as the system most likely did not start the array). The new motherboard has ECC memory and I did run memory test for a number of hours without any errors before testing it with UnRAID. I also tried installing Linux on an SSD that I conneted in turn to each of the six SATA channels I use for UnRAID (2 are regular separate SATA and four go thorugh a "slim SAS cable with 4 SATA connectors in the other end) and they did at least work reliably enough to seemingly run Linux without any errors.

For what it is worth here are the diagnostics. What would be the best action to take in this situation. Run some diag

nas-diagnostics-20230520-2051.zip

Edited by NAS-newbie
Link to comment

I disabled all containers and VMs (to avoid more writes until I have done some more tests) and let the test continue running. Late at night I started receiving a lot of messages like this in the syslog (possibly when "mover" run). Seems strange with memory allocation error as the box has 64GB memory so must be out of a size limited pool....

May 20 23:50:22 NAS nginx: 2023/05/20 23:50:22 [error] 6892#6892: nchan: Out of shared memory while allocating message of size 9059. Increase nchan_max_reserved_memory. May 20 23:50:22 NAS nginx: 2023/05/20 23:50:22 [error] 6892#6892: *274131 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost" May 20 23:50:22 NAS nginx: 2023/05/20 23:50:22 [error] 6892#6892: MEMSTORE:00: can't create shared message for channel /devices May 20 23:50:22 NAS nginx: 2023/05/20 23:50:22 [crit] 6892#6892: ngx_slab_alloc() failed: no memory May 20 23:50:22 NAS nginx: 2023/05/20 23:50:22 [error] 6892#6892: shpool alloc failed May 20 23:50:22 NAS nginx: 2023/05/20 23:50:22 [error] 6892#6892: nchan: Out of shared memory while allocating message of size 277. Increase nchan_max_reserved_memory. May 20 23:50:22 NAS nginx: 2023/05/20 23:50:22 [error] 6892#6892: *274140 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/update1?buffer_length=1 HTTP/1.1", host: "localhost" May 20 23:50:22 NAS nginx: 2023/05/20 23:50:22 [error] 6892#6892: MEMSTORE:00: can't create shared message for channel /update1 May 20 23:50:23 NAS nginx: 2023/05/20 23:50:23 [crit] 6892#6892: ngx_slab_alloc() failed: no memory May 20 23:50:23 NAS nginx: 2023/05/20 23:50:23 [error] 6892#6892: shpool alloc failed May 20 23:50:23 NAS nginx: 2023/05/20 23:50:23 [error] 6892#6892: nchan: Out of shared memory while allocating message of size 3603. Increase nchan_max_reserved_memory. May 20 23:50:23 NAS nginx: 2023/05/20 23:50:23 [error] 6892#6892: *274142 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/var?buffer_length=1 HTTP/1.1", host: "localhost" May 20 23:50:23 NAS nginx: 2023/05/20 23:50:23 [error] 6892#6892: MEMSTORE:00: can't create shared message for channel /var May 20 23:50:23 NAS nginx: 2023/05/20 23:50:23 [crit] 6892#6892: ngx_slab_alloc() failed: no memory May 20 23:50:23 NAS nginx: 2023/05/20 23:50:23 [error] 6892#6892: shpool alloc failed

and eventually the test was paused at about 35% (with no more errors found than at ~8% interestingly).

New diagnostics included if anybody can have a look. Have now restarted the parity check to see if more errors are encountered all over the disks or if the "only" ones are the initialt ~500...

nas-diagnostics-20230521-0645.zip

Edited by NAS-newbie
Link to comment

I have now run like 60% of the parity check and there seem to be no more errors discovered in addition to the 513 that was found in the first 8% of the array. My array consists of a 3 "enterprise grade" disks (20TB parity, 20 TB data and 10TB data) and 3 smaller and older "consumer grade" disks that remain from when I first tried out UnRAID (3TB and 2x2TB) that do not contain any data (I moved what little was on them to the new drives a long time ago and the new ones have not filled up enough to start using the small disks since). 

Assuming no more parity errors are found I feel quite sure one or more of these old drives (that I probably should have removed when adding the new much larger disks) are to be blamed for the parity errors.

As they add relatively little capacity to the array and seem likely to cause these serious problems I am planning to remove them from the array.

Given that I have not selected "correct parity" in the check I have ongoing what is the best way to do this operation?

As I have backups for all critical data I am thinking of risking to go the "new configuration route" (leaves a window of about 36 hours I do not have parity protection for the array and may have to go through restoring backups if I have a failure) and let that build new parity.

Any suggestions for a faster/safer/better procedure are appreciated!

Edited by NAS-newbie
Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.