(Solved) 6.3.5 Unraid server becomes unresponsive – cpu 100 – cold reboot only option


optiman

Recommended Posts

I have been running into a issue where for some reason my server will become unresponsive.  This can happen at any time it seems with no logic as to what is causing it.  When this happens, I’m unable to do much at all.  I have not made any resent changes to hardware or apps (plugins and dockers).  All of that said, I have suspicion that the Mover may be involved.  It was running almost every time that my server has become unresponsive.

 

I’m running Unraid Pro 6.3.5.  My signature has current hardware specs.  I only run two Docker containers, sabnzbd and sickrage.  For plugins, I have attached a screenshot to show the plugins installed.   This issue started back with unraid 6.2 and happens every 2 or 3 weeks.  The webgui dashboard shows that my cpu and memory are barely being used at all during normal operations.  I also have green  balls on all drives and the syslog doesn’t show any thing that helps me identify the cause.

 

When this occurs the following happens:

-          SMB shares become unresponsive, no access

-          Webgui unresponsive, even on the server console

-          Command line commands do not execute, Powerdown, Shutdown, etc will not run

-          CPU goes to 100%

-          A cold restart is the only way to shutdown or reboot the server.  This caused a Parity Check to start, which in most cases has zero errors.

-          If a manual copy is in process (or the Mover is running) then the active or current directory will become inaccessible.  The only way I can get back into that directory is to use the “chmod -R 777”.

-          HTOP shows process that is eating up cpu, something about a “shfs error noatime.big_writes.allow only”.  I’ve had no luck searching for forums.  I am unable to KILL this process.

 

What still works when this happens;

-          I can putty or use server console to login to the command line, although can’t do much.

-          I can copy the syslog to flash drive.

-          I can logout and login to the sever console, but the webgui will not come up, just hangs

-          Docker containers continue to run normal

 

So far I have used the Server Console to view HTOP and that is how I know the cpu is at 100%.  It shows a process that I do not recognize or understand, see attached screenshot of HTOP.  I cannot KILL the process, not using htop and not using command line.  The system will simple ignore the commands.  I also tried to just leave the server alone and see if it would recover, no luck.  After 3 days it was still screwed up and unresponsive.

 

I’m still using reiserfs (I need to convert this when I have time) so I ran file system checks and reviewed SMART data on all drives - no errors.    For my cache pool (btrfs) I ran the SCRUB command.  I also ran memtest – no issues.  After a fresh reboot I’m able to run tower-diagnostics and I can PM that and my Syslog files to LT support if it helps.

 

This WILL happen again and I’m concerned with having to do a COLD restart.  Because I’m unable to KILL the process that is pushing the cpu to 100%, it leaves no option but to turn off the power each time this happens – NOT GOOD.

 

My main question is what can I do the next time this happens to help me identify what is causing this?

 

Please help if you  can.

 

Thanks!

htop.jpg

memtest screenshot.jpg

plugins.JPG

nerdpack.JPG

Edited by Switchblade
Link to comment
36 minutes ago, Switchblade said:

I’m still using reiserfs

I know you don't want to hear this, but you will continue to have these lockups until you convert. This is almost a normal thing now, where reiserfs disks cause lockups after a period of time. It seems to be a little better with a fresh reiserfs format only filled 2/3 of the way or so, but eventually the file system will start locking up when writes are requested.

 

Seen it personally, and time and time again on this forum.

 

Convert before one of your hard lockups causes file system corruption and data loss.

Link to comment

Thanks for the replies guys.  As pwm points out, I too have not had any reiserfs issues, until recent release updates.  Does the htop information agree that is the cause?

 

How can I confirm my issue is for sure connected to reiserfs and converting my drives will solve this issue?

 

Thanks again and cheers!

 

Link to comment

Not easy to be 100% sure.

 

But there is quite a lot of people who have had issues with Reiserfs. Definitely together with extended attributes. But it isn't impossible that the RFS code have somehow also received some bad change in newer kernels.

 

But I have seen 100% CPU load from a different process when writing to RFS, so somewhere in the file system code there is some error that can results in a busy loop.

Link to comment
3 minutes ago, Switchblade said:

How can I confirm my issue is for sure connected to reiserfs and converting my drives will solve this issue?

Eliminate all writes to reiserfs volumes, and see if the lockups continue.

 

This can be mostly accomplished by converting one drive to XFS or adding a drive as XFS and directing all new material there temporarily.

 

Other than that I can only refer you to the multitude of instances on this forum that outline your exact issue. It's not a new issue, it's been ongoing since the move to 64 bit unraid in 6.x

It doesn't happen to everyone, but for every instance where it happens, conversion to XFS is the solution.

Link to comment
14 hours ago, pwm said:

I have never had a single Reiserfs lockup on quite a number of servers, with one single exception. Reiserfs + extended attributes does not work well.

 

Correct, but unRAID uses extended attributes to store shares information. It is highly recommended for anybody running unRAID6 to move away from ReiserFS.

 

Link to comment
3 hours ago, bonienl said:

 

Correct, but unRAID uses extended attributes to store shares information. It is highly recommended for anybody running unRAID6 to move away from ReiserFS.

 

Ok - didn't know that. I only knew that the Dynamix File Integrity plugin uses extended attributes.

 

For file systems without extended attributes, I have accumulated maybe 50-100 years of total usage time with zero issues with Reiserfs volumes. With extended attributes on the other hand, I have had hanging processes with 100% CPU time.

Link to comment

Thanks guys, point well made.  I just wanted to be sure there wasn't another issue going on here that I should address first.

 

I have new drives to preclear and I'll start working on the plan to convert.  I assume the process hasn't changed much from last year, unless somebody has created an even easier process.  That is why I deferred this task, as it seems it would take a long time.  I guess it's time to just getter done.

 

Thanks

 

 

Link to comment

If you care which disk number contains specific data, it's a little complex. If not, all you need is one empty disk big enough to hold the contents of your fullest reiserfs disk, and you can start copying. Stop array, change desired format type on empty disk, start array, verify only empty disk wants formatting, format disk, COPY (not move) fullest reiserfs disk to freshly formatted disk, verify copy if desired, stop array, change desired format type of the disk you just copied from, start array, verify only one disk wants formatted, etc, etc.

 

You do NOT want to move data, only copy from reiserfs disks, for one thing it slows down the process greatly, for another it's prone to errors if you have to stop in the middle. Much cleaner to do a full copy so you can verify all the data is identical, then formatting the drive, which is WAY faster than deleting the files from reiserfs.

 

Parity stays intact the entire time, no need to do new configs or risk a drive failure while parity is being rebuilt.

Link to comment

I have no share inclusions nor exclusions and I don't care where the data ends up.  All 8 of my data drives are 6TB.  I have a spare 8TB that I want to use for  this process and it can remain in my array afterwards.  I will just have a spare 6TB at the end of the conversion.

 

In your steps, would I still need to do the New Config part? 

 

I read it's best to boot into SAFE mode, so no plugins, no docker, no Mover, etc.  I don't care if my server is offline the entire time.  I'm more interested in the fastest process that keeps Parity.  I was planning to use log into my server console and use MC to "copy" and verify, instead of the command line option.  As long as I'm copying from disk share to disk share, then I this process should work.  Do you agree this is ok, or should I reconsider using unbalance or the command line options?

Link to comment
2 hours ago, pwm said:

I think I would prefer rsync, since it can verify the copy and supports restart.

 

It's obviously good to turn off anything that modifies the source disk when doing the copying.

+1

Here's what I use.

rsync -arv /mnt/disk(source#)/ /mnt/disk(destination#)>/boot/copy.txt
rsync -narcv /mnt/disk(source#)/ /mnt/disk(destination#)>/boot/verify.txt

Pay special attention to the slashes, the source has a trailing /, the destination does NOT.

If the empty disk was disk4 and the source was disk6, then it would be /mnt/disk6/ /mnt/disk4>/output.txt

Run these either in a "screen" session or at the physical console. The /boot/copy and verify parts put the list of files that have been acted on in a text file on the root of the flash drive. copy.txt should have a full list of the files on the drive, verify should have NO files listed.  If verify shows a list of files, those files were different after the copy run and you need to figure out why.

 

No new config or parity resync is needed with this.

 

If you want to use MC to do the copy part, the verify part will work on its own just fine.

Link to comment

ok, I will use rsync.  I'll boot into SAFE mode and run the two rsync commands at the console

 

Thanks for the rsync commands jonathanm!  So run the first rsync command to copy.  When that one finishes, then run the second rync command to verify and confirm output file has no files listed.

 

Preclearing the new drive now

 

Thanks!

 

 

 

Edited by Switchblade
Link to comment
3 minutes ago, Switchblade said:

So the first rsync command is to copy and the second line is to verify - correct?

Yes. If you use the >text.txt part,  all normal screen output goes to the file and the command line will just jump to the next line and stay blank, and return to a prompt when it's done. You can check on the progress of the copy by comparing the used amount on the target disk to what's on the source disk, and also open the text file to see what files have been copied so far. Easy way is to click on the "view" icon on the far right of the flash line in the main GUI, then clicking on copy.txt assuming that's what you typed on the command line.

 

Figure several hours on each operation, but at least it's background and no interaction needed.

Link to comment

I have a new drive ready to go and I have created my plan using the feedback from you guys.  That said, before I start, I am sharing my plan so you can tell me if I am missing anything.  I do have a couple of questions.

 

I have dual Parity drives.  Do I need to do anything extra, like remove the 2nd parity drive before I start and put it back at the end?

 

I have 8 data drives to convert, all of them are 6TB drives.

 

I have added a 8TB drive (disk9), which will remain in the array when conversion is completed.  This means I will have a extra 6TB drive at the end of this procedure.  So the very last step for me is to remove the empty disk8 drive.  I don’t have any more room in my case to add another data drive, so one must be removed at the end of this process.

 

Because I am on Unraid 6.3.5, I am using the instructions (with help from this thread) at https://lime-technology.com/wiki/File_System_Conversion

 

Share based, no inclusions, preserving parity

*Use physical server Console

*No New Config needed for this process

*Boot into SAFE mode, no plugins and no dockers running, no mover

 

1.    Start copying from console – from disk1 to the new XFS disk9

 

rsync -arv /mnt/disk1/ /mnt/disk9>/boot/copy.txt

 

2.    After copy completes, run Verify and check output file (if successful there should be no files listed)

 

rsync -narcv /mnt/disk1/ /mnt/disk9>/boot/verify.txt

 

3.    Stop array, change to XFS format type on empty disk1, start array, verify only empty disk wants formatting, format disk.

 

4.       COPY (not move) next reiserfs disk to freshly formatted XFS disk, verify copy, stop array, change desired format type of the disk you just copied from, start array, verify only one disk wants formatted.  Repeat until all drives are converted.

 

Copy Plan:

disk1 to disk9

disk2 to disk1

disk3 to disk 2

disk4 to disk3

disk5 to disk4

disk6 to disk5

disk7 to disk6

disk8 to disk7

 

disk8 is now empty and can be removed from the array.

After removing the old 6TB disk8, do I need to move disk9 to disk8 slot or does it matter?

 

Thanks!

Link to comment
What about my dual Parity drives?  Is that a issue?

No

 

I guess the rest of my instructions / Plan look ok? 

Yes, but you'll need to a new config and re-sync parity when you remove disk8, you can take the opportunity to move disk9 to disk8 at that time, alternatively you can clear the disk before removing it and maintain parity.

 

 

Link to comment
1 minute ago, Switchblade said:

ok good, so I can leave disk9 as is at the end, even though I won't have a disk8

 

No need to move the disk physically, you have the optional choice to "rename" disk9 to disk8

 

2 minutes ago, Switchblade said:

What about my dual Parity drives?  Is that a issue?

 

Same story. Physical placement of the parity drives is not relevant, but you CAN NOT change parity 1 to parity 2 or vice versa. Both parity disks contain different information.

Also keep in mind that changing the logical disk order (e.g. swapping two disks) needs a rebuild of parity 2.

 

6 minutes ago, Switchblade said:

I guess the rest of my instructions / Plan look ok? 

 

Your plan is sound.

Link to comment

Great, thanks guys!  Kids are watching a movie right now, but I will reboot into Safe mode later today and start the conversion.

 

I think I will just erase or clear the contents of disk8 at the end to keep parity. 

 

For the New Config  - It's been a while since I did that.  I stop the array, remove disk8, move disk 9 to slot 8, then do the New Config and tick the box to keep existing configuration, then start array - is that correct?

Link to comment

Thanks for the info and link.  I only want to remove the old disk8 (which will still have all the files on it, as I only COPY the data).  I don't care about zeroing it, unless that is necessary.  I can just delete the files, or perhaps just reformat disk8 to clear.  My goal is to remove disk8 and keep parity.  I will just preclear the old disk8 later on, and save it as a spare.

 

In the shrink array instructions, why do I need steps 7 and 8?  If I have already deleted all files, the drive should be empty and parity good.

 

  1. Make sure that the drive you are removing has been removed from any inclusions or exclusions for all shares, including in the global share settings.
  2. Make sure the array is started, with the drive assigned and mounted.
  3. Make sure you have a copy of your array assignments, especially the parity drive. You may need this list if the "Retain current configuration" option doesn't work correctly.
  4. It is highly recommended to turn on reconstruct write, as the write method (sometimes called 'Turbo write'). With it on, the script can run 2 to 3 times as fast, saving hours!

    In Settings -> Disk Settings, change Tunable (md_write_method) to reconstruct write.

  5. Make sure ALL data has been copied off the drive; drive MUST be completely empty for the clearing script to work.
  6. Double check that there are no files or folders left on the drive.

    Note: one quick way to clean a drive is reformat it! (once you're sure nothing of importance is left of course!)

  7. Create a single folder on the drive with the name clear-me - exactly 7 lowercase letters and one hyphen
  8. Run the clear an array drive script from the User Scripts plugin (or run it standalone, at a command prompt).
    • If you prepared the drive correctly, it will completely and safely zero out the drive. If you didn't prepare the drive correctly, the script will refuse to run, in order to avoid any chance of data loss.
    • If the script refuses to run, indicating it did not find a marked and empty drive, then very likely there are still files on your drive. Check for hidden files. ALL files must be removed!
    • Clearing takes a loooong time! Progress info will be displayed, in 6.2 or later. Prior to 6.2, nothing will show until it finishes.
    • If running in User Scripts, the browser tab will hang for the entire clearing process.
    • While the script is running, the Main screen may show invalid numbers for the drive, ignore them. Important! Do not try to access the drive, at all!
  9. When the clearing is complete, stop the array
  10. Go to Tools then New Config
  11. Click on the Retain current configuration box (says None at first), click on the box for All, then click on close
  12. Click on the box for Yes I want to do this, then click Apply then Done
  13. Return to the Main page, and check all assignments. If any are missing, correct them. Unassign the drive(s) you are removing. Double check all of the assignments, especially the parity drive(s)!
  14. Click the check box for Parity is already valid, make sure it is checked!
  15. Start the array! Click the Start button then the Proceed button (on the warning popup that will pop up)
  16. Parity should still be valid, however it's highly recommended to do a Parity Check

 

Thanks again for your help.  I'm trying to do good planning so I don't f*ck this up.  :)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.