Recent troubles shutting down/rebooting.


Recommended Posts

The last 3 times I've done a reboot, the system hangs and is unable to complete the process.  I'm forced to perform a hard power off with the power button.  My monitor attached to the server displaying the syslog shows it doing the process and creating the log zip.  After that, it gets stuck with a stop error.  I will attempt to capture a photo of it next time.  I'm just curious if anybody has had any issues of late?  It seems like I've had this problem ever since I installed the new "My Servers" plugin to the UnRaid host and my server, but I'm not sure if it's something else going on.  Would appreciate some assistance if possible.

 

Attached are the most recent captured logs.

kyber-diagnostics-20210427-1935.zip

Link to comment

I'm really struggling now.  I've been running parity checks day and night (literally, it takes like 1.5 days).  Every time I run it, I get sync errors.  The first time I had this problem was after using Unbalance to scatter some folders off a drive that was 100% full.  After the transfer, I started a parity check.  It returned over 18k sync errors.  So I rebooted (always a good troubleshooting step with computers right?).  Thats when I started this thread, because the reboot failed to shutdown.  Once back up, it's been running another parity check.  When I last checked it, I had about 8 hours to go, and almost 12k sync errors again.  I left to go visit my brother, then came home.  The server was unresponsive, the "My Servers" tab in the forums showed it was offline.  I turned on the display connected to my server, and all I had a black screen.  So...something crashed?  I forced a reboot with the power switch again.  Rebooted, and it's doing another Parity check..  this time though, it's reading at like 35-40 MB/s compared to the 150-180 MB/s I'm used to seeing.  UHG what is going on?  I paused the check, thinking I could reboot "safely" and resume it.  Well I just rebooted, successfully, and when it came back up all I see is the popup that the last parity check finished successfully but with errors.  It also said there was no log or something to display the errors.. and that's it.  What should I do now?  I'm tempted to start another parity check.  I want to make sure my data is safe.  I just don't understand what's happened.  Would really appreciate some expert advice.

Link to comment

Important to consider -- if you don't have SMART data coming back from your drives, you don't know if they're failing. If they're failing and you're running long-stretch operations like parity sync checks over and over, you're thrashing a failing drive, which might make it fail faster, and destroy your data.

 

Maybe throttle back for a moment, and let's do some forensics. First, do you see valid SMART data for your disks? I have a few non-default UI options, but go to the Main tab, click on a disk's label (ie Disk 1 ) and locate the Attributes and Identity information.

 

image.thumb.png.f023827036160ef467469241ed89e6da.png

 

If that sort of information is missing, you're going to need to set up SMART settings that are arcane and difficult and the source of much frustration, OR connect your devices to a good quality USB adapter (or directly to another PC, etc) and examine their SMART data with a valid tool.

 

Windows- CrystalDiskInfo is free. https://crystalmark.info/en/software/crystaldiskinfo/

Linux - smartctl from the terminal, or just ssh to your unRAID box, since it has all the same tools.

 

If the data is there, check for phrases like Pending SectorUncorrectable, Bad Block -- look for FAILING_NOW or IN_THE_PAST anywhere on the left, as some error triggers can be persistently flagged. Ignore HUGE numbers like "Raw Read Error Rate" -- it's actually a set of values encoded into one number.

 

Just a cursory glance at your logs, I see nzbget segfaulting CONSTNATLY, which is a concern. It could be disk corruption, bad RAM, etc etc etc, but let's stop thrashing those disks and verify they're alive first, and I'll try to help from there.

Link to comment

Thanks for the responses. I’m at work now but when I get home I’ll repost diagnostic logs. All of the drives are showing good smart data, at least sense I last looked. I also looked again this morning before leaving and saw that the speed is back where I’m used to seeing. 

Edited by hansolo77
Link to comment
6 hours ago, hansolo77 said:

the drives are showing good smart data

I apologize - I'm just starting out trying to help out here, and I missed the SMART data in the diagnostics you attached. It's there, and seems okay.

 

 

 

Regarding crashing when you're not around - one thing I've done in the past is open a terminal window directly attached to the server in question, and run;
 

setterm -blank 0
dmesg -w

 

This will disable blanking the terminal (usually) and start a process to watch kernel messages and print them to you. Sometimes a crash happens too quickly to send the logs, but often the last message is important but doesn't get flushed to disk. If your server is headless, an ssh session also works but is less likely to catch a "dieing gasp".

 

Another thing you might try, but isn't likely to do much in your specific case, is disable processor C-States. I've seen a few random smatterings of intermittent crash caused by C-States, and it doesn't hurt to try. Run this command on the server, either terminal or ssh. It will disable all C-States on all processors, but only until you reboot. This way it's safe to test, unlikely to hurt, and you don't have to do anything to undo it.

for cpus in /sys/devices/system/cpu/cpu*/cpuidle/state*/disable; do echo 1 > $cpus; done

 

Link to comment

About to come home. Only 6 sync errors so far. I checked the drives smart status on break. Disk 13 has some reallocated sectors but I don’t know how long ago it was, it’s an old drive and the warranty is expired on it. It could be c-states, does that relate to sleep?  I had tried to use sleep once but turned it off a while ago, long before this. Should I stop the parity check and post the diagnosis now or wait till it’s done. It’s about 68%. Here’s a quick screenshot from my phone of their status. 

46B28AC4-55BC-4172-ACF3-AA66FDDFD066.png

Link to comment

You can find when the last few SMART errors happened by checking in the webUI under the disk's Self-Test section, click Show under SMART error log. If you're not confident interpreting it, we'll see the same data in the diagnostics.zip once it's uploaded.

 

C-States are low-power states that cores cycle in and out of very rapidly during normal use, to save power and reduce heat. Sleep is a different thing, and comes with a whole bundle of problems -- enough that I find myself often wishing they'd (Linux, specifically) just do away with it in non-portable contexts.

 

As for when to upload - the only case waiting would serve is if a disk has undiscovered errors towards its end, but given what I've heard so far I'm confident there's no explicit need to wait.

Link to comment

Here's the current diagnostics.  The parity check is still running.  It was at 6 sync errors, then i was looking over the syslog in the diagnostics zip to see if I saw anything that might be a culprit.  While doing that, it looks like it had another sync correction, so I'll just add that line here:

 

Apr 29 20:42:09 Kyber kernel: md: recovery thread: P corrected, sector=16930232816

 

It's too bad the log doesn't indicate what drive that sector is on.  Would help identify if the errors are all being detected on the same drive or not.

kyber-diagnostics-20210429-2036.zip

 

Also.. I checked Drive 13's Smart error history, and it doesn't show anything...  "No Errors Logged"

Edited by hansolo77
Link to comment
14 minutes ago, hansolo77 said:

It's too bad the log doesn't indicate what drive that sector is on.  Would help identify if the errors are all being detected on the same drive or not.

No way to know. Parity is just an extra bit that can be used to figure out if all the other bits are in agreement, or it allows a missing bit to be calculated from all the other bits. No way to know which disk is out of sync so parity is the one you have to correct so it is in sync with all the other disks again.

 

If you have some reason to suspect a particular disk, then a sync error might be a further reason to suspect that disk.

 

We usually suggest running non-correcting parity checks until you determine you actually have sync errors, then if you think there may be some problem causing those you can try to correct it and run another non-correcting check. In the absense of any other information, all you can do in the end is correct parity so the array is all in sync again..

 

And in the case of unclean shutdowns, the parity check you get will be non-correcting. And if there are parity errors from unclean shutdown it is usually reasonable to expect that parity is the disk out-of-sync and a correcting check needs to be done.

 

 

 

Link to comment

So what your saying is.. I should stop this check (which says "Sync errors corrected: xxxx") because it's correcting the errors, and restart another check that's NON-Correcting?  That seems unhelpful to me.  If it's correcting errors, then that means there ARE errors right?  Doing a non-correcting scan would just say there are errors and I need to run another scan that IS correcting.  Maybe I'm just confused here, it's been a long day.  I just don't see what the benefit on a non-correcting scan is.

 

As it is right now, I'm at 75.4% completed... 20-hours in, 6 hours to go, and I now have 6189 sync errors corrected.

 

I suspected the nzbget docker before @codefauxpointed it out, so I've actually had all of my dockers disabled except for PLEX.  I don't know what that problem is, but none of my docker's are on the array.. they're on a dedicated unassigned device.  But nzbget does interact with the array.. so I've disabled it.  PLEX is also on it's own dedicated unassigned device, but it should only be reading media from the array, not creating new files, etc.  I even think the DVR is set to use the plex drive and not the array, although I could be wrong on that.  I can complete disable it as well if you think that's a potential relation to the problem.

Link to comment

From just skimming your logs: Since you are using VMs, are you using PCIe passthrough?

 

If no, a good candidate to try is always to turn off VT-d (AMD-Vi, IOMMU, or whatever your BIOS calls it) in the BIOS and see if that improves your situation any.

Some hardware just really hates being part of the IOMMU.

 

If you are using PCIe passthrough, you can try editing your boot options (main tab, click flash) for the "Unraid OS" boot option. After the append initrd=/bzroot, try adding "iommu=pt". (See attached image, you can just edit in that textfield). Click apply and reboot. iommu=pt will only put devices in the IOMMU that you actually pass through, so hopefully not the problematic device (your NIC it seems from the one fault)

 

//EDIT: I did this on mine because I know my NIC shows instability with the iommu on every device.

 

d0395550_2021-04-29_18-48-19.png

Edited by Doridian
Link to comment

Go ahead and complete the correcting check.

 

The point of a non-correcting check, and the reason automatic checks for unclean shutdown, or scheduled parity checks, should be non-correcting, is in case there is a problem like a bad disk or even a bad connection during the check. This can be determined by looking at other information like SMART reports or syslog, for example, or even just looking at the webUI to see if there are any I/O errors on any disk. If that happens and the parity check is correcting, then it is actually changing parity which may have been valid, when what you really need to do is fix the other problem and leave parity alone.

Link to comment
12 minutes ago, Doridian said:

From just skimming your logs: Since you are using VMs, are you using PCIe passthrough?

 

If no, a good candidate to try is always to turn off VT-d (AMD-Vi, IOMMU, or whatever your BIOS calls it) in the BIOS and see if that improves your situation any.

Some hardware just really hates being part of the IOMMU.

 

If you are using PCIe passthrough, you can try editing your boot options (main tab, click flash) for the "Unraid OS" boot option. After the append initrd=/bzroot, try adding "iommu=pt". (See attached image, you can just edit in that textfield). Click apply and reboot. iommu=pt will only put devices in the IOMMU that you actually pass through, so hopefully not the problematic device (your NIC it seems from the one fault)

 

//EDIT: I did this on mine because I know my NIC shows instability with the iommu on every device.

 

d0395550_2021-04-29_18-48-19.png

 

 

I had hoped to one day do a PCIe passthrough to allow my Windows 10 VM to be a gaming setup so I could stream with something like Moonlight to my Pi, etc.  I've NEVER set it up actually, because I was under the impression my video card (nVidia Quadro P2200) wouldn't provide much 3D capability.  The only other passthrough I've ever set up was a USB port so I could quickly attach a flash drive and dump some files onto the Windows VM.  I do have multiple NICs.. the motherboard has 2 - rj45 and wifi.  The wifi is disabled in bios.  The other NIC I have is a dedicated 10G card that is direct attached between the server and another computer, but the other computer never works right with the NIC unless I disable/re-enable the NIC.  That card isn't being passed through either.  I just checked my VM config for Windows 10 (the only VM installed ATM) and it doesn't have anything set as a passthrough.  So I will try your suggestion of the `iommu=pt`.

 

 

12 minutes ago, trurl said:

Go ahead and complete the correcting check.

 

The point of a non-correcting check, and the reason automatic checks for unclean shutdown, or scheduled parity checks, should be non-correcting, is in case there is a problem like a bad disk or even a bad connection during the check. This can be determined by looking at other information like SMART reports or syslog, for example, or even just looking at the webUI to see if there are any I/O errors on any disk. If that happens and the parity check is correcting, then it is actually changing parity which may have been valid, when what you really need to do is fix the other problem and leave parity alone.

 

 

That makes a little sense.  I think I have my scheduled monthly checks set to auto correct though.  Maybe I should disable that.  I'll let it finish the current correcting scan, then tomorrow I'll do another non-correcting scan.  Thanks for the help guys.  Hopefully this will all get sorted out.

Link to comment

I did a little bit of log checking.  The NIC that is throwing up logs is ETH1..  that corresponds to the 10G NIC I have.  It looks like the log is just indicating when my other PC turns on, and the link is established.  I don't think that would have any impact on things.  I know my original post for this was about the hang ups during reboot, but I think I have a bigger problem on my hands with this parity issue.  Should I create a new thread to support that, or is this one still ok?

 

I wonder if it would be a good idea to just remove the parity drives from the array, then re-add them so the parity is just completely rebuilt.  I've not done anything with the array in days, so it's not like I've got anything recent to lose.

 

Something else I noticed was back when I started using Unbalance to scatter files off a drive that was full.  After all that was done, and this parity trouble started creeping up, I did a "Fix Common Problems" extended scan.  It reported that there are actually a lot of duplicated files on multiple drives.  Might that be why the parity is struggling?

Link to comment
8 minutes ago, hansolo77 said:

It reported that there are actually a lot of duplicated files on multiple drives.  Might that be why the parity is struggling?

No as parity has no concept of files - it works at the raw sector level.   Parity is only affected by problems reading or writing sectors on the drives.

 

you DO, however want to sort this out as you can get strange behaviour.    When multiple copies of a files exists unRaid shows in a User Share the first one it finds (searches pools and then drives in order).   This can lead to a case where you delete a file and nothing appears to happen (and potentially the file contents change) as the deletion just ‘uncovers’ the next copy found.

 

Link to comment

Does no shutdown problem previously ?

 

Will you copy between user share and disk share manually ? ( Or other manual operation ). This may cause problem if handle not properly ( the count no. of parity out sync abnormally high )

 

May be start checking on main hardware first, run UEFI memtest86 ( enable test on all CPU )

 

https://www.memtest86.com/download.htm

Edited by Vr2Io
Link to comment
2 hours ago, Vr2Io said:

copy between user share and disk share manually

Since user shares are just different views of the same files as on the disks, this can cause data loss due to linux trying to overwrite the file it is trying to read. It shouldn't cause any parity sync errors though.

Link to comment
2 hours ago, Vr2Io said:

Will you copy between user share and disk share manually ? ( Or other manual operation ). This may cause problem if handle not properly ( the count no. of parity out sync abnormally high )

 

24 minutes ago, trurl said:

Since user shares are just different views of the same files as on the disks, this can cause data loss due to linux trying to overwrite the file it is trying to read. It shouldn't cause any parity sync errors though.

 

I don't do this.  I always access files through the shares, never via /mnt/disk**.  Unless when I did the Unbalance plugin/scatter.  Maybe that screwed me all up.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.