SOLVED - 2 issues | Unresponsive server + disks falling out during reboot

Magmanthe · May 5, 2021

Hey,

So I have this weird-issue that usually happens around the Parity-check and or reboot of server.

Yesterday this happened again, I upgraded the server to 6.9.2 (but this has been with the machine for a while, so it has nothing to do with 6.9.x). I upgraded and after that I rebooted the server.

Server spins up again nicely, and logs in as normal. Then after a few minutes (maybe an hour), it becomes totally unresponsive.

playing files from the server does not work, accessing it via file-explorer does not work, WebGui unresponsive, as locally connected screen/mouse/keyboard also becomes completely unresponsive.

Only "solution" I have found is a hard-reboot, pressing the reset-button the the chassis.

Then once the server turns back on, what happened previously was that one of the disks fell out of the array. Red X in the Main-screen with "Device is disabled, Contents emulated". This happened to my smallest Disk 4 and my workaround to deal with this was just to stop array, take out the disk, start the array. Stop the array again, include the disk and rebuild it.

Now yesterday this happened again after the update, but when I reboot the server now, it is no longer my Disk4 that fell out, but my Parity 1 and Disk 1 doing this, which is weird, cause that's new.

I have activated the syslog-server from Settings, but that log is worthless, it only says

May 4 20:39:38 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May 4 20:39:45 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May 5 11:37:10 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May 5 11:37:16 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May 5 13:33:42 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May 5 13:33:48 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock

Any help for how to find out what is actually happening here would be much appreciated. Where can I find more detailed logging of what is going on?

//magmanthe

Edited June 8, 2021 by Magmanthe

trurl · May 5, 2021

Go to Tools-Diagnostics and attach the complete Diagnostics ZIP file to your NEXT post in this thread.

42 minutes ago, Magmanthe said:

syslog-server

Might tell us something if you get it after crash.

Magmanthe · May 5, 2021

I know you said to add the full zip, but I see there are plaintext PW's in there in the Go-file, so you ain't getting that file from the config-folder.

Any other files that might have plaintext-identifiable info that I will remove?

//Magmanthe

trurl · May 5, 2021

7 minutes ago, Magmanthe said:

plaintext PW's in there in the Go-file

Something wrong there. Stock go file only contains

#!/bin/bash
# Start the Management Utility
/usr/local/sbin/emhttp &

Magmanthe · May 5, 2021

Well that depends on your setup, doesn't it?
Cause I'm using auto-login with FTP and Keyfile and the full FTP-adress/username/PW is in there as well.. So you know, not overly enthusiastic for the whole world to see that

'll delete the Go-file and add the ZIP

magnas-diagnostics-20210505-1455.zip

trurl · May 5, 2021

Nothing obvious in those. Have you done memtest?

May  5 13:43:01 magnas root: Fix Common Problems: Warning: Share AudioBooks is set for both included (disk3) and excluded (disk1) disks ** Ignored
May  5 13:43:01 magnas root: Fix Common Problems: Warning: Share TV is set for both included (disk1) and excluded (disk2,disk3,disk4) disks ** Ignored

Include means ONLY, Exclude means EXCEPT. Never any reason to set both. Your AudioBooks share Includes/Excludes isn't even consistent. Don't ignore FCP unless you know why you should.

trurl · May 5, 2021

15 minutes ago, Magmanthe said:

Well that depends on your setup, doesn't it?

Obviously you knew why. I didn't have any way to know. Several people have been hacked and had malicious code inserted in their go so I was concerned for you.

Magmanthe · May 5, 2021

Yeah, I am aware of the inc/exlude, but that is also a bit of the OCD that I am forcing specific data onto specific disks without any possible spill-over. So i will respectfully disagree that there are scenarios where you use both include and exlude. But this is also besides the point. Disk-inclusion or exslusion for certain shares, does not have anything to do with a locked-up system with disk falling out of the array? (assumption)

No, have not run memtest.. maybe I give it a whirl and see what it spits out.

I'll come back in a few days after MemTest is finsihed.

//Magmanthe

Edited May 5, 2021 by Magmanthe

Magmanthe · May 5, 2021

quick update..

i know to check sticks separetly, but did a quick-check with all 4, and it spit out errors (just as a sanity check)..

So there is for sure something wrong with one (or more), so I'll pull the sticks and check them one-by-one..
Thanks for the tips, didn't even consider it to be a RAM-error...

//Magmanthe

Edited May 5, 2021 by Magmanthe

trurl · May 5, 2021

14 minutes ago, Magmanthe said:

there are scenarios where you use both include and exlude

No there aren't. You are misunderstanding the settings and how they work.

Your setting for AudioBooks means use ONLY disk3 and also use everything EXCEPT disk1. If you only want it to use disk3 then just saying Include disk3 is all that is needed, nothing needs to be excluded.

trurl · May 5, 2021

20 minutes ago, Magmanthe said:

without any possible spill-over

Nothing ever spills over with regard to these settings. The only scenario where "overflow" happens is with cache yes or prefer shares overflowing to the array when cache has less than Minimum Free.

Of course, these settings only control what happens to new files. They won't have any effect on where existing files are already.

trurl · May 5, 2021

Include disk3 already excludes disk1 and all other disks that are not disk3, except of course for caching.

If you think something else happened at some time it may have had to do with trying to move files. Include/exclude won't have any effect on that situation.

See #2 here:

Magmanthe · May 5, 2021

Okay.

But for all intents and purposes there is nothing wrong with doing it, as long as the include and exlude-rule don't overlap.

but let me ask you then.

If audiobook is set to include disk 3.

and disk 3 is full, what happens when I shove in MORE audibooks? Which disk does it begin to put that data into?
Chronologically the first disk?
the next disk in line after Disk 3?
None, as I set include disk 3 and thus if full, it just discards the data?

//Magmanthe

trurl · May 5, 2021

4 hours ago, Magmanthe said:

None, as I set include disk 3 and thus if full, it just discards the data?

The write will fail

Magmanthe · May 5, 2021

Hi again,

So I've now run MemeTest on all 4 sticks, and wouldn't you know... 1 of them was really bad.

3 passed fine without any errors, but the last is garbage as you can see from the pix.. Luckily I did have a spare so no biggie..

So thanks for pointing me in that direction much appreciated.. Didn't think of that at all..

However when starting the server now, the Parity 1 disk and Disk 1 is still marked as disabled for some reason.. Anything else I can do?

//Magmanthe

Edited May 5, 2021 by Magmanthe

trurl · May 5, 2021

7 hours ago, Magmanthe said:

my workaround to deal with this was just to stop array, take out the disk, start the array. Stop the array again, include the disk and rebuild it.

This isn't a workaround. It is the way you get disks enabled again.

Unraid disables a disk when a write to it fails because the disk is no longer in sync with the array. After disabling a disk, all access to the disk doesn't actually use the disk again. Instead, it uses the emulated disk from the parity calculation by reading all other disks. Reads are emulated by reading all other disks, and writes are emulated by updating parity (in the case of that parity disk, it just isn't used since it can't be accessed directly, it has no data).

So, the disks have to be rebuilt to get them back in sync.

If by "take out the disk" you mean physically remove it, that isn't necessary. You just need to start the array with the disk unassigned, then reassigning it will rebuild to that disk.

Magmanthe · May 5, 2021

Yeah, no I don't take them physically out of the server..

I remove them "softwareliy" from the array when it is stopped, start the array. Then i stop the array again, and add it back in.

Okay, so that's how I have been doing it before too, so I know how to do it..
However, earlier it was always only 1 disk that was "lost" (disabled). and it was also not a parity disk, so rebuilding it was not an issue.

Now it is a parity disk and and regular.
do I build them back into the array at the same time, or take 1 disk at a time?

//Magmanthe

Edited May 5, 2021 by Magmanthe

trurl · May 5, 2021

You can do both at once. Rebuilding parity is just like rebuilding data, it is the same parity calculation.

Magmanthe · May 5, 2021

Rebuild in progress.. Now we play the waiting game

Thanks a bunch for the input and help, @trurl much appreciated...

Edited May 5, 2021 by Magmanthe

Magmanthe · May 6, 2021

So I'm back with an update..

Since yesterday, I came back from work today to check the Rebuild-progress, and lo and behold the "unresponsiveness" is back.

System is all locked-up (both from WebGui) and from direct-connected mouse/keyboard/screen.

However instead of hard-resetting and turning the system off/on at the current time, i will leave it on until tomorrow morning (Friday) just in case the actual parity and disk-rebuild is still going on in the background. From experience I know it should take around 1 day and ~3-5 hours, and I started it at around 22 yesterday.. So around midnight tonight (if it is still going) it will be done.

However it does not seem that the RAM-sticks are at fault then, as I now have 4 fully functional sticks from what I tested yesterday.

any particular BIOS-settings that might interfere? (I remember reading about AMD and C-states that might make the system unresponsive , but I'm running Intel so it shouldn't be that).

Other HW-issues like a MB error perhaps?

I turned CPU-graphics on in Bios due to locally connected screen via HDMI (or maybe it was DP; but anyway, directly to the I/O of the MB), with a mouse/keyboard. the screen there is also "stuck" on the log-in screen (as you can see) and after it locked-up (sometime during the night, or during daytime when I was at work) that too, becomes unresponsive..

Any halp is appreciated

//Magmanthe

trurl · May 7, 2021

Is Syslog Server still setup? If so zip it up and attach

Magmanthe · May 7, 2021

Yeah, bunch of stuff happening now.. (all bad)..

I know UNRAID is not a backup-solution per se.. but compared to what I used to have (everything on a disk in my computer) being in the server "protected" by 2 parity-disks, i did consider the data somewhat more secure that having it in my PC.

However with what is happening now, I'm copying out as much data as I can just in case... Attaching latest syslog (whiteout go-file).

Woke up this morning, and there was no change, so I took the power to the server, and then turned it back on.

During boot-section it came to the login-screen and after that it reported lots of XFS-errors. I tried restarting again, and the same happened again.

Restarted again this time I chose to start in safe-mode without any plugins and it started, but during the array-screen, every disk was missing from the overview 😱

I pressed the reboot-button (in gui) and this time it started "somewhat normally".

However the Parity 1 disk is still not valid and my Disk 1 is being emulated. My disk 1 says "unmountable: not mounted".

So before I do anything now I'm copying out the data to some external drives..

//Magmanthe

magnas-diagnostics-20210507-1717.zip

JorgeB · May 7, 2021

Yoy can either try to repair the filesystem on the emulated disk1 or see if the actual disks mounts correctly with UD, it should since it looks healthy.

Magmanthe · May 7, 2021

I'm guessing UD means Unassigned devices?

As you can see, in the system I have
1 x 6tb

2 x 4 tb
1 x 1 tb

I also have 2 x 4 TB (that was intended as replacements for the 2x4TB in the system, as they are old and have like 50k hours on them).

So for now the plan is to copy out data from the 2 x 4TB and the 1 TB (this is the most crucial data). The 6TB only has TV so nothing important.

Once that is done I will try to any and all mitigations to try and fix the issue.

I am somewhat unsure how, but I'll try Rebuilding first and if that does not work, maybe some other ideas..

Last resort is just wiping everything and starting a new build which would suck due to all the nice customization I've done (hours of tinkering and following SpaceInvader-guides)..

My suspicion is that it might be some HW-error that is not being collected by the syslog (mb gone haywire, PSU or cables fucking things up? HBA-card?)..

Once (like 15 years ago) I had a MB in my first computer, where the transistors on it started bulging from the printboard, so I've seen MB's just crapping out before..

//Magmanthe

JorgeB · May 7, 2021

28 minutes ago, Magmanthe said:

I am somewhat unsure how, but I'll try Rebuilding first and if that does not work, maybe some other ideas..

First fix the filesystem, no point in rebuilding if it can't be fixed or there's evident data loss, especially when rebuilding on top of the old disk.

SOLVED - 2 issues | Unresponsive server + disks falling out during reboot

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation