SOLVED - 2 issues | Unresponsive server + disks falling out during reboot


Recommended Posts

Hey,

 

So I have this weird-issue that usually happens around the Parity-check and or reboot of server.


Yesterday this happened again, I upgraded the server to 6.9.2 (but this has been with the machine for a while, so it has nothing to do with 6.9.x). I upgraded and after that I rebooted the server.

Server spins up again nicely, and logs in as normal. Then after a few minutes (maybe an hour), it becomes totally unresponsive.

playing files from the server does not work, accessing it via file-explorer does not work, WebGui unresponsive, as locally connected screen/mouse/keyboard also becomes completely unresponsive.

Only "solution" I have found is a hard-reboot, pressing the reset-button the the chassis.

 

Then once the server turns back on, what happened previously was that one of the disks fell out of the array. Red X in the Main-screen with "Device is disabled, Contents emulated". This happened to my smallest Disk 4 and my workaround to deal with this was just to stop array, take out the disk, start the array. Stop the array again, include the disk and rebuild it.

 

 

Now yesterday this happened again after the update, but when I reboot the server now, it is no longer my Disk4 that fell out, but my Parity 1 and Disk 1 doing this, which is weird, cause that's new.

I have activated the syslog-server from Settings, but that log is worthless, it only says 

May  4 20:39:38 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May  4 20:39:45 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May  5 11:37:10 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May  5 11:37:16 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May  5 13:33:42 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock
May  5 13:33:48 magnas /dev/log [info]: Server is up! /var/run/unraid-api.sock

 

 

Any help for how to find out what is actually happening here would be much appreciated. Where can I find more detailed logging of what is going on?

 

 

//magmanthe
 

 

Edited by Magmanthe
Link to comment

Nothing obvious in those. Have you done memtest?

 

May  5 13:43:01 magnas root: Fix Common Problems: Warning: Share AudioBooks is set for both included (disk3) and excluded (disk1) disks ** Ignored
May  5 13:43:01 magnas root: Fix Common Problems: Warning: Share TV is set for both included (disk1) and excluded (disk2,disk3,disk4) disks ** Ignored

Include means ONLY, Exclude means EXCEPT. Never any reason to set both. Your AudioBooks share Includes/Excludes isn't even consistent. Don't ignore FCP unless you know why you should.

Link to comment

Yeah, I am aware of the inc/exlude, but that is also a bit of the OCD that I am forcing specific data onto specific disks without any possible spill-over. So i will respectfully disagree that there are scenarios where you use both include and exlude. But this is also besides the point. Disk-inclusion or exslusion for certain shares, does not have anything to do with a locked-up system with disk falling out of the array? (assumption)

 

 

No, have not run memtest.. maybe I give it a whirl and see what it spits out.

I'll come back in a few days after MemTest is finsihed.

 

//Magmanthe

Edited by Magmanthe
Link to comment

quick update..

i know to check sticks separetly, but did a quick-check with all 4, and it spit out errors (just as a sanity check)..


So there is for sure something wrong with one (or more), so I'll pull the sticks and check them one-by-one..
Thanks for the tips, didn't even consider it to be a RAM-error...

 

//Magmanthe

Edited by Magmanthe
Link to comment
14 minutes ago, Magmanthe said:

there are scenarios where you use both include and exlude

No there aren't. You are misunderstanding the settings and how they work.

 

Your setting for AudioBooks means use ONLY disk3 and also use everything EXCEPT disk1. If you only want it to use disk3 then just saying Include disk3 is all that is needed, nothing needs to be excluded.

Link to comment
20 minutes ago, Magmanthe said:

without any possible spill-over

Nothing ever spills over with regard to these settings. The only scenario where "overflow" happens is with cache yes or prefer shares overflowing to the array when cache has less than Minimum Free.

 

Of course, these settings only control what happens to new files. They won't have any effect on where existing files are already.

Link to comment

Include disk3 already excludes disk1 and all other disks that are not disk3, except of course for caching.

 

If you think something else happened at some time it may have had to do with trying to move files. Include/exclude won't have any effect on that situation.

 

See #2 here:

 

 

Link to comment

Okay.

 

But for all intents and purposes there is nothing wrong with doing it, as long as the include and exlude-rule don't overlap.

 

but let me ask you then.

If audiobook is set to include disk 3.

 

and disk 3 is full, what happens when I shove in MORE audibooks? Which disk does it begin to put that data into?
Chronologically the first disk?
the next disk in line after Disk 3?
None, as I set include disk 3 and thus if full, it just discards the data?

 

//Magmanthe

Link to comment

Hi again,

 

So I've now run MemeTest on all 4 sticks, and wouldn't you know... 1 of them was really bad.

3 passed fine without any errors, but the last is garbage as you can see from the pix.. Luckily I did have a spare so no biggie..

So thanks for pointing me in that direction :) much appreciated.. Didn't think of that at all..

 

 

However when starting the server now, the Parity 1 disk and Disk 1 is still marked as disabled for some reason.. Anything else I can do?

 

 

//Magmanthe

unraid_ram_error.png

unraid_disbaled_disk.png

Edited by Magmanthe
Link to comment
7 hours ago, Magmanthe said:

my workaround to deal with this was just to stop array, take out the disk, start the array. Stop the array again, include the disk and rebuild it.

This isn't a workaround. It is the way you get disks enabled again.

 

Unraid disables a disk when a write to it fails because the disk is no longer in sync with the array. After disabling a disk, all access to the disk doesn't actually use the disk again. Instead, it uses the emulated disk from the parity calculation by reading all other disks. Reads are emulated by reading all other disks, and writes are emulated by updating parity (in the case of that parity disk, it just isn't used since it can't be accessed directly, it has no data).

 

So, the disks have to be rebuilt to get them back in sync.

 

If by "take out the disk" you mean physically remove it, that isn't necessary. You just need to start the array with the disk unassigned, then reassigning it will rebuild to that disk.

Link to comment

Yeah, no I don't take them physically out of the server..

 

I remove them "softwareliy" from the array when it is stopped, start the array. Then i stop the array again, and add it back in.


Okay, so that's how I have been doing it before too, so I know how to do it..
However, earlier it was always only 1 disk that was "lost" (disabled). and it was also not a parity disk, so rebuilding it was not an issue.

 

Now it is a parity disk and and regular.
do I build them back into the array at the same time, or take 1 disk at a time?

 

//Magmanthe

Edited by Magmanthe
Link to comment

So I'm back with an update..


Since yesterday, I came back from work today to check the Rebuild-progress, and lo and behold the "unresponsiveness" is back.

System is all locked-up (both from WebGui) and from direct-connected mouse/keyboard/screen.

 

However instead of hard-resetting and turning the system off/on at the current time, i will leave it on until tomorrow morning (Friday) just in case the actual parity and disk-rebuild is still going on in the background. From experience I know it should take around 1 day and ~3-5 hours, and I started it at around 22 yesterday.. So around midnight tonight (if it is still going) it will be done.

 

However it does not seem that the RAM-sticks are at fault then, as I now have 4 fully functional sticks from what I tested yesterday.

 

any particular BIOS-settings that might interfere? (I remember reading about AMD and C-states that might make the system unresponsive , but I'm running Intel so it shouldn't be that).

Other HW-issues like a MB error perhaps?

 

 

I turned CPU-graphics on in Bios due to locally connected screen via HDMI (or maybe it was DP; but anyway, directly to the I/O of the MB), with a mouse/keyboard. the screen there is also "stuck" on the log-in screen (as you can see) and after it locked-up (sometime during the night, or during daytime when I was at work) that too, becomes unresponsive..

 

Any halp is appreciated :) 

 

 

//Magmanthe

 

 

unraid_no_ping.png

unraid_loging_local.png

Link to comment

Yeah, bunch of stuff happening now.. (all bad)..

 

I know UNRAID is not a backup-solution per se.. but compared to what I used to have (everything on a disk in my computer) being in the server "protected" by 2 parity-disks, i did consider the data somewhat more secure that having it in my PC.

However with what is happening now, I'm copying out as much data as I can just in case... Attaching latest syslog (whiteout go-file).

 


Woke up this morning, and there was no change, so I took the power to the server, and then turned it back on.

During boot-section it came to the login-screen and after that it reported lots of XFS-errors. I tried restarting again, and the same happened again.

Restarted again this time I chose to start in safe-mode without any plugins and it started, but during the array-screen, every disk was missing from the overview 😱


I pressed the reboot-button (in gui) and this time it started "somewhat normally".

However the Parity 1 disk is still not valid and my Disk 1 is being emulated. My disk 1 says "unmountable: not mounted".

 

So before I do anything now I'm copying out the data to some external drives..

 

//Magmanthe

magnas-diagnostics-20210507-1717.zip

Link to comment

I'm guessing UD means Unassigned devices?

 

As you can see, in the system I have
1 x 6tb

2 x 4 tb
1 x 1 tb

I also have 2 x 4 TB (that was intended as replacements for the 2x4TB in the system, as they are old and have like 50k hours on them).

So for now the plan is to copy out data from the 2 x 4TB and the 1 TB (this is the most crucial data). The 6TB only has TV so nothing important.


Once that is done I will try to any and all mitigations to try and fix the issue.

I am somewhat unsure how, but I'll try Rebuilding first and if that does not work, maybe some other ideas..

Last resort is just wiping everything and starting a new build which would suck due to all the nice customization I've done (hours of tinkering and following SpaceInvader-guides)..

 

My suspicion is that it might be some HW-error that is not being collected by the syslog (mb gone haywire, PSU or cables fucking things up? HBA-card?).. 

Once (like 15 years ago) I had a MB in my first computer,  where the transistors on it started bulging from the printboard, so I've seen MB's just crapping out before..

 

//Magmanthe

Link to comment
28 minutes ago, Magmanthe said:

I am somewhat unsure how, but I'll try Rebuilding first and if that does not work, maybe some other ideas..

First fix the filesystem, no point in rebuilding if it can't be fixed or there's evident data loss, especially when rebuilding on top of the old disk.

Link to comment
  • Magmanthe changed the title to SOLVED - 2 issues | Unresponsive server + disks falling out during reboot

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.