Jump to content

[Hardware Error]: event severity: fatal - Unraid 6.10rc2


Recommended Posts

So my server has been acting up lately, and I don´t know whats wrong. 
In the logs it shows a bunch of hardware errors, but I still don´t know exactly what is wrong, so if someone please could help me decipher my logs so I can fix whatever is wrong! 
Sometimes 1 thread gets pinned at 100% utilization, and it wander around the hole cpu changing thread every know and then, htop show something about smdb, only way to get it to stop is by running "fuser -km /mnt/user".
Also have a Really hard time just to reboot it. Array is taking 7-10 minutes to stop, and when it boots up again it always runs a parity check, just as it was an "unclean" shutdown, although it wasn´t. 
Have never had to power cycle by holding down the power button, yet, but it feels like whatever is wrong is just getting worse. 

Its also showing 70-90% cpu utilization on the dashboard, but only 5-12% on htop, temps agree with htop aswell, near idle temps. 

 

Please help! I´m at the mercy of you guys right now, the only option I have left other then you, is doing a clean install of everything, and that would suck majorly.

belk-diagnostics-20220124-2251.zip

Edited by BeardElk
dashboard cpu utilization
Link to comment
6 minutes ago, Squid said:

Just so we're on the same page, what hardware errors are you seeing that you're concerned about?  Only thing that really pops out to me on the diagnostics is that your Linux VM refused to shut down...

What?! 
That VM wasn´t even on when I tried to take the array offline. 

I just skimmed through the log and saw a bunch of hardware error - fatal, and didnt understand what it was. 
So nothing else is wrong? 
(The linux VM is an android TV vm, and it does not listen to normal "shut off this vm", I have to force it unless i´m using it, a promt pops up saying are you sure you wanna power off, and it will not power off unless yes is selected.)

 

But why am I getting parity check at every boot? 

Edited by BeardElk
Link to comment

"deadbeef" was referenced in the syslog dump

Jan 24 22:42:48 BelK kernel: [Hardware Error]:   000000d0: deadbeef deadbeef deadbeef 0006c004  ................

It's a "joke" by the programmers

 

In your case if there's nothing in a system event log etc in the BIOS, I'm not sure where to go or what it means.

Link to comment
On 1/26/2022 at 3:30 PM, Squid said:

"deadbeef" was referenced in the syslog dump

Jan 24 22:42:48 BelK kernel: [Hardware Error]:   000000d0: deadbeef deadbeef deadbeef 0006c004  ................

It's a "joke" by the programmers

 

In your case if there's nothing in a system event log etc in the BIOS, I'm not sure where to go or what it means.

 

I think I´m starting to hone in on whats wrong, kinda user-error and "perfect storm" situation. 
I´ve got a Duplicati running backups on "important files" from unraid to my synology nas (if I ever have to rebuild the whole, it was running fine when I set it up, but had a bunch of errors in duplicati now, where it was trying to backup appdata while docker was running, resulting in a weird moment 22 where docker processes got killed and tries to restart when at the same time duplicati is there messing with everything. 
Removed appdata folder from duplicati and running a test now. (edit just finished with 0 errors).

So user-error by me for even selecting appdata folder (must´ve had a brainfart) and escalating problems as a result. 

Link to comment

Update, ever since I set up the syslog server, nothing has happened..... I´ve manage to induce an smb lockup, and had to use "fuser -km /mnt/user" to release it, reboot and that was 5 days ago. 

Been running 24/7 since. 
Same daily backups to my nas, same weekly backups of appdata, and same daily reboot of my network (routers and switches etc). 

 

I´m starting to suspect that its some kind of voltage drop on my 230v mains line. I don´t have an UPS yet (on the top of my to get list), but i´ve had other things freeze at the same time. 
My synology nas and main router has freezed at the same time as my unraid server just froze and stopped working, but it doesn´t always do that either. 

Could just as easy be that the unraid server is throwing network errors and crashing stuff.

I´ve been in this apartment for 6 years now and never had this problem before, but the recently booted up a major battery factory next door. 

Edited by BeardElk
Link to comment
  • 2 weeks later...

I had configured syslog wrong, but its working now. 
This is the output between the 2 latest crashes: 

 

For some reason im unable to upload my syslog. 
I´m getting "Sorry, an unknown server error occurred when uploading this file.

(Error code: -200)"

 

tried changing the name and added .txt and still the same error.

Edited by BeardElk
Link to comment

No idé why it did´t work, had to copy - past from the running syslog into a new file to get it to upload. 

Luckily I read the syslog (mirror to flash) from the flash ( i turned off the server then put the flash into another pc and copied syslog, and read it) so i knew where it stopped. 
So syslog.belk.txt is an exact copy to when it crashed this morning. 
Per usual it took almost all of my lan connected devices with it, and locked them until I rebooted the server....

syslog.belk.txt

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...