Jump to content

Frequent Crashing


Go to solution Solved by halorrr,

Recommended Posts

This has been happening on both 6.12.2 and now 6.12.3 as well. Sometimes it crashes in a way that it starts back up on it's own, sometimes it needs the physical power switch to be pushed to come back on. Sometimes the crashes are very close together and some times they are far apart. I've attached the diagnostics and well as the syslogs that I was mirroring to the flash drive. There should be at least 3 crashes that I know about in the recorded syslogs. 

 

I wasn't able to pick out much that really stood out to me when looking at the logs. Only things I noticed that are maybe notable, but I'm not sure is this line right before it reports about the unclean shutdown. 
 

Jul 15 12:43:27 Leviathan kernel: msr: Write to unrecognized MSR 0xc0010292 by zenstates (pid: 7992).

 

The day before the crashes started the server had the cache drive momentarily fill up which locked things out for about an hour, I'm not sure if that is significant or a coincidence, it has since been resolved though and the cache is only at 84% now though.

 

Let me know if there is more information needed to figure out what is going on here.

syslog leviathan-diagnostics-20230715-1627.zip

Link to comment

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

I am having the same problems over the last couple of days, no changes in hardware and it ran stable over months (I am not suspecting anything being wrong with the hardware)
The system crashes to a point where it needs a hard reboot.  My system is on 6.12.3, and yesterday it crashed 3 times. the numlock key on the attached keyboard does not work and the monitor produces a no signal message. It is clear that it is not only the webgui of unraid that has crashed.

 

the only 2 lines of concern in the syslog (to me) are:

  • docker0: port 1(vethc08d6dd) entered disabled state.    --> it occurs a lot, i do not no if I need to be concerned and if so how to find the responsible docker container
  • nginx: 2023/07/19 14:01:54 [alert] 6707#6707: worker process xxxx exited on signal 6 --> which as far as I have learned from other posts is only concerning webgui of unraid 

Yesterday evening I booted in safemode, with VM's and Dockers enabled (and minimal set running). This morning the system is still up. 
It is most likely at this point to look for issues in plugins. Below is the list of plugins installed on my system. Of these the ZFS master and User Scripts were installed last, the other 2 are updated.

 

As there is nothing in the log files I think I need to try to find the rootcause by disabling the yellow marked plugins one-by-one.  Is there a smarter (less timeconsuming) way?

  1. -rw------- 1 root root   3415 Feb 12 04:40 unbalance.plg
  2. -rw------- 1 root root   4861 Feb 16 20:26 intel-gpu-top.plg
  3. -rw------- 1 root root   5629 Mar  3 04:40 dynamix.system.temp.plg
  4. -rw------- 1 root root   7287 Mar  3 04:40 dynamix.system.stats.plg
  5. -rw------- 1 root root   4572 Mar  3 04:40 dynamix.system.info.plg
  6. -rw------- 1 root root   4765 Mar  3 04:40 dynamix.password.validator.plg
  7. -rw------- 1 root root   6376 Mar  3 04:40 dynamix.cache.dirs.plg
  8. -rw------- 1 root root   5349 Apr 16 07:21 unassigned.devices-plus.plg
  9. -rw------- 1 root root  24691 May  6 04:40 dynamix.unraid.net.plg
  10. -rw------- 1 root root   9094 May 20 13:27 unassigned.devices.preclear.plg
  11. -rw------- 1 root root   7486 Jun  1 20:30 dynamix.file.manager.plg
  12. -rw------- 1 root root   6302 Jun  8 12:13 NerdTools.plg
  13. -rw------- 1 root root   4520 Jun 12 19:07 open.files.plg
  14. -rw------- 1 root root   5129 Jul  4 06:40 ca.mover.tuning.plg
  15. -rw------- 1 root root 103495 Jul  4 12:41 community.applications.plg
  16. -rw------- 1 root root   9814 Jul  5 21:51 tips.and.tweaks.plg
  17. -rw------- 1 root root   7165 Jul  8 13:29 appdata.backup.plg
  18. -rw------- 1 root root   3974 Jul 16 15:05 zfs.master.plg
  19. -rw------- 1 root root   7360 Jul 16 15:12 user.scripts.plg
  20. -rw------- 1 root root  20839 Jul 16 21:55 fix.common.problems.plg
  21. -rw------- 1 root root  90705 Jul 16 21:55 unassigned.devices.plg

syslog

Link to comment
7 minutes ago, Richard Aarnink said:

I am having the same problems over the last couple of days, no changes in hardware and it ran stable over months (I am not suspecting anything being wrong with the hardware)
The system crashes to a point where it needs a hard reboot.  My system is on 6.12.3, and yesterday it crashed 3 times. the numlock key on the attached keyboard does not work and the monitor produces a no signal message. It is clear that it is not only the webgui of unraid that has crashed.

 

the only 2 lines of concern in the syslog (to me) are:

  • docker0: port 1(vethc08d6dd) entered disabled state.    --> it occurs a lot, i do not no if I need to be concerned and if so how to find the responsible docker container
  • nginx: 2023/07/19 14:01:54 [alert] 6707#6707: worker process xxxx exited on signal 6 --> which as far as I have learned from other posts is only concerning webgui of unraid 

Yesterday evening I booted in safemode, with VM's and Dockers enabled (and minimal set running). This morning the system is still up. 
It is most likely at this point to look for issues in plugins. Below is the list of plugins installed on my system. Of these the ZFS master and User Scripts were installed last, the other 2 are updated.

 

As there is nothing in the log files I think I need to try to find the rootcause by disabling the yellow marked plugins one-by-one.  Is there a smarter (less timeconsuming) way?

  1. -rw------- 1 root root   3415 Feb 12 04:40 unbalance.plg
  2. -rw------- 1 root root   4861 Feb 16 20:26 intel-gpu-top.plg
  3. -rw------- 1 root root   5629 Mar  3 04:40 dynamix.system.temp.plg
  4. -rw------- 1 root root   7287 Mar  3 04:40 dynamix.system.stats.plg
  5. -rw------- 1 root root   4572 Mar  3 04:40 dynamix.system.info.plg
  6. -rw------- 1 root root   4765 Mar  3 04:40 dynamix.password.validator.plg
  7. -rw------- 1 root root   6376 Mar  3 04:40 dynamix.cache.dirs.plg
  8. -rw------- 1 root root   5349 Apr 16 07:21 unassigned.devices-plus.plg
  9. -rw------- 1 root root  24691 May  6 04:40 dynamix.unraid.net.plg
  10. -rw------- 1 root root   9094 May 20 13:27 unassigned.devices.preclear.plg
  11. -rw------- 1 root root   7486 Jun  1 20:30 dynamix.file.manager.plg
  12. -rw------- 1 root root   6302 Jun  8 12:13 NerdTools.plg
  13. -rw------- 1 root root   4520 Jun 12 19:07 open.files.plg
  14. -rw------- 1 root root   5129 Jul  4 06:40 ca.mover.tuning.plg
  15. -rw------- 1 root root 103495 Jul  4 12:41 community.applications.plg
  16. -rw------- 1 root root   9814 Jul  5 21:51 tips.and.tweaks.plg
  17. -rw------- 1 root root   7165 Jul  8 13:29 appdata.backup.plg
  18. -rw------- 1 root root   3974 Jul 16 15:05 zfs.master.plg
  19. -rw------- 1 root root   7360 Jul 16 15:12 user.scripts.plg
  20. -rw------- 1 root root  20839 Jul 16 21:55 fix.common.problems.plg
  21. -rw------- 1 root root  90705 Jul 16 21:55 unassigned.devices.plg

syslog 203.98 kB · 0 downloads

 

We do have some overlap in the plugins we use. I'm currently in the trial / error phase myself and will report back when I find the culprit that is doing it on my system. I currently have 50% of my plugins uninstalled and am waiting to see if a crash occurs. If it doesn't I'll reintroduce 50% of the remaining 50% and so on. 

When I choose which ones to try in the next batch though I'm going to try and target some of our overlapping plugins since that feels like it might get me an answer faster.

Link to comment

@halorrr thank you, when we combine our efforts we might find the rootcause faster.
I have just rebooted the server with all plugins enabled except user.scripts and zfs.master as these are the plugins I installed last. 
Also updated NerdTools to 2023.07.18  as this update is now offered. I will keep you informed.
 

update 1: running with all plugins except user.scripts and zfs.master for 5:30 hrs now. Keep you informed

update 2: now after almost 9 hrs the system crashed. seems like user.scripts and zfs.master are not to blame

now back to safe mode first. tomorrow I will continue the search

 

Edited by Richard Aarnink
update due to new info on same topic
Link to comment
6 hours ago, Richard Aarnink said:

@halorrr thank you, when we combine our efforts we might find the rootcause faster.
I have just rebooted the server with all plugins enabled except user.scripts and zfs.master as these are the plugins I installed last. 
Also updated NerdTools to 2023.07.18  as this update is now offered. I will keep you informed.
 

update: running with all plugins except user.scripts and zfs.master for 5:30 hrs now. Keep you informed

 

 

Interestingly I just had a restart this morning with the first 50% of plugins turned on (which is the group of things that I thought were less likely to be culprits too). 

 

The ones in the remaining group that crashed are:

appdata.backup.plg
ca.update.applications.plg
community.applications.plg
customtab.plg
disklocation-master.plg
docker.folder.plg
dynamix.file.manager.plg
dynamix.unraid.net.plg
fix.common.problems.plg
unassigned.devices-plus.plg
unassigned.devices.preclear.plg
unlimited-width.plg
user.scripts.plg

 

Which leaves the ones we have in common as:

appdata.backup.plg
community.applications.plg
dynamix.file.manager.plg
dynamix.unraid.net.plg
unassigned.devices-plus.plg
unassigned.devices.preclear.plg

Which is already a much more trimmed down list. 

 

Community Applications feels like it is probably unlikely to be the culprit, so I'm going to remove everything else in the cross sectioned group and see if the crashes still happen. If they don't then I can try readding them bit by bit.

Link to comment
1 hour ago, Richard Aarnink said:

Update 3:


Currently running without these plugins:

  • appdata.backup.plg
  • dynamix.file.manager.plg
  • dynamix.unraid.net.plg
  • unassigned.devices.preclear.plg

It is too early to tell if the one of these is causing the crashes. the system is running 2,5 hrs now.
@halorrr do you have any updates ? 

 

 

 Honestly I'm confused. I removed ALL of my plugins and the crashes continued. I just this very moment turned safe mode back on, to confirm that it still helps.

Documentation just says that safe mode is no plugins, I'm wondering if there are any other changes that safe mode makes that might be helping though.

Link to comment
3 minutes ago, halorrr said:

 Honestly I'm confused. I removed ALL of my plugins and the crashes continued. I just this very moment turned safe mode back on, to confirm that it still helps.

Documentation just says that safe mode is no plugins, I'm wondering if there are any other changes that safe mode makes that might be helping though.

I think it might also not load any packages that are in the ‘extras’ folder on the flash drive (if you have this folder at all).

Link to comment

@halorrr After 13hrs the system crashed again, so it seems I am in the same boat and we both are still having similar issues. 

I just started in safe mode. Now I am planning to wait at least 24hrs before I call it a victory and assume that the system is stable in safe mode.
Currently Docker a minimum of docker containers (MQTT and Plex) and one VM (Home Assistant) are running.

Do you have any updates?

Link to comment
5 hours ago, Richard Aarnink said:

@halorrr After 13hrs the system crashed again, so it seems I am in the same boat and we both are still having similar issues. 

I just started in safe mode. Now I am planning to wait at least 24hrs before I call it a victory and assume that the system is stable in safe mode.
Currently Docker a minimum of docker containers (MQTT and Plex) and one VM (Home Assistant) are running.

Do you have any updates?

 

For some reason I'm now crashing in safe mode, which I wasn't before. So back to the drawing board on testing. I'm currently doing an 24 hour run of memtest86 to reverify that there are no RAM issues, 14 hours in and all passing so far.

Link to comment

Yesterday evening I noticed that unraid was reporting that the usb flash drive was disconnected.
I found errors using dosfsck. I replaced the usb flash drive earlier this month, by a new Philips 32 gb one, when I upgraded to 6.12.
I did this as the previous one was rather old.

 

Now I know to keep away from the Philips branded 32 gb usb flash drives. 
@halorrr you might want to investigate the USB Flash drive as well the memory is not the culprit.

 

The server is still running, no crashes. All plugins are enabled and runtime is 23 hrs.

 

 

Edited by Richard Aarnink
Link to comment
On 7/23/2023 at 3:32 AM, Richard Aarnink said:

Yesterday evening I noticed that unraid was reporting that the usb flash drive was disconnected.
I found errors using dosfsck. I replaced the usb flash drive earlier this month, by a new Philips 32 gb one, when I upgraded to 6.12.
I did this as the previous one was rather old.

 

Now I know to keep away from the Philips branded 32 gb usb flash drives. 
@halorrr you might want to investigate the USB Flash drive as well the memory is not the culprit.

 

The server is still running, no crashes. All plugins are enabled and runtime is 23 hrs.

 

 

 

I'm at a loss at this point of what to try. I replaced the Flash drive and still no luck. If it is related to some piece of my hardware I have no idea how to narrow down to which one. 

Link to comment

I am still not out of the woods myself (if that is a saying). I replaced my flash drive with a Transcend jetflash 600 8gb. The system crashed within a couple of hours. i am unsure if the flashdrive is the issue. 

 

Now I am trying the Samsung Bar Pro 32 Gb. I use an usb2 port, no other devices are connected via usb2. I have created a complete new install and did not copy anything over from the original usb flash drive. currently there are no plugins active. The system is rebuilding parity now. I want to keep the system running a couple of days before i add plugins.

I am curious whether this has solved it, but i am unsure. It certainly is not a good feeling that there is nothing in the logs.

 

 

 

 

 

Link to comment
2 hours ago, Richard Aarnink said:

I am still not out of the woods myself (if that is a saying). I replaced my flash drive with a Transcend jetflash 600 8gb. The system crashed within a couple of hours. i am unsure if the flashdrive is the issue. 

 

Now I am trying the Samsung Bar Pro 32 Gb. I use an usb2 port, no other devices are connected via usb2. I have created a complete new install and did not copy anything over from the original usb flash drive. currently there are no plugins active. The system is rebuilding parity now. I want to keep the system running a couple of days before i add plugins.

I am curious whether this has solved it, but i am unsure. It certainly is not a good feeling that there is nothing in the logs.

 

 

 

 

 

 

Let me know how that goes, I'm thinking I'm going to run an even longer memtest than I did last time cause I'm running out of things I know how to test.

Link to comment

@halorrr thank’s for letting me know. I created a new usb flashdrive using a samsung bar pro 32 gb and did not copy any files from the old usb device. Today I “downgraded” the cache drive to btrfs from zfs. This should enable me to downgrade if needed.

the system is up for 1 day and 23 hrs now. Still on 6.12.3. I will keep you informed.

 

in your case, if the downgrade to 6.11.5, which ran fine the past, still does not solve the problem, consider what us left to look at.

 

1. Any changes or upgrades to the BIOS?
2. The usb flash drive, is it still the same stick? Is it in an USB2 slot (USB3 causes problems in some cases)


maybe do an extensive test using the tools from the ultimate boot usb https://www.pcwdld.com/ultimate-boot-cd-ubcd-review/

 

 

 

Edited by Richard Aarnink
Link to comment

@halorrr with the new USB Flas disk I am running 4 days now no problems.
Will keep the system as-is for now as I am off on vacation in a couple of days.

 

not all plugins are running, but the dockers and the vm for homeassistant is.
 I can provide more details if you need to have more info.

in the logs I see Nginx crashes, but this seems to be a known issue. The system however is stable for now.

 

So a fresh new usb stick, a new installation (no copying of files fron the old usb stick) seemed to solve the issue for me.

 

regards Richard

Link to comment
2 hours ago, Richard Aarnink said:

@halorrr with the new USB Flas disk I am running 4 days now no problems.
Will keep the system as-is for now as I am off on vacation in a couple of days.

 

not all plugins are running, but the dockers and the vm for homeassistant is.
 I can provide more details if you need to have more info.

in the logs I see Nginx crashes, but this seems to be a known issue. The system however is stable for now.

 

So a fresh new usb stick, a new installation (no copying of files fron the old usb stick) seemed to solve the issue for me.

 

regards Richard


Interesting, I already am on a new USB stick. I'm going to use another USB just to get the untouched versions of files and replace them one by one to see if I can find the culprit. Would be nice if I can narrow that down for others.

Link to comment

Ok so still finalizing testing but it looks like the culprit was the `disk.cfg` file. I'm almost at 24 hours without crashes since restoring that to defaults. I'm not trying to understand what the differences between the files are. 

If anyone can help me understand if any of these values seem odd:

 

startArray:
Default: "no", old config: "yes"

poll_attributes:
Default: "30", old config: "1800"

 

nr_requests:
Default: "auto", old config: "128"

 

diskSpindownDelay.0:
Default: "-1", old config: "0"

 

diskExport.1 (this same one repeats for the other disks too):
Default: "-", old config: "e"

 

diskSecurity.1 (this same one repeats for some of the the other disks too but not all) :
Default: "public", old config: "secure"

 

diskWriteList.1 (this same one repeats for some of the the other disks too but not all) :
Default: "", old config: "<my username>"

Edited by halorrr
Link to comment
  • 2 weeks later...
On 8/5/2023 at 4:21 PM, halorrr said:

Ok so still finalizing testing but it looks like the culprit was the `disk.cfg` file. I'm almost at 24 hours without crashes since restoring that to defaults. I'm not trying to understand what the differences between the files are.

How did you get on with this, @halorrr? Was resetting disk.cfg the solution? Although I don't know for certain, I would guess that the entries you ask about are the equivalents of the per-disk preferences you get when clicking on the disk name in the "Main" page of the web interface. How did you reset it? Copy the file over from a fresh install?

Link to comment

Hi I am still struggling, the server ran 17 days without issues on 6.12.3.

As I was on vacation, the WebGUI of Unraid was not used.

When I got back and started using the WebGUI the server died in less then 2 days.

I upgraded to 6.12.4 rc 18. It ran for 2,5 days and just died an hour ago. 
There are no nginx crashes in the syslog, only the “Stop running nchan processes“ message appears a couple of times.

I am reverting back to 6.11.

Link to comment

@ScottAS2 @Richard Aarnink The disk.cfg ended up being another dead end but I did end up narrowing it down at last. For me it was in fact a hardware issue. Seemed to be mostly related to the iGPU on my intel chip. I was able to replicate a crash the second any Plex streams that used the transcoder happened. So I turned off hardware acceleration transcoding and the crashes mostly stopped but I was still getting the odd crash on occasion.

 

I then opened my server to get the serial number from my chip and when I did so I unplugged the monitor from the server and forgot to plug it back in afterward, for some reason not having the monitor plugged in resolved all remaining crashes and for the first time I went 7 days with no crashes.

 

My server is currently off as I RMAed the Intel chip and am waiting for a new one, but it seems at least for me that it was heavily tied to graphics usage. I'm curious to see when I get the new chip if hardware acceleration and such will work again, or if there is something else beside the processor itself that causes the crashes when this is in use. Though in my reading it looks like a lot of people have have crashes for various different types of reasons when working with Intel's iGPU. 

For what it's work my server uses the Intel i9-12900k

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...