Frequent Crashing

July 15, 20232 yr

This has been happening on both 6.12.2 and now 6.12.3 as well. Sometimes it crashes in a way that it starts back up on it's own, sometimes it needs the physical power switch to be pushed to come back on. Sometimes the crashes are very close together and some times they are far apart. I've attached the diagnostics and well as the syslogs that I was mirroring to the flash drive. There should be at least 3 crashes that I know about in the recorded syslogs.

I wasn't able to pick out much that really stood out to me when looking at the logs. Only things I noticed that are maybe notable, but I'm not sure is this line right before it reports about the unclean shutdown.

Jul 15 12:43:27 Leviathan kernel: msr: Write to unrecognized MSR 0xc0010292 by zenstates (pid: 7992).

The day before the crashes started the server had the cache drive momentarily fill up which locked things out for about an hour, I'm not sure if that is significant or a coincidence, it has since been resolved though and the cache is only at 84% now though.

Let me know if there is more information needed to figure out what is going on here.

syslog leviathan-diagnostics-20230715-1627.zip

Quote

July 16, 20232 yr

Community Expert

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Quote

July 19, 20232 yr

Author

Thanks for the advice, it didn't crash when I had it in safe mode with everything disabled. . I turned Docker on this morning, and so far still no crashes. I'll update here again when I find what the culprit is.

Quote

July 20, 20232 yr

I am having the same problems over the last couple of days, no changes in hardware and it ran stable over months (I am not suspecting anything being wrong with the hardware)
The system crashes to a point where it needs a hard reboot. My system is on 6.12.3, and yesterday it crashed 3 times. the numlock key on the attached keyboard does not work and the monitor produces a no signal message. It is clear that it is not only the webgui of unraid that has crashed.

the only 2 lines of concern in the syslog (to me) are:

docker0: port 1(vethc08d6dd) entered disabled state. --> it occurs a lot, i do not no if I need to be concerned and if so how to find the responsible docker container
nginx: 2023/07/19 14:01:54 [alert] 6707#6707: worker process xxxx exited on signal 6 --> which as far as I have learned from other posts is only concerning webgui of unraid

Yesterday evening I booted in safemode, with VM's and Dockers enabled (and minimal set running). This morning the system is still up.
It is most likely at this point to look for issues in plugins. Below is the list of plugins installed on my system. Of these the ZFS master and User Scripts were installed last, the other 2 are updated.

As there is nothing in the log files I think I need to try to find the rootcause by disabling the yellow marked plugins one-by-one. Is there a smarter (less timeconsuming) way?

-rw------- 1 root root 3415 Feb 12 04:40 unbalance.plg
-rw------- 1 root root 4861 Feb 16 20:26 intel-gpu-top.plg
-rw------- 1 root root 5629 Mar 3 04:40 dynamix.system.temp.plg
-rw------- 1 root root 7287 Mar 3 04:40 dynamix.system.stats.plg
-rw------- 1 root root 4572 Mar 3 04:40 dynamix.system.info.plg
-rw------- 1 root root 4765 Mar 3 04:40 dynamix.password.validator.plg
-rw------- 1 root root 6376 Mar 3 04:40 dynamix.cache.dirs.plg
-rw------- 1 root root 5349 Apr 16 07:21 unassigned.devices-plus.plg
-rw------- 1 root root 24691 May 6 04:40 dynamix.unraid.net.plg
-rw------- 1 root root 9094 May 20 13:27 unassigned.devices.preclear.plg
-rw------- 1 root root 7486 Jun 1 20:30 dynamix.file.manager.plg
-rw------- 1 root root 6302 Jun 8 12:13 NerdTools.plg
-rw------- 1 root root 4520 Jun 12 19:07 open.files.plg
-rw------- 1 root root 5129 Jul 4 06:40 ca.mover.tuning.plg
-rw------- 1 root root 103495 Jul 4 12:41 community.applications.plg
-rw------- 1 root root 9814 Jul 5 21:51 tips.and.tweaks.plg
-rw------- 1 root root 7165 Jul 8 13:29 appdata.backup.plg
-rw------- 1 root root 3974 Jul 16 15:05 zfs.master.plg
-rw------- 1 root root 7360 Jul 16 15:12 user.scripts.plg
-rw------- 1 root root 20839 Jul 16 21:55 fix.common.problems.plg
-rw------- 1 root root 90705 Jul 16 21:55 unassigned.devices.plg

syslog

Quote

July 20, 20232 yr

Author

7 minutes ago, Richard Aarnink said:

I am having the same problems over the last couple of days, no changes in hardware and it ran stable over months (I am not suspecting anything being wrong with the hardware)
The system crashes to a point where it needs a hard reboot. My system is on 6.12.3, and yesterday it crashed 3 times. the numlock key on the attached keyboard does not work and the monitor produces a no signal message. It is clear that it is not only the webgui of unraid that has crashed.

the only 2 lines of concern in the syslog (to me) are:

docker0: port 1(vethc08d6dd) entered disabled state. --> it occurs a lot, i do not no if I need to be concerned and if so how to find the responsible docker container

nginx: 2023/07/19 14:01:54 [alert] 6707#6707: worker process xxxx exited on signal 6 --> which as far as I have learned from other posts is only concerning webgui of unraid

Yesterday evening I booted in safemode, with VM's and Dockers enabled (and minimal set running). This morning the system is still up.
It is most likely at this point to look for issues in plugins. Below is the list of plugins installed on my system. Of these the ZFS master and User Scripts were installed last, the other 2 are updated.

As there is nothing in the log files I think I need to try to find the rootcause by disabling the yellow marked plugins one-by-one. Is there a smarter (less timeconsuming) way?

-rw------- 1 root root 3415 Feb 12 04:40 unbalance.plg

-rw------- 1 root root 4861 Feb 16 20:26 intel-gpu-top.plg

-rw------- 1 root root 5629 Mar 3 04:40 dynamix.system.temp.plg

-rw------- 1 root root 7287 Mar 3 04:40 dynamix.system.stats.plg

-rw------- 1 root root 4572 Mar 3 04:40 dynamix.system.info.plg

-rw------- 1 root root 4765 Mar 3 04:40 dynamix.password.validator.plg

-rw------- 1 root root 6376 Mar 3 04:40 dynamix.cache.dirs.plg

-rw------- 1 root root 5349 Apr 16 07:21 unassigned.devices-plus.plg

-rw------- 1 root root 24691 May 6 04:40 dynamix.unraid.net.plg

-rw------- 1 root root 9094 May 20 13:27 unassigned.devices.preclear.plg

-rw------- 1 root root 7486 Jun 1 20:30 dynamix.file.manager.plg

-rw------- 1 root root 6302 Jun 8 12:13 NerdTools.plg

-rw------- 1 root root 4520 Jun 12 19:07 open.files.plg

-rw------- 1 root root 5129 Jul 4 06:40 ca.mover.tuning.plg

-rw------- 1 root root 103495 Jul 4 12:41 community.applications.plg

-rw------- 1 root root 9814 Jul 5 21:51 tips.and.tweaks.plg

-rw------- 1 root root 7165 Jul 8 13:29 appdata.backup.plg

-rw------- 1 root root 3974 Jul 16 15:05 zfs.master.plg

-rw------- 1 root root 7360 Jul 16 15:12 user.scripts.plg

-rw------- 1 root root 20839 Jul 16 21:55 fix.common.problems.plg

-rw------- 1 root root 90705 Jul 16 21:55 unassigned.devices.plg

syslog 203.98 kB · 0 downloads

We do have some overlap in the plugins we use. I'm currently in the trial / error phase myself and will report back when I find the culprit that is doing it on my system. I currently have 50% of my plugins uninstalled and am waiting to see if a crash occurs. If it doesn't I'll reintroduce 50% of the remaining 50% and so on.

When I choose which ones to try in the next batch though I'm going to try and target some of our overlapping plugins since that feels like it might get me an answer faster.

Quote

July 20, 20232 yr

@halorrr thank you, when we combine our efforts we might find the rootcause faster.
I have just rebooted the server with all plugins enabled except user.scripts and zfs.master as these are the plugins I installed last.
Also updated NerdTools to 2023.07.18 as this update is now offered. I will keep you informed.

update 1: running with all plugins except user.scripts and zfs.master for 5:30 hrs now. Keep you informed

update 2: now after almost 9 hrs the system crashed. seems like user.scripts and zfs.master are not to blame

now back to safe mode first. tomorrow I will continue the search

Edited July 20, 20232 yr by Richard Aarnink
update due to new info on same topic

Quote

July 20, 20232 yr

Author

6 hours ago, Richard Aarnink said:

@halorrr thank you, when we combine our efforts we might find the rootcause faster.
I have just rebooted the server with all plugins enabled except user.scripts and zfs.master as these are the plugins I installed last.
Also updated NerdTools to 2023.07.18 as this update is now offered. I will keep you informed.

update: running with all plugins except user.scripts and zfs.master for 5:30 hrs now. Keep you informed

Interestingly I just had a restart this morning with the first 50% of plugins turned on (which is the group of things that I thought were less likely to be culprits too).

The ones in the remaining group that crashed are:

appdata.backup.plg
ca.update.applications.plg
community.applications.plg
customtab.plg
disklocation-master.plg
docker.folder.plg
dynamix.file.manager.plg
dynamix.unraid.net.plg
fix.common.problems.plg
unassigned.devices-plus.plg
unassigned.devices.preclear.plg
unlimited-width.plg
user.scripts.plg

Which leaves the ones we have in common as:

appdata.backup.plg
community.applications.plg
dynamix.file.manager.plg
dynamix.unraid.net.plg
unassigned.devices-plus.plg
unassigned.devices.preclear.plg

Which is already a much more trimmed down list.

Community Applications feels like it is probably unlikely to be the culprit, so I'm going to remove everything else in the cross sectioned group and see if the crashes still happen. If they don't then I can try readding them bit by bit.

Quote

July 21, 20232 yr

Update 3:

Currently running without these plugins:

appdata.backup.plg
dynamix.file.manager.plg
dynamix.unraid.net.plg
unassigned.devices.preclear.plg

It is too early to tell if the one of these is causing the crashes. the system is running 2,5 hrs now.
@halorrr do you have any updates ?

Quote

July 21, 20232 yr

Author

1 hour ago, Richard Aarnink said:

Update 3:

Currently running without these plugins:

appdata.backup.plg

dynamix.file.manager.plg

dynamix.unraid.net.plg

unassigned.devices.preclear.plg

It is too early to tell if the one of these is causing the crashes. the system is running 2,5 hrs now.
@halorrr do you have any updates ?

Honestly I'm confused. I removed ALL of my plugins and the crashes continued. I just this very moment turned safe mode back on, to confirm that it still helps.

Documentation just says that safe mode is no plugins, I'm wondering if there are any other changes that safe mode makes that might be helping though.

Quote

July 21, 20232 yr

Community Expert

3 minutes ago, halorrr said:

Honestly I'm confused. I removed ALL of my plugins and the crashes continued. I just this very moment turned safe mode back on, to confirm that it still helps.

Documentation just says that safe mode is no plugins, I'm wondering if there are any other changes that safe mode makes that might be helping though.

I think it might also not load any packages that are in the ‘extras’ folder on the flash drive (if you have this folder at all).

Quote

July 22, 20232 yr

@halorrr After 13hrs the system crashed again, so it seems I am in the same boat and we both are still having similar issues.

I just started in safe mode. Now I am planning to wait at least 24hrs before I call it a victory and assume that the system is stable in safe mode.
Currently Docker a minimum of docker containers (MQTT and Plex) and one VM (Home Assistant) are running.

Do you have any updates?

Quote

July 22, 20232 yr

Author

5 hours ago, Richard Aarnink said:

@halorrr After 13hrs the system crashed again, so it seems I am in the same boat and we both are still having similar issues.

I just started in safe mode. Now I am planning to wait at least 24hrs before I call it a victory and assume that the system is stable in safe mode.
Currently Docker a minimum of docker containers (MQTT and Plex) and one VM (Home Assistant) are running.

Do you have any updates?

For some reason I'm now crashing in safe mode, which I wasn't before. So back to the drawing board on testing. I'm currently doing an 24 hour run of memtest86 to reverify that there are no RAM issues, 14 hours in and all passing so far.

Quote

July 23, 20232 yr

Yesterday evening I noticed that unraid was reporting that the usb flash drive was disconnected.
I found errors using dosfsck. I replaced the usb flash drive earlier this month, by a new Philips 32 gb one, when I upgraded to 6.12.
I did this as the previous one was rather old.

Now I know to keep away from the Philips branded 32 gb usb flash drives.
@halorrr you might want to investigate the USB Flash drive as well the memory is not the culprit.

The server is still running, no crashes. All plugins are enabled and runtime is 23 hrs.

Edited July 23, 20232 yr by Richard Aarnink

Quote

July 26, 20232 yr

Author

On 7/23/2023 at 3:32 AM, Richard Aarnink said:

Yesterday evening I noticed that unraid was reporting that the usb flash drive was disconnected.
I found errors using dosfsck. I replaced the usb flash drive earlier this month, by a new Philips 32 gb one, when I upgraded to 6.12.
I did this as the previous one was rather old.

Now I know to keep away from the Philips branded 32 gb usb flash drives.
@halorrr you might want to investigate the USB Flash drive as well the memory is not the culprit.

The server is still running, no crashes. All plugins are enabled and runtime is 23 hrs.

I'm at a loss at this point of what to try. I replaced the Flash drive and still no luck. If it is related to some piece of my hardware I have no idea how to narrow down to which one.

Quote

July 26, 20232 yr

I am still not out of the woods myself (if that is a saying). I replaced my flash drive with a Transcend jetflash 600 8gb. The system crashed within a couple of hours. i am unsure if the flashdrive is the issue.

Now I am trying the Samsung Bar Pro 32 Gb. I use an usb2 port, no other devices are connected via usb2. I have created a complete new install and did not copy anything over from the original usb flash drive. currently there are no plugins active. The system is rebuilding parity now. I want to keep the system running a couple of days before i add plugins.

I am curious whether this has solved it, but i am unsure. It certainly is not a good feeling that there is nothing in the logs.

Quote

July 26, 20232 yr

Author

2 hours ago, Richard Aarnink said:

I am still not out of the woods myself (if that is a saying). I replaced my flash drive with a Transcend jetflash 600 8gb. The system crashed within a couple of hours. i am unsure if the flashdrive is the issue.

Now I am trying the Samsung Bar Pro 32 Gb. I use an usb2 port, no other devices are connected via usb2. I have created a complete new install and did not copy anything over from the original usb flash drive. currently there are no plugins active. The system is rebuilding parity now. I want to keep the system running a couple of days before i add plugins.

I am curious whether this has solved it, but i am unsure. It certainly is not a good feeling that there is nothing in the logs.

Let me know how that goes, I'm thinking I'm going to run an even longer memtest than I did last time cause I'm running out of things I know how to test.

Quote

July 28, 20232 yr

Author

@Richard Aarnink to add more data and rule more things out, I downgraded Unraid to 6.11.5 which did not fix the issue.

Quote

July 28, 20232 yr

@halorrr thank’s for letting me know. I created a new usb flashdrive using a samsung bar pro 32 gb and did not copy any files from the old usb device. Today I “downgraded” the cache drive to btrfs from zfs. This should enable me to downgrade if needed.

the system is up for 1 day and 23 hrs now. Still on 6.12.3. I will keep you informed.

in your case, if the downgrade to 6.11.5, which ran fine the past, still does not solve the problem, consider what us left to look at.

1. Any changes or upgrades to the BIOS?
2. The usb flash drive, is it still the same stick? Is it in an USB2 slot (USB3 causes problems in some cases)

maybe do an extensive test using the tools from the ultimate boot usb https://www.pcwdld.com/ultimate-boot-cd-ubcd-review/

Edited July 29, 20232 yr by Richard Aarnink

Quote

July 31, 20232 yr

Author

@Richard Aarnink Any updates from you? Just came back from a weekend away myself. To answer your question I haven't made any changes to the bios. The flash drive is still plugged in but I'll have to check what USB type it is plugged in to.

I'll have a look at the extensive test next as well.

Thanks

Quote

July 31, 20232 yr

@halorrr with the new USB Flas disk I am running 4 days now no problems.
Will keep the system as-is for now as I am off on vacation in a couple of days.

not all plugins are running, but the dockers and the vm for homeassistant is.
I can provide more details if you need to have more info.

in the logs I see Nginx crashes, but this seems to be a known issue. The system however is stable for now.

So a fresh new usb stick, a new installation (no copying of files fron the old usb stick) seemed to solve the issue for me.

regards Richard

Quote

July 31, 20232 yr

Author

2 hours ago, Richard Aarnink said:

@halorrr with the new USB Flas disk I am running 4 days now no problems.
Will keep the system as-is for now as I am off on vacation in a couple of days.

not all plugins are running, but the dockers and the vm for homeassistant is.
I can provide more details if you need to have more info.

in the logs I see Nginx crashes, but this seems to be a known issue. The system however is stable for now.

So a fresh new usb stick, a new installation (no copying of files fron the old usb stick) seemed to solve the issue for me.

regards Richard

Interesting, I already am on a new USB stick. I'm going to use another USB just to get the untouched versions of files and replace them one by one to see if I can find the culprit. Would be nice if I can narrow that down for others.

Quote

August 5, 20232 yr

Author

Ok so still finalizing testing but it looks like the culprit was the `disk.cfg` file. I'm almost at 24 hours without crashes since restoring that to defaults. I'm not trying to understand what the differences between the files are.

If anyone can help me understand if any of these values seem odd:

startArray:
Default: "no", old config: "yes"

poll_attributes:
Default: "30", old config: "1800"

nr_requests:
Default: "auto", old config: "128"

diskSpindownDelay.0:
Default: "-1", old config: "0"

diskExport.1 (this same one repeats for the other disks too):
Default: "-", old config: "e"

diskSecurity.1 (this same one repeats for some of the the other disks too but not all) :
Default: "public", old config: "secure"

diskWriteList.1 (this same one repeats for some of the the other disks too but not all) :
Default: "", old config: "<my username>"

Edited August 5, 20232 yr by halorrr

Quote

August 20, 20232 yr

On 8/5/2023 at 4:21 PM, halorrr said:

Ok so still finalizing testing but it looks like the culprit was the `disk.cfg` file. I'm almost at 24 hours without crashes since restoring that to defaults. I'm not trying to understand what the differences between the files are.

How did you get on with this, @halorrr? Was resetting disk.cfg the solution? Although I don't know for certain, I would guess that the entries you ask about are the equivalents of the per-disk preferences you get when clicking on the disk name in the "Main" page of the web interface. How did you reset it? Copy the file over from a fresh install?

Quote

August 25, 20232 yr

Hi I am still struggling, the server ran 17 days without issues on 6.12.3.

As I was on vacation, the WebGUI of Unraid was not used.

When I got back and started using the WebGUI the server died in less then 2 days.

I upgraded to 6.12.4 rc 18. It ran for 2,5 days and just died an hour ago.
There are no nginx crashes in the syslog, only the “Stop running nchan processes“ message appears a couple of times.

I am reverting back to 6.11.

Quote

August 25, 20232 yr

Author

@ScottAS2 @Richard Aarnink The disk.cfg ended up being another dead end but I did end up narrowing it down at last. For me it was in fact a hardware issue. Seemed to be mostly related to the iGPU on my intel chip. I was able to replicate a crash the second any Plex streams that used the transcoder happened. So I turned off hardware acceleration transcoding and the crashes mostly stopped but I was still getting the odd crash on occasion.

I then opened my server to get the serial number from my chip and when I did so I unplugged the monitor from the server and forgot to plug it back in afterward, for some reason not having the monitor plugged in resolved all remaining crashes and for the first time I went 7 days with no crashes.

My server is currently off as I RMAed the Intel chip and am waiting for a new one, but it seems at least for me that it was heavily tied to graphics usage. I'm curious to see when I get the new chip if hardware acceleration and such will work again, or if there is something else beside the processor itself that causes the crashes when this is in use. Though in my reading it looks like a lot of people have have crashes for various different types of reasons when working with Intel's iGPU.

For what it's work my server uses the Intel i9-12900k

Quote

Frequent Crashing

Featured Replies

Solved by mmenanno

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)