Regularly getting unclean crashes/reboots. I have not been able to pin down why my Unraid system does these unclean crashes/reboots that sometimes happens within hours of booting up, or sometimes days in between. The last unclean crash/reboot was just hours ago and Unraid had been running for 2-3 days. Usually when I notice this Unraid has already done an unclean reboot and I see that scrubs/parity checks are being done. Sometimes, I notice that the machine is not responsive, so when I run to it and turn on the monitor, nothing gets displayed likely because it is crashing? Some of the things I have tried doing: I have tried this Saturday to replug all the power cables I could find and this did not fix this issue I tried replacing the memory sticks. I have known the previous memory sticks passed MemTest86 tests, but it would fail sometimes when I run TM5 using @anta777 configs and even blue screen the system at times. The current memory sticks passed Memtest86 and did not error out or blue screen when running TM5. Replacing the memory sticks did not fix this issue either. This machine didn't seem to have issues when running on Windows 10 OS. I did however move the machine to a new chassis when I converted it to be an Unraid machine. Still, I needed to run TM5 using Windows and nothing seemed to happen when I ran the memory tests then. I should note that because the unclean crashes/reboots happen regularly, but in random intervals, it is possible I could have encountered this issue in Windows and just simply did not give it enough runtime to encounter it. Edit: Also tried: Connecting the machine to a UPS in case this issue was caused by dirty power. Even when connected to a line interactive UPS, this issue occurs and I don't hear my UPS screaming at me when power gets cut from it like during blackouts. alagaesia-diagnostics-20230130-1416.zip

I replaced them sometime mid December. I am hoping that these RAM sticks don't have any issues because even when I do get around to RMAing the previous RAM sticks, those RAM sticks are not in the QVL list while the current RAM sticks are in the QVL list. Edit: Additional history behind all the RAM sticks: The current 2x8GB sticks is the RAM that used to run on the machine back when it was a Windows machine. I don't remember regular blue screens happening due to this RAM though. I used to also have another 2x8GB sticks of the same SKU but likely due to me moving around the RAM a lot when I was troubleshooting RAM issues I had previously, one of the sticks had a capacitor break off so I took out 2 of the paired sticks out. The previous 2x16GB sticks was part of my current Windows machine. I had numerous blue screen issues with various error codes that I could only attribute to being the RAM being faulty in some way. It didn't matter whether I turned on XMP or not, as the blue screens still occured. Running TM5 on both my current Windows machine and the Unraid machine using a Windows OS had TM5 reporting issues, but Memtest86 was never able to detect any errors at all. My current Windows machine is using 2x16GB sticks that I made sure was part of the QVL list and the Windows machine has been stable and not having blue screens since. The QVL list says it can also support up to 4 sticks of this SKU, so I am willing to consider buying more of this SKU and use 2 of them for the current Unraid machine.

Regularly getting unclean crashes/reboots in random intervals

January 30, 20233 yr

Regularly getting unclean crashes/reboots. I have not been able to pin down why my Unraid system does these unclean crashes/reboots that sometimes happens within hours of booting up, or sometimes days in between. The last unclean crash/reboot was just hours ago and Unraid had been running for 2-3 days.

Usually when I notice this Unraid has already done an unclean reboot and I see that scrubs/parity checks are being done. Sometimes, I notice that the machine is not responsive, so when I run to it and turn on the monitor, nothing gets displayed likely because it is crashing?

Some of the things I have tried doing:

I have tried this Saturday to replug all the power cables I could find and this did not fix this issue
I tried replacing the memory sticks. I have known the previous memory sticks passed MemTest86 tests, but it would fail sometimes when I run TM5 using @anta777 configs and even blue screen the system at times. The current memory sticks passed Memtest86 and did not error out or blue screen when running TM5. Replacing the memory sticks did not fix this issue either.
This machine didn't seem to have issues when running on Windows 10 OS. I did however move the machine to a new chassis when I converted it to be an Unraid machine. Still, I needed to run TM5 using Windows and nothing seemed to happen when I ran the memory tests then. I should note that because the unclean crashes/reboots happen regularly, but in random intervals, it is possible I could have encountered this issue in Windows and just simply did not give it enough runtime to encounter it.

Edit:

Also tried:

Connecting the machine to a UPS in case this issue was caused by dirty power. Even when connected to a line interactive UPS, this issue occurs and I don't hear my UPS screaming at me when power gets cut from it like during blackouts.

alagaesia-diagnostics-20230130-1416.zip

Edited January 30, 20233 yr by Percutio
added another try

Quote

January 31, 20233 yr

Enable the syslog server and post that after a crash.

Quote

January 31, 20233 yr

Author

I had syslog server running since around Jan 2. The most recent crash happened around Jan 30 12:00:22.

syslog-192.168.1.12.log

Quote

January 31, 20233 yr

Nothing relevant logged at that time, but there are several segfaults before which might suggest a RAM problem, when did you replace the RAM?

Quote

January 31, 20233 yr

Author

I replaced them sometime mid December. I am hoping that these RAM sticks don't have any issues because even when I do get around to RMAing the previous RAM sticks, those RAM sticks are not in the QVL list while the current RAM sticks are in the QVL list.

Edit:
Additional history behind all the RAM sticks:

The current 2x8GB sticks is the RAM that used to run on the machine back when it was a Windows machine. I don't remember regular blue screens happening due to this RAM though. I used to also have another 2x8GB sticks of the same SKU but likely due to me moving around the RAM a lot when I was troubleshooting RAM issues I had previously, one of the sticks had a capacitor break off so I took out 2 of the paired sticks out.
The previous 2x16GB sticks was part of my current Windows machine. I had numerous blue screen issues with various error codes that I could only attribute to being the RAM being faulty in some way. It didn't matter whether I turned on XMP or not, as the blue screens still occured. Running TM5 on both my current Windows machine and the Unraid machine using a Windows OS had TM5 reporting issues, but Memtest86 was never able to detect any errors at all.
My current Windows machine is using 2x16GB sticks that I made sure was part of the QVL list and the Windows machine has been stable and not having blue screens since. The QVL list says it can also support up to 4 sticks of this SKU, so I am willing to consider buying more of this SKU and use 2 of them for the current Unraid machine.

Edited January 31, 20233 yr by Percutio

Quote

January 31, 20233 yr

Author

I would also like to note that the lsof segfaults could have something to do with how lsof had been pinning one of the CPU cores to 100%. Ever since I uninstalled a plugin I suspected was running the lsof command, I don't think I have seen a CPU core pinned to 100% like that since.

Quote

January 31, 20233 yr

It's not jus lsof, and they are from this year, so possibly there's still a hardware issue:

Jan  6 04:52:12 Alagaesia kernel: unraid-api[2174]: segfault at 1b ip 0000000001574a31 sp 00007fff0607c858 error 4 in unraid-api[91c000+167b000]
Jan  6 04:52:12 Alagaesia kernel: Code: 1f 48 8b 48 17 48 63 c6 48 69 c0 ab aa aa 2a 48 8b 79 17 48 c1 e8 20 29 d0 8d 14 85 10 00 00 00 8d 04 40 01 c0 48 63 d2 29 c6 <8b> 44 3a ff 8d 0c b6 d3 e8 83 e0 1f c3 66 90 48 8b 07 89 f2 c1 fa


Jan  8 18:39:07 Alagaesia kernel: plugin[17953]: segfault at 0 ip 00001522ca1253c7 sp 00007ffc0a4857d0 error 4 in libc-2.36.so[1522ca0ad000+16b000]
Jan  8 18:39:07 Alagaesia kernel: Code: 00 00 00 48 85 ff 0f 84 5f 02 00 00 48 83 3d ef 50 15 00 1f 0f 87 51 01 00 00 b8 80 00 00 00 41 bd 02 00 00 00 bb 20 00 00 00 <48> 01 e8 48 8b 48 08 48 8d 70 f0 48 39 ce 0f 84 10 fe ff ff 48 8b


Jan 22 20:12:44 Alagaesia kernel: php[13980]: segfault at 0 ip 0000148f659ae3c7 sp 00007ffd95d919f0 error 4 in libc-2.36.so[148f65936000+16b000]
Jan 22 20:12:44 Alagaesia kernel: Code: 00 00 00 48 85 ff 0f 84 5f 02 00 00 48 83 3d ef 50 15 00 1f 0f 87 51 01 00 00 b8 80 00 00 00 41 bd 02 00 00 00 bb 20 00 00 00 <48> 01 e8 48 8b 48 08 48 8d 70 f0 48 39 ce 0f 84 10 fe ff ff 48 8b

Quote

January 31, 20233 yr

Author

I'll try swapping the RAM of my current Windows machine with the RAM in the Unraid machine tomorrow and run TM5 on the RAM. The RAM of my current Windows machine is not in the QVL list for the Unraid machine's motherboard so I am hoping this won't cause issues either...

EDIT: I had the new RAM running TM5 for a couple days now and there were no errors. I even had Extreme1 config running for around 24 hours. Now the machine is running Unraid again and soon I'll be looking into doing a zfs replication to another pool so that will probably be using a lot of RAM in the progress as a real world test.

Edited February 3, 20233 yr by Percutio

Quote

February 7, 20233 yr

Author

I am still getting crashes/dirty reboots with the new RAM. This RAM definitely has no problems whatsoever with my main Windows machine and when I installed it in the unraid machine, it did not show any errors when I ran TM5 tests with anta777 configs. I even had the Extreme1 config running for 24+ hours and nothing happened.

This time around I caught the unraid machine rebooting/crashing twice within around 30 minutes of each other while I was looking into issues with the zfs replication. They happened around 9am Feb 7 and 9:30 Feb 7.

EDIT: It crashed/rebooted yet again at around 9:48 just as I was about to bring up all the kernel logs in the tools page

syslog-192.168.1.12.log

Edited February 7, 20233 yr by Percutio

Quote

February 7, 20233 yr

Nothing relevant logged at the time of the crashes, but there are still multiple apps recently segfaulting, so it still looks to me like a hardware problem.

Quote

February 7, 20233 yr

Author

What other hardware troubleshooting steps are there to do? At the very least I doubt the memory sticks have any problems unless I have been unlucky with 2 different memory vendors and Windows can't detect the problems.

Edited February 7, 20233 yr by Percutio

Quote

February 7, 20233 yr

This type of issue is difficult to diagnose without starting to swap some parts around.

Quote

February 7, 20233 yr

Author

There were some additional kernel logs about CPU on the 4th dirty reboot so far so I could try seeing what happens when I resocket the CPU and use some air to clean the socket pins. I'll try doing that tonight and in the meantime just shut the machine down.

syslog-192.168.1.12 (1).log

Quote

February 18, 20233 yr

Author
Solution

So, that night, I had done a couple things:

Blew some air on the CPU pins. The socket and pins themselves didn't look like they had any issues visually though.
Retightened the CPU mounting system since SecuFirm2 prevents me from opening the retention bracket without taking it apart it seems. Maybe the issue has something to do with contact pressure.

And ever since then, I have had no dirty reboots/crashes. The last few days I did have to restart or shut down the system to test a new UPS, but otherwise nothing that would trigger a ZFS scrub or similar. There haven't been any segfaults since either.

I'll just close this topic now and very much hope these dirty reboots/crashes are gone now. In the coming weeks, I plan to setup some docker containers/services and overall just using the machine as a NAS. If this issue pops up again during those actions, I'll just update this thread.

Quote

Regularly getting unclean crashes/reboots in random intervals

Featured Replies

Solved by Percutio

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)