System unstable 6.12.0-rc4.1

JorgeB · May 2, 2023

Enable the syslog server and post that after a crash.

ChronoStriker1 · May 3, 2023

I have enabled it but had not set it to write to usb (I have it doing that now). In the same run the server crashed again, I did get a picture of the screen. Ive also noticed that on reboot the server would crash, I think this is that error or at the very least I think its the same error.

Also in the logs I noticed it keeps saying "shfs: cache disk full". As far as I can tell all of the share floors are set low enough that I shouldn't be getting that message.

JorgeB · May 3, 2023

17 minutes ago, ChronoStriker1 said:

As far as I can tell all of the share floors are set low enough that I shouldn't be getting that message.

Correct, post the log when you have it.

ChronoStriker1 · May 3, 2023

I will after it crashes again. I do have a second question since I've been trying to monitor the syslog myself, if I envoke the mover manually I see this come up:

May 3 09:09:50 Tower move: mover: started
May 3 09:09:50 Tower shfs: /usr/sbin/zfs destroy -r cache/Downloads
May 3 09:09:50 Tower shfs: error: retval 1 attempting 'zfs destroy'

Is that expected? I decided to go with zfs to learn more about it but I dont know why it would try to do a destroy.

Edited May 3, 2023 by ChronoStriker1

JorgeB · May 3, 2023

Destroying the dataset after the mover is done is normal, not sure what the error is about though.

ChronoStriker1 · May 3, 2023

I ran a scrub on my cache and I recieved two errors one is to a file that i can redownload so no issue there, the other was:
zfs permanent error cache:<0x24fc8a>
I do not know how to deal with that one.

JorgeB · May 3, 2023

47 minutes ago, ChronoStriker1 said:

zfs permanent error cache:<0x24fc8a>

That means metadata corruption, you should should destroy and recreate the pool, but before that it would be a good idea to run memtest.

ChronoStriker1 · May 3, 2023

After deleting the file and doing another scrub that error went away. I also manually moved thigns again, this time arround it looks like things moved (it looked like it stopped part way through previously). Currently I am not getting that shfs: cache disk full message. Still waiting for it to crash again.

JorgeB · May 3, 2023

Still think it would be a good idea to run memtest, and the stability issues could be related.

ChronoStriker1 · May 3, 2023

I had run it last night for 1 pass and the ram had passed. I can attempt to let it run over night sometime this weekend.

ChronoStriker1 · May 6, 2023

The system stopped letting me do some actions again, one cpu looked like it was pegged at 100% by "/usr/src/app/vendor/bundle/ruby/3.1.0/bin/rake jobs:work" tried killing it but it wouldnt die, I tried stoping things in order to reboot but the web interface became unresponsive, I attempted to reboot from the commandline but the last message I saw was "Tower init: Trying to re-exec init". It did eventually reboot but it had an uncleen shutdown.

Edited May 30, 2023 by ChronoStriker1

JorgeB · May 7, 2023

Syslog shows multiple apps segfaulting, this usually points to a hardware problem, most often RAM related, just some examples:

May  5 02:41:22 Tower kernel: thunar[9746]: segfault at cc0000003a ip 00001474c6bd0071 sp 00007ffcb97fc750 error 4 in libglib-2.0.so.0.7400.6[1474c6b7a000+8d000] likely on CPU 0 (core 0, socket 0)
...
May  5 16:04:11 Tower kernel: ghb[31695]: segfault at fffffffffffff820 ip 000014c78ddc3458 sp 000014c78452fab8 error 5 in libx264.so.164[14c78dc05000+1f0000] likely on CPU 14 (core 28, socket 0)
...
May  6 08:07:26 Tower kernel: python3[7940]: segfault at 10 ip 000014c592a55933 sp 00007ffcbebc8fc8 error 4 in ld-musl-x86_64.so.1[14c592a47000+48000] likely on CPU 12 (core 24, socket 0)

You are also overclocking you RAM, first thing to try would be to disable XMP, if issues persist try with just one stick of RAM at a time (with XMP disabled).

ChronoStriker1 · May 8, 2023

After another crash yesterday I ran a full 4 pass memtest86 run and my memory passed. Well it's still possible the issue is the memory I need to be more specific so I can RMA parts. Is there any better way to track down what's going on.

JorgeB · May 8, 2023

D

On 5/7/2023 at 8:59 AM, JorgeB said:

disable XMP, if issues persist try with just one stick of RAM at a time (with XMP disabled).

JonathanM · May 8, 2023

Keep in mind that the memory speed limitation may very well NOT be the memory DIMMs, it can be the memory controller on the CPU or the motherboard itself.

Putting 200MPH rated tires on a toyota corolla is not going to allow the car to go that fast.

All parts in the memory access path must be able to sustain the targeted speed. Run the system at stock speed (no XMP) and see how it behaves. Memtest can only prove the memory is bad, passing memtest doesn't mean the memory is working 100% under all conditions.

ChronoStriker1 · May 8, 2023

I have disabled xmp, there has been at least one segfault that I noticed so far but it hasnt crashed yet. I will continue to keep an eye on it.

ChronoStriker1 · May 8, 2023

And it looks like it crashed again. I can attempt running one stick at a time later today to see if there is any changes but is there anything else I can test other than just the memory?

JorgeB · May 8, 2023

RAM is the easiest thing to test, if it still crashes with either stick alone the next suspect IMHO would be board/CPU.

ChronoStriker1 · May 9, 2023

Welp tried one stick twice had the same issue, was able to get another set and am having the same errors, so I think at this point I can say its not the ram. So where would the next place to check be?

JorgeB · May 10, 2023

Board or CPU would be my next suspects.

ChronoStriker1 · May 10, 2023

Well that will be fun to rma then. Another question that is hopefully easy, even after reboots and swapping the ram, its always the same things that always segfault. Wouldn't random things segfault as it thinks there is an issue or it runs out of memory? Looking at my syslog from yesterday:

May  9 21:37:29 Tower kernel: unraid-api[16221]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007ffe4d6f11a8 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)
May  9 21:45:25 Tower kernel: python3[11316]: segfault at 7 ip 00001504488506f3 sp 00007ffdc92cd5f0 error 4 in libpython3.10.so.1.0[15044873b000+1be000] likely on CPU 14 (core 28, socket 0)
May  9 22:25:13 Tower kernel: Thunar[22347]: segfault at 600000003a ip 00001512f15d1f1c sp 00007ffc678c45b0 error 4 in libglib-2.0.so.0.6600.8[1512f157e000+88000] likely on CPU 12 (core 24, socket 0)
May  9 22:25:18 Tower kernel: thunar[1956]: segfault at 600000003a ip 0000153cad537f1c sp 00007ffc9fc1f040 error 4 in libglib-2.0.so.0.6600.8[153cad4e4000+88000] likely on CPU 2 (core 4, socket 0)
May  9 22:45:28 Tower kernel: python[24692]: segfault at 1 ip 00001507b28ac411 sp 00007fff2a585a50 error 6 in libpython3.11.so.1.0[1507b2799000+1bb000] likely on CPU 0 (core 0, socket 0)
May  9 23:56:34 Tower kernel: unraid-api[8794]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007fff36d19508 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)
May 10 01:09:03 Tower kernel: unraid-api[15814]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007fffca0cda58 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)
May 10 02:37:20 Tower kernel: python[7637]: segfault at 8 ip 000014f6acd47ac9 sp 000014f6a84bba90 error 4 in libpython3.9.so.1.0[14f6acc13000+1b8000] likely on CPU 0 (core 0, socket 0)
May 10 03:14:47 Tower kernel: python3[27555]: segfault at 0 ip 000014dc983af61b sp 000014dc95621998 error 6 in libpython3.8.so.1.0[14dc98273000+183000] likely on CPU 12 (core 24, socket 0)
May 10 03:46:44 Tower kernel: python[4413]: segfault at 6 ip 0000151abbf715e6 sp 00007ffc2b350e40 error 6 in libpython3.11.so.1.0[151abbe5f000+1bb000] likely on CPU 0 (core 0, socket 0)
May 10 05:24:00 Tower kernel: php7[4270]: segfault at 40 ip 00005585e53dd3a0 sp 00007ffc2b656380 error 4 in php7[5585e5200000+240000] likely on CPU 0 (core 0, socket 0)
May 10 05:29:39 Tower kernel: unraid-api[30617]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007ffe8bc8fca8 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)
May 10 06:32:13 Tower kernel: unraid-api[21391]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007ffe80db5a58 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)

I know for a fact that unraid-api, python3, and thunar are always programs (or the libriaries associated with them) that seem to segfault. Is it possible that some of the files have been damaged due to the crashes and thats why they are faulting?

samsausages · May 10, 2023

How are your CPU temperatures looking while testing? Can also install the corefreq plugin, run corefreq-cli and go to tools>atomic burn and do a cpu burn test, similar to Prime95

You got a tough one there to track down, but good advice so far as to what to look at first.

Edited May 10, 2023 by samsausages

ChronoStriker1 · May 10, 2023

CPU temps have been fine this entire time. Ill install the plugin and run it after the latest parity check is complete. I will also contact intel and asus to see if I can rma the processor and motherboard sicne im outside my 30 day window with Amazon.

ChronoStriker1 · May 11, 2023

Atomic burn worked without issue max temp was 71C and no crashes while it did it.

samsausages · May 11, 2023

I would say 71C is a bit on the high side, but looks like you have a 13900k, so that isn't bad for that chip and well within spec.
Good sign that it didn't crash, but probably doesn't make your search for answers much easier. My gut is still memory related, be it memory itself, motherboard or the memory controller on the cpu.
This may be a long shot as well, but I had an older x99 system that didn't like my memory settings on Auto, I had to go in and set the timings and voltage manually in the BIOS for it to work properly. And it's tough to troubleshoot because errors were rare and intermittent.

System unstable 6.12.0-rc4.1

User Feedback

Recommended Comments

JorgeB 7521

Link to comment

ChronoStriker1 5

Link to comment

JorgeB 7521

Link to comment

ChronoStriker1 5

Link to comment

JorgeB 7521

Link to comment

ChronoStriker1 5

Link to comment

JorgeB 7521

Link to comment

ChronoStriker1 5

Link to comment

JorgeB 7521

Link to comment

ChronoStriker1 5

Link to comment

ChronoStriker1 5

Link to comment

JorgeB 7521

Link to comment

ChronoStriker1 5

Link to comment

JorgeB 7521

Link to comment

JonathanM 2317

Link to comment

ChronoStriker1 5

Link to comment

ChronoStriker1 5

Link to comment

JorgeB 7521

Link to comment

ChronoStriker1 5

Link to comment

JorgeB 7521

Link to comment

ChronoStriker1 5

Link to comment

samsausages 30

Link to comment

ChronoStriker1 5

Link to comment

ChronoStriker1 5

Link to comment

samsausages 30

Link to comment

Join the conversation