• System unstable 6.12.0-rc4.1


    ChronoStriker1
    • Closed

    Right now I doubt the issue is due to being on the rc but as I am using it I figure I should post this here.  I had recently replaced most of the hardware on my unraid server but now it seems to be somewhat unstable.  The last "crash" happened probably an hour ago, its not really a crash, I have partial access, some dockers will still be running, my ssh session seems to stay up, but some applications (like htop) will just freeze.  I think I may loose /mnt/user at the time but Ive only been able to prove it once.  Things are usually so bad that I cant run diagnostics or even the shutdown or reboot commands.  This was way more prevalent (happening almost daily if not more often) when I first set the server up, but after changing a few hardware around (had a few bad sata cables and an HBA that seemed like it was having issues until I updated its firmware) it seemed like it went away, but I guess thats not the case.  I had run memtest but the memory had passed so Im kinda out of ideas.  Would appreciate anyones help.

     




    User Feedback

    Recommended Comments



    I have enabled it but had not set it to write to usb (I have it doing that now).  In the same run the server crashed again, I did get a picture of the screen.  Ive also noticed that on reboot the server would crash, I think this is that error or at the very least I think its the same error.20230502_183946.thumb.jpg.697de1c41dd570bb57de8d43d3005eb2.jpg

     

    Also in the logs I noticed it keeps saying "shfs: cache disk full". As far as I can tell all of the share floors are set low enough that I shouldn't be getting that message.

    Link to comment
    17 minutes ago, ChronoStriker1 said:

    As far as I can tell all of the share floors are set low enough that I shouldn't be getting that message.

    Correct, post the log when you have it.

    Link to comment

    I will after it crashes again.  I do have a second question since I've been trying to monitor the syslog myself, if I envoke the mover manually I see this come up:

    May  3 09:09:50 Tower move: mover: started
    May  3 09:09:50 Tower shfs: /usr/sbin/zfs destroy -r cache/Downloads
    May  3 09:09:50 Tower shfs: error: retval 1 attempting 'zfs destroy'

    Is that expected?  I decided to go with zfs to learn more about it but I dont know why it would try to do a destroy.

    Edited by ChronoStriker1
    Link to comment

    Destroying the dataset after the mover is done is normal, not sure what the error is about though.

    Link to comment

    I ran a scrub on my cache and I recieved two errors one is to a file that i can redownload so no issue there, the other was:
    zfs permanent error cache:<0x24fc8a>
    I do not know how to deal with that one.

    • Upvote 1
    Link to comment
    47 minutes ago, ChronoStriker1 said:

    zfs permanent error cache:<0x24fc8a>

    That means metadata corruption, you should should destroy and recreate the pool, but before that it would be a good idea to run memtest.

    Link to comment

    After deleting the file and doing another scrub that error went away.  I also manually moved thigns again, this time arround it looks like things moved (it looked like it stopped part way through previously).  Currently I am not getting that shfs: cache disk full message. Still waiting for it to crash again.

    • Upvote 1
    Link to comment

    The system stopped letting me do some actions again, one cpu looked like it was pegged at 100% by "/usr/src/app/vendor/bundle/ruby/3.1.0/bin/rake jobs:work" tried killing it but it wouldnt die, I tried stoping things in order to reboot but the web interface became unresponsive, I attempted to reboot from the commandline but the last message I saw was "Tower init: Trying to re-exec init".  It did eventually reboot but it had an uncleen shutdown.

     

    Edited by ChronoStriker1
    Link to comment

    Syslog shows multiple apps segfaulting, this usually points to a hardware problem, most often RAM related, just some examples:


     

    May  5 02:41:22 Tower kernel: thunar[9746]: segfault at cc0000003a ip 00001474c6bd0071 sp 00007ffcb97fc750 error 4 in libglib-2.0.so.0.7400.6[1474c6b7a000+8d000] likely on CPU 0 (core 0, socket 0)
    ...
    May  5 16:04:11 Tower kernel: ghb[31695]: segfault at fffffffffffff820 ip 000014c78ddc3458 sp 000014c78452fab8 error 5 in libx264.so.164[14c78dc05000+1f0000] likely on CPU 14 (core 28, socket 0)
    ...
    May  6 08:07:26 Tower kernel: python3[7940]: segfault at 10 ip 000014c592a55933 sp 00007ffcbebc8fc8 error 4 in ld-musl-x86_64.so.1[14c592a47000+48000] likely on CPU 12 (core 24, socket 0)

     

    You are also overclocking you RAM, first thing to try would be to disable XMP, if issues persist try with just one stick of RAM at a time (with XMP disabled).

    Link to comment

    After another crash yesterday I ran a full 4 pass memtest86 run and my memory passed. Well it's still possible the issue is the memory I need to be more specific so I can RMA parts. Is there any better way to track down what's going on.

    Link to comment

    D

    On 5/7/2023 at 8:59 AM, JorgeB said:

    disable XMP, if issues persist try with just one stick of RAM at a time (with XMP disabled).

     

    Link to comment

    Keep in mind that the memory speed limitation may very well NOT be the memory DIMMs, it can be the memory controller on the CPU or the motherboard itself.

     

    Putting 200MPH rated tires on a toyota corolla is not going to allow the car to go that fast.

     

    All parts in the memory access path must be able to sustain the targeted speed. Run the system at stock speed (no XMP) and see how it behaves. Memtest can only prove the memory is bad, passing memtest doesn't mean the memory is working 100% under all conditions.

    Link to comment

    I have disabled xmp, there has been at least one segfault that I noticed so far but it hasnt crashed yet.  I will continue to keep an eye on it.

    Link to comment

    And it looks like it crashed again.  I can attempt running one stick at a time later today to see if there is any changes but is there anything else I can test other than just the memory?

    Link to comment

    RAM is the easiest thing to test, if it still crashes with either stick alone the next suspect IMHO would be board/CPU.

    Link to comment

    Welp tried one stick twice had the same issue, was able to get another set and am having the same errors, so I think at this point I can say its not the ram.  So where would the next place to check be?

    Link to comment

    Well that will be fun to rma then.  Another question that is hopefully easy, even after reboots and swapping the ram, its always the same things that always segfault. Wouldn't random things segfault as it thinks there is an issue or it runs out of memory?  Looking at my syslog from yesterday:
     

    May  9 21:37:29 Tower kernel: unraid-api[16221]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007ffe4d6f11a8 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)
    May  9 21:45:25 Tower kernel: python3[11316]: segfault at 7 ip 00001504488506f3 sp 00007ffdc92cd5f0 error 4 in libpython3.10.so.1.0[15044873b000+1be000] likely on CPU 14 (core 28, socket 0)
    May  9 22:25:13 Tower kernel: Thunar[22347]: segfault at 600000003a ip 00001512f15d1f1c sp 00007ffc678c45b0 error 4 in libglib-2.0.so.0.6600.8[1512f157e000+88000] likely on CPU 12 (core 24, socket 0)
    May  9 22:25:18 Tower kernel: thunar[1956]: segfault at 600000003a ip 0000153cad537f1c sp 00007ffc9fc1f040 error 4 in libglib-2.0.so.0.6600.8[153cad4e4000+88000] likely on CPU 2 (core 4, socket 0)
    May  9 22:45:28 Tower kernel: python[24692]: segfault at 1 ip 00001507b28ac411 sp 00007fff2a585a50 error 6 in libpython3.11.so.1.0[1507b2799000+1bb000] likely on CPU 0 (core 0, socket 0)
    May  9 23:56:34 Tower kernel: unraid-api[8794]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007fff36d19508 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)
    May 10 01:09:03 Tower kernel: unraid-api[15814]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007fffca0cda58 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)
    May 10 02:37:20 Tower kernel: python[7637]: segfault at 8 ip 000014f6acd47ac9 sp 000014f6a84bba90 error 4 in libpython3.9.so.1.0[14f6acc13000+1b8000] likely on CPU 0 (core 0, socket 0)
    May 10 03:14:47 Tower kernel: python3[27555]: segfault at 0 ip 000014dc983af61b sp 000014dc95621998 error 6 in libpython3.8.so.1.0[14dc98273000+183000] likely on CPU 12 (core 24, socket 0)
    May 10 03:46:44 Tower kernel: python[4413]: segfault at 6 ip 0000151abbf715e6 sp 00007ffc2b350e40 error 6 in libpython3.11.so.1.0[151abbe5f000+1bb000] likely on CPU 0 (core 0, socket 0)
    May 10 05:24:00 Tower kernel: php7[4270]: segfault at 40 ip 00005585e53dd3a0 sp 00007ffc2b656380 error 4 in php7[5585e5200000+240000] likely on CPU 0 (core 0, socket 0)
    May 10 05:29:39 Tower kernel: unraid-api[30617]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007ffe8bc8fca8 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)
    May 10 06:32:13 Tower kernel: unraid-api[21391]: segfault at ffffffffffff3b28 ip 0000000001518f00 sp 00007ffe80db5a58 error 5 in unraid-api[91c000+167b000] likely on CPU 0 (core 0, socket 0)

    I know for a fact that unraid-api, python3, and thunar are always programs (or the libriaries associated with them) that seem to segfault.  Is it possible that some of the files have been damaged due to the crashes and thats why they are faulting?

    Link to comment

    How are your CPU temperatures looking while testing?  Can also install the corefreq plugin, run corefreq-cli and go to tools>atomic burn and do a cpu burn test, similar to Prime95

    You got a tough one there to track down, but good advice so far as to what to look at first.

    Edited by samsausages
    Link to comment

    CPU temps have been fine this entire time.  Ill install the plugin and run it after the latest parity check is complete.  I will also contact intel and asus to see if I can rma the processor and motherboard sicne im outside my 30 day window with Amazon.

    Link to comment

    I would say 71C is a bit on the high side, but looks like you have a 13900k, so that isn't bad for that chip and well within spec.
    Good sign that it didn't crash, but probably doesn't make your search for answers much easier.  My gut is still memory related, be it memory itself, motherboard or the memory controller on the cpu.
    This may be a long shot as well, but I had an older x99 system that didn't like my memory settings on Auto, I had to go in and set the timings and voltage manually in the BIOS for it to work properly.  And it's tough to troubleshoot because errors were rare and intermittent.
     

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.