• [6.7.0] Server hangs after ~24 hours


    emmcee
    • Solved Annoyance

    I've had an issue twice now since upgrading to 6.7 where the server just hangs. No network activity and nothing on the console if I plug in a monitor to HDMI. The power LED is still lit on the server but nothing else. When I restarted last night it did a parity check and it sent an email report about 4 hours ago, but I just sat down to watch something on plex and server is down again.

     

    Is there any way to get diagnostics in the case of no console/nmo network? I'm guessing not.




    User Feedback

    Recommended Comments

    Since you are on 6.7 you can go into Settings and enable the built-in syslog server and set it to save to the flash drive.

    Link to comment

    And right on cue it hangs again. Just before it hangs I see this in the syslog (I amreluctant to post itas it's not anonymised)

     

    May 22 20:26:38 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:26:38 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 313000 [fault reason 05] PTE Write access is not set
    May 22 20:26:38 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:26:38 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 315000 [fault reason 05] PTE Write access is not set
    May 22 20:26:38 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 316000 [fault reason 05] PTE Write access is not set
    May 22 20:26:38 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:26:38 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 317000 [fault reason 05] PTE Write access is not set
    May 22 20:26:44 Tower kernel: [drm] GPU HANG: ecode 8:2:0x9bd7cfff, in Plex Transcoder [4974], reason: hang on vcs0, action: reset
    May 22 20:26:44 Tower kernel: i915 0000:00:02.0: Resetting vcs0 for hang on vcs0
    May 22 20:26:52 Tower kernel: i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0
    May 22 20:26:52 Tower kernel: dmar_fault: 47723 callbacks suppressed
    May 22 20:26:52 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:26:52 Tower kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 313000 [fault reason 23] Unknown
    May 22 20:26:52 Tower kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 314000 [fault reason 23] Unknown
    May 22 20:26:52 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:26:52 Tower kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 315000 [fault reason 23] Unknown
    May 22 20:26:52 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:26:52 Tower kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 316000 [fault reason 23] Unknown
    May 22 20:26:57 Tower kernel: dmar_fault: 31347 callbacks suppressed
    May 22 20:26:57 Tower kernel: DMAR: DRHD: handling fault status reg 3
    May 22 20:26:57 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 313000 [fault reason 05] PTE Write access is not set
    May 22 20:26:57 Tower kernel: DMAR: DRHD: handling fault status reg 3
    May 22 20:26:57 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 318000 [fault reason 05] PTE Write access is not set
    May 22 20:26:57 Tower kernel: DMAR: DRHD: handling fault status reg 3
    May 22 20:26:57 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 31b000 [fault reason 05] PTE Write access is not set
    May 22 20:26:57 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:02 Tower kernel: dmar_fault: 3344072 callbacks suppressed
    May 22 20:27:02 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:02 Tower kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 116d1a000 [fault reason 23] Unknown
    May 22 20:27:02 Tower kernel: DMAR: DRHD: handling fault status reg 3
    May 22 20:27:02 Tower kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 116d1d000 [fault reason 23] Unknown
    May 22 20:27:02 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:02 Tower kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 116d21000 [fault reason 23] Unknown
    May 22 20:27:02 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:14 Tower kernel: i915 0000:00:02.0: Resetting vcs0 for no progress on vcs0
    May 22 20:27:14 Tower kernel: dmar_fault: 3249556 callbacks suppressed
    May 22 20:27:14 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:14 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 313000 [fault reason 05] PTE Write access is not set
    May 22 20:27:14 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:14 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 315000 [fault reason 05] PTE Write access is not set
    May 22 20:27:14 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:14 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 316000 [fault reason 05] PTE Write access is not set
    May 22 20:27:14 Tower kernel: DMAR: DRHD: handling fault status reg 3
    May 22 20:27:30 Tower kernel: i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0, vcs0
    May 22 20:27:30 Tower kernel: i915 0000:00:02.0: Resetting vcs0 for no progress on rcs0, vcs0
    May 22 20:27:30 Tower kernel: dmar_fault: 1920976 callbacks suppressed
    May 22 20:27:30 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:30 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 313000 [fault reason 05] PTE Write access is not set
    May 22 20:27:30 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:30 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 315000 [fault reason 05] PTE Write access is not set
    May 22 20:27:30 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:30 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 316000 [fault reason 05] PTE Write access is not set
    May 22 20:27:30 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:35 Tower kernel: dmar_fault: 27784 callbacks suppressed
    May 22 20:27:35 Tower kernel: DMAR: DRHD: handling fault status reg 3
    May 22 20:27:35 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 313000 [fault reason 05] PTE Write access is not set
    May 22 20:27:35 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:35 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 317000 [fault reason 05] PTE Write access is not set
    May 22 20:27:35 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:35 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 319000 [fault reason 05] PTE Write access is not set
    May 22 20:27:35 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:42 Tower kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
    May 22 20:27:42 Tower kernel: dmar_fault: 1239 callbacks suppressed
    May 22 20:27:42 Tower kernel: DMAR: DRHD: handling fault status reg 3
    May 22 20:27:42 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 313000 [fault reason 05] PTE Write access is not set
    May 22 20:27:42 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:42 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 317000 [fault reason 05] PTE Write access is not set
    May 22 20:27:42 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:42 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 319000 [fault reason 05] PTE Write access is not set
    May 22 20:27:42 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 31a000 [fault reason 05] PTE Write access is not set
    May 22 20:27:48 Tower kernel: dmar_fault: 27637 callbacks suppressed
    May 22 20:27:48 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:48 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 313000 [fault reason 05] PTE Write access is not set
    May 22 20:27:48 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:48 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 317000 [fault reason 05] PTE Write access is not set
    May 22 20:27:48 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 318000 [fault reason 05] PTE Write access is not set
    May 22 20:27:48 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:48 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 319000 [fault reason 05] PTE Write access is not set
    May 22 20:27:54 Tower kernel: dmar_fault: 3722 callbacks suppressed
    May 22 20:27:54 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:54 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 313000 [fault reason 05] PTE Write access is not set
    May 22 20:27:54 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:54 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 317000 [fault reason 05] PTE Write access is not set
    May 22 20:27:54 Tower kernel: DMAR: DRHD: handling fault status reg 2
    May 22 20:27:54 Tower kernel: DMAR: [DMA Read] Request device [00:02.0] fault addr 319000 [fault reason 05] PTE Write access is not set
    May 22 20:27:54 Tower kernel: DMAR: DRHD: handling fault status reg 2

    I found this which indicates it may be due to GPU. I'm passing the iGPU from my i7-5775c to the Plex docker - could that be related?

    Link to comment

    And it looks like Plex doesn't feel so good. From the plex container log:

     

    failed to open /data/jenkins/conan_build/290002784/conan/.conan/data/libva/2.1.0-40/plex/stable/package/81a2df5e16044d97d1b088b0e6c9598b5b17f233/lib/dri/hybrid_drv_video.so
    Failed to wrapper hybrid_drv_video.so
    failed to open /data/jenkins/conan_build/290002784/conan/.conan/data/libva/2.1.0-40/plex/stable/package/81a2df5e16044d97d1b088b0e6c9598b5b17f233/lib/dri/hybrid_drv_video.so
    Failed to wrapper hybrid_drv_video.so
    failed to open /data/jenkins/conan_build/290002784/conan/.conan/data/libva/2.1.0-40/plex/stable/package/81a2df5e16044d97d1b088b0e6c9598b5b17f233/lib/dri/hybrid_drv_video.so
    Failed to wrapper hybrid_drv_video.so
    failed to open /data/jenkins/conan_build/290002784/conan/.conan/data/libva/2.1.0-40/plex/stable/package/81a2df5e16044d97d1b088b0e6c9598b5b17f233/lib/dri/hybrid_drv_video.so
    Failed to wrapper hybrid_drv_video.so
    failed to open /data/jenkins/conan_build/290002784/conan/.conan/data/libva/2.1.0-40/plex/stable/package/81a2df5e16044d97d1b088b0e6c9598b5b17f233/lib/dri/hybrid_drv_video.so
    Failed to wrapper hybrid_drv_video.so
    failed to open /data/jenkins/conan_build/290002784/conan/.conan/data/libva/2.1.0-40/plex/stable/package/81a2df5e16044d97d1b088b0e6c9598b5b17f233/lib/dri/hybrid_drv_video.so
    Failed to wrapper hybrid_drv_video.so

     

    Link to comment

    Sorry. I thought I commented on the fix. I think this was related to that Plex error and iGPU passthrough. When I updated the Plex container the error went away and system has been up for 5 days. I think the 24 hour thing was due to people sitting down to watch Plex around 8PM. 

    Link to comment

    No problem, but I was kinda hoping for a more elaborate fix. For I'm having the same issues with with the iGPU passthrough using Binhex PlexPass container on unraid 6.7.0 . It used to work perfectly, but even with an updated container the problem still arises.

    Link to comment

    I’m running linuxserver/plex with plexpass if it helps. Might be worth installing it to see if it works. 

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.