• [6.9.0 beta 30] Server hard lock up


    nickp85
    • Urgent

    Updated to 6.9 beta 30 last night and sometime overnight my whole server locked up.  My Windows 10 VM was running along with 4 docker containers, Plex, Sonarr, Radarr and Nzbget, nothing else. I have my logs going to a local syslog so they are on a share.  There is nothing in the log between when I went to bed and when I had to force reboot the machine.

     

    Couldn't ping the machine, local console was blank, keyboard unresponsive, totally dead.  Force power off and on brought it back. After the reboot a parity check started automatically.

     

    Server had been working 100% fine with 6.8.3 for months. Since going to 6.9 beta 30, I reformatted my cache to 1MiB and also changed Docker to use an xfs image to control the high writes I was getting to my cache SSDs.  Machine was working fine just before bed, was playing Overwatch no issue.

     

    ***UPDATE*** It hasn't happened since the first night of being on 6.9 beta 30

    nicknas2-diagnostics-20201011-1053.zip




    User Feedback

    Recommended Comments



    15 minutes ago, John_M said:

    No, you're not. That table was for @turnipisum's AMD hardware. In your earlier post you say you have a 9900K for which Intel specifies DDR4-2666.

    https://ark.intel.com/content/www/us/en/ark/products/186605/intel-core-i9-9900k-processor-16m-cache-up-to-5-00-ghz.html

    Ah you are right, sorry.  Either way, server passes 4 days on memtest with no issues so certainly not a problem with my ram.  Also, back on beta 25 it’s not crashing at all.

    • Like 1
    Link to comment

    Update! so looking in my bios again i had power supply set to "low current idle" so i've now set it to "typical current idle" and got almost 4 days uptime so far! 🤞

    Also i did some logging of psu usage just in case. most i've seen while gaming on 2 vm's with 2070 supers on each one was 760 watts so don't think i'm hitting limit on the hx1000i but i am going get a 1600i as soon as i can but at £460 it's gonna have to wait to new year as i will need bigger ups as well another £700-1k 🤪

    Link to comment

    I think I'm having this or very similar crash. I don't even have any VMs on when it happens now.

    But....hey..how do you revert back to a specific beta!? Or even Stable for that matter?
    I go to the Update OS page, and change "Next" to "stable"...then the page quickly flashes/reloads and the selection is back on "Next". So I don't even know how to revert back to "Stable". But I'd also like to know how to revert back to a specific previous beta (not necessarily the previous one but maybe a few back).
    Any help?


     

    Edited by Stupifier
    Link to comment

    Any idea if @limetech has looked at this?  Its been quite a while and I don't see any attempts to look at this problem.  Its affecting beta 35 as well.  Presumably this issue would follow forward into the next release as well...

    Link to comment
    57 minutes ago, sittingmongoose said:

    Any idea if @limetech has looked at this?  Its been quite a while and I don't see any attempts to look at this problem.  Its affecting beta 35 as well.  Presumably this issue would follow forward into the next release as well...

    Nothing new other than than the c-states, typical current idle or memory speed tricks to try.

    Yes i'm on beta 35 now still having issues! I'm leaning to kernel issue or Nvidia drivers on the two 2070 supers on my vm's maybe. tried a lot of tweaks losing track now lol.

    I have seen posts on other forum's about bare metal ryzen and linux rig's having lock up's as well.

    I'm hoping that 6.9.0 release will solve it but who knows.

    Link to comment

    Well i updated my 2 win10 vm's with the latest nvida drivers and not had lockup since. I got to almost 5 days but now i've updated to RC1 and done a few other things, like put memory back to 2666Mhz change some cpu pinning and swapped around some usb pass through. So we will see how it goes 🤞 

    Link to comment

    Hi,

     

    I also want to share my experiences with this issue, because I am also affected.

    Normally I only run stable releases on my main server, but i am running a Samsung 970 evo SSD and i am seeing exessive writes on the latest stable release (> 1TB per day). Therefore i switched to 6.9.0-beta35, reformatted my SSD, and my writes are fine now. When 6.9.0-rc1 was released i switched to that. I suffered from the lockups on both pre-releases.

    I have been running a Intel System (i7 6700K and MSI Z170 MB with 64 GB DDR4). I am running the System without any OC, so CPU was running at stock speed and RAM was running at 2133 MHz.

    Because my hardware was a bit old now and i wanted to upgrade anyways, I purchased new hardware, so i can exclude any hardware issues. I am using Intel QuickSync in my plex docker container, so i purchased another Intel System. I have chosen an i5 10600k in combination with ASUS Z490 Creator MB. When i got my new components, I run memtest86 for 4 days without any issues. Then i switched over my UNRAID server to the new platform, without any success. Just after 2 days the first lockup (running unraid 6.9.0-rc1).

    Now i wanted to make further hardware tests, so i switched my unraid server over to the hardware of my main workstation, which is a Intel i9 9900K on an ASUS WS Z390 PRO (also memtest stable and no OC). With this hardware i also faced a lockup just after 18 hours uptime.

    I have attached the diagnostics after my latest crash on -rc1, but they are probably not that helpful, because they are captured after the lockup and a hard reset. Also i am running my unraid-server headless down in the basement, so i unfortunatly cannot look at monitor output at the time of the lockup.

    After my own diagnosis i would definitely exclude a hardware error, because i have tested now 3 completely different systems with the same results.


    I would appreciate any help I can get on this.

     

    Edited by greenflash24
    Link to comment

    I run a threadripper 1950X with an Asus X399 Prime-A motherboard.  Originally I didn't have to disable C-states and enable power supply typical current idle.  I did have to do that on my previous Ryzen 1700X system.  Anyway, recent crashes made me revisit that.  The other thing I did was adjust my VM to use the new NIC settings, which from previous testing seem to be much slower but more stable, I should check if that's still the case or not.  But I was getting my logs filling up due to having virtio set instead of virtio-net.  I assume having a full drive due to logs isn't great for stability either - maybe it's partitioned though, I haven't checked.

     

    Mostly the issues do seem to come when I'm gaming in a Windows 10 VM though, other VM's seem OK.  So the main difference I can think of is GPU passthrough.

     

    Also, mine doesn't always crash per-se, but dmesg shows similar kernel messages to those posted in the beginning with kernel traces etc (wish I knew how to read those).  My memory has also undergone extensive testing.

     

    Is any of this common to anyone else here with system crashes?

     

    Edited by Marshalleq
    Link to comment

    Update!

     

    I went to 6.9.0 rc1 updated Nvidia drivers on both vm's and got to almost 9 days up! Then updated to rc2 and within 48hrs i had 2 lock up's so it's still plaguing me! 🤷‍♂️

     

    I have just redone the 2 vm's on new templates using q35 5.1 (was on i440fx) and new virtio drivers 0.1.190 on them so we will see if that makes any difference. 

     

    But in all lock up's it seems to be iommu issue in my case.

    Dec 22 21:30:45 SKYNET-UR kernel: RIP: 0010:__iommu_dma_unmap+0x7a/0xe8
    Dec 22 21:30:45 SKYNET-UR kernel: Code: 46 28 4c 8d 60 ff 48 8d 54 18 ff 49 21 ec 48 f7 d8 4c 29 e5 49 01 d4 49 21 c4 48 89 ee 4c 89 e2 e8 8f df ff ff 4c 39 e0 74 02 <0f> 0b 49 83 be 68 07 00 00 00 75 32 49 8b 45 08 48 8b 40 48 48 85
    Dec 22 21:30:45 SKYNET-UR kernel: RSP: 0018:ffffc900018239f8 EFLAGS: 00010206
    Dec 22 21:30:45 SKYNET-UR kernel: RAX: 0000000000002000 RBX: 0000000000001000 RCX: 0000000000000001
    Dec 22 21:30:45 SKYNET-UR kernel: RDX: ffff888100066e20 RSI: ffffffffffffe000 RDI: 0000000000000009
    Dec 22 21:30:45 SKYNET-UR kernel: RBP: 00000000fed7e000 R08: ffff888100066e20 R09: ffff8881596d6bf0
    Dec 22 21:30:45 SKYNET-UR kernel: R10: 0000000000000009 R11: ffff888000000000 R12: 0000000000001000
    Dec 22 21:30:45 SKYNET-UR kernel: R13: ffff888100066e10 R14: ffff88813da76000 R15: ffffffffa00e0640
    Dec 22 21:30:45 SKYNET-UR kernel: FS:  000014ebd85ae740(0000) GS:ffff889fdd180000(0000) knlGS:0000000000000000
    Dec 22 21:30:45 SKYNET-UR kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Dec 22 21:30:45 SKYNET-UR kernel: CR2: 000014ebd8740425 CR3: 000000015c5a2000 CR4: 0000000000350ee0
    Dec 22 21:30:45 SKYNET-UR kernel: Call Trace:
    Dec 22 21:30:45 SKYNET-UR kernel: iommu_dma_free+0x1a/0x2b

     

    Link to comment

    Mines been running a lot longer since a) moving to virtio-net b) updating to the rc series with newer kernel and c) probably more specifically, removing a rather taxing 24x7 compression routine I run (which I will turn back on at some point.  I actually haven't had a crash yet, still monitoring.

     

    I also notice for the first time EVER for me on unraid, the VM get's to the Tianobios screen in sub 5 seconds.  Previously, that only ever happened for the first run after a reboot, then it would be e.g. 1 minute before it would get there, or even longer.

     

    I still suspect something about threadripper has been causing this for me and I doubt it's gone, just reduced.

    Link to comment

    I'm still getting crash/lock up about every 3-5 days on rc2 i have no clue what's going on tried so many things I'm out of idea's now!

    Link to comment
    24 minutes ago, turnipisum said:

    I'm out of idea's now!

    What I do in similar circumstances is try to break the problem down into smaller ones. Try running with virtualisation disabled for a while and see if stability improves. It's not a solution but a troubleshooting method.

    Link to comment

    Update!

    Looks like i have finally found the fix to my lock ups! It would appear to be a VM qemu issue. I changed my machine type to q35-4.2 from q35-5.1 and have not had a issue since. Now on 18 days up time.

    I had already change from i440fx to q35 but had both on 5.1 so i'm guessing that i440fx-4.2 would work fine in my case as well. I want to get 30 days up time to be sure, then i will try i440fx-4.2 see what happens.

     

    Link to comment

    Yes, I don' know what the cause is but a few days ago I changed to q35-5.0 and that seemed to fix it.  Before that, it didn't lock up but it was painfully slow - but it looked like a lockup if you weren't prepared to wait 20 minutes for your vm to boot. :)

     

    So possibly you could get to 5.0 as well.

     

    Edit  - 5.0 still had slowness issues, just less than 5.1  - trying 4.2 again.

     

    Looking through the changes here, there's not a lot that's changed - so it shouldn't be too hard to pin down.

    Edited by Marshalleq
    Link to comment

    Update to my last post.

     

    It didn't fix it!☹️ I got a random uptime of 47 days then now i'm back to 1-4 ish days then crash.

    I have just disabled PCIe ACS override see if that does anything. 

     

    Same error as i always have in logs.

     

    Mar  6 19:50:21 SKYNET-UR kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./TRX40 Creator, BIOS P1.70 05/29/2020
    Mar  6 19:50:21 SKYNET-UR kernel: RIP: 0010:__iommu_dma_unmap+0x7a/0xe8
    Mar  6 19:50:21 SKYNET-UR kernel: Code: 46 28 4c 8d 60 ff 48 8d 54 18 ff 49 21 ec 48 f7 d8 4c 29 e5 49 01 d4 49 21 c4 48 89 ee 4c 89 e2 e8 8f df ff ff 4c 39 e0 74 02 <0f> 0b 49 83 be 68 07 00 00 00 75 32 49 8b 45 08 48 8b 40 48 48 85
    Mar  6 19:50:21 SKYNET-UR kernel: RSP: 0018:ffffc9000468f9f8 EFLAGS: 00010206
    Mar  6 19:50:21 SKYNET-UR kernel: RAX: 0000000000002000 RBX: 0000000000001000 RCX: 0000000000000001
    Mar  6 19:50:21 SKYNET-UR kernel: RDX: ffff888102d55020 RSI: ffffffffffffe000 RDI: 0000000000000009
    Mar  6 19:50:21 SKYNET-UR kernel: RBP: 00000000fed7e000 R08: ffff888102d55020 R09: ffff8881525b2bf0
    Mar  6 19:50:21 SKYNET-UR kernel: R10: 0000000000000009 R11: ffff888000000000 R12: 0000000000001000
    Mar  6 19:50:21 SKYNET-UR kernel: R13: ffff888102d55010 R14: ffff88813d301000 R15: ffffffffa00da640
    Mar  6 19:50:21 SKYNET-UR kernel: FS:  0000148f90fb0740(0000) GS:ffff889fdd840000(0000) knlGS:0000000000000000
    Mar  6 19:50:21 SKYNET-UR kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar  6 19:50:21 SKYNET-UR kernel: CR2: 0000150fa8003340 CR3: 000000014cd8a000 CR4: 0000000000350ee0
    Mar  6 19:50:21 SKYNET-UR kernel: Call Trace:
    Mar  6 19:50:21 SKYNET-UR kernel: iommu_dma_free+0x1a/0x2b

     

    Edited by turnipisum
    Link to comment

    Can you try to disable the VM service entierely for a few days for troubleshooting reasons.

    I think you are on Intel or am I wrong?

    Link to comment

    AMD 3970x all spec below. The issue is the 3 VM's are in daily use i will have to build temp machines to kill them off for days.

     

    Case: Corsair Obsidian 750d | MB: Asrock Trx40 Creator | CPU: AMD Threadripper 3970X | Cooler: Noctua NH-U14S | RAM: Corsair LPX 128GB DDR4 C16 | GPU: 2 x MSI  RTX 2070 Super's | Cache: Intel 660p Series 1TB M.2 X2 in 2TB Pool | Parity: Ironwolf 6TB | Array Storage: Ironwolf 6TB + Ironwolf 4TB | Unassigned Devices: Corsair 660p M.2 1TB + Kingston 480GB SSD + Skyhawk 2TB  | NIC: Intel 82576 Chip, Dual RJ45 Ports, 1Gbit PCI | PSU: Corsair RM1000i

    Link to comment
    On 11/22/2020 at 8:13 PM, turnipisum said:

    Just found the posts about power supply idle and c-states so trying that see what happens.

     

     

    Thanks @ich777 but i gave that a try months ago lol

    Link to comment

    Well i'm hoping i have finally sorted it! Never had that long uptime since it was built.

    1248842305_Screenshot2021-10-03230715.png.b54977d3d45ddb56e61a8208e85602c9.png

     

    I found out the 128gb of Corsair LPX 16gb dimms in the server have different version numbers which relates to different chip sets! luckily i had more dimms in another machine so i have managed to sort a 128gb set with same chips and looks like it has got me sorted at long last.

     

    Link to the below quote from reddit about version numbers.

     

    Quote

    Corsair

    "Version Number"

    Corsair sticks identify the IC with a 'version number' on the label such as "ver4.31" - props to them for this as it helps even less knowledgeable users to match kits when adding more sticks retroactively. The DDR4 numbers aren't officially documented, but they follow the same pattern as DDR3.

    The numbers take the "ver X.YZ" format where
    * X is IC maker - 3 for Micron/Spectek, 4 for Samsung, 5 for Hynix, 8 for Nanya as with DDR3.
    * Y seems to be capacity per rank - 1 for 2GB, 2 for 4GB, 3 for 8GB, 4 for 16GB. Usually this translates directly to IC density (8GB/rank = 8Gbit), but ver4.14 which uses half as many double width "x16" 4Gbit chips is a special case.
    * Z is revision, usually starting from A=0 and usually counting up one letter per increment. Hynix's first revisions are lettered "M" which is numbered as X.Y9, Samsung now do this too and it will presumably be the same.

    Micron ICs seem to be numbered oddly with different "version numbers" for different JEDEC bins, and different revisions under the same "version number".

    The known and possible version numbers are as follows;

    VersionVendorICConfirmation?

    3.20Micron4Gbit Rev.APresumed

    3.21Micron4Gbit Rev.BConfirmed

    3.22Micron4Gbit Rev.E*Speculated

    3.22Micron4Gbit Rev.F*Confirmed

    3.31Micron8Gbit Rev.BConfirmed

    3.31Micron8Gbit Rev.DPresumed

    3.31Micron8Gbit Rev.EConfirmed

    3.32Micron8Gbit Rev.HConfirmed

    3.32Micron??????????wk27 '17 2x8GB 2666 16-18-18-36 1.2V

    3.32Micron??????????wk46 '19 2x8GB 3000 15-17-17-35 1.35V

    3.40Micron16Gbit Rev.B (2133 bin)Confirmed

    3.41Micron??????????wk44 '20 2x16GB 3600 18-22-22-42 1.35V

    3.43Micron??????????wk43 '20 2x16GB 3200 16-19-19-36 1.35V

    3.43Micron16Gbit Rev.E??? (or bad bin Rev.B)wk51 '20 2x16GB 3200 16-20-20-38 1.35V

    3.44Micron16Gbit Rev.B (2666 bin)Confirmed

    4.14Samsung4Gbit D-die (4x16)Confirmed

    4.23Samsung4Gbit D-dieConfirmed

    4.24Samsung4Gbit E-dieConfirmed

    4.21Samsung8Gbit B-die (4x16)Presumed

    4.31Samsung8Gbit B-dieConfirmed

    4.31Samsung8Gbit C-die**Presumed

    4.32Samsung8Gbit C-dieConfirmed

    4.33Samsung8Gbit D-diePresumed

    4.34Samsung8Gbit E-diePresumed

    4.49Samsung16Gbit M-diePresumed

    4.40Samsung16Gbit A-dieSpeculated

    5.29Hynix4Gbit MFRConfirmed

    5.20Hynix4Gbit AFRConfirmed

    5.21Hynix4Gbit BJRSpeculated

    5.22Hynix4Gbit CJRPresumed

    5.39Hynix8Gbit MFRConfirmed

    5.30Hynix8Gbit AFRConfirmed

    5.31Hynix8Gbit "BFR"???Speculated

    5.32Hynix8Gbit CJRConfirmed

    5.33Hynix8Gbit DJRPresumed

    5.38Hynix8Gbit JJRPresumed

    5.49Hynix16Gbit MJRPresumed

    8.20Nanya4Gbit Rev.ASpeculated

    8.21Nanya4Gbit Rev.B***Presumed

    8.23Nanya4Gbit Rev.D***Presumed

    8.30Nanya8Gbit Rev.APresumed

    8.31Nanya8Gbit Rev.B****Confirmed

    Especially with Micron, Corsair version numbers are sometimes weird. Confirmed means an IC has been seen under a version number, not that it can't also cover something else.

    *Rev.F is confirmed to come in ver3.22 sticks, but that doesn't leave a gap for Rev.E. It's wildly guessed that they may both appear under 3.22.
    **TechPowerUp recently got a sample kit of Vengeance RGB Pro SL 2x8GB 3600c18 under this version; however, the chips had SAC marks on them (which by Corsair's IC labeling scheme would indicate C-die) and behaved like C-die in OCing.
    ***Version number seen in the wild, IC unconfirmed.
    ****Deduced from the NAB... Corsair code on the ICs, as well as a Corsair rep statement, acc. to one post from China.

    Date code

    The first 4 digits of a Corsair serial number are a date code in the form yyww, eg 1528 is week 28 2015.

    Corsair relabeled ICs

    Some ICs loaded into Corsair sticks have been shown to assume a marking with a Corsair logo and two text lines, the first presumably stating the IC configuration, and the second featuring an internal Corsair code that seems to correspond to the IC manufacturer and stepping, as well as a yyww format date at the end. Unfortunately, such kind of marking has only been confirmed in some ver5.xx (Hynix) and 8.xx (Nanya) sticks. Samsungs (ver4.xx) may have it too but as seen in the ver4.31 example, it may collide with the version number scheme.

    VersionCodeICOriginal partial mark

    4.31SAC...Samsung 8Gbit C-dienone, determined by OCing behaviour

    5.20HYA...Hynix 4Gbit AFRDWMF...

    5.30HYA...Hynix 8Gbit AFRDTCC...

    5.32HYC...Hynix 8Gbit CJRDTBM... / none

    - (ValueSelect)NAA...Nanya 8Gbit A-die?arbitrary Nanya

    8.31NAB...Nanya 8Gbit B-die?arbitrary Nanya

     

    • Like 2
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.