[6.9.0-beta30] New unRAID crashes after adding cache and parity


Recommended Posts

OS Version: 6.9.0-beta25 and 6.9.0-beta29

Processor: AMD Ryzen 7 PRO 4750G

Motherboard: MSI MAG B550 MORTAR

Memory: G.SKILL TridentZ 16GB (2x8GB) DDR4 3866

PSU: Corsair RMx RM550x
SSD (Cache): 
ADATA XPG SX8200 Pro 1TB
Drives: 2x WD Red 10TB CMR, 4x WD Red 4TB CMR (not installed yet)

 

Hey guys,

 

I'm brand new to unRAID coming from Synology. I've really enjoyed what I have setup so far, but I am now running into an issue with my unRAID server crashing. I purchased brand new hardware for this setup except for 4x 4TBs that I plan on moving over to unRAID once everything is stable. I brought up my unRAID server with only 1x 10TB and 1x 4TB in my array and was able to move everything off of my Synology NAS onto my new unRAID array without any issues. The system seemed to be stable for about 2 days with no issues to report. I began installing a few Dockers at this point like Plex and Swag. Everything seemed to be working fine.

 

Once my transfer was complete, I hooked up my ADATA SSD as `cache` + my other 10TB as parity and begin the parity build process. Within about 3 hours, the unRAID server went completely unresponsive out of nowhere. It would not respond to webUI nor SSH in any way. The only way I could get it to come back was to hard reset it. Since then, the server has been doing this same crash behavior every couple hours, sometimes quicker.

 

Steps I have tried:

  • Updated unRAID to 6.9.0-beta29 for new Linux Kernel... This didn't seem to make a difference. I am unable to downgrade to stable builds because they do not contain my motherboard's ethernet driver. The latest Linux kernels are also way better for the recent Ryzen APUs. I haven't observed any qualitative difference in this behavior between 6.9.0-beta25 and 6.9.0-beta29.
  • Booted MemTest86 with Legacy boot. Ran overnight for ~11 hours without any errors. 
  • Booted server /w monitor attached. I was able to see the following kernel panic

 

IMG-4794.png

 

  • Ran the 'Fix Common Problems': nothing was really found of any interest.
  • Used the zenstates to disable C6. Didn't make a difference and the server still crashed.
  • Booted the array without a parity drive, but with the cache still selected. No parity build started for this test. Dockers were all running. Observed the same failure.
  • Booted the array without the cache drive, but with the parity re-selected. This is currently still working /w uptime of about 6hrs... That is longer than its taken for previous failures to occur (usually about 1-2hrs for a failure). My goal is to complete the parity build before making any further changes. Docker is also disabled because cache is disconnected!

 

One thing to note is that I have my docker.img setup to be at `/mnt/cache/docker.img` with my appdata folder at `/mnt/user/appdata/` and set at Prefer Cache. This means that I do not have Docker currently running now because I didn't start the array with cache... Perhaps it is feasible that there is something wrong with my Docker config or containers causing this, but I haven't really done anything out of the ordinary there. I am only running Plex-Media-Server, Swag, Tautulli, DuckDNS, and binhex-delugevpn at this point so nothing too crazy. I have left the settings at mostly default / what SpaceInvader recommended in a few of this videos.

 

Where do you guys suggest I go next? Once my parity build is complete, I am comfortable reconnecting cache and doing some more debugging. I have attached my diagnostics from the current server boot (no cache) in case that is any help to you guys. I'm not sure whether this suggests a hardware issue or software issue and I am not experienced enough with unRAID to know how to debug deeper than this.

 

P.S. I see `* If the system crashes completely and there is no way to capture a final syslog, then start a tail on the unRAID console or Telnet session (tail -f /var/log/syslog).` in your read me for this forum. I will run this once i reattach cache. I suspect it would show this same kernel panic above, but maybe with some better contextual information.

tower-diagnostics-20201001-1718_noCacheAttached.zip

Edited by MarlinJones
Added P.S. at the bottom
Link to comment

Thank you for the suggestion. When I originally configured the BIOS, the memory speed populated at 2133 MT/s. Since they are sold as 3866 sticks, I figured that XMP was required to get their intended performance. I wasn't familiar with XMP and therefore it never really occurred to me that this was a big overclock on the memory.

 

I went ahead and made two changes:

1. Updated BIOS with a (beta 9/29/2020 v144) version from MSI that contains an update intended to improve APU performance on B550 boards. 

2. Reset my memory back to the default 2133 MT/s by turning XMP off.

 

I am happy to report that after about 36hours, it appears the system is stable without a single crash. I am going to go ahead and experiment with memory settings to reach that 3200 MT/s maximum supported by the 4750G processor. Hopefully I can find some rock solid settings at that frequency. If not, I will just have to bite the bullet and keep the memory at the default 2133 MT/s.

 

I appreciate the help.

Link to comment
  • 2 weeks later...

Ah just as an update I have still been unable to fully solve this issue. I have played around with memory speeds ad nauseam at this point... While decreasing memory speed down to 2133 does seem to make the crashes less frequent, it has not solved the problem. Im still getting one every 2 to 3 days.

 

I have modified the C State / Idle behavior in the BIOS and have found no improvement... running out of things to try at this point

Link to comment

Sorry to keep bumping my own post with no responses but I found some more information that could help diagnose this properly. I noticed that my server is occasionally doing a sort of "hiccup" behavior where it will stop responding to pings and the webUI entirely for somewhere between 10 and 30s.... Usually when I notice this, I will think that it has crashed, but sometimes it comes back. I checked syslog after one of these and found this:
 

Oct 13 20:48:36 Athena kernel: WARNING: CPU: 14 PID: 14184 at drivers/iommu/iova.c:814 iova_magazine_free_pfns.part.0+0x37/0x5e
Oct 13 20:48:36 Athena kernel: Modules linked in: xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_MASQUERADE iptable_filter iptable_nat nf_nat ip_tables xfs nfsd lockd grace sunrpc md_mod bonding edac_mce_amd kvm_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof mpt3sas ahci i2c_piix4 raid_class rapl r8125(O) ccp nvme i2c_core libahci video scsi_transport_sas k10temp r8169 realtek backlight nvme_core acpi_cpufreq wmi button
Oct 13 20:48:36 Athena kernel: CPU: 14 PID: 14184 Comm: kworker/14:1 Tainted: G        W  O      5.8.13-Unraid #1
Oct 13 20:48:36 Athena kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C94/MAG B550M MORTAR (MS-7C94), BIOS 1.44 09/29/2020
Oct 13 20:48:36 Athena kernel: Workqueue: events rtl8125_reset_task [r8125]
Oct 13 20:48:36 Athena kernel: RIP: 0010:iova_magazine_free_pfns.part.0+0x37/0x5e
Oct 13 20:48:36 Athena kernel: Code: 89 fb 48 89 f7 e8 45 ec 29 00 49 89 c4 49 63 c5 48 3b 03 73 23 48 8b 74 c3 08 48 89 ef e8 6e fb ff ff 48 85 c0 48 89 c6 75 04 <0f> 0b eb 05 e8 4c ff ff ff 41 ff c5 eb d5 4c 89 e6 48 89 ef e8 a7
Oct 13 20:48:36 Athena kernel: RSP: 0018:ffffc900016afd20 EFLAGS: 00010046
Oct 13 20:48:36 Athena kernel: RAX: 0000000000000000 RBX: ffff888167624000 RCX: 000000008040003d
Oct 13 20:48:36 Athena kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8883fb4e9008
Oct 13 20:48:36 Athena kernel: RBP: ffff8883fb4e9008 R08: 0000000000000001 R09: ffffffff8141efc9
Oct 13 20:48:36 Athena kernel: R10: ffff8883fb38a6c0 R11: ffff8883fb38a6c0 R12: 0000000000000046
Oct 13 20:48:36 Athena kernel: R13: 0000000000000040 R14: ffff8883fb4e9008 R15: ffff8883fb4e9088
Oct 13 20:48:36 Athena kernel: FS:  0000000000000000(0000) GS:ffff8883ff380000(0000) knlGS:0000000000000000
Oct 13 20:48:36 Athena kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 13 20:48:36 Athena kernel: CR2: 000014e1d4f0c000 CR3: 000000035adec000 CR4: 0000000000340ee0
Oct 13 20:48:36 Athena kernel: Call Trace:
Oct 13 20:48:36 Athena kernel: free_iova_fast+0x167/0x186
Oct 13 20:48:36 Athena kernel: fq_ring_free+0x74/0x92
Oct 13 20:48:36 Athena kernel: queue_iova+0x74/0x104
Oct 13 20:48:36 Athena kernel: __iommu_dma_unmap+0xc6/0xe8
Oct 13 20:48:36 Athena kernel: rtl8125_rx_clear+0x4b/0x85 [r8125]
Oct 13 20:48:36 Athena kernel: rtl8125_reset_task+0x9d/0x114 [r8125]
Oct 13 20:48:36 Athena kernel: process_one_work+0x13c/0x1d5
Oct 13 20:48:36 Athena kernel: worker_thread+0x18b/0x22f
Oct 13 20:48:36 Athena kernel: ? process_scheduled_works+0x27/0x27
Oct 13 20:48:36 Athena kernel: kthread+0xe5/0xea
Oct 13 20:48:36 Athena kernel: ? kthread_unpark+0x52/0x52
Oct 13 20:48:36 Athena kernel: ret_from_fork+0x22/0x30
Oct 13 20:48:36 Athena kernel: ---[ end trace 2e5dccb3ecd7d581 ]---
Oct 13 20:48:36 Athena dhcpcd[1891]: br0: carrier acquired
Oct 13 20:48:36 Athena dhcpcd[1891]: br0: rebinding lease of 192.168.1.2
Oct 13 20:48:37 Athena kernel: r8125: eth0: link down
Oct 13 20:48:37 Athena kernel: bond0: (slave eth0): link status definitely down, disabling slave
Oct 13 20:48:37 Athena kernel: device eth0 left promiscuous mode
Oct 13 20:48:37 Athena kernel: bond0: now running without any active interface!
Oct 13 20:48:37 Athena kernel: br0: port 1(bond0) entered disabled state
Oct 13 20:48:38 Athena dhcpcd[1891]: br0: carrier lost
Oct 13 20:48:40 Athena kernel: r8125: eth0: link up
Oct 13 20:48:40 Athena dhcpcd[1891]: br0: carrier acquired
Oct 13 20:48:40 Athena kernel: bond0: (slave eth0): link status definitely up, 1000 Mbps full duplex
Oct 13 20:48:40 Athena kernel: bond0: (slave eth0): making interface the new active one
Oct 13 20:48:40 Athena kernel: device eth0 entered promiscuous mode
Oct 13 20:48:40 Athena kernel: bond0: active interface up!
Oct 13 20:48:40 Athena kernel: br0: port 1(bond0) entered blocking state
Oct 13 20:48:40 Athena kernel: br0: port 1(bond0) entered forwarding state
Oct 13 20:48:40 Athena dhcpcd[1891]: br0: rebinding lease of 192.168.1.2
Oct 13 20:48:41 Athena dhcpcd[1891]: br0: probing address 192.168.1.2/24
Oct 13 20:48:47 Athena dhcpcd[1891]: br0: leased 192.168.1.2 for 86400 seconds
Oct 13 20:48:47 Athena dhcpcd[1891]: br0: adding route to 192.168.1.0/24
Oct 13 20:48:47 Athena dhcpcd[1891]: br0: adding default route via 192.168.1.1
Oct 13 20:48:48 Athena ntpd[1952]: Listen normally on 14 br0 192.168.1.2:123

 

I attached my whole diagnostic here as well. This kernel panic occurs multiple times in a row. It appears to be to be related to rtl8125 which is a relatively new 2.5G NIC thats part of my motherboard. Its actually the reason I needed to use beta unRAID versions...

 

Has anyone seen this before / anybody have suggestions? I am going to try 2 things:

1. Disable br0 and just use the connection without bonding enabled (although i think I require bridging for Docker)

2. Buy a PCIe x1 network card and use that instead of the built in ethernet...

 

ill report back 

athena-diagnostics-20201013-2053.zip

Edited by MarlinJones
Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.