[SOLVED] Help Required Identifying Cause of Crash (Parity Lost)


Recommended Posts

Hey all,

 

I have been running unRAID for a while now and this morning when I accessed my Plex account to watch some content my entire array crashed and upon restarting one of my drives was coming back to me with a "Device Disabled, Content Emulated" issue. I checked the drive and the SMART reports are coming back clean and healthy from what I can tell, aside from usual wear/tear. Since everything on the drive looks good I have started rebuilding the parity as it looks like the drive became out of sync with the existing parity while things were in a weird state.

 

I have attached my diagnostics from when the array crashed (before restarting) as well as the SMART report for the drive in question, I am hoping with the expertise in these forums someone is able to point me in the right direction on this one.

 

For details, I am running unRAID 6.8.3 on a Dell R710, 1 Parity drive, 4 Data drives, and 1 Cache drive. Disk 1 is the drive that was showing the Device Disabled notification and is currently having it's data rebuilt. Feel free to bug me if you would like any additional information along with the diagnostics and SMART report.

nuclear-winter-diagnostics-20200804-1236.zip nuclear-winter-smart-20200804-1655.zip

Edited by backlands
Issue Solved
Link to comment

HBA problems:

Aug  4 12:28:12 NUCLEAR-WINTER kernel: mpt2sas_cm0: SAS host is non-operational !!!!

Unraid ending up losing connection with all disks, when this happens you'll get as many disabled disks as there are parity devices, which disk(s) get disabled is a crap-shoot.

 

 

Issue started here:

 

Aug  4 12:28:12 NUCLEAR-WINTER kernel: pcieport 0000:00:05.0: AER: Uncorrected (Fatal) error received: 0000:00:05.0
Aug  4 12:28:12 NUCLEAR-WINTER kernel: pcieport 0000:00:05.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
Aug  4 12:28:12 NUCLEAR-WINTER kernel: pcieport 0000:00:05.0:   device [8086:340c] error status/mask=00000020/00318000
Aug  4 12:28:12 NUCLEAR-WINTER kernel: pcieport 0000:00:05.0:    [ 5] SDES                   (First)

This is a fatal error, but can't tell if it's a problem with the board or the HBA, try it in a different slot if available and make sure it's sufficiently cooled.

 

Also note that you're having multiple hardware errors logged, e.g:

 

Aug  4 02:32:24 NUCLEAR-WINTER kernel: mce: [Hardware Error]: Machine check events logged

 

 

 

 

 

Link to comment

Thanks for looking into that Johnnie! I am not sure why the HBA crashed like that, I am moving it over to another slot on the board and will keep an eye on it. Hopefully nothing shows up again. I don't think it is a cooling issue as I am typically seeing temps across the board around 40C at load and 30C at idle. This weekend I might bring the system down and do a deep clean of the fans and everything.

 

I checked the iDRAC card on the system for any logged events for the Hardware Errors you mentioned as well and found nothing useful there, is there anywhere else I can find more details on what devices were giving the errors? I searched for details on this and what I found noted to check the logging on the board (which would be iDRAC for me from my understanding) but let me know if I should look elsewhere.

 

Thanks again for your help, I really just want to get a stable system and really hope I can get that with unRAID with some troubleshooting.

Link to comment

I double checked to confirm, I do have mcelog installed so it should be logging additional info. My current understanding is that mcelog is built into the diagnostics export so that "The output of mcelog (if installed) has been logged".

 

I also checked my previously posted diagnostics to confirm that it is installed and found the following in syslog2.txt

Jul 25 22:11:04 NUCLEAR-WINTER nerdpack: Installing mcelog-161 package...
Jul 25 22:11:04 NUCLEAR-WINTER root: 
Jul 25 22:11:04 NUCLEAR-WINTER root: Installing mcelog-161 package...

I haven't noticed any machine check events this boot and will continue keeping an eye on things unless you have additional steps I should take at this time. I am wondering if there are any other packages I should grab from NerdPack that might assist with this further? I may also shutdown and run a memtest this weekend, do you think that is worthwhile at this point?

Link to comment

I have been continuing to monitor the system and was having a run of stability but once again it went down, the error notice from iDRAC is "OS Stop: unknown event" and the syslog ends with the following messages and nothing after. Let me know if you would like the full syslog, unfortunately I can't provide diagnostics as the system was unreachable after this error so I can't quickly anonymize the log file.

 

Aug 12 17:09:57 NUCLEAR-WINTER emhttpd: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog
Aug 12 17:37:37 NUCLEAR-WINTER emhttpd: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog
Aug 13 04:00:01 NUCLEAR-WINTER Plugin Auto Update: Checking for available plugin updates
Aug 13 04:00:06 NUCLEAR-WINTER Plugin Auto Update: Community Applications Plugin Auto Update finished
Aug 13 04:40:01 NUCLEAR-WINTER root: Fix Common Problems Version 2020.08.02
Aug 13 04:40:01 NUCLEAR-WINTER kernel: BUG: unable to handle kernel paging request at 0000000000ffffa0
Aug 13 04:40:01 NUCLEAR-WINTER kernel: PGD 80000011364ec067 P4D 80000011364ec067 PUD 11af050067 PMD 0
Aug 13 04:40:01 NUCLEAR-WINTER kernel: Oops: 0000 [#1] SMP PTI
Aug 13 04:40:01 NUCLEAR-WINTER kernel: CPU: 12 PID: 672 Comm: curl Tainted: G        W I       4.19.107-Unraid #1
Aug 13 04:40:01 NUCLEAR-WINTER kernel: Hardware name: Dell Inc. PowerEdge R710/00W9X3, BIOS 6.6.0 05/22/2018
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RIP: 0010:vma_interval_tree_remove+0x1d4/0x231
Aug 13 04:40:01 NUCLEAR-WINTER kernel: Code: 80 e6 01 48 0f 45 fa 48 89 fd eb 4e 48 8b 50 b0 48 2b 50 a8 48 8b 48 40 48 c1 ea 0c 48 8d 54 0a ff 48 8b 48 10 48 85 c9 74 0b <48> 8b 49 18 48 39 ca 48 0f 42 d1 48 8b 48 08 48 85 c9 74 0b 48 8b
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RSP: 0018:ffffc900081d3c38 EFLAGS: 00010206
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RAX: ffff8891ac7af4fc RBX: ffff889085e6f800 RCX: 0000000000ffff88
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RDX: ffffffffffffffff RSI: ffff8891eb205aa0 RDI: ffff889085e6f800
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RBP: 0000000000000000 R08: ffffffff811011d3 R09: ffff889085e6e658
Aug 13 04:40:01 NUCLEAR-WINTER kernel: R10: ffff889085e6f858 R11: ffff889085e6eaa0 R12: ffff8891eb205aa0
Aug 13 04:40:01 NUCLEAR-WINTER kernel: R13: ffff889085e6f858 R14: 0000000000000000 R15: 0000000000001000
Aug 13 04:40:01 NUCLEAR-WINTER kernel: FS:  0000148db728e700(0000) GS:ffff8891f7b00000(0000) knlGS:0000000000000000
Aug 13 04:40:01 NUCLEAR-WINTER kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 13 04:40:01 NUCLEAR-WINTER kernel: CR2: 0000000000ffffa0 CR3: 0000000175330000 CR4: 00000000000006e0
Aug 13 04:40:01 NUCLEAR-WINTER kernel: Call Trace:
Aug 13 04:40:01 NUCLEAR-WINTER kernel: __vma_adjust+0x273/0x58c
Aug 13 04:40:01 NUCLEAR-WINTER kernel: ? memcg_kmem_get_cache+0xb9/0x1a0
Aug 13 04:40:01 NUCLEAR-WINTER kernel: __split_vma+0x10d/0x16f
Aug 13 04:40:01 NUCLEAR-WINTER kernel: do_munmap+0x159/0x2c0
Aug 13 04:40:01 NUCLEAR-WINTER kernel: ? vma_link+0x6f/0x7c
Aug 13 04:40:01 NUCLEAR-WINTER kernel: mmap_region+0xfe/0x41b
Aug 13 04:40:01 NUCLEAR-WINTER kernel: do_mmap+0x403/0x459
Aug 13 04:40:01 NUCLEAR-WINTER kernel: vm_mmap_pgoff+0x91/0xde
Aug 13 04:40:01 NUCLEAR-WINTER kernel: ksys_mmap_pgoff+0x17c/0x1bb
Aug 13 04:40:01 NUCLEAR-WINTER kernel: do_syscall_64+0x57/0xf2
Aug 13 04:40:01 NUCLEAR-WINTER kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RIP: 0033:0x148db7f90512
Aug 13 04:40:01 NUCLEAR-WINTER kernel: Code: eb aa 66 0f 1f 44 00 00 41 f7 c1 ff 0f 00 00 75 27 55 48 89 fd 53 89 cb 48 85 ff 74 33 41 89 da 48 89 ef b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 56 5b 5d c3 0f 1f 00 c7 05 16 ec 00 00 16 00
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RSP: 002b:0000148db728cad8 EFLAGS: 00000206 ORIG_RAX: 0000000000000009
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RAX: ffffffffffffffda RBX: 0000000000000812 RCX: 0000148db7f90512
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000148db7076000
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RBP: 0000148db7076000 R08: 0000000000000005 R09: 0000000000005000
Aug 13 04:40:01 NUCLEAR-WINTER kernel: R10: 0000000000000812 R11: 0000000000000206 R12: 0000148db00015f0
Aug 13 04:40:01 NUCLEAR-WINTER kernel: R13: 0000148db728cec8 R14: 0000000000000004 R15: 0000000000000002
Aug 13 04:40:01 NUCLEAR-WINTER kernel: Modules linked in: macvlan veth xt_nat ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables ipmi_devintf ipmi_si md_mod xfs nfsd lockd grace sunrpc bonding bnx2 sr_mod cdrom intel_p>
Aug 13 04:40:01 NUCLEAR-WINTER kernel: CR2: 0000000000ffffa0
Aug 13 04:40:01 NUCLEAR-WINTER kernel: ---[ end trace 52838674856dc5cd ]---
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RIP: 0010:vma_interval_tree_remove+0x1d4/0x231
Aug 13 04:40:01 NUCLEAR-WINTER kernel: Code: 80 e6 01 48 0f 45 fa 48 89 fd eb 4e 48 8b 50 b0 48 2b 50 a8 48 8b 48 40 48 c1 ea 0c 48 8d 54 0a ff 48 8b 48 10 48 85 c9 74 0b <48> 8b 49 18 48 39 ca 48 0f 42 d1 48 8b 48 08 48 85 c9 74 0b 48 8b
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RSP: 0018:ffffc900081d3c38 EFLAGS: 00010206
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RAX: ffff8891ac7af4fc RBX: ffff889085e6f800 RCX: 0000000000ffff88
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RDX: ffffffffffffffff RSI: ffff8891eb205aa0 RDI: ffff889085e6f800
Aug 13 04:40:01 NUCLEAR-WINTER kernel: RBP: 0000000000000000 R08: ffffffff811011d3 R09: ffff889085e6e658
Aug 13 04:40:01 NUCLEAR-WINTER kernel: R10: ffff889085e6f858 R11: ffff889085e6eaa0 R12: ffff8891eb205aa0
Aug 13 04:40:01 NUCLEAR-WINTER kernel: R13: ffff889085e6f858 R14: 0000000000000000 R15: 0000000000001000
Aug 13 04:40:01 NUCLEAR-WINTER kernel: FS:  0000148db728e700(0000) GS:ffff8891f7b00000(0000) knlGS:0000000000000000
Aug 13 04:40:01 NUCLEAR-WINTER kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 13 04:40:01 NUCLEAR-WINTER kernel: CR2: 0000000000ffffa0 CR3: 0000000175330000 CR4: 00000000000006e0

 

Link to comment
  • 3 weeks later...

Alright, I have been monitoring things and have not seen this issue again after cleaning and reseating the PCIe card for my backplane into a new slot, as well as the connectors at both ends. I think this was a random fault due to something at one of these points. Thanks for the help on this, this issue is solved now.

 

 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.