Server becoming unavailable overnight

February 20, 20233 yr

Hi all,

My server is becoming unavailable overnight and I don't understand the logs well enough to work out what is causing the issue. I've attached the syslog which has been writing to flash. These are the last entries:

Feb 21 03:31:44 Tower kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 99)
Feb 21 03:31:44 Tower kernel: BUG: unable to handle page fault for address: ffffffff820df610
Feb 21 03:31:44 Tower kernel: #PF: supervisor instruction fetch in kernel mode
Feb 21 03:31:44 Tower kernel: #PF: error_code(0x0011) - permissions violation
Feb 21 03:31:44 Tower kernel: PGD 220e067 P4D 220e067 PUD 220f063 PMD 108da7063 PTE 80000000020df061
Feb 21 03:31:44 Tower kernel: Oops: 0011 [#2] PREEMPT SMP PTI
Feb 21 03:31:44 Tower kernel: CPU: 4 PID: 5157 Comm: Plex Media Scan Tainted: G D 5.19.17-Unraid #2
Feb 21 03:31:44 Tower kernel: Hardware name: MSI MS-7917/Z97 GAMING 5 (MS-7917), BIOS V1.9 12/23/2014
Feb 21 03:31:44 Tower kernel: RIP: 0010:SIGMA2+0x1e350/0x9ec60
Feb 21 03:31:44 Tower kernel: Code: 6b 65 74 5f 6f 72 64 65 72 3d 25 75 0a 00 6d 6d 2f 77 6f 72 6b 69 6e 67 73 65 74 2e 63 00 6b 73 6d 20 00 61 6e 6f 6e 20 00 20 <43> 4d 41 00 01 34 70 61 67 65 3a 25 70 20 69 73 20 75 6e 69 6e 69
Feb 21 03:31:44 Tower kernel: RSP: 0018:ffffc900087f7d08 EFLAGS: 00010282
Feb 21 03:31:44 Tower kernel: RAX: ffffffff820df610 RBX: ffffea000a332cc0 RCX: 000000000000777e
Feb 21 03:31:44 Tower kernel: RDX: dead000000000100 RSI: 0000000000000246 RDI: ffffea000a332cc0
Feb 21 03:31:44 Tower kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000a32a
Feb 21 03:31:44 Tower kernel: R10: 0000000000000003 R11: 0000151902248000 R12: ffff888044f5f000
Feb 21 03:31:44 Tower kernel: R13: ffffc900087f7d40 R14: 0000000000000001 R15: 0000000000000002
Feb 21 03:31:44 Tower kernel: FS: 0000151905bddb00(0000) GS:ffff88840fb00000(0000) knlGS:0000000000000000
Feb 21 03:31:44 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 21 03:31:44 Tower kernel: CR2: ffffffff820df610 CR3: 0000000175d40005 CR4: 00000000001706e0
Feb 21 03:31:44 Tower kernel: Call Trace:
Feb 21 03:31:44 Tower kernel: <TASK>
Feb 21 03:31:44 Tower kernel: ? release_pages+0xe0/0x291
Feb 21 03:31:44 Tower kernel: ? tlb_flush_mmu+0x6b/0x99
Feb 21 03:31:44 Tower kernel: ? tlb_finish_mmu+0x2c/0x5b
Feb 21 03:31:44 Tower kernel: ? unmap_region+0xd3/0x101
Feb 21 03:31:44 Tower kernel: ? __do_munmap+0x275/0x2e2
Feb 21 03:31:44 Tower kernel: ? __vm_munmap+0x69/0xb7
Feb 21 03:31:44 Tower kernel: ? __x64_sys_munmap+0x17/0x1e
Feb 21 03:31:44 Tower kernel: ? do_syscall_64+0x6b/0x81
Feb 21 03:31:44 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
Feb 21 03:31:44 Tower kernel: </TASK>
Feb 21 03:31:44 Tower kernel: Modules linked in: tun veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables i915 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iosf_mbi drm_buddy i2c_algo_bit ttm drm_display_helper drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd mxm_wmi rapl drm intel_cstate mpt3sas intel_gtt agpgart ahci syscopyarea sysfillrect i2c_i801 i2c_smbus alx raid_class intel_uncore i2c_core sysimgblt libahci scsi_transport_sas mdio fb_sys_fops thermal fan video wmi backlight button acpi_pad unix
Feb 21 03:31:44 Tower kernel: CR2: ffffffff820df610
Feb 21 03:31:44 Tower kernel: ---[ end trace 0000000000000000 ]---
Feb 21 03:31:44 Tower kernel: RIP: 0010:mutex_lock+0x1e/0x2e
Feb 21 03:31:44 Tower kernel: Code: 00 00 be 02 00 00 00 e9 a0 fc ff ff 0f 1f 44 00 00 51 48 89 3c 24 e8 7e f3 ff ff 31 c0 48 8b 3c 24 65 48 8b 14 25 c0 bb 01 00 <f0> 48 0f b1 17 74 03 5a eb c9 58 c3 cc cc cc cc 0f 1f 44 00 00 53
Feb 21 03:31:44 Tower kernel: RSP: 0000:ffffc90008237ef8 EFLAGS: 00010246
Feb 21 03:31:44 Tower kernel: RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000000
Feb 21 03:31:44 Tower kernel: RDX: ffff8881219f4ec0 RSI: ffffc90008237e80 RDI: 0000000008137f58
Feb 21 03:31:44 Tower kernel: RBP: 0000150830ff46b0 R08: 0000000000000000 R09: 0000000000000347
Feb 21 03:31:44 Tower kernel: R10: ffff8881573b0010 R11: 0000000000000000 R12: 0000000000000080
Feb 21 03:31:44 Tower kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Feb 21 03:31:44 Tower kernel: FS: 0000151905bddb00(0000) GS:ffff88840fb00000(0000) knlGS:0000000000000000
Feb 21 03:31:44 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 21 03:31:44 Tower kernel: CR2: ffffffff820df610 CR3: 0000000175d40005 CR4: 00000000001706e0
Feb 21 03:33:43 Tower emhttpd: spinning down /dev/sdm
Feb 21 03:35:51 Tower emhttpd: spinning down /dev/sdq
Feb 21 03:35:51 Tower emhttpd: spinning down /dev/sdo
Feb 21 03:40:01 Tower crond[1102]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Feb 21 03:42:33 Tower emhttpd: spinning down /dev/sdp
Feb 21 03:44:11 Tower emhttpd: read SMART /dev/sdp
Feb 21 03:45:32 Tower emhttpd: spinning down /dev/sdf
Feb 21 03:46:25 Tower emhttpd: spinning down /dev/sdj
Feb 21 03:55:09 Tower emhttpd: read SMART /dev/sdf
Feb 21 04:01:35 Tower emhttpd: spinning down /dev/sdp
Feb 21 04:04:39 Tower emhttpd: read SMART /dev/sdp

It didn't crash on Sunday night I think because it does an app folder backup.

Any help in the right direction appreciated

syslog

Quote

February 20, 20233 yr

Community Expert

attach diagnostics to your NEXT post in this thread

Quote

February 20, 20233 yr

Author

tower-diagnostics-20230221-1005.zip

Quote

February 20, 20233 yr

Community Expert

Not clearly related, but looks like disk7 is failing.

You should go to the settings page of each of your WD disks and add attributes 1, 200 for monitoring.

After doing that, do any of your disks show SMART warnings on the Dashboard page?

Do you have Notifications setup to alert you immediately be email or other agent as soon as a problem is detected?

Quote

February 20, 20233 yr

Community Expert

Looks like several of your WD disks have non-zero for SMART attribute 1.

And SMART report for disk5 says it has failed. In fact, it looks like you (or someone) ran several extended SMART tests on disk5 and ignored the fact that they all failed.

Do you have backups of anything important and irreplaceable?

Quote

February 20, 20233 yr

Author

Nothing critical/irreplaceable on the disks.

This wouldn't be causing the issue I'm facing yea?

Quote

February 20, 20233 yr

Community Expert

Back to your original problem. Have you done memtest recently?

Quote

February 20, 20233 yr

Community Expert

Just now, Swoodle said:

Nothing critical/irreplaceable on the disks.

You definitely need to replace at least disk5, probably disk7, and run extended SMART tests on all those WD disks with non-zero attribute 1.

Since you only have single parity, you can only rebuild one disk at a time, and would have no redundancy if another failed.

Quote

February 21, 20233 yr

Author

44 minutes ago, trurl said:

Back to your original problem. Have you done memtest recently?

Haven't done a memtest ahhhh ever in unraid, does this need a monitor/keyboard attached to the server or is there a way around that?

Quote

February 21, 20233 yr

Community Expert

17 minutes ago, Swoodle said:

does this need a monitor/keyboard attached to the server

yes

Quote

February 21, 20233 yr

Author

Ran a memtest for couple hrs all passed no issues.

Quote

February 21, 20233 yr

Community Expert

Disable Docker and VM Manager in Settings. Disable Autostart in Disk Settings. Reboot in SAFE mode.

Without starting the array, see if your server will run long enough to do the extended SMART self-tests I recommended earlier.

14 hours ago, trurl said:

run extended SMART tests on all those WD disks with non-zero attribute 1.

They need to be done anyway, and you need to replace several disks. No point in starting the array until you are ready to take care of all that.

And, if it doesn't crash while doing the extended tests, that might be a clue.

Quote

February 21, 20233 yr

Community Expert

14 hours ago, Swoodle said:

Nothing critical/irreplaceable on the disks.

Another approach would be to New Config without all those questionable disks, remove them from the server, and see how your server works with only good disks.

Quote

February 21, 20233 yr

Author

I'll replace disk 5, that won't be an issue however I'm curious that a disk on the array can cause an issue like this??? That doesn't seem right.

Quote

February 22, 20233 yr

Community Expert

Not yet any clear answer on what is causing crash.

Quote

March 7, 20233 yr

Author

Took a little bit as I replaced my parity then transferred data and removed the issue disks. Long story short, the crash is still happening. I think it's when Plex runs it's scheduled tasks in the middle of the night, as I've had success not crashing when plex docker is stopped overnight.

Attached the new tower diagnostics. tower-diagnostics-20230308-0847.zip

Quote

November 25, 20232 yr

Did you ever find the root cause? I'm running into the same lockups with very similar error messages.

Quote

Server becoming unavailable overnight

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)