Jump to content

Server becoming unavailable overnight


Swoodle

Recommended Posts

Hi all,

 

My server is becoming unavailable overnight and I don't understand the logs well enough to work out what is causing the issue. I've attached the syslog which has been writing to flash. These are the last entries:

 

Feb 21 03:31:44 Tower kernel: kernel tried to execute NX-protected page - exploit attempt? (uid: 99)
Feb 21 03:31:44 Tower kernel: BUG: unable to handle page fault for address: ffffffff820df610
Feb 21 03:31:44 Tower kernel: #PF: supervisor instruction fetch in kernel mode
Feb 21 03:31:44 Tower kernel: #PF: error_code(0x0011) - permissions violation
Feb 21 03:31:44 Tower kernel: PGD 220e067 P4D 220e067 PUD 220f063 PMD 108da7063 PTE 80000000020df061
Feb 21 03:31:44 Tower kernel: Oops: 0011 [#2] PREEMPT SMP PTI
Feb 21 03:31:44 Tower kernel: CPU: 4 PID: 5157 Comm: Plex Media Scan Tainted: G      D           5.19.17-Unraid #2
Feb 21 03:31:44 Tower kernel: Hardware name: MSI MS-7917/Z97 GAMING 5 (MS-7917), BIOS V1.9 12/23/2014
Feb 21 03:31:44 Tower kernel: RIP: 0010:SIGMA2+0x1e350/0x9ec60
Feb 21 03:31:44 Tower kernel: Code: 6b 65 74 5f 6f 72 64 65 72 3d 25 75 0a 00 6d 6d 2f 77 6f 72 6b 69 6e 67 73 65 74 2e 63 00 6b 73 6d 20 00 61 6e 6f 6e 20 00 20 <43> 4d 41 00 01 34 70 61 67 65 3a 25 70 20 69 73 20 75 6e 69 6e 69
Feb 21 03:31:44 Tower kernel: RSP: 0018:ffffc900087f7d08 EFLAGS: 00010282
Feb 21 03:31:44 Tower kernel: RAX: ffffffff820df610 RBX: ffffea000a332cc0 RCX: 000000000000777e
Feb 21 03:31:44 Tower kernel: RDX: dead000000000100 RSI: 0000000000000246 RDI: ffffea000a332cc0
Feb 21 03:31:44 Tower kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 000000000000a32a
Feb 21 03:31:44 Tower kernel: R10: 0000000000000003 R11: 0000151902248000 R12: ffff888044f5f000
Feb 21 03:31:44 Tower kernel: R13: ffffc900087f7d40 R14: 0000000000000001 R15: 0000000000000002
Feb 21 03:31:44 Tower kernel: FS:  0000151905bddb00(0000) GS:ffff88840fb00000(0000) knlGS:0000000000000000
Feb 21 03:31:44 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 21 03:31:44 Tower kernel: CR2: ffffffff820df610 CR3: 0000000175d40005 CR4: 00000000001706e0
Feb 21 03:31:44 Tower kernel: Call Trace:
Feb 21 03:31:44 Tower kernel: <TASK>
Feb 21 03:31:44 Tower kernel: ? release_pages+0xe0/0x291
Feb 21 03:31:44 Tower kernel: ? tlb_flush_mmu+0x6b/0x99
Feb 21 03:31:44 Tower kernel: ? tlb_finish_mmu+0x2c/0x5b
Feb 21 03:31:44 Tower kernel: ? unmap_region+0xd3/0x101
Feb 21 03:31:44 Tower kernel: ? __do_munmap+0x275/0x2e2
Feb 21 03:31:44 Tower kernel: ? __vm_munmap+0x69/0xb7
Feb 21 03:31:44 Tower kernel: ? __x64_sys_munmap+0x17/0x1e
Feb 21 03:31:44 Tower kernel: ? do_syscall_64+0x6b/0x81
Feb 21 03:31:44 Tower kernel: ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
Feb 21 03:31:44 Tower kernel: </TASK>
Feb 21 03:31:44 Tower kernel: Modules linked in: tun veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables i915 x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iosf_mbi drm_buddy i2c_algo_bit ttm drm_display_helper drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd mxm_wmi rapl drm intel_cstate mpt3sas intel_gtt agpgart ahci syscopyarea sysfillrect i2c_i801 i2c_smbus alx raid_class intel_uncore i2c_core sysimgblt libahci scsi_transport_sas mdio fb_sys_fops thermal fan video wmi backlight button acpi_pad unix
Feb 21 03:31:44 Tower kernel: CR2: ffffffff820df610
Feb 21 03:31:44 Tower kernel: ---[ end trace 0000000000000000 ]---
Feb 21 03:31:44 Tower kernel: RIP: 0010:mutex_lock+0x1e/0x2e
Feb 21 03:31:44 Tower kernel: Code: 00 00 be 02 00 00 00 e9 a0 fc ff ff 0f 1f 44 00 00 51 48 89 3c 24 e8 7e f3 ff ff 31 c0 48 8b 3c 24 65 48 8b 14 25 c0 bb 01 00 <f0> 48 0f b1 17 74 03 5a eb c9 58 c3 cc cc cc cc 0f 1f 44 00 00 53
Feb 21 03:31:44 Tower kernel: RSP: 0000:ffffc90008237ef8 EFLAGS: 00010246
Feb 21 03:31:44 Tower kernel: RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000000
Feb 21 03:31:44 Tower kernel: RDX: ffff8881219f4ec0 RSI: ffffc90008237e80 RDI: 0000000008137f58
Feb 21 03:31:44 Tower kernel: RBP: 0000150830ff46b0 R08: 0000000000000000 R09: 0000000000000347
Feb 21 03:31:44 Tower kernel: R10: ffff8881573b0010 R11: 0000000000000000 R12: 0000000000000080
Feb 21 03:31:44 Tower kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Feb 21 03:31:44 Tower kernel: FS:  0000151905bddb00(0000) GS:ffff88840fb00000(0000) knlGS:0000000000000000
Feb 21 03:31:44 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 21 03:31:44 Tower kernel: CR2: ffffffff820df610 CR3: 0000000175d40005 CR4: 00000000001706e0
Feb 21 03:33:43 Tower  emhttpd: spinning down /dev/sdm
Feb 21 03:35:51 Tower  emhttpd: spinning down /dev/sdq
Feb 21 03:35:51 Tower  emhttpd: spinning down /dev/sdo
Feb 21 03:40:01 Tower  crond[1102]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Feb 21 03:42:33 Tower  emhttpd: spinning down /dev/sdp
Feb 21 03:44:11 Tower  emhttpd: read SMART /dev/sdp
Feb 21 03:45:32 Tower  emhttpd: spinning down /dev/sdf
Feb 21 03:46:25 Tower  emhttpd: spinning down /dev/sdj
Feb 21 03:55:09 Tower  emhttpd: read SMART /dev/sdf
Feb 21 04:01:35 Tower  emhttpd: spinning down /dev/sdp
Feb 21 04:04:39 Tower  emhttpd: read SMART /dev/sdp

 

It didn't crash on Sunday night I think because it does an app folder backup. 

 

Any help in the right direction appreciated

 

 

syslog

Link to comment

Not clearly related, but looks like disk7 is failing.

 

You should go to the settings page of each of your WD disks and add attributes 1, 200 for monitoring.

 

After doing that, do any of your disks show SMART warnings on the Dashboard page?

 

Do you have Notifications setup to alert you immediately be email or other agent as soon as a problem is detected?

Link to comment

Looks like several of your WD disks have non-zero for SMART attribute 1.

 

And SMART report for disk5 says it has failed. In fact, it looks like you (or someone) ran several extended SMART tests on disk5 and ignored the fact that they all failed.

 

Do you have backups of anything important and irreplaceable?

Link to comment
Just now, Swoodle said:

Nothing critical/irreplaceable on the disks. 

You definitely need to replace at least disk5, probably disk7, and run extended SMART tests on all those WD disks with non-zero attribute 1.

 

Since you only have single parity, you can only rebuild one disk at a time, and would have no redundancy if another failed.

 

Link to comment

Disable Docker and VM Manager in Settings. Disable Autostart in Disk Settings. Reboot in SAFE mode.

 

Without starting the array, see if your server will run long enough to do the extended SMART self-tests I recommended earlier.

14 hours ago, trurl said:

run extended SMART tests on all those WD disks with non-zero attribute 1.

They need to be done anyway, and you need to replace several disks. No point in starting the array until you are ready to take care of all that.

 

And, if it doesn't crash while doing the extended tests, that might be a clue.

 

 

Link to comment
  • 2 weeks later...
  • 8 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...