Sever is randomly shutting down

Wgreen92 · April 10, 2021

Hello all, I have been pulling my hair out a bit trying to figure out what's going on here, total noob, about 2 months in any help is appreciated.

Main problem: Server is randomly shutting down. Hardware is still spun up, powered, fans start going nuts, no response from web gui on local network and plex shuts off. When I had a graphics card in there screen would go black and unresponsive as well. Have to hard shutdown and restart via power button long press.

Log included, woke up with it "shut down and powered up" turned it off and pulled log, so last reports are leading up to crash.

Sub problem: Cache (samsung 870 evo 500gb) reporting UDMA CRC error (up to 1240) I have swapped sata ports and cables. Unfortunately did not Preclear this disk, as I put it in before learning this was best practices. Put it in right out of the box. I don't think its causing main problem.

Unraid 6.9.2 (also happened on 6.9.1) OS Plus

Specs:

Spoiler

Mobo: ASUSTeK COMPUTER INC. CROSSHAIR V FORMULA-Z

CPU: AMD FX™-9590 Eight-Core @ 4700 MHz

Ram: 4*8GB = 32gb DDR3

Parity: Iron wolf Pro 7200 4tb

Datas:

WD Blue 5200 4tb

WD Blue 5200 1TB

Iron wolf 5900 4tb

Cache: Samsung 870 evo 500GB

Flash: Scandisk 3.2 Gen1 32gb ultra

Flash attached va USB 3.0 20 Pin Motherboard Header Extension Connector

Most glairing errors

Spoiler

Apr 10 06:39:50 Mainframe kernel: mce: [Hardware Error]: Machine check events logged
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: Corrected error, no action required.
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: CPU:0 (15:2:0) MC2_STATUS[Over|CE|MiscV|AddrV|-|CECC|-|-]: 0xdc25404000040136
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: Error Addr: 0x0000000401878cb8
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: MC2 Error: Fill ECC error on data fills.
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD
Apr 10 06:39:50 Mainframe kernel: mce: [Hardware Error]: Machine check events logged
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: Corrected error, no action required.
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: CPU:1 (15:2:0) MC2_STATUS[Over|CE|MiscV|AddrV|-|CECC|-|-]: 0xdc25409000040136
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: Error Addr: 0x000000059aebf238
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: MC2 Error: Fill ECC error on data fills.
Apr 10 06:39:50 Mainframe kernel: [Hardware Error]: cache level: L2, tx: DATA, mem-tx: DRD

From searching it seems people claim this is CPU failure, either overheating, under voltage. or just straight failure.

other glairing error

Spoiler

Apr 10 04:04:21 Mainframe dhcpcd[1678]: br0: failed to renew DHCP, rebinding
Apr 10 04:35:23 Mainframe kernel: ata1.00: exception Emask 0x10 SAct 0xcc000 SErr 0x0 action 0x6 frozen
Apr 10 04:35:23 Mainframe kernel: ata1.00: irq_stat 0x08000000, interface fatal error
Apr 10 04:35:23 Mainframe kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Apr 10 04:35:23 Mainframe kernel: ata1.00: cmd 61/90:70:20:99:1e/07:00:04:00:00/40 tag 14 ncq dma 991232 out
Apr 10 04:35:23 Mainframe kernel: res 40/00:70:20:99:1e/00:00:04:00:00/40 Emask 0x10 (ATA bus error)
Apr 10 04:35:23 Mainframe kernel: ata1.00: status: { DRDY }
Apr 10 04:35:23 Mainframe kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Apr 10 04:35:23 Mainframe kernel: ata1.00: cmd 61/c8:78:b0:a0:1e/03:00:04:00:00/40 tag 15 ncq dma 495616 out
Apr 10 04:35:23 Mainframe kernel: res 40/00:70:20:99:1e/00:00:04:00:00/40 Emask 0x10 (ATA bus error)
Apr 10 04:35:23 Mainframe kernel: ata1.00: status: { DRDY }
Apr 10 04:35:23 Mainframe kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Apr 10 04:35:23 Mainframe kernel: ata1.00: cmd 61/68:90:78:a4:1e/00:00:04:00:00/40 tag 18 ncq dma 53248 out
Apr 10 04:35:23 Mainframe kernel: res 40/00:70:20:99:1e/00:00:04:00:00/40 Emask 0x10 (ATA bus error)
Apr 10 04:35:23 Mainframe kernel: ata1.00: status: { DRDY }
Apr 10 04:35:23 Mainframe kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Apr 10 04:35:23 Mainframe kernel: ata1.00: cmd 61/f0:98:e8:bb:f8/00:00:02:00:00/40 tag 19 ncq dma 122880 out
Apr 10 04:35:23 Mainframe kernel: res 40/00:70:20:99:1e/00:00:04:00:00/40 Emask 0x10 (ATA bus error)
Apr 10 04:35:23 Mainframe kernel: ata1.00: status: { DRDY }
Apr 10 04:35:23 Mainframe kernel: ata1: hard resetting link
Apr 10 04:35:23 Mainframe kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Apr 10 04:35:23 Mainframe kernel: ata1.00: supports DRM functions and may not be fully accessible
Apr 10 04:35:23 Mainframe kernel: ata1.00: supports DRM functions and may not be fully accessible
Apr 10 04:35:23 Mainframe kernel: ata1.00: configured for UDMA/133
Apr 10 04:35:23 Mainframe kernel: ata1: EH complete
Apr 10 04:35:23 Mainframe kernel: ata1.00: Enabling discard_zeroes_data

Sub problem?

syslog

JorgeB · April 11, 2021

CRC errors are a known issue with some Samsung SSDs and those AMD chipsets, other issue looks like a hardware problem, like a bad PSU, CPU, board, etc.

Sever is randomly shutting down

Recommended Posts

Wgreen92

Link to comment

JorgeB

Link to comment

Join the conversation