[SOLVED] Server Unresponsive After a Few Hours of Uptime

mrvnsk9 · January 5, 2020

Over the last few days my server has starting hanging after being up for a few hours. I looked at the SMART reports in the diagnostics and it looks like there are errors on disks 3 and 6. I'm not sure if this is the problem or if there is another cause for the lockups. I mirrored the syslog server to flash before the last freeze happened. Both it and the diagnostics are attached. Any help would be greatly appreciated.

Thanks in advance!

dragon-diagnostics-20200104-1903.zip syslog

trurl · January 5, 2020

Array not started in those so can't tell anything about filesystems or shares, though those wouldn't be expected to cause a crash.

Does it in fact crash even if you never start the array?

mrvnsk9 · January 5, 2020

Apologies. I forgot my array wasn't started when I pulled the first diagnostics. Of course that was the only time I pulled any diagnostics.

It freezes when the array is started. It hasn't frozen with the array stopped.

I attached a new diagnostics with the array started.

Thanks

dragon-diagnostics-20200104-2108.zip

trurl · January 5, 2020

I don't notice anything in those, except it looks like you are using 11G of 30G docker image with no dockers running. Have you had problems filling docker image?

Have you done memtest?

mrvnsk9 · January 5, 2020

I haven't had issues with the docker image filling up. I guess the dockers hadn't finished starting when I pulled the diagnostics.

I haven't done a memtest yet. I'll try that.

Should I be concerned about the SMART errors on disks 3 and 6?

mrvnsk9 · January 5, 2020

I ran a memtest and it passed with no errors.

trurl · January 5, 2020

Doesn't seem like you have actually captured a syslog after a crash since the one you attached was basically the same as that in the diagnostics, and the array wasn't started.

mrvnsk9 · January 5, 2020

The log was still mirrored after the restart. If you scroll up to line 213 in the syslog file you should see a timestamp of "Jan 4 17:47:16". This is where the server became unresponsive.

JorgeB · January 5, 2020

Ryzen on Linux can lock up due to issues with c-states, make sure bios is up to date, then look for "Power Supply Idle Control" (or similar) and set it to "typical current idle" (or similar), or completely disable C-sates.

More info here:
https://forums.unraid.net/bug-reports/prereleases/670-rc1-system-hard-lock-r354/

mrvnsk9 · January 5, 2020

The bios is up to date and C-states are already disabled. The "Power Supply Idle Control" was not set to the suggested value. I changed that. The odd thing is the server has been stable for a year and didn't start having issues until Jan 1. Probably a coincidence, but it's still odd to me.

Is the "/usr/local/sbin/zenstates --c6-disable" line still required in the go file or is it no longer needed?

Also, I'm using "rcu_nocbs=0-7" in the syslinux configuration.

Edited January 5, 2020 by mrvnsk9

JorgeB · January 6, 2020

12 hours ago, mrvnsk9 said:

Is the "/usr/local/sbin/zenstates --c6-disable" line still required in the go file or is it no longer needed?

It should no longer be needed with the power supply idle control correctly set.

mrvnsk9 · January 7, 2020

@johnnie.black After making the changes to the bios, the server stayed up for about 15 hours. I was using the unbalance plugin to move some files to disk6 and received the following errors before it locked up.

Jan  6 23:33:48 Dragon kernel: ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata1.00: cmd 61/70:00:08:9d:e0/01:00:be:00:00/40 tag 0 ncq dma 188416 out
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata1.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata1.00: cmd 61/40:08:90:7d:e0/05:00:be:00:00/40 tag 1 ncq dma 688128 out
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata1.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata1: hard resetting link
Jan  6 23:33:48 Dragon kernel: ata3.00: exception Emask 0x0 SAct 0x780 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata3.00: cmd 60/40:38:78:9e:e0/05:00:be:00:00/40 tag 7 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata3.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata3.00: cmd 60/40:40:b8:a3:e0/05:00:be:00:00/40 tag 8 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata3.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata3.00: cmd 60/40:48:f8:a8:e0/05:00:be:00:00/40 tag 9 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata3.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata3.00: cmd 60/78:50:38:ae:e0/01:00:be:00:00/40 tag 10 ncq dma 192512 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata3.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata3: hard resetting link
Jan  6 23:33:48 Dragon kernel: ata4.00: exception Emask 0x0 SAct 0x3c003000 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:60:d0:82:e0/05:00:be:00:00/40 tag 12 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:68:10:88:e0/05:00:be:00:00/40 tag 13 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:d0:78:9e:e0/05:00:be:00:00/40 tag 26 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:d8:b8:a3:e0/05:00:be:00:00/40 tag 27 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:e0:f8:a8:e0/05:00:be:00:00/40 tag 28 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/78:e8:38:ae:e0/01:00:be:00:00/40 tag 29 ncq dma 192512 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:82:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4: hard resetting link
Jan  6 23:33:48 Dragon kernel: ata8.00: exception Emask 0x0 SAct 0x3c00 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata8.00: cmd 60/40:50:78:9e:e0/05:00:be:00:00/40 tag 10 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata8.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata8.00: cmd 60/40:58:b8:a3:e0/05:00:be:00:00/40 tag 11 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata8.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata8.00: cmd 60/40:60:f8:a8:e0/05:00:be:00:00/40 tag 12 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata8.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata8.00: cmd 60/78:68:38:ae:e0/01:00:be:00:00/40 tag 13 ncq dma 192512 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata8.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata8: hard resetting link
Jan  6 23:33:48 Dragon kernel: ata2.00: exception Emask 0x0 SAct 0x1e080 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/40:38:10:88:e0/05:00:be:00:00/40 tag 7 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/40:68:78:9e:e0/05:00:be:00:00/40 tag 13 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/40:70:b8:a3:e0/05:00:be:00:00/40 tag 14 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/40:78:f8:a8:e0/05:00:be:00:00/40 tag 15 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/78:80:38:ae:e0/01:00:be:00:00/40 tag 16 ncq dma 192512 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2: hard resetting link

Would this indicate there is an issue with disk6? I was copying the files from disk5 if that is relevant information. I've attached a new diagnostics taken after rebooting the server.

It looks like there was an error on every drive in the array. Did unbalance cause this or do I have bad cables or is there another issue causing it?

For reference, these are the pci devices for the drives.

/sys/bus/pci/devices/0000:01:00.1/ata1/host1/target1:0:0/1:0:0:0/block/sdb
/sys/bus/pci/devices/0000:01:00.1/ata2/host2/target2:0:0/2:0:0:0/block/sdc
/sys/bus/pci/devices/0000:01:00.1/ata3/host3/target3:0:0/3:0:0:0/block/sdd
/sys/bus/pci/devices/0000:01:00.1/ata4/host4/target4:0:0/4:0:0:0/block/sde
/sys/bus/pci/devices/0000:01:00.1/ata7/host7/target7:0:0/7:0:0:0/block/sdf
/sys/bus/pci/devices/0000:01:00.1/ata8/host8/target8:0:0/8:0:0:0/block/sdg
/sys/bus/pci/devices/0000:09:00.0/ata12/host12/target12:0:0/12:0:0:0/block/sdh

dragon-diagnostics-20200106-2358.zip

Edited January 7, 2020 by mrvnsk9

JorgeB · January 7, 2020

1 hour ago, mrvnsk9 said:

Would this indicate there is an issue with disk6?

To me it indicates a problem with the disk controller, since there are errors on almost all disks, could also be a power related problem.

mrvnsk9 · January 7, 2020

8 hours ago, johnnie.black said:

To me it indicates a problem with the disk controller, since there are errors on almost all disks, could also be a power related problem.

I have a StarTech controller with a Marvell 88SE9230 chipset, which i have to disable IMMOU or it drops drives, in the system. I'm going to remove that controller from the array and see if that improves things (I only have one drive attached to it anyway). I should probably replace it with a LSI 9300-8i or something similar.

mrvnsk9 · January 8, 2020

@johnnie.black Changing the bios to the correct setting seems to have done the trick. I'm also going to swap out the controller card for one that's actually supported by unraid. I'll consider this solved. Thanks for your help!

[SOLVED] Server Unresponsive After a Few Hours of Uptime

Recommended Posts

mrvnsk9

Link to comment

trurl

Link to comment

mrvnsk9

Link to comment

trurl

Link to comment

mrvnsk9

Link to comment

mrvnsk9

Link to comment

trurl

Link to comment

mrvnsk9

Link to comment

JorgeB

Link to comment

mrvnsk9

Link to comment

JorgeB

Link to comment

mrvnsk9

Link to comment

JorgeB

Link to comment

mrvnsk9

Link to comment

mrvnsk9

Link to comment

Join the conversation