Jump to content

[SOLVED] Server Unresponsive After a Few Hours of Uptime


Recommended Posts

Over the last few days my server has starting hanging after being up for a few hours. I looked at the SMART reports in the diagnostics and it looks like there are errors on disks 3 and 6. I'm not sure if this is the problem or if there is another cause for the lockups. I mirrored the syslog server to flash before the last freeze happened. Both it and the diagnostics are attached. Any help would be greatly appreciated.

 

Thanks in advance!

dragon-diagnostics-20200104-1903.zip syslog

Link to comment

The bios is up to date and C-states are already disabled. The "Power Supply Idle Control" was not set to the suggested value. I changed that. The odd thing is the server has been stable for a year and didn't start having issues until Jan 1. Probably a coincidence, but it's still odd to me.

 

Is the "/usr/local/sbin/zenstates --c6-disable" line still required in the go file or is it no longer needed?

 

Also, I'm using "rcu_nocbs=0-7" in the syslinux configuration.

Edited by mrvnsk9
Link to comment

@johnnie.black After making the changes to the bios, the server stayed up for about 15 hours. I was using the unbalance plugin to move some files to disk6 and received the following errors before it locked up.

Jan  6 23:33:48 Dragon kernel: ata1.00: exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata1.00: cmd 61/70:00:08:9d:e0/01:00:be:00:00/40 tag 0 ncq dma 188416 out
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata1.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata1.00: cmd 61/40:08:90:7d:e0/05:00:be:00:00/40 tag 1 ncq dma 688128 out
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata1.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata1: hard resetting link
Jan  6 23:33:48 Dragon kernel: ata3.00: exception Emask 0x0 SAct 0x780 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata3.00: cmd 60/40:38:78:9e:e0/05:00:be:00:00/40 tag 7 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata3.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata3.00: cmd 60/40:40:b8:a3:e0/05:00:be:00:00/40 tag 8 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata3.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata3.00: cmd 60/40:48:f8:a8:e0/05:00:be:00:00/40 tag 9 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata3.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata3.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata3.00: cmd 60/78:50:38:ae:e0/01:00:be:00:00/40 tag 10 ncq dma 192512 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata3.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata3: hard resetting link
Jan  6 23:33:48 Dragon kernel: ata4.00: exception Emask 0x0 SAct 0x3c003000 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:60:d0:82:e0/05:00:be:00:00/40 tag 12 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:68:10:88:e0/05:00:be:00:00/40 tag 13 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:d0:78:9e:e0/05:00:be:00:00/40 tag 26 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:d8:b8:a3:e0/05:00:be:00:00/40 tag 27 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/40:e0:f8:a8:e0/05:00:be:00:00/40 tag 28 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata4.00: cmd 60/78:e8:38:ae:e0/01:00:be:00:00/40 tag 29 ncq dma 192512 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:82:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata4.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata4: hard resetting link
Jan  6 23:33:48 Dragon kernel: ata8.00: exception Emask 0x0 SAct 0x3c00 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata8.00: cmd 60/40:50:78:9e:e0/05:00:be:00:00/40 tag 10 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata8.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata8.00: cmd 60/40:58:b8:a3:e0/05:00:be:00:00/40 tag 11 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata8.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata8.00: cmd 60/40:60:f8:a8:e0/05:00:be:00:00/40 tag 12 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata8.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata8.00: cmd 60/78:68:38:ae:e0/01:00:be:00:00/40 tag 13 ncq dma 192512 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata8.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata8: hard resetting link
Jan  6 23:33:48 Dragon kernel: ata2.00: exception Emask 0x0 SAct 0x1e080 SErr 0x0 action 0x6 frozen
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/40:38:10:88:e0/05:00:be:00:00/40 tag 7 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/40:68:78:9e:e0/05:00:be:00:00/40 tag 13 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/40:70:b8:a3:e0/05:00:be:00:00/40 tag 14 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:ff:ff:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/40:78:f8:a8:e0/05:00:be:00:00/40 tag 15 ncq dma 688128 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2.00: failed command: READ FPDMA QUEUED
Jan  6 23:33:48 Dragon kernel: ata2.00: cmd 60/78:80:38:ae:e0/01:00:be:00:00/40 tag 16 ncq dma 192512 in
Jan  6 23:33:48 Dragon kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jan  6 23:33:48 Dragon kernel: ata2.00: status: { DRDY }
Jan  6 23:33:48 Dragon kernel: ata2: hard resetting link

Would this indicate there is an issue with disk6? I was copying the files from disk5 if that is relevant information. I've attached a new diagnostics taken after rebooting the server. 

 

It looks like there was an error on every drive in the array. Did unbalance cause this or do I have bad cables or is there another issue causing it?

 

For reference, these are the pci devices for the drives.

/sys/bus/pci/devices/0000:01:00.1/ata1/host1/target1:0:0/1:0:0:0/block/sdb
/sys/bus/pci/devices/0000:01:00.1/ata2/host2/target2:0:0/2:0:0:0/block/sdc
/sys/bus/pci/devices/0000:01:00.1/ata3/host3/target3:0:0/3:0:0:0/block/sdd
/sys/bus/pci/devices/0000:01:00.1/ata4/host4/target4:0:0/4:0:0:0/block/sde
/sys/bus/pci/devices/0000:01:00.1/ata7/host7/target7:0:0/7:0:0:0/block/sdf
/sys/bus/pci/devices/0000:01:00.1/ata8/host8/target8:0:0/8:0:0:0/block/sdg
/sys/bus/pci/devices/0000:09:00.0/ata12/host12/target12:0:0/12:0:0:0/block/sdh

dragon-diagnostics-20200106-2358.zip

Edited by mrvnsk9
Link to comment
8 hours ago, johnnie.black said:

To me it indicates a problem with the disk controller, since there are errors on almost all disks, could also be a power related problem.

I have a StarTech controller with a Marvell 88SE9230 chipset, which i have to disable IMMOU or it drops drives, in the system. I'm going to remove that controller from the array and see if that improves things (I only have one drive attached to it anyway). I should probably replace it with a LSI 9300-8i or something similar.

Link to comment
  • mrvnsk9 changed the title to [SOLVED] Server Unresponsive After a Few Hours of Uptime

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...