unresponsive array and web management (5.0-rc5)

tjiddy · July 25, 2012

Long time listener, first time caller.

My unraid server (5.0-rc5) which consists of parity, cache, and 6 drives (two of which were just added on Friday) became mostly unresponsive yesterday. I couldnt access any of the shares and the web interface was not responding yet I didn't notice anyting out of place in the syslog. I could telnet into into the machine, but any attempts to do anything with the disks (even an ls) would result in a hung telnet session. I tried for hours to stop the array cleanly, but nothing worked. I eventually just pushed the reset button on the server. The server started up ok, and began it's "unclean shutdown" parity check. I kept an eye on the logs this time and saw quite a few kernel: md: parity incorrect: 1942702520 messages. I assumed this was because of my hard reset. after some time, the sytem became unresponsive again. This time I noticed this in the logs

Jul 24 19:51:40 JABBA kernel: md: parity incorrect: 1942702576
Jul 24 19:52:11 JABBA kernel: sas: command 0xf29fc6c0, task 0xf41a03c0, timed out: BLK_EH_NOT_HANDLED
Jul 24 19:52:11 JABBA kernel: sas: Enter sas_scsi_recover_host
Jul 24 19:52:11 JABBA kernel: sas: trying to find task 0xf41a03c0
Jul 24 19:52:11 JABBA kernel: sas: sas_scsi_find_task: aborting task 0xf41a03c0
Jul 24 19:52:11 JABBA kernel: sas: sas_scsi_find_task: querying task 0xf41a03c0
Jul 24 19:52:11 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5
Jul 24 19:52:11 JABBA kernel: sas: sas_scsi_find_task: task 0xf41a03c0 failed to abort
Jul 24 19:52:11 JABBA kernel: sas: task 0xf41a03c0 is not at LU: I_T recover
Jul 24 19:52:11 JABBA kernel: sas: I_T nexus reset for dev 0000000000000000
Jul 24 19:52:13 JABBA kernel: mvsas 0000:02:00.0: Phy0 : No sig fis
Jul 24 19:52:13 JABBA kernel: sas: sas_form_port: phy0 belongs to port0 already(1)!
Jul 24 19:52:13 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[0]:rc= 0
Jul 24 19:52:13 JABBA kernel: sas: I_T 0000000000000000 recovered
Jul 24 19:52:13 JABBA kernel: sas: sas_ata_task_done: SAS error 8d
Jul 24 19:52:13 JABBA kernel: ata5: sas eh calling libata port error handler
Jul 24 19:52:13 JABBA kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 t0
Jul 24 19:52:13 JABBA kernel: ata5.00: failed command: READ FPDMA QUEUED
Jul 24 19:52:13 JABBA kernel: ata5.00: cmd 60/d8:00:38:4b:cb/01:00:73:00:00/40 tag 0 ncq 241664 in
Jul 24 19:52:13 JABBA kernel:          res 41/40:00:48:4c:cb/00:00:73:00:00/40 Emask 0x409 (media error) <F>
Jul 24 19:52:13 JABBA kernel: ata5.00: status: { DRDY ERR }
Jul 24 19:52:13 JABBA kernel: ata5.00: error: { UNC }
Jul 24 19:52:13 JABBA kernel: ata5.00: configured for UDMA/133
Jul 24 19:52:13 JABBA kernel: ata5: EH complete
Jul 24 19:52:13 JABBA kernel: ata6: sas eh calling libata port error handler
Jul 24 19:52:13 JABBA kernel: ata7: sas eh calling libata port error handler
Jul 24 19:52:13 JABBA kernel: ata8: sas eh calling libata port error handler
Jul 24 19:52:13 JABBA kernel: ata9: sas eh calling libata port error handler
Jul 24 19:52:13 JABBA kernel: ata10: sas eh calling libata port error handler
Jul 24 19:52:13 JABBA kernel: ata11: sas eh calling libata port error handler
Jul 24 19:52:13 JABBA kernel: sas: --- Exit sas_scsi_recover_host
Jul 24 19:52:44 JABBA kernel: sas: command 0xf29fc6c0, task 0xf41a03c0, timed out: BLK_EH_NOT_HANDLED

I also noticed that the system is only recognizing one of my two memory sticks.

root@JABBA:~# free -m
             total       used       free     shared    buffers     cached
Mem:          1770       1635        134          0         13       1473
-/+ buffers/cache:        148       1622
Swap:

root@JABBA:~# vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
0 46      0 138436  14120 1508744    0    0   158     0  695  284  0 13 47 39
            0          0          0

After that long winded story,I have two questions

1) do the above errors mean I have a dead drive? If so, why would a dead drive bring the whole system to it's knees? I saw some sas references in there, could my backplane have suddenly gone bad?

2) Is there a way to gracefully shut down that I havent' thought of? I've tried powerdown script, /etc/rc.c/rc.unRaid stop, stop script, even a shutdown now -r. I get REALLY nervous pushing that reset button.

I have attached my full syslog. Any help would be greatly appreciated

syslog.txt

tjiddy · July 26, 2012

I pulled the drive that seemed to be causing the hang. (It was one of those new 3TB WD red drives). Restarted the array, and started rebuilding the parity drive. Again, the system became unresponsive, with this message in syslog:

Jul 25 22:56:30 JABBA kernel: md: recovery thread syncing parity disk ...
Jul 25 22:56:30 JABBA kernel: md: using 1536k window, over a total of 2930266532 blocks.
Jul 26 00:18:56 JABBA kernel: sas: command 0xf43d4e40, task 0xf74b3540, timed out: BLK_EH_NOT_HANDLED
Jul 26 00:18:56 JABBA kernel: sas: Enter sas_scsi_recover_host
Jul 26 00:18:56 JABBA kernel: sas: trying to find task 0xf74b3540
Jul 26 00:18:56 JABBA kernel: sas: sas_scsi_find_task: aborting task 0xf74b3540
Jul 26 00:18:56 JABBA kernel: sas: sas_scsi_find_task: querying task 0xf74b3540
Jul 26 00:18:56 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5
Jul 26 00:18:56 JABBA kernel: sas: sas_scsi_find_task: task 0xf74b3540 failed to abort
Jul 26 00:18:56 JABBA kernel: sas: task 0xf74b3540 is not at LU: I_T recover
Jul 26 00:18:56 JABBA kernel: sas: I_T nexus reset for dev 0000000000000000
Jul 26 00:18:56 JABBA kernel: sas: sas_form_port: phy0 belongs to port0 already(1)!
Jul 26 00:18:58 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[0]:rc= 0
Jul 26 00:18:58 JABBA kernel: sas: I_T 0000000000000000 recovered
Jul 26 00:18:58 JABBA kernel: sas: sas_ata_task_done: SAS error 8d
Jul 26 00:18:58 JABBA kernel: ata5: sas eh calling libata port error handler
Jul 26 00:18:58 JABBA kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 t0
Jul 26 00:18:58 JABBA kernel: ata5.00: failed command: WRITE FPDMA QUEUED
Jul 26 00:18:58 JABBA kernel: ata5.00: cmd 61/00:00:98:e7:c4/02:00:3a:00:00/40 tag 0 ncq 262144 out
Jul 26 00:18:58 JABBA kernel:          res 41/10:00:98:e7:c4/00:00:3a:00:00/40 Emask 0x481 (invalid argument) <F>
Jul 26 00:18:58 JABBA kernel: ata5.00: status: { DRDY ERR }
Jul 26 00:18:58 JABBA kernel: ata5.00: error: { IDNF }
Jul 26 00:18:58 JABBA kernel: ata5.00: configured for UDMA/133
Jul 26 00:18:58 JABBA kernel: ata5: EH complete
Jul 26 00:18:58 JABBA kernel: ata6: sas eh calling libata port error handler
Jul 26 00:18:58 JABBA kernel: ata7: sas eh calling libata port error handler
Jul 26 00:18:58 JABBA kernel: ata8: sas eh calling libata port error handler
Jul 26 00:18:58 JABBA kernel: ata9: sas eh calling libata port error handler
Jul 26 00:18:58 JABBA kernel: ata10: sas eh calling libata port error handler
Jul 26 00:18:58 JABBA kernel: sas: --- Exit sas_scsi_recover_host
Jul 26 00:18:58 JABBA kernel: sas: sas_ata_task_done: SAS error 2

I found a thread of people having very similar issues, I'm going to link to this thread in that one.

derekos · September 22, 2012

Did you ever resolve this?

unresponsive array and web management (5.0-rc5)

Recommended Posts

tjiddy

Link to comment

tjiddy

Link to comment

derekos

Link to comment

Archived