tjiddy Posted July 25, 2012 Share Posted July 25, 2012 Long time listener, first time caller. My unraid server (5.0-rc5) which consists of parity, cache, and 6 drives (two of which were just added on Friday) became mostly unresponsive yesterday. I couldnt access any of the shares and the web interface was not responding yet I didn't notice anyting out of place in the syslog. I could telnet into into the machine, but any attempts to do anything with the disks (even an ls) would result in a hung telnet session. I tried for hours to stop the array cleanly, but nothing worked. I eventually just pushed the reset button on the server. The server started up ok, and began it's "unclean shutdown" parity check. I kept an eye on the logs this time and saw quite a few kernel: md: parity incorrect: 1942702520 messages. I assumed this was because of my hard reset. after some time, the sytem became unresponsive again. This time I noticed this in the logs Jul 24 19:51:40 JABBA kernel: md: parity incorrect: 1942702576 Jul 24 19:52:11 JABBA kernel: sas: command 0xf29fc6c0, task 0xf41a03c0, timed out: BLK_EH_NOT_HANDLED Jul 24 19:52:11 JABBA kernel: sas: Enter sas_scsi_recover_host Jul 24 19:52:11 JABBA kernel: sas: trying to find task 0xf41a03c0 Jul 24 19:52:11 JABBA kernel: sas: sas_scsi_find_task: aborting task 0xf41a03c0 Jul 24 19:52:11 JABBA kernel: sas: sas_scsi_find_task: querying task 0xf41a03c0 Jul 24 19:52:11 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5 Jul 24 19:52:11 JABBA kernel: sas: sas_scsi_find_task: task 0xf41a03c0 failed to abort Jul 24 19:52:11 JABBA kernel: sas: task 0xf41a03c0 is not at LU: I_T recover Jul 24 19:52:11 JABBA kernel: sas: I_T nexus reset for dev 0000000000000000 Jul 24 19:52:13 JABBA kernel: mvsas 0000:02:00.0: Phy0 : No sig fis Jul 24 19:52:13 JABBA kernel: sas: sas_form_port: phy0 belongs to port0 already(1)! Jul 24 19:52:13 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[0]:rc= 0 Jul 24 19:52:13 JABBA kernel: sas: I_T 0000000000000000 recovered Jul 24 19:52:13 JABBA kernel: sas: sas_ata_task_done: SAS error 8d Jul 24 19:52:13 JABBA kernel: ata5: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 t0 Jul 24 19:52:13 JABBA kernel: ata5.00: failed command: READ FPDMA QUEUED Jul 24 19:52:13 JABBA kernel: ata5.00: cmd 60/d8:00:38:4b:cb/01:00:73:00:00/40 tag 0 ncq 241664 in Jul 24 19:52:13 JABBA kernel: res 41/40:00:48:4c:cb/00:00:73:00:00/40 Emask 0x409 (media error) <F> Jul 24 19:52:13 JABBA kernel: ata5.00: status: { DRDY ERR } Jul 24 19:52:13 JABBA kernel: ata5.00: error: { UNC } Jul 24 19:52:13 JABBA kernel: ata5.00: configured for UDMA/133 Jul 24 19:52:13 JABBA kernel: ata5: EH complete Jul 24 19:52:13 JABBA kernel: ata6: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata7: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata8: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata9: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata10: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: ata11: sas eh calling libata port error handler Jul 24 19:52:13 JABBA kernel: sas: --- Exit sas_scsi_recover_host Jul 24 19:52:44 JABBA kernel: sas: command 0xf29fc6c0, task 0xf41a03c0, timed out: BLK_EH_NOT_HANDLED I also noticed that the system is only recognizing one of my two memory sticks. root@JABBA:~# free -m total used free shared buffers cached Mem: 1770 1635 134 0 13 1473 -/+ buffers/cache: 148 1622 Swap: root@JABBA:~# vmstat procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 46 0 138436 14120 1508744 0 0 158 0 695 284 0 13 47 39 0 0 0 After that long winded story,I have two questions 1) do the above errors mean I have a dead drive? If so, why would a dead drive bring the whole system to it's knees? I saw some sas references in there, could my backplane have suddenly gone bad? 2) Is there a way to gracefully shut down that I havent' thought of? I've tried powerdown script, /etc/rc.c/rc.unRaid stop, stop script, even a shutdown now -r. I get REALLY nervous pushing that reset button. I have attached my full syslog. Any help would be greatly appreciated syslog.txt Link to comment
tjiddy Posted July 26, 2012 Author Share Posted July 26, 2012 I pulled the drive that seemed to be causing the hang. (It was one of those new 3TB WD red drives). Restarted the array, and started rebuilding the parity drive. Again, the system became unresponsive, with this message in syslog: Jul 25 22:56:30 JABBA kernel: md: recovery thread syncing parity disk ... Jul 25 22:56:30 JABBA kernel: md: using 1536k window, over a total of 2930266532 blocks. Jul 26 00:18:56 JABBA kernel: sas: command 0xf43d4e40, task 0xf74b3540, timed out: BLK_EH_NOT_HANDLED Jul 26 00:18:56 JABBA kernel: sas: Enter sas_scsi_recover_host Jul 26 00:18:56 JABBA kernel: sas: trying to find task 0xf74b3540 Jul 26 00:18:56 JABBA kernel: sas: sas_scsi_find_task: aborting task 0xf74b3540 Jul 26 00:18:56 JABBA kernel: sas: sas_scsi_find_task: querying task 0xf74b3540 Jul 26 00:18:56 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5 Jul 26 00:18:56 JABBA kernel: sas: sas_scsi_find_task: task 0xf74b3540 failed to abort Jul 26 00:18:56 JABBA kernel: sas: task 0xf74b3540 is not at LU: I_T recover Jul 26 00:18:56 JABBA kernel: sas: I_T nexus reset for dev 0000000000000000 Jul 26 00:18:56 JABBA kernel: sas: sas_form_port: phy0 belongs to port0 already(1)! Jul 26 00:18:58 JABBA kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[0]:rc= 0 Jul 26 00:18:58 JABBA kernel: sas: I_T 0000000000000000 recovered Jul 26 00:18:58 JABBA kernel: sas: sas_ata_task_done: SAS error 8d Jul 26 00:18:58 JABBA kernel: ata5: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 t0 Jul 26 00:18:58 JABBA kernel: ata5.00: failed command: WRITE FPDMA QUEUED Jul 26 00:18:58 JABBA kernel: ata5.00: cmd 61/00:00:98:e7:c4/02:00:3a:00:00/40 tag 0 ncq 262144 out Jul 26 00:18:58 JABBA kernel: res 41/10:00:98:e7:c4/00:00:3a:00:00/40 Emask 0x481 (invalid argument) <F> Jul 26 00:18:58 JABBA kernel: ata5.00: status: { DRDY ERR } Jul 26 00:18:58 JABBA kernel: ata5.00: error: { IDNF } Jul 26 00:18:58 JABBA kernel: ata5.00: configured for UDMA/133 Jul 26 00:18:58 JABBA kernel: ata5: EH complete Jul 26 00:18:58 JABBA kernel: ata6: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata7: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata8: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata9: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: ata10: sas eh calling libata port error handler Jul 26 00:18:58 JABBA kernel: sas: --- Exit sas_scsi_recover_host Jul 26 00:18:58 JABBA kernel: sas: sas_ata_task_done: SAS error 2 I found a thread of people having very similar issues, I'm going to link to this thread in that one. Link to comment
derekos Posted September 22, 2012 Share Posted September 22, 2012 Did you ever resolve this? Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.