bkastner Posted September 3, 2015 Share Posted September 3, 2015 So while trying to help someone else diagnose an issue with the SAS2LP card I had started a parity check, and just left it running. I got an email notification after a few hours of a failed drive (even though it was fine on Sept 1st during the regular parity check). In the GUI I see the drive X'd out with 'device is disabled, contents emulated', but looking at the disk I don't really see any issues: SMART self-test history: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 12738 - SMART error log: No Errors Logged # Attribute Name Flag Value Worst Threshold Type Updated Failed Raw Value 1 Raw Read Error Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin Up Time 0x0027 178 176 021 Pre-fail Always - 8091 4 Start Stop Count 0x0032 098 098 000 Old age Always - 2333 5 Reallocated Sector Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek Error Rate 0x002e 100 253 000 Old age Always - 0 9 Power On Hours 0x0032 083 083 000 Old age Always - 12739 (1y, 165d, 19h) 10 Spin Retry Count 0x0032 100 100 000 Old age Always - 0 11 Calibration Retry Count 0x0032 100 253 000 Old age Always - 0 12 Power Cycle Count 0x0032 100 100 000 Old age Always - 29 192 Power-Off Retract Count 0x0032 200 200 000 Old age Always - 12 193 Load Cycle Count 0x0032 199 199 000 Old age Always - 4748 194 Temperature Celsius 0x0022 122 108 000 Old age Always - 30 196 Reallocated Event Count 0x0032 200 200 000 Old age Always - 0 197 Current Pending Sector 0x0032 200 200 000 Old age Always - 0 198 Offline Uncorrectable 0x0030 200 200 000 Old age Offline - 0 199 UDMA CRC Error Count 0x0032 200 200 000 Old age Always - 0 200 Multi Zone Error Rate 0x0008 200 200 000 Old age Offline - 0 I've tried rebooting, but the drive still shows as failed. What would be my next steps to try? Should I replace it right away, or is the drive likely okay and something else the issue? Before I get asked, these drives are attached to a backplane on a Norco 4224, and I have not even been in the same room as the server in almost a week - nothing was jarred, jangled, nudged, bumped, or given a dirty look. Link to comment
dgaschk Posted September 3, 2015 Share Posted September 3, 2015 Tools->Diagnostics. Attach file. Link to comment
bkastner Posted September 3, 2015 Author Share Posted September 3, 2015 Here you go. cydstorage-diagnostics-20150903-1603.zip Link to comment
bkastner Posted September 4, 2015 Author Share Posted September 4, 2015 Is anyone able to review the diagnostics and provide recommendations on next steps? Thanks Link to comment
RobJ Posted September 4, 2015 Share Posted September 4, 2015 I don't see any problems with the drives, including Disk 9, so I don't think it was the drive's fault. Unfortunately, you have rebooted so no evidence of what happened. However, you have the Powerdown plugin, and it saves syslogs, so could you provide the previous syslog, the one that contains the failure of Disk 9? Since you have a lot of apps, ones that download stuff, you should probably rebuild Disk 9 onto itself, by unassigning it, starting and stopping the array, then re-assigning it, and restarting the array, so it can rebuild Disk 9 onto itself, with any changes since it was disabled. Link to comment
bkastner Posted September 4, 2015 Author Share Posted September 4, 2015 I've attached the log. I rebooted only because I've had something similar in the past happen, and a reboot brought the disks back online normally. I wanted to confirm this didn't fix the issue before submitting to the forum. I will wait for a bit to give you (or whoever) a chance to review the previous log before I do the rebuild, but am glad to hear the disk looks good - but am confused as to what happened. Hopefully the previous log shows something. syslog-20150903-145649.zip Link to comment
RobJ Posted September 4, 2015 Share Posted September 4, 2015 Sep 3 11:04:18 CydStorage kernel: md: recovery thread checking parity... ... Sep 3 14:35:45 CydStorage kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active FFBFFFFF, slot [36]. Sep 3 14:35:45 CydStorage kernel: md: correcting parity, sector=2723530288 Sep 3 14:35:45 CydStorage kernel: md: correcting parity, sector=2723530296 ...[many more parity errors snipped]... Sep 3 14:35:45 CydStorage kernel: md: correcting parity, sector=2723532224 Sep 3 14:35:45 CydStorage kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active F38A1670, slot [36]. Sep 3 14:35:45 CydStorage kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active F38A1660, slot [24]. Sep 3 14:35:45 CydStorage kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active F28A1660, slot [38]. Sep 3 14:36:16 CydStorage kernel: sas: Enter sas_scsi_recover_host busy: 32 failed: 32 Sep 3 14:36:16 CydStorage kernel: sas: trying to find task 0xffff88040a7df600 Sep 3 14:36:16 CydStorage kernel: sas: sas_scsi_find_task: aborting task 0xffff88040a7df600 Sep 3 14:36:16 CydStorage kernel: sas: sas_scsi_find_task: task 0xffff88040a7df600 is aborted Sep 3 14:36:16 CydStorage kernel: sas: sas_eh_handle_sas_errors: task 0xffff88040a7df600 is aborted ...[snipped many more sas tasks aborted]... Sep 3 14:36:16 CydStorage kernel: ata16.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x6 frozen Sep 3 14:36:16 CydStorage kernel: sas: ata17: end_device-8:3: dev error handler Sep 3 14:36:16 CydStorage kernel: ata16.00: failed command: READ FPDMA QUEUED Sep 3 14:36:16 CydStorage kernel: ata17.00: exception Emask 0x1 SAct 0x7fffffff SErr 0x0 action 0x6 frozen Sep 3 14:36:16 CydStorage kernel: sas: ata18: end_device-8:4: dev error handler Sep 3 14:36:16 CydStorage kernel: ata16.00: cmd 60/20:00:38:c3:55/00:00:a2:00:00/40 tag 7 ncq 16384 in Sep 3 14:36:16 CydStorage kernel: res 40/00:00:5f:2f:16/00:00:93:00:00/e0 Emask 0x4 (timeout) Sep 3 14:36:16 CydStorage kernel: ata17.00: failed command: READ FPDMA QUEUED Sep 3 14:36:16 CydStorage kernel: ata16.00: status: { DRDY } Sep 3 14:36:16 CydStorage kernel: ata17.00: cmd 60/60:00:18:c9:55/00:00:a2:00:00/40 tag 0 ncq 49152 in Sep 3 14:36:16 CydStorage kernel: res 41/04:00:47:c4:55/00:00:a2:00:00/40 Emask 0x5 (timeout) Sep 3 14:36:16 CydStorage kernel: sas: ata19: end_device-8:5: dev error handler Sep 3 14:36:16 CydStorage kernel: ata16: hard resetting link Sep 3 14:36:16 CydStorage kernel: ata17.00: status: { DRDY ERR } Sep 3 14:36:16 CydStorage kernel: ata17.00: error: { ABRT } Sep 3 14:36:16 CydStorage kernel: ata17.00: failed command: READ FPDMA QUEUED ...[snipped much more like the above]... Sep 3 14:37:08 CydStorage kernel: ata17.00: qc timeout (cmd 0xa1) Sep 3 14:37:08 CydStorage kernel: ata17.00: failed to IDENTIFY (I/O error, err_mask=0x5) Sep 3 14:37:08 CydStorage kernel: ata17.00: revalidation failed (errno=-5) Sep 3 14:37:08 CydStorage kernel: ata17.00: disabled ...[snipped hundreds of lines with related errors for Disk 9 (sdm, ata17, sd 8:0:3:0)]... A manual parity check was started at 11:04:18, then a SAS card malfunction occurs at 14:35:45. I believe this is a 9485 card. There are immediate parity errors (alarming!), then more SAS card malfunction, then 2 drives are unresponsive. One drive at ata16 (without an ERR) responds later and is recovered. The other drive at ata17 (with an ERR of ABRT) does not, and less than a minute later is disabled. You can ignore all subsequent errors related to it. Because it can't be reached, unRAID drops it from the array. The good news is the drive is fine, no issues with it at all. It's Disk 9, sdm. The bad news is the SAS card crash. I have just seen this essentially identical behavior very recently, but can't remember where. Same error messages, same series of aborted tasks, and 2 drives behaved the same, and one was dropped. I don't know how to advise. The SAS error handler is really unhelpful, never actually says what went wrong with it. I suspect there's nothing special about the 2 drives, they were just the next 2 to be accessed. So the drive being dropped is probably a random thing. The immediate series of parity errors would indicate that bad data was produced by the SAS card, and that caused parity 'corrections' that corrupted parity at that point. This is especially alarming now, because you want to rebuild Disk 9. It looks like you shut the system down fairly quickly after this, so you may want to consider accepting the current physical Disk 9 instead of a rebuilt one. That would mean doing a New Config and indicating parity is correct, then allowing the subsequent parity check to correct the parity corruptions. Link to comment
bkastner Posted September 4, 2015 Author Share Posted September 4, 2015 Ahh... crap. So, I waited for a few hours, but not knowing how long it would take for someone to review my logs I got impatient and started the rebuild as per the original suggestion. I wasn't expecting this result. I am 15.6% into the rebuild as of this moment, so am guessing I am committed to it. This card is actually the 9480 card (I had posted details in the SAS2LP thread on my cards and both were 9480s). This sucks.... Link to comment
RobJ Posted September 4, 2015 Share Posted September 4, 2015 I'm really sorry. I needed some sleep, and it takes me awhile to pick things back up. If you feel I bear any responsibility for any damage that may occur, I do apologize. Link to comment
bkastner Posted September 4, 2015 Author Share Posted September 4, 2015 I'm really sorry. I needed some sleep, and it takes me awhile to pick things back up. If you feel I bear any responsibility for any damage that may occur, I do apologize. No, there is no one to blame but myself. I just get impatient. Honestly, I appreciate all that you and the other moderators and helpful forum members do. So, since I am now 26.5% into the rebuild, should I just let it finish? Also, if I was to look to replace the SAS2LP cards, what would you suggest? I've seen a lot of comments around the M1015 as a preferred card, but wanted to confirm if that would be the recommended replacement. I've had pretty good luck with the SAS2LP cards so far, but with other people's issues, and now my flakiness I think I want to look into another card that can manage 8 drives (per card). I'd appreciate any suggestions you may have. Link to comment
RobJ Posted September 7, 2015 Share Posted September 7, 2015 Also, if I was to look to replace the SAS2LP cards, what would you suggest? I've seen a lot of comments around the M1015 as a preferred card, but wanted to confirm if that would be the recommended replacement. I've had pretty good luck with the SAS2LP cards so far, but with other people's issues, and now my flakiness I think I want to look into another card that can manage 8 drives (per card). I'd appreciate any suggestions you may have. I'm sorry, I've long been financially limited, never had a SAS card, so no help at all to you. Hopefully others will have advice. I'm also hoping damage was light or non-existent (as in only some unused disk space affected)? Link to comment
bkastner Posted September 7, 2015 Author Share Posted September 7, 2015 Also, if I was to look to replace the SAS2LP cards, what would you suggest? I've seen a lot of comments around the M1015 as a preferred card, but wanted to confirm if that would be the recommended replacement. I've had pretty good luck with the SAS2LP cards so far, but with other people's issues, and now my flakiness I think I want to look into another card that can manage 8 drives (per card). I'd appreciate any suggestions you may have. I'm sorry, I've long been financially limited, never had a SAS card, so no help at all to you. Hopefully others will have advice. I'm also hoping damage was light or non-existent (as in only some unused disk space affected)? Thanks Rob. I had 100MB free on a 4TB drive, so am doubtful I will be that lucky. Also, it was entirely TV episodes, so who knows when I will stumble across the corrupted files (if I ever do). It's life, and mostly self-inflicted, so life just moves on. I appreciate your assistance and willingness to help. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.