Failed Drive - I think...


bkastner

Recommended Posts

So while trying to help someone else diagnose an issue with the SAS2LP card I had started a parity check, and just left it running. I got an email notification after a few hours of a failed drive (even though it was fine on Sept 1st during the regular parity check).

 

In the GUI I see the drive X'd out with 'device is disabled, contents emulated', but looking at the disk I don't really see any issues:

 


SMART self-test history:
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     12738         -

SMART error log:
No Errors Logged

 

#	Attribute Name	Flag	Value	Worst	Threshold	Type	Updated	Failed	Raw Value
1	Raw Read Error Rate	0x002f	200	200	051	Pre-fail	Always	-	0
3	Spin Up Time	                0x0027	178	176	021	Pre-fail	Always	-	8091
4	Start Stop Count	        0x0032	098	098	000	Old age	Always	-	2333
5	Reallocated Sector Ct	0x0033	200	200	140	Pre-fail	Always	-	0
7	Seek Error Rate	        0x002e	100	253	000	Old age	Always	-	0
9	Power On Hours	        0x0032	083	083	000	Old age	Always	-	12739 (1y, 165d, 19h)
10	Spin Retry Count	        0x0032	100	100	000	Old age	Always	-	0
11	Calibration Retry Count	0x0032	100	253	000	Old age	Always	-	0
12	Power Cycle Count	        0x0032	100	100	000	Old age	Always	-	29
192	Power-Off Retract Count	0x0032	200	200	000	Old age	Always	-	12
193	Load Cycle Count	        0x0032	199	199	000	Old age	Always	-	4748
194	Temperature Celsius	0x0022	122	108	000	Old age	Always	-	30
196	Reallocated Event Count	0x0032	200	200	000	Old age	Always	-	0
197	Current Pending Sector	0x0032	200	200	000	Old age	Always	-	0
198	Offline Uncorrectable	0x0030	200	200	000	Old age	Offline	-	0
199	UDMA CRC Error Count	0x0032	200	200	000	Old age	Always	-	0
200	Multi Zone Error Rate	0x0008	200	200	000	Old age	Offline	-	0

 

I've tried rebooting, but the drive still shows as failed.

 

What would be my next steps to try? Should I replace it right away, or is the drive likely okay and something else the issue?

 

Before I get asked, these drives are attached to a backplane on a Norco 4224, and I have not even been in the same room as the server in almost a week - nothing was jarred, jangled, nudged, bumped, or given a dirty look.

Link to comment

I don't see any problems with the drives, including Disk 9, so I don't think it was the drive's fault.  Unfortunately, you have rebooted so no evidence of what happened.  However, you have the Powerdown plugin, and it saves syslogs, so could you provide the previous syslog, the one that contains the failure of Disk 9?

 

Since you have a lot of apps, ones that download stuff, you should probably rebuild Disk 9 onto itself, by unassigning it, starting and stopping the array, then re-assigning it, and restarting the array, so it can rebuild Disk 9 onto itself, with any changes since it was disabled.

Link to comment

I've attached the log.

 

I rebooted only because I've had something similar in the past happen, and a reboot brought the disks back online normally. I wanted to confirm this didn't fix the issue before submitting to the forum.

 

I will wait for a bit to give you (or whoever) a chance to review the previous log before I do the rebuild, but am glad to hear the disk looks good - but am confused as to what happened. Hopefully the previous log shows something.

syslog-20150903-145649.zip

Link to comment

Sep  3 11:04:18 CydStorage kernel: md: recovery thread checking parity...

...

Sep  3 14:35:45 CydStorage kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active FFBFFFFF,  slot [36].

Sep  3 14:35:45 CydStorage kernel: md: correcting parity, sector=2723530288

Sep  3 14:35:45 CydStorage kernel: md: correcting parity, sector=2723530296

...[many more parity errors snipped]...

Sep  3 14:35:45 CydStorage kernel: md: correcting parity, sector=2723532224

Sep  3 14:35:45 CydStorage kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active F38A1670,  slot [36].

Sep  3 14:35:45 CydStorage kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active F38A1660,  slot [24].

Sep  3 14:35:45 CydStorage kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active F28A1660,  slot [38].

Sep  3 14:36:16 CydStorage kernel: sas: Enter sas_scsi_recover_host busy: 32 failed: 32

Sep  3 14:36:16 CydStorage kernel: sas: trying to find task 0xffff88040a7df600

Sep  3 14:36:16 CydStorage kernel: sas: sas_scsi_find_task: aborting task 0xffff88040a7df600

Sep  3 14:36:16 CydStorage kernel: sas: sas_scsi_find_task: task 0xffff88040a7df600 is aborted

Sep  3 14:36:16 CydStorage kernel: sas: sas_eh_handle_sas_errors: task 0xffff88040a7df600 is aborted

...[snipped many more sas tasks aborted]...

Sep  3 14:36:16 CydStorage kernel: ata16.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x6 frozen

Sep  3 14:36:16 CydStorage kernel: sas: ata17: end_device-8:3: dev error handler

Sep  3 14:36:16 CydStorage kernel: ata16.00: failed command: READ FPDMA QUEUED

Sep  3 14:36:16 CydStorage kernel: ata17.00: exception Emask 0x1 SAct 0x7fffffff SErr 0x0 action 0x6 frozen

Sep  3 14:36:16 CydStorage kernel: sas: ata18: end_device-8:4: dev error handler

Sep  3 14:36:16 CydStorage kernel: ata16.00: cmd 60/20:00:38:c3:55/00:00:a2:00:00/40 tag 7 ncq 16384 in

Sep  3 14:36:16 CydStorage kernel:        res 40/00:00:5f:2f:16/00:00:93:00:00/e0 Emask 0x4 (timeout)

Sep  3 14:36:16 CydStorage kernel: ata17.00: failed command: READ FPDMA QUEUED

Sep  3 14:36:16 CydStorage kernel: ata16.00: status: { DRDY }

Sep  3 14:36:16 CydStorage kernel: ata17.00: cmd 60/60:00:18:c9:55/00:00:a2:00:00/40 tag 0 ncq 49152 in

Sep  3 14:36:16 CydStorage kernel:        res 41/04:00:47:c4:55/00:00:a2:00:00/40 Emask 0x5 (timeout)

Sep  3 14:36:16 CydStorage kernel: sas: ata19: end_device-8:5: dev error handler

Sep  3 14:36:16 CydStorage kernel: ata16: hard resetting link

Sep  3 14:36:16 CydStorage kernel: ata17.00: status: { DRDY ERR }

Sep  3 14:36:16 CydStorage kernel: ata17.00: error: { ABRT }

Sep  3 14:36:16 CydStorage kernel: ata17.00: failed command: READ FPDMA QUEUED

...[snipped much more like the above]...

Sep  3 14:37:08 CydStorage kernel: ata17.00: qc timeout (cmd 0xa1)

Sep  3 14:37:08 CydStorage kernel: ata17.00: failed to IDENTIFY (I/O error, err_mask=0x5)

Sep  3 14:37:08 CydStorage kernel: ata17.00: revalidation failed (errno=-5)

Sep  3 14:37:08 CydStorage kernel: ata17.00: disabled

...[snipped hundreds of lines with related errors for Disk 9 (sdm, ata17, sd 8:0:3:0)]...

 

A manual parity check was started at 11:04:18, then a SAS card malfunction occurs at 14:35:45.  I believe this is a 9485 card.  There are immediate parity errors (alarming!), then more SAS card malfunction, then 2 drives are unresponsive.  One drive at ata16 (without an ERR) responds later and is recovered.  The other drive at ata17 (with an ERR of ABRT) does not, and less than a minute later is disabled.  You can ignore all subsequent errors related to it.  Because it can't be reached, unRAID drops it from the array.  The good news is the drive is fine, no issues with it at all.  It's Disk 9, sdm.

 

The bad news is the SAS card crash.  I have just seen this essentially identical behavior very recently, but can't remember where.  Same error messages, same series of aborted tasks, and 2 drives behaved the same, and one was dropped.  I don't know how to advise.  The SAS error handler is really unhelpful, never actually says what went wrong with it.  I suspect there's nothing special about the 2 drives, they were just the next 2 to be accessed.  So the drive being dropped is probably a random thing.

 

The immediate series of parity errors would indicate that bad data was produced by the SAS card, and that caused parity 'corrections' that corrupted parity at that point.  This is especially alarming now, because you want to rebuild Disk 9.  It looks like you shut the system down fairly quickly after this, so you may want to consider accepting the current physical Disk 9 instead of a rebuilt one.  That would mean doing a New Config and indicating parity is correct, then allowing the subsequent parity check to correct the parity corruptions.

Link to comment

Ahh... crap.

 

So, I waited for a few hours, but not knowing how long it would take for someone to review my logs I got impatient and started the rebuild as per the original suggestion. I wasn't expecting this result. :(

 

I am 15.6% into the rebuild as of this moment, so am guessing I am committed to it.

 

This card is actually the 9480 card (I had posted details in the SAS2LP thread on my cards and both were 9480s).

 

This sucks....

Link to comment

I'm really sorry.  I needed some sleep, and it takes me awhile to pick things back up.  If you feel I bear any responsibility for any damage that may occur, I do apologize.

 

No, there is no one to blame but myself. I just get impatient.

 

Honestly, I appreciate all that you and the other moderators and helpful forum members do.

 

So, since I am now 26.5% into the rebuild, should I just let it finish?

 

Also, if I was to look to replace the SAS2LP cards, what would you suggest? I've seen a lot of comments around the M1015 as a preferred card, but wanted to confirm if that would be the recommended replacement. I've had pretty good luck with the SAS2LP cards so far, but with other people's issues, and now my flakiness I think I want to look into another card that can manage 8 drives (per card).

 

I'd appreciate any suggestions you may have.

Link to comment

Also, if I was to look to replace the SAS2LP cards, what would you suggest? I've seen a lot of comments around the M1015 as a preferred card, but wanted to confirm if that would be the recommended replacement. I've had pretty good luck with the SAS2LP cards so far, but with other people's issues, and now my flakiness I think I want to look into another card that can manage 8 drives (per card).

 

I'd appreciate any suggestions you may have.

I'm sorry, I've long been financially limited, never had a SAS card, so no help at all to you.  Hopefully others will have advice.

 

I'm also hoping damage was light or non-existent (as in only some unused disk space affected)?

Link to comment

Also, if I was to look to replace the SAS2LP cards, what would you suggest? I've seen a lot of comments around the M1015 as a preferred card, but wanted to confirm if that would be the recommended replacement. I've had pretty good luck with the SAS2LP cards so far, but with other people's issues, and now my flakiness I think I want to look into another card that can manage 8 drives (per card).

 

I'd appreciate any suggestions you may have.

I'm sorry, I've long been financially limited, never had a SAS card, so no help at all to you.  Hopefully others will have advice.

 

I'm also hoping damage was light or non-existent (as in only some unused disk space affected)?

 

Thanks Rob.

 

I had 100MB free on a 4TB drive, so am doubtful I will be that lucky. Also, it was entirely TV episodes, so who knows when I will stumble across the corrupted files (if I ever do).

 

It's life, and mostly self-inflicted, so life just moves on. :)

 

I appreciate your assistance and willingness to help.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.