[Resolved] Too many wrong and/or missing disks!


Recommended Posts

I am running unraid 6.8.1. I think I've run into a situation where I might be royally screwed. Any assistance is appreciated. 

 

I was in the process of replacing a 4TB HDD with a 10 TB HDD. I powered down the array. I then swapped out the physical hard drive and put in the new one. I powered up and all devices were present except for the intentional missing disk 6. Then, I changed disk 6 to point to the new HDD /dev/sdo (sane as before the swap). Unraid informed me that it would erase all data and I proceeded. According to the logs before unraid could even format drive 6 (sdo), drive 7 (sdh) threw errors and and the array failed to start. A kernel panic then ensued, but I think that was from the ATA error. I suspect the Marvell SAS driver has experienced issues like so many others have experienced. I've never has a problem with them until now.

 

So I shutdown unraid. I then put /dev/sdo back to the previous 4TB HDD. I then booted. The BIOS sees all the devices.

 

Here is the entry where the array know I'm missing disk 6:

2020-01-21T19:19:56-07:00 nas1 kernel: md: import disk6: (sdo) WDC_WD100EFAX-68LHPN0_JEKAM33N size: 9766436812 
2020-01-21T19:19:56-07:00 nas1 kernel: md: import_slot: 6 wrong

 

Imports all the disks:

2020-01-21T19:20:06-07:00 nas1 kernel: md6: running, size: 9766436812 blocks
2020-01-21T19:20:06-07:00 nas1 kernel: md7: running, size: 9766436812 blocks

 

When unraid tries to mount the disk 7 (md7) filesystem it blows up:

2020-01-21T19:20:11-07:00 nas1 kernel: sas: sas_ata_task_done: SAS error 8a
2020-01-21T19:20:12-07:00 nas1 kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
2020-01-21T19:20:12-07:00 nas1 kernel: sas: ata7: end_device-7:0: cmd error handler
2020-01-21T19:20:12-07:00 nas1 kernel: sas: ata7: end_device-7:0: dev error handler
2020-01-21T19:20:12-07:00 nas1 kernel: sas: ata8: end_device-7:1: dev error handler
2020-01-21T19:20:12-07:00 nas1 kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
2020-01-21T19:20:12-07:00 nas1 kernel: ata7.00: failed command: READ DMA EXT
2020-01-21T19:20:12-07:00 nas1 kernel: ata7.00: cmd 25/00:08:98:ed:ee/00:00:e8:00:00/e0 tag 18 dma 4096 in
2020-01-21T19:20:12-07:00 nas1 kernel:         res 01/04:00:a7:97:1a/00:00:e9:00:00/40 Emask 0x12 (ATA bus error)
2020-01-21T19:20:12-07:00 nas1 kernel: sas: ata9: end_device-7:2: dev error handler
2020-01-21T19:20:12-07:00 nas1 kernel: ata7.00: status: { ERR }
2020-01-21T19:20:12-07:00 nas1 kernel: ata7.00: error: { ABRT }
2020-01-21T19:20:12-07:00 nas1 kernel: sas: ata10: end_device-7:3: dev error handler
2020-01-21T19:20:12-07:00 nas1 kernel: ata7: hard resetting link
2020-01-21T19:20:12-07:00 nas1 kernel: sas: ata11: end_device-7:4: dev error handler
2020-01-21T19:20:12-07:00 nas1 kernel: sas: ata12: end_device-7:5: dev error handler
2020-01-21T19:20:12-07:00 nas1 kernel: sas: ata13: end_device-7:6: dev error handler
2020-01-21T19:20:12-07:00 nas1 kernel: sas: sas_ata_task_done: SAS error 8a

 

smartctl thinks the disks are okay.

 

Now I have disk 6 (sdo) back but unraid says it is a new device and thus has it disabled and disk 7 is not found. Thus 2 failed devices and I am toast. Array will not start and array will not rebuild from parity. disk 6 is in perfect condition with all data still intact, but I don't know how to get unraid to trust it so i can replaced the failed disk 7 (sdh) with a new one and rebuild disk 7 from parity. My next step is take out /dev/sdh and try looking at the filesystem on another host. I would like to get the array back to a working state first and then replace the controllers later (first need to buy some).

 

Are there any suggestion on how I might escape this quandary? Thanks in advance.
 

nas1-diagnostics-20200121-2035.zip

Edited by argonaut
changed to resolved
Link to comment

I think it was a controller failure. I took out the two HDDs that were suspect and tested them on a separate host running fsck and smartctl; they had no errors. A friend down the street had a controller I used. I replaced my AOC-SASLP-MV8 controllers with a LSI 9305-24i. After ensuring the BIOS and Unraid saw all the devices I followed The 'Trust My Array' Procedure instructions in the Wiki to trust all devices. I then started the array which warned me the parity drive would be overwritten. Unraid forced a parity rebuild and that is currently happening. Data is being written to the parity disk at 152.9 MB/sec and I'm a few percent complete. So things are looking good now.

 

These controller use Marvell's 88SE6480 Serial ATA Host Controller. All the negative comments about Marvell controllers seem warranted. Save yourself the headache and get rid of them before they bite you.

Edited by argonaut
typo
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.