One drive failed , then one party drive dropped out, then another dropped out


jpimlott

Recommended Posts

System was up and running fine. One failed (drive 2 ) with a drive light stuck on.  With in a minute or so the party num 1 dropped out,  I removed the bad drive and reset the party and started rebuilding the array. It got to about 10% and drive 3 dropped out.

I think it is still good but now i cant rebuild as i have 3 disks missing.  I think drive 3 is good  i would like to make it good and start again.  The array is in that stat right now and have done nothing to it. 

tower-diagnostics-20210910-0044.zip

Link to comment

There are issues with multiple disks on multiple controllers:

 

Sep  9 23:07:49 Tower kernel: ata4: hard resetting link
Sep  9 23:07:55 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)
Sep  9 23:07:59 Tower kernel: ata4: COMRESET failed (errno=-16)
Sep  9 23:07:59 Tower kernel: ata4: hard resetting link
Sep  9 23:08:01 Tower kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Sep  9 23:08:01 Tower kernel: ata4.00: configured for UDMA/33


Sep  9 23:08:42 Tower kernel: ata8: failed to read log page 10h (errno=-5)
Sep  9 23:08:42 Tower kernel: ata8.00: exception Emask 0x1 SAct 0x1000 SErr 0x0 action 0x6
Sep  9 23:08:42 Tower kernel: ata8.00: failed command: READ FPDMA QUEUED
Sep  9 23:08:42 Tower kernel: ata8.00: cmd 60/00:00:c8:6a:14/01:00:96:00:00/40 tag 12 ncq dma 131072 in
Sep  9 23:08:42 Tower kernel:         res 01/04:60:c8:6a:14/00:00:96:00:00/40 Emask 0x3 (HSM violation)
Sep  9 23:08:42 Tower kernel: ata8.00: status: { ERR }
Sep  9 23:08:42 Tower kernel: ata8.00: error: { ABRT }


Sep  9 23:10:34 Tower kernel: ata13.00: exception Emask 0x1 SAct 0x1000 SErr 0x0 action 0x6
Sep  9 23:10:34 Tower kernel: ata13.00: failed command: WRITE FPDMA QUEUED
Sep  9 23:10:34 Tower kernel: ata13.00: cmd 61/00:00:f8:39:97/01:00:96:00:00/40 tag 12 ncq dma 131072 out
Sep  9 23:10:34 Tower kernel:         res 01/04:58:f8:38:97/00:00:96:00:00/40 Emask 0x3 (HSM violation)
Sep  9 23:10:34 Tower kernel: ata13.00: status: { ERR }
Sep  9 23:10:34 Tower kernel: ata13.00: error: { ABRT }
Sep  9 23:10:34 Tower kernel: ata13: hard resetting link


Sep  9 23:11:03 Tower kernel: sd 6:0:3:0: [sdi] tag#803 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 cmd_age=0s
Sep  9 23:11:03 Tower kernel: sd 6:0:3:0: [sdi] tag#803 CDB: opcode=0x88 88 00 00 00 00 01 93 77 f4 70 00 00 00 a0 00 00
Sep  9 23:11:03 Tower kernel: blk_update_request: I/O error, dev sdi, sector 6769079408 op 0x0:(READ) flags 0x0 phys_seg 20 prio class 0

 

Two of the controllers are Marvell based and have known issues, but ata4 is the onboard SATA, so there might be a power or connection issue, I would recommend replacing the SASLP/SAS2LP controllers with LSI anyway and then check all connections and/or test with a different PSU.

 

Link to comment

When drives stopped working the drive light was stuck on,

I also tried reseating the drives and removed from the array and restarted and re-added.

After doing that disk 2 had real issues in writing. It would go slow then stop then back to med speed.

I later tried to just rebuild party 1 and was building fast 140 MBs then disk 3 dropped off. 

I am copying data off disk 2 now to a Linux machine and so far so good

Link to comment

According to those diagnostics, only missing disk2 is disabled, and emulated disk2 is mounted as are all other disks.

 

Parity1 is invalid because you were rebuilding it, and disk3 problems, probably connection, is interfering with parity1 rebuild.

 

Parity2 will allow you to rebuild the emulated disk2 (and parity1) if you could get your connections fixed.

 

Just to confirm, post a screenshot of Main - Array Devices

Link to comment

I redid all the power and sata wiring making sure that each 5in3 3in2 gets power from 2 different home cables.

found one questionable port/cable changed to one on the marvel controllers.

disk 3 is happy and is rebuilding disk 2 and party 1, at about 95 meg bytes per sec.

That slower than normal but assume it is because one party and one disk needs building and the disk is need to generated before the party

can be calculated.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.