lost two parity drives, then array drive ?


Go to solution Solved by JorgeB,

Recommended Posts

My server has been working great for the past 7-8 months, no problems at all, mainly as a media server and TimeMachine.  One morning the server was not accessible and I had to do a hard reboot with the power button.  The array reported a bad Parity2 drive, which I removed from the array and then did a parity rebuild that completed.  I reformatted the failed Parity drive (XFS) and the Attributes seemed OK.  It's my newest drive, an 8 GB WD drive about a year old, maybe a bit more.

 

I rebooted at least once and now have a failed array drive which shows that the contents are emulated (the data seems to be there).  Thinking the reformatted 8G drive was good, I substituted it for the failed disk 3 and rebooted, which initiated a rebuild that never finished.  Now the syslog is full of disk0 read errors, the tail is below

 

I have some new drives coming, but its unclear that the issue is actually drive fail.  I tried the extended SMART test on the former P2 drive, but it stopped before completion.  The short SMART test reports "no such device".  The drive attributes "could not be read".

 

Today I tried to get a diagnostics file but the script starts and never completes.  I attach a diagnostics file from the time of the parity drive fail though and can try to get a current one if anyone has suggestions on how to get it.

 

I'll format the new drives as soon as I can, but it will be days before they're ready.  At this point its not clear how the array drive can be emulated, if it cant read from the Parity drive (!?)

 

Thanks for any help

Dennis

 

 

 

 

 

 

 

 

Quote

tail /var/log/syslog.1

 

Nov 26 13:06:51 Tower kernel: md: disk0 read error, sector=56325808
Nov 26 13:06:51 Tower kernel: md: disk0 read error, sector=56325816
Nov 26 13:06:51 Tower kernel: md: disk0 read error, sector=56325824
Nov 26 13:06:51 Tower kernel: md: disk0 read error, sector=56325832
Nov 26 13:06:51 Tower kernel: md: disk0 read error, sector=56325840
Nov 26 13:06:51 Tower kernel: md: disk0 read error, sector=56325848
Nov 26 13:06:51 Tower kernel: md: disk0 read error, sector=56325856
Nov 26 13:06:51 Tower kernel: md: disk0 read error, sector=56325864
Nov 26 13:06:51 Tower kernel: md: disk0 read error, sector=56325872

 

 

 

Here's the hardware profile:HW profile.xml.zip

 

 

 

tower-smart-20221125-2237.zip tower-diagnostics-20221125-1150.zip

Edited by dtempleton
Replaced lengthy HW profile with zip file
Link to comment
  • Solution

In the earlier diags there are issues with multiple devices before parity drops offline:

 

Nov 25 07:36:55 Tower kernel: ata14.00: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 25 07:36:55 Tower kernel: ata14.00: irq_stat 0x48000001, interface fatal error
Nov 25 07:36:55 Tower kernel: ata14.00: failed command: READ DMA EXT
Nov 25 07:36:55 Tower kernel: ata14.00: cmd 25/00:40:48:0d:4a/00:05:f5:02:00/e0 tag 3 dma 688128 in
Nov 25 07:36:55 Tower kernel:         res 53/84:30:57:10:4a/00:02:f5:02:00/40 Emask 0x10 (ATA bus error)
Nov 25 07:36:55 Tower kernel: ata14.00: status: { DRDY SENSE ERR }
Nov 25 07:36:55 Tower kernel: ata14.00: error: { ICRC ABRT }
Nov 25 07:36:55 Tower kernel: ata14: hard resetting link
Nov 25 07:36:55 Tower kernel: ata11.00: exception Emask 0x10 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 25 07:36:55 Tower kernel: ata11.00: irq_stat 0x48000001, interface fatal error
Nov 25 07:36:55 Tower kernel: ata11.00: failed command: WRITE DMA EXT
Nov 25 07:36:55 Tower kernel: ata11.00: cmd 35/00:40:48:f8:49/00:05:f5:02:00/e0 tag 7 dma 688128 out
Nov 25 07:36:55 Tower kernel:         res 51/84:40:48:f8:49/00:05:f5:02:00/e0 Emask 0x10 (ATA bus error)
Nov 25 07:36:55 Tower kernel: ata11.00: status: { DRDY ERR }
Nov 25 07:36:55 Tower kernel: ata11.00: error: { ICRC ABRT }
Nov 25 07:36:55 Tower kernel: ata11: hard resetting link
Nov 25 07:36:55 Tower kernel: ata14: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 25 07:36:55 Tower kernel: ata14.00: configured for UDMA/133
Nov 25 07:36:55 Tower kernel: ata14: EH complete
Nov 25 07:36:55 Tower kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 25 07:36:55 Tower kernel: ata11.00: configured for UDMA/133
Nov 25 07:36:55 Tower kernel: ata11: EH complete
Nov 25 10:11:49 Tower kernel: ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Nov 25 10:11:49 Tower kernel: ata14.00: irq_stat 0x40000001
Nov 25 10:11:49 Tower kernel: ata14.00: failed command: READ DMA EXT
Nov 25 10:11:49 Tower kernel: ata14.00: cmd 25/00:40:a8:16:ed/00:05:6d:03:00/e0 tag 24 dma 688128 in
Nov 25 10:11:49 Tower kernel:         res 53/84:c0:27:17:ed/00:04:6d:03:00/40 Emask 0x10 (ATA bus error)
Nov 25 10:11:49 Tower kernel: ata14.00: status: { DRDY SENSE ERR }
Nov 25 10:11:49 Tower kernel: ata14.00: error: { ICRC ABRT }
Nov 25 10:11:49 Tower kernel: ata14: hard resetting link
Nov 25 10:11:54 Tower kernel: ata14: link is slow to respond, please be patient (ready=0)
Nov 25 10:11:59 Tower kernel: ata14: COMRESET failed (errno=-16)
Nov 25 10:11:59 Tower kernel: ata14: hard resetting link
Nov 25 10:12:02 Tower kernel: ata14: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Nov 25 10:12:02 Tower kernel: ata14.00: configured for UDMA/133
Nov 25 10:12:02 Tower kernel: ata14: EH complete
Nov 25 10:12:03 Tower kernel: ata11.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Nov 25 10:12:03 Tower kernel: ata11.00: irq_stat 0x80400000, PHY RDY changed
Nov 25 10:12:03 Tower kernel: ata11: SError: { PHYRdyChg }
Nov 25 10:12:03 Tower kernel: ata11.00: failed command: WRITE DMA EXT
Nov 25 10:12:03 Tower kernel: ata11.00: cmd 35/00:40:28:71:ee/00:05:6d:03:00/e0 tag 3 dma 688128 out
Nov 25 10:12:03 Tower kernel:         res 50/00:00:28:71:ee/00:00:6d:03:00/e0 Emask 0x10 (ATA bus error)
Nov 25 10:12:03 Tower kernel: ata11.00: status: { DRDY }
Nov 25 10:12:03 Tower kernel: ata11: hard resetting link
Nov 25 10:12:03 Tower kernel: ata11: SATA link down (SStatus 0 SControl 300)
Nov 25 10:12:09 Tower kernel: ata11: hard resetting link
Nov 25 10:12:09 Tower kernel: ata11: SATA link down (SStatus 0 SControl 300)
Nov 25 10:12:14 Tower kernel: ata11: hard resetting link
Nov 25 10:12:14 Tower kernel: ata19: SATA link down (SStatus 0 SControl 300)
Nov 25 10:12:14 Tower kernel: ata20: SATA link down (SStatus 0 SControl 300)
Nov 25 10:12:15 Tower kernel: ata11: SATA link down (SStatus 0 SControl 300)
Nov 25 10:12:15 Tower kernel: ata11.00: disable device

 

This is usually a power/connection problem, could also be a controller issue, save the current syslog

cp /var/log/syslog /boot/syslog.txt

then reboot and post new diags after array start.

Link to comment

Thanks;  reboot looks the same

 

Here is the new syslog syslog.zip

and diagnostics file tower-diagnostics-20221127-1441.zip

 

I realize that all of the drive errors I'm seeing are controlled by one controller; this one:

https://www.amazon.com/gp/product/B07SZDK6CZ/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1

Ziyituod PCIe SATA Card, 4 Port with 4 SATA Cable, SATA Controller Expansion Card with Low Profile Bracket, Marvell 9215 Non-Raid, Boot as System Disk

its a Marvell 9215 device, about 7 months old;  I thought that was a usable one. 

 

I'll go look at the list of usable controllers

 

 

Link to comment
19 minutes ago, dirkinthedark said:

Yikes let me know, I just ordered the same card.

It looks like that Marvell 9215 controller is not on the approved list now, but the forum is full of questions.  I just ordered a different one:

 

https://www.amazon.com/gp/product/B08BHZQVP7/ref=ppx_yo_dt_b_asin_title_o00_s00?ie=UTF8&psc=1

PCIe SATA Card, Electop SATA III 6 Gbps Expansion Controller, JMB585/SATA 3.0 Non-Raid ,Support 5 Ports with 5 SATA Cables, Standard & Low Profile Bracket for Desktop PC

 

that has the JMicro controller that is recommended here:

https://forums.unraid.net/topic/102010-recommended-controllers-for-unraid/

 

won't know for a few days if it fixes things

 

ChatNoir:  Thanks, our messages passed each other simultaneously.

 

Edited by dtempleton
acknowledge ChatNoir contribution
Link to comment
1 hour ago, dirkinthedark said:

I dont understand, the controller you listed is the asmedia1062 which should be good

You are right, I said that based on the description from dtempleton that explicitly mentioned Marvell.

 

Surprise, surprise, the Amazon seller kept the same link but changed the description !!!

https://web.archive.org/web/20200831194618/https://www.amazon.com/Ziyituod-Controller-Expansion-Profile-Non-Raid/dp/B07SZDK6CZ

 

Anyhow, ASM1062 is a two ports controller. If the card offer more, there is something sketchy. Either a port multiplier or it's another chip.

And since the seller has shown he is clearly trustworthy ...

  • Like 1
Link to comment
8 hours ago, ChatNoir said:

You are right, I said that based on the description from dtempleton that explicitly mentioned Marvell.

 

Surprise, surprise, the Amazon seller kept the same link but changed the description !!!

https://web.archive.org/web/20200831194618/https://www.amazon.com/Ziyituod-Controller-Expansion-Profile-Non-Raid/dp/B07SZDK6CZ

 

Anyhow, ASM1062 is a two ports controller. If the card offer more, there is something sketchy. Either a port multiplier or it's another chip.

And since the seller has shown he is clearly trustworthy ...

Ok perfect, so I will look for another one myself.  Tricky, tricky hehe.  

 

Link to comment
  • 2 weeks later...

My server is working now, thanks for your input JorgeB and others.  I bought this 5 port card:

https://www.amazon.com/dp/B08BHZQVP7?psc=1&ref=ppx_yo2ov_dt_b_product_details

and all seems to be ok except several of my drives report UMDA CRC errors (that seem to be permanent but unimportant).

 

Regarding the previously purchased card from Ziyituod, it's a mess.  It was listed at Amazon as originally listed as an ASMedia controller, then with the same part number a second version was listed on Amazon that was clearly a Marvell controller.  When I pulled the controller the board actually showed neither of the Ziyituod model numbers and no identifiers at all.   In future if I get a card that doesn't look like that advertised I'll send it right back.

 

Thanks for helping put this back together.

 

Dennis

  • Like 2
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.