Red X On 1 Of 2 Parity Drives

Compass · August 19, 2021

Hi Folks,

My server is fairly basic and I never really touch or change anything, it just chugs along.

Today I was copying some data to my server (Win10 PC via Teracopy) when I noticed that the transfer had stalled, when I logged into Tower/Main there is a red X next to parity 1. So I stopped the copy, shut down the server(sorry forgot to get that sys log before shutting down), replaced the sata cable, re-sat the power cable to the HDD, then powered back up, same result, see attached sys log. SMART seems to be ok.

WD Blue 6TB: WDC WD60EZRZ-00GZ5B1 WD-WXB1HB4LU1ZN is the one to look for.

Thinking I should just replace it, or both the parity drives.

The server is still running now, advice before proceeding please.

tower-diagnostics-20210819-1210.zip

Compass · August 19, 2021

Not even an hour later and my 2nd parity has gone, firstly with a red X, I took the array 'off line' and its saying 'no device' for the 2nd parity. eek

updated sys log attached

tower-diagnostics-20210819-1313.zip

Edited August 19, 2021 by Compass

trurl · August 19, 2021

Why are you using such a very old version of Unraid? Diagnostics for those old version make us do a lot more work trying to piece together all the information.

SMART attributes for both parity looks OK but no SMART tests have been done on them.

Looks like connection or maybe controller problems. Are both of those using the MARVELL controller?

Compass · August 19, 2021

I guess I'm using that version because normally I have no issues, so if it's not broken, I don't try to fix it.

Both drives are coming off the motherboard.

trurl · August 19, 2021

Just now, Compass said:

Both drives are coming off the motherboard.

But which controller is on the motherboard?

00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922] (rev 02)
01:00.0 RAID bus controller [0104]: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B [11ab:6485] (rev 01)

Check all disk connections, both ends, power and SATA, including splitters.

Compass · August 19, 2021

18 minutes ago, trurl said:
But which controller is on the motherboard?
00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922] (rev 02)
01:00.0 RAID bus controller [0104]: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B [11ab:6485] (rev 01)
Check all disk connections, both ends, power and SATA, including splitters.

I would assume they are both using the Marvell controller, which they have done for years, whats the best way to check? I don't remember 'flashing' the MB, but vaguely remember 'flashing' the SASLP,

checking cables now.

is it safe to shutdown?

Edited August 19, 2021 by Compass

trurl · August 19, 2021

14 minutes ago, Compass said:

is it safe to shutdown?

yes

Compass · August 19, 2021

'Intel ICH9R SATA 3.0Gbps Controller' is the only mention of controller for the MB as seen here, https://www.supermicro.com/products/motherboard/ATOM/ICH9/X7SPA-HF-D525.cfm

All of the cables look fine, we haven't had any power outages or blackouts lately, and the server hasn't been moved or fiddled with, its in a rack cabinet. The only thing that has been touched is the power button to turn it on, it turns off automatically.

Can I update the version of unRaid to make the diagnostics better/easier? If so, is going from 6.2.2 to 6.10 a problem? Steps to follow?

trurl · August 19, 2021

You should go ahead with checking connections then do the rebuilds on your current version. Do you have enough ports to avoid the Marvell?

We can discuss upgrading later.

Compass · August 19, 2021

I will order 2 new HDD, its Friday here now so I wont get them till next week, couple of days to preclear.

All of the sata ports are full, this MB was highly recommended for unRaid when I put this server together, nearly 8 years ago, its perplexing that it would suddenly be an issue.

Whilst I wait for new HDD's to arrive, do you see an issue with using the parity swap procedure( https://wiki.unraid.net/The_parity_swap_procedure ) but just re-assigning the old drives in their current position, and starting the array to see if they rebuild?

trurl · August 20, 2021

2 hours ago, Compass said:

order 2 new HDD

22 hours ago, trurl said:

SMART attributes for both parity looks OK

I don't think there is anything wrong with the disks, and no reason to not try rebuilding to the same disks, since these are both parity and parity has none of your data.

2 hours ago, Compass said:

using the parity swap procedure

You don't need the parity swap procedure to rebuild to the same disks or to rebuild to new disks. These are simple parity replacements either way.

From that wiki you linked on parity swap:

Quote

Why would you want to do this? To replace a data drive with a larger one, that is even larger than the Parity drive.

Not at all the situation you are in, even if you want to replace with larger parity disks.

To rebuild to the same disks:

https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself

The procedure is the same whether data or parity, and you can do 2 at the same time since you have dual parity.

Compass · August 20, 2021

11 minutes ago, trurl said:

I don't think there is anything wrong with the disks, and no reason to not try rebuilding to the same disks, since these are both parity and parity has none of your data.

You don't need the parity swap procedure to rebuild to the same disks or to rebuild to new disks. These are simple parity replacements either way.

From that wiki you linked on parity swap:

Not at all the situation you are in, even if you want to replace with larger parity disks.

To rebuild to the same disks:

https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself

The procedure is the same whether data or parity, and you can do 2 at the same time since you have dual parity.

Ok, am doing an extended SMART test anyway, see what it says, then will do what you have linked to, thanks, will let you know the outcome

Compass · August 20, 2021

Extended SMART results

WDC_WD60EZRZ-00GZ5B1_WD-WXB1HB4LU1ZN-20210820-2202.txt WDC_WD60EZRZ-00GZ5B1_WD-WXN1H849LEZS-20210821-0022.txt

trurl · August 20, 2021

Those look fine

Compass · August 21, 2021

Parity 2 Red X'd again, have stopped the rebuild, dia attached

tower-diagnostics-20210821-1434.zip

Compass · August 21, 2021

Parity 2 Red X'd again, have stopped the rebuild, dia attached

Have started again with just Parity 1 disc this time

Compass · August 21, 2021

11 minutes ago, Compass said:

Parity 2 Red X'd again, have stopped the rebuild, dia attached

Have started again with just Parity 1 disc this time

Have just noticed Disc 2 is 'unmountable' , should I just let in run?

JorgeB · August 21, 2021

There are ATA errors in multiple disks, these are usually a power/connection problem.

Compass · August 21, 2021

2 hours ago, JorgeB said:

There are ATA errors in multiple disks, these are usually a power/connection problem.

Thanks, I'll get a new PS, its a few years old now, runs off a UPS though, so stable supply from the wall, anyway the rebuild with parity 1 has failed too....diag attached, before I shut it down, anything else I should try or be aware of?

tower-diagnostics-20210821-2025.zip

JorgeB · August 21, 2021

19 minutes ago, Compass said:

before I shut it down, anything else I should try or be aware of?

Not really, just wouldn't try anything else before using a different PSU or at least replace/check all the cables/connections, after that check the syslog for errors like these, there's still a problem if they continue to appear.

Aug 21 12:03:58 Tower kernel: ata3.00: status: { DRDY }
Aug 21 12:03:58 Tower kernel: ata3: hard resetting link
Aug 21 12:04:04 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Aug 21 12:04:08 Tower kernel: ata3: COMRESET failed (errno=-16)
Aug 21 12:04:08 Tower kernel: ata3: hard resetting link
Aug 21 12:04:14 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Aug 21 12:04:18 Tower kernel: ata3: COMRESET failed (errno=-16)
Aug 21 12:04:18 Tower kernel: ata3: hard resetting link
Aug 21 12:04:23 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 21 12:04:23 Tower kernel: ata3.00: configured for UDMA/133
Aug 21 12:04:23 Tower kernel: ata3: EH complete
Aug 21 12:04:46 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 21 12:04:46 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT
Aug 21 12:04:46 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2
Aug 21 12:04:46 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 21 12:04:46 Tower kernel: ata4.00: status: { DRDY }
Aug 21 12:04:46 Tower kernel: ata4: hard resetting link
Aug 21 12:04:47 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 21 12:04:47 Tower kernel: ata4.00: configured for UDMA/133
Aug 21 12:04:47 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x4
Aug 21 12:04:47 Tower kernel: ata4: EH complete
Aug 21 12:05:54 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 21 12:05:54 Tower kernel: ata4.00: failed command: READ DMA EXT
Aug 21 12:05:54 Tower kernel: ata4.00: cmd 25/00:40:30:86:79/00:05:1e:00:00/e0 tag 29 dma 688128 in
Aug 21 12:05:54 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Compass · August 21, 2021

13 minutes ago, JorgeB said:

Not really, just wouldn't try anything else before using a different PSU or at least replace/check all the cables/connections, after that check the syslog for errors like these, there's still a problem if they continue to appear.

Aug 21 12:03:58 Tower kernel: ata3.00: status: { DRDY }
Aug 21 12:03:58 Tower kernel: ata3: hard resetting link
Aug 21 12:04:04 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Aug 21 12:04:08 Tower kernel: ata3: COMRESET failed (errno=-16)
Aug 21 12:04:08 Tower kernel: ata3: hard resetting link
Aug 21 12:04:14 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Aug 21 12:04:18 Tower kernel: ata3: COMRESET failed (errno=-16)
Aug 21 12:04:18 Tower kernel: ata3: hard resetting link
Aug 21 12:04:23 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 21 12:04:23 Tower kernel: ata3.00: configured for UDMA/133
Aug 21 12:04:23 Tower kernel: ata3: EH complete
Aug 21 12:04:46 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 21 12:04:46 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT
Aug 21 12:04:46 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2
Aug 21 12:04:46 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 21 12:04:46 Tower kernel: ata4.00: status: { DRDY }
Aug 21 12:04:46 Tower kernel: ata4: hard resetting link
Aug 21 12:04:47 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 21 12:04:47 Tower kernel: ata4.00: configured for UDMA/133
Aug 21 12:04:47 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x4
Aug 21 12:04:47 Tower kernel: ata4: EH complete
Aug 21 12:05:54 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 21 12:05:54 Tower kernel: ata4.00: failed command: READ DMA EXT
Aug 21 12:05:54 Tower kernel: ata4.00: cmd 25/00:40:30:86:79/00:05:1e:00:00/e0 tag 29 dma 688128 in
Aug 21 12:05:54 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Righto thanks, I will shut it down, check everything whilst I find a new PS, any suggestions?

Compass · August 22, 2021

I had a new PS in my cupboard for another pc build that never went ahead, winning! So replaced all power cables, sata cables with new ones. Still the same outcome,

Parity 1 Red X

Parity 2 Orange triangle

Disc 2 unmountable(still green balled, not emulated)

Diag attached, have also copy and pasted the xfs_repair -nv on Disk 2 result below, wasn't sure if that would come in the diag.

Please advise next step, I'm at a loss.

Phase 1 - find and verify superblock... - block cache size set to 284120 entries

Phase 2 - using internal log - zero log... zero_log: head block 174487 tail block 174483 - scan filesystem freespace and inode maps... - found root inode chunk

Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - process newly discovered inodes...

Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 No modify flag set,

skipping phase 5

Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - traversal finished ... - moving disconnected inodes to lost+found ...

Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting.

XFS_REPAIR Summary Sun Aug 22 13:36:37 2021 Phase Start End Duration

Phase 1: 08/22 13:36:36 08/22 13:36:36

Phase 2: 08/22 13:36:36 08/22 13:36:37 1 second

Phase 3: 08/22 13:36:37 08/22 13:36:37

Phase 4: 08/22 13:36:37 08/22 13:36:37

Phase 5: Skipped

Phase 6: 08/22 13:36:37 08/22 13:36:37

Phase 7: 08/22 13:36:37 08/22 13:36:37 Total run time: 1 second

tower-diagnostics-20210822-1344.zip

EDIT, have also now tried another brand new MB, same model, Supermicro X7SPA-HF-D525-O, same result

Edited August 22, 2021 by Compass

JorgeB · August 22, 2021

4 hours ago, Compass said:

Still the same outcome,

Parity 1 Red X

Parity 2 Orange triangle

Disc 2 unmountable(still green balled, not emulated)

This will never be fixed by just changing hardware, if for now there are no more ATA errors run xfs_repair without -n on disk2, or nothing will be done, then try to re-sync parity, keep monitoring the log for ATA errors.

Compass · August 22, 2021

1 minute ago, JorgeB said:

This will never be fixed by just changing hardware, if for now there are no more ATA errors run xfs_repair without -n on disk2, or nothing will be done, then try to re-sync parity, keep monitoring the log for ATA errors.

Should I use Putty or just the webGUI for xfs_repair?

ChatNoir · August 22, 2021

The WebGUI should be simpler.

Red X On 1 Of 2 Parity Drives

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation