Red X On 1 Of 2 Parity Drives


Recommended Posts

Hi Folks,

 

My server is fairly basic and I never really touch or change anything, it just chugs along.

 

Today I was copying some data to my server (Win10 PC via Teracopy) when I noticed that the transfer had stalled, when I logged into Tower/Main there is a red X next to parity 1. So I stopped the copy,  shut down the server(sorry forgot to get that sys log before shutting down), replaced the sata cable, re-sat the power cable to the HDD, then powered back up, same result, see attached sys log. SMART seems to be ok.

 

WD Blue 6TB: WDC WD60EZRZ-00GZ5B1 WD-WXB1HB4LU1ZN is the one to look for.

 

Thinking I should just replace it, or both the parity drives.

 

The server is still running now, advice before proceeding please.

 

tower-diagnostics-20210819-1210.zip

Link to comment

Why are you using such a very old version of Unraid? Diagnostics for those old version make us do a lot more work trying to piece together all the information.

 

SMART attributes for both parity looks OK but no SMART tests have been done on them.

 

Looks like connection or maybe controller problems. Are both of those using the MARVELL controller?

Link to comment
Just now, Compass said:

Both drives are coming off the motherboard.

But which controller is on the motherboard?

 

00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922] (rev 02)
01:00.0 RAID bus controller [0104]: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B [11ab:6485] (rev 01)

 

Check all disk connections, both ends, power and SATA, including splitters.

Link to comment
18 minutes ago, trurl said:

But which controller is on the motherboard?

 

00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922] (rev 02)
01:00.0 RAID bus controller [0104]: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B [11ab:6485] (rev 01)

 

Check all disk connections, both ends, power and SATA, including splitters.

I would assume they are both using the Marvell controller, which they have done for years, whats the best way to check?  I don't remember 'flashing' the MB, but vaguely remember 'flashing' the SASLP,

 

checking cables now.

 

is it safe to shutdown?

Edited by Compass
Link to comment

'Intel ICH9R SATA 3.0Gbps Controller' is the only mention of controller for the MB as seen here, https://www.supermicro.com/products/motherboard/ATOM/ICH9/X7SPA-HF-D525.cfm

 

All of the cables look fine, we haven't had any power outages or blackouts lately, and the server hasn't been moved or fiddled with, its in a rack cabinet. The only thing that has been touched is the power button to turn it on, it turns off automatically.

 

Can I update the version of unRaid to make the diagnostics better/easier? If so, is going from 6.2.2 to 6.10 a problem? Steps to follow?

Link to comment

I will order 2 new HDD, its Friday here now so I wont get them till next week, couple of days to preclear.

 

All of the sata ports are full, this MB was highly recommended for unRaid when I put this server together, nearly 8 years ago, its perplexing that it would suddenly be an issue.

 

Whilst I wait for new HDD's to arrive, do you see an issue with using the parity swap procedure( https://wiki.unraid.net/The_parity_swap_procedure ) but just re-assigning the old drives in their current position, and starting the array to see if they rebuild?

Link to comment
2 hours ago, Compass said:

order 2 new HDD

 

22 hours ago, trurl said:

SMART attributes for both parity looks OK

I don't think there is anything wrong with the disks, and no reason to not try rebuilding to the same disks, since these are both parity and parity has none of your data.

 

2 hours ago, Compass said:

using the parity swap procedure

You don't need the parity swap procedure to rebuild to the same disks or to rebuild to new disks. These are simple parity replacements either way.

 

From that wiki you linked on parity swap:

Quote

Why would you want to do this? To replace a data drive with a larger one, that is even larger than the Parity drive.

Not at all the situation you are in, even if you want to replace with larger parity disks.

 

To rebuild to the same disks:

https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself

 

The procedure is the same whether data or parity, and you can do 2 at the same time since you have dual parity.

Link to comment
11 minutes ago, trurl said:

 

I don't think there is anything wrong with the disks, and no reason to not try rebuilding to the same disks, since these are both parity and parity has none of your data.

 

You don't need the parity swap procedure to rebuild to the same disks or to rebuild to new disks. These are simple parity replacements either way.

 

From that wiki you linked on parity swap:

Not at all the situation you are in, even if you want to replace with larger parity disks.

 

To rebuild to the same disks:

https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself

 

The procedure is the same whether data or parity, and you can do 2 at the same time since you have dual parity.

Ok, am doing an extended SMART test anyway, see what it says, then will do what you have linked to, thanks, will let you know the outcome

Link to comment
19 minutes ago, Compass said:

before I shut it down, anything else I should try or be aware of?

Not really, just wouldn't try anything else before using a different PSU or at least replace/check all the cables/connections, after that check the syslog for errors like these, there's still a problem if they continue to appear.

 

Aug 21 12:03:58 Tower kernel: ata3.00: status: { DRDY }
Aug 21 12:03:58 Tower kernel: ata3: hard resetting link
Aug 21 12:04:04 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Aug 21 12:04:08 Tower kernel: ata3: COMRESET failed (errno=-16)
Aug 21 12:04:08 Tower kernel: ata3: hard resetting link
Aug 21 12:04:14 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Aug 21 12:04:18 Tower kernel: ata3: COMRESET failed (errno=-16)
Aug 21 12:04:18 Tower kernel: ata3: hard resetting link
Aug 21 12:04:23 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 21 12:04:23 Tower kernel: ata3.00: configured for UDMA/133
Aug 21 12:04:23 Tower kernel: ata3: EH complete
Aug 21 12:04:46 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 21 12:04:46 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT
Aug 21 12:04:46 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2
Aug 21 12:04:46 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 21 12:04:46 Tower kernel: ata4.00: status: { DRDY }
Aug 21 12:04:46 Tower kernel: ata4: hard resetting link
Aug 21 12:04:47 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 21 12:04:47 Tower kernel: ata4.00: configured for UDMA/133
Aug 21 12:04:47 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x4
Aug 21 12:04:47 Tower kernel: ata4: EH complete
Aug 21 12:05:54 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 21 12:05:54 Tower kernel: ata4.00: failed command: READ DMA EXT
Aug 21 12:05:54 Tower kernel: ata4.00: cmd 25/00:40:30:86:79/00:05:1e:00:00/e0 tag 29 dma 688128 in
Aug 21 12:05:54 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

 

Link to comment
13 minutes ago, JorgeB said:

Not really, just wouldn't try anything else before using a different PSU or at least replace/check all the cables/connections, after that check the syslog for errors like these, there's still a problem if they continue to appear.

 

Aug 21 12:03:58 Tower kernel: ata3.00: status: { DRDY }
Aug 21 12:03:58 Tower kernel: ata3: hard resetting link
Aug 21 12:04:04 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Aug 21 12:04:08 Tower kernel: ata3: COMRESET failed (errno=-16)
Aug 21 12:04:08 Tower kernel: ata3: hard resetting link
Aug 21 12:04:14 Tower kernel: ata3: link is slow to respond, please be patient (ready=0)
Aug 21 12:04:18 Tower kernel: ata3: COMRESET failed (errno=-16)
Aug 21 12:04:18 Tower kernel: ata3: hard resetting link
Aug 21 12:04:23 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 21 12:04:23 Tower kernel: ata3.00: configured for UDMA/133
Aug 21 12:04:23 Tower kernel: ata3: EH complete
Aug 21 12:04:46 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 21 12:04:46 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT
Aug 21 12:04:46 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2
Aug 21 12:04:46 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Aug 21 12:04:46 Tower kernel: ata4.00: status: { DRDY }
Aug 21 12:04:46 Tower kernel: ata4: hard resetting link
Aug 21 12:04:47 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Aug 21 12:04:47 Tower kernel: ata4.00: configured for UDMA/133
Aug 21 12:04:47 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x4
Aug 21 12:04:47 Tower kernel: ata4: EH complete
Aug 21 12:05:54 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Aug 21 12:05:54 Tower kernel: ata4.00: failed command: READ DMA EXT
Aug 21 12:05:54 Tower kernel: ata4.00: cmd 25/00:40:30:86:79/00:05:1e:00:00/e0 tag 29 dma 688128 in
Aug 21 12:05:54 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

 

Righto thanks, I will shut it down, check everything whilst I find a new PS, any suggestions?

Link to comment

I had a new PS in my cupboard for another pc build that never went ahead, winning! So replaced all power cables, sata cables with new ones. Still the same outcome,

Parity 1 Red X

Parity 2 Orange triangle

Disc 2 unmountable(still green balled, not emulated)

 

Diag attached, have also copy and pasted the xfs_repair -nv on Disk 2 result below, wasn't sure if that would come in the diag.

 

Please advise next step, I'm at a loss.

 

Phase 1 - find and verify superblock... - block cache size set to 284120 entries

Phase 2 - using internal log - zero log... zero_log: head block 174487 tail block 174483 - scan filesystem freespace and inode maps... - found root inode chunk

Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - process newly discovered inodes...

Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 No modify flag set,

skipping phase 5

Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - traversal finished ... - moving disconnected inodes to lost+found ...

Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting.

XFS_REPAIR Summary Sun Aug 22 13:36:37 2021 Phase Start End Duration

Phase 1: 08/22 13:36:36 08/22 13:36:36

Phase 2: 08/22 13:36:36 08/22 13:36:37 1 second

Phase 3: 08/22 13:36:37 08/22 13:36:37

Phase 4: 08/22 13:36:37 08/22 13:36:37

Phase 5: Skipped

Phase 6: 08/22 13:36:37 08/22 13:36:37

Phase 7: 08/22 13:36:37 08/22 13:36:37 Total run time: 1 second

 

tower-diagnostics-20210822-1344.zip

 

EDIT, have also now tried another brand new MB, same model, Supermicro X7SPA-HF-D525-O, same result

Edited by Compass
Link to comment
4 hours ago, Compass said:

Still the same outcome,

Parity 1 Red X

Parity 2 Orange triangle

Disc 2 unmountable(still green balled, not emulated)

This will never be fixed by just changing hardware, if for now there are no more ATA errors run xfs_repair without -n on disk2, or nothing will be done, then try to re-sync parity, keep monitoring the log for ATA errors.

Link to comment
1 minute ago, JorgeB said:

This will never be fixed by just changing hardware, if for now there are no more ATA errors run xfs_repair without -n on disk2, or nothing will be done, then try to re-sync parity, keep monitoring the log for ATA errors.

Should I use Putty or just the webGUI for xfs_repair?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.