Compass Posted August 19, 2021 Share Posted August 19, 2021 Hi Folks, My server is fairly basic and I never really touch or change anything, it just chugs along. Today I was copying some data to my server (Win10 PC via Teracopy) when I noticed that the transfer had stalled, when I logged into Tower/Main there is a red X next to parity 1. So I stopped the copy, shut down the server(sorry forgot to get that sys log before shutting down), replaced the sata cable, re-sat the power cable to the HDD, then powered back up, same result, see attached sys log. SMART seems to be ok. WD Blue 6TB: WDC WD60EZRZ-00GZ5B1 WD-WXB1HB4LU1ZN is the one to look for. Thinking I should just replace it, or both the parity drives. The server is still running now, advice before proceeding please. tower-diagnostics-20210819-1210.zip Quote Link to comment
Compass Posted August 19, 2021 Author Share Posted August 19, 2021 (edited) Not even an hour later and my 2nd parity has gone, firstly with a red X, I took the array 'off line' and its saying 'no device' for the 2nd parity. eek updated sys log attached tower-diagnostics-20210819-1313.zip Edited August 19, 2021 by Compass Quote Link to comment
trurl Posted August 19, 2021 Share Posted August 19, 2021 Why are you using such a very old version of Unraid? Diagnostics for those old version make us do a lot more work trying to piece together all the information. SMART attributes for both parity looks OK but no SMART tests have been done on them. Looks like connection or maybe controller problems. Are both of those using the MARVELL controller? Quote Link to comment
Compass Posted August 19, 2021 Author Share Posted August 19, 2021 I guess I'm using that version because normally I have no issues, so if it's not broken, I don't try to fix it. Both drives are coming off the motherboard. Quote Link to comment
trurl Posted August 19, 2021 Share Posted August 19, 2021 Just now, Compass said: Both drives are coming off the motherboard. But which controller is on the motherboard? 00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922] (rev 02) 01:00.0 RAID bus controller [0104]: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B [11ab:6485] (rev 01) Check all disk connections, both ends, power and SATA, including splitters. Quote Link to comment
Compass Posted August 19, 2021 Author Share Posted August 19, 2021 (edited) 18 minutes ago, trurl said: But which controller is on the motherboard? 00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] [8086:2922] (rev 02) 01:00.0 RAID bus controller [0104]: Marvell Technology Group Ltd. MV64460/64461/64462 System Controller, Revision B [11ab:6485] (rev 01) Check all disk connections, both ends, power and SATA, including splitters. I would assume they are both using the Marvell controller, which they have done for years, whats the best way to check? I don't remember 'flashing' the MB, but vaguely remember 'flashing' the SASLP, checking cables now. is it safe to shutdown? Edited August 19, 2021 by Compass Quote Link to comment
trurl Posted August 19, 2021 Share Posted August 19, 2021 14 minutes ago, Compass said: is it safe to shutdown? yes Quote Link to comment
Compass Posted August 19, 2021 Author Share Posted August 19, 2021 'Intel ICH9R SATA 3.0Gbps Controller' is the only mention of controller for the MB as seen here, https://www.supermicro.com/products/motherboard/ATOM/ICH9/X7SPA-HF-D525.cfm All of the cables look fine, we haven't had any power outages or blackouts lately, and the server hasn't been moved or fiddled with, its in a rack cabinet. The only thing that has been touched is the power button to turn it on, it turns off automatically. Can I update the version of unRaid to make the diagnostics better/easier? If so, is going from 6.2.2 to 6.10 a problem? Steps to follow? Quote Link to comment
trurl Posted August 19, 2021 Share Posted August 19, 2021 You should go ahead with checking connections then do the rebuilds on your current version. Do you have enough ports to avoid the Marvell? We can discuss upgrading later. Quote Link to comment
Compass Posted August 19, 2021 Author Share Posted August 19, 2021 I will order 2 new HDD, its Friday here now so I wont get them till next week, couple of days to preclear. All of the sata ports are full, this MB was highly recommended for unRaid when I put this server together, nearly 8 years ago, its perplexing that it would suddenly be an issue. Whilst I wait for new HDD's to arrive, do you see an issue with using the parity swap procedure( https://wiki.unraid.net/The_parity_swap_procedure ) but just re-assigning the old drives in their current position, and starting the array to see if they rebuild? Quote Link to comment
trurl Posted August 20, 2021 Share Posted August 20, 2021 2 hours ago, Compass said: order 2 new HDD 22 hours ago, trurl said: SMART attributes for both parity looks OK I don't think there is anything wrong with the disks, and no reason to not try rebuilding to the same disks, since these are both parity and parity has none of your data. 2 hours ago, Compass said: using the parity swap procedure You don't need the parity swap procedure to rebuild to the same disks or to rebuild to new disks. These are simple parity replacements either way. From that wiki you linked on parity swap: Quote Why would you want to do this? To replace a data drive with a larger one, that is even larger than the Parity drive. Not at all the situation you are in, even if you want to replace with larger parity disks. To rebuild to the same disks: https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself The procedure is the same whether data or parity, and you can do 2 at the same time since you have dual parity. Quote Link to comment
Compass Posted August 20, 2021 Author Share Posted August 20, 2021 11 minutes ago, trurl said: I don't think there is anything wrong with the disks, and no reason to not try rebuilding to the same disks, since these are both parity and parity has none of your data. You don't need the parity swap procedure to rebuild to the same disks or to rebuild to new disks. These are simple parity replacements either way. From that wiki you linked on parity swap: Not at all the situation you are in, even if you want to replace with larger parity disks. To rebuild to the same disks: https://wiki.unraid.net/Manual/Storage_Management#Rebuilding_a_drive_onto_itself The procedure is the same whether data or parity, and you can do 2 at the same time since you have dual parity. Ok, am doing an extended SMART test anyway, see what it says, then will do what you have linked to, thanks, will let you know the outcome Quote Link to comment
Compass Posted August 20, 2021 Author Share Posted August 20, 2021 Extended SMART results WDC_WD60EZRZ-00GZ5B1_WD-WXB1HB4LU1ZN-20210820-2202.txt WDC_WD60EZRZ-00GZ5B1_WD-WXN1H849LEZS-20210821-0022.txt Quote Link to comment
Compass Posted August 21, 2021 Author Share Posted August 21, 2021 Parity 2 Red X'd again, have stopped the rebuild, dia attached tower-diagnostics-20210821-1434.zip Quote Link to comment
Compass Posted August 21, 2021 Author Share Posted August 21, 2021 Parity 2 Red X'd again, have stopped the rebuild, dia attached Have started again with just Parity 1 disc this time Quote Link to comment
Compass Posted August 21, 2021 Author Share Posted August 21, 2021 11 minutes ago, Compass said: Parity 2 Red X'd again, have stopped the rebuild, dia attached Have started again with just Parity 1 disc this time Have just noticed Disc 2 is 'unmountable' , should I just let in run? Quote Link to comment
JorgeB Posted August 21, 2021 Share Posted August 21, 2021 There are ATA errors in multiple disks, these are usually a power/connection problem. Quote Link to comment
Compass Posted August 21, 2021 Author Share Posted August 21, 2021 2 hours ago, JorgeB said: There are ATA errors in multiple disks, these are usually a power/connection problem. Thanks, I'll get a new PS, its a few years old now, runs off a UPS though, so stable supply from the wall, anyway the rebuild with parity 1 has failed too....diag attached, before I shut it down, anything else I should try or be aware of? tower-diagnostics-20210821-2025.zip Quote Link to comment
JorgeB Posted August 21, 2021 Share Posted August 21, 2021 19 minutes ago, Compass said: before I shut it down, anything else I should try or be aware of? Not really, just wouldn't try anything else before using a different PSU or at least replace/check all the cables/connections, after that check the syslog for errors like these, there's still a problem if they continue to appear. Aug 21 12:03:58 Tower kernel: ata3.00: status: { DRDY } Aug 21 12:03:58 Tower kernel: ata3: hard resetting link Aug 21 12:04:04 Tower kernel: ata3: link is slow to respond, please be patient (ready=0) Aug 21 12:04:08 Tower kernel: ata3: COMRESET failed (errno=-16) Aug 21 12:04:08 Tower kernel: ata3: hard resetting link Aug 21 12:04:14 Tower kernel: ata3: link is slow to respond, please be patient (ready=0) Aug 21 12:04:18 Tower kernel: ata3: COMRESET failed (errno=-16) Aug 21 12:04:18 Tower kernel: ata3: hard resetting link Aug 21 12:04:23 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 21 12:04:23 Tower kernel: ata3.00: configured for UDMA/133 Aug 21 12:04:23 Tower kernel: ata3: EH complete Aug 21 12:04:46 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 21 12:04:46 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT Aug 21 12:04:46 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2 Aug 21 12:04:46 Tower kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 21 12:04:46 Tower kernel: ata4.00: status: { DRDY } Aug 21 12:04:46 Tower kernel: ata4: hard resetting link Aug 21 12:04:47 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 21 12:04:47 Tower kernel: ata4.00: configured for UDMA/133 Aug 21 12:04:47 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x4 Aug 21 12:04:47 Tower kernel: ata4: EH complete Aug 21 12:05:54 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 21 12:05:54 Tower kernel: ata4.00: failed command: READ DMA EXT Aug 21 12:05:54 Tower kernel: ata4.00: cmd 25/00:40:30:86:79/00:05:1e:00:00/e0 tag 29 dma 688128 in Aug 21 12:05:54 Tower kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Quote Link to comment
Compass Posted August 21, 2021 Author Share Posted August 21, 2021 13 minutes ago, JorgeB said: Not really, just wouldn't try anything else before using a different PSU or at least replace/check all the cables/connections, after that check the syslog for errors like these, there's still a problem if they continue to appear. Aug 21 12:03:58 Tower kernel: ata3.00: status: { DRDY } Aug 21 12:03:58 Tower kernel: ata3: hard resetting link Aug 21 12:04:04 Tower kernel: ata3: link is slow to respond, please be patient (ready=0) Aug 21 12:04:08 Tower kernel: ata3: COMRESET failed (errno=-16) Aug 21 12:04:08 Tower kernel: ata3: hard resetting link Aug 21 12:04:14 Tower kernel: ata3: link is slow to respond, please be patient (ready=0) Aug 21 12:04:18 Tower kernel: ata3: COMRESET failed (errno=-16) Aug 21 12:04:18 Tower kernel: ata3: hard resetting link Aug 21 12:04:23 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 21 12:04:23 Tower kernel: ata3.00: configured for UDMA/133 Aug 21 12:04:23 Tower kernel: ata3: EH complete Aug 21 12:04:46 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 21 12:04:46 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT Aug 21 12:04:46 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 2 Aug 21 12:04:46 Tower kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Aug 21 12:04:46 Tower kernel: ata4.00: status: { DRDY } Aug 21 12:04:46 Tower kernel: ata4: hard resetting link Aug 21 12:04:47 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Aug 21 12:04:47 Tower kernel: ata4.00: configured for UDMA/133 Aug 21 12:04:47 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x4 Aug 21 12:04:47 Tower kernel: ata4: EH complete Aug 21 12:05:54 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Aug 21 12:05:54 Tower kernel: ata4.00: failed command: READ DMA EXT Aug 21 12:05:54 Tower kernel: ata4.00: cmd 25/00:40:30:86:79/00:05:1e:00:00/e0 tag 29 dma 688128 in Aug 21 12:05:54 Tower kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Righto thanks, I will shut it down, check everything whilst I find a new PS, any suggestions? Quote Link to comment
Compass Posted August 22, 2021 Author Share Posted August 22, 2021 (edited) I had a new PS in my cupboard for another pc build that never went ahead, winning! So replaced all power cables, sata cables with new ones. Still the same outcome, Parity 1 Red X Parity 2 Orange triangle Disc 2 unmountable(still green balled, not emulated) Diag attached, have also copy and pasted the xfs_repair -nv on Disk 2 result below, wasn't sure if that would come in the diag. Please advise next step, I'm at a loss. Phase 1 - find and verify superblock... - block cache size set to 284120 entries Phase 2 - using internal log - zero log... zero_log: head block 174487 tail block 174483 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Sun Aug 22 13:36:37 2021 Phase Start End Duration Phase 1: 08/22 13:36:36 08/22 13:36:36 Phase 2: 08/22 13:36:36 08/22 13:36:37 1 second Phase 3: 08/22 13:36:37 08/22 13:36:37 Phase 4: 08/22 13:36:37 08/22 13:36:37 Phase 5: Skipped Phase 6: 08/22 13:36:37 08/22 13:36:37 Phase 7: 08/22 13:36:37 08/22 13:36:37 Total run time: 1 second tower-diagnostics-20210822-1344.zip EDIT, have also now tried another brand new MB, same model, Supermicro X7SPA-HF-D525-O, same result Edited August 22, 2021 by Compass Quote Link to comment
JorgeB Posted August 22, 2021 Share Posted August 22, 2021 4 hours ago, Compass said: Still the same outcome, Parity 1 Red X Parity 2 Orange triangle Disc 2 unmountable(still green balled, not emulated) This will never be fixed by just changing hardware, if for now there are no more ATA errors run xfs_repair without -n on disk2, or nothing will be done, then try to re-sync parity, keep monitoring the log for ATA errors. Quote Link to comment
Compass Posted August 22, 2021 Author Share Posted August 22, 2021 1 minute ago, JorgeB said: This will never be fixed by just changing hardware, if for now there are no more ATA errors run xfs_repair without -n on disk2, or nothing will be done, then try to re-sync parity, keep monitoring the log for ATA errors. Should I use Putty or just the webGUI for xfs_repair? Quote Link to comment
ChatNoir Posted August 22, 2021 Share Posted August 22, 2021 The WebGUI should be simpler. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.