Homerr Posted November 30, 2019 Share Posted November 30, 2019 (edited) [Solved] TLDR - A SATA cable went bad on one of the parity drives. I had a previous issue I *just* fixed and now it's back. I had 3 drives drop out of the array per the thread below. I have 2 parity drives so I was able to rebuild two from parity and the third had data loss. What I did: I had a new precleared WD 10tb that I put in as Disk 8 (WD10...48LDZ). I then precleared the replaced former Disk 8 (WD10...W43D) as Disk 14, that's running fine now. Disk 11 was ultimately reformatted and returned to the array after some checks per the other thread. Disks 8 and 14 rebuilding about a week ago and the array all good for about 24 hours. Then a regularly scheduled parity check started Wednesday morning. I came home to an unresponsive server. I rebooted and now Disks 8 and 13 are fubared. Disk 13 had not previously had an issue. I turned the server off for the last couple of days as I was out of town for the holiday. I can obviously rebuild from array but the reliability is majorly in questions. Any ideas on troubleshooting this? I do have another motherboard, CPU, and PSU I could try if it is recommended. I'm running a pair of Supermicro AOC-SASLP-MV8 that I've used a couple of years and had no issues with previously. I replaced the breakout cables per the previous thread. All running in an early-gen Norco RPC-2040 4U case. unraid-diagnostics-20191130-1651.zip Edited December 26, 2019 by Homerr Quote Link to comment
JorgeB Posted December 1, 2019 Share Posted December 1, 2019 Diags are after rebooting, but the most likely culprit would be the SASLP controller, they are known to drop disks without a reason, and past reliability doesn't mean much since they can start giving problems at any time, usually after an Unraid release upgrade or any other hardware change. Since disks8 and 13 should be on the same miniSAS cable it might also be that if you want to try that first, but IMHO you should replace them with LSI HBAs ASAP. P.S: parity2 needs a new SATA cable. Quote Link to comment
Homerr Posted December 1, 2019 Author Share Posted December 1, 2019 Ok, thanks for the direction. I had previously ordered an LSI 9201 16i from ebay seller jiawen2018 in the midst of this and tried it but it was wonky from the start and only running at paltry kb/s speeds. I went back to the SASLP controllers since they had previously worked. But now I just ordered another LSI 9201 16i from Art of Server on ebay and his pics looks slightly different than jaiwen2018's cards, so I wonder if the latter was genuine. I did put in new cables in August when I first attempted to address this. So I'm concerned that one of my backplanes is bad. I'm going to deconstruct which drives are on which backplanes as well. If one is bad then it would seem the ol' Norco might need to be replaced as well as I don't seem to be able to find replacements online. Quote Link to comment
Homerr Posted December 1, 2019 Author Share Posted December 1, 2019 I've figured out which backplane each drive is on. Disks 13 and 8 were on the same backplane and same breakout cable. I replaced the breakout cable with a different one and also moved the drives to my one unused slot to no avail. Once drives are marked 'Unmountable: No file system' are they then always until a reformat and/or rebuild? Or should they show up 'healthy' if there are no other issues? Quote Link to comment
itimpi Posted December 1, 2019 Share Posted December 1, 2019 Quote Once drives are marked 'Unmountable: No file system' are they then always until a reformat and/or rebuild? Or should they show up 'healthy' if there are no other issues? The vast majority of the time this means that there is File System corruption on the drive that can be fixed by Clicking on the drive on the Main tab and using the option on the resulting dialog to check/repair the file system. Quote Link to comment
Homerr Posted December 6, 2019 Author Share Posted December 6, 2019 I got the new LSI card from Art of Server and installed it. I started in Maintenance mode, clicked on each disks 8 and 13, and ran the xfs repair without -n. It then said it needed to mount the drives to get information. I stopped Maintenance mode and started the array in normal mode. I don't know if this is really what it was asking for. Stopped the array again. Went in to Maintenance mode and re-ran xfs repair. It didn't seem to be definitive about what happened, just 'done'. Stopped Maintenance mode and started array as normal. Disks 8 and 13 still having issues. Stopped, back to Maintenance mode. Ran xfs repair with -L to rebuild the log file on each drive. Stop Maintenance mode, start array as normal. No joy. What do I do next? In the jumble of all this my hot spare is now smaller than a drive to replace. Can I force an array rebuild on these two drives without reassigning a different drive? Quote Link to comment
JorgeB Posted December 6, 2019 Share Posted December 6, 2019 Post the output of xfs_repair. Quote Link to comment
Homerr Posted December 6, 2019 Author Share Posted December 6, 2019 Disk 8: Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 4 - agno = 6 - agno = 3 - agno = 5 - agno = 1 - agno = 7 - agno = 8 - agno = 9 - agno = 2 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... Maximum metadata LSN (4:2) is ahead of log (1:2). Format log to cycle 7. done And Disk 13: Phase 1 - find and verify superblock... superblock read failed, offset 0, size 524288, ag 0, rval -1 fatal error -- Input/output error Quote Link to comment
JorgeB Posted December 6, 2019 Share Posted December 6, 2019 And current diags please, grabbed after array start. Quote Link to comment
itimpi Posted December 6, 2019 Share Posted December 6, 2019 I would expect disk8 to now mount fine as that is the result to expect of xfs_repair completes without error. It appears disk13 cannot be accessed. This suggests it is either offline or has really failed. Posting the current diagnostics might give a clue. Quote Link to comment
Homerr Posted December 8, 2019 Author Share Posted December 8, 2019 Diagnostics attached. unraid-syslog-20191208-0001.zip Quote Link to comment
JorgeB Posted December 8, 2019 Share Posted December 8, 2019 That's just the syslog, tools -> diagnostics Quote Link to comment
Homerr Posted December 8, 2019 Author Share Posted December 8, 2019 oops! unraid-diagnostics-20191208-1555.zip Quote Link to comment
JorgeB Posted December 9, 2019 Share Posted December 9, 2019 There's a problem with parity2, and since 2 disks are disabled both can't be emulated correctly: Dec 7 16:34:55 unRAID kernel: ata6.00: status: { DRDY SENSE ERR } Dec 7 16:34:55 unRAID kernel: ata6.00: error: { ICRC ABRT } Dec 7 16:34:55 unRAID kernel: ata6: hard resetting link Dec 7 16:34:56 unRAID kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Dec 7 16:34:56 unRAID kernel: ata6.00: configured for UDMA/100 Dec 7 16:34:56 unRAID kernel: sd 6:0:0:0: [sdt] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 Dec 7 16:34:56 unRAID kernel: sd 6:0:0:0: [sdt] tag#7 Sense Key : 0xb [current] Dec 7 16:34:56 unRAID kernel: sd 6:0:0:0: [sdt] tag#7 ASC=0x47 ASCQ=0x0 Dec 7 16:34:56 unRAID kernel: sd 6:0:0:0: [sdt] tag#7 CDB: opcode=0x8a 8a 00 00 00 00 00 47 f9 57 d0 00 00 05 40 00 00 Dec 7 16:34:56 unRAID kernel: print_req_error: I/O error, dev sdt, sector 1207523280 Dec 7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523216 Dec 7 16:34:56 unRAID kernel: ata6: EH complete Dec 7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523224 Dec 7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523232 Dec 7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523240 Dec 7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523248 Dec 7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523256 Dec 7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523264 Dec 7 16:34:56 unRAID kernel: md: disk29 write error, sector=1207523272 Looks more like a connection problem, replace both cables on that disk and post new diags after array start. Quote Link to comment
Homerr Posted December 14, 2019 Author Share Posted December 14, 2019 (edited) Been sick this week, just getting back to this. I bought new SATA cables for all 4 drives running off the mobo including the 2 parity drives. I started the array in Maintenance mode, ran xfs_repair with no modifiers. It said to start the array and then rerun, use -L if that didn't work. Here is the diagnostics file while the array is back up. I have not gone back in to Maintenance mode yet and reran anything. unraid-diagnostics-20191214-2113.zip Edited December 15, 2019 by Homerr Quote Link to comment
JorgeB Posted December 15, 2019 Share Posted December 15, 2019 Post xfs_repair output for both disks. Quote Link to comment
itimpi Posted December 15, 2019 Share Posted December 15, 2019 11 hours ago, Homerr said: It said to start the array and then rerun, use -L if that didn't work. It is quite normal to be prompted to add the -L flag to a xfs_repair run and if you got this prompt you need to do so to get the disk repaired. Quote Link to comment
Homerr Posted December 16, 2019 Author Share Posted December 16, 2019 (edited) I let it finish a check of the array while the array was started, it ran all day yesterday and finished without errors. I'm attaching the diagnostics file after that completed. Restarted in Maint. mode and here are the xfs_repair outputs. I reran with no modifier, but did not yet run the -L option. Disk 8: Phase 1 - find and verify superblock... sb root inode value 18446744073709551615 (NULLFSINO) inconsistent with calculated value 128 resetting superblock root inode pointer to 128 sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap ino pointer to 129 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary ino pointer to 130 Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Disk 13: Phase 1 - find and verify superblock... sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97 resetting superblock realtime bitmap ino pointer to 97 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98 resetting superblock realtime summary ino pointer to 98 Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. unraid-diagnostics-20191216-1428.zip Edited December 16, 2019 by Homerr Quote Link to comment
Homerr Posted December 16, 2019 Author Share Posted December 16, 2019 The outputs are long so I put them in text files, attached. Disk 8.txt Disk 13.txt Quote Link to comment
Homerr Posted December 16, 2019 Author Share Posted December 16, 2019 I started the array and the disk now show up with Used and Free space (instead of Unmountable) but are still emulated. Quote Link to comment
JorgeB Posted December 16, 2019 Share Posted December 16, 2019 3 minutes ago, Homerr said: I started the array and the disk now show up with Used and Free space (instead of Unmountable) but are still emulated. That's expected since once a disk is disabled it needs to be rebuilt, you can rebuild on top of the old disks, but before doing that make sure the data on the emulated disks looks correct, and there are no lost+found folder with data. Quote Link to comment
Homerr Posted December 16, 2019 Author Share Posted December 16, 2019 Both disks have lost+found folders. Disk 8 just has 2 folders with files like 132, 133, 134, etc. with no file extensions. The files are large multi GB, probably movies...? Disk 13 has more recognizable folders and files. Dumb question is - how do I recover each of them from the lost+found trees? Quote Link to comment
JorgeB Posted December 16, 2019 Share Posted December 16, 2019 I would suggesting rebuilding to new disks and keeping the old ones, they should mount fine with UD (after changing the UUID) and then compare data on the rebuilt disks, and copy anything missing. Alternatively do a new config and resync parity, but there's more risk involved since array will be unprotected if one of disks is really bad. Quote Link to comment
Homerr Posted December 17, 2019 Author Share Posted December 17, 2019 Thanks! I just finished consolidating some backup files to free up a 10tb drive and swapped that in as a replacement for Disk 8. It's rebuilding now. I do as you suggested and check it against the files. And then that drive will replace Disk 13. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.