JorgeB Posted December 18, 2017 Share Posted December 18, 2017 You should try one thing at a time or you won't know what was the problem, now if you had another complete server you could move all disks there and troubleshoot that one later with test data instead of your real data. Quote Link to comment
Joseph Posted December 18, 2017 Author Share Posted December 18, 2017 27 minutes ago, johnnie.black said: You should try one thing at a time or you won't know what was the problem, now if you had another complete server you could move all disks there and troubleshoot that one later with test data instead of your real data. The irony is, I was contemplating the purchase of a beefier box and using the current box as a backup unRAID. This exercise has accelerated that consideration. Quote Link to comment
Joseph Posted December 20, 2017 Author Share Posted December 20, 2017 (edited) On 12/18/2017 at 1:54 AM, johnnie.black said: If the array data was unchanged since the beginning of the rebuild (this includes no running dockers/VMs on the array) there's still a chance to rebuild disk6 again, assuming disk5 is OK, but you need to fix whatever is wrong with the server first. RECAP: * Disk3 knocked offline and accidentally reformatted. Drive has been pulled in for an attempt at data recovery. Contents, which are empty, are being emulated. * Disk5 knocked offline and rebuilt but with an insane amount of writes and errors. * Disk6 knocked offline... at this point I stopped the array. * Memtests were run clean without errors. * Currently, unRAID is shut down until I get a replacement power supply. BRIEF UPDATE: * The data recovery on Disk3 is still a work in progress. * I examined the contents of Disk5 and Disk6 via Ubuntu and things appear to be in tact. However, I suppose that unless I find a damaged file manually, then there's no way of ever really knowing if there's actual data corruption or not. The error count on the main page suggests there will be, but its a needle in a haystack. After I replace the power supply and get a replacement for Disk3, what are your thoughts on rebuilding the array from parity (ie Disk3 and Disk6) vs. rebuilding the parity disks from all the data drives in the drive pool? Thanks for your help. Edited December 20, 2017 by Joseph Quote Link to comment
JorgeB Posted December 20, 2017 Share Posted December 20, 2017 2 minutes ago, Joseph said: After I replace the power supply and get another disk for Disk3, what are your thoughts on rebuilding the array from parity (ie Disk3 and Disk6) vs. rebuilding the parity disks from all the data drives in the drive pool? This can only work OK if the array data was 100% unchanged during the failed rebuild, if it was you got nothing to lose trying to rebuild disk6 on a spare disk, if you don't have a spare it would be best to backup current disk6 before using it, it's corrupt for sure, but maybe some data is good. Quote Link to comment
Joseph Posted December 20, 2017 Author Share Posted December 20, 2017 (edited) 14 minutes ago, johnnie.black said: This can only work OK if the array data was 100% unchanged during the failed rebuild, if it was you got nothing to lose trying to rebuild disk6 on a spare disk, if you don't have a spare it would be best to backup current disk6 before using it, it's corrupt for sure, but maybe some data is good. so, I'm a little confused. If I'm reading correctly, you're saying its better to have the array attempt to rebuild to a replacement Disk6 even though Disk 5 had a ton of errors and I can see valid files on Disk6 (as well as Disk5 for that matter) via Ubuntu? Edited December 20, 2017 by Joseph Quote Link to comment
JorgeB Posted December 20, 2017 Share Posted December 20, 2017 15 minutes ago, Joseph said: so, I'm a little confused. If I'm reading correctly, you're saying its better to have the array attempt to rebuild to a replacement Disk6 even though Disk 5 had a ton of errors and I can see valid files on Disk6 (as well as Disk5 for that matter) Sorry, I meant rebuild disk5. Quote Link to comment
Joseph Posted December 20, 2017 Author Share Posted December 20, 2017 20 minutes ago, johnnie.black said: Sorry, I meant rebuild disk5. oh, ok... makes sense. fwiw, I hope I didn't change anything on Disk6 when I launched a couple of items to see if they worked... Didn't save anything, but now i wonder if that changed the contents of the disk. Quote Link to comment
JorgeB Posted December 20, 2017 Share Posted December 20, 2017 31 minutes ago, Joseph said: I hope I didn't change anything on Disk6 when I launched a couple of items to see if they worked... Didn't save anything, but now i wonder if that changed the contents of the disk. It should be OK, but a read-only mount would be better in theses situations Quote Link to comment
Joseph Posted December 29, 2017 Author Share Posted December 29, 2017 (edited) On 12/20/2017 at 1:09 PM, johnnie.black said: It should be OK, but a read-only mount would be better in theses situations UPDATE: I found the original Disk5 HDD that I shelved about a month ago as a backup and its might be in tact. So hopefully it won't be a total loss if rebuilding from parity doesn't work. Still don't know what PSU to buy, so I posted in the PSU thread. UPDATE2: Ok, so my Seasonic PRIME 750W 80 Plus Gold PSU arrived today and I have everything back up and running. I had enough power cables so I removed all splitters just to be safe. When I went to start the array, it did not 'see' Disk3 or Disk5 and wanted to format to bring them online--WHICH I DID NOT DO THIS TIME. I stopped the array and rian diags (see attached.) Any thoughts on how to proceed? I don't want to blow it this time. Edited January 17, 2018 by Joseph Quote Link to comment
JorgeB Posted December 29, 2017 Share Posted December 29, 2017 Disk3 is empty correct, though not being correctly emulated is not a good sign, same for disk5, before rebuilding run xfs_repair on both emulated disks, unassign the disk5 you just assigned before grabbing the diags, start the array in maintenance mode and run: xfs_repair -v /dev/md3 When done run the same on /dev/md5, then start the array, with both disks still not assigned and see if the emulated disks mount. Quote Link to comment
Joseph Posted December 30, 2017 Author Share Posted December 30, 2017 4 hours ago, johnnie.black said: Disk3 is empty correct, though not being correctly emulated is not a good sign, same for disk5, before rebuilding run xfs_repair on both emulated disks, unassign the disk5 you just assigned before grabbing the diags, start the array in maintenance mode and run: UPDATE: md3 Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... [...] found candidate secondary superblock... verified secondary superblock... writing modified primary superblock - block cache size set to 1471432 entries sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97 resetting superblock realtime bitmap ino pointer to 97 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98 resetting superblock realtime summary ino pointer to 98 Phase 2 - using internal log - zero log... zero_log: head block 13799 tail block 13795 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. This message is confusing to me. Leaving it in maintenance mode for now. Thoughts on what to do next (before running xfs on md5)? Quote Link to comment
JorgeB Posted December 30, 2017 Share Posted December 30, 2017 xfs_repair -vL /dev/md3 Use -L Quote Link to comment
Joseph Posted December 30, 2017 Author Share Posted December 30, 2017 15 minutes ago, johnnie.black said: xfs_repair -vL /dev/md3 Use -L oh boy! will update with results Quote Link to comment
Joseph Posted December 30, 2017 Author Share Posted December 30, 2017 9 hours ago, johnnie.black said: xfs_repair -vL /dev/md3 Use -L UPDATE on md3 repair (See Below): is it ok to run the repair on md5 or do I need to do something else first? Thanks. Phase 1 - find and verify superblock... - block cache size set to 1471432 entries sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 97 resetting superblock realtime bitmap ino pointer to 97 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 98 resetting superblock realtime summary ino pointer to 98 Phase 2 - using internal log - zero log... zero_log: head block 13799 tail block 13795 ALERT: The filesystem has valuable metadata changes in a log which is being destroyed because the -L option was used. - scan filesystem freespace and inode maps... sb_icount 0, counted 1472 sb_ifree 0, counted 534 sb_fdblocks 976277683, counted 976273761 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 bad CRC for inode 96 bad CRC for inode 99 bad CRC for inode 111 bad CRC for inode 116 bad CRC for inode 118 bad CRC for inode 119 bad CRC for inode 127 bad CRC for inode 128 bad CRC for inode 156 bad CRC for inode 96, will rewrite cleared root inode 96 bad CRC for inode 99, will rewrite bad CRC for inode 111, will rewrite cleared inode 111 bad CRC for inode 116, will rewrite cleared inode 116 bad CRC for inode 118, will rewrite cleared inode 118 bad CRC for inode 119, will rewrite cleared inode 119 bad CRC for inode 127, will rewrite cleared inode 127 bad CRC for inode 128, will rewrite cleared inode 128 bad CRC for inode 156, will rewrite cleared inode 156 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 1 - agno = 3 Phase 5 - rebuild AG headers and trees... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - reset superblock... Phase 6 - check inode connectivity... reinitializing root directory - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... disconnected dir inode 4298782817, moving to lost+found Phase 7 - verify and correct link counts... resetting inode 99 nlinks from 2 to 3 Maximum metadata LSN (1:13791) is ahead of log (1:2). Format log to cycle 4. XFS_REPAIR Summary Sat Dec 30 01:46:31 2017 Phase Start End Duration Phase 1: 12/30 01:40:25 12/30 01:40:25 Phase 2: 12/30 01:40:25 12/30 01:42:58 2 minutes, 33 seconds Phase 3: 12/30 01:42:58 12/30 01:42:59 1 second Phase 4: 12/30 01:42:59 12/30 01:42:59 Phase 5: 12/30 01:42:59 12/30 01:42:59 Phase 6: 12/30 01:42:59 12/30 01:42:59 Phase 7: 12/30 01:42:59 12/30 01:42:59 Total run time: 2 minutes, 34 seconds done root@Tower:/# Quote Link to comment
JorgeB Posted December 30, 2017 Share Posted December 30, 2017 You can run xfs_repair on disk5, after both are done start the array to check if xfs_repair was successful or not, though if disk3 is still empty the one that really matters is disk5. Quote Link to comment
Joseph Posted December 30, 2017 Author Share Posted December 30, 2017 15 minutes ago, johnnie.black said: You can run xfs_repair on disk5, after both are done start the array to check if xfs_repair was successful or not, though if disk3 is still empty the one that really matters is disk5. Its taking awhile to find the secondary superblock on md5. My guess is once its finished, it will instruct me to run -L option on it too.... if that's the case, I will do that and then start the array after its finished. Quote Link to comment
JorgeB Posted December 30, 2017 Share Posted December 30, 2017 Not a good sign, let if finish but you'll likely have to use the current disk5 or the previous disk5 to recover as much as possible. Quote Link to comment
Joseph Posted December 30, 2017 Author Share Posted December 30, 2017 2 hours ago, johnnie.black said: Not a good sign, let if finish but you'll likely have to use the current disk5 or the previous disk5 to recover as much as possible. DIFFERENT RESULTS!! This could be promising, no? So I guess the next step is to start the array (with disk5 not installed) and see what happens? [...] ...found candidate secondary superblock... verified secondary superblock... writing modified primary superblock - block cache size set to 1471424 entries Phase 2 - using internal log - zero log... zero_log: head block 886339 tail block 886339 - scan filesystem freespace and inode maps... sb_icount 19968, counted 20352 sb_ifree 7680, counted 6889 sb_fdblocks 31812789, counted 192468030 - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 bad CRC for inode 96 bad CRC for inode 99 bad CRC for inode 111 bad CRC for inode 116 bad CRC for inode 118 bad CRC for inode 119 bad CRC for inode 127 bad CRC for inode 128 bad CRC for inode 156 bad CRC for inode 96, will rewrite cleared root inode 96 bad CRC for inode 99, will rewrite cleared inode 99 bad CRC for inode 111, will rewrite cleared inode 111 bad CRC for inode 116, will rewrite cleared inode 116 bad CRC for inode 118, will rewrite cleared inode 118 bad CRC for inode 119, will rewrite cleared inode 119 bad CRC for inode 127, will rewrite cleared inode 127 bad CRC for inode 128, will rewrite cleared inode 128 bad CRC for inode 156, will rewrite cleared inode 156 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 Phase 5 - rebuild AG headers and trees... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - reset superblock... Phase 6 - check inode connectivity... reinitializing root directory - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - traversal finished ... - moving disconnected inodes to lost+found ... disconnected dir inode 99, moving to lost+found disconnected dir inode 2159698, moving to lost+found disconnected dir inode 37065147, moving to lost+found Phase 7 - verify and correct link counts... resetting inode 96 nlinks from 2 to 3 resetting inode 52475801 nlinks from 2 to 5 Note - stripe unit (0) and width (0) were copied from a backup superblock. Please reset with mount -o sunit=<value>,swidth=<value> if necessary XFS_REPAIR Summary Sat Dec 30 13:57:24 2017 Phase Start End Duration Phase 1: 12/30 12:40:26 12/30 13:56:59 1 hour, 16 minutes, 33 seconds Phase 2: 12/30 13:56:59 12/30 13:57:00 1 second Phase 3: 12/30 13:57:00 12/30 13:57:14 14 seconds Phase 4: 12/30 13:57:14 12/30 13:57:14 Phase 5: 12/30 13:57:14 12/30 13:57:14 Phase 6: 12/30 13:57:14 12/30 13:57:15 1 second Phase 7: 12/30 13:57:15 12/30 13:57:15 Total run time: 1 hour, 16 minutes, 49 seconds done Quote Link to comment
JorgeB Posted December 30, 2017 Share Posted December 30, 2017 6 minutes ago, Joseph said: This could be promising, no? So I guess the next step is to start the array (with disk5 not installed) and see what happens? Difficult to guess, start the array, with no disk assigned for disk5. Quote Link to comment
Joseph Posted December 30, 2017 Author Share Posted December 30, 2017 (edited) On 12/30/2017 at 4:06 PM, johnnie.black said: Difficult to guess, start the array, with no disk assigned for disk5. There seems to be contents on the emulated disk5.... but is there any way to verify the validity of the contents? All of it is in Lost+found. I'm guessing next steps is to rebuild Disk3 & Disk5 and then to ensure validity of the contents of Disk5, copy everything from the disk5 backup that I still have lying around.... thoughts? Edited January 17, 2018 by Joseph Quote Link to comment
JorgeB Posted December 30, 2017 Share Posted December 30, 2017 7 minutes ago, Joseph said: I'm guessing next steps is to rebuild Disk3 & Disk5 and then to ensure validity of the contents of Disk5 You can see the contents of disk5 by browsing the emulated disk, whatever is there is what is going to be on the rebuilt disk, if you decide to rebuild it would be best to use a new spare disk, so you can still access the old one if needed. Quote Link to comment
Joseph Posted December 31, 2017 Author Share Posted December 31, 2017 1 hour ago, johnnie.black said: You can see the contents of disk5 by browsing the emulated disk Maybe I don't understand what lost and found is.... seems to me at this point there could be certain files (such as as audio or video) which might seem ok, but there's no way to know if some parts throughout the timeline are corrupted unless they are played all the way thru...which is why I'm considering just restoring the files from the shelved disk5. Quote Link to comment
pwm Posted December 31, 2017 Share Posted December 31, 2017 3 minutes ago, Joseph said: Maybe I don't understand what lost and found is.... seems to me at this point there could be certain files (such as as audio or video) which might seem ok, but there's no way to know if some parts throughout the timeline are corrupted unless they are played all the way thru...which is why I'm considering just restoring the files from the shelved disk5. Lost and found is basically file chains found when doing the repair - but the repair could not figure out how to catalog the found file data. Modern file systems separates the file names as seen in the directories from the meta data about the file and normally also from the actual file data. The printout above mentions inodes - each inode represents one file, but without any file name or owning directory. The entries in the directory just points to the inode. And the inode contains information about where the file data is stored. This separation is what allows hard links - multiple directory entries with potentially completely different file names to point at the same inode and hence access the same file data. Lost and found just means that directory entries - or whole directories - have been lost. So unreachable inodes was then added to lost and found. This also means that it is never good to see "bad CRC for inode xxx". Quote Link to comment
Joseph Posted December 31, 2017 Author Share Posted December 31, 2017 10 minutes ago, pwm said: This also means that it is never good to see "bad CRC for inode xxx". so in my case, the fact there are 9 bad CRCs for inodes means there are 9 corrupted (or lost) files...but the rest of the files on the disk are 100% in tact? Quote Link to comment
pwm Posted December 31, 2017 Share Posted December 31, 2017 No, you can have more lost files. And you don't know if other files are intact or not - XFS can checksum the meta-data but not the file data. That is a reason why it's good to keep hashes for all static files, allowing you to regularly validate the file content. And - of course - why it's also good to have a working backup scheme that takes into account file changes and not just overwrites a good backup file with a corrupted copy. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.