JorgeB Posted August 31, 2020 Share Posted August 31, 2020 You are having multiple disk errors, you need to fix that first or it will make any rebuild difficult, looks more like a power/connection problem, if you have it try another PSU and/or another controller, also check/replace all cables. Quote Link to comment
geeksheikh Posted August 31, 2020 Author Share Posted August 31, 2020 16 hours ago, trurl said: Little confused about your screenshots. They both seem to indicate rebuild of disk14 completed, but disk18 is now disabled. The 1st screenshot shows 2 unmountable disks, the rebuilt disk14, and the disabled disk18. The 2nd screenshot shows only 1 unmountable disk, the rebuilt disk14, with the disabled disk18 mounted. I replaced disk 14 with a brand new healthy disk and followed the process Johnnie Black noted above. The rebuild began and completed. During the rebuild Disk 18 started showing problems and wound up becoming unmountable. After the rebuild, disk 14 shows that it's not mountable as well. Since you said BAD GUESS, I'm not remembering that I should be check/fixing that disk. (I did not format thank goodness). So, Disk 18 is a new set of failures I believe and Disk 14 has never been remounted since I replaced it and rebuilt it. Quote Link to comment
JorgeB Posted August 31, 2020 Share Posted August 31, 2020 Run xfs_repair on disk14, you might also try it on the emulated disk18. Quote Link to comment
geeksheikh Posted August 31, 2020 Author Share Posted August 31, 2020 12 minutes ago, johnnie.black said: Run xfs_repair on disk14 FS type is auto -- thus I don't see xfs repair option...how do I run it on "auto" FS type? Quote Link to comment
trurl Posted August 31, 2020 Share Posted August 31, 2020 27 minutes ago, srfnmnk said: FS type is auto -- thus I don't see xfs repair option...how do I run it on "auto" FS type? I see you have luks:xfs as the default filesystem, so that is what "auto" means. If it doesn't give you the option to do the repair in the webUI you can probably do it from the command line, but I have no experience with encrypted drives so you should wait for someone else like @johnnie.black 😉 Quote Link to comment
geeksheikh Posted August 31, 2020 Author Share Posted August 31, 2020 Right, here's the screenshot for Disk 14 and Disk 18. Notice that Disk 14 doesn't have the Check FS block like Disk 18 does. Quote Link to comment
JorgeB Posted August 31, 2020 Share Posted August 31, 2020 You can manually set the fs to xfs or run on the command line: xfs_repair -v /dev/mapper/mdX Replace X with correct disk #. Quote Link to comment
geeksheikh Posted August 31, 2020 Author Share Posted August 31, 2020 Done, this was the output. root@pumbaa:/dev/mapper# xfs_repair -v /dev/mapper/md14 Phase 1 - find and verify superblock... bad primary superblock - bad CRC in superblock !!! attempting to find secondary superblock... .found candidate secondary superblock... verified secondary superblock... writing modified primary superblock - block cache size set to 3067592 entries sb realtime bitmap inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 129 resetting superblock realtime bitmap ino pointer to 129 sb realtime summary inode 18446744073709551615 (NULLFSINO) inconsistent with calculated value 130 resetting superblock realtime summary ino pointer to 130 Phase 2 - using internal log - zero log... zero_log: head block 69490 tail block 69486 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Quote Link to comment
trurl Posted August 31, 2020 Share Posted August 31, 2020 6 minutes ago, srfnmnk said: use the -L option Quote Link to comment
geeksheikh Posted August 31, 2020 Author Share Posted August 31, 2020 -L Force log zeroing. Do this as a last resort. This is what you're wanting me to do, right? Quote Link to comment
trurl Posted August 31, 2020 Share Posted August 31, 2020 4 minutes ago, srfnmnk said: -L Force log zeroing. Do this as a last resort. This is what you're wanting me to do, right? yes The repair tool is generic for XFS filesystems on linux, not specific to Unraid. It is just telling you that you should attempt to mount the disk first. But Unraid has already told you the disk is unmountable. So the log can't be used because the disk can't be mounted. Quote Link to comment
geeksheikh Posted August 31, 2020 Author Share Posted August 31, 2020 Attached is the attempted repair output for Disk 14 disk_repair_output.txt Quote Link to comment
JorgeB Posted August 31, 2020 Share Posted August 31, 2020 Disk should mount now, check contents look correct, also look for a lost+found folder and any data inside there. Quote Link to comment
geeksheikh Posted August 31, 2020 Author Share Posted August 31, 2020 Ok, that looks to have gotten it mounted and back into the array. I went to /mnt/disk14 and don't see a lost+found folder. So now the only issue appears to be disk18, I believe this is a net new issue that occurred during the disk14 rebuild. I have attached the extended self-test results (failure). I'm assuming that since it's also unmountable that I could do the same xfs_repair even while the array is started, right (i.e. I don't need to stick it in maintenance mode)? But, I'm also assuming that since SMART failed I'm likely to need to just go ahead and replace it as well, but, now that the array is healthy this can be a normal disk replace, of >= size, right? Last point is that I have two drives that are showing up as "Historical Devices". The TOSHIBA is the original DISK19 that became unmountable and the other appears to be my current disk 6. I'm guessing that when we did the force new config stuff this happened, so should i just "remove" them with the red X? Quote Link to comment
JorgeB Posted August 31, 2020 Share Posted August 31, 2020 5 minutes ago, srfnmnk said: I could do the same xfs_repair even while the array is started, right (i.e. I don't need to stick it in maintenance mode)? Yes. 5 minutes ago, srfnmnk said: But, I'm also assuming that since SMART failed I'm likely to need to just go ahead and replace it as well, but, now that the array is healthy this can be a normal disk replace, of >= size, right? Correct, should still run xfs_repair first on the emulated disk. 6 minutes ago, srfnmnk said: so should i just "remove" them with the red X? That's from the UD plugin, but yes you can remove them. Quote Link to comment
geeksheikh Posted August 31, 2020 Author Share Posted August 31, 2020 xfs_repair complete on emulated Disk18. Output very similar to the Disk14. Should I just leave it emulated until replacement can come or try to mount it? Thanks again for all the help. whatta trip this has been. Quote Link to comment
JorgeB Posted August 31, 2020 Share Posted August 31, 2020 8 minutes ago, srfnmnk said: or try to mount it? No problem trying to mount the emulated disk, whatever you see there is what will be on the rebuilt disk later. Quote Link to comment
geeksheikh Posted August 31, 2020 Author Share Posted August 31, 2020 Ok, and best/only way to mount it is to restart array, right? Quote Link to comment
geeksheikh Posted September 1, 2020 Author Share Posted September 1, 2020 Ok, so now 18 is mounted but it's still disabled and emulated. Hoping to receive additional disks tomorrow. Quote Link to comment
geeksheikh Posted September 3, 2020 Author Share Posted September 3, 2020 Hello, I thought this journey was coming to a close but the saga continues... Since the last posts I have received yet another new disk and replaced disk 18. I stopped the array, powered down replaced the drive started the array with the new drive selected as disk 18...the rebuild began and everything completed. Disk was unmountable so I completed the xfs_repair -vL /dev/mapper/md18 as before, it completed and was then able to be mounted Restarted the array in maintenance mode and ran xfs_repair -nv from the gui and everything looked good (completed as expected) I thought that would be the end of it...BUT...that evening a wholly new disk became "disabled" (Disk 8 - SN: PL1331LAHGXLLH). So on I go: I notice that the emulated files from disk 8 are unavailable. I ssh in and verify that I cannot make a copy of any files on disk 8 from ssh console...i tried both /mnt/disk8/... & /mnt/user/<share> xfs_repair in maintenance mode had same superblock issue so -- xfs_repair -vL on Disk 8 (eumulated) just disabled. Ran an extended SMART test and it passed... now xfs_repair -nv looks normal but the drive is still disabled... Restarting the array shows that the files are being emulated but the files are not there. Looked in (from ssh) /mnt/disk8 and /mnt/user/<share>... and they're in neither place...so I proceed to check lost+found and there are 2261 dirs with data in them but they all have IDs...(screenshot attached) The data is all throughout these folders... What the devil is going on?! Ideas? Before I build a new config or do anything else I wanted to check in and get your thoughts. I'm worried that if I do another new config and note that disk 8 is good that we might be going in circles now... It seems I did have 2 bad drives -- both have been replaced and rebuilt -- but why drives keep becoming disabled? How to get disk 8 enabled again and get the array healthy again? Attaching some screenshots and a new diagnostics package of current status. One other note, when the array is stopped, (maybe other times too) I see this scrolling error in the log "device /dev/sdae problem getting id" (screenshot attached) but I don't see any sdae devices anywhere in unraid gui but on ssh it does exist in /dev folder. pumbaa-diagnostics-20200903-1248.zip Quote Link to comment
JorgeB Posted September 3, 2020 Share Posted September 3, 2020 19 minutes ago, srfnmnk said: I thought that would be the end of it...BUT That why I mentioned earlier: On 8/31/2020 at 10:19 AM, JorgeB said: You are having multiple disk errors, you need to fix that first or it will make any rebuild difficult, looks more like a power/connection problem, if you have it try another PSU and/or another controller, also check/replace all cables. Quote Link to comment
geeksheikh Posted September 3, 2020 Author Share Posted September 3, 2020 This is the same PSU that's been in there for 1.5 years. Are you suggesting that it might be intermittently going out or something? I have UPS connected and am not seeing any issues/fluctuations on input/output power... The server also never turns off -- I just cannot fathom the randomness of the power failures that would be required to make this happen...The failures were across two different power rails too, I did confirm that when you mentioned this earlier. The effort required to replace and rewire the PSU is significant, is there anything else that it could be? I would like to rule out absolutely everything else first. There are some ongoing errors such as the /dev/sdae problem getting id and what not... Quote Link to comment
JorgeB Posted September 3, 2020 Share Posted September 3, 2020 I'm not saying it's the PSU, I'm saying there's a hardware problem and until you fix it you'll continue to have troubles, it could be a cable, it could be the controller, it could be power, etc, very difficult to diagnose remotely or without start swapping some hardware around. Quote Link to comment
JorgeB Posted September 3, 2020 Share Posted September 3, 2020 One thing you can try now is to update LSI to latest firmware: mpt2sas_cm1: LSISAS2116: FWVersion(17.00.01.00) Current one is 20.00.07.00 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.