Drumsk8 Posted June 5, 2023 Share Posted June 5, 2023 Hello, I am usually not one to post, I'll lurk for hours/days/weeks working through my problems. (already done that) However, I've hit a bit of a wall and really could do with a little assistance if anyone is kind enough to help. PART 1 - party swap and sync errors Ok, so it all started when I had a disk die in my RAID, I have 2 parity drives and 5 data drives, all 2TB. It was Disk 2 which died. Ok so I manage to get a good deal on some 3TB drives and I can't afford anything else. They arrive and I decide that since I'd previously seen some errors on Disk 5 that I cleared and never came back it would be a wise move to perform a dual parity swap. (knock my office offline for a while, but then I could stay up) unraid version: 6.11.5 D2 - Replaced with 3TB drive [dead disk getting replaced] D5 - Replaced with 3TB drive [forced failure to switch disk] P1 <--> D2 P2 <--> D5 I followed the guide on the wiki and it seemed to go fine. In all took over 30+ hours. There was a couple of small CRC errors, but in the grand scheme of things and knowing my side panel is off and cables dangling. I didn't think much of that. What's a couple of CRC's within a sea of Billions. I then decided the best bet would be to perform a parity check, make sure everything reports as fine. I ran this and towards the latter part of the stripe checking I suddenly get errors millions of errors 244 million sync errors showed I tried making a daig and couldn't it failed with out of memory. I then rebooted the server, (I know bad move at that point, I was tired and not thinking, wasn't able to make a diag anyways) So... after that nasty scare of 244M errors, I decided to perform another check. This time without correction (after reading some posts) I did and it found 244 errors, not million just 244! Great, that looked promising. These being sync errors from the parity check, I've had no errors CRC or otherwise reported on the disks. I really want to expand the array and get some HOT disks out like Disk 1 which usually sits at 39c. But until the parity is known as good I can't do that. So I ran the parity check again this time with corrections on, it found 269 errors, which it's supposedly corrected. Also, only parity sync errors, no disk errors showing, not even in terminal is anything nasty showing. (not checked log files, not sure what to check) The problem is, it keeps showing "SOME" kind of error.... So I know the usual response is to create a daig and come here. PART 2 - diagnosis creation failure I've not been able to create a single Diagnosis file, every single time it fails saying out of memory. I've even shutdown all docker contains and reduced the memory foot print to as small as anything and it still won't complete. Server has 8GB of RAM and the only VM running is Perfsense with 2GB limit on it. The boot disk for unraid is an 8GB penstick with 7GB free) One thing I noticed when it was trying to create the diag file was a massive and I mean massive number of 'SED' commands related to an old Plex Media Sever. Thousands of them... Is there anyway to clear that crap, so a diagnosis file can be made and I think it's this that's causing it to go OOM? So right now, the server hasn't been rebooted since after the first Party check that showed 244 errors, then 269 errors. Any thoughts would be welcomed on how I can fix part 2 so I can produce the files for someone to help me with checking Part 1. Thank you very much in advance. Quote Link to comment
JorgeB Posted June 5, 2023 Share Posted June 5, 2023 4 minutes ago, Drumsk8 said: I ran this and towards the latter part of the stripe checking I suddenly get errors millions of errors 244 million sync errors showed This is likely from a known issue where the new parity disk doesn't get correctly zeroed during the swap, errors would start after the 2TB mark. Still finding some errors, and not always the same number, it's usually hardware related, most often RAM, assuming no ECC start by running memtest. Quote Link to comment
Drumsk8 Posted June 5, 2023 Author Share Posted June 5, 2023 Awesome thank you. I was hoping it was something like that with the sync errors as it was after the 2TB mark. Ok, I'll shut it down and perform a memory test ASAP. Quote Link to comment
Drumsk8 Posted June 7, 2023 Author Share Posted June 7, 2023 Memtest didn't result in any errors. So I stripped the system cleaned it, rewired all the drives and rebuilt a disk with no problems. So far it's looking promising. Full parity and all drives green. Thank you for the clarifications, that reinstalled confidence. As for Part 2: I am still unable to create Diagnostics file which could one day prove dire in need. I see a ton of entries like this and left to it's own devices eventually goes OOM every time. I took a look in /etc/log and except a system log being 23MB all other logs looked to be well under 1MB. Examples of the mass entries on Diagnosis creation: sed -i 's/\/mnt\/disk3\/appdata\/Plex Media Server OLD\/Metadata\/Albums\/0\/34ae6e365d68df8b3efeeb368f7f6a5710c8e76.bundle\/Contents\/_combined\/posters\/com.plexapp.agents.localmedia_6522b776bf373a371f65cd411e5860bb763ca5a0/\/\/..0\/.../g' '/galaxy-diagnostics-20230607-0254/logs/syslog.txt' 2>/dev/null sed -i 's/\/mnt\/disk3\/appdata\/Plex Media Server OLD\/Metadata\/Albums\/0\/34ae6e365d68df8b3efeeb368f7f6a5710c8e76.bundle\/Contents\/_combined\/posters\/com.plexapp.agents.localmedia_f49da28a7b1ce5c4f1264891a37c817e59040a81/\/\/..1\/.../g' '/galaxy-diagnostics-20230607-0254/logs/syslog.txt' 2>/dev/null sed -i 's/\/mnt\/disk3\/appdata\/Plex Media Server OLD\/Metadata\/Albums\/0\/1a7c38862cfa6aa132705da4dac2e039c17c106.bundle\/Contents\/_combined\/tracks\/dafa2ccf401001ba77ade2684cc0db5250a4f74f\/lyrics\/com.plexapp.agents.lyricfind_20f6d18399ccbd0b8b6305e2df0987f0c7e88f7c/\/\/..c\/.../g' '/galaxy-diagnostics-20230607-0254/logs/syslog.txt' 2>/dev/null sed -i 's/\/mnt\/disk3\/appdata\/Plex Media Server OLD\/Metadata\/Albums\/0\/1a7c38862cfa6aa132705da4dac2e039c17c106.bundle\/Contents\/_combined\/posters\/com.plexapp.agents.localmedia_406cb3ba28592029570a4bf95b64438ee3b56dc3/\/\/..3\/.../g' '/galaxy-diagnostics-20230607-0254/logs/syslog.txt' 2>/dev/null I don't know why it's displaying these... Could you enlighten me as to where I might go look and safely purge to stop this? I am sure these are relics of some past. This system has been in service almost a decade. Thank you very much. Quote Link to comment
itimpi Posted June 7, 2023 Share Posted June 7, 2023 4 hours ago, Drumsk8 said: Memtest didn't result in any errors. So I stripped the system cleaned it, rewired all the drives and rebuilt a disk with no problems. So far it's looking promising. Full parity and all drives green. Thank you for the clarifications, that reinstalled confidence. As for Part 2: I am still unable to create Diagnostics file which could one day prove dire in need. I see a ton of entries like this and left to it's own devices eventually goes OOM every time. I took a look in /etc/log and except a system log being 23MB all other logs looked to be well under 1MB. Examples of the mass entries on Diagnosis creation: sed -i 's/\/mnt\/disk3\/appdata\/Plex Media Server OLD\/Metadata\/Albums\/0\/34ae6e365d68df8b3efeeb368f7f6a5710c8e76.bundle\/Contents\/_combined\/posters\/com.plexapp.agents.localmedia_6522b776bf373a371f65cd411e5860bb763ca5a0/\/\/..0\/.../g' '/galaxy-diagnostics-20230607-0254/logs/syslog.txt' 2>/dev/null sed -i 's/\/mnt\/disk3\/appdata\/Plex Media Server OLD\/Metadata\/Albums\/0\/34ae6e365d68df8b3efeeb368f7f6a5710c8e76.bundle\/Contents\/_combined\/posters\/com.plexapp.agents.localmedia_f49da28a7b1ce5c4f1264891a37c817e59040a81/\/\/..1\/.../g' '/galaxy-diagnostics-20230607-0254/logs/syslog.txt' 2>/dev/null sed -i 's/\/mnt\/disk3\/appdata\/Plex Media Server OLD\/Metadata\/Albums\/0\/1a7c38862cfa6aa132705da4dac2e039c17c106.bundle\/Contents\/_combined\/tracks\/dafa2ccf401001ba77ade2684cc0db5250a4f74f\/lyrics\/com.plexapp.agents.lyricfind_20f6d18399ccbd0b8b6305e2df0987f0c7e88f7c/\/\/..c\/.../g' '/galaxy-diagnostics-20230607-0254/logs/syslog.txt' 2>/dev/null sed -i 's/\/mnt\/disk3\/appdata\/Plex Media Server OLD\/Metadata\/Albums\/0\/1a7c38862cfa6aa132705da4dac2e039c17c106.bundle\/Contents\/_combined\/posters\/com.plexapp.agents.localmedia_406cb3ba28592029570a4bf95b64438ee3b56dc3/\/\/..3\/.../g' '/galaxy-diagnostics-20230607-0254/logs/syslog.txt' 2>/dev/null I don't know why it's displaying these... Could you enlighten me as to where I might go look and safely purge to stop this? I am sure these are relics of some past. This system has been in service almost a decade. Thank you very much. Exactly which log is that is the 23MB that you mention as that sounds rather large for a log? The code looks like it it is trying to anonymize entries in the syslog in the diagnostics created by Plex Media server. The solution would be to stop Plex generating all those entries in the first place. Sounds as if your Plex is set to do a lot more logging than is normal. Quote Link to comment
Drumsk8 Posted June 10, 2023 Author Share Posted June 10, 2023 (edited) Hello, Sorry for the delay, I am having to firefight a few different issues on different fronts and prioritise what gets my attention. Previous to my last report, I took a look at syslog, which had just rotated from 23MB. However, what I found in the latest log was disturbing about /dev/md4 having problems. (not looked at the old log file yet) If I perform an xfs_repair -n of /dev/md4 when mounted in maintenance mode, I get a lot of errors: Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... Metadata CRC error detected at 0x44108d, xfs_cntbt block 0x10/0x1000 btree block 0/2 is suspect, error -74 invalid start block 214091987 in record 32 of cnt btree block 0/2 out-of-order cnt btree record 34 (45912643 1) block 0/2 out-of-order cnt btree record 35 (45923253 1) block 0/2 out-of-order cnt btree record 36 (45923265 1) block 0/2 out-of-order cnt btree record 37 (45923269 1) block 0/2 out-of-order cnt btree record 38 (45923276 1) block 0/2 out-of-order cnt btree record 39 (45923284 1) block 0/2 out-of-order cnt btree record 40 (45923316 1) block 0/2 .... .... block (3,58740174-58745177) multiply claimed by cnt space tree, state - 2 block (3,53539583-53540086) multiply claimed by cnt space tree, state - 2 block (3,58833129-58835080) multiply claimed by cnt space tree, state - 2 block (3,35192822-35194373) multiply claimed by cnt space tree, state - 2 block (3,52066506-52070095) multiply claimed by cnt space tree, state - 2 block (3,60157998-60161599) multiply claimed by cnt space tree, state - 2 block (3,60869497-60870526) multiply claimed by cnt space tree, state - 2 agf_freeblks 17916042, counted 17912735 in ag 3 agf_freeblks 17251174, counted 17251421 in ag 0 inode chunk claims used block, inobt block - agno 0, bno 45424960, inopb 8 sb_ifree 8643, counted 8495 sb_fdblocks 301636648, counted 301387426 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... found inodes not in the inode allocation tree - process known inodes and perform inode discovery... - agno = 0 data fork in ino 363388269 claims free block 45425173 data fork in ino 363388277 claims free block 45425040 data fork in ino 363393040 claims free block 45424984 - agno = 1 - agno = 2 - agno = 3 data fork in ino 3656296609 claims free block 466087763 data fork in ino 3656296609 claims free block 466113263 data fork in ino 3656296610 claims free block 466153275 data fork in ino 3656296610 claims free block 466176739 data fork in ino 3656296612 claims free block 465895311 .... .... data fork in ino 3672161813 claims free block 465928067 data fork in ino 3672161813 claims free block 465953567 - agno = 4 data fork in ino 4294967394 claims free block 536870932 imap claims in-use inode 4294967394 is free, would correct imap data fork in ino 4294967420 claims free block 536871043 imap claims in-use inode 4294967420 is free, would correct imap .... .... imap claims in-use inode 4294967443 is free, would correct imap - agno = 5 imap claims a free inode 5368709218 is in use, would correct imap and clear inode - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... free space (0,45424942-45424942) only seen by one free space btree free space (0,45424960-45424983) only seen by one free space btree free space (0,45425024-45425039) only seen by one free space btree free space (0,45425172-45425172) only seen by one free space btree free space (3,105029022-105031068) only seen by one free space btree - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 5 - agno = 3 - agno = 1 entry ".." at block 0 offset 80 in directory inode 2149391459 references non-existent inode 399881886 entry "c08248f161df429e8242b575f5e55e59.mkv" in shortform directory 5368709217 references free inode 5368709218 would have junked entry "c08248f161df429e8242b575f5e55e59.mkv" in directory inode 5368709217 would have corrected i8 count in directory 5368709217 from 1 to 0 - agno = 4 entry "Music" in shortform directory 96 references non-existent inode 367278512 .... .... entry "Season 3" in shortform directory 1777277169 references non-existent inode 373591522 would have junked entry "Season 3" in directory inode 1777277169 entry "Audio Record" in shortform directory 1780371851 references non-existent inode 393556912 would have junked entry "Audio Record" in directory inode 1780371851 entry "Season 3" in shortform directory 1991665187 references non-existent inode 417382242 would have junked entry "Season 3" in directory inode 1991665187 No modify flag set, skipping phase 5 Inode allocation btrees are too corrupted, skipping phases 6 and 7 No modify flag set, skipping filesystem flush and exiting. As you can see there's a ton of mismatch problems showing. So far a extended SMART of that drive indicates it's fine. It was only just rebuilt from a previous 2TB DATA drive to a 3TB DATA drive, now that both parity drives are at 3TB. (It wasn't precleared before hand. just NTFS blank) Not exactly sure what to do at this stage, I have the array in maintenance mode, but I dare start it. Catch 22, I need to start it to check the data locations it's referencing are OK, but to start the array fully will kick in a parity check, which with this XFS corruption could cause more problems. I've not been able to sort out the over reporting plex logs, as I can't exactly start my array at the moment. Although, trying to create a diag now with docker down still spits out these massive number of plex items. So they've got to be stored somewhere and need purging. I am currently working on a grep search of all files for one of the patterns and then I can find what needs purging. Any advice from the unRAID sages would be greatly welcomed. Thank you. Edited June 11, 2023 by Drumsk8 Quote Link to comment
Drumsk8 Posted June 10, 2023 Author Share Posted June 10, 2023 This was that latest syslog I've extracted, the other one glalaxy-syslog-cleaned.txt is the syslog1 so previous to rotation, I've extracted out all but a few mover lines at the end I cut off the rest as it's unimportant and potentially sensitive. The second log, shows where I had disk 4 removed, then populated it on reboot with this new drive. Also, I've attached the SMART for disk4 it shows as passed but a ton of raw read errors, I have no idea if that's old or current or what. I don't have a LSI card, although I have one on order now, but who knows when that will turn up. If I need to check this drive cable, I could shut it down and poke the cable, but what's the best way to bring it back up safely (Maintenance mode?) and what test to perform, extended SMART? Still not tracked down these spurious old plex replace(s) so I can't get a sanitized diag out for you. Thank you. syslog.txt galaxy-syslog-cleaned.txt ST3000DM001-1ER166_Z502TA2T-20230610-2051 disk4 (sdc).txt Quote Link to comment
Drumsk8 Posted June 11, 2023 Author Share Posted June 11, 2023 Just to say that there has been zero errors show within unRAID GUI for any disk including disk4 during it's rebuild. So, this xfs corruption is a bit odd. Is it possible that it's caused by mover? I.e. the disk was rebuilding but mover kicked in and did it's thing therefore some of these inodes are now incorrect and I could safely flush the xfs journal, or do you think it's safe enough to attempt re-mounting (turning off automatic parity check on boot for good measure.) Also it's mover that's logging to syslog it reported it couldn't move x files, a silly high number of them that ballooned the log to 23MB here's the end of that galax-syslog-cleaned. aka syslog.1 Jun 7 02:28:28 Galaxy move: file: /mnt/disk5/appdata/plex/Library/Application Support/Plex Media Server/Metadata/TV Shows/9/61742c53f8e7f308d780334e2436d270fcfe268.bundle/Contents/_combined/themes/com.plexapp.agents.plexthememusic_4726015063b6654c4c3f5211c86841e1ad7588de Jun 7 02:28:28 Galaxy move: move_object: /mnt/disk5/appdata/plex/Library/Application Support/Plex Media Server/Metadata/TV Shows/9/61742c53f8e7f308d780334e2436d270fcfe268.bundle/Contents/_combined/themes/com.plexapp.agents.plexthememusic_4726015063b6654c4c3f5211c86841e1ad7588de No such file or directory Jun 7 02:28:28 Galaxy root: mover: finished Jun 7 02:50:08 Galaxy kernel: md: sync done. time=25834sec Jun 7 02:50:08 Galaxy kernel: md: recovery thread: exit status: 0 Jun 7 03:38:11 Galaxy autofan: Highest disk temp is 36C, adjusting fan speed from: 74 (29% @ 906rpm) to: 52 (20% @ 609rpm) Jun 7 03:38:12 Galaxy autofan: Highest disk temp is 36C, adjusting fan speed from: 79 (30% @ 906rpm) to: 57 (22% @ 609rpm) Jun 7 03:59:19 Galaxy kernel: docker0: port 2(vethdeac530) entered blocking state Jun 7 03:59:19 Galaxy kernel: docker0: port 2(vethdeac530) entered disabled state Jun 7 03:59:19 Galaxy kernel: device vethdeac530 entered promiscuous mode Jun 7 03:59:23 Galaxy kernel: docker0: port 3(veth5305e2e) entered blocking state Jun 7 03:59:23 Galaxy kernel: docker0: port 3(veth5305e2e) entered disabled state Jun 7 03:59:23 Galaxy kernel: device veth5305e2e entered promiscuous mode Jun 7 03:59:23 Galaxy kernel: eth0: renamed from vetha3a7a6d Jun 7 03:59:23 Galaxy kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethdeac530: link becomes ready Jun 7 03:59:23 Galaxy kernel: docker0: port 2(vethdeac530) entered blocking state Jun 7 03:59:23 Galaxy kernel: docker0: port 2(vethdeac530) entered forwarding state Jun 7 03:59:28 Galaxy kernel: eth0: renamed from veth11d28ad Jun 7 03:59:28 Galaxy kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth5305e2e: link becomes ready Jun 7 03:59:28 Galaxy kernel: docker0: port 3(veth5305e2e) entered blocking state Jun 7 03:59:28 Galaxy kernel: docker0: port 3(veth5305e2e) entered forwarding state Quote Link to comment
itimpi Posted June 11, 2023 Share Posted June 11, 2023 8 hours ago, Drumsk8 said: So, this xfs corruption is a bit odd. Is it possible that it's caused by mover? Not really, The rebuild process is clever enough to handle writes being made to the disk being rebuilt and keeping everything correctly in sync. Quote Link to comment
JorgeB Posted June 11, 2023 Share Posted June 11, 2023 13 hours ago, Drumsk8 said: Not exactly sure what to do at this stage You need to run xfs_repair without -n to fix that filesystem. Quote Link to comment
Drumsk8 Posted June 11, 2023 Author Share Posted June 11, 2023 (edited) root@Galaxy:~# xfs_repair /dev/md4 Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. The issue is the unwritten journal so I have 4 options: 1) Start the array fully so the drives are properly mounted giving MD4 a chance to clear it's journal (which itself could be corrupted.) I will need to make sure parity check doesn't start. (I need to clear the xfs corruption before parity check) 2) Manually mount drive as some other mount point like /dev/fixmd4 and let it try and clear the journal and then proceed to xfs_repair. (I've done this with a single disk, but as part of an array I don't know if this could break things?) 3) I cut my losses and wipe that journal and allow XFS_repair to do it's job. Then start the array and do full parity check. If there's no harm in trying option 2 that would be my choice (or below), otherwise mostly option 3 is probably easist path to recovery and hopefully my "no corrupt parity" can resolve issues after? 4) An alternative is this. I have the original disk4, I have a couple spare drives I could switch out this bad xfs fs one and let unraid rebuild the new drive from parity. Giving me the original, the corrupted and now a new version instead. Surely if I have to go physically data gathering off of disks to repair folders I could based on the state of the previous two drives (given I clear or wipe the journal on the broken fs one) Which ever is going to be the least likely to cause data corruption. At this stage wouldn't option 4 turn out to be safer "IF" the parity is good. Worse case it just puts me back in the same position but with an original disk 4 (known good) and two broken FS one's. Thank you very much! Edited June 11, 2023 by Drumsk8 Quote Link to comment
Solution JorgeB Posted June 12, 2023 Solution Share Posted June 12, 2023 Use -L, usually it's fine, and if you still have the original disk you can then compare the data if needed. Quote Link to comment
Drumsk8 Posted June 13, 2023 Author Share Posted June 13, 2023 Ok I dumped the journal and ran the repair. That seemed to go fine, just done a parity check with correction and it repaired 6 items. So I'd say that's pretty successful. Given the heat I'd rather not push another parity check (some drives are at 44c), I'll do that again soon, in a week or so, just to make sure everything is in sync. As to part 2 that was blocking me from creating a diagnosis files, a 23MB syslog full of broken mover commands will likely be the culprit. I'll give this a few more days to settle and show no lingering issues and then update this as complete. Thank you both for helping me navigate my choices. I'm exceptionally grateful! 1 Quote Link to comment
Drumsk8 Posted June 20, 2023 Author Share Posted June 20, 2023 I've just done a full parity check and there was 0 errors. Again thank you for your assistance! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.