Garbonzo Posted September 23 Share Posted September 23 I am on the 6.12.4 release. First, I am using shucked Seagate 8tb drives from Costco (bad habit I started years ago) I replaced Parity and Disk4 earlier this year. I had a bad cable off my HBA that led to some CRC errors on a few drives, but that resolved when the cable was replaced... (I believe, since they haven't increased since) Now I am getting a 187: Reported uncorrect error that iirc can be a bad sign. My plans are to start moving to larger/enterprise drives (or at least NOT more SMR drives) starting with parity... but I haven't had the time + money to sit down and research deals. (so if anyone has GOOD info on that front, TIA!) So, for now I am trying to determine the best path forward... I have another 8tb or three sitting in a old dell t7500 I am messing with Truenas Scale on, but they are as old or older than the others and have many of the CRC errors from the bad cables mentioned, but haven't increased since moving them, and can certainly be pulled to replace the parity drive here (or add a second)... whatever is the best course of action.... that is what I am really asking I guess... what to do, what to do.... As a second (related) question, when I had a bad shutdown (last night) and after like 20mins I hard reset and it started a parity check... that is all normal behavior. However, the last 2 times it has done a parity check it has found errors (today like 2476) it doesn't tell me anywhere I see that the errors were fixed, do I need to do anything (like start a parity check manually with the box checked). But that error count of 55 confused me (not sure if that is new errors since the 2476 or in addition to them... just need some clarification on that part (might be helpful to be more explicit if they are being repaired the first time OR that you need to do something after it is completed. I guess I am wondering if I have all of the errors STILL because I didn't correct them last time this happened (first time I got errors on a parity check) or if these are NEW errors since that time. And here is the diags: ezra-diagnostics-20230923-1552.zip I really appreciate the help, and thanks in advance. -G Quote Link to comment
Frank1940 Posted September 23 Share Posted September 23 You have to run a correcting parity check to fix those 2476 errors. (To double check that all is well, you might want to run a non-correcting one after that.) If the correcting one finds more than 2476 errors, I would be really concerned at that point. I would not care that much if they are older by several hundred hours if they are healthy. (CRC errors are not an indication of disk health! 99.99% of all CRC errors are caused by something besides the hard drive.) Personally, I would get that parity drive out of my server. I might want to test it by doing a couple of preclear cycles on it or a full SMART disk on it. (You could do this with the Unassigned Devices plugin if you have space in your server for another drive.) Of course, it is over 3.5 years old... 26 minutes ago, Garbonzo said: But that error count of 55 confused me I confused, too. What 55 errors???? Quote Link to comment
Solution JorgeB Posted September 24 Solution Share Posted September 24 Run an extended SMART test on parity, also need to check filesystem on disk4. Quote Link to comment
Garbonzo Posted September 24 Author Share Posted September 24 15 hours ago, Frank1940 said: You have to run a correcting parity check to fix those 2476 errors. (To double check that all is well, you might want to run a non-correcting one after that.) If the correcting one finds more than 2476 errors, I would be really concerned at that point. I would not care that much if they are older by several hundred hours if they are healthy. (CRC errors are not an indication of disk health! 99.99% of all CRC errors are caused by something besides the hard drive.) Personally, I would get that parity drive out of my server. I might want to test it by doing a couple of preclear cycles on it or a full SMART disk on it. (You could do this with the Unassigned Devices plugin if you have space in your server for another drive.) Of course, it is over 3.5 years old... I confused, too. What 55 errors???? I was talking about the 55 listed in the error column on the array devices tab, but looking back at my email alerts from the previous month with errors, there were 3000+ so it was a different "event". I REALLY thought I had replaced parity when I pulled the pairity2 drive that was in there... It may be a week or 2 until I research (and get a little more coin together) something like 16tb+ enterprise drives, at the minimum a single drive to start with parity. SO, in the meantime, I think I will pull 2 of these 8tb drives from the Truenas box and get them running pre-clears in unraid... then rebuild them as parity/parity2 (not sure if I can do that at one time)... unless replacing disk4 is a better option (although these are gonna be in that 3+yr range as well) as it looks like it was disk5 I replaced earlier this year... I'd love to be on like 4 x 20tb drives by end of year if possible, so I can feel SOLID about this media server, and use the other drives to learn new stuff with... I just have so many issues to resolve on this unraid server... this macvlan to ipvlan has been confusing for me, as by the time I had read and re-read how to setup ipvlan with multiple nics, only to have that seem to be fixed in 6.12.4 (yet fix common problems still gripes), but mainly I need to track down why accessing this server seems SO LAGGY for the last few months... from wired and wireless connections... its possible that its network related, but learning wireshark at the same time has got me moving VERY slow to figure out what is going on... Here is all I REALLY know: when I first upgraded/migrated from a dual xeon X5680's with 72gb ram to a AMD Ryzen 7 1700 w/32gb seemed to perform at a similar "feel" to the old server for a few months... then got worse... I always planned to upgrade to a 5000 series cpu (never even intended on the 1700, it just came with the MB I bought) but I don't want to throw money at a problem atm if that isn't the main issue... I will start the correcting parity check now, and move on from there... oh, and file system on disk4 and extended SMART test @JorgeB recommended as well, and I guess post the extended SMART results would be req'd Again, thanks for your help, I will report back and/or mark solutions afterwards. -G Quote Link to comment
Garbonzo Posted September 24 Author Share Posted September 24 (edited) 6 hours ago, JorgeB said: Run an extended SMART test on parity, also need to check filesystem on disk4. as a side question, was there anything (besides studying the logs on a regular basis) that would have clued me into the disk4 situation? You have always been helpful with pointing out this kind of thing whenever I post a diagnostics, but I feel like maybe I should have seen a warning (similar to hdd temps or something) but I could be wrong. Anyway, thanks for pointing it out, I think its fixed, I just wanted to know how to catch these things if/as they occur. I am not sure if that is just diligence (scanning the log for a key word or two on a regular basis) or some other plug-in or setting I am missing... that type of question.. cheers, -g EDIT: Ok, so not OK, disk4 still shows (after removing the -n) : Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2 - using internal log - zero log... - 10:29:34: zeroing log - 521728 of 521728 blocks done - scan filesystem freespace and inode maps... - 10:29:36: scanning filesystem freespace - 32 of 32 allocation groups done - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - 10:29:36: scanning agi unlinked lists - 32 of 32 allocation groups done - process known inodes and perform inode discovery... - agno = 30 - agno = 15 - agno = 0 - agno = 16 - agno = 31 - agno = 1 - agno = 17 - agno = 18 - agno = 2 - agno = 19 - agno = 3 - agno = 20 - agno = 4 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - 10:29:56: process known inodes and inode discovery - 134912 of 134816 inodes done - process newly discovered inodes... - 10:29:56: process newly discovered inodes - 32 of 32 allocation groups done Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - 10:29:56: setting up duplicate extent list - 32 of 32 allocation groups done - check for inodes claiming duplicate blocks... - agno = 0 - agno = 3 - agno = 8 - agno = 12 - agno = 15 - agno = 6 - agno = 7 - agno = 1 - agno = 10 - agno = 9 - agno = 11 - agno = 13 - agno = 14 - agno = 2 - agno = 5 - agno = 4 - agno = 16 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 30 - agno = 31 - 10:29:56: check for inodes claiming duplicate blocks - 134912 of 134816 inodes done No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... - 10:30:13: verify and correct link counts - 32 of 32 allocation groups done No modify flag set, skipping filesystem flush and exiting. I really want to understand... I need to learn more about file systems (that aren't NTFS) as well... I need to find some good videos on this type of stuff... as reading understanding some of this stuff takes me alot longer to grok the situation. Based on other threads, unsure if there are additional flags I should use for the xfs_repair OR if I need to rebild (that may only be if the drive is disabled, this one is mounting fine atm), so ALSO - ORDER OF OPERATIONS (considering replacing the parity is playing into this game) Again, a sincere thanks for the help. (as a side note, disk5 has some AGNO's listed as well, but disks1-3 seem fine) -g Edited September 24 by Garbonzo new info Quote Link to comment
JorgeB Posted September 25 Share Posted September 25 17 hours ago, Garbonzo said: Ok, so not OK, disk4 still shows (after removing the -n) That output is with -n, post without -n. Quote Link to comment
Garbonzo Posted September 25 Author Share Posted September 25 I had just ran it without the -n. this was the first run after, next time I can take it down I will re-run it again and see what it looks like. thanks, I will try to get to it tomorrow. -g Quote Link to comment
Garbonzo Posted September 26 Author Share Posted September 26 (edited) On 9/25/2023 at 3:29 AM, JorgeB said: That output is with -n, post without -n. Sorry about the formatting (or lack of) Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2- using internal log - zero log... - 09:18:57: zeroing log - 521728 of 521728 blocks done - scan filesystem freespace and inode maps... - 09:19:00: scanning filesystem freespace - 32 of 32 allocation groups done - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - 09:19:00: scanning agi unlinked lists - 32 of 32 allocation groups done - process known inodes and perform inode discovery... - agno = 0 - agno = 15 - agno = 30 - agno = 31 - agno = 1 - agno = 16 - agno = 2 - agno = 17 - agno = 18 - agno = 3 - agno = 19 - agno = 4 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - 09:19:22: process known inodes and inode discovery - 135040 of 134944 inodes done - process newly discovered inodes... - 09:19:22: process newly discovered inodes - 32 of 32 allocation groups done Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - 09:19:22: setting up duplicate extent list - 32 of 32 allocation groups done - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 1 - agno = 12 - agno = 13 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 3 - agno = 4 - agno = 16 - agno = 14 - agno = 5 - agno = 15 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 30 - agno = 31 - 09:19:22: check for inodes claiming duplicate blocks - 135040 of 134944 inodes done Phase 5 - rebuild AG headers and trees... - 09:19:27: rebuild AG headers and trees - 32 of 32 allocation groups done - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... - 09:19:45: verify and correct link counts - 32 of 32 allocation groups done done So I can assume there is some other flags or something I need to include, but nothing sticks out to me looking at that... other than some thing being ambiguous like "verify" instead of "verifying" and "rebuild" instead of "rebuilding"... It seems like it may be telling ME to do that (but some have timestamps, so it appears it is what the process is doing... -g Edited September 26 by Garbonzo formatting (and followup) Quote Link to comment
Frank1940 Posted September 26 Share Posted September 26 (edited) You can force the more traditional formatting of the output for CLI commands by selecting the text that you have just copied into the Edit window and clicking on the <Code> formatting tool. (The icon in the toolbar is the tool you are looking...) Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2- using internal log - zero log... - 09:18:57: zeroing log - 521728 of 521728 blocks done - scan filesystem freespace and inode maps... - 09:19:00: scanning filesystem freespace - 32 of 32 allocation groups done - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - 09:19:00: scanning agi unlinked lists - 32 of 32 allocation groups done - process known inodes and perform inode discovery... - agno = 0 - agno = 15 - agno = 30 - agno = 31 - agno = 1 - agno = 16 - agno = 2 - agno = 17 - agno = 18 - agno = 3 - agno = 19 - agno = 4 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - 09:19:22: process known inodes and inode discovery - 135040 of 134944 inodes done - process newly discovered inodes... - 09:19:22: process newly discovered inodes - 32 of 32 allocation groups done Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - 09:19:22: setting up duplicate extent list - 32 of 32 allocation groups done - check for inodes claiming duplicate blocks... - agno = 0 - agno = 2 - agno = 1 - agno = 12 - agno = 13 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 3 - agno = 4 - agno = 16 - agno = 14 - agno = 5 - agno = 15 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - agno = 24 - agno = 25 - agno = 26 - agno = 27 - agno = 28 - agno = 29 - agno = 30 - agno = 31 - 09:19:22: check for inodes claiming duplicate blocks - 135040 of 134944 inodes done Phase 5 - rebuild AG headers and trees... - 09:19:27: rebuild AG headers and trees - 32 of 32 allocation groups done - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... - 09:19:45: verify and correct link counts - 32 of 32 allocation groups done done Edited September 26 by Frank1940 1 Quote Link to comment
Garbonzo Posted September 26 Author Share Posted September 26 Ok, these were just the only drives that showed the agno = x x so I wasn't 100%... but if that looks OK, I just wanted thoughts adding a second parity drive (it would be another similar age/used drive) to cover me a little more... unless there is some downside in doing that (other than my having to dismantle my truenas test box, but thats fine)... it should give me extra protection until I can physically replace the bad drive with a larger/better drive.... thanks for everything guys, really! -G Quote Link to comment
JorgeB Posted September 27 Share Posted September 27 2nd parity always add some extra redundancy. Quote Link to comment
Garbonzo Posted September 27 Author Share Posted September 27 Right! I guess I had to just process the fact that having 2 parity drives protects against 2 failures INCLUDING one of the parity drives (which could happen to me). SO, as long as it covers that case, it would be the upside I need right now. thanks again Quote Link to comment
Garbonzo Posted November 16 Author Share Posted November 16 Sorry to reuse a solved issue, but it is the conclusion (hopefully) and maybe will help someone else. I have replaced a parity drive in the past; I followed a video from Spaceinvaderone and it went fine, as far as I recall... Here is where I am today. I finally bought 2 x 14tb drives (to start upgrading past the 8tb drives I am using. I am using shucked drives again, but I have my reasons for that.. -unfortunately. So I was originally thinking I'd replace the parity and either add or replace one of the other drives (so I can maybe use the 8's I pull in my truenas experiment). I realize that it is still just basically "a spin of the wheel" as to which drive will die first, but if there is any insights the log can bring as to which main drive to pull (besides the parity) from the array, it is over my head, so any assistance is welcomed. I hate to @ people, but @JorgeB has been helpful in the past but didn't want to PM in case this info helps someone else find their way to a solution of sorts... Seems like I should be able to leave the 8tb parity and add one of the 14tbs as parity2 then remove the first parity drive (although if parity2 stays named that way, it will mess with my OCD for sure But I don't recall if there was a reason that wasn't/isn't the way to go. Then either just add the 14 to the main array and leave or remove the WORST 8tb offender. <-please advise which that "could" be I will re-watch the videos I did last time (or newer ones if they are out there) prior to attempting this, but since the two drives are going to finish pre-clearing today, I may have time to shuck them and put them in this evening... just wanted to get some experienced perspectives on the best approach if I am missing something obvious... Again, I am sorry to throw this out there in such a vague way- but I will be doing some more research prior to pulling the trigger, but if I can't get to it tonight it will be another week or more probably, so I was trying to preemptively get some assistance about order of operations, etc. TIA! -G ezra-diagnostics-20231116-0854.zip Quote Link to comment
Frank1940 Posted November 16 Share Posted November 16 Your Parity2 disk is not shown in the SMART directory... Quote Link to comment
itimpi Posted November 16 Share Posted November 16 @Garbonzo parity1 and parity2 are not interchangeable so you cannot later change it to parity1 without recalculating its contents. If as you say your OCD could kick in with having parity2 and no parity1 an alternative might be to simply change parity1 to be a 14TB drive and keep the old 8TB parity drive intact until parity build on it successfully completes as that gives a regression path if another drive in the array fails while rebuilding parity. On completion of the rebuild of parity1 onto the 14TB drive you change proceed as normal with making changes to the array drives knowing you are protected against any one drive failing while doing that. Quote Link to comment
Garbonzo Posted November 16 Author Share Posted November 16 8 hours ago, itimpi said: @Garbonzo parity1 and parity2 are not interchangeable so you cannot later change it to parity1 without recalculating its contents. If as you say your OCD could kick in with having parity2 and no parity1 an alternative might be to simply change parity1 to be a 14TB drive and keep the old 8TB parity drive intact until parity build on it successfully completes as that gives a regression path if another drive in the array fails while rebuilding parity. On completion of the rebuild of parity1 onto the 14TB drive you change proceed as normal with making changes to the array drives knowing you are protected against any one drive failing while doing that. Yep, I actually had forgotten that, thank you... and the way you describe doing things seems the most practical as well... last time I was swapping a dead drive, this time I will have that peace of mind about having the old drive... So thanks for the help on that.... was thinking I may just add the second 14 to the array and leave the other drives so I have more room to move things around for the time being. 8 hours ago, Frank1940 said: Your Parity2 disk is not shown in the SMART directory... There currently is no parity2, I had 2 in the past, had to pull one for another use, sorry if I was confusing Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.