AinzOolGown Posted September 24, 2023 Share Posted September 24, 2023 (edited) Hi there Since some days (i'm running solid since 4 years), my dockers stops gradually and it's very hard to make all the systems run again fine (and it shortly redo the same errors :/) When i try to restart them, unraid show an Execution Error so dead end I've searched and found some post talking about bad RAMs and i ran 2 (paralleled and unparalleled) MEMTEST86, which PASSed fine Also found ie this topic that talk about a full cache but it doesn't seem to be the case for me either, so i'm a little lost root@Tower:~# btrfs dev stats /mnt/cache [/dev/mapper/sdk1].write_io_errs 192912 [/dev/mapper/sdk1].read_io_errs 145891 [/dev/mapper/sdk1].flush_io_errs 1086 [/dev/mapper/sdk1].corruption_errs 628 [/dev/mapper/sdk1].generation_errs 0 [/dev/mapper/sdj1].write_io_errs 0 [/dev/mapper/sdj1].read_io_errs 0 [/dev/mapper/sdj1].flush_io_errs 0 [/dev/mapper/sdj1].corruption_errs 0 [/dev/mapper/sdj1].generation_errs 0 [/dev/mapper/sdl1].write_io_errs 0 [/dev/mapper/sdl1].read_io_errs 0 [/dev/mapper/sdl1].flush_io_errs 0 [/dev/mapper/sdl1].corruption_errs 0 [/dev/mapper/sdl1].generation_errs 0 I'm attaching diagnostics and some logs if anyone can point me where i can head next to fix this issue please Thank you very much in advance ! logs.txt tower-diagnostics-20230924-1631.zip Edited October 29, 2023 by AinzOolGown typo Quote Link to comment
Squid Posted September 24, 2023 Share Posted September 24, 2023 Not necessarily your problem, but corruption issues are generally related to bad RAM (as your searching has already indicated). FWIW, your particular RAM is not on either the motherboard's QVL for memory and G-Skill doesn't list that motherboard on the memory's QVL. Personally, I only ever buy RAM from the MB QVL for the most trouble free experience. But take it with a grain of salt. Just because neither the memory or the MB says that they are compatible with each other doesn't mean that they are not. 1 Quote Link to comment
AinzOolGown Posted September 25, 2023 Author Share Posted September 25, 2023 Hi Squid, thank you Yep, i figured that and done testings when i built the server, but all the tests was successfull so i decided to keep them It's been 4 years and never had a RAM issue since. If it is really a RAM problem, then i don't understand how it can be all ok since 4 years and suddently is not ok anymore Memtest86 also tells me everything's fine 😭 Quote Link to comment
JorgeB Posted September 25, 2023 Share Posted September 25, 2023 20 hours ago, AinzOolGown said: [/dev/mapper/sdk1].write_io_errs 192912 [/dev/mapper/sdk1].read_io_errs 145891 [/dev/mapper/sdk1].flush_io_errs 1086 These suggest the device dropped offline sometime in the past, see here for more info and better pool monitoring. Quote Link to comment
AinzOolGown Posted September 25, 2023 Author Share Posted September 25, 2023 (edited) Thanks JorgeB That reminds me... Sorry i forgot to tell you, just some days before this behavior, one disk of my array went offline 2 or 3 days. Unraid emulated it and i noticed just some hours later by chance. I changed a cable and rebuilt array. All seemed good to me, but in fact, it might have corrupted the array ? If this is it, how can i correct that please ? i read about some "scrub" command, i have it for cache array but not for data array Will it erase all cache ? I wont lose any data right ? Thank you P.S.: Installed Squid's script from your link, thanks again ! Included most recent logs, it start to worry me In the meantime, i had a notification "Error on cache pool - No description" LastLogs.txt Edited September 25, 2023 by AinzOolGown Quote Link to comment
JorgeB Posted September 25, 2023 Share Posted September 25, 2023 23 minutes ago, AinzOolGown said: If this is it, how can i correct that please ? See the link above. Quote Link to comment
AinzOolGown Posted September 30, 2023 Author Share Posted September 30, 2023 (edited) Hi So, i followed this process and changed all shares that use cache to "Yes". VM & Dockers are disabled but mover have finished and shares (appdata/system) remain on it. I opted to completely replace my cache drives/SATA Cables Can i just save the remaining content elsewhere and move it back when the new cache pool will be operationnal ? Is there a better way to move the remaining files to the array ? And, how to do this the right way ? (to conserve permissions and such) i have a Mac and PathFinder ready for that Also, i have a share set to "no" for cache, but strangely the last files created on it are on cache O_o Thank you Edited September 30, 2023 by AinzOolGown Quote Link to comment
JorgeB Posted September 30, 2023 Share Posted September 30, 2023 1 hour ago, AinzOolGown said: but mover have finished and shares (appdata/system) remain on it. Enable mover logging, run the mover, post new diags. Quote Link to comment
AinzOolGown Posted September 30, 2023 Author Share Posted September 30, 2023 Thanks JorgeB Here it is tower-diagnostics-20230930-1227.zip Quote Link to comment
JorgeB Posted September 30, 2023 Share Posted September 30, 2023 The filesystem is read-only, so the data cannot be moved, it can be copied though. Quote Link to comment
AinzOolGown Posted September 30, 2023 Author Share Posted September 30, 2023 (edited) Can i just copy the content via SMB and do the opposite when the new drives will be installed ? (sorry i prefer to double-check, there's some prod Dockers for my business and that would be hell to lose anything ) Edited September 30, 2023 by AinzOolGown Quote Link to comment
JorgeB Posted September 30, 2023 Share Posted September 30, 2023 21 minutes ago, AinzOolGown said: Can i just copy the content via SMB and do the opposite when the new drives will be installed ? That should work, you can also use for example rsync. Quote Link to comment
AinzOolGown Posted September 30, 2023 Author Share Posted September 30, 2023 (edited) Sorry JorgeB, rsync gives me errors too cmd was : rsync -a /mnt/cache/appdata/ /mnt/user/Backup-Saves/TowerDockers/Cache/appdata/ I tried a CA Backup and same, it gives errors 😱 Ran a Filesystem checks on Cache, attached logs rsync errors.txt Cache filesystem check.txt Edited September 30, 2023 by AinzOolGown Quote Link to comment
JorgeB Posted October 1, 2023 Share Posted October 1, 2023 Any corrupt files will fail to copy, you can try btrfs restore here, it will ignore corrupt files and still copy them but they will still be corrupt. Quote Link to comment
AinzOolGown Posted October 1, 2023 Author Share Posted October 1, 2023 Hi I copied what i can Deleted the cache pool / pluged new SSDs / recreated a new cache pool and assigned the new SSDs with new cables Then i reswitched appdata on "only" cache and done a CA plugin restore Plugin said restoring complete but i have a notification saying error occured Syslog indicate a lot of btrfs error I switched to Discord somewhere inbetween, and it seems that my RAID controller card is the culprit Synd & Kilrah helped me and they advised to replace my card by a 9300-8i, so i ordered one and now i'm waiting for it to arrive and retry a CA restore Thank you JorgeB for your help and patience, when you're stressed, it's a big help to have support ! Will repost the next steps when i'll try with the new card. 1 Quote Link to comment
AinzOolGown Posted October 13, 2023 Author Share Posted October 13, 2023 (edited) Hi So i received HBA Card and was wondering how to proced the cleanest way Do i scrub SSD like this (when they're still connected to RAID Controller) before replacing RAID Controller by HBA Card or : 1/ Replace RAID Controller with HBA Card 2/ Scrub SSD 3/ delete + redo Cache pool 4/ Restore Backup ? Please Edited October 13, 2023 by AinzOolGown Quote Link to comment
JorgeB Posted October 13, 2023 Share Posted October 13, 2023 If you are going to re-format the pool there's no much point in scrubbing, but you could do it before or after. 1 Quote Link to comment
AinzOolGown Posted October 13, 2023 Author Share Posted October 13, 2023 Ok, thank you very much :) Quote Link to comment
AinzOolGown Posted October 21, 2023 Author Share Posted October 21, 2023 (edited) Hi So, i replaced the card just now, and tried to reactivate VM, but the list is empty Maybe it is linked with the cache settings in shares I changed them, but maybe have to invoke mover ? Thanks ! P.S. I have a backup of libvirt.img EDIT : Nevermind, i replaced the libvirt.img and all VMs reappeared Edited October 21, 2023 by AinzOolGown 1 Quote Link to comment
AinzOolGown Posted October 21, 2023 Author Share Posted October 21, 2023 (edited) To follow, i see anothers btrfs errors, after invoking mover I don't understand, SSD are new, câbles and HBA card are new too, i'm in despair Here's the fresh diagnostic tower-diagnostics-20231021-1207.zip When i deleted/redone cache pool, there was no reformating proposed or done by the system Do i need to force one ? Scrub was saying "no error found" before invoking mover Edited October 21, 2023 by AinzOolGown Quote Link to comment
Solution JorgeB Posted October 21, 2023 Solution Share Posted October 21, 2023 There are issues with both cache devices, since they are MX500 see if this helps: https://forums.unraid.net/topic/134954-warning-crucial-mx500-ssds-world-of-pain-stay-away-from-these/?do=findComment&comment=1255816 Quote Link to comment
AinzOolGown Posted October 21, 2023 Author Share Posted October 21, 2023 (edited) Thank you very much JorgeB I think i'm damned Snolly's post say that to view the firmware information, we need to "click on the drive's name and go to identity tab" Mine's look like this : Also, my Crucial SSDs are not listed in /dev/ maybe because Unraid can't view past the HBA card ? That doesn't make sense since it can mount them successfully Edited October 21, 2023 by AinzOolGown Quote Link to comment
JorgeB Posted October 22, 2023 Share Posted October 22, 2023 16 hours ago, AinzOolGown said: Mine's look like this : The devices dropped offline, power cycling the server should bring them back. Quote Link to comment
AinzOolGown Posted October 23, 2023 Author Share Posted October 23, 2023 Hi JorgeB, Thanks to bear with me ^^ I'm thinking of changing the SSD, this time i'm searching possible unraid incompatibility for each SSD i think can be good but every model return plenty bad result... Do you have some brand/model to recommend please ? Quote Link to comment
JorgeB Posted October 23, 2023 Share Posted October 23, 2023 I've been happy with my MX500, other models I've been using without issues so far are the 860 and 870 EVO. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.