[Solved] BTRFS Error (RAM seems ok)


Go to solution Solved by JorgeB,

Recommended Posts

Hi there :)

Since some days (i'm running solid since 4 years), my dockers stops gradually and it's very hard to make all the systems run again fine (and it shortly redo the same errors :/)
When i try to restart them, unraid show an Execution Error so dead end


I've searched and found some post talking about bad RAMs and i ran 2 (paralleled and unparalleled) MEMTEST86, which PASSed fine

Also found ie this topic that talk about a full cache but it doesn't seem to be the case for me either, so i'm a little lost :/

 

root@Tower:~# btrfs dev stats /mnt/cache
[/dev/mapper/sdk1].write_io_errs    192912
[/dev/mapper/sdk1].read_io_errs     145891
[/dev/mapper/sdk1].flush_io_errs    1086
[/dev/mapper/sdk1].corruption_errs  628
[/dev/mapper/sdk1].generation_errs  0
[/dev/mapper/sdj1].write_io_errs    0
[/dev/mapper/sdj1].read_io_errs     0
[/dev/mapper/sdj1].flush_io_errs    0
[/dev/mapper/sdj1].corruption_errs  0
[/dev/mapper/sdj1].generation_errs  0
[/dev/mapper/sdl1].write_io_errs    0
[/dev/mapper/sdl1].read_io_errs     0
[/dev/mapper/sdl1].flush_io_errs    0
[/dev/mapper/sdl1].corruption_errs  0
[/dev/mapper/sdl1].generation_errs  0

 

I'm attaching diagnostics and some logs if anyone can point me where i can head next to fix this issue please :)

 

Thank you very much in advance !

logs.txt tower-diagnostics-20230924-1631.zip

Edited by AinzOolGown
typo
Link to comment

Not necessarily your problem, but corruption issues are generally related to bad RAM (as your searching has already indicated).  FWIW, your particular RAM is not on either the motherboard's QVL for memory and G-Skill doesn't list that motherboard on the memory's QVL.  Personally, I only ever buy RAM from the MB QVL for the most trouble free experience.  But take it with a grain of salt.  Just because neither the memory or the MB says that they are compatible with each other doesn't mean that they are not.

  • Like 1
Link to comment

Hi Squid, thank you :)

Yep, i figured that and done testings when i built the server, but all the tests was successfull so i decided to keep them :)

It's been 4 years and never had a RAM issue since. If it is really a RAM problem, then i don't understand how it can be all ok since 4 years and suddently is not ok anymore :/

Memtest86 also tells me everything's fine 😭

Link to comment

Thanks JorgeB ;)

 

That reminds me... Sorry i forgot to tell you, just some days before this behavior, one disk of my array went offline 2 or 3 days. Unraid emulated it and i noticed just some hours later by chance. I changed a cable and rebuilt array. All seemed good to me, but in fact, it might have corrupted the array ?

 

If this is it, how can i correct that please ?
i read about some "scrub" command, i have it for cache array but not for data array
Will it erase all cache ? I wont lose any data right ?

 

Thank you :)

 

P.S.: Installed Squid's script from your link, thanks again !
 

Included most recent logs, it start to worry me :/
In the meantime, i had a notification "Error on cache pool - No description"

LastLogs.txt

Edited by AinzOolGown
Link to comment

Hi :)

 

So, i followed this process and changed all shares that use cache to "Yes". VM & Dockers are disabled but mover have finished and shares (appdata/system) remain on it.

 

I opted to completely replace my cache drives/SATA Cables

 

Can i just save the remaining content elsewhere and move it back when the new cache pool will be operationnal ?

Is there a better way to move the remaining files to the array ?

And, how to do this the right way ? (to conserve permissions and such) i have a Mac and PathFinder ready for that :)

 

Also, i have a share set to "no" for cache, but strangely the last files created on it are on cache O_o

 

Thank you :)

Edited by AinzOolGown
Link to comment

Hi  :)

I copied what i can

Deleted the cache pool / pluged new SSDs / recreated a new cache pool and assigned the new SSDs with new cables

 

Then i reswitched appdata on "only" cache and done a CA plugin restore
Plugin said
restoring complete but i have a notification saying error occured
Syslog indicate a lot of btrfs error

I switched to Discord somewhere inbetween, and it seems that my RAID controller card is the culprit
Synd & Kilrah helped me and they advised to replace my card by a 9300-8i, so i ordered one and now i'm waiting for it to arrive and retry a CA restore

Thank you JorgeB for your help and patience, when you're stressed, it's a big help to have support ! :)

Will repost the next steps when i'll try with the new card.

  • Like 1
Link to comment
  • 2 weeks later...

Hi :)

So i received HBA Card and was wondering how to proced the cleanest way

Do i scrub SSD like this (when they're still connected to RAID Controller) before replacing RAID Controller by HBA Card

or :
1/
Replace RAID Controller with HBA Card

2/ Scrub SSD

3/ delete + redo Cache pool

4/ Restore Backup ?

 

Please :)

Edited by AinzOolGown
Link to comment

Hi :)

So, i replaced the card just now, and tried to reactivate VM, but the list is empty :/


 

Maybe it is linked with the cache settings in shares

I changed them, but maybe have to invoke mover ?

 

Thanks !

 

P.S. I have a backup of libvirt.img

 

EDIT : Nevermind, i replaced the libvirt.img and all VMs reappeared ;)

Edited by AinzOolGown
  • Like 1
Link to comment

To follow, i see anothers btrfs errors, after invoking mover :/

 

I don't understand, SSD are new, câbles and HBA card are new too, i'm in despair

 

Here's the fresh diagnostic

tower-diagnostics-20231021-1207.zip

 

When i deleted/redone cache pool, there was no reformating proposed or done by the system
Do i need to force one ?

Scrub was saying "no error found" before invoking mover

Edited by AinzOolGown
Link to comment

Thank you very much JorgeB ;)

I think i'm damned :/

 

Snolly's post say that to view the firmware information, we need to "click on the drive's name and go to identity tab"
Mine's look like this :
1937980679_Capturedecran2023-10-21a18_30_15.thumb.png.532d40d9e8f8f52f0afe30dc005ebb14.png

 

Also, my Crucial SSDs are not listed in /dev/

maybe because Unraid can't view past the HBA card ?
That doesn't make sense since it can mount them successfully :/

 

Edited by AinzOolGown
Link to comment
  • AinzOolGown changed the title to [Solved] BTRFS Error (RAM seems ok)

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.