[Solved] BTRFS Error (RAM seems ok)

September 24, 20232 yr

Hi there

Since some days (i'm running solid since 4 years), my dockers stops gradually and it's very hard to make all the systems run again fine (and it shortly redo the same errors :/)
When i try to restart them, unraid show an Execution Error so dead end

I've searched and found some post talking about bad RAMs and i ran 2 (paralleled and unparalleled) MEMTEST86, which PASSed fine

Also found ie this topic that talk about a full cache but it doesn't seem to be the case for me either, so i'm a little lost

root@Tower:~# btrfs dev stats /mnt/cache
[/dev/mapper/sdk1].write_io_errs    192912
[/dev/mapper/sdk1].read_io_errs     145891
[/dev/mapper/sdk1].flush_io_errs    1086
[/dev/mapper/sdk1].corruption_errs  628
[/dev/mapper/sdk1].generation_errs  0
[/dev/mapper/sdj1].write_io_errs    0
[/dev/mapper/sdj1].read_io_errs     0
[/dev/mapper/sdj1].flush_io_errs    0
[/dev/mapper/sdj1].corruption_errs  0
[/dev/mapper/sdj1].generation_errs  0
[/dev/mapper/sdl1].write_io_errs    0
[/dev/mapper/sdl1].read_io_errs     0
[/dev/mapper/sdl1].flush_io_errs    0
[/dev/mapper/sdl1].corruption_errs  0
[/dev/mapper/sdl1].generation_errs  0

I'm attaching diagnostics and some logs if anyone can point me where i can head next to fix this issue please

Thank you very much in advance !

logs.txt tower-diagnostics-20230924-1631.zip

Edited October 29, 20232 yr by AinzOolGown
typo

Quote

September 24, 20232 yr

Not necessarily your problem, but corruption issues are generally related to bad RAM (as your searching has already indicated). FWIW, your particular RAM is not on either the motherboard's QVL for memory and G-Skill doesn't list that motherboard on the memory's QVL. Personally, I only ever buy RAM from the MB QVL for the most trouble free experience. But take it with a grain of salt. Just because neither the memory or the MB says that they are compatible with each other doesn't mean that they are not.

Quote

1

September 25, 20232 yr

Author

Hi Squid, thank you

Yep, i figured that and done testings when i built the server, but all the tests was successfull so i decided to keep them

It's been 4 years and never had a RAM issue since. If it is really a RAM problem, then i don't understand how it can be all ok since 4 years and suddently is not ok anymore

Memtest86 also tells me everything's fine 😭

Quote

September 25, 20232 yr

Community Expert

20 hours ago, AinzOolGown said:

[/dev/mapper/sdk1].write_io_errs    192912
[/dev/mapper/sdk1].read_io_errs     145891
[/dev/mapper/sdk1].flush_io_errs    1086

These suggest the device dropped offline sometime in the past, see here for more info and better pool monitoring.

Quote

September 25, 20232 yr

Author

Thanks JorgeB

That reminds me... Sorry i forgot to tell you, just some days before this behavior, one disk of my array went offline 2 or 3 days. Unraid emulated it and i noticed just some hours later by chance. I changed a cable and rebuilt array. All seemed good to me, but in fact, it might have corrupted the array ?

If this is it, how can i correct that please ?
i read about some "scrub" command, i have it for cache array but not for data array
Will it erase all cache ? I wont lose any data right ?

Thank you

P.S.: Installed Squid's script from your link, thanks again !

Included most recent logs, it start to worry me
In the meantime, i had a notification "Error on cache pool - No description"

LastLogs.txt

Edited September 25, 20232 yr by AinzOolGown

Quote

September 25, 20232 yr

Community Expert

23 minutes ago, AinzOolGown said:

If this is it, how can i correct that please ?

See the link above.

Quote

September 30, 20232 yr

Author

Hi

So, i followed this process and changed all shares that use cache to "Yes". VM & Dockers are disabled but mover have finished and shares (appdata/system) remain on it.

I opted to completely replace my cache drives/SATA Cables

Can i just save the remaining content elsewhere and move it back when the new cache pool will be operationnal ?

Is there a better way to move the remaining files to the array ?

And, how to do this the right way ? (to conserve permissions and such) i have a Mac and PathFinder ready for that

Also, i have a share set to "no" for cache, but strangely the last files created on it are on cache O_o

Thank you

Edited September 30, 20232 yr by AinzOolGown

Quote

September 30, 20232 yr

Community Expert

1 hour ago, AinzOolGown said:

but mover have finished and shares (appdata/system) remain on it.

Enable mover logging, run the mover, post new diags.

Quote

September 30, 20232 yr

Author

Thanks JorgeB

Here it is

tower-diagnostics-20230930-1227.zip

Quote

September 30, 20232 yr

Community Expert

The filesystem is read-only, so the data cannot be moved, it can be copied though.

Quote

September 30, 20232 yr

Author

Can i just copy the content via SMB and do the opposite when the new drives will be installed ? (sorry i prefer to double-check, there's some prod Dockers for my business and that would be hell to lose anything )

Edited September 30, 20232 yr by AinzOolGown

Quote

September 30, 20232 yr

Community Expert

21 minutes ago, AinzOolGown said:

Can i just copy the content via SMB and do the opposite when the new drives will be installed ?

That should work, you can also use for example rsync.

Quote

September 30, 20232 yr

Author

Sorry JorgeB, rsync gives me errors too
cmd was :

rsync -a /mnt/cache/appdata/ /mnt/user/Backup-Saves/TowerDockers/Cache/appdata/

I tried a CA Backup and same, it gives errors 😱

Ran a Filesystem checks on Cache, attached logs

rsync errors.txt

Cache filesystem check.txt

Edited September 30, 20232 yr by AinzOolGown

Quote

October 1, 20232 yr

Community Expert

Any corrupt files will fail to copy, you can try btrfs restore here, it will ignore corrupt files and still copy them but they will still be corrupt.

Quote

October 1, 20232 yr

Author

Hi

I copied what i can

Deleted the cache pool / pluged new SSDs / recreated a new cache pool and assigned the new SSDs with new cables

Then i reswitched appdata on "only" cache and done a CA plugin restore
Plugin said restoring complete but i have a notification saying error occured
Syslog indicate a lot of btrfs error

I switched to Discord somewhere inbetween, and it seems that my RAID controller card is the culprit
Synd & Kilrah helped me and they advised to replace my card by a 9300-8i, so i ordered one and now i'm waiting for it to arrive and retry a CA restore

Thank you JorgeB for your help and patience, when you're stressed, it's a big help to have support !

Will repost the next steps when i'll try with the new card.

Quote

1

October 13, 20232 yr

Author

Hi

So i received HBA Card and was wondering how to proced the cleanest way

Do i scrub SSD like this (when they're still connected to RAID Controller) before replacing RAID Controller by HBA Card

or :
1/ Replace RAID Controller with HBA Card

2/ Scrub SSD

3/ delete + redo Cache pool

4/ Restore Backup ?

Please

Edited October 13, 20232 yr by AinzOolGown

Quote

October 13, 20232 yr

Community Expert

If you are going to re-format the pool there's no much point in scrubbing, but you could do it before or after.

Quote

1

October 13, 20232 yr

Author

Ok, thank you very much :)

Quote

October 21, 20232 yr

Author

Hi

So, i replaced the card just now, and tried to reactivate VM, but the list is empty

Maybe it is linked with the cache settings in shares

I changed them, but maybe have to invoke mover ?

Thanks !

P.S. I have a backup of libvirt.img

EDIT : Nevermind, i replaced the libvirt.img and all VMs reappeared

Edited October 21, 20232 yr by AinzOolGown

Quote

1

October 21, 20232 yr

Author

To follow, i see anothers btrfs errors, after invoking mover

I don't understand, SSD are new, câbles and HBA card are new too, i'm in despair

Here's the fresh diagnostic

tower-diagnostics-20231021-1207.zip

When i deleted/redone cache pool, there was no reformating proposed or done by the system
Do i need to force one ?

Scrub was saying "no error found" before invoking mover

Edited October 21, 20232 yr by AinzOolGown

Quote

October 21, 20232 yr

Community Expert
Solution

There are issues with both cache devices, since they are MX500 see if this helps:

https://forums.unraid.net/topic/134954-warning-crucial-mx500-ssds-world-of-pain-stay-away-from-these/?do=findComment&comment=1255816

Quote

October 21, 20232 yr

Author

Thank you very much JorgeB

I think i'm damned

Snolly's post say that to view the firmware information, we need to "click on the drive's name and go to identity tab"
Mine's look like this :

Also, my Crucial SSDs are not listed in /dev/

~~maybe because Unraid can't view past the HBA card ?~~
That doesn't make sense since it can mount them successfully

Edited October 21, 20232 yr by AinzOolGown

Quote

October 22, 20232 yr

Community Expert

16 hours ago, AinzOolGown said:

Mine's look like this :

The devices dropped offline, power cycling the server should bring them back.

Quote

October 23, 20232 yr

Author

Hi JorgeB,

Thanks to bear with me ^^

I'm thinking of changing the SSD, this time i'm searching possible unraid incompatibility for each SSD i think can be good but every model return plenty bad result...

Do you have some brand/model to recommend please ?

Quote

October 23, 20232 yr

Community Expert

I've been happy with my MX500, other models I've been using without issues so far are the 860 and 870 EVO.

Quote

[Solved] BTRFS Error (RAM seems ok)

Featured Replies

Solved by JorgeB

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)