Jump to content

CSUM Loop2 error


Go to solution Solved by JorgeB,

Recommended Posts

v6.12.10

 

I've been noticing intermittent appearances of the following error:

 

Apr 23 05:28:31 Node kernel: BTRFS warning (device loop2): csum failed root 13298 ino 10830 off 741376 csum 0x4458e96e expected csum 0x9c63b882 mirror 1
Apr 23 05:28:31 Node kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 0, corrupt 232049, gen 0

 

Per recommendations here I deleted and recreated my docker image but barely a day later the error is back.

 

Apr 24 19:22:17 Node kernel: BTRFS warning (device loop2): csum failed root 382 ino 10035 off 83816448 csum 0xc0681546 expected csum 0x185344aa mirror 1
Apr 24 19:22:17 Node kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0

 

My docker.img file is on the cache and not in the array: /mnt/cache/docker.img

 

What else should I try?

node-diagnostics-20240425-0637.zip

Edited by weirdcrap
Link to comment
Posted (edited)
7 minutes ago, JorgeB said:

Start by running memtest, that could be a RAM problem.

That's going to be a problem as it's a remote server  4 hours away and it would have to stay down for days to thoroughly memtest 64GB of RAM (3-4 passes assuming I get no errors right away).

 

Anything else I can try before I have to pull the server offline for an extended period of time?

Edited by weirdcrap
Link to comment
Posted (edited)
27 minutes ago, Kilrah said:

You can do a scrub of the cache pool, and if no errors recreate the docker image

https://docs.unraid.net/unraid-os/manual/docker-management/#re-create-the-docker-image-file

 

If the problem comes back it'll likely be a hardware issue and you'll want that memtest.

It's not a pool its a single NVME drive. What are the possible consequences of just not addressing it then?

 

I would like to exhaust all other avenues before I have to take multiple days off work or track someone else down to run and baby sit memtest for me. It's not just a simple run memtest once and you're done type of thing, since i have 4 sticks if I get errors I then have to play the test 1 stick at a time game to figure out which stick needs to be replaced. So i'm looking at multiple days of down time while I try to track down which stick or sticks the problem is.

 

I rebuilt this server brand new because of RAM issues less than two years ago when I could no longer find new sticks of DDR3 RAM. I'm understandably pretty mad that I'm already having to take time off work to go track down some RAM related error again.

 

I'm desperately waiting for a fiber to the home provider to build out in my neighborhood so I can bring this server home and stop having to deal with this crap. I can't run a plex server for remote streaming when I'm out of the house off of 10-15Mbps upload though.

Edited by weirdcrap
Link to comment

All depends on how it evolves, now it's just a single corrupted block in the Docker image, maybe it even fixes itself if that gets deleted/rewritten when updating a container, or maybe it gets worse. Could just leave it as is until it breaks more I guess.

Link to comment
Posted (edited)
8 minutes ago, Kilrah said:

All depends on how it evolves, now it's just a single corrupted block in the Docker image, maybe it even fixes itself if that gets deleted/rewritten when updating a container, or maybe it gets worse. Could just leave it as is until it breaks more I guess.

If it's going to potentially corrupt data written to the array then I obviously need to address that. But if its just some corrupted bits in the docker.img file I don't really care and its not worth the time and headache to address currently.

 

I'll keep a close eye on the logs for the next few days and make a decision based on what I observe.

 

EDIT: How would I know if it fixes itself with a container update? The error would just stop appearing?

Edited by weirdcrap
Link to comment
Posted (edited)
50 minutes ago, JorgeB said:

If it's a RAM problem, all data can be affected, but since you don't have any other btrfs or zfs filesystems, only the docker image is detecting data corruption.

This is what I figured. Great.

 

Do I need to use the memtest built into the UnRAID USB or can I just grab any old copy? I assume there is nothing "special" about the unraid version?

 

If I can find someone local to do the testing for me I'll have them remove the unraid USB just so it doesn't accidentally boot and start the array if they miss the post prompt or something.

 

I also just realized Kilrah probably meant scrub the docker image not the cache itself (since its xfs). I ran a scrub on loop2 and got no errors.

 

EDIT: Oh wow I didn't look at the previous syslogs in the diagnostics. apparently this just blew up fast. April 8th and 9th both syslog.1 and syslog.2 are nothing but iterative BTRFS corruption errors, just thousands of them.

 

99% of them all seem to be the same ino (inode?) and a few different off (offsets?): root 13093 ino 10035 off 93798400. Even the single error from today is the same ino of 10035.

 

If it really is a RAM error would ALL the errors be on the same inode? I would think RAM issues would return lots of different inodes and offsets as things are shifted around in memory?

Edited by weirdcrap
Link to comment
1 hour ago, weirdcrap said:

If it really is a RAM error would ALL the errors be on the same inode?

Likely a single location was corrupted, but now since it can't be repaired it keeps reporting errors everytime it's accessed...

 

  

1 hour ago, weirdcrap said:

I also just realized Kilrah probably meant scrub the docker image not the cache itself (since its xfs).

Sorry, make that a check filesystem on the cache then.

 

Edited by Kilrah
Link to comment
Posted (edited)
42 minutes ago, Kilrah said:

Likely a single location was corrupted, but now since it can't be repaired it keeps reporting errors everytime it's accessed...

 

  

Sorry, make that a check filesystem on the cache then.

 

Ok I've got a data transfer going but once that's done I'll put the array in maintenance mode and check the filesystem on the cache disk.

 

40 minutes ago, itimpi said:

It is always easy to recreate the docker.img file if it gets corrupted and restore containers with settings intact via Apps->Previous Apps so, as long as the underlying XFS file system look OK, I would probably recommend doing that.

Are you saying that you disagree that it is a RAM issue? Or just that if I don't want to address the RAM issue simply recreating the docker.img file is easy?

 

I do like my data uncorrupted so if it is a RAM issue I'll address it, doesn't mean I'm any less mad about having to do it in a server that's less than 2 years old. I've got someone who says the can memtest the sticks for me this weekend so I'll have them do that as well just to rule the RAM out.

Edited by weirdcrap
Link to comment
1 hour ago, Kilrah said:

Likely a single location was corrupted, but now since it can't be repaired it keeps reporting errors everytime it's accessed...

 

  

Sorry, make that a check filesystem on the cache then.

 

I assume this means the FS is fine?

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

 

Link to comment
47 minutes ago, weirdcrap said:

Are you saying that you disagree that it is a RAM issue? Or just that if I don't want to address the RAM issue simply recreating the docker.img file is easy?

No idea if it is a RAM issue or not!   Just pointing out that it is very easy to get the docker.img file back into a known clean state.  If you continue to get corruption then either the device holding it has a problem or you quite likely do have a RAM issue.

Link to comment
Posted (edited)

The bad stick was located and array booted back up.

 

Things have taken a turn for the weird. Sonarr was throwing all kinds of access denied to path errors. I'm missing a bunch of shares and there are strange errors in the log I've never seen before.

It's throwing the below for most if not all of my shares.

Node emhttpd: error: malloc_share_locations, 7199: Operation not supported (95): getxattr: /mnt/user/CommunityApplicationsAppdataBackup

 

The data appears to be there on the individual disks still and unraid reported the array configuration was valid before I started it.

 

I tried to reboot the server and am now stuck in an unresolvable retry unmounting shares loop. It claims /mnt/user is not empty but I don't see any open files preventing unRAID from stopping the array.

 

help!

 

EDIT: OK I managed to force a reboot. Waiting for it to come back up now.

 

EDIT2: F*CKKKKKK its not coming back online. I'm going to have to get someone to go over there and see what the hell is going. In the meantime if anyone can shed some light on wtf happened to my shares that would be great. I've never seen this happen before and I don't think testing and removing a single RAM stick would cause all this nonsense.

node-diagnostics-20240428-0615.zip

Edited by weirdcrap
Link to comment

You have something creating this folder before the array starts:

 

Apr 28 04:40:42 Node emhttpd: error: malloc_share_locations, 7199: Operation not supported (95): getxattr: /mnt/user/CommunityApplicationsAppdataBackup

 

You need to correct that, or the shares won't then work correctly.

Link to comment
Posted (edited)
2 minutes ago, JorgeB said:

You have something creating this folder before the array starts:

 

Apr 28 04:40:42 Node emhttpd: error: malloc_share_locations, 7199: Operation not supported (95): getxattr: /mnt/user/CommunityApplicationsAppdataBackup

 

You need to correct that, or the shares won't then work correctly.

But that's just a share on my array? It's not just a random folder I created. It's a share on my other server as well and it isn't causing me any problems.

 

It's from/for Squid's community app data backup plugin before it was taken over by the new dev.

Edited by weirdcrap
Link to comment
Posted (edited)
1 hour ago, JorgeB said:

No, because it already exists before array start, so something else is creating that folder.

After memtest was finished my helper booted the server up and it sat all night with the array stopped. When I went to start the array this morning this is what happened.

 

I wonder if the appdata backup plugin maybe ran while the array was stopped and created that folder? I don't know how else it would have been created before array start.

 

If I can get the server back up I just need to delete that folder with the array stopped then? Or how do I remove the "bad" version of the folder without affecting the actual share that contains my data?

 

Will the folder persist in /mnt/user/ after the reboot if something isn't automatically recreating it at boot? I have no automated processes that touch /mnt/user/ (besides built in UnRAID stuff) so I would hope that a reboot would be all I need to resolve this.

 

EDIT: I can't seem to find plugin parameters in the diagnostics. I was hoping to check the ca.backup.appdata plugin's schedule to see if it was set to run this morning.

Edited by weirdcrap
Link to comment
1 minute ago, itimpi said:

I have not checked the code, but I would expect the appdata backup plugin to include a check on whether the array is started.

You would think. I was hoping to figure it out by checking the schedule but looking through the diag captured I can't find anything about it. I didn't notice any evidence of the plugin running in the syslog capture either.

 

7 minutes ago, Kilrah said:

Possible yes. Just reboot.

I've got someone headed that way in a bit to force a reboot hopefully that resolves it.

Link to comment
22 minutes ago, weirdcrap said:

You would think.

Just had a quick look at the code and it appears that there IS a check on whether the array is already started so unless that check is faulty in some way the backup process will not run without the array started.

Link to comment
Posted (edited)
26 minutes ago, Squid said:

Version that you're running is from the new dev.

Yeah sorry I wasn't trying to imply it was your version, just wasn't very clear.

 

My buddy finally got over there to reboot the server. Sounds like I might have two bad sticks. The second one wouldn't post/test yesterday but after removing and reseating it worked so we kept it in. But today again after rebooting the server wouldn't post until he removed this second stick. It sounds like G.Skill wants me to return the kit anyway as part of the RMA so I guess it works out I'm down two sticks instead of just one.

 

Server is back up and everything appears to be working now.

 

  

3 hours ago, itimpi said:

Just had a quick look at the code and it appears that there IS a check on whether the array is already started so unless that check is faulty in some way the backup process will not run without the array started.

 

 

I checked the schedule for the appdata backup plugin and it is set to run on Tuesday so I have no idea how that CommunityApplicationsAppdataBackup folder would have been created in /mnt/user/ while my array was stopped.

Edited by weirdcrap
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...