[SOLVED] BTRFS Corrupt Error on Cache Drive [6.9.0-rc2]


Recommended Posts

Ran into a kernel panic error within the last 24 hours. Recently (within the past hour or so), my Unraid server went unresponsive. At the time, the only activity was a single user on Plex (Docker) transcoding. I had to force the system to shutdown (via power button) and turn it back on again. As a precaution, since transcoding was on that cache drive, I moved it to RAM for the time being.

 

Looking at the log, I found one instance of BTRFS complaining about corruption:

Jan 28 20:47:26 WadeWilson kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0

 

Since my cache drive is the only BTRFS formatted device in the system, I ran scrub on it:

Scrub device /dev/nvme0n1p1 (id 1) done
Scrub started:    Thu Jan 28 20:57:14 2021
Status:           finished
Duration:         0:00:25
Total to scrub:   133.02GiB
Rate:             2.77GiB/s
Error summary:    csum=1
  Corrected:      0
  Uncorrectable:  1
  Unverified:     0
ERROR: there are uncorrectable errors

 

Anyone have any advice on what I should do next? Logs are attached.

wadewilson-diagnostics-20210128-2102.zip

Edited by MarkRMonaco
Link to comment

Just an update on this. Reading other information on the forums and the wiki, I decided to reformat the drive. As a precaution, I first erased the drive, unassigned it, formatted to XFS, reassigned and let it automatically reformat it back to BTRFS. When rebuilding the docker image file, I also opted for the XFS option.

 

Running appdata backup/restore as we speak...

Edited by MarkRMonaco
Link to comment
  • MarkRMonaco changed the title to BTRFS Corrupt Error on Cache Drive [6.9.0-rc2]

Just another update. Ran into some issues while restoring my docker containers from my saved templates. It would occasionally cause the docker service to stop. In most cases, I was able to stop the array and restart it, which would get docker running again. However, at some point, stopping the array would get hung up at the cache drive. Thankfully, I was able to stop it via terminal with "umount -l /mnt/cache". At which point, I rebooted the server and immediately ran btrfs scrub again. Errors/corruption were detected, but corrected. 

 

Scrub device /dev/nvme0n1p1 (id 1) done
Scrub started:    Thu Jan 28 21:51:42 2021
Status:           finished
Duration:         0:00:54
Total to scrub:   235.02GiB
Rate:             3.15GiB/s
Error summary:    no errors found
WARNING: errors detected during scrubbing, corrected

 

Unfortunately, I forgot to pull logs before I rebooted...

 

I'll continue to keep an eye on it and will report back (w/ logs) if anything changes.

Edited by MarkRMonaco
Link to comment

I agree @Squid.  I'm assuming there is a chance that the appdata may have had some corruption in it when it was backed-up before the reformat. At the moment, everything seems to be ok and any btrfs errors were corrected (according to the logs). Since it is doing a parity check after the forced reboot, I'm going to let it sit for the time being. If I see anything else pop-up in the system log or if any other abnormal activity occurs, I'll reply back here (hopefully, w/ logs).

Edited by MarkRMonaco
Link to comment

Regarding BTRFS I would highly recommend not using it.  You will find there are two opinions on this, but if you search around the forums you will see quite a number of cases like yours, where there are BTRFS errors suddenly happening and they require a reformat to resolve.  With that kind of risk hanging over your file system it makes the benefits of BTRFS kinda pointless IMO.

 

Right now, the best bet is to use XFS for everything, which will mean you can't run a mirror as a cache, but it's extremely more likely IMO that you'll have a BTRFS failure than a disk failure.

Link to comment

Another update, I had a few more issues pop-up last night. First one, the WebGUI was complaining that the license file was missing/corrupted. So, I redownloaded a fresh copy of my key file, and placed it on the USB drive. I then moved the drive to a different port on the computer (because I was also seeing a mention of "reset SuperSpeed Gen 1 USB device" in the syslog). Once the system was back up, it ran for several hours without issues (before I went to bed).

 

When I woke up this morning, I found that the system was being unresponsive (hard-locked). I verified in the BIOS that Global C-States was already off, and typical current was already enabled. Therefore, I turned off spread spectrum (since XMP was enabled at that time). Once it was back up, I added "rcu_nocbs=0-15" to the syslinux config and rebooted (at the time I didn't realize that I mistyped it and had "cu_nocbs=0-15" in the config).

 

From there, I went out to the store and came back a few hours later to another hard-lock. This time, I went back into BIOS and turned off XMP. Once the system was back up, I corrected the "rcu_nocbs=0-15" entry and rebooted. From there, I opened putty on my other computer and began a tail on the syslog. Note, I already have the syslog server enabled on the unraid system with it looping back to itself, but have never been able to get anything to write to the share.

 

As for the USB drive itself, I have Unraid configured to use UEFI mode.

Edited by MarkRMonaco
Link to comment
27 minutes ago, jonathanm said:

You must change the desired format type to XFS if that's what you want to use. That option will not be available if you have more than 1 slot available for cache disks.

@jonathanm, that would make a lot more sense. I have two slots configured (as I was going to add a 2nd drive a few months ago, but never did). I'll worry about reformatting once I get my system stabilized.

Edited by MarkRMonaco
Link to comment

@Altheran, if you have not done it already, I would suggest moving plex to its own dedicated SSD via unassigned devices. I have mine running on a 2nd nvme SSD formatted as XFS. From there, I have my Plex docker container's appdata mapped to the unassigned drive's mount point. Its helped to take the load off Unraid's SSD (not only in performance but storage used) and has not been affected by my recent stability issues.

 

You can use Krusader to move the Plex appdata when its container is not running.

Edited by MarkRMonaco
Link to comment
3 hours ago, JorgeB said:

This means data corruption, usually the result of bad RAM (or with Ryzen just overclocked RAM), note that xfs won't catch this, but it still might be happening.

Since XMP was the last thing that I turned off and the system has been fine since, I can at least rule out the RAM being bad. Just odd that this happened now (unless it was building over time).

Link to comment
  • MarkRMonaco changed the title to [SOLVED] BTRFS Corrupt Error on Cache Drive [6.9.0-rc2]

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.