Jump to content

Unstable crashes and BTRFS cache corruption


Recommended Posts

I have had my Unraid server for a little over a year and have had recurring crashes at 7-14 day intervals as well as BTRFS corruption of my cache drive pool.  I have moved the cache drives to different controllers (nvme, add-in card and motherboard).  I replaced my original Samsung 970 nvme cache drives with crucial SATA SSD drives.  My docker.img will become corrupted and I have to delete it and restore my containers, etc.  It is very frustrating and when I search on BTRFS corruption I don't see any specific issues, most say to reformat the cache and restore.  Doing this every few weeks is maddening.  I have attached my diagnostic, I would really appreciate some help in determining how to troubleshoot this and make my unraid stable.

 

Thank you in advance for any assistance.

enterprise-diagnostics-20200803-0915.zip

Link to comment
Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

 

Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board.

Link to comment
7 minutes ago, johnnie.black said:

Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

 

Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board.

 

Thank you very much for the response...

I ran a 24 hour memtest on the system with zero errors a few months ago and the memory was on the list.  

Link to comment

I also see these error frequenly on my syslog server:

2020-08-03  06:58:45  Error  Enterprise  kern  kernel  pcieport 0000:00:01.1: [ 6] BadTLP 

2020-08-03  06:58:45  Error  Enterprise  kern  kernel  pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000

2020-08-03  06:58:45  Error  Enterprise  kern  kernel  pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)

2020-08-03  06:58:45  Info  Enterprise  kern  kernel  pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...