Unstable crashes and BTRFS cache corruption

Grimjack · August 3, 2020

I have had my Unraid server for a little over a year and have had recurring crashes at 7-14 day intervals as well as BTRFS corruption of my cache drive pool. I have moved the cache drives to different controllers (nvme, add-in card and motherboard). I replaced my original Samsung 970 nvme cache drives with crucial SATA SSD drives. My docker.img will become corrupted and I have to delete it and restore my containers, etc. It is very frustrating and when I search on BTRFS corruption I don't see any specific issues, most say to reformat the cache and restore. Doing this every few weeks is maddening. I have attached my diagnostic, I would really appreciate some help in determining how to troubleshoot this and make my unraid stable.

Thank you in advance for any assistance.

enterprise-diagnostics-20200803-0915.zip

JorgeB · August 3, 2020

Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board.

Grimjack · August 3, 2020

7 minutes ago, johnnie.black said:
Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board.

Thank you very much for the response...

I ran a 24 hour memtest on the system with zero errors a few months ago and the memory was on the list.

Grimjack · August 3, 2020

I also see these error frequenly on my syslog server:

2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: [ 6] BadTLP

2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000

2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)

2020-08-03 06:58:45 Info Enterprise kern kernel pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0

JorgeB · August 3, 2020

Still most likely a hardware issue, if you can try other RAM, or another board/CPU/RAM combo.

Grimjack · August 10, 2020

I replaced the RAM with certified RAM and my SAS controller and my video card (I was worried about those PCI errors). After 3 days had another crash. Attached is the log dump of the crash and a few hours of errors before it. I was never good at making sense out of these messages.

All_2020-8-10-9 22 39.xlsx

Unstable crashes and BTRFS cache corruption

Recommended Posts

Grimjack

Link to comment

JorgeB

Link to comment

Grimjack

Link to comment

Grimjack

Link to comment

JorgeB

Link to comment

Grimjack

Link to comment

Join the conversation