Unstable crashes and BTRFS cache corruption

August 3, 20205 yr

I have had my Unraid server for a little over a year and have had recurring crashes at 7-14 day intervals as well as BTRFS corruption of my cache drive pool. I have moved the cache drives to different controllers (nvme, add-in card and motherboard). I replaced my original Samsung 970 nvme cache drives with crucial SATA SSD drives. My docker.img will become corrupted and I have to delete it and restore my containers, etc. It is very frustrating and when I search on BTRFS corruption I don't see any specific issues, most say to reformat the cache and restore. Doing this every few weeks is maddening. I have attached my diagnostic, I would really appreciate some help in determining how to troubleshoot this and make my unraid stable.

Thank you in advance for any assistance.

enterprise-diagnostics-20200803-0915.zip

Quote

August 3, 20205 yr

Community Expert

Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0

Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board.

Quote

August 3, 20205 yr

Author

7 minutes ago, johnnie.black said:
Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Aug  2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board.

Thank you very much for the response...

I ran a 24 hour memtest on the system with zero errors a few months ago and the memory was on the list.

Quote

August 3, 20205 yr

Author

I also see these error frequenly on my syslog server:

2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: [ 6] BadTLP

2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000

2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)

2020-08-03 06:58:45 Info Enterprise kern kernel pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0

Quote

August 3, 20205 yr

Community Expert

Still most likely a hardware issue, if you can try other RAM, or another board/CPU/RAM combo.

Quote

August 10, 20205 yr

Author

I replaced the RAM with certified RAM and my SAS controller and my video card (I was worried about those PCI errors). After 3 days had another crash. Attached is the log dump of the crash and a few hours of errors before it. I was never good at making sense out of these messages.

All_2020-8-10-9 22 39.xlsx

Quote

Unstable crashes and BTRFS cache corruption

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)