August 3, 20205 yr I have had my Unraid server for a little over a year and have had recurring crashes at 7-14 day intervals as well as BTRFS corruption of my cache drive pool. I have moved the cache drives to different controllers (nvme, add-in card and motherboard). I replaced my original Samsung 970 nvme cache drives with crucial SATA SSD drives. My docker.img will become corrupted and I have to delete it and restore my containers, etc. It is very frustrating and when I search on BTRFS corruption I don't see any specific issues, most say to reformat the cache and restore. Doing this every few weeks is maddening. I have attached my diagnostic, I would really appreciate some help in determining how to troubleshoot this and make my unraid stable. Thank you in advance for any assistance. enterprise-diagnostics-20200803-0915.zip
August 3, 20205 yr Community Expert Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board.
August 3, 20205 yr Author 7 minutes ago, johnnie.black said: Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board. Thank you very much for the response... I ran a 24 hour memtest on the system with zero errors a few months ago and the memory was on the list.
August 3, 20205 yr Author I also see these error frequenly on my syslog server: 2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: [ 6] BadTLP 2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000 2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) 2020-08-03 06:58:45 Info Enterprise kern kernel pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0
August 3, 20205 yr Community Expert Still most likely a hardware issue, if you can try other RAM, or another board/CPU/RAM combo.
August 10, 20205 yr Author I replaced the RAM with certified RAM and my SAS controller and my video card (I was worried about those PCI errors). After 3 days had another crash. Attached is the log dump of the crash and a few hours of errors before it. I was never good at making sense out of these messages. All_2020-8-10-9 22 39.xlsx
Archived
This topic is now archived and is closed to further replies.