Grimjack Posted August 3, 2020 Share Posted August 3, 2020 I have had my Unraid server for a little over a year and have had recurring crashes at 7-14 day intervals as well as BTRFS corruption of my cache drive pool. I have moved the cache drives to different controllers (nvme, add-in card and motherboard). I replaced my original Samsung 970 nvme cache drives with crucial SATA SSD drives. My docker.img will become corrupted and I have to delete it and restore my containers, etc. It is very frustrating and when I search on BTRFS corruption I don't see any specific issues, most say to reformat the cache and restore. Doing this every few weeks is maddening. I have attached my diagnostic, I would really appreciate some help in determining how to troubleshoot this and make my unraid stable. Thank you in advance for any assistance. enterprise-diagnostics-20200803-0915.zip Quote Link to comment
JorgeB Posted August 3, 2020 Share Posted August 3, 2020 Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board. Quote Link to comment
Grimjack Posted August 3, 2020 Author Share Posted August 3, 2020 7 minutes ago, johnnie.black said: Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdj1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Aug 2 16:06:32 Enterprise kernel: BTRFS info (device sdj1): bdev /dev/sdi1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Data corruption on btrfs suggests a hardware problem, like bad RAM, start by running memtest, also make sure RAM is on the QVL for that board. Thank you very much for the response... I ran a 24 hour memtest on the system with zero errors a few months ago and the memory was on the list. Quote Link to comment
Grimjack Posted August 3, 2020 Author Share Posted August 3, 2020 I also see these error frequenly on my syslog server: 2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: [ 6] BadTLP 2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000 2020-08-03 06:58:45 Error Enterprise kern kernel pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) 2020-08-03 06:58:45 Info Enterprise kern kernel pcieport 0000:00:01.1: AER: Corrected error received: 0000:00:00.0 Quote Link to comment
JorgeB Posted August 3, 2020 Share Posted August 3, 2020 Still most likely a hardware issue, if you can try other RAM, or another board/CPU/RAM combo. Quote Link to comment
Grimjack Posted August 10, 2020 Author Share Posted August 10, 2020 I replaced the RAM with certified RAM and my SAS controller and my video card (I was worried about those PCI errors). After 3 days had another crash. Attached is the log dump of the crash and a few hours of errors before it. I was never good at making sense out of these messages. All_2020-8-10-9 22 39.xlsx Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.