March 25, 20188 yr every few weeks i run into an issue with my cache drives where it ends up taking the system off line. The disk log shows the following: ar 24 10:21:42 TheBeast kernel: BTRFS error (device nvme1n1p1): unable to find ref byte nr 505982324736 parent 0 root 5 owner 5166944 offset 0Mar 24 10:21:42 TheBeast kernel: BTRFS: error (device nvme1n1p1) in __btrfs_free_extent:7073: errno=-2 No such entryMar 24 10:21:42 TheBeast kernel: BTRFS info (device nvme1n1p1): forced readonlyMar 24 10:21:42 TheBeast kernel: BTRFS: error (device nvme1n1p1) in btrfs_run_delayed_refs:3089: errno=-2 No such entryMar 24 10:21:42 TheBeast kernel: BTRFS error (device nvme1n1p1): pending csums is 8060928 I also see BTRFS errors in the syslog around the same time. I've repaired and even completely re-created the cache. I have 2 drives in the cache pool both Samsung SSD 960 256 GB drives, every few weeks the issue reoccurs. What information can i share to help diagnose why this happens and how to avoid it from occurring again?
March 25, 20188 yr Community Expert Please post your diagnostics and the output of: btrfs dev stats /mnt/cache
March 25, 20188 yr Author Thanks for the quick reply, here is the output from the dev stats command: [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 0 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 159 [/dev/nvme1n1p1].generation_errs 0 [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 159 [/dev/nvme0n1p1].generation_errs 0
March 25, 20188 yr Community Expert Please also post your diagnostics: Tools -> Diagnostics Same corruption errors on both devices would suggest RAM issues, run memtest if you haven't yet.
March 25, 20188 yr Author Attached are the diagnostic logs requested.. about to run a memtest unraid-diagnostics-20180325-1541.zip
March 25, 20188 yr Author FYI - Unraid's build in memtest would not boot, but I created bootable media for memtest and it came back clean
March 25, 20188 yr Community Expert You need to let it run for some time, ideally 24 hours, and even if there are no errors it's not conclusive, it's only conclusive if there are. if it's not memory it's likely another hardware issue, also look for a bios update.
April 26, 20188 yr Author Thanks for the pointer. It took several 24 hour memtests before it actually occurred but i did finally find a bad memory DIMM I've replaced the faulty memory and ran a clean 24 hour memtest. Hopefully the system will be stable now. I appreciate the help!
Archived
This topic is now archived and is closed to further replies.