Cache Drive BTRFS Error/Read-Only. Drive Problem?

May 29, 201610 yr

Hi, I'm currently running v6.2beta21, but this has been a recurring issue for at least the past few beta versions if I remember correctly. What happens, is that the machine will run fine for 12-24 hours or sometimes even longer, but then I will come back and try to watch a movie on Plex or start up work on the VM I use as a development server and everything will be dead. The sites hosted on the VM will all show some error in the browser that I don't see anywhere else, my SSH windows to that machine will all be errored out, and the Plex stuff shows "Media cannot be located" or something along those lines whenever I try to open a show/movie.

The problem, to me, appears to be an issue with a read or write operation to the cache drive (where all of the VMs and Plex, etc. is held), which then results in it being put into a read-only mode. This seems like a general hardware issue, but the drive shows 0 errors and always has. This has happened probably 100 times by now, and I have never seen anything aside from 0 errors in the cache drive row, or any other drives. When the system is up and working, it's flawless.

So my first thought was that it's an issue with the HBA card or cables or even the hot-swap bays. I moved around the drives in the bays so that the cache drive was in the place of one of the 4TB data drives that had been in the system for months with no issues. When I restarted, everything was normal, but the issue still happened at what seems like the same rate. The problem doesn't show up as a drive error, even though that seems to be what is happening. If you guys don't think that it's the drive, I could try taking the SSD out of the hot swap bay and just plugging it into the straight SATA cable. I just don't see how it could be I/O because this always seems to happen when I'm not near my PC and when I am near my PC I am regularly running a lot of workload through that drive; it hosts databases, etc. What are the chances that it has an I/O error when I'm not using it, compared to when it's doing thousands of reads/writes per minute while I am working?

Here is what I think is the "important" part of the syslog (I've also included the full thing at the bottom because I could definitely be wrong). This is where it seems the error starts.

May 28 06:51:51 TOWER kernel: sd 1:0:3:0: [sde] tag#0 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
May 28 06:51:51 TOWER kernel: mpt2sas_cm0: 	sas_address(0x4433221104000000), phy(4)
May 28 06:51:51 TOWER kernel: mpt2sas_cm0: 	enclosure_logical_id(0x500605b0079b2610),slot(4)
May 28 06:51:51 TOWER kernel: mpt2sas_cm0: 	handle(0x000b), ioc_status(success)(0x0000), smid(19)
May 28 06:51:51 TOWER kernel: mpt2sas_cm0: 	request_len(0), underflow(0), resid(0)
May 28 06:51:51 TOWER kernel: mpt2sas_cm0: 	tag(65535), transfer_count(0), sc->result(0x00000000)
May 28 06:51:51 TOWER kernel: mpt2sas_cm0: 	scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
May 28 06:51:51 TOWER kernel: mpt2sas_cm0: 	[sense_key,asc,ascq]: [0x06,0x29,0x00], count(18)
May 28 06:51:51 TOWER kernel: blk_update_request: 13 callbacks suppressed
May 28 06:51:51 TOWER kernel: blk_update_request: I/O error, dev sde, sector 0
May 28 06:51:51 TOWER kernel: btrfs_dev_stat_print_on_error: 13 callbacks suppressed
May 28 06:51:51 TOWER kernel: BTRFS error (device sde1): bdev /dev/sde1 errs: wr 788, rd 9, flush 1, corrupt 0, gen 0
May 28 06:51:51 TOWER kernel: BTRFS: error (device sde1) in write_all_supers:3620: errno=-5 IO failure (errors while submitting device barriers.)
May 28 06:51:51 TOWER kernel: BTRFS info (device sde1): forced readonly
May 28 06:51:51 TOWER kernel: ------------[ cut here ]------------
May 28 06:51:51 TOWER kernel: WARNING: CPU: 6 PID: 754 at fs/btrfs/tree-log.c:2936 btrfs_sync_log+0x7a3/0x9c5()
May 28 06:51:51 TOWER kernel: BTRFS: Transaction aborted (error -5)
May 28 06:51:51 TOWER kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables vhost_net vhost macvtap macvlan xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat md_mod tun bonding coretemp kvm_intel kvm ata_piix mpt3sas bnx2 hpsa raid_class scsi_transport_sas ipmi_si pcc_cpufreq acpi_cpufreq
May 28 06:51:51 TOWER kernel: CPU: 6 PID: 754 Comm: qemu-system-x86 Tainted: G          I     4.4.6-unRAID #1
May 28 06:51:51 TOWER kernel: Hardware name: HP ProLiant DL380 G6, BIOS P62 03/30/2010
May 28 06:51:51 TOWER kernel: 0000000000000000 ffff880224e27ca0 ffffffff813688da ffff880224e27ce8
May 28 06:51:51 TOWER kernel: 0000000000000b78 ffff880224e27cd8 ffffffff8104a28a ffffffff812f65d3
May 28 06:51:51 TOWER kernel: ffff8807087d6800 ffff8800979b7800 00000000fffffffb ffff8801f95a9800
May 28 06:51:51 TOWER kernel: Call Trace:
May 28 06:51:51 TOWER kernel: [<ffffffff813688da>] dump_stack+0x61/0x7e
May 28 06:51:51 TOWER kernel: [<ffffffff8104a28a>] warn_slowpath_common+0x8f/0xa8
May 28 06:51:51 TOWER kernel: [<ffffffff812f65d3>] ? btrfs_sync_log+0x7a3/0x9c5
May 28 06:51:51 TOWER kernel: [<ffffffff8104a2e6>] warn_slowpath_fmt+0x43/0x4b
May 28 06:51:51 TOWER kernel: [<ffffffff812f65d3>] btrfs_sync_log+0x7a3/0x9c5
May 28 06:51:51 TOWER kernel: [<ffffffff812d3626>] btrfs_sync_file+0x23a/0x29e
May 28 06:51:51 TOWER kernel: [<ffffffff812d3626>] ? btrfs_sync_file+0x23a/0x29e
May 28 06:51:51 TOWER kernel: [<ffffffff8112d146>] vfs_fsync_range+0x87/0x99
May 28 06:51:51 TOWER kernel: [<ffffffff8112d16f>] vfs_fsync+0x17/0x19
May 28 06:51:51 TOWER kernel: [<ffffffff8112d19d>] do_fsync+0x2c/0x45
May 28 06:51:51 TOWER kernel: [<ffffffff8112d3ad>] SyS_fdatasync+0xe/0x12
May 28 06:51:51 TOWER kernel: [<ffffffff8161a0ae>] entry_SYSCALL_64_fastpath+0x12/0x6d
May 28 06:51:51 TOWER kernel: ---[ end trace 800bc7cd3c709081 ]---
May 28 06:51:51 TOWER kernel: BTRFS: error (device sde1) in btrfs_sync_log:2936: errno=-5 IO failure
May 28 06:51:53 TOWER shfs/user: shfs_create: open: /mnt/cache/Config/mongonew/diagnostic.data/metrics.interim.temp (30) Read-only file system
May 28 06:52:23 TOWER kernel: loop: Write error at byte offset 3397902336, length 4096.
May 28 06:52:23 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6636528
May 28 06:52:23 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
May 28 06:54:31 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 06:55:29 TOWER shfs/user: shfs_write: write: (30) Read-only file system

Then there are a bunch of entries of that same "Read-only file system" error and then some more stuff:

May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3515129856, length 4096.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6865488
May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3515260416, length 512.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6865743
May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3515390976, length 1024.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6865998
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3516403712, length 4096.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6867976
May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3516534272, length 512.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6868231
May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3516664832, length 1024.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6868486
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3515129856, length 4096.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6865488
May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3515260416, length 512.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6865743
May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3515390976, length 1024.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6865998
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: loop: Write error at byte offset 3516403712, length 4096.
May 28 06:59:12 TOWER kernel: blk_update_request: I/O error, dev loop0, sector 6867976
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: BTRFS error (device loop0): bdev /dev/loop0 errs: wr 12, rd 0, flush 0, corrupt 0, gen 0
May 28 06:59:12 TOWER kernel: BTRFS: error (device loop0) in btrfs_commit_transaction:2124: errno=-5 IO failure (Error while writing out transaction)
May 28 06:59:12 TOWER kernel: BTRFS info (device loop0): forced readonly
May 28 06:59:12 TOWER kernel: BTRFS warning (device loop0): Skipping commit of aborted transaction.
May 28 06:59:12 TOWER kernel: ------------[ cut here ]------------
May 28 06:59:12 TOWER kernel: WARNING: CPU: 0 PID: 10860 at fs/btrfs/transaction.c:1746 cleanup_transaction+0x8f/0x24c()
May 28 06:59:12 TOWER kernel: BTRFS: Transaction aborted (error -5)
May 28 06:59:12 TOWER kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4 ebtable_filter ebtables vhost_net vhost macvtap macvlan xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat md_mod tun bonding coretemp kvm_intel kvm ata_piix mpt3sas bnx2 hpsa raid_class scsi_transport_sas ipmi_si pcc_cpufreq acpi_cpufreq
May 28 06:59:12 TOWER kernel: CPU: 0 PID: 10860 Comm: btrfs-transacti Tainted: G        W I     4.4.6-unRAID #1
May 28 06:59:12 TOWER kernel: Hardware name: HP ProLiant DL380 G6, BIOS P62 03/30/2010
May 28 06:59:12 TOWER kernel: 0000000000000000 ffff8806f079fcd0 ffffffff813688da ffff8806f079fd18
May 28 06:59:12 TOWER kernel: 00000000000006d2 ffff8806f079fd08 ffffffff8104a28a ffffffff812bff7b
May 28 06:59:12 TOWER kernel: ffff880208c7f000 ffff88047b8b8a20 ffff88020855caa0 00000000fffffffb
May 28 06:59:12 TOWER kernel: Call Trace:
May 28 06:59:12 TOWER kernel: [<ffffffff813688da>] dump_stack+0x61/0x7e
May 28 06:59:12 TOWER kernel: [<ffffffff8104a28a>] warn_slowpath_common+0x8f/0xa8
May 28 06:59:12 TOWER kernel: [<ffffffff812bff7b>] ? cleanup_transaction+0x8f/0x24c
May 28 06:59:12 TOWER kernel: [<ffffffff8104a2e6>] warn_slowpath_fmt+0x43/0x4b
May 28 06:59:12 TOWER kernel: [<ffffffff812bff7b>] cleanup_transaction+0x8f/0x24c
May 28 06:59:12 TOWER kernel: [<ffffffff81075fd3>] ? wait_woken+0x6d/0x6d
May 28 06:59:12 TOWER kernel: [<ffffffff81075b24>] ? __wake_up+0x3f/0x46
May 28 06:59:12 TOWER kernel: [<ffffffff812c11cd>] btrfs_commit_transaction+0x9c6/0x9e1
May 28 06:59:12 TOWER kernel: [<ffffffff812bcbaa>] transaction_kthread+0xfa/0x1cd
May 28 06:59:12 TOWER kernel: [<ffffffff812bcbaa>] ? transaction_kthread+0xfa/0x1cd
May 28 06:59:12 TOWER kernel: [<ffffffff812bcab0>] ? btrfs_cleanup_transaction+0x45e/0x45e
May 28 06:59:12 TOWER kernel: [<ffffffff8105f870>] kthread+0xcd/0xd5
May 28 06:59:12 TOWER kernel: [<ffffffff8105f7a3>] ? kthread_worker_fn+0x137/0x137
May 28 06:59:12 TOWER kernel: [<ffffffff8161a3ff>] ret_from_fork+0x3f/0x70
May 28 06:59:12 TOWER kernel: [<ffffffff8105f7a3>] ? kthread_worker_fn+0x137/0x137
May 28 06:59:12 TOWER kernel: ---[ end trace 800bc7cd3c709082 ]---
May 28 06:59:12 TOWER kernel: BTRFS: error (device loop0) in cleanup_transaction:1746: errno=-5 IO failure
May 28 06:59:12 TOWER kernel: BTRFS info (device loop0): delayed_refs has NO entry
May 28 07:06:00 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 07:06:00 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 07:06:00 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 07:06:00 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 07:06:00 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 07:06:00 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 07:10:20 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 07:10:20 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 07:10:20 TOWER shfs/user: shfs_write: write: (30) Read-only file system
May 28 07:10:20 TOWER shfs/user: shfs_write: write: (30) Read-only file system

I tried to attach the full log from boot to the error, but it's way too large so I uploaded it here: https://mega.nz/#!XMtyWbRR!TO_UHT1biwgpugU4Vj99lXcCy-ropyEPR00eEWXta1U. I actually reboot the array after this, but none of it shows. Also, when I try to download the syslog, I just get a .zip with an empty txt file. I copied this from the syslog page of the webui.

Some other info that might be useful:

- The cache is a single drive, a Samsung 830 Series SSD

- There are 4 data drives (3 data, 1 parity). All are 4TB WD Red, 2 Pros, 2 non-pros.

- The system itself is a HP Proliant w/ 2 E5540s and 28gb of RAM

- The HBA is a LSI 9200-8e which goes out via 2 SFF-8088 cables into a little adapter that goes from SFF-8088 to 8087 thing on in an expansion slot of a separate chassis, and then to two of these hot swap bays: http://www.amazon.com/Rosewill-5-25-Inch-3-5-Inch-Hot-swap-SATAIII/dp/B00DGZ42SM

- The SFF/SATA cables and adapters are cheap Chinese things but again, what are the chances of these problems being related to I/O and not happening when an actual workload is going through them. Also, all of the data drives are in the same bays and have never had an issue.

Thanks for your time,

Joe

Quote

May 29, 201610 yr

Please see Check Disk File systems. I recommend reading carefully the BTRFS command line section.

Quote

May 29, 201610 yr

Author

Sorry, I forgot to mention in the post. I have tried to do the scrub command in the webui and it will complete the full scrub (50GBish of the drive) without errors if the problem has not occurred yet. If I try it after the problem has occurred (when everything is broken and it has gone into the read-only mode), it just says aborted after 00:00:00, which I guess is to be expected.

If I simply copy all of the contents of the cache drive to some other location, then reformat or reformat in XFS as the documentation suggests, and copy everything back, what are the chances that all of my VMs and Dockers will be fine? The docs make it seem like that procedure is not super reliable.

Quote

May 29, 201610 yr

If I simply copy all of the contents of the cache drive to some other location, then reformat or reformat in XFS as the documentation suggests, and copy everything back, what are the chances that all of my VMs and Dockers will be fine? The docs make it seem like that procedure is not super reliable.

I suppose the chances are as good as the related files are good, the files of the VM's and Dockers. The only source of unreliability is how well you can retrieve the files from that file system. If you get good copies, then the rest is straightforward.

Quote

Cache Drive BTRFS Error/Read-Only. Drive Problem?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)