BTRFS Errors on Cache drive


Recommended Posts

I'm using my cache drive on both of my servers exclusively for VM's and Docker. Even on my backup server I'm making backup copies of Docker/VM's files from time to time. I saw errors recently when I was trying the copy two of my qcow files. After a parity check and a btrfs scrub via the excellent web gui I see these log entries:

Apr 24 10:01:57 Tower2 kernel: scrub_handle_errored_block: 163 callbacks suppressed
Apr 24 10:01:57 Tower2 kernel: btrfs_dev_stat_print_on_error: 162 callbacks suppressed
Apr 24 10:01:57 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3491, gen 0
Apr 24 10:02:05 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3492, gen 0
Apr 24 10:02:08 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3493, gen 0
Apr 24 10:02:08 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3494, gen 0
Apr 24 10:02:08 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3495, gen 0
Apr 24 10:02:08 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3496, gen 0
Apr 24 10:02:09 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3497, gen 0
Apr 24 10:03:09 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3498, gen 0
Apr 24 10:03:23 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3499, gen 0
Apr 24 10:03:25 Tower2 kernel: BTRFS: checksum error at logical 98779938816 on dev /dev/sdb1, sector 182443808, root 5, inode 273, offset 10165972992, length 4096, links 1 (path: .VMs/yavdr/yavdr.qcow2)
Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3500, gen 0
Apr 24 10:03:25 Tower2 kernel: BTRFS: checksum error at logical 98779942912 on dev /dev/sdb1, sector 182443816, root 5, inode 273, offset 10165977088, length 4096, links 1 (path: .VMs/yavdr/yavdr.qcow2)
Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3501, gen 0
Apr 24 10:03:25 Tower2 kernel: BTRFS: checksum error at logical 98779947008 on dev /dev/sdb1, sector 182443824, root 5, inode 273, offset 10165981184, length 4096, links 1 (path: .VMs/yavdr/yavdr.qcow2)
Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3502, gen 0
Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3504, gen 0
Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3505, gen 0
Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3506, gen 0
Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3507, gen 0
Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3508, gen 0

scrub status for 15d1c7e5-9ab0-41da-9386-66aa4837f13d
scrub started at Fri Apr 24 09:58:35 2015 and finished after 291 seconds
total bytes scrubbed: 77.25GiB with 1136 errors
error details: csum=1136
corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

 

As said: this is only my backup server but I would like to know how to move forward as this can happen on the production system as well. So what is the guidance /best practice? Forget about these files, reformat the drive and starting over again? Hopefully not  :o

Link to comment

Here is what the web gui is showing:

SMART overall-health :	PASSED

Additionally:

Warning: ATA error count 0 inconsistent with error log pointer 4

ATA Error Count: 0
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 0 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 01 e8 40 5d 40   at LBA = 0x005d40e8 = 6111464

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 01 ec 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 ed 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 ee 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  28d+08:56:25.408  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00  28d+08:56:25.408  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error -1 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 e8 40 5d 40   at LBA = 0x005d40e8 = 6111464

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 03 01 ee 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 10 01 ed 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 ec 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 eb 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00  28d+08:56:25.408  SET FEATURES [Enable SATA feature]

Error -2 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 01 e8 40 5d 40   at LBA = 0x005d40e8 = 6111464

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 10 01 eb 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 ec 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 ed 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 03 01 ee 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 10 01 ea 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED

Error -3 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 00 e8 40 5d 40   at LBA = 0x005d40e8 = 6111464

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 01 ee 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 ed 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 03 01 ec 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 10 01 eb 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 ea 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED

Error -4 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 01 e8 40 5d 40   at LBA = 0x005d40e8 = 6111464

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 01 e9 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 03 01 ea 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 10 01 eb 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 ec 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED
  60 00 01 ed 40 5d 40 00  28d+08:56:25.408  READ FPDMA QUEUED

 

Interesting detail: both VM's are running perfectly well (those where I can't copy the qcow file). I have also updated to the latest firmware

Link to comment

That's only a small part of the complete SMART report, and it appears that the SMART data has a little corruption.  A single error is shown, repeated 5 times, but not the actual cause of the error, just that a read was desired of a specific LBA when it happened.

 

The current webGui does not yet provide the full SMART report, so you will need to do it the old way, Obtain a SMART report.

 

What you are looking for in the complete SMART report, or from the SMART Attributes in the webGui is whether there are sector issues, and what the drive's operational age is, so you can determine if the time of the errors, 3067, is recent or long ago.

Link to comment

@RobJ, I'm running a SMART report now using this command: smartctl -a -A -t long /dev/sdb >/boot/smart.txt

 

Should be quick as this is a SSD device.

 

@Jon, here are the two messages:

root@Tower2:/mnt/cache/.VMs/yavdrHeadless# lsattr /mnt/cache/.VMs/yavdrHeadless/yavdr.qcow2
---------------- /mnt/cache/.VMs/yavdrHeadless/yavdr.qcow2

 

root@Tower2:/mnt/cache/.VMs/yavdrHeadless# lsattr /mnt/cache/.VMs/yavdr/yavdr.qcow2
---------------- /mnt/cache/.VMs/yavdr/yavdr.qcow2

 

By the way, I was not aware that the VM's can't reside on the cache drive and my guess is that most cache drives are btrfs formatted (..following your advise how you are setting up the LT Server).

Link to comment

By the way, I was not aware that the VM's can't reside on the cache drive and my guess is that most cache drives are btrfs formatted (..following your advise how you are setting up the LT Server).

They can reside on the cache drive.    It was just that qcow2 files created on earlier betas on BTRFS formatted disks can have a problem.  I believe that this issue is now resolved for VM's created via the KVM Manager that is now integrated into unRAID.
Link to comment

Enclosed you find the SMART report.

 

@itimpi, you are right I remember that we were asked to reformat the drives.....which I did.

At least my memory is not failing me!

 

I switched my cache drive to XFS at the time.  I also switched my VM's to img (raw) files as they seem to give much better performance and I was not particularly interested in the features that qcow2 supported such as snapshots.

Link to comment

I too had changed my cache to xfs from btrfs pool because of similar errors with qcow2 files.  Also if I installed a mythbuntu vm I couldn't back it up to the array without an io error. 

 

But yesterday decided to try cache pool again.  Backed up everything to the array then copied everything back except vm's with qcow2's. I really don't need the snapshot or other features of qcow2 either. So I used this instead of copying to move my images to a raw format and the cache pool.

 

qemu-img convert -O raw windows7.qcow2 /mnt/cache/domains/windows7/windows7.raw

 

There was also a thread about turning of cow flag on directories in btrfs so you could use qcow. Maybe that's in the new vm manager.

Link to comment

@RobJ, I'm running a SMART report now using this command: smartctl -a -A -t long /dev/sdb >/boot/smart.txt

 

Should be quick as this is a SSD device.

 

@Jon, here are the two messages:

root@Tower2:/mnt/cache/.VMs/yavdrHeadless# lsattr /mnt/cache/.VMs/yavdrHeadless/yavdr.qcow2
---------------- /mnt/cache/.VMs/yavdrHeadless/yavdr.qcow2

 

root@Tower2:/mnt/cache/.VMs/yavdrHeadless# lsattr /mnt/cache/.VMs/yavdr/yavdr.qcow2
---------------- /mnt/cache/.VMs/yavdr/yavdr.qcow2

 

By the way, I was not aware that the VM's can't reside on the cache drive and my guess is that most cache drives are btrfs formatted (..following your advise how you are setting up the LT Server).

Its not that they can't reside there. They absolutely can, but for qcow images, you should disable copy on write on btrfs for those files because otherwise you are duplicating the effort (cow at the filesystem level AND the virtual disk level).

 

If you create qcow virtual disks using our vm manager, it should set the NOCOW bit automatically, but it will lose this setting if you copy the image off to a non btrfs disk for backup.

 

You cannot add the NOCOW flag to an existing file on btrfs, but you can add it to a folder, then copy the image from a non BTRFS device and back.

 

Or you could make your lives easier and just use raw...

 

To set NOCOW to a directory:

 

chattr +C -R /mnt/cache/path/to/folder

 

The R and C above are case sensitive.

Link to comment

I created these files a while ago, so was not using the latest VM Manager.

Easiest step might be to convert both files into RAW format. Let's see if that is also producing I/O Errors.

I'm also asking myself, why the other two VM files were copied without any hassle. qcow format als well.....

 

Ah, question, the SMART Report looks ok - correct?

Link to comment

This is what I was afraid of. Works for the Win8 VM (...that I was also able to copy into the backup share):

root@Tower2:/mnt/cache/.VMs/win8# qemu-img convert -O raw win8.qcow2 win8.raw
root@Tower2:/mnt/cache/.VMs/win8# v
total 44132716
-rw------- 1 root root 22832873472 Apr 24 14:44 win8.qcow2
-rw-r--r-- 1 root root 26843545600 Apr 26 11:26 win8.raw
-rw------- 1 root root        4416 Apr 18 11:37 win8.xml
-rw-rw-rw- 1 root root        3924 Apr 18 11:38 win8VNC.xml

 

...and doesn't work for the VM where I got I/O errors:

root@Tower2:/mnt/cache/.VMs/yavdr# qemu-img convert -O raw yavdr.qcow2 yavdr.raw
qemu-img: error while reading sector 19819008: Input/output error

 

Any idea if there is a way to get this VM file copied/converted....as a reminder: this is a Ubuntu VM that is working very well.

Link to comment

That's the same error I would get with mythbuntu even though it ran just fine.  I tried copying it various ways without success. I think I tried qemu-img check. You could try virsh clone or virsh migrate.  You may have to create another ubuntu vm and just ssh into both and copy your settings and data to the new one.

Link to comment

Thanks dmacias, doesn't work, the filesystem might be completely screwed up. Doesn't matter as this is just my test&backup server.

 

Anyhow just searching how to start over with my two SSD's that I would like to reformat in order to get rid of the errors. I have a 128GB and a 64GB SSD and would like to add the capacities. Unfortunately the guide is outdated and informations is scattered all around the forum. Have to investigate how to do it.

Link to comment

I don't know that the vm filesystem is screwed up. It's just something weird. I tested it before on a fresh install of mythbuntu and get same error with a qcow2 file. Can't copy,  I/O error. A quick solution to format from the webgui would be to stop the array and change the filesystem of the drive to xfs. Start array, format, stop array, change back to btrfs, then start and format for each drive. Then add the drives to slot 1 and 2 cache.  You may have to decrease drive slots to gain cache slots.

Link to comment

Thanks dmacias, doesn't work, the filesystem might be completely screwed up. Doesn't matter as this is just my test&backup server.

 

Anyhow just searching how to start over with my two SSD's that I would like to reformat in order to get rid of the errors. I have a 128GB and a 64GB SSD and would like to add the capacities. Unfortunately the guide is outdated and informations is scattered all around the forum. Have to investigate how to do it.

Any luck.  I just switched back from a cache pool to 2 separate xfs drives. One for cache and one for vm's.  I started seeing errors even though I had raw images in a nocow directory. Scrub would show unrecoverable errors. And when I went to copy my mythbuntu vm to the array I got an I/O error. So I had to use a back up. I haven't had any hard reset.

Link to comment
Any luck.  I just switched back from a cache pool to 2 separate xfs drives. One for cache and one for vm's.  I started seeing errors even though I had raw images in a nocow directory. Scrub would show unrecoverable errors. And when I went to copy my mythbuntu vm to the array I got an I/O error. So I had to use a back up. I haven't had any hard reset.

 

Similar picture here ... luckily no I/O errors, so I can copy VM files but massive amount of csum errors. I also didn't had any hard crash of the server.

 

I have completely created VM's from scratch, all in RAW format but scrub is again showing:

scrub status for e46946cf-15f4-4a47-9fe6-94a9ae7088d4
scrub started at Tue May  5 09:57:51 2015 and finished after 136 seconds
total bytes scrubbed: 72.38GiB with 1837 errors
error details: csum=1837
corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

and the log:

May 5 10:00:06 Tower2 kernel: btrfs_dev_stat_print_on_error: 1316 callbacks suppressed
May 5 10:00:06 Tower2 kernel: BTRFS: bdev /dev/sdg1 errs: wr 0, rd 0, flush 0, corrupt 3412, gen 0

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.