EdgarWallace Posted April 24, 2015 Share Posted April 24, 2015 I'm using my cache drive on both of my servers exclusively for VM's and Docker. Even on my backup server I'm making backup copies of Docker/VM's files from time to time. I saw errors recently when I was trying the copy two of my qcow files. After a parity check and a btrfs scrub via the excellent web gui I see these log entries: Apr 24 10:01:57 Tower2 kernel: scrub_handle_errored_block: 163 callbacks suppressed Apr 24 10:01:57 Tower2 kernel: btrfs_dev_stat_print_on_error: 162 callbacks suppressed Apr 24 10:01:57 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3491, gen 0 Apr 24 10:02:05 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3492, gen 0 Apr 24 10:02:08 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3493, gen 0 Apr 24 10:02:08 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3494, gen 0 Apr 24 10:02:08 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3495, gen 0 Apr 24 10:02:08 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3496, gen 0 Apr 24 10:02:09 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3497, gen 0 Apr 24 10:03:09 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3498, gen 0 Apr 24 10:03:23 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3499, gen 0 Apr 24 10:03:25 Tower2 kernel: BTRFS: checksum error at logical 98779938816 on dev /dev/sdb1, sector 182443808, root 5, inode 273, offset 10165972992, length 4096, links 1 (path: .VMs/yavdr/yavdr.qcow2) Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3500, gen 0 Apr 24 10:03:25 Tower2 kernel: BTRFS: checksum error at logical 98779942912 on dev /dev/sdb1, sector 182443816, root 5, inode 273, offset 10165977088, length 4096, links 1 (path: .VMs/yavdr/yavdr.qcow2) Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3501, gen 0 Apr 24 10:03:25 Tower2 kernel: BTRFS: checksum error at logical 98779947008 on dev /dev/sdb1, sector 182443824, root 5, inode 273, offset 10165981184, length 4096, links 1 (path: .VMs/yavdr/yavdr.qcow2) Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3502, gen 0 Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3504, gen 0 Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3505, gen 0 Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3506, gen 0 Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3507, gen 0 Apr 24 10:03:25 Tower2 kernel: BTRFS: bdev /dev/sdb1 errs: wr 0, rd 0, flush 0, corrupt 3508, gen 0 scrub status for 15d1c7e5-9ab0-41da-9386-66aa4837f13d scrub started at Fri Apr 24 09:58:35 2015 and finished after 291 seconds total bytes scrubbed: 77.25GiB with 1136 errors error details: csum=1136 corrected errors: 0, uncorrectable errors: 0, unverified errors: 0 As said: this is only my backup server but I would like to know how to move forward as this can happen on the production system as well. So what is the guidance /best practice? Forget about these files, reformat the drive and starting over again? Hopefully not Quote Link to comment
HellDiverUK Posted April 24, 2015 Share Posted April 24, 2015 What does SMART say about the drive? In other words is it a btrfs issue, or is the drive throwing bad sectors? Quote Link to comment
EdgarWallace Posted April 24, 2015 Author Share Posted April 24, 2015 Here is what the web gui is showing: SMART overall-health : PASSED Additionally: Warning: ATA error count 0 inconsistent with error log pointer 4 ATA Error Count: 0 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 0 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 50 01 e8 40 5d 40 at LBA = 0x005d40e8 = 6111464 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 01 ec 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 ed 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 ee 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED ef 10 02 00 00 00 a0 00 28d+08:56:25.408 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 28d+08:56:25.408 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] Error -1 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 50 00 e8 40 5d 40 at LBA = 0x005d40e8 = 6111464 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 03 01 ee 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 10 01 ed 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 ec 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 eb 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED ef 10 02 00 00 00 a0 00 28d+08:56:25.408 SET FEATURES [Enable SATA feature] Error -2 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 50 01 e8 40 5d 40 at LBA = 0x005d40e8 = 6111464 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 10 01 eb 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 ec 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 ed 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 03 01 ee 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 10 01 ea 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED Error -3 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 50 00 e8 40 5d 40 at LBA = 0x005d40e8 = 6111464 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 01 ee 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 ed 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 03 01 ec 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 10 01 eb 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 ea 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED Error -4 occurred at disk power-on lifetime: 3067 hours (127 days + 19 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 00 50 01 e8 40 5d 40 at LBA = 0x005d40e8 = 6111464 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 01 e9 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 03 01 ea 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 10 01 eb 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 ec 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED 60 00 01 ed 40 5d 40 00 28d+08:56:25.408 READ FPDMA QUEUED Interesting detail: both VM's are running perfectly well (those where I can't copy the qcow file). I have also updated to the latest firmware Quote Link to comment
RobJ Posted April 24, 2015 Share Posted April 24, 2015 That's only a small part of the complete SMART report, and it appears that the SMART data has a little corruption. A single error is shown, repeated 5 times, but not the actual cause of the error, just that a read was desired of a specific LBA when it happened. The current webGui does not yet provide the full SMART report, so you will need to do it the old way, Obtain a SMART report. What you are looking for in the complete SMART report, or from the SMART Attributes in the webGui is whether there are sector issues, and what the drive's operational age is, so you can determine if the time of the errors, 3067, is recent or long ago. Quote Link to comment
EdgarWallace Posted April 24, 2015 Author Share Posted April 24, 2015 Thanks RobJ I will follow you advise tomorrow and report back Quote Link to comment
lionelhutz Posted April 24, 2015 Share Posted April 24, 2015 I was getting similar checksum error lines. I could format it and start over and it'd be good for a few weeks and start again. So, I switched to XFS and it's been good ever since. Others have also had a similar thing happen. Quote Link to comment
jonp Posted April 25, 2015 Share Posted April 25, 2015 QCOW vdisks on a btrfs formatted device? Could the issue. Run this command: lsattr /mnt/path/to/file Do not use /mnt/user (specify the disk in the path) Copy and paste the output of that command back here please. Quote Link to comment
EdgarWallace Posted April 25, 2015 Author Share Posted April 25, 2015 @RobJ, I'm running a SMART report now using this command: smartctl -a -A -t long /dev/sdb >/boot/smart.txt Should be quick as this is a SSD device. @Jon, here are the two messages: root@Tower2:/mnt/cache/.VMs/yavdrHeadless# lsattr /mnt/cache/.VMs/yavdrHeadless/yavdr.qcow2 ---------------- /mnt/cache/.VMs/yavdrHeadless/yavdr.qcow2 root@Tower2:/mnt/cache/.VMs/yavdrHeadless# lsattr /mnt/cache/.VMs/yavdr/yavdr.qcow2 ---------------- /mnt/cache/.VMs/yavdr/yavdr.qcow2 By the way, I was not aware that the VM's can't reside on the cache drive and my guess is that most cache drives are btrfs formatted (..following your advise how you are setting up the LT Server). Quote Link to comment
itimpi Posted April 25, 2015 Share Posted April 25, 2015 By the way, I was not aware that the VM's can't reside on the cache drive and my guess is that most cache drives are btrfs formatted (..following your advise how you are setting up the LT Server). They can reside on the cache drive. It was just that qcow2 files created on earlier betas on BTRFS formatted disks can have a problem. I believe that this issue is now resolved for VM's created via the KVM Manager that is now integrated into unRAID. Quote Link to comment
EdgarWallace Posted April 25, 2015 Author Share Posted April 25, 2015 Enclosed you find the SMART report. @itimpi, you are right I remember that we were asked to reformat the drives.....which I did. Quote Link to comment
itimpi Posted April 25, 2015 Share Posted April 25, 2015 Enclosed you find the SMART report. @itimpi, you are right I remember that we were asked to reformat the drives.....which I did. At least my memory is not failing me! I switched my cache drive to XFS at the time. I also switched my VM's to img (raw) files as they seem to give much better performance and I was not particularly interested in the features that qcow2 supported such as snapshots. Quote Link to comment
dmacias Posted April 25, 2015 Share Posted April 25, 2015 I too had changed my cache to xfs from btrfs pool because of similar errors with qcow2 files. Also if I installed a mythbuntu vm I couldn't back it up to the array without an io error. But yesterday decided to try cache pool again. Backed up everything to the array then copied everything back except vm's with qcow2's. I really don't need the snapshot or other features of qcow2 either. So I used this instead of copying to move my images to a raw format and the cache pool. qemu-img convert -O raw windows7.qcow2 /mnt/cache/domains/windows7/windows7.raw There was also a thread about turning of cow flag on directories in btrfs so you could use qcow. Maybe that's in the new vm manager. Quote Link to comment
jonp Posted April 25, 2015 Share Posted April 25, 2015 @RobJ, I'm running a SMART report now using this command: smartctl -a -A -t long /dev/sdb >/boot/smart.txt Should be quick as this is a SSD device. @Jon, here are the two messages: root@Tower2:/mnt/cache/.VMs/yavdrHeadless# lsattr /mnt/cache/.VMs/yavdrHeadless/yavdr.qcow2 ---------------- /mnt/cache/.VMs/yavdrHeadless/yavdr.qcow2 root@Tower2:/mnt/cache/.VMs/yavdrHeadless# lsattr /mnt/cache/.VMs/yavdr/yavdr.qcow2 ---------------- /mnt/cache/.VMs/yavdr/yavdr.qcow2 By the way, I was not aware that the VM's can't reside on the cache drive and my guess is that most cache drives are btrfs formatted (..following your advise how you are setting up the LT Server). Its not that they can't reside there. They absolutely can, but for qcow images, you should disable copy on write on btrfs for those files because otherwise you are duplicating the effort (cow at the filesystem level AND the virtual disk level). If you create qcow virtual disks using our vm manager, it should set the NOCOW bit automatically, but it will lose this setting if you copy the image off to a non btrfs disk for backup. You cannot add the NOCOW flag to an existing file on btrfs, but you can add it to a folder, then copy the image from a non BTRFS device and back. Or you could make your lives easier and just use raw... To set NOCOW to a directory: chattr +C -R /mnt/cache/path/to/folder The R and C above are case sensitive. Quote Link to comment
EdgarWallace Posted April 25, 2015 Author Share Posted April 25, 2015 I created these files a while ago, so was not using the latest VM Manager. Easiest step might be to convert both files into RAW format. Let's see if that is also producing I/O Errors. I'm also asking myself, why the other two VM files were copied without any hassle. qcow format als well..... Ah, question, the SMART Report looks ok - correct? Quote Link to comment
RobJ Posted April 26, 2015 Share Posted April 26, 2015 Ah, question, the SMART Report looks ok - correct? The report looks fine so far. It indicated that you had started a 'long' test, and that it would finish in 9 more minutes, so you should be able to get another report and see the results of that test. Quote Link to comment
EdgarWallace Posted April 26, 2015 Author Share Posted April 26, 2015 This is what I was afraid of. Works for the Win8 VM (...that I was also able to copy into the backup share): root@Tower2:/mnt/cache/.VMs/win8# qemu-img convert -O raw win8.qcow2 win8.raw root@Tower2:/mnt/cache/.VMs/win8# v total 44132716 -rw------- 1 root root 22832873472 Apr 24 14:44 win8.qcow2 -rw-r--r-- 1 root root 26843545600 Apr 26 11:26 win8.raw -rw------- 1 root root 4416 Apr 18 11:37 win8.xml -rw-rw-rw- 1 root root 3924 Apr 18 11:38 win8VNC.xml ...and doesn't work for the VM where I got I/O errors: root@Tower2:/mnt/cache/.VMs/yavdr# qemu-img convert -O raw yavdr.qcow2 yavdr.raw qemu-img: error while reading sector 19819008: Input/output error Any idea if there is a way to get this VM file copied/converted....as a reminder: this is a Ubuntu VM that is working very well. Quote Link to comment
dmacias Posted April 26, 2015 Share Posted April 26, 2015 That's the same error I would get with mythbuntu even though it ran just fine. I tried copying it various ways without success. I think I tried qemu-img check. You could try virsh clone or virsh migrate. You may have to create another ubuntu vm and just ssh into both and copy your settings and data to the new one. Quote Link to comment
EdgarWallace Posted April 26, 2015 Author Share Posted April 26, 2015 Thanks dmacias, doesn't work, the filesystem might be completely screwed up. Doesn't matter as this is just my test&backup server. Anyhow just searching how to start over with my two SSD's that I would like to reformat in order to get rid of the errors. I have a 128GB and a 64GB SSD and would like to add the capacities. Unfortunately the guide is outdated and informations is scattered all around the forum. Have to investigate how to do it. Quote Link to comment
dmacias Posted April 26, 2015 Share Posted April 26, 2015 I don't know that the vm filesystem is screwed up. It's just something weird. I tested it before on a fresh install of mythbuntu and get same error with a qcow2 file. Can't copy, I/O error. A quick solution to format from the webgui would be to stop the array and change the filesystem of the drive to xfs. Start array, format, stop array, change back to btrfs, then start and format for each drive. Then add the drives to slot 1 and 2 cache. You may have to decrease drive slots to gain cache slots. Quote Link to comment
dmacias Posted May 4, 2015 Share Posted May 4, 2015 Thanks dmacias, doesn't work, the filesystem might be completely screwed up. Doesn't matter as this is just my test&backup server. Anyhow just searching how to start over with my two SSD's that I would like to reformat in order to get rid of the errors. I have a 128GB and a 64GB SSD and would like to add the capacities. Unfortunately the guide is outdated and informations is scattered all around the forum. Have to investigate how to do it. Any luck. I just switched back from a cache pool to 2 separate xfs drives. One for cache and one for vm's. I started seeing errors even though I had raw images in a nocow directory. Scrub would show unrecoverable errors. And when I went to copy my mythbuntu vm to the array I got an I/O error. So I had to use a back up. I haven't had any hard reset. Quote Link to comment
EdgarWallace Posted May 5, 2015 Author Share Posted May 5, 2015 Any luck. I just switched back from a cache pool to 2 separate xfs drives. One for cache and one for vm's. I started seeing errors even though I had raw images in a nocow directory. Scrub would show unrecoverable errors. And when I went to copy my mythbuntu vm to the array I got an I/O error. So I had to use a back up. I haven't had any hard reset. Similar picture here ... luckily no I/O errors, so I can copy VM files but massive amount of csum errors. I also didn't had any hard crash of the server. I have completely created VM's from scratch, all in RAW format but scrub is again showing: scrub status for e46946cf-15f4-4a47-9fe6-94a9ae7088d4 scrub started at Tue May 5 09:57:51 2015 and finished after 136 seconds total bytes scrubbed: 72.38GiB with 1837 errors error details: csum=1837 corrected errors: 0, uncorrectable errors: 0, unverified errors: 0 and the log: May 5 10:00:06 Tower2 kernel: btrfs_dev_stat_print_on_error: 1316 callbacks suppressed May 5 10:00:06 Tower2 kernel: BTRFS: bdev /dev/sdg1 errs: wr 0, rd 0, flush 0, corrupt 3412, gen 0 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.