Primary disk in cache pool going bad, desperately attempting to save VM.


neztach

Recommended Posts

So I have a cache pool /dev/sdj and /dev/sdk and /sdj is throwing errors, so probably going bad.  For a minute, restart I was able to get it to be recognized and come online, and I stupidly missed my window to copy my windows VM from within. 

 

At this point, if I spin up the array normally, the cache pool will say "no file system" seemingly no matter what.  Whatever section of /sdj1 that's bad is right in the section where mover needs to move, so I can't move anything.  I also can't run dockers, as they were built dependent on cache, so no krusader.  I found a link here which allows me to make the cache pool read only, which lets me look at things via console. 

 

btrfs rescue zero-log /dev/sdj1

 

I've attempted multiple variations of rsync to copy the vdisk.img to a location on the array, all of which fail.  The latest being

 

rsync -avh --progress --sparse /mnt/user/domains/Windows\ 10/vdisk1.img /mnt/user/Archives/VMBackup/Win10/

 

which results in the following (bear in mind only 200G is allocated to that VM):

 

    214.75G 100%    63.13MB/s    0:54:04 (xfr#1, to-chk=0/1)

rsync: [sender] read errors mapping "/mnt/user/domains/Windows 10/vdisk1.img": Input/output error (5)

WARNING: vdisk.img failed verification -- update discarded (will try again).

    214.75G 100%    51.68MB/s    1:06:02 (xfr#2, to-chk=0/1)

rsync: [sender] read errors mapping "/mnt/user/domains/Windows 10/vdisk1.img": Input/output error (5)

ERROR: vdisk1.img failed verification -- update discarded.

 

I also tried

cp -av --sparse=always <source> <destination>

'/mnt/user/domain/Windows 10/vdisk1.img' -> '/mnt/user/Archives/VMBackup/Win10/vdisk1.img'

cp: error reading '/mnt/user/domain/Windows 10/vdisk1.img': Input/output error

 

I'm at a loss.  Any advice?  Honestly, if I could remove the pool, and put all of cache on /dev/sdk1 then I'm sure all would be well.

 

Link to comment

These suggest one of the devices dropped offline in the past:

 

Nov 15 16:44:17 Eden kernel: BTRFS info (device sdk1): bdev /dev/sdj1 errs: wr 29362, rd 138395, flush 0, corrupt 4, gen 0
Nov 15 16:44:17 Eden kernel: BTRFS info (device sdk1): bdev /dev/sdk1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0

 

And it dropped offline again:

 

Nov 16 00:49:27 Eden kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x1)
Nov 16 00:49:27 Eden kernel: ata11.00: revalidation failed (errno=-5)
Nov 16 00:49:27 Eden kernel: ata11.00: disabled

 

Try replacing the cables on that device, and ideally don't use the SASLP controller which is not recommended, or at least connect the SSDs to onboard SATA, or trim also won't work, then run a correcting scrub and see if there aren't any uncorrectable errors.

Link to comment

I guess I don't understand how one SSD appears to be connected differently than another SSD.  It's a single backplane, and both are connected to the same backplane.  Perhaps in my previous diagnostic, one of the drives was hooked up to an LSI controller then as well?  If I knew which was was on the LSI controller both then and now I can perhaps shift them around so they are both in a slot previously recognized as LSI.  If-not, which part of the diagnostics would I look at to see LSI vs SASLP?

Link to comment

If it's a single backplane it's direct connection, you have devices connected to 3 different controllers:

 

These are on an SASLP

 

[1:0:0:0]    disk    ATA      SanDisk SDSSDH31 70RL  /dev/sdj   /dev/sg9
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/1:0:0:0  [/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0]
[1:0:1:0]    disk    ATA      WDC WD100EMAZ-00 0A83  /dev/sdk   /dev/sg10
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/1:0:1:0  [/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host1/port-1:1/end_device-1:1/target1:0:1/1:0:1:0]

 

These on the onboard SATA ports

 

[2:0:0:0]    disk    ATA      ST10000VN0008-2P SC61  /dev/sdb   /dev/sg1
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/2:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata1/host2/target2:0:0/2:0:0:0]
[3:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdc   /dev/sg2
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/3:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata2/host3/target3:0:0/3:0:0:0]
[4:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdd   /dev/sg3
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/4:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata3/host4/target4:0:0/4:0:0:0]
[5:0:0:0]    disk    ATA      ST8000VN0022-2EL SC61  /dev/sde   /dev/sg4
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/5:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata4/host5/target5:0:0/5:0:0:0]
[6:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdf   /dev/sg5
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/6:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata5/host6/target6:0:0/6:0:0:0]
[7:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdg   /dev/sg6
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/7:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata6/host7/target7:0:0/7:0:0:0]
[8:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdh   /dev/sg7
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/8:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata7/host8/target8:0:0/8:0:0:0]
[9:0:0:0]    disk    ATA      ST10000VN0004-1Z SC60  /dev/sdi   /dev/sg8
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/9:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata8/host9/target9:0:0/9:0:0:0]

 

And these on another SASLP:

 

[12:0:0:0]   disk    ATA      ST10000VN0008-2J SC60  /dev/sdl   /dev/sg11
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:0:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:0/end_device-12:0/target12:0:0/12:0:0:0]
[12:0:1:0]   disk    ATA      WDC WD100EMAZ-00 0A83  /dev/sdm   /dev/sg12
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:1:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:1/end_device-12:1/target12:0:1/12:0:1:0]
[12:0:2:0]   disk    ATA      SanDisk Ultra II 00RL  /dev/sdn   /dev/sg13
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:2:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:2/end_device-12:2/target12:0:2/12:0:2:0]
[12:0:3:0]   disk    ATA      ST4000DM000-1F21 CC54  /dev/sdo   /dev/sg14
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:3:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:3/end_device-12:3/target12:0:3/12:0:3:0]
[12:0:4:0]   disk    ATA      ST10000VN0008-2P SC61  /dev/sdp   /dev/sg15
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:4:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:4/end_device-12:4/target12:0:4/12:0:4:0]
[12:0:5:0]   disk    ATA      WDC WD60EFRX-68L 0A82  /dev/sdq   /dev/sg16
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:5:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:5/end_device-12:5/target12:0:5/12:0:5:0]

 

So I see no reason why you can't connect the SSDs to the Intel ports, it's just a case of swapping things around.

 

 

Link to comment

you're absolutely right, and I'm perfectly willing to try.  I'm just wondering which log you got that from, so I can look at the previous diagnotics and see if either cache drive was previously in an LSI to narrow down what ports I could try, then self-verify so I can follow your advice and get both of them on an LSI controller.

 

Link to comment
11 minutes ago, neztach said:

so I can look at the previous diagnotics and see if either cache drive was previously in an LSI to narrow down what ports I could try,

You can look, it's in the lsscsi.txt file in the diags, but it was on the same controller:

 

8 hours ago, JorgeB said:

Try replacing the cables on that device, and ideally don't use the SASLP controller

 

The other SSD device was also there, so that one is now on the LSI.

The SSD was on an SASLP, now is on the other one.

Link to comment

To reiterate,

 

I would love for there to be some kind of method to just change the pool back to using a single drive (the 1 of 2 drives in the cache pool still good).  Of I would be willing to buy a new 2TB SSD to replace both drives in the pool.

 

At this point, I would just like some way (ANY. way.) - that would allow me to at least copy off whatever files are essential - especially my VMs.

 

Is there anything I can do here?

Link to comment
12 hours ago, neztach said:

but which are my intel ports?

The ones listed above as onboard SATA.

 

8 hours ago, neztach said:

At this point, I would just like some way (ANY. way.) - that would allow me to at least copy off whatever files are essential - especially my VMs.

 

Is there anything I can do here?

By the looks of it you should be able to get most data but there could be some data corruption, but you first need to solve the device dropping issue.

Link to comment

I think the device dropping issue is probably symptomatic of the drive dying, in which case, what could I really do about it?  Having said that, I'll shuffle the drives around again, and send another diagnostics.  In the mean-time, If-after I shuffle the cache drives to both be on SATA the device still drops, is there something I can do after that to attempt to copy data off?

Link to comment

image.png.02cfd0a849a823608ac69054252aff86.png

 

it appeared to have gotten further, and I believe I now have both cache drives in SATA slots, diagnostics included to confirm.

 

Nothing I can do to possibly limp it along to get a copy of the contents? Some of it? all of it?

 

like I said, all I really want is my VMs ...specifically one of them.

 

after that, I don't mind decomissioning the bad SSD and formatting the remaining one as cache.  I already have a new 2TB en route.

 

eden-diagnostics-20211118-0919.zip

Link to comment
23 minutes ago, neztach said:

I believe I now have both cache drives in SATA slots

They are, we can now see it's a device problem, there are uncorrectable errors (bad sectors), but unlike the bad behaving SASLP now it doesn't drop the SSD, you can try btrfs restore, not sure if it continues when there's an i/o error, if it doesn't you'd need to first clone the SSD to a new device then try again, note that some data corruption is expected in any case.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.