Primary disk in cache pool going bad, desperately attempting to save VM.

neztach · November 15, 2021

So I have a cache pool /dev/sdj and /dev/sdk and /sdj is throwing errors, so probably going bad. For a minute, restart I was able to get it to be recognized and come online, and I stupidly missed my window to copy my windows VM from within.

At this point, if I spin up the array normally, the cache pool will say "no file system" seemingly no matter what. Whatever section of /sdj1 that's bad is right in the section where mover needs to move, so I can't move anything. I also can't run dockers, as they were built dependent on cache, so no krusader. I found a link here which allows me to make the cache pool read only, which lets me look at things via console.

btrfs rescue zero-log /dev/sdj1

I've attempted multiple variations of rsync to copy the vdisk.img to a location on the array, all of which fail. The latest being

rsync -avh --progress --sparse /mnt/user/domains/Windows\ 10/vdisk1.img /mnt/user/Archives/VMBackup/Win10/

which results in the following (bear in mind only 200G is allocated to that VM):

214.75G 100% 63.13MB/s 0:54:04 (xfr#1, to-chk=0/1)

rsync: [sender] read errors mapping "/mnt/user/domains/Windows 10/vdisk1.img": Input/output error (5)

WARNING: vdisk.img failed verification -- update discarded (will try again).

214.75G 100% 51.68MB/s 1:06:02 (xfr#2, to-chk=0/1)

rsync: [sender] read errors mapping "/mnt/user/domains/Windows 10/vdisk1.img": Input/output error (5)

ERROR: vdisk1.img failed verification -- update discarded.

I also tried

cp -av --sparse=always <source> <destination>

'/mnt/user/domain/Windows 10/vdisk1.img' -> '/mnt/user/Archives/VMBackup/Win10/vdisk1.img'

cp: error reading '/mnt/user/domain/Windows 10/vdisk1.img': Input/output error

I'm at a loss. Any advice? Honestly, if I could remove the pool, and put all of cache on /dev/sdk1 then I'm sure all would be well.

JorgeB · November 16, 2021

Please post the diagnostics.

neztach · November 16, 2021

https://drive.google.com/drive/folders/1P4ck5oJ1ko4VGZyHPu7m6qGHRF6IMsc7?usp=sharing

hopefully that link works.

JonathanM · November 16, 2021

10 minutes ago, neztach said:

hopefully that link works.

It may be a valid link, but it's not what was desired.

Attach the diagnostics zip file to your next post in this thread. Clicking on random links to download an unknown file isn't ideal.

neztach · November 16, 2021

wow I'm such a rookie! Apologies, see attached.

eden-diagnostics-20211116-1542.zip

JorgeB · November 17, 2021

These suggest one of the devices dropped offline in the past:

Nov 15 16:44:17 Eden kernel: BTRFS info (device sdk1): bdev /dev/sdj1 errs: wr 29362, rd 138395, flush 0, corrupt 4, gen 0
Nov 15 16:44:17 Eden kernel: BTRFS info (device sdk1): bdev /dev/sdk1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0

And it dropped offline again:

Nov 16 00:49:27 Eden kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x1)
Nov 16 00:49:27 Eden kernel: ata11.00: revalidation failed (errno=-5)
Nov 16 00:49:27 Eden kernel: ata11.00: disabled

Try replacing the cables on that device, and ideally don't use the SASLP controller which is not recommended, or at least connect the SSDs to onboard SATA, or trim also won't work, then run a correcting scrub and see if there aren't any uncorrectable errors.

neztach · November 17, 2021

I have a backplane, so I can't really plug directly into the board, but I switched the slots for both cache drives, and the corrective btrfs scrub started fine, then aborted.

image.png.5d1fc64b28b22d4d271c5d6d5b09306b.png

eden-diagnostics-20211117-0845.zip

JorgeB · November 17, 2021

It dropped offline again.

neztach · November 17, 2021

I keep rebooting and scrubbing ....sometimes it gets father, sometimes not. Shall I continue, or is there a better approach?

neztach · November 17, 2021

also, I have a HUGE PlexMediaServer folder on cache that I wouldn't be upset about if I trashed all together. Would it help - or make it worse - or change nothing?

JonathanM · November 17, 2021

Until you change the hardware path and figure out which part is responsible for the drive dropping it's not productive to keep doing the same thing over and over.

JorgeB · November 17, 2021

If you can't connect it to the onboard SATA at least connect it to the LSI controller, like the other SSD is, still no trim but at least it's a reliable controller.

neztach · November 17, 2021

I guess I don't understand how one SSD appears to be connected differently than another SSD. It's a single backplane, and both are connected to the same backplane. Perhaps in my previous diagnostic, one of the drives was hooked up to an LSI controller then as well? If I knew which was was on the LSI controller both then and now I can perhaps shift them around so they are both in a slot previously recognized as LSI. If-not, which part of the diagnostics would I look at to see LSI vs SASLP?

JorgeB · November 17, 2021

If it's a single backplane it's direct connection, you have devices connected to 3 different controllers:

These are on an SASLP

[1:0:0:0]    disk    ATA      SanDisk SDSSDH31 70RL  /dev/sdj   /dev/sg9
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/1:0:0:0  [/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0]
[1:0:1:0]    disk    ATA      WDC WD100EMAZ-00 0A83  /dev/sdk   /dev/sg10
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/1:0:1:0  [/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host1/port-1:1/end_device-1:1/target1:0:1/1:0:1:0]

These on the onboard SATA ports

[2:0:0:0]    disk    ATA      ST10000VN0008-2P SC61  /dev/sdb   /dev/sg1
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/2:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata1/host2/target2:0:0/2:0:0:0]
[3:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdc   /dev/sg2
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/3:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata2/host3/target3:0:0/3:0:0:0]
[4:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdd   /dev/sg3
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/4:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata3/host4/target4:0:0/4:0:0:0]
[5:0:0:0]    disk    ATA      ST8000VN0022-2EL SC61  /dev/sde   /dev/sg4
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/5:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata4/host5/target5:0:0/5:0:0:0]
[6:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdf   /dev/sg5
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/6:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata5/host6/target6:0:0/6:0:0:0]
[7:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdg   /dev/sg6
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/7:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata6/host7/target7:0:0/7:0:0:0]
[8:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdh   /dev/sg7
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/8:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata7/host8/target8:0:0/8:0:0:0]
[9:0:0:0]    disk    ATA      ST10000VN0004-1Z SC60  /dev/sdi   /dev/sg8
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/9:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata8/host9/target9:0:0/9:0:0:0]

And these on another SASLP:

[12:0:0:0]   disk    ATA      ST10000VN0008-2J SC60  /dev/sdl   /dev/sg11
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:0:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:0/end_device-12:0/target12:0:0/12:0:0:0]
[12:0:1:0]   disk    ATA      WDC WD100EMAZ-00 0A83  /dev/sdm   /dev/sg12
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:1:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:1/end_device-12:1/target12:0:1/12:0:1:0]
[12:0:2:0]   disk    ATA      SanDisk Ultra II 00RL  /dev/sdn   /dev/sg13
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:2:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:2/end_device-12:2/target12:0:2/12:0:2:0]
[12:0:3:0]   disk    ATA      ST4000DM000-1F21 CC54  /dev/sdo   /dev/sg14
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:3:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:3/end_device-12:3/target12:0:3/12:0:3:0]
[12:0:4:0]   disk    ATA      ST10000VN0008-2P SC61  /dev/sdp   /dev/sg15
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:4:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:4/end_device-12:4/target12:0:4/12:0:4:0]
[12:0:5:0]   disk    ATA      WDC WD60EFRX-68L 0A82  /dev/sdq   /dev/sg16
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:5:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:5/end_device-12:5/target12:0:5/12:0:5:0]

So I see no reason why you can't connect the SSDs to the Intel ports, it's just a case of swapping things around.

neztach · November 17, 2021

you're absolutely right, and I'm perfectly willing to try. I'm just wondering which log you got that from, so I can look at the previous diagnotics and see if either cache drive was previously in an LSI to narrow down what ports I could try, then self-verify so I can follow your advice and get both of them on an LSI controller.

JorgeB · November 17, 2021

11 minutes ago, neztach said:

so I can look at the previous diagnotics and see if either cache drive was previously in an LSI to narrow down what ports I could try,

You can look, it's in the lsscsi.txt file in the diags, but it was on the same controller:

8 hours ago, JorgeB said:

Try replacing the cables on that device, and ideally don't use the SASLP controller

~~The other SSD device was also there, so that one is now on the LSI.~~

The SSD was on an SASLP, now is on the other one.

JorgeB · November 17, 2021

Sorry, I must have confused your diags with another one, you have two SASLP controllers, there's no LSI, you can still connect the SSDs to the Intel ports though.

neztach · November 17, 2021

well based on your original response, I rearranged my cache drives to what I thought was the LSI. Can you confirm?

and I feel like I'm coming across as dense, but which are my intel ports?

eden-diagnostics-20211117-1340.zip

neztach · November 17, 2021

To reiterate,

I would love for there to be some kind of method to just change the pool back to using a single drive (the 1 of 2 drives in the cache pool still good). Of I would be willing to buy a new 2TB SSD to replace both drives in the pool.

At this point, I would just like some way (ANY. way.) - that would allow me to at least copy off whatever files are essential - especially my VMs.

Is there anything I can do here?

JorgeB · November 18, 2021

12 hours ago, neztach said:

but which are my intel ports?

The ones listed above as onboard SATA.

8 hours ago, neztach said:

At this point, I would just like some way (ANY. way.) - that would allow me to at least copy off whatever files are essential - especially my VMs.

Is there anything I can do here?

By the looks of it you should be able to get most data but there could be some data corruption, but you first need to solve the device dropping issue.

neztach · November 18, 2021

I think the device dropping issue is probably symptomatic of the drive dying, in which case, what could I really do about it? Having said that, I'll shuffle the drives around again, and send another diagnostics. In the mean-time, If-after I shuffle the cache drives to both be on SATA the device still drops, is there something I can do after that to attempt to copy data off?

JorgeB · November 18, 2021

12 minutes ago, neztach said:

is there something I can do after that to attempt to copy data off?

Not much you can do if the drive keeps dropping, you were using RAID0 so the pool is not redundant, I assume also no backups?

neztach · November 18, 2021

image.png.02cfd0a849a823608ac69054252aff86.png

it appeared to have gotten further, and I believe I now have both cache drives in SATA slots, diagnostics included to confirm.

Nothing I can do to possibly limp it along to get a copy of the contents? Some of it? all of it?

like I said, all I really want is my VMs ...specifically one of them.

after that, I don't mind decomissioning the bad SSD and formatting the remaining one as cache. I already have a new 2TB en route.

eden-diagnostics-20211118-0919.zip

JorgeB · November 18, 2021

23 minutes ago, neztach said:

I believe I now have both cache drives in SATA slots

They are, we can now see it's a device problem, there are uncorrectable errors (bad sectors), but unlike the bad behaving SASLP now it doesn't drop the SSD, you can try btrfs restore, not sure if it continues when there's an i/o error, if it doesn't you'd need to first clone the SSD to a new device then try again, note that some data corruption is expected in any case.

neztach · November 18, 2021

image.png.bafb1367592bcfbd45f1955c67f847da.png

says can't because disk is mounted, but can't get to the rest of the pool if its unmounted

Primary disk in cache pool going bad, desperately attempting to save VM.

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation