Primary disk in cache pool going bad, desperately attempting to save VM.

November 15, 20214 yr

So I have a cache pool /dev/sdj and /dev/sdk and /sdj is throwing errors, so probably going bad. For a minute, restart I was able to get it to be recognized and come online, and I stupidly missed my window to copy my windows VM from within.

At this point, if I spin up the array normally, the cache pool will say "no file system" seemingly no matter what. Whatever section of /sdj1 that's bad is right in the section where mover needs to move, so I can't move anything. I also can't run dockers, as they were built dependent on cache, so no krusader. I found a link here which allows me to make the cache pool read only, which lets me look at things via console.

btrfs rescue zero-log /dev/sdj1

I've attempted multiple variations of rsync to copy the vdisk.img to a location on the array, all of which fail. The latest being

rsync -avh --progress --sparse /mnt/user/domains/Windows\ 10/vdisk1.img /mnt/user/Archives/VMBackup/Win10/

which results in the following (bear in mind only 200G is allocated to that VM):

214.75G 100% 63.13MB/s 0:54:04 (xfr#1, to-chk=0/1)

rsync: [sender] read errors mapping "/mnt/user/domains/Windows 10/vdisk1.img": Input/output error (5)

WARNING: vdisk.img failed verification -- update discarded (will try again).

214.75G 100% 51.68MB/s 1:06:02 (xfr#2, to-chk=0/1)

rsync: [sender] read errors mapping "/mnt/user/domains/Windows 10/vdisk1.img": Input/output error (5)

ERROR: vdisk1.img failed verification -- update discarded.

I also tried

cp -av --sparse=always <source> <destination>

'/mnt/user/domain/Windows 10/vdisk1.img' -> '/mnt/user/Archives/VMBackup/Win10/vdisk1.img'

cp: error reading '/mnt/user/domain/Windows 10/vdisk1.img': Input/output error

I'm at a loss. Any advice? Honestly, if I could remove the pool, and put all of cache on /dev/sdk1 then I'm sure all would be well.

Quote

November 16, 20214 yr

Community Expert

Please post the diagnostics.

Quote

November 16, 20214 yr

Author

https://drive.google.com/drive/folders/1P4ck5oJ1ko4VGZyHPu7m6qGHRF6IMsc7?usp=sharing

hopefully that link works.

Quote

November 16, 20214 yr

10 minutes ago, neztach said:

hopefully that link works.

It may be a valid link, but it's not what was desired.

Attach the diagnostics zip file to your next post in this thread. Clicking on random links to download an unknown file isn't ideal.

Quote

November 16, 20214 yr

Author

wow I'm such a rookie! Apologies, see attached.

eden-diagnostics-20211116-1542.zip

Quote

November 17, 20214 yr

Community Expert

These suggest one of the devices dropped offline in the past:

Nov 15 16:44:17 Eden kernel: BTRFS info (device sdk1): bdev /dev/sdj1 errs: wr 29362, rd 138395, flush 0, corrupt 4, gen 0
Nov 15 16:44:17 Eden kernel: BTRFS info (device sdk1): bdev /dev/sdk1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0

And it dropped offline again:

Nov 16 00:49:27 Eden kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x1)
Nov 16 00:49:27 Eden kernel: ata11.00: revalidation failed (errno=-5)
Nov 16 00:49:27 Eden kernel: ata11.00: disabled

Try replacing the cables on that device, and ideally don't use the SASLP controller which is not recommended, or at least connect the SSDs to onboard SATA, or trim also won't work, then run a correcting scrub and see if there aren't any uncorrectable errors.

Quote

November 17, 20214 yr

Author

I have a backplane, so I can't really plug directly into the board, but I switched the slots for both cache drives, and the corrective btrfs scrub started fine, then aborted.

image.png.5d1fc64b28b22d4d271c5d6d5b09306b.png

eden-diagnostics-20211117-0845.zip

Quote

November 17, 20214 yr

Community Expert

It dropped offline again.

Quote

November 17, 20214 yr

Author

I keep rebooting and scrubbing ....sometimes it gets father, sometimes not. Shall I continue, or is there a better approach?

Quote

November 17, 20214 yr

Author

also, I have a HUGE PlexMediaServer folder on cache that I wouldn't be upset about if I trashed all together. Would it help - or make it worse - or change nothing?

Quote

November 17, 20214 yr

Until you change the hardware path and figure out which part is responsible for the drive dropping it's not productive to keep doing the same thing over and over.

Quote

November 17, 20214 yr

Community Expert

If you can't connect it to the onboard SATA at least connect it to the LSI controller, like the other SSD is, still no trim but at least it's a reliable controller.

Quote

November 17, 20214 yr

Author

I guess I don't understand how one SSD appears to be connected differently than another SSD. It's a single backplane, and both are connected to the same backplane. Perhaps in my previous diagnostic, one of the drives was hooked up to an LSI controller then as well? If I knew which was was on the LSI controller both then and now I can perhaps shift them around so they are both in a slot previously recognized as LSI. If-not, which part of the diagnostics would I look at to see LSI vs SASLP?

Quote

November 17, 20214 yr

Community Expert

If it's a single backplane it's direct connection, you have devices connected to 3 different controllers:

These are on an SASLP

[1:0:0:0]    disk    ATA      SanDisk SDSSDH31 70RL  /dev/sdj   /dev/sg9
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/1:0:0:0  [/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0]
[1:0:1:0]    disk    ATA      WDC WD100EMAZ-00 0A83  /dev/sdk   /dev/sg10
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/1:0:1:0  [/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host1/port-1:1/end_device-1:1/target1:0:1/1:0:1:0]

These on the onboard SATA ports

[2:0:0:0]    disk    ATA      ST10000VN0008-2P SC61  /dev/sdb   /dev/sg1
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/2:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata1/host2/target2:0:0/2:0:0:0]
[3:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdc   /dev/sg2
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/3:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata2/host3/target3:0:0/3:0:0:0]
[4:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdd   /dev/sg3
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/4:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata3/host4/target4:0:0/4:0:0:0]
[5:0:0:0]    disk    ATA      ST8000VN0022-2EL SC61  /dev/sde   /dev/sg4
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/5:0:0:0  [/sys/devices/pci0000:00/0000:00:11.4/ata4/host5/target5:0:0/5:0:0:0]
[6:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdf   /dev/sg5
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/6:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata5/host6/target6:0:0/6:0:0:0]
[7:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdg   /dev/sg6
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/7:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata6/host7/target7:0:0/7:0:0:0]
[8:0:0:0]    disk    ATA      WDC WD60EFRX-68M 0A82  /dev/sdh   /dev/sg7
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/8:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata7/host8/target8:0:0/8:0:0:0]
[9:0:0:0]    disk    ATA      ST10000VN0004-1Z SC60  /dev/sdi   /dev/sg8
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/9:0:0:0  [/sys/devices/pci0000:00/0000:00:1f.2/ata8/host9/target9:0:0/9:0:0:0]

And these on another SASLP:

[12:0:0:0]   disk    ATA      ST10000VN0008-2J SC60  /dev/sdl   /dev/sg11
  state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:0:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:0/end_device-12:0/target12:0:0/12:0:0:0]
[12:0:1:0]   disk    ATA      WDC WD100EMAZ-00 0A83  /dev/sdm   /dev/sg12
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:1:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:1/end_device-12:1/target12:0:1/12:0:1:0]
[12:0:2:0]   disk    ATA      SanDisk Ultra II 00RL  /dev/sdn   /dev/sg13
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:2:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:2/end_device-12:2/target12:0:2/12:0:2:0]
[12:0:3:0]   disk    ATA      ST4000DM000-1F21 CC54  /dev/sdo   /dev/sg14
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:3:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:3/end_device-12:3/target12:0:3/12:0:3:0]
[12:0:4:0]   disk    ATA      ST10000VN0008-2P SC61  /dev/sdp   /dev/sg15
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:4:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:4/end_device-12:4/target12:0:4/12:0:4:0]
[12:0:5:0]   disk    ATA      WDC WD60EFRX-68L 0A82  /dev/sdq   /dev/sg16
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/12:0:5:0  [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:5/end_device-12:5/target12:0:5/12:0:5:0]

So I see no reason why you can't connect the SSDs to the Intel ports, it's just a case of swapping things around.

Quote

November 17, 20214 yr

Author

you're absolutely right, and I'm perfectly willing to try. I'm just wondering which log you got that from, so I can look at the previous diagnotics and see if either cache drive was previously in an LSI to narrow down what ports I could try, then self-verify so I can follow your advice and get both of them on an LSI controller.

Quote

November 17, 20214 yr

Community Expert

11 minutes ago, neztach said:

so I can look at the previous diagnotics and see if either cache drive was previously in an LSI to narrow down what ports I could try,

You can look, it's in the lsscsi.txt file in the diags, but it was on the same controller:

8 hours ago, JorgeB said:

Try replacing the cables on that device, and ideally don't use the SASLP controller

~~The other SSD device was also there, so that one is now on the LSI.~~

The SSD was on an SASLP, now is on the other one.

Quote

November 17, 20214 yr

Community Expert

Sorry, I must have confused your diags with another one, you have two SASLP controllers, there's no LSI, you can still connect the SSDs to the Intel ports though.

Quote

November 17, 20214 yr

Author

well based on your original response, I rearranged my cache drives to what I thought was the LSI. Can you confirm?

and I feel like I'm coming across as dense, but which are my intel ports?

eden-diagnostics-20211117-1340.zip

Quote

November 17, 20214 yr

Author

To reiterate,

I would love for there to be some kind of method to just change the pool back to using a single drive (the 1 of 2 drives in the cache pool still good). Of I would be willing to buy a new 2TB SSD to replace both drives in the pool.

At this point, I would just like some way (ANY. way.) - that would allow me to at least copy off whatever files are essential - especially my VMs.

Is there anything I can do here?

Quote

November 18, 20214 yr

Community Expert

12 hours ago, neztach said:

but which are my intel ports?

The ones listed above as onboard SATA.

8 hours ago, neztach said:

At this point, I would just like some way (ANY. way.) - that would allow me to at least copy off whatever files are essential - especially my VMs.

Is there anything I can do here?

By the looks of it you should be able to get most data but there could be some data corruption, but you first need to solve the device dropping issue.

Quote

November 18, 20214 yr

Author

I think the device dropping issue is probably symptomatic of the drive dying, in which case, what could I really do about it? Having said that, I'll shuffle the drives around again, and send another diagnostics. In the mean-time, If-after I shuffle the cache drives to both be on SATA the device still drops, is there something I can do after that to attempt to copy data off?

Quote

November 18, 20214 yr

Community Expert

12 minutes ago, neztach said:

is there something I can do after that to attempt to copy data off?

Not much you can do if the drive keeps dropping, you were using RAID0 so the pool is not redundant, I assume also no backups?

Quote

November 18, 20214 yr

Author

image.png.02cfd0a849a823608ac69054252aff86.png

it appeared to have gotten further, and I believe I now have both cache drives in SATA slots, diagnostics included to confirm.

Nothing I can do to possibly limp it along to get a copy of the contents? Some of it? all of it?

like I said, all I really want is my VMs ...specifically one of them.

after that, I don't mind decomissioning the bad SSD and formatting the remaining one as cache. I already have a new 2TB en route.

eden-diagnostics-20211118-0919.zip

Quote

November 18, 20214 yr

Community Expert

23 minutes ago, neztach said:

I believe I now have both cache drives in SATA slots

They are, we can now see it's a device problem, there are uncorrectable errors (bad sectors), but unlike the bad behaving SASLP now it doesn't drop the SSD, you can try btrfs restore, not sure if it continues when there's an i/o error, if it doesn't you'd need to first clone the SSD to a new device then try again, note that some data corruption is expected in any case.

Quote

November 18, 20214 yr

Author

image.png.bafb1367592bcfbd45f1955c67f847da.png

says can't because disk is mounted, but can't get to the rest of the pool if its unmounted

Quote

Primary disk in cache pool going bad, desperately attempting to save VM.

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)