neztach Posted November 15, 2021 Share Posted November 15, 2021 So I have a cache pool /dev/sdj and /dev/sdk and /sdj is throwing errors, so probably going bad. For a minute, restart I was able to get it to be recognized and come online, and I stupidly missed my window to copy my windows VM from within. At this point, if I spin up the array normally, the cache pool will say "no file system" seemingly no matter what. Whatever section of /sdj1 that's bad is right in the section where mover needs to move, so I can't move anything. I also can't run dockers, as they were built dependent on cache, so no krusader. I found a link here which allows me to make the cache pool read only, which lets me look at things via console. btrfs rescue zero-log /dev/sdj1 I've attempted multiple variations of rsync to copy the vdisk.img to a location on the array, all of which fail. The latest being rsync -avh --progress --sparse /mnt/user/domains/Windows\ 10/vdisk1.img /mnt/user/Archives/VMBackup/Win10/ which results in the following (bear in mind only 200G is allocated to that VM): 214.75G 100% 63.13MB/s 0:54:04 (xfr#1, to-chk=0/1) rsync: [sender] read errors mapping "/mnt/user/domains/Windows 10/vdisk1.img": Input/output error (5) WARNING: vdisk.img failed verification -- update discarded (will try again). 214.75G 100% 51.68MB/s 1:06:02 (xfr#2, to-chk=0/1) rsync: [sender] read errors mapping "/mnt/user/domains/Windows 10/vdisk1.img": Input/output error (5) ERROR: vdisk1.img failed verification -- update discarded. I also tried cp -av --sparse=always <source> <destination> '/mnt/user/domain/Windows 10/vdisk1.img' -> '/mnt/user/Archives/VMBackup/Win10/vdisk1.img' cp: error reading '/mnt/user/domain/Windows 10/vdisk1.img': Input/output error I'm at a loss. Any advice? Honestly, if I could remove the pool, and put all of cache on /dev/sdk1 then I'm sure all would be well. Quote Link to comment
JorgeB Posted November 16, 2021 Share Posted November 16, 2021 Please post the diagnostics. Quote Link to comment
neztach Posted November 16, 2021 Author Share Posted November 16, 2021 https://drive.google.com/drive/folders/1P4ck5oJ1ko4VGZyHPu7m6qGHRF6IMsc7?usp=sharing hopefully that link works. Quote Link to comment
JonathanM Posted November 16, 2021 Share Posted November 16, 2021 10 minutes ago, neztach said: hopefully that link works. It may be a valid link, but it's not what was desired. Attach the diagnostics zip file to your next post in this thread. Clicking on random links to download an unknown file isn't ideal. Quote Link to comment
neztach Posted November 16, 2021 Author Share Posted November 16, 2021 wow I'm such a rookie! Apologies, see attached. eden-diagnostics-20211116-1542.zip Quote Link to comment
JorgeB Posted November 17, 2021 Share Posted November 17, 2021 These suggest one of the devices dropped offline in the past: Nov 15 16:44:17 Eden kernel: BTRFS info (device sdk1): bdev /dev/sdj1 errs: wr 29362, rd 138395, flush 0, corrupt 4, gen 0 Nov 15 16:44:17 Eden kernel: BTRFS info (device sdk1): bdev /dev/sdk1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 And it dropped offline again: Nov 16 00:49:27 Eden kernel: ata11.00: failed to IDENTIFY (I/O error, err_mask=0x1) Nov 16 00:49:27 Eden kernel: ata11.00: revalidation failed (errno=-5) Nov 16 00:49:27 Eden kernel: ata11.00: disabled Try replacing the cables on that device, and ideally don't use the SASLP controller which is not recommended, or at least connect the SSDs to onboard SATA, or trim also won't work, then run a correcting scrub and see if there aren't any uncorrectable errors. Quote Link to comment
neztach Posted November 17, 2021 Author Share Posted November 17, 2021 I have a backplane, so I can't really plug directly into the board, but I switched the slots for both cache drives, and the corrective btrfs scrub started fine, then aborted. eden-diagnostics-20211117-0845.zip Quote Link to comment
JorgeB Posted November 17, 2021 Share Posted November 17, 2021 It dropped offline again. Quote Link to comment
neztach Posted November 17, 2021 Author Share Posted November 17, 2021 I keep rebooting and scrubbing ....sometimes it gets father, sometimes not. Shall I continue, or is there a better approach? Quote Link to comment
neztach Posted November 17, 2021 Author Share Posted November 17, 2021 also, I have a HUGE PlexMediaServer folder on cache that I wouldn't be upset about if I trashed all together. Would it help - or make it worse - or change nothing? Quote Link to comment
JonathanM Posted November 17, 2021 Share Posted November 17, 2021 Until you change the hardware path and figure out which part is responsible for the drive dropping it's not productive to keep doing the same thing over and over. Quote Link to comment
JorgeB Posted November 17, 2021 Share Posted November 17, 2021 If you can't connect it to the onboard SATA at least connect it to the LSI controller, like the other SSD is, still no trim but at least it's a reliable controller. Quote Link to comment
neztach Posted November 17, 2021 Author Share Posted November 17, 2021 I guess I don't understand how one SSD appears to be connected differently than another SSD. It's a single backplane, and both are connected to the same backplane. Perhaps in my previous diagnostic, one of the drives was hooked up to an LSI controller then as well? If I knew which was was on the LSI controller both then and now I can perhaps shift them around so they are both in a slot previously recognized as LSI. If-not, which part of the diagnostics would I look at to see LSI vs SASLP? Quote Link to comment
JorgeB Posted November 17, 2021 Share Posted November 17, 2021 If it's a single backplane it's direct connection, you have devices connected to 3 different controllers: These are on an SASLP [1:0:0:0] disk ATA SanDisk SDSSDH31 70RL /dev/sdj /dev/sg9 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/1:0:0:0 [/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host1/port-1:0/end_device-1:0/target1:0:0/1:0:0:0] [1:0:1:0] disk ATA WDC WD100EMAZ-00 0A83 /dev/sdk /dev/sg10 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/1:0:1:0 [/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/host1/port-1:1/end_device-1:1/target1:0:1/1:0:1:0] These on the onboard SATA ports [2:0:0:0] disk ATA ST10000VN0008-2P SC61 /dev/sdb /dev/sg1 state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/2:0:0:0 [/sys/devices/pci0000:00/0000:00:11.4/ata1/host2/target2:0:0/2:0:0:0] [3:0:0:0] disk ATA WDC WD60EFRX-68M 0A82 /dev/sdc /dev/sg2 state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/3:0:0:0 [/sys/devices/pci0000:00/0000:00:11.4/ata2/host3/target3:0:0/3:0:0:0] [4:0:0:0] disk ATA WDC WD60EFRX-68M 0A82 /dev/sdd /dev/sg3 state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/4:0:0:0 [/sys/devices/pci0000:00/0000:00:11.4/ata3/host4/target4:0:0/4:0:0:0] [5:0:0:0] disk ATA ST8000VN0022-2EL SC61 /dev/sde /dev/sg4 state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/5:0:0:0 [/sys/devices/pci0000:00/0000:00:11.4/ata4/host5/target5:0:0/5:0:0:0] [6:0:0:0] disk ATA WDC WD60EFRX-68M 0A82 /dev/sdf /dev/sg5 state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/6:0:0:0 [/sys/devices/pci0000:00/0000:00:1f.2/ata5/host6/target6:0:0/6:0:0:0] [7:0:0:0] disk ATA WDC WD60EFRX-68M 0A82 /dev/sdg /dev/sg6 state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/7:0:0:0 [/sys/devices/pci0000:00/0000:00:1f.2/ata6/host7/target7:0:0/7:0:0:0] [8:0:0:0] disk ATA WDC WD60EFRX-68M 0A82 /dev/sdh /dev/sg7 state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/8:0:0:0 [/sys/devices/pci0000:00/0000:00:1f.2/ata7/host8/target8:0:0/8:0:0:0] [9:0:0:0] disk ATA ST10000VN0004-1Z SC60 /dev/sdi /dev/sg8 state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/9:0:0:0 [/sys/devices/pci0000:00/0000:00:1f.2/ata8/host9/target9:0:0/9:0:0:0] And these on another SASLP: [12:0:0:0] disk ATA ST10000VN0008-2J SC60 /dev/sdl /dev/sg11 state=running queue_depth=1 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/12:0:0:0 [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:0/end_device-12:0/target12:0:0/12:0:0:0] [12:0:1:0] disk ATA WDC WD100EMAZ-00 0A83 /dev/sdm /dev/sg12 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/12:0:1:0 [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:1/end_device-12:1/target12:0:1/12:0:1:0] [12:0:2:0] disk ATA SanDisk Ultra II 00RL /dev/sdn /dev/sg13 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/12:0:2:0 [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:2/end_device-12:2/target12:0:2/12:0:2:0] [12:0:3:0] disk ATA ST4000DM000-1F21 CC54 /dev/sdo /dev/sg14 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/12:0:3:0 [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:3/end_device-12:3/target12:0:3/12:0:3:0] [12:0:4:0] disk ATA ST10000VN0008-2P SC61 /dev/sdp /dev/sg15 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/12:0:4:0 [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:4/end_device-12:4/target12:0:4/12:0:4:0] [12:0:5:0] disk ATA WDC WD60EFRX-68L 0A82 /dev/sdq /dev/sg16 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/12:0:5:0 [/sys/devices/pci0000:00/0000:00:01.1/0000:02:00.0/host12/port-12:5/end_device-12:5/target12:0:5/12:0:5:0] So I see no reason why you can't connect the SSDs to the Intel ports, it's just a case of swapping things around. Quote Link to comment
neztach Posted November 17, 2021 Author Share Posted November 17, 2021 you're absolutely right, and I'm perfectly willing to try. I'm just wondering which log you got that from, so I can look at the previous diagnotics and see if either cache drive was previously in an LSI to narrow down what ports I could try, then self-verify so I can follow your advice and get both of them on an LSI controller. Quote Link to comment
JorgeB Posted November 17, 2021 Share Posted November 17, 2021 11 minutes ago, neztach said: so I can look at the previous diagnotics and see if either cache drive was previously in an LSI to narrow down what ports I could try, You can look, it's in the lsscsi.txt file in the diags, but it was on the same controller: 8 hours ago, JorgeB said: Try replacing the cables on that device, and ideally don't use the SASLP controller The other SSD device was also there, so that one is now on the LSI. The SSD was on an SASLP, now is on the other one. Quote Link to comment
JorgeB Posted November 17, 2021 Share Posted November 17, 2021 Sorry, I must have confused your diags with another one, you have two SASLP controllers, there's no LSI, you can still connect the SSDs to the Intel ports though. Quote Link to comment
neztach Posted November 17, 2021 Author Share Posted November 17, 2021 well based on your original response, I rearranged my cache drives to what I thought was the LSI. Can you confirm? and I feel like I'm coming across as dense, but which are my intel ports? eden-diagnostics-20211117-1340.zip Quote Link to comment
neztach Posted November 17, 2021 Author Share Posted November 17, 2021 To reiterate, I would love for there to be some kind of method to just change the pool back to using a single drive (the 1 of 2 drives in the cache pool still good). Of I would be willing to buy a new 2TB SSD to replace both drives in the pool. At this point, I would just like some way (ANY. way.) - that would allow me to at least copy off whatever files are essential - especially my VMs. Is there anything I can do here? Quote Link to comment
JorgeB Posted November 18, 2021 Share Posted November 18, 2021 12 hours ago, neztach said: but which are my intel ports? The ones listed above as onboard SATA. 8 hours ago, neztach said: At this point, I would just like some way (ANY. way.) - that would allow me to at least copy off whatever files are essential - especially my VMs. Is there anything I can do here? By the looks of it you should be able to get most data but there could be some data corruption, but you first need to solve the device dropping issue. Quote Link to comment
neztach Posted November 18, 2021 Author Share Posted November 18, 2021 I think the device dropping issue is probably symptomatic of the drive dying, in which case, what could I really do about it? Having said that, I'll shuffle the drives around again, and send another diagnostics. In the mean-time, If-after I shuffle the cache drives to both be on SATA the device still drops, is there something I can do after that to attempt to copy data off? Quote Link to comment
JorgeB Posted November 18, 2021 Share Posted November 18, 2021 12 minutes ago, neztach said: is there something I can do after that to attempt to copy data off? Not much you can do if the drive keeps dropping, you were using RAID0 so the pool is not redundant, I assume also no backups? Quote Link to comment
neztach Posted November 18, 2021 Author Share Posted November 18, 2021 it appeared to have gotten further, and I believe I now have both cache drives in SATA slots, diagnostics included to confirm. Nothing I can do to possibly limp it along to get a copy of the contents? Some of it? all of it? like I said, all I really want is my VMs ...specifically one of them. after that, I don't mind decomissioning the bad SSD and formatting the remaining one as cache. I already have a new 2TB en route. eden-diagnostics-20211118-0919.zip Quote Link to comment
JorgeB Posted November 18, 2021 Share Posted November 18, 2021 23 minutes ago, neztach said: I believe I now have both cache drives in SATA slots They are, we can now see it's a device problem, there are uncorrectable errors (bad sectors), but unlike the bad behaving SASLP now it doesn't drop the SSD, you can try btrfs restore, not sure if it continues when there's an i/o error, if it doesn't you'd need to first clone the SSD to a new device then try again, note that some data corruption is expected in any case. Quote Link to comment
neztach Posted November 18, 2021 Author Share Posted November 18, 2021 says can't because disk is mounted, but can't get to the rest of the pool if its unmounted Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.