cache drive failure?

NinjaKitty · November 12, 2020

So I've been noticing a lot of errors recently that look like this on my cache drive... Sometimes it appears for the first one, and sometimes it appears for the 2nd one.

Nov 12 10:32:48 Tower kernel: BTRFS warning (device sdb1): lost page write due to IO error on /dev/sdd1
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Nov 12 10:32:48 Tower kernel: BTRFS error (device sdb1): error writing primary super block to device 2
Nov 12 10:32:48 Tower kernel: BTRFS warning (device sdb1): lost page write due to IO error on /dev/sdd1
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Nov 12 10:32:48 Tower kernel: BTRFS error (device sdb1): error writing primary super block to device 2
Nov 12 10:32:48 Tower kernel: BTRFS warning (device sdb1): lost page write due to IO error on /dev/sdd1
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Nov 12 10:32:48 Tower kernel: BTRFS error (device sdb1): error writing primary super block to device 2
Nov 12 10:32:48 Tower kernel: BTRFS warning (device sdb1): lost page write due to IO error on /dev/sdd1
Nov 12 10:32:48 Tower kernel: BTRFS error (device sdb1): error writing primary super block to device 2
### [PREVIOUS LINE REPEATED 8 TIMES] ###
Nov 12 10:32:50 Tower kernel: print_req_error: 1288 callbacks suppressed
Nov 12 10:32:50 Tower kernel: print_req_error: I/O error, dev sdd, sector 31464
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Nov 12 10:32:50 Tower kernel: btrfs_dev_stat_print_on_error: 4120 callbacks suppressed
Nov 12 10:32:50 Tower kernel: BTRFS error (device sdb1): bdev /dev/sdd1 errs: wr 40584686, rd 14313628, flush 2491, corrupt 0, gen 0
Nov 12 10:32:50 Tower kernel: print_req_error: I/O error, dev sdd, sector 2336

I've tried replacing the cables, and I still got these errors. I never ran into anything functionally wrong though.

EDIT: (adding small seciton for more context i forgot to add...)

Today I noticed that one of my shares had some data on the cache/ drive for a share that was already set to use no cache. I manually copied those files, but in the process, /mnt/user became unreachable something like `transport has no endpoint`... Hence why I decided to take down the server for checks.

end of Edit

I took my server down just so I could run a check, but then realized... my 2nd cache drive wouldn't show up. After reconnecting, it was showing up only under unassigned devices. On top of that I was running into the /dev/sdx is size zero error in the syslog.

I rebooted one more time, and this time it auto-started with 1 of the cache drives... and for some reason it was reporting something like 1.2 TB used / 800 GB unused when my drives are 1 TB each...

Now this is the part where I'm concerned I might have screwed up by stupidity? I started it again, and now it says Unmountable: No File System. It's now in the process of being stopped infinitely retrying to unmounting user share.

Did I permanently screw up? Or is this still recoverable?

tower-diagnostics-20201112-1841.zip

Edit 2: Adding picture of how it looks like now after another reboot

I'm able to access the cache drive, so I'm backing it up for now.

Edited November 12, 2020 by NinjaKitty

JorgeB · November 13, 2020

Cache device dropped offline:

Nov 12 10:30:37 Tower kernel: ata2: hard resetting link
Nov 12 10:30:37 Tower kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 12 10:30:37 Tower kernel: ata2.00: failed to set xfermode (err_mask=0x1)
Nov 12 10:30:37 Tower kernel: ata2.00: limiting speed to UDMA/100:PIO3
Nov 12 10:30:42 Tower kernel: ata2: hard resetting link
Nov 12 10:30:43 Tower kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 12 10:30:43 Tower kernel: ata2.00: failed to set xfermode (err_mask=0x1)
Nov 12 10:30:43 Tower kernel: ata2.00: disabled

With SSDs this is usually a connection problem.

21 hours ago, NinjaKitty said:

Now this is the part where I'm concerned I might have screwed up by stupidity? I started it again, and now it says Unmountable: No File System. It's now in the process of being stopped infinitely retrying to unmounting user share.

If both devices have been dropping it might be difficult to recover, some options here, also make sure to check this.

NinjaKitty · November 14, 2020

Thanks for the response.

I was trying to update my ticket yesterday, but i kept running into a 4xx error on the forums.

I was able to resolve my problem since I was able to access the data (using the same steps mentioned in that first link).

I just moved all the data over to the main array, then rebuilt the cache array... I noticed one of the two drives acting odd before rebuilding the cache array, which after several tries to move data around, it eventually stopped working. Tried to format it w/ gparted, and it froze. Shortly afterwards it was undiscoverable by `fdisk -l` so I just assumed it died (still under warranty & was a raid 1 config, so not a big deal).

So right now I'm just using 1 of the 2 drives, which is still working. Cache is up working fine now. Ordered some more SSDs, and moving with the warranty process with the supposed dead one.

cache drive failure?

Recommended Posts

NinjaKitty

Link to comment

JorgeB

Link to comment

NinjaKitty

Link to comment

Join the conversation