GoldStig23 Posted December 5, 2023 Share Posted December 5, 2023 Two weeks ago, I had a hard lock up on 6.12.5 and rebooted the server. After that happened I turned on remote syslogging to see if I could catch the error again if it happened. I noticed another issue where the cache drive BTRFS was throwing errors BTRFS: error (device loop2: state EA) in cleanup_transaction:1964: errno=-5 IO failure BTRFS warning (device loop2: state E): Skipping commit of aborted transaction. BTRFS info (device loop2: state E): forced readonly BTRFS: error (device loop2) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction) BTRFS error (device loop2): bdev /dev/loop2 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0 I/O error, dev loop2, sector 16606560 op 0x1:(WRITE) flags 0x1800 phys_seg 16 prio class 2 BTRFS error (device loop2): bdev /dev/loop2 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0 I purchased a new ssd to replace the failing drive. I naively thought that replacement worked the same as replacing the parity drive, so I added the drive to the cache pool, and that seemed to have caused the issues to expound. I tried to backup the files on the cache drive by setting the shares to Cache>Array and invoking the mover. I was able to move most of the data, but it seems that the drive went read only. I captured diagnostics from Dec 3rd after that happened. I ran the following commands and it appeared to fail, so I stopped the array and started the array again. root@tower:~# root@tower:~# mount -o rescue=all,ro /dev/sdj1 /temp mount: /temp: /dev/sdj1 already mounted on /temp. dmesg(1) may have more information after failed mount system call. root@tower:~# btrfs restore -v /dev/sdj1 /mnt/cache-ssd/restore ERROR: /dev/sdj1 is currently mounted, cannot continue root@tower:~# After starting and stopping the array, my drive is now unmountable. I've run the same commands above, and the result is the same. See diagnostics from today after the drive went unmountable. What's my next step? Not all of the data made if off the cache drive, and while it's just plex and some *arr services, it would be nice to have to start from scratch. I do have a second ssd drive installed in the server, so I plan on getting this drive replaced and hopefully RMA'd since it's still under warranty. tower-diagnostics-20231204-2141.zip tower-diagnostics-20231203-2102.zip Quote Link to comment
JorgeB Posted December 5, 2023 Share Posted December 5, 2023 5 hours ago, GoldStig23 said: I noticed another issue where the cache drive BTRFS was throwing errors That was the docker image, not the cache filesystem, and nothing I see suggests a problem with the SSD. Post the output of btrfs fi show Quote Link to comment
GoldStig23 Posted December 5, 2023 Author Share Posted December 5, 2023 Here's the output of btrfs fi show root@tower:~# btrfs fi show Label: none uuid: e90043d3-1088-44d0-909f-6758d9989b79 Total devices 2 FS bytes used 458.17GiB devid 1 size 465.76GiB used 465.76GiB path /dev/sdj1 devid 2 size 0 used 0 path MISSING Label: none uuid: 715addb3-54e3-48d1-9c22-42f1ba1444ca Total devices 1 FS bytes used 144.00KiB devid 1 size 465.76GiB used 2.02GiB path /dev/sdk1 root@tower:~# Quote Link to comment
JorgeB Posted December 5, 2023 Share Posted December 5, 2023 Which device was the old pool? Quote Link to comment
GoldStig23 Posted December 5, 2023 Author Share Posted December 5, 2023 4 minutes ago, JorgeB said: Which device was the old pool? sdj1 was the drive in the old pool. Quote Link to comment
JorgeB Posted December 5, 2023 Share Posted December 5, 2023 Any idea why there's a missing device? Quote Link to comment
GoldStig23 Posted December 5, 2023 Author Share Posted December 5, 2023 I added a new drive to the pool thinking I could just replace it after it replicated the data to it. Should I try adding it back to the old pool? Quote Link to comment
JorgeB Posted December 5, 2023 Share Posted December 5, 2023 Not for now, the balance is crashing, so most likely it won't work as well, try this: mkdir /x mount -t btrfs -o skip_balance /dev/sdj1 /x Then see if you can access the data on /x and if yes copy it to the array or the new pool. Quote Link to comment
GoldStig23 Posted December 5, 2023 Author Share Posted December 5, 2023 No dice, here's the output: root@tower:~# mkdir /x root@tower:~# mount -t btrfs -o skip_balance /dev/sdj1 /x mount: /x: wrong fs type, bad option, bad superblock on /dev/sdj1, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call. root@tower:~# Here's the syslog output after running that command. BTRFS error (device sdj1: state EMA): Remounting read-write after error is not allowed Quote Link to comment
JorgeB Posted December 5, 2023 Share Posted December 5, 2023 Try mount -t btrfs -o skip_balance,ro /dev/sdj1 /x Quote Link to comment
Solution GoldStig23 Posted December 5, 2023 Author Solution Share Posted December 5, 2023 Ok, that seemed to have worked. I was able to mount it to the X directory, copy everything over with MC to the array. I think I grabbed everything What's the recommended next step? Should I start a new cache pool with both SSD drives, format them both and go to zfs? Quote Link to comment
JorgeB Posted December 5, 2023 Share Posted December 5, 2023 34 minutes ago, GoldStig23 said: Should I start a new cache pool with both SSD drives, format them both and go to zfs? You can do that, make sure you select zfs as the filesystem before so it does not attempt to mount the old pool, or erase the devices first. Quote Link to comment
GoldStig23 Posted December 5, 2023 Author Share Posted December 5, 2023 I deleted the pools and started over. Changed the filesystem to zfs and added both drives to the new pool. They both now show up as " Unmountable: Unsupported or no file system" When I select the format button on the main page, it tries to format, but it doesn't appear to work. It still remains unmountable. Here's the output of the logs: Dec 5 11:56:08 tower kernel: mdcmd (36): nocheck pause Dec 5 11:56:10 tower emhttpd: creating volume: cache (zfs) Dec 5 11:56:10 tower emhttpd: shcmd (8047): /sbin/wipefs -af /dev/sdj Dec 5 11:56:10 tower root: /dev/sdj: 2 bytes were erased at offset 0x000001fe (dos): 55 aa Dec 5 11:56:10 tower emhttpd: shcmd (8048): /sbin/blkdiscard /dev/sdj Dec 5 11:56:10 tower root: blkdiscard: cannot open /dev/sdj: Device or resource busy Dec 5 11:56:10 tower emhttpd: shcmd (8048): exit status: 1 Dec 5 11:56:10 tower emhttpd: continuing... Dec 5 11:56:10 tower emhttpd: writing MBR on disk (sdj) with partition 1 offset 2048, erased: 0 Dec 5 11:56:10 tower emhttpd: re-reading (sdj) partition table Dec 5 11:56:11 tower emhttpd: error: mkmbr, 2192: Device or resource busy (16): ioctl BLKRRPART: /dev/sdj Dec 5 11:56:11 tower emhttpd: shcmd (8049): udevadm settle Dec 5 11:56:12 tower emhttpd: shcmd (8050): /sbin/wipefs -af /dev/sdj1 Dec 5 11:56:12 tower emhttpd: shcmd (8051): /sbin/wipefs -af /dev/sdk Dec 5 11:56:12 tower root: /dev/sdk: 2 bytes were erased at offset 0x000001fe (dos): 55 aa Dec 5 11:56:12 tower kernel: ata8.00: Enabling discard_zeroes_data Dec 5 11:56:12 tower emhttpd: shcmd (8052): /sbin/blkdiscard /dev/sdk Dec 5 11:56:13 tower kernel: ata8.00: Enabling discard_zeroes_data Dec 5 11:56:13 tower emhttpd: writing MBR on disk (sdk) with partition 1 offset 2048, erased: 0 Dec 5 11:56:13 tower emhttpd: re-reading (sdk) partition table Dec 5 11:56:14 tower kernel: ata8.00: Enabling discard_zeroes_data Dec 5 11:56:14 tower emhttpd: shcmd (8053): udevadm settle Dec 5 11:56:14 tower kernel: sdk: sdk1 Dec 5 11:56:15 tower emhttpd: shcmd (8054): /sbin/wipefs -af /dev/sdk1 Dec 5 11:56:15 tower emhttpd: shcmd (8055): /usr/sbin/zpool create -f -o ashift=12 -o autotrim=on -O compression=off -O dnodesize=auto -O acltype=posixacl -O xattr=sa -O normalization=formD -m /mnt/cache cache mirror /dev/sdj1 /dev/sdk1 Dec 5 11:56:15 tower root: cannot open '/dev/sdj1': Device or resource busy Dec 5 11:56:15 tower root: cannot create 'cache': one or more devices is currently unavailable Dec 5 11:56:15 tower emhttpd: shcmd (8055): exit status: 1 Dec 5 11:56:15 tower emhttpd: mounting /mnt/cache Dec 5 11:56:15 tower emhttpd: shcmd (8056): mkdir -p /mnt/cache Dec 5 11:56:15 tower emhttpd: /usr/sbin/zpool import -d /dev/sdj1 2>&1 Dec 5 11:56:15 tower emhttpd: no pools available to import Dec 5 11:56:15 tower emhttpd: cache: no uuid Dec 5 11:56:15 tower emhttpd: /mnt/cache mount error: Unsupported or no file system Dec 5 11:56:15 tower emhttpd: shcmd (8057): rmdir /mnt/cache Dec 5 11:56:15 tower emhttpd: Starting services... Dec 5 11:56:16 tower emhttpd: shcmd (8060): /etc/rc.d/rc.samba restart Dec 5 11:56:16 tower wsdd2[12390]: 'Terminated' signal received. Dec 5 11:56:16 tower wsdd2[12390]: terminating. Dec 5 11:56:18 tower root: Starting Samba: /usr/sbin/smbd -D Dec 5 11:56:18 tower root: /usr/sbin/nmbd -D Dec 5 11:56:18 tower root: /usr/sbin/wsdd2 -d -4 Dec 5 11:56:18 tower root: /usr/sbin/winbindd -D Dec 5 11:56:18 tower wsdd2[13473]: starting. Dec 5 11:56:18 tower emhttpd: shcmd (8064): /etc/rc.d/rc.avahidaemon restart Dec 5 11:56:18 tower root: Stopping Avahi mDNS/DNS-SD Daemon: stopped Dec 5 11:56:18 tower avahi-daemon[12438]: Got SIGTERM, quitting. Dec 5 11:56:18 tower avahi-dnsconfd[12447]: read(): EOF Quote Link to comment
JorgeB Posted December 5, 2023 Share Posted December 5, 2023 Reboot and try again, it should format then. Quote Link to comment
GoldStig23 Posted December 6, 2023 Author Share Posted December 6, 2023 I think I got everything working again. SSD drives are in a new pool, with the original data moved over. Got docker going again and all of the data from plex, etc is back to normal. Thank you for your help Jorge! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.