2Piececombo Posted October 15, 2020 Share Posted October 15, 2020 I logged into my server to find one cache disk in the pool has a seemingly impossible number of writes. Has anyone else seen this? I attached my diagnostics in case it's helpful flounraiddiag.zip Quote Link to comment
2Piececombo Posted October 15, 2020 Author Share Posted October 15, 2020 (edited) immediately after posting this i opened the system log to see it spamming this: Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707617, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707618, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707619, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707620, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707621, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707622, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707623, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707624, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707625, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 11707626, rd 7822225, flush 111006, corrupt 0, gen 0 Oct 15 10:19:52 floserver kernel: BTRFS warning (device nvme1n1p1): lost page write due to IO error on /dev/nvme1n1p1 Oct 15 10:19:52 floserver kernel: BTRFS warning (device nvme1n1p1): lost page write due to IO error on /dev/nvme1n1p1 Oct 15 10:19:52 floserver kernel: BTRFS warning (device nvme1n1p1): lost page write due to IO error on /dev/nvme1n1p1 Oct 15 10:19:52 floserver kernel: BTRFS error (device nvme1n1p1): error writing primary super block to device 1 Oct 15 10:19:53 floserver kernel: BTRFS warning (device nvme1n1p1): lost page write due to IO error on /dev/nvme1n1p1 Oct 15 10:19:53 floserver kernel: BTRFS warning (device nvme1n1p1): lost page write due to IO error on /dev/nvme1n1p1 Oct 15 10:19:53 floserver kernel: BTRFS warning (device nvme1n1p1): lost page write due to IO error on /dev/nvme1n1p1 Oct 15 10:19:53 floserver kernel: BTRFS error (device nvme1n1p1): error writing primary super block to device 1 Oct 15 10:19:53 floserver kernel: BTRFS error (device nvme1n1p1): error writing primary super block to device 1 Oct 15 10:20:00 floserver kernel: btrfs_dev_stat_print_on_error: 163 callbacks suppressed Edited October 15, 2020 by 2Piececombo Quote Link to comment
JorgeB Posted October 15, 2020 Share Posted October 15, 2020 Crazy number of writes usually means a dropped device, and that was what happened here: Oct 13 16:09:58 floserver kernel: nvme nvme1: I/O 102 QID 8 timeout, aborting Oct 13 16:10:28 floserver kernel: nvme nvme1: I/O 102 QID 8 timeout, reset controller Oct 13 16:10:59 floserver kernel: nvme nvme1: I/O 28 QID 0 timeout, reset controller Oct 13 16:13:40 floserver kernel: nvme nvme1: Device not ready; aborting reset Oct 13 16:13:40 floserver kernel: nvme nvme1: Abort status: 0x7 Oct 13 16:13:40 floserver kernel: print_req_error: I/O error, dev nvme1n1, sector 41730296 Oct 13 16:15:48 floserver kernel: nvme nvme1: Device not ready; aborting reset Oct 13 16:15:48 floserver kernel: nvme nvme1: Removing after probe failure status: -19 Oct 13 16:17:56 floserver kernel: nvme nvme1: Device not ready; aborting reset Recommended reading this for better pool monitoring. Quote Link to comment
2Piececombo Posted October 15, 2020 Author Share Posted October 15, 2020 Okay I'll set up the script and run a scrub. If un-correctable errors are found, im looking at a bad drive/port? (nvme, so no cables) Quote Link to comment
JorgeB Posted October 15, 2020 Share Posted October 15, 2020 1 minute ago, 2Piececombo said: im looking at a bad drive/port? (nvme, so no cables Sometimes a newer kernel helps, like the one in -beta30, this can sometimes also help: Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" and before "initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 Reboot and see if it makes a difference. Quote Link to comment
2Piececombo Posted October 15, 2020 Author Share Posted October 15, 2020 okay ill give that a shot as well. scrub is currently running. Reboot will have to wait a while, cant kick people off right now. Will post back later. thanks for the help as usual! Quote Link to comment
2Piececombo Posted October 15, 2020 Author Share Posted October 15, 2020 this is what i got from the scrub Quote Link to comment
JorgeB Posted October 15, 2020 Share Posted October 15, 2020 Is the dropped NVMe back online? Did you reboot? Quote Link to comment
2Piececombo Posted October 15, 2020 Author Share Posted October 15, 2020 It's showing up in the cache pool if that's what you are asking. I have not been able to reboot yet, there is a critical VM running and I cannot kick the user off at the moment Quote Link to comment
JorgeB Posted October 15, 2020 Share Posted October 15, 2020 A dropped device won't come back online without at least a reboot, possibly a power cycle. Quote Link to comment
2Piececombo Posted October 15, 2020 Author Share Posted October 15, 2020 I see. I noticed it dropped offline once a week ago, shut the server down and re-seated the drive. wrote it off as a fluke but I guess I may have a real problem here. I had the idea to shut down the server and swap the nvmes and plug them into the other port, so I could see if the same device falls off or if it's the m.2 port on the board. Unless you have a better suggestion for testing each drive. thanks again Quote Link to comment
JorgeB Posted October 15, 2020 Share Posted October 15, 2020 9 minutes ago, 2Piececombo said: so I could see if the same device falls off or if it's the m.2 port on the board. It's worth a try. Quote Link to comment
2Piececombo Posted October 20, 2020 Author Share Posted October 20, 2020 update. I pulled the possibly bad drive and stuck a new one in, and it was fine for about 24 hours. Checked on it this morning and realized now both nvme disks are showing a crazy number of writes and reads. My VM page is now empty showing there are none. Just did a reboot,, and now neither nvme is showing. going over IPMI to the motherbaord it shows both nvme drives, but unraid see's neither of them. Included diagnostics. Im not sure what to do at this point, at the very least I need to move the data off these cache drives to the array so I can at least get a very important VM back online floserver-diagnostics-20201020-1114.zip Quote Link to comment
JorgeB Posted October 20, 2020 Share Posted October 20, 2020 33 minutes ago, 2Piececombo said: I pulled the possibly bad drive Unlikely to be a device problem. Is this the first boot after adding the nvme_core line? If yes try removing it, though it never caused NVMe to go undetected before, but I guess it could. Quote Link to comment
2Piececombo Posted October 20, 2020 Author Share Posted October 20, 2020 it was not the first reboot since adding that, though I did take it out and perform another reboot, still nothing. I then rebooted once more and the devices showed up again, but array wouldnt start saying too many devices (had plugged in a USB drive a while back pushing me over my 6 device limit, though it never complained about it til now) Upgraded the licence and started the array. I find it hard to believe the extra device had something to do with it, though it is weird it never yelled at me about having too many devices til now. Not sure what to make of that. Things are okay (for the time being) though my VM is having some troubles booting, but I can sort that out Quote Link to comment
JorgeB Posted October 20, 2020 Share Posted October 20, 2020 You should upgrade to latest beta, newer kernel might help with the NVMe dropping issue. Quote Link to comment
2Piececombo Posted October 20, 2020 Author Share Posted October 20, 2020 Okay ill look into that. For not im getting everything off the cache. Perhaps you can give me some guidance with my VM problem. When I start the VM (win server 2019) it is not able to boot. I get to a cmd and check diskpart and it does not show the vdisk, even though the vdisk is assigned to the vm. It shows the virtio and iso, but not the vdisk. Any ideas there? Quote Link to comment
JorgeB Posted October 20, 2020 Share Posted October 20, 2020 8 minutes ago, 2Piececombo said: Any ideas there? VMs are not my strong suit, best bet is to make a new post about that in the KVM forum. Quote Link to comment
2Piececombo Posted October 20, 2020 Author Share Posted October 20, 2020 (edited) will do, thanks EDIT: For anyone that may come across this in the future, check out this thread Edited February 19, 2021 by 2Piececombo Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.