BTRFS bdev /dev/nvme0n1p1 error


Go to solution Solved by JorgeB,

Recommended Posts

Hi guys for the last several days my system has been flooded with BTRFS error. i used to have one cache on my unraid pool and one day i decided to add another one as BTRFS so that the i expected the cache to mirror the other one. everything seemed to be running perfectly for couple of weeks until one day i noticed that the temperature on the new cache was showed as "*" . Then i went to check the log and it was flooded with BTRFS error. first i just unmount the disk and re-formatted it , put it back on the pool. but the error seemed to be happening again. i tried to put it on the other NVME slots on the motherboard but it didn't fix the issue. i also tried other NVME stick , but the issue was still there.

 

The other thing i noticed was that every time it happens , the second cache always got crashed  since i couldn't see the identity or capabilities details whenever i clicked on that cache icon. Also after i stopped the array , the other cache always went missing until i rebooted the system. 

 

I need some guidance here, since i am very new to unraid. 

 

thanks

 

here's attached my recent diagnostic.

 

example of the recent logs:

 

Feb 3 00:49:25 ASAS kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme0n1p1 errs: wr 11126892, rd 1855, flush 30442, corrupt 0, gen 0 Feb 3 00:49:25 ASAS kernel: BTRFS warning (device nvme1n1p1): lost page write due to IO error on /dev/nvme0n1p1 (-5)

 

asas-diagnostics-20240203-0109.zip

Edited by febriantoarbi
Link to comment
  • mrpainnogain changed the title to BTRFS bdev /dev/nvme0n1p1 error
  • Replies 50
  • Created
  • Last Reply

Top Posters In This Topic

NVMe device dropped offline almost immediately after mounting the pool:

 

Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 134 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 135 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 136 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 137 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 138 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:41 ASAS kernel: nvme nvme0: I/O 134 QID 5 timeout, reset controller
Feb  2 18:47:41 ASAS kernel: nvme nvme0: I/O 19 QID 0 timeout, reset controller

 

Try swapping the devices and see if the problem follows the device.

 

P.S. in case you're not aware the pool is missing another device, besides this one that dropped now:

 

Feb  2 18:45:12 ASAS emhttpd:  Total devices 3 FS bytes used 634.53GiB
Feb  2 18:45:12 ASAS emhttpd:  devid    1 size 953.87GiB used 651.03GiB path /dev/nvme1n1p1
Feb  2 18:45:12 ASAS emhttpd:  devid    4 size 953.87GiB used 6.00GiB path /dev/nvme0n1p1
Feb  2 18:45:12 ASAS emhttpd:  *** Some devices missing

 

Link to comment
18 minutes ago, JorgeB said:

NVMe device dropped offline almost immediately after mounting the pool:

 

Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 134 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 135 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 136 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 137 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:11 ASAS kernel: nvme nvme0: I/O 138 (I/O Cmd) QID 5 timeout, aborting
Feb  2 18:47:41 ASAS kernel: nvme nvme0: I/O 134 QID 5 timeout, reset controller
Feb  2 18:47:41 ASAS kernel: nvme nvme0: I/O 19 QID 0 timeout, reset controller

 

Try swapping the devices and see if the problem follows the device.

 

P.S. in case you're not aware the pool is missing another device, besides this one that dropped now:

 

Feb  2 18:45:12 ASAS emhttpd:  Total devices 3 FS bytes used 634.53GiB
Feb  2 18:45:12 ASAS emhttpd:  devid    1 size 953.87GiB used 651.03GiB path /dev/nvme1n1p1
Feb  2 18:45:12 ASAS emhttpd:  devid    4 size 953.87GiB used 6.00GiB path /dev/nvme0n1p1
Feb  2 18:45:12 ASAS emhttpd:  *** Some devices missing

 

do you mean i need to stop the array and assign cache 1 with nvme0n1 and cache 2 with nvme1n1 and start the array again? will i get "wrong" cache notifications? 

 

nvme_swap.thumb.png.3a03c95f507489e6318833e3744baa52.png

 

the 2L082xxx is my first cache from the beginning

 

 

regarding to the missing device, 

may be it was the first NVME that i thought was broken, but i was pretty sure that i unmount that NVME and also formatted it. i also took it off from the motherboard. 

Edited by febriantoarbi
Link to comment
4 minutes ago, JorgeB said:

Power cycle the server to see if the device comes back online, just rebooting will likely not be enough, if it does try mounting the pool again.

i have already done that couple hours ago, the cache came back online. re-mounted to the pool. but the issue happened again

Link to comment
8 minutes ago, JorgeB said:

The device may be failing, to confirm swap m.2 slots with the other one and see where the issue follows.

are you saying my first cache is failing? or the cache 2 ?  is doing BTRFS scrub necessary in this case? 

just a friendly reminder, i did try different m2 slots with the previous NVME that i thought was broken, but i did not move the NVME on the first cache at all. 

Edited by febriantoarbi
Link to comment
4 hours ago, JorgeB said:

Device with serial ending in SYH is the one that dropped, swap slots and see if the same device drops again, if yes it may be failing.

Okay , i will try that later.  In the mean time, will it be possible if i just unmount the problem device? And just run the pool with the healthy device. Because the last time i did this, i got the “too many missing disks” error when trying to start the array. 

Link to comment
2 hours ago, JorgeB said:

If the pool is redundant it should mount with just one device, post new diags if it doesn't.

ok currently i am trying to start the array with only the first cache ( the cache that i have been using far before i added another cache) and i am having this " 

cache - too many missing/wrong devices" .  how do i resolve this?

image.thumb.png.242e9bbfcb399f73f7909b0fe3e96596.png

image.thumb.png.9d7bcbfb6340ee93beea880595f1a1cc.png
 

asas-diagnostics-20240204-1938.zip

Edited by mrpainnogain
Link to comment
4 minutes ago, JorgeB said:

Cache profile is set to single, edit /boot/config/pools/cache.cfg and change diskFsProfile="single" to diskFsProfile="raid1", then try again, not sure if you need to reboot first for the change to be seen.

do i need to do this via terminal? 

i got the permission denied , when trying to access that as a root 

 

 

Link to comment
21 minutes ago, JorgeB said:

Cache profile is set to single, edit /boot/config/pools/cache.cfg and change diskFsProfile="single" to diskFsProfile="raid1", then try again, not sure if you need to reboot first for the change to be seen.

somehow i still got the same warning after changing it to raid1  via the terminal command with sudo nano ...... 

i also did reboot the system

 

 

asas-diagnostics-20240204-1959.zip

Edited by mrpainnogain
Link to comment

Assuming you rebooted after the change, try reimporting the pool with the single device, stop array, unassign all pool devices, start array to reset the pool, stop array, reassign the single pool device (leave the slots set to 2 or how it was, don't set to 1), start array.

Link to comment
19 minutes ago, JorgeB said:

Assuming you rebooted after the change, try reimporting the pool with the single device, stop array, unassign all pool devices, start array to reset the pool, stop array, reassign the single pool device (leave the slots set to 2 or how it was, don't set to 1), start array.

i did unassigned all devices , start array , then stop it again . and i am about to mount the cache , and it says the FS is auto . is it safe to start the array again ? the FS was BTRFS previously.

do i have to change the file system type to BTRFS manually? 

 

image.thumb.png.425fa51e1ccb5955610c23d2c6b8269a.png

image.thumb.png.ffd2dd682429ae6d250f9034bb260ed2.png

Edited by mrpainnogain
Link to comment

okay somehow it managed to work again after i removed the other NVME which considered to be "broken" .  now i remember that the cache with series 2L028LXX used to be labeled "nvme0n1" and somehow it changed to nvme1n1 after the 2nd nvme was added. and that may be causing the issue with the array. all good for now, but i am still figuring out how to do the BTRFS with the other cache.

Link to comment
18 hours ago, mrpainnogain said:

okay somehow it managed to work again after i removed the other NVME which considered to be "broken"

Yep, sorry, forgot to mention that device would need to be disconnected or wiped.

 

18 hours ago, mrpainnogain said:

but i am still figuring out how to do the BTRFS with the other cache.

Do you mean add the other NVMe device back?

Link to comment
19 hours ago, JorgeB said:

Yep, sorry, forgot to mention that device would need to be disconnected or wiped.

 

Do you mean add the other NVMe device back?

Yes, since i am afraid in my case is there is something wrong with the way i add the 2nd  NVME to the cache pool to be configured as BTRFS. i have already sent the NVME back to the shop and they did not find anything wrong with that.  in the mean time i want to test the NVME as "unassigned device" and install VM on that, just to make sure that the device is actually in good condition. other than BTRFS format, is there anyway i can mirror 1 cache to the other? may be copying the contents of one cache to the other one while services like dockers and VM's are off on regular basis? 

Link to comment
21 hours ago, JorgeB said:

You can create a mirror with btrfs o zfs, you can also manually sync one to the other, with rsync and a user script for example.

hey i have another question , i just  put another NVME on the system . but now the the cache that was supposed to be labelled as nvme0n1 became "nvme1n1" and it triggered the disk missing warning again and the cache became " unmountable: unsupported file system" . how do i force the name to stay on "nvme0n1" on the first cache? 

asas-diagnostics-20240207-1402.zip

Edited by mrpainnogain
Link to comment
14 minutes ago, itimpi said:

That is the device name and is not used by Unraid to identify drives (instead Unraid uses the drive serial number).    Something else must be going on.   

 

 

 

hmm any idea what might be the issue here? because with only one NVME everything went smooth. but the issue came out every time i added another NVME ( doesn't matter the slot on mobo) 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.