(SOLVED) FITRIM ioctl failed: Input/output error followed by Cache pool BTRFS missing device(s)

FQs19 · September 3, 2022

If anyone has time to explain to me what happened on my server, I would really appreciate it.

While at work last night, I received an email from my unRAID server that stated the following:

cron for user root /sbin/fstrim -a -v | logger &> /dev/null

fstrim: /mnt/arraycache: FITRIM ioctl failed: Input/output error

Immediately after that I received another email from my unRAID server that stated:

Unraid Status: Warning [THREADRIPPER19] - Cache pool BTRFS missing device(s)

Event: Unraid Arraycache disk message
Subject: Warning [THREADRIPPER19] - Cache pool BTRFS missing device(s)
Description: Force_MP600_20238284000130053009 (nvme0n1)
Importance: warning

I wasn't able to log in remotely to unRAID even though I had My Servers setup to do so, but that's a problem for another day.

When I got home and logged in, I found that the Disk in question, Disk nvme0n1, was still shown on the Main tab and I was able to view it by clicking on the Arraycache link. I've had nvme disks drop offline before, but they were gone from the Main tab. Why was unRAID saying it was offline, but still visible?

My Plex docker was still working perfectly fine, which I assume is because the other disk in the cachepool never dropped offline.

I did some googling and read that I shouldn't be using TRIM on nvme devices especially if they are in a btrfs pool. That's the first I ever heard of that. So I will turn off TRIM on my server.

I haven't had time to turn my unRAID server back on and see if there's any other problems. I just wanted to post here and get some feedback as to why it happened, how I can keep it from happening again, and what steps should I take before rebooting to make sure no data is lost?

Attached is my Diagnostic file and a screenshot from after getting the Diagnostic file and before shutting my server down.

Thank you in advance for any help given.

threadripper19-diagnostics-20220903-0504.zip

Edited September 29, 2022 by FQs19
Topic Solved

FQs19 · September 4, 2022

After starting my Unraid server, I started the array in maintenance mode then ran a file system check on both of my cache pools. Neither of them had any errors.

I then stopped the array and started it normally. I ran a scrub with repair corrupted blocks checked on my arraycache pool. The scrub repaired a lot of corruption since that had the nvme drop offline.
I ran a scrub on my plexcache pool and there were no errors.

I thought everything was good until a couple hours ago. Every hour at exactly the 47th min of the hour, I get an email stating:
unraid status: errors on cache pool

Event: Unraid Status
Subject: ERRORS on cache pool
Description: No description
Importance: warning

It doesn't tell me what cache pool has the errors and I'm not seeing any errors on either cache pool.

I've ran another scrub on each cache pool and there are no errors.

I looked in my logs and there's no mention of cache pool errors except the system sending out an email.

What is going on here?

threadripper19-syslog-20220904-0003.zip

FQs19 · September 4, 2022

I found the cause of the Errors on Cache Pool.

I forgot I made a user script to run a btrfs check hourly and that after an error is seen, it will continue to notify you of the error every time it runs, which for me is every hour.

I had to force reset the stats on the cache pool to stop the error. I just assumed the stats would reset after correcting the errors.

I still haven't figured out why the nvme drive dropped offline in the first place after a scheduled TRIM.

JorgeB · September 4, 2022

2 hours ago, FQs19 said:

I still haven't figured out why the nvme drive dropped offline in the first place

The below sometimes helps:

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0

Reboot and see if it makes a difference.

FQs19 · September 4, 2022

10 hours ago, JorgeB said:
The below sometimes helps:

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"
nvme_core.default_ps_max_latency_us=0
e.g.:
append initrd=/bzroot nvme_core.default_ps_max_latency_us=0
Reboot and see if it makes a difference.

Thanks for the reply JorgeB.

I did have that line in my original unraid server, but don't remember why I didn't put it in my fresh install of unraid.

I'll try and find out why I didn't put it back in, then add it and turn TRIM back on and see what happens.

Thanks again.

FQs19 · September 4, 2022

@JorgeB

I already had that line in my boot folder, but I think I messed up the position of it.

Which is probably why is wasn't working.

Can you please correct my boot folder settings.

I have it under Unraid OS and Unraid OS GUI MODE.

Thanks

JorgeB · September 5, 2022

That's correct it can be in the middle or the end of the line, as long as it's after append is fine.

FQs19 · September 5, 2022

8 hours ago, JorgeB said:

That's correct it can be in the middle or the end of the line, as long as it's after append is fine.

Curious why I had the nvme drop offline if I already had that code in there.

I'll try rebooting and enabling TRIM again, to see what happens.

JorgeB · September 5, 2022

10 minutes ago, FQs19 said:

Curious why I had the nvme drop offline if I already had that code in there.

It just means there was another reason and that won't help, or maybe it could have dropped more often without that if there's more than one reason.

FQs19 · September 8, 2022

So one of my nvme drives (nvme0n1) went missing again.

This happened right after the backup/restore appdata plugin finished running a backup. The last time is went missing after running TRIM.

I'm at work, unable to do anything but grab the diagnostic zip. I'm attaching it to this post.

Could someone please help me figure out why I keep losing an nvme disk.

Do I need to wipe that cache pool and start over?

Remove that nvme disk, put it in a windows machine, test it, then erase it and put it back in unraid and try again?

Just return it to Corsair if it's still under warranty?

Or is there a problem with this motherboard?

I just can't seem to win with unraid. It's been nothing but problem after problem after problem. I can't believe that I'm having this many hardware problems.

In one year I've replaced the case, the power supply, the motherboard, the flash drive, the memory, the sata cables, the hba cables, and even changed out the nvme drives.

Beyond frustrated at this point.

threadripper19-diagnostics-20220908-0253.zip

JorgeB · September 8, 2022

It's the other device this time, and since they are different brands it suggest it's not a device issue.

59 minutes ago, FQs19 said:

Do I need to wipe that cache pool and start over?

That won't help, this is a hardware related issue, look for a BIOS update for the board, I also usually suggest a different m.2/pcie slot if available, but since both dropped that's also unlikely to help.

FQs19 · September 8, 2022

8 hours ago, JorgeB said:

It's the other device this time, and since they are different brands it suggest it's not a device issue.

That won't help, this is a hardware related issue, look for a BIOS update for the board, I also usually suggest a different m.2/pcie slot if available, but since both dropped that's also unlikely to help.

Thanks again for your help JorgeB.

I'm going to pull both nvme disks out, make sure all contacts are clean, verify the heatsinks are secured and that the thermal pads are in the correct location, then install the nvme disks. I might swap which m.2 slot they go back in. Would that affect the cachepool? I'm thinking it will change the name associated with each nvme disk like nvme0n1 will now be nvme1n1.

Also, to get my nvme disk back online, are these the steps to take (these are the steps I did the first time):

-reboot server

-verify all nvme disks are online

-start array in maintenance mode

-run a file system check

-if no errors are found, start array in normal mode, then run a correcting scrub.

After those steps, I'll shut my server down and do the steps I laid out earlier.

I'm also going to change my memory speed to its default of 2666 from the 3000 it is set to now.

JorgeB · September 8, 2022

33 minutes ago, FQs19 said:

Would that affect the cachepool?

Nope.

33 minutes ago, FQs19 said:

Also, to get my nvme disk back online, are these the steps to take (these are the steps I did the first time):

-reboot server

-verify all nvme disks are online

-start array in maintenance mode

-run a file system check

-if no errors are found, start array in normal mode, then run a correcting scrub.

File system check should not be needed, but it won't hurt as long as it's in read only mode.

FQs19 · September 8, 2022

1 hour ago, JorgeB said:

Nope.

File system check should not be needed, but it won't hurt as long as it's in read only mode.

Glad to hear swapping m.2 slots won't affect the cachepool.

And yes, I only do File System checks as read only.

Thanks again.

(SOLVED) FITRIM ioctl failed: Input/output error followed by Cache pool BTRFS missing device(s)

Recommended Posts

FQs19

Link to comment

FQs19

Link to comment

FQs19

Link to comment

JorgeB

Link to comment

FQs19

Link to comment

FQs19

Link to comment

JorgeB

Link to comment

FQs19

Link to comment

JorgeB

Link to comment

FQs19

Link to comment

JorgeB

Link to comment

FQs19

Link to comment

JorgeB

Link to comment

FQs19

Link to comment

Join the conversation