Time to replace this SSD or something else?

Azxiana · May 7, 2022

This is the second time in a week that this has happened with this drive. Crucial P2 500GB.

I thought it was an issue with the Hyper M.2 PCI-Express card that has previously been suspicious so I ended up rebuilding the server with a fresh motherboard so I could have the M.2 drives on the motherboard itself. So it has failed like this on both an old hardware configuration and a fresh one.

It is time to replace this drive or could there be something else that I am missing?

May  7 04:44:46 Emilia kernel: nvme nvme1: I/O 24 QID 5 timeout, aborting
May  7 04:45:17 Emilia kernel: nvme nvme1: I/O 24 QID 5 timeout, reset controller
May  7 04:45:47 Emilia kernel: nvme nvme1: I/O 10 QID 0 timeout, reset controller
May  7 04:48:29 Emilia kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
May  7 04:48:29 Emilia kernel: nvme nvme1: Abort status: 0x371
May  7 04:50:37 Emilia kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
May  7 04:50:37 Emilia kernel: nvme nvme1: Removing after probe failure status: -19
May  7 04:52:45 Emilia kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
May  7 04:52:45 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
May  7 04:52:45 Emilia kernel: blk_update_request: I/O error, dev nvme1n1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
May  7 04:52:45 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 0, flush 1, corrupt 0, gen 0
May  7 04:52:45 Emilia kernel: BTRFS warning (device nvme1n1p1): chunk 393078636544 missing 1 devices, max tolerance is 0 for writable mount
May  7 04:52:45 Emilia kernel: BTRFS: error (device nvme1n1p1) in write_all_supers:3845: errno=-5 IO failure (errors while submitting device barriers.)
May  7 04:52:45 Emilia kernel: BTRFS info (device nvme1n1p1): forced readonly
May  7 04:52:45 Emilia kernel: BTRFS warning (device nvme1n1p1): Skipping commit of aborted transaction.
May  7 04:52:45 Emilia kernel: BTRFS: error (device nvme1n1p1) in cleanup_transaction:1942: errno=-5 IO failure
May  7 04:52:45 Emilia kernel: BTRFS warning (device nvme1n1p1): Skipping commit of aborted transaction.
May  7 04:52:45 Emilia kernel: BTRFS: error (device nvme1n1p1) in cleanup_transaction:1942: errno=-5 IO failure
May  7 04:52:45 Emilia kernel: nvme nvme1: failed to set APST feature (-19)
May  7 04:54:01 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 1, flush 1, corrupt 0, gen 0
May  7 04:54:01 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 2, flush 1, corrupt 0, gen 0
May  7 04:54:06 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 125920 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
May  7 04:54:06 Emilia kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
May  7 04:54:06 Emilia kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
May  7 04:54:06 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 650208 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
May  7 04:54:06 Emilia kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
May  7 04:54:06 Emilia kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
May  7 04:54:06 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 125664 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
May  7 04:54:06 Emilia kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
May  7 04:54:06 Emilia kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
May  7 04:54:06 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 125824 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
May  7 04:54:06 Emilia kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
May  7 04:54:06 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 125984 op 0x1:(WRITE) flags 0x1800 phys_seg 48 prio class 0
May  7 04:54:06 Emilia kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
May  7 04:54:06 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 649952 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
May  7 04:54:06 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 650112 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
May  7 04:54:06 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 650272 op 0x1:(WRITE) flags 0x1800 phys_seg 48 prio class 0
May  7 04:54:06 Emilia kernel: BTRFS: error (device loop2) in btrfs_commit_transaction:2377: errno=-5 IO failure (Error while writing out transaction)
May  7 04:54:06 Emilia kernel: BTRFS info (device loop2): forced readonly
May  7 04:54:06 Emilia kernel: BTRFS warning (device loop2): Skipping commit of aborted transaction.
May  7 04:54:06 Emilia kernel: BTRFS: error (device loop2) in cleanup_transaction:1942: errno=-5 IO failure
May  7 04:54:06 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 27288 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
May  7 04:54:07 Emilia kernel: btrfs_dev_stat_print_on_error: 11 callbacks suppressed
May  7 04:54:07 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 3, flush 1, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS warning (device nvme1n1p1): direct IO failed ino 263 rw 0,0 sector 0x3de6a78 len 4096 err no 10
May  7 04:54:07 Emilia kernel: blk_update_request: I/O error, dev loop2, sector 1297496 op 0x0:(READ) flags 0x80700 phys_seg 3 prio class 0
May  7 04:54:07 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 4, flush 1, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 5, flush 1, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS warning (device nvme1n1p1): direct IO failed ino 263 rw 0,0 sector 0x3de7228 len 4096 err no 10
May  7 04:54:07 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 6, flush 1, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS warning (device nvme1n1p1): direct IO failed ino 263 rw 0,0 sector 0x3de7230 len 4096 err no 10
May  7 04:54:07 Emilia kernel: BTRFS warning (device nvme1n1p1): direct IO failed ino 263 rw 0,0 sector 0x3de7238 len 4096 err no 10
May  7 04:54:07 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 7, flush 1, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 8, flush 1, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 9, flush 1, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS warning (device nvme1n1p1): direct IO failed ino 263 rw 0,0 sector 0x3de7300 len 4096 err no 10
May  7 04:54:07 Emilia kernel: BTRFS warning (device nvme1n1p1): direct IO failed ino 263 rw 0,0 sector 0x3de7308 len 4096 err no 10
May  7 04:54:07 Emilia kernel: BTRFS warning (device nvme1n1p1): direct IO failed ino 263 rw 0,0 sector 0x3de7310 len 4096 err no 10
May  7 04:54:07 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 10, flush 1, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS warning (device nvme1n1p1): direct IO failed ino 263 rw 0,0 sector 0x3de7310 len 4096 err no 10
May  7 04:54:07 Emilia kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 19, rd 1, flush 0, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme1n1p1 errs: wr 1, rd 11, flush 1, corrupt 0, gen 0
May  7 04:54:07 Emilia kernel: BTRFS warning (device nvme1n1p1): direct IO failed ino 263 rw 0,0 sector 0x3de7310 len 4096 err no 10

JorgeB · May 7, 2022

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0

Reboot and see if it makes a difference.

Azxiana · May 7, 2022

1 minute ago, JorgeB said:
Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"
nvme_core.default_ps_max_latency_us=0
e.g.:
append initrd=/bzroot nvme_core.default_ps_max_latency_us=0
Reboot and see if it makes a difference.

I will give it a try, thanks! This drive and its companion have been in service for almost two years in this server without issue until this past week. I actually have to power the entire server off to get the drive to come back.

Azxiana · May 8, 2022

I just got back from Best Buy with two replacement SSDs. It happened again. ¯\_(ツ)_/¯

Froberg · June 8, 2022

I'm having it too with standard SSD's.. find any cause?

Jun 8 03:18:34 FortyTwo kernel: BTRFS error (device sdb1): block=455832535040 write time tree block corruption detected
Jun 8 03:18:34 FortyTwo kernel: BTRFS: error (device sdb1) in btrfs_commit_transaction:2438: errno=-5 IO failure (Error while writing out transaction)
Jun 8 03:18:34 FortyTwo kernel: BTRFS info (device sdb1): forced readonly
Jun 8 03:18:34 FortyTwo kernel: BTRFS warning (device sdb1): Skipping commit of aborted transaction.
Jun 8 03:18:34 FortyTwo kernel: BTRFS: error (device sdb1) in cleanup_transaction:2011: errno=-5 IO failure
Jun 8 03:18:36 FortyTwo kernel: blk_update_request: I/O error, dev loop2, sector 29152 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:18:36 FortyTwo kernel: blk_update_request: I/O error, dev loop2, sector 458944 op 0x1:(WRITE) flags 0x1800 phys_seg 1 prio class 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:18:36 FortyTwo kernel: blk_update_request: I/O error, dev loop2, sector 983232 op 0x1:(WRITE) flags 0x1800 phys_seg 1 prio class 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:18:36 FortyTwo kernel: blk_update_request: I/O error, dev loop2, sector 459392 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 4, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:18:36 FortyTwo kernel: blk_update_request: I/O error, dev loop2, sector 983680 op 0x1:(WRITE) flags 0x1800 phys_seg 2 prio class 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 5, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:18:36 FortyTwo kernel: blk_update_request: I/O error, dev loop2, sector 460448 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 6, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 7, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:18:36 FortyTwo kernel: blk_update_request: I/O error, dev loop2, sector 984736 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 8, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 9, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS: error (device loop2) in free_log_tree:3451: errno=-5 IO failure
Jun 8 03:18:36 FortyTwo kernel: BTRFS info (device loop2): forced readonly
Jun 8 03:18:36 FortyTwo kernel: BTRFS warning (device loop2): Skipping commit of aborted transaction.
Jun 8 03:18:36 FortyTwo kernel: BTRFS: error (device loop2) in cleanup_transaction:2011: errno=-5 IO failure
Jun 8 03:18:36 FortyTwo kernel: blk_update_request: I/O error, dev loop2, sector 32928 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 0
Jun 8 03:18:36 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 10, rd 0, flush 0, corrupt 0, gen 0
Jun 8 03:28:38 FortyTwo root: Restoring original turbo write mode
Jun 8 03:28:38 FortyTwo kernel: mdcmd (129): set md_write_method auto
Jun 8 03:28:38 FortyTwo kernel:
Jun 8 03:44:08 FortyTwo kernel: blk_update_request: I/O error, dev loop2, sector 29152 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 0
Jun 8 03:44:08 FortyTwo kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 11, rd 0, flush 0, corrupt 0, gen 0

BTFRS has been the single most unstable thing with my UnRAID experience thus far..

JorgeB · June 8, 2022

3 hours ago, Froberg said:

I'm having it too with standard SSD's.. find any cause?

Please post the complete diagnostics.

Froberg · June 9, 2022

On 6/8/2022 at 12:37 PM, JorgeB said:

Please post the complete diagnostics.

Seems like it had another crash during the night..

fortytwo-diagnostics-20220609-1637.zip

Froberg · June 9, 2022

yeah BTFRS is completely unmountable now.. never had it this bad before..

even set up a script to regularly monitor for corruption. Damn thing just keeps screwing up.

It just says the filesystem is unmountable now, like it's all gone.. thinking I'll just to xfs from now on.. haven't had btfrs be stable despite changing s-ata connections, cables, power delivery and even switching to new SSD's entirely.

Annoying.

image.png.0f8433fade543ce05831dbeaebf73139.png

Please advise, otherwise I think I'll just have to recover from backup appdata.

Edited June 9, 2022 by Froberg

JorgeB · June 9, 2022

Errors you're having suggest a possible RAM issue, start by running memtest, after that there are some recovery options here.

Froberg · June 9, 2022

50 minutes ago, JorgeB said:

Errors you're having suggest a possible RAM issue, start by running memtest, after that there are some recovery options here.

What makes you suspect a memory issue? It's ECC memory and it's been put through memtest before.. I'll usually get 8-12 months out of the btfrs cache before it implodes..

JorgeB · June 9, 2022

On 6/8/2022 at 8:19 AM, Froberg said:

write time tree block corruption detected

This means btrfs detected corruption before writing the data to the disk, it usually indicates RAM or other kernel memory corruption.

Froberg · June 9, 2022

2 hours ago, JorgeB said:

This means btrfs detected corruption before writing the data to the disk, it usually indicates RAM or other kernel memory corruption.

image.png.e6bd7cfb9037061321820b11c1684e27.png

Seems fine so far..

I think it's BTFRS itself that self-corrupts.. I think I've tried all variables by now, including getting the motherboard replaced at one point.

Eh - think I'll just go for xfs and get started on recovery.. haven't tried recovering using backup/restore before, so it'll be a nice test at least.

edit:

pass complete, no errors.

image.png.65ba15cfc5a7a1926be9e46c51c2a291.png

Edited June 9, 2022 by Froberg

ChatNoir · June 9, 2022

You might want to try a more recent memtest : https://www.memtest86.com/download.htm

The built in is older and do not work well with ECC memory.

Froberg · June 9, 2022

Surely other issues would have cropped up during six years of use other than this btfrs issue? Surely?

ChatNoir · June 9, 2022

8 minutes ago, Froberg said:

Surely other issues would have cropped up during six years of use other than this btfrs issue? Surely?

Things work then doesn't anymore.

Maybe it's not that, but I wouldn't want to run any computer on faulty RAM.

Froberg · June 9, 2022

Just now, ChatNoir said:

Things work then doesn't anymore.

Maybe it's not that, but I wouldn't want to run any computer on faulty RAM.

Yes, obviously.

Probably just shouldn't have upgraded the OS Was running fine until then.

I'll try the other memtest once I'm done recovering. Plex takes literal ages.

JorgeB · June 10, 2022

13 hours ago, Froberg said:

I think it's BTFRS itself that self-corrupts..

Btrfs is very susceptible to RAM or any other hardware corruption issue, much more than other filesystems, so if there's an issue it's where you'll see it first, but there are many users, not just in Unraid, using very large btrfs filesystem for years without issues, I myself have roughly 200 btrfs filesystems in use for about 5 or 6 years, only had issues with one, it got trashed twice in a couple of months, traced it to a bad disk.

Froberg · June 11, 2022

23 hours ago, JorgeB said:

Btrfs is very susceptible to RAM or any other hardware corruption issue, much more than other filesystems, so if there's an issue it's where you'll see it first, but there are many users, not just in Unraid, using very large btrfs filesystem for years without issues, I myself have roughly 200 btrfs filesystems in use for about 5 or 6 years, only had issues with one, it got trashed twice in a couple of months, traced it to a bad disk.

A bad disk shouldn't really be causing the loss of an entire raid setup though, ideally, surely?

JorgeB · June 11, 2022

39 minutes ago, Froberg said:

A bad disk shouldn't really be causing the loss of an entire raid setup though, ideally, surely?

It was a single disk filesystem, an array disk.

Froberg · October 2, 2022

On 6/11/2022 at 10:09 AM, JorgeB said:

It was a single disk filesystem, an array disk.

So thread necrophilia is a thing. I just switched to a new system and I am putting this old one through its paces before deciding whether to use it as an upgrade for my backup-server, that's only running intermittently.

One thing I did notice recently was the log drive filling up quite rapidly, but I couldn't see any immediate issues. Switched to running single-disk cache, now back to running btfrs raid1 in the new system.

Here's the new system:

image.png.80dca0ee9c09f99891e487c9cd573a12.png

Uptime is close to five days. The old one would rise to 30% log usage within a day usually. Usually with the btfrs issue I found out when the log was full and related issues started to occur, rebooting fixed the log-issue and then I'd be able to tell that btfrs was FUBAR'ed.

It's been happening with varying frequency. I did discover just now, that since I'm using Dynamix I was supposed to increase the size of the log, so I changed it to 512 megs with mount -o remount,size=512m /var/log - maybe that will help with the issue I was having.

Can btfrs corrupt if the system runs out of memory somehow?

Memtest is currently running on my test bench and is showing no issues, running the latest version of memtest.. going to let it complete regardless.

Any specific memtest config or something you want me to try out to be sure? I don't want to rely on the board and memory for my backup box if it's the cause of my issues after all.

JorgeB · October 2, 2022

A 24H memtest while not definitive will catch most issues.

Froberg · October 2, 2022

1 hour ago, JorgeB said:

A 24H memtest while not definitive will catch most issues.

Four hours to complete the test.. all passed.

Memtest free won't let you run more than four passes, not sure how I'd accomplish a 24 hour test from looking at the settings.

I'm running another test now though.

Froberg · October 3, 2022

22 hours ago, JorgeB said:

A 24H memtest while not definitive will catch most issues.

I've re-run the tests five times now.. still going strong with zero errors.

Anything more I can do to rule out any hardware fault?

JorgeB · October 3, 2022

Use the server normally and monitor the pool for errors.

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=700582

Time to replace this SSD or something else?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation