Jump to content

Uncorrectable errors after scrub on NVME SSD cache drive


Recommended Posts

Hi,

 

I have encountered a problem and would be very grateful if someone would have the kindness to help me.

 

I was checking my syslog and saw BTRFS errors on my cache NVME SSD drive :

Dec  9 02:10:42 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 711847 off 9417928704 csum 0x98f94189 expected csum 0x8d72db16 mirror 1
Dec  9 02:10:42 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 711847 off 9417928704 csum 0x98f94189 expected csum 0x8d72db16 mirror 1
Dec  9 02:10:42 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 711847 off 9418031104 csum 0x98f94189 expected csum 0xf896b432 mirror 1
Dec  9 02:10:42 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 711847 off 9418031104 csum 0x98f94189 expected csum 0xf896b432 mirror 1
Dec  9 02:10:42 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 711847 off 9418162176 csum 0x98f94189 expected csum 0x3ecb4cb8 mirror 1

Then I searched online and found that running a scrub on the drive could help. Here is the result:

scrub status for 5a462ed1-615f-4fdc-8c3a-0e1fd40b537c
	scrub started at Mon Dec  9 17:18:12 2019 and finished after 00:05:05
	total bytes scrubbed: 443.09GiB with 1024 errors
	error details: csum=1024
	corrected errors: 0, uncorrectable errors: 1024, unverified errors: 0

Now the syslog shows more warning and errors:

Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551573680128 on dev /dev/nvme0n1p1, physical 64441253888, root 5, inode 474170, offset 3762552832, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1025, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551573680128 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551574106112 on dev /dev/nvme0n1p1, physical 64441679872, root 5, inode 474170, offset 3762978816, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1026, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551574106112 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551573843968 on dev /dev/nvme0n1p1, physical 64441417728, root 5, inode 474170, offset 3762716672, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1027, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551573975040 on dev /dev/nvme0n1p1, physical 64441548800, root 5, inode 474170, offset 3762847744, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551573843968 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1028, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551573975040 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551573684224 on dev /dev/nvme0n1p1, physical 64441257984, root 5, inode 474170, offset 3762556928, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1029, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551573684224 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551574110208 on dev /dev/nvme0n1p1, physical 64441683968, root 5, inode 474170, offset 3762982912, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1030, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551574110208 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551574237184 on dev /dev/nvme0n1p1, physical 64441810944, root 5, inode 474170, offset 3763109888, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)
Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551573712896 on dev /dev/nvme0n1p1, physical 64441286656, root 5, inode 474170, offset 3762585600, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1032, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551573712896 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551574237184 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551573848064 on dev /dev/nvme0n1p1, physical 64441421824, root 5, inode 474170, offset 3762720768, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1033, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551573848064 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1034, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1035, gen 0
Dec  9 17:18:57 MOZART kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 1551573688320 on dev /dev/nvme0n1p1
Dec  9 17:18:57 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 1551573979136 on dev /dev/nvme0n1p1, physical 64441552896, root 5, inode 474170, offset 3762851840, length 4096, links 1 (path: downloads.cache/transmission/data_ygg/Les demoiselles de Rochefort (1967).mkv)

Some more research on the internet suggests that I should reformat the cache drive entirely... and if the issue persists, replace it.

 

What are your thoughts ?

 

Many thanks,

G

 

 

Edited by Opawesome
Amended topic title
Link to comment
On 12/9/2019 at 6:19 PM, johnnie.black said:

It can, but it shouldn't cause data corruption, it should correct single bit errors or halt the system if an uncorrectable error is detected, you can post the diagnostics, maybe something else visible.

OK. Here it is. Thank you in advance.

 

Edited by Opawesome
Deleted diagnostics
Link to comment

Only thing out of the ordinary are these error every time the device is trimmed:

 

Dec  1 01:00:33 MOZART kernel: dmar_fault: 20 callbacks suppressed
Dec  1 01:00:33 MOZART kernel: DMAR: DRHD: handling fault status reg 3
Dec  1 01:00:33 MOZART kernel: DMAR: [DMA Read] Request device [06:00.0] fault addr f9590000 [fault reason 06] PTE Read access is not set
Dec  1 01:00:33 MOZART kernel: DMAR: DRHD: handling fault status reg 3
Dec  1 01:00:33 MOZART kernel: DMAR: [DMA Read] Request device [06:00.0] fault addr f78ff000 [fault reason 06] PTE Read access is not set
Dec  1 01:00:40 MOZART kernel: DMAR: DRHD: handling fault status reg 3
Dec  1 01:00:40 MOZART kernel: DMAR: [DMA Read] Request device [06:00.0] fault addr ee08d000 [fault reason 06] PTE Read access is not set
Dec  1 01:00:41 MOZART kernel: DMAR: DRHD: handling fault status reg 3
Dec  1 01:00:41 MOZART kernel: DMAR: [DMA Read] Request device [06:00.0] fault addr f7903000 [fault reason 06] PTE Read access is not set
Dec  1 01:00:42 MOZART root: /var/lib/docker: 8.1 GiB (8695169024 bytes) trimmed on /dev/loop2
Dec  1 01:00:42 MOZART root: /mnt/cache: 397.6 GiB (426866036736 bytes) trimmed on /dev/nvme0n1p1

I believe there's a firmware update to fix that, though I can't see that causing data corruption it still should be fixed and see if it makes any difference.

Link to comment

Thank you again.

 

According to https://downloadmirror.intel.com/28749/eng/Intel_SSD_Firmware_Update_Tool_3_0_7_Release_Notes-328292-030US.pdf, I have the latest firmware for my Intel 660p NVME drive (002C). Or maybe I did not understand your comment properly ?

root@MOZART:~# smartctl -a /dev/nvme0
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.56-Unraid] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEKNW010T8
Serial Number:                      BTNH925511J71P0B
Firmware Version:                   002C
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon Dec  9 19:37:39 2019 CET
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.00W       -        -    0  0  0  0        0       0
 1 +     3.00W       -        -    1  1  1  1        0       0
 2 +     2.20W       -        -    2  2  2  2        0       0
 3 -   0.0300W       -        -    3  3  3  3     5000    5000
 4 -   0.0040W       -        -    4  4  4  4     5000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        25 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    28,535,988 [14.6 TB]
Data Units Written:                 18,216,684 [9.32 TB]
Host Read Commands:                 129,177,201
Host Write Commands:                85,548,913
Controller Busy Time:               2,026
Power Cycles:                       13
Power On Hours:                     1,731
Unsafe Shutdowns:                   1
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

 

Link to comment
On 12/9/2019 at 8:11 PM, johnnie.black said:

In that case I would try disabling trim for a week or so and see if it makes any difference, and after fixing the filesystem.

OK. I moved the data away from the cache drive, formated it (kept BTRFS as file system), moved some non sensitive data back on the cache drive and disabled the scheduled TRIM of the cache drive.

 

I will keep the forum updated depending on the results.

 

Thanks again.

G

Link to comment
On 12/9/2019 at 8:11 PM, johnnie.black said:

In that case I would try disabling trim for a week or so and see if it makes any difference, and after fixing the filesystem.

Hi again,

 

Unfortunately, I found those errors in my logs today :

Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73572352 csum 0x98f94189 expected csum 0xc8b9af1f mirror 1
Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73605120 csum 0x98f94189 expected csum 0xe0f25571 mirror 1
Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73572352 csum 0x98f94189 expected csum 0xc8b9af1f mirror 1
Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73588736 csum 0x98f94189 expected csum 0xf84934bf mirror 1
Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73621504 csum 0x98f94189 expected csum 0x56366ae7 mirror 1
Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73670656 csum 0x98f94189 expected csum 0xbdc0d0a7 mirror 1
Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73637888 csum 0x98f94189 expected csum 0xc43cef4b mirror 1
Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73703424 csum 0x98f94189 expected csum 0xa80c45c1 mirror 1
Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73719808 csum 0x98f94189 expected csum 0xdf3cd6df mirror 1
Dec 11 14:46:20 MOZART kernel: BTRFS warning (device nvme0n1p1): csum failed root 5 ino 1363 off 73654272 csum 0x98f94189 expected csum 0xe84e0f9d mirror 1

 

After scrub (with "correct errors" not ticked), I get the following result:

scrub status for 7db22b62-0ba2-4436-baa2-c4553155fd03
	scrub started at Thu Dec 12 08:30:13 2019 and finished after 00:03:13
	total bytes scrubbed: 231.98GiB with 113664 errors
	error details: csum=113664
	corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

...and now syslogs alos show this kind of error:

Dec 12 08:30:38 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Dec 12 08:30:38 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
Dec 12 08:30:38 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
Dec 12 08:30:38 MOZART kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Dec 12 08:30:38 MOZART kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 71940702208 on dev /dev/nvme0n1p1, physical 64424509440, root 5, inode 1362, offset 246415360, length 4096, links 1 (path: downloads.cache2/[PATH TO SOME FILES])

 

Seems like formatting did not solve the problem.

 

Does it mean that my NVME SSD is dead, or even that my motherboard needs RMA ?

 

With thanks,

G

Edited by Opawesome
Added scrub result
Link to comment
On 12/12/2019 at 9:14 AM, johnnie.black said:

Mestest shouldn't detect any errors with ECC, since they would be corrected, unless something isn't working properly or ECC is disabled in the BIOS. 

No errors after 10+ hours of memtest86. I will let it run for another 10 hours and if still no errors show up I guess I will try a new NVME SSD (hoping really hard that this is not the motherboard)

memtest.PNG

Link to comment

Hi all,

 

I have received and installed a new NVME SSD (WD blue 1TB).

 

Now it seems the NVME SSD does not even show in unraid (not even i BIOS I believe).

 

I am posting the diagnotics files in case someone has any idea.

 

I am feeling so frustrated 😞

 

Thank you in advance (I greatly appreciate it).

 

G

 

Edited by Opawesome
deleted attachment
Link to comment

Assuming you're using the board's NVMe slot I think I remember reading somewhere that it doesn't work with x2 NVMe devices, only x4, WD Blue NVMe is x2, also note that WD blue SATA M.2 devices won't work for sure on that board.

 

Edit: slot is x2 so I see no reason not to work with x2 devices, I think it's the Supermicro dual m.2 adapter that only works with x4 devices.

 

Link to comment
11 hours ago, johnnie.black said:

Assuming you're using the board's NVMe slot I think I remember reading somewhere that it doesn't work with x2 NVMe devices, only x4, WD Blue NVMe is x2, also note that WD blue SATA M.2 devices won't work for sure on that board.

 

Edit: slot is x2 so I see no reason not to work with x2 devices, I think it's the Supermicro dual m.2 adapter that only works with x4 devices.

 

Hi @johnnie.black,

 

Thank you very much for your answer.

 

I feel stupid but indeed, my new "NVME SSD WD blue 1TB" was in fact a "SATA SSD "WD blue 1TB"... I did type "NVME 1TB" in the amazon search bar but I was obviously not careful enough. Lesson learnt.

 

Now buying a new "actual NVME" SSD to see if the issue can be fixed that way.

G

Edited by Opawesome
typos
Link to comment

Update :

 

I reinstalled the supposedly defective NVME SSD but changed the "NVME Firmware Source" setting in the BIOS from "Vendor firmware" to "AMI Native Support". Then, I completly filled and emptied the drive (twice), I could not reproduce the errors. Perfoming a scrub dod not return any error either.

 

The errors may have been caused by this BIOS setting or a bad connection on the M.2 slot after all.

 

Filling up the drive for the 3rd time now. We'll see...

Link to comment
On 12/20/2019 at 8:49 AM, johnnie.black said:

Please report back your findings, curious on what is the issue here.

Update :

 

So: (i) I switched back the "NVME Firmware Source" setting in the BIOS from "AMI Native Support" to "Vendor firmware" (thus putting the system back to its original status) and; (ii) I filled the supposedly defective NVME SSD with data and performed a scrub (several times) ; but I still could not detect/produce any error on the cache drive.

 

This is good news but, in a way, I could not perform a diagnosis of my issue and now I have this sword of Damocles hanging over my head

 

Conclusion:

 

It may just have been the connections which had become lose or something, and intalling a new drive followed by putting the old one back in place may have actually solved the problem.

 

Also, I did notice a drop of viscous liquid on the back side of the supposedly defective NVME SSD, which seemed to come from the thermal pad of the SSD's heatsink. I cleaned this liquid before reinstalling the NVME SSD so it may have been the cause as well.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...