NVMe cache went bad in SMART test but I cannot see why?


Diggewuff

Recommended Posts

Hey guis,

My NVMe cache went bad in SMART test but I cannot see why? Physically it is working absolutely fine.

It is not very old and there are not too much TB Written.

Here are some screenshots and additionally the smartctl output. Maybe someone can help me to interpret what's the issue here and if I have to be afraid to have the Cache fail physically soon.

928292104_SMARTStatusFailed.png.490a189b5a925378b32d5582b0b3666a.png1552029939_102@Lowusage.thumb.png.d9457f6a682202aaacef1dc1b4a7f3b7.png

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 500GB
Serial Number:                      S3EU*************
Firmware Version:                   2B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 500,107,862,016 [500 GB]
Unallocated NVM Capacity:           0
Controller ID:                      2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          500,107,862,016 [500 GB]
Namespace 1 Utilization:            244,207,935,488 [244 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5371b0e432
Local Time is:                      Mon Jul 29 00:23:32 2019 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.04W       -        -    0  0  0  0        0       0
 1 +     5.09W       -        -    1  1  1  1        0       0
 2 +     4.08W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
- NVM subsystem reliability has been degraded

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x04
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    102%
Data Units Read:                    168,638,229 [86.3 TB]
Data Units Written:                 631,781,984 [323 TB]
Host Read Commands:                 2,498,997,755
Host Write Commands:                3,364,105,515
Controller Busy Time:               27,544
Power Cycles:                       211
Power On Hours:                     12,433
Unsafe Shutdowns:                   123
Media and Data Integrity Errors:    0
Error Information Log Entries:      135
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               42 Celsius
Temperature Sensor 2:               64 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        135     0  0x0004  0x4202  0x028            0     0     -
  1        134     0  0x0004  0x4202  0x028            0     0     -
  2        133     0  0x0004  0x4202  0x028            0     0     -
  3        132     0  0x0004  0x4202  0x028            0     0     -
  4        131     0  0x0004  0x4202  0x028            0     0     -
  5        130     0  0x0004  0x4202  0x028            0     0     -
  6        129     0  0x0004  0x4202  0x028            0     0     -
  7        128     0  0x0004  0x4202  0x028            0     0     -
  8        127     0  0x0004  0x4202  0x028            0     0     -
  9        126     0  0x0004  0x4202  0x028            0     0     -
 10        125     0  0x0004  0x4202  0x028            0     0     -
 11        124     0  0x0004  0x4202  0x028            0     0     -
 12        123     0  0x0004  0x4202  0x028            0     0     -
 13        122     0  0x0004  0x4202  0x028            0     0     -
 14        121     0  0x0004  0x4202  0x028            0     0     -
 15        120     0  0x0004  0x4202  0x028            0     0     -
... (48 entries not shown)

Appreciate any help. The SSD is still in warranty but Samsung wants me to send it in for inspection first. No Advance RMA.
 

 

Link to comment
20 minutes ago, johnnie.black said:

It's past the predicted life of the flash, doesn't mean it's failing or about to fail.

 

 

I also noticed that but on what measure is this life prediction made? The Drive was sold with an MTBF of 1,5 million hours and 400 TBW. Neither of them is reached o exceeded.

Link to comment
6 minutes ago, Diggewuff said:

I'm not sure about that  this website says 400 TBW for all 3 sizes. But I have also read 200 at any place.

https://www.samsung.com/de/memory-storage/960-evo-nvme-m-2-ssd/MZ-V6E500BW/

The English datasheet 200TBW https://www.samsung.com/semiconductor/global.semi.static/Samsung_SSD_960_EVO_Data_Sheet_Rev_1_2.pdf

you can always try to claim warranty based on the German site...

 

image.thumb.png.e0d07393ad8cc46373b773916c8aeda3.png

 

I found myself some cheap enterprise NVMe drives on eBay I highly recommend them, powerless protection very consistent write speed and above all else 1366 TBW

 

 

Edited by LammeN3rd
Link to comment
5 minutes ago, Diggewuff said:

You would say the failing SMART test is just because of the prediction and nothing else?

Yes.

 

5 minutes ago, Diggewuff said:

And the Prediction is noting I have to be worried about?

Not for now, SSDs are known to last way more than predicted life, but of course it can start failing at any time, keep an eye on it.

Link to comment
2 minutes ago, johnnie.black said:

Yes.

 

Not for now, SSDs are known to last way more than predicted life, but of course it can start failing at any time, keep an eye on it.

As long as you have anything but intel SSD's, these tend to go read-only when they reach their predicted life all to protect the customer intel sales!

Edited by LammeN3rd
Link to comment

The RMA Procedure of Samsung is pretty stressful. 🤯

Quote

As a first step towards initiating the RMA procedure for your drive, we will need the following information:

  • A detailed explanation of the problem you are facing with your drive. 
  • Please provide us with screenshots of error messages or error codes, or a small video clip showing the problem that you are having with your drive, if possible.
  • Model of your server.
  • What kind of use has this SSD in your server?
  • The SMART test result obtained by Samsung magician.
  • Samsung Magician can be downloaded via the following link:
  • https://www.samsung.com/semiconductor/minisite/ssd/download/tools/
  •            Please note that the software is only compatible with Windows and the SSD must be connected directly to the motherboard.
  •           To obtain the SMART result, simply click on the'SMART' button on the Samsung Magician home page.
  • A screenshot of Samsung magician homepage
  • A copy of your proof of purchase
  • What is your country of residence?
  • Photos of the front and backside of the entire SSD clearly indicating the serial number on the label on the drive.
  • Are you an end-user or reseller of this drive?

And no advance RMA so pretty much downtime too.

Edited by Diggewuff
Link to comment

The TBW value also looks pretty high to me so much written and only 1/4 of that read. My whole array is only 1/13 of that amount of data (TBW). Do that sound plausible to you?

Additional info: I'm not moving huge amounts of data regularly. But I have quite a few Docker containers (35ea) running, logging writing to Databases and regularly updated.

Edited by Diggewuff
Link to comment
2 minutes ago, Diggewuff said:

Do that sound plausible to you?

Is the SSD used for docker/VMs? If yes they are known to write a lot do the SSD, much more than expected, but never found a definite answer on the cause, you can see my device has similar stats, also pretty sure the power on hours are wrong, pretty sure I've been using for more than 2 years.

 

=== START OF INFORMATION SECTION ===
Model Number:                       TOSHIBA-RD400
Serial Number:                      664S107XTPGV
Firmware Version:                   57CZ4102
PCI Vendor/Subsystem ID:            0x1b85
IEEE OUI Identifier:                0xe83a97
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            e83a97 02000018f5
Local Time is:                      Mon Jul 29 17:40:00 2019 BST
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x000e):     Wr_Unc DS_Mngmt Wr_Zero
Warning  Comp. Temp. Threshold:     78 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.00W       -        -    0  0  0  0        0       0
 1 +     2.40W       -        -    1  1  1  1        0       0
 2 +     1.90W       -        -    2  2  2  2        0       0
 3 -   0.1600W       -        -    3  3  3  3     1000    1000
 4 -   0.0120W       -        -    4  4  4  4     5000   35000
 5 -   0.0060W       -        -    5  5  5  5   100000  110000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    54%
Data Units Read:                    131,651,195 [67.4 TB]
Data Units Written:                 537,286,812 [275 TB]
Host Read Commands:                 4,801,983,587
Host Write Commands:                6,914,780,808
Controller Busy Time:               22,109
Power Cycles:                       158
Power On Hours:                     5,551
Unsafe Shutdowns:                   38
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               42 Celsius

Error Information (NVMe Log 0x01, max 128 entries)
No Errors Logged

 

Link to comment
3 minutes ago, johnnie.black said:

Is the SSD used for docker

yes 

 

11 minutes ago, Diggewuff said:

I'm not moving huge amounts of data regularly. But I have quite a few Docker containers (35ea) running, logging writing to Databases and regularly updated.

I specifically decided for an NVMe cache because of the Containers. 

Maybe That was a Pretty costly decision based on that experience. 😖

 

I think I'll use the drive until it fails. But from now on I won't have any warning for the point of failure, because SMART already failed.

Would adding a second cache drive in RAID1 be a good idea?

My Mainboard (Supermicro X11SSi-LN4F) only has one NVMe slot. Will RAID1 work with a PCI NVMe extension?

Link to comment
30 minutes ago, Diggewuff said:

Is it possible to combine a 500 GB and a 1000 GB NVMe in RAID1 to expand it later?

Sure, but it will give a total usable capacity of 500GB, but probably report more free space than it has. If you want to do this, make sure you keep up with the latest of @johnnie.black's recommendations on cache pool management.

Link to comment
12 hours ago, Diggewuff said:

Can anyone give me a hint for a good M.2 SSD for write intensive workloads?

I think you are approaching the issue from the wrong angle. If you look at those "write-intensive" enterprise SSDs in the market, they all do the same thing - have hard over-provisioning. That's why "read-intensive" will have 4TB but "write-intensive" will have 3.84TB or something like that.

 

So a better approach would be to rethink your config. 512GB is not a lot of space and when you add static data (e.g. your VM vdisk, docker img, docker appdata etc.), you don't have much left to over-provision for write activities.

So a better approach, for example, if you get a new 512GB SSD, you can consider mounting your current 960 as UD and use it for temp data such as

  • Plex transcode
  • Download temp files

Such temp data does not need RAID1 protection. Moving it out to the 960 will reduce write activity on your new SSD and give it more soft over-provisioning space for any inevitable write that must be done i.e. prolonging its lifespan.

Also remember to trim your SSDs. Trim is good.

 

While I can't recommend "good" SSD for write intensive workloads, I can dis-recommend (there is such a word?) QLC SSDs. They are absolutely terrible for write-intensive workloads.

Link to comment

 

16 minutes ago, testdasi said:

So a better approach would be to rethink your config. 512GB is not a lot of space and when you add static data (e.g. your VM vdisk, docker img, docker appdata etc.), you don't have much left to over-provision for write activities.

Thanks for your detailed Explanations. Downloads and Plex Transcods cannot be the main reason for the nearly 400 TBW.

18 hours ago, johnnie.black said:

Is the SSD used for docker/VMs? If yes they are known to write a lot do the SSD, much more than expected

That seems more reasonably to me. Or am I understanding anything wrong here?

If not, overprovisioning seems as the most reasonable approach to me.

21 minutes ago, testdasi said:

you don't have much left to over-provision for write activities.

Is it possible to over provision any SSD Myself?

I don't have any VM vdisks and my Docker Image is about 100gig so 500gig of usable space is plenty enough for my use. Switching to 1 tb or let's say overprovisioned 840gigs would already be a huge upgrade for me but I don't want to lose the speed of NVMe.

I cannot find any fast NVMe drives that are rated for significantly higher TBW.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.