Diggewuff Posted July 29, 2019 Share Posted July 29, 2019 Hey guis, My NVMe cache went bad in SMART test but I cannot see why? Physically it is working absolutely fine. It is not very old and there are not too much TB Written. Here are some screenshots and additionally the smartctl output. Maybe someone can help me to interpret what's the issue here and if I have to be afraid to have the Cache fail physically soon. === START OF INFORMATION SECTION === Model Number: Samsung SSD 960 EVO 500GB Serial Number: S3EU************* Firmware Version: 2B7QCXE7 PCI Vendor/Subsystem ID: 0x144d IEEE OUI Identifier: 0x002538 Total NVM Capacity: 500,107,862,016 [500 GB] Unallocated NVM Capacity: 0 Controller ID: 2 Number of Namespaces: 1 Namespace 1 Size/Capacity: 500,107,862,016 [500 GB] Namespace 1 Utilization: 244,207,935,488 [244 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: 002538 5371b0e432 Local Time is: Mon Jul 29 00:23:32 2019 CEST Firmware Updates (0x16): 3 Slots, no Reset required Optional Admin Commands (0x0007): Security Format Frmw_DL Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 77 Celsius Critical Comp. Temp. Threshold: 79 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.04W - - 0 0 0 0 0 0 1 + 5.09W - - 1 1 1 1 0 0 2 + 4.08W - - 2 2 2 2 0 0 3 - 0.0400W - - 3 3 3 3 210 1500 4 - 0.0050W - - 4 4 4 4 2200 6000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 0 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: FAILED! - NVM subsystem reliability has been degraded SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x04 Temperature: 42 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 102% Data Units Read: 168,638,229 [86.3 TB] Data Units Written: 631,781,984 [323 TB] Host Read Commands: 2,498,997,755 Host Write Commands: 3,364,105,515 Controller Busy Time: 27,544 Power Cycles: 211 Power On Hours: 12,433 Unsafe Shutdowns: 123 Media and Data Integrity Errors: 0 Error Information Log Entries: 135 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 42 Celsius Temperature Sensor 2: 64 Celsius Error Information (NVMe Log 0x01, max 64 entries) Num ErrCount SQId CmdId Status PELoc LBA NSID VS 0 135 0 0x0004 0x4202 0x028 0 0 - 1 134 0 0x0004 0x4202 0x028 0 0 - 2 133 0 0x0004 0x4202 0x028 0 0 - 3 132 0 0x0004 0x4202 0x028 0 0 - 4 131 0 0x0004 0x4202 0x028 0 0 - 5 130 0 0x0004 0x4202 0x028 0 0 - 6 129 0 0x0004 0x4202 0x028 0 0 - 7 128 0 0x0004 0x4202 0x028 0 0 - 8 127 0 0x0004 0x4202 0x028 0 0 - 9 126 0 0x0004 0x4202 0x028 0 0 - 10 125 0 0x0004 0x4202 0x028 0 0 - 11 124 0 0x0004 0x4202 0x028 0 0 - 12 123 0 0x0004 0x4202 0x028 0 0 - 13 122 0 0x0004 0x4202 0x028 0 0 - 14 121 0 0x0004 0x4202 0x028 0 0 - 15 120 0 0x0004 0x4202 0x028 0 0 - ... (48 entries not shown) Appreciate any help. The SSD is still in warranty but Samsung wants me to send it in for inspection first. No Advance RMA. Quote Link to comment
JorgeB Posted July 29, 2019 Share Posted July 29, 2019 Percentage Used: 102% It's past the predicted life of the flash, doesn't mean it's failing or about to fail. Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 20 minutes ago, johnnie.black said: It's past the predicted life of the flash, doesn't mean it's failing or about to fail. I also noticed that but on what measure is this life prediction made? The Drive was sold with an MTBF of 1,5 million hours and 400 TBW. Neither of them is reached o exceeded. Quote Link to comment
JorgeB Posted July 29, 2019 Share Posted July 29, 2019 Just now, Diggewuff said: 400 TBW. 200TBW, 400 is for the 1TB model. Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 I'm not sure about that this website says 400 TBW for all 3 sizes. But I have also read 200 at any place. https://www.samsung.com/de/memory-storage/960-evo-nvme-m-2-ssd/MZ-V6E500BW/ Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 Anyways... You would say the failing SMART test is just because of the prediction and nothing else? And the Prediction is noting I have to be worried about? Quote Link to comment
LammeN3rd Posted July 29, 2019 Share Posted July 29, 2019 (edited) 6 minutes ago, Diggewuff said: I'm not sure about that this website says 400 TBW for all 3 sizes. But I have also read 200 at any place. https://www.samsung.com/de/memory-storage/960-evo-nvme-m-2-ssd/MZ-V6E500BW/ The English datasheet 200TBW https://www.samsung.com/semiconductor/global.semi.static/Samsung_SSD_960_EVO_Data_Sheet_Rev_1_2.pdf you can always try to claim warranty based on the German site... I found myself some cheap enterprise NVMe drives on eBay I highly recommend them, powerless protection very consistent write speed and above all else 1366 TBW Edited July 29, 2019 by LammeN3rd Quote Link to comment
JorgeB Posted July 29, 2019 Share Posted July 29, 2019 5 minutes ago, Diggewuff said: You would say the failing SMART test is just because of the prediction and nothing else? Yes. 5 minutes ago, Diggewuff said: And the Prediction is noting I have to be worried about? Not for now, SSDs are known to last way more than predicted life, but of course it can start failing at any time, keep an eye on it. Quote Link to comment
JorgeB Posted July 29, 2019 Share Posted July 29, 2019 9 minutes ago, Diggewuff said: I'm not sure about that this website says 400 TBW for all 3 sizes Clearly a typo, since TBW always varies with size. Quote Link to comment
LammeN3rd Posted July 29, 2019 Share Posted July 29, 2019 (edited) 2 minutes ago, johnnie.black said: Yes. Not for now, SSDs are known to last way more than predicted life, but of course it can start failing at any time, keep an eye on it. As long as you have anything but intel SSD's, these tend to go read-only when they reach their predicted life all to protect the customer intel sales! Edited July 29, 2019 by LammeN3rd Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 (edited) The RMA Procedure of Samsung is pretty stressful. 🤯 Quote As a first step towards initiating the RMA procedure for your drive, we will need the following information: A detailed explanation of the problem you are facing with your drive. Please provide us with screenshots of error messages or error codes, or a small video clip showing the problem that you are having with your drive, if possible. Model of your server. What kind of use has this SSD in your server? The SMART test result obtained by Samsung magician. Samsung Magician can be downloaded via the following link: https://www.samsung.com/semiconductor/minisite/ssd/download/tools/ Please note that the software is only compatible with Windows and the SSD must be connected directly to the motherboard. To obtain the SMART result, simply click on the'SMART' button on the Samsung Magician home page. A screenshot of Samsung magician homepage A copy of your proof of purchase What is your country of residence? Photos of the front and backside of the entire SSD clearly indicating the serial number on the label on the drive. Are you an end-user or reseller of this drive? And no advance RMA so pretty much downtime too. Edited July 29, 2019 by Diggewuff Quote Link to comment
JorgeB Posted July 29, 2019 Share Posted July 29, 2019 It will be a waste of time, since it's past the TBW, and on the DE site they also mention 200TBW: Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 (edited) The TBW value also looks pretty high to me so much written and only 1/4 of that read. My whole array is only 1/13 of that amount of data (TBW). Do that sound plausible to you? Additional info: I'm not moving huge amounts of data regularly. But I have quite a few Docker containers (35ea) running, logging writing to Databases and regularly updated. Edited July 29, 2019 by Diggewuff Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 Just now, johnnie.black said: It will be a waste of time, since it's past the TBW, and on the DE site they also mention 200TBW: I aggre to that. Quote Link to comment
JorgeB Posted July 29, 2019 Share Posted July 29, 2019 2 minutes ago, Diggewuff said: Do that sound plausible to you? Is the SSD used for docker/VMs? If yes they are known to write a lot do the SSD, much more than expected, but never found a definite answer on the cause, you can see my device has similar stats, also pretty sure the power on hours are wrong, pretty sure I've been using for more than 2 years. === START OF INFORMATION SECTION === Model Number: TOSHIBA-RD400 Serial Number: 664S107XTPGV Firmware Version: 57CZ4102 PCI Vendor/Subsystem ID: 0x1b85 IEEE OUI Identifier: 0xe83a97 Controller ID: 0 Number of Namespaces: 1 Namespace 1 Size/Capacity: 512,110,190,592 [512 GB] Namespace 1 Formatted LBA Size: 512 Namespace 1 IEEE EUI-64: e83a97 02000018f5 Local Time is: Mon Jul 29 17:40:00 2019 BST Firmware Updates (0x02): 1 Slot Optional Admin Commands (0x0007): Security Format Frmw_DL Optional NVM Commands (0x000e): Wr_Unc DS_Mngmt Wr_Zero Warning Comp. Temp. Threshold: 78 Celsius Critical Comp. Temp. Threshold: 82 Celsius Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.00W - - 0 0 0 0 0 0 1 + 2.40W - - 1 1 1 1 0 0 2 + 1.90W - - 2 2 2 2 0 0 3 - 0.1600W - - 3 3 3 3 1000 1000 4 - 0.0120W - - 4 4 4 4 5000 35000 5 - 0.0060W - - 5 5 5 5 100000 110000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 42 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 54% Data Units Read: 131,651,195 [67.4 TB] Data Units Written: 537,286,812 [275 TB] Host Read Commands: 4,801,983,587 Host Write Commands: 6,914,780,808 Controller Busy Time: 22,109 Power Cycles: 158 Power On Hours: 5,551 Unsafe Shutdowns: 38 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 42 Celsius Error Information (NVMe Log 0x01, max 128 entries) No Errors Logged Quote Link to comment
LammeN3rd Posted July 29, 2019 Share Posted July 29, 2019 Write amplification is *&%^$#& as well! https://en.wikipedia.org/wiki/Write_amplification Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 3 minutes ago, johnnie.black said: Is the SSD used for docker yes 11 minutes ago, Diggewuff said: I'm not moving huge amounts of data regularly. But I have quite a few Docker containers (35ea) running, logging writing to Databases and regularly updated. I specifically decided for an NVMe cache because of the Containers. Maybe That was a Pretty costly decision based on that experience. 😖 I think I'll use the drive until it fails. But from now on I won't have any warning for the point of failure, because SMART already failed. Would adding a second cache drive in RAID1 be a good idea? My Mainboard (Supermicro X11SSi-LN4F) only has one NVMe slot. Will RAID1 work with a PCI NVMe extension? Quote Link to comment
JorgeB Posted July 29, 2019 Share Posted July 29, 2019 1 minute ago, Diggewuff said: Would adding a second cache drive in RAID1 be a good idea? Redundancy is always good, alternatively backup anything important to another device, I do it daily. Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 3 minutes ago, Diggewuff said: My Mainboard (Supermicro X11SSi-LN4F) only has one NVMe slot. Will RAID1 work with a PCI NVMe extension? Is that a good idea? 8 minutes ago, LammeN3rd said: Write amplification is *&%^$#& as well! https://en.wikipedia.org/wiki/Write_amplification Can that be Reduced in any way? Quote Link to comment
JorgeB Posted July 29, 2019 Share Posted July 29, 2019 1 minute ago, Diggewuff said: Will RAID1 work with a PCI NVMe extension? Yes. Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 Is it possible to combine a 500 GB and a 1000 GB NVMe in RAID1 to expand it later? Quote Link to comment
JonathanM Posted July 29, 2019 Share Posted July 29, 2019 30 minutes ago, Diggewuff said: Is it possible to combine a 500 GB and a 1000 GB NVMe in RAID1 to expand it later? Sure, but it will give a total usable capacity of 500GB, but probably report more free space than it has. If you want to do this, make sure you keep up with the latest of @johnnie.black's recommendations on cache pool management. Quote Link to comment
Diggewuff Posted July 29, 2019 Author Share Posted July 29, 2019 Can anyone give me a hint for a good M.2 SSD for write intensive workloads? Quote Link to comment
testdasi Posted July 30, 2019 Share Posted July 30, 2019 12 hours ago, Diggewuff said: Can anyone give me a hint for a good M.2 SSD for write intensive workloads? I think you are approaching the issue from the wrong angle. If you look at those "write-intensive" enterprise SSDs in the market, they all do the same thing - have hard over-provisioning. That's why "read-intensive" will have 4TB but "write-intensive" will have 3.84TB or something like that. So a better approach would be to rethink your config. 512GB is not a lot of space and when you add static data (e.g. your VM vdisk, docker img, docker appdata etc.), you don't have much left to over-provision for write activities. So a better approach, for example, if you get a new 512GB SSD, you can consider mounting your current 960 as UD and use it for temp data such as Plex transcode Download temp files Such temp data does not need RAID1 protection. Moving it out to the 960 will reduce write activity on your new SSD and give it more soft over-provisioning space for any inevitable write that must be done i.e. prolonging its lifespan. Also remember to trim your SSDs. Trim is good. While I can't recommend "good" SSD for write intensive workloads, I can dis-recommend (there is such a word?) QLC SSDs. They are absolutely terrible for write-intensive workloads. Quote Link to comment
Diggewuff Posted July 30, 2019 Author Share Posted July 30, 2019 16 minutes ago, testdasi said: So a better approach would be to rethink your config. 512GB is not a lot of space and when you add static data (e.g. your VM vdisk, docker img, docker appdata etc.), you don't have much left to over-provision for write activities. Thanks for your detailed Explanations. Downloads and Plex Transcods cannot be the main reason for the nearly 400 TBW. 18 hours ago, johnnie.black said: Is the SSD used for docker/VMs? If yes they are known to write a lot do the SSD, much more than expected That seems more reasonably to me. Or am I understanding anything wrong here? If not, overprovisioning seems as the most reasonable approach to me. 21 minutes ago, testdasi said: you don't have much left to over-provision for write activities. Is it possible to over provision any SSD Myself? I don't have any VM vdisks and my Docker Image is about 100gig so 500gig of usable space is plenty enough for my use. Switching to 1 tb or let's say overprovisioned 840gigs would already be a huge upgrade for me but I don't want to lose the speed of NVMe. I cannot find any fast NVMe drives that are rated for significantly higher TBW. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.