Jump to content
RockDawg

SSD cache drive dying from excessive writes

14 posts in this topic Last Reply

Recommended Posts

I have a Crucial MX500 250GB SSD that I've been using as my cache drive. I installed it about a year ago and unRAID is reporting that is failing. SMART data being reported:

 

Power on hours - 8646

Total LBAs Written - 158.34 TB

Percent Lifetime Remain  - 99 (Failing Now)

 

I do not use the cache drive for writing file to the array (no mover). I simply use it to run Dockers. I do run quite a few of them and I have tried to troubleshoot the particular container causing the issue but I just can't seem to to find it. The containers I'm running are:

 

Krusader

Bitwarden

Clouflare-DDNS

Deluge

DiskSpeed

Emby

LetsEncrypt

MariaDb

Netdata

Nextcloud

NzbGet

Ombi

Radarr

Roon

Sonarr

 

I have tried stopping them all and starting each one by itself to see if one is the main culprit but they all perform writes at least every minute or so (or more often). And it just seems as thought all them running at the same time just adds up. All of the media ay of my containers downloads goes straight to another drive.  But there has to be a way to keep them from killing an SDD in about a year, right?

 

My cache is a single drive (XFS) and I am running 6.8.3 Any ideas.

Share this post


Link to post
15 hours ago, RockDawg said:

My cache is a single drive (XFS) and I am running 6.8.3 Any ideas.

New beta fixes this issue, but this was mostly when using btrfs, how many GBs is it writing per day?

Share this post


Link to post

And you're sure it's xfs? Either way you can use iotop (install de nerdpack plugin) to and then try to find out which docker(s) is writing so much.

Share this post


Link to post

I am a bit confused now.  I have the DiskSpeed docker and that's where I got the SMART data in my OP.  Right now it says Total LBAs Written - 158.90 TB.  But if I look at the SMART data in unraid it shows Total LBAs Written - 38794827600.  Each LBA is 512 bytes, right?  So 38,794,827,600 x 512 = 19,862,951,731,200 or 18.07 TB, right?  Or am I calculating wrong?

 

So where is DiskSpeed getting it's 158.90 TB?  I read a review that said the 256 GB drive that I have is rated for 100 TBW.  So my calculation is way under that and DiskSpeed is reporting over 50% than that rating.

Share this post


Link to post

Indeed!  Updated and now it reports 19.88TB.  So now why would the SMART data be saying that the drive is failing due to percent lifetime remain?

Share this post


Link to post

What does the SMART information say when read from within Unraid directly (obtained by clicking on the drive on the Main tab)?   If you post your system diagnostics zip file (obtained via Tools -> Diagnostics) then we could see for ourselves exactly what is being reported as the SMART information for all drives is part of the content of that zip.

 

in principle the SMART data is handled by the firmware built into the drive so if it says the lifetime is running out it is probably true.    Sounds as if the drive may not have the TBW value you though it had :(

Share this post


Link to post
On 8/1/2020 at 4:42 PM, RockDawg said:

Percent Lifetime Remain  - 99 (Failing Now)

Is this the the normalized value or the raw value? Please post the SMART report.

 

Share this post


Link to post

Does it actually say failing? Or just 99?

 

Usually smart stat about it failing starts at 100 and then goes down to 0 the worse shape it is.

Share this post


Link to post

Here are 2 screenshots showing the warning notification from the main page and the other is the SMART data shown when click on "cache" in the main page.  It's now up to 100.

 

 

Untitled.jpg

Untitled1.jpg

Share this post


Link to post

Looks like a firmware bug to me, 20TBW is nowhere close the expected life for that SSD, as a comparison where's one from an MX500 500GB, with around 90TB written it's at 50% expected life:

=== START OF INFORMATION SECTION ===
Model Family:     Crucial/Micron MX500 SSDs
Device Model:     CT500MX500SSD1
Serial Number:    1849E1DAED7F

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    0
  5 Reallocate_NAND_Blk_Cnt -O--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    11725
 12 Power_Cycle_Count       -O--CK   100   100   000    -    38
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Ave_Block-Erase_Count   -O--CK   050   050   000    -    762
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    3
180 Unused_Reserve_NAND_Blk PO--CK   000   000   000    -    38
183 SATA_Interfac_Downshift -O--CK   100   100   000    -    0
184 Error_Correction_Count  -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   064   038   000    -    36 (Min/Max 0/62)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Bogus_Current_Pend_Sect -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Percent_Lifetime_Remain ----CK   050   050   001    -    50
206 Write_Error_Rate        -OSR--   100   100   000    -    0
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   000    -    0
246 Total_LBAs_Written      -O--CK   100   100   000    -    174404886673
247 Host_Program_Page_Count -O--CK   100   100   000    -    3003552877
248 FTL_Program_Page_Count  -O--CK   100   100   000    -    4939020470

I would just ignore that.

Share this post


Link to post

It is also very strange that it says there is 100% life remaining and that it has also been flagged as FAILING NOW.

Share this post


Link to post
24 minutes ago, itimpi said:

It is also very strange that it says there is 100% life remaining and that it has also been flagged as FAILING NOW.

That part is OK, normalized value is remaining life, raw value is spent life, this is from an almost new MX 500:
 

ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE

  1 Raw_Read_Error_Rate     POSR-K   100   100   000    -    0
  5 Reallocate_NAND_Blk_Cnt -O--CK   100   100   010    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    267
 12 Power_Cycle_Count       -O--CK   100   100   000    -    4
171 Program_Fail_Count      -O--CK   100   100   000    -    0
172 Erase_Fail_Count        -O--CK   100   100   000    -    0
173 Ave_Block-Erase_Count   -O--CK   100   100   000    -    4
174 Unexpect_Power_Loss_Ct  -O--CK   100   100   000    -    0
180 Unused_Reserve_NAND_Blk PO--CK   000   000   000    -    28
183 SATA_Interfac_Downshift -O--CK   100   100   000    -    0
184 Error_Correction_Count  -O--CK   100   100   000    -    0
187 Reported_Uncorrect      -O--CK   100   100   000    -    0
194 Temperature_Celsius     -O---K   068   053   000    -    32 (Min/Max 0/47)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Bogus_Current_Pend_Sect -O--CK   100   100   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   100   100   000    -    0
202 Percent_Lifetime_Remain ----CK   100   100   001    -    0
206 Write_Error_Rate        -OSR--   100   100   000    -    0
210 Success_RAIN_Recov_Cnt  -O--CK   100   100   000    -    0
246 Total_LBAs_Written      -O--CK   100   100   000    -    2476107088
247 Host_Program_Page_Count -O--CK   100   100   000    -    19671852
248 FTL_Program_Page_Count  -O--CK   100   100   000    -    33437715

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.