Please Help, Dockers Stopped Abruptly, Cache now says "Unmountable No File System". AppData Share Gone :(


Recommended Posts

I do have a single cache drive, SSD.    

 

What happened is after years of operating without issue, I had dockers stop, I believe because of the BTRFS kernel issue with balance space.  So after that, I formatted XFS and restored cache drive to how it was, it was working perfectly once more for 2days, and all of a sudden it just croaked.  Dockers stopped, after a reboot the cache said unmountable and filesystem check could not do anything.  Which is why I was asked to run a memtest.  

 

So that is where I am, nothing points at that cache drive being bad just yet, but somehow after being xfs for 2 days it just kinda quit and I seem to have lost all my appdata share

 

So if anything I think I would only be doing this to the SSD Cache drive, which right now is unmountable anyways so it seems my data on that drive is lost to begin with.  

 

Or were you referring to some other drives in the array?  

Link to comment
  • Replies 52
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

26 minutes ago, manolodf said:

I do have a single cache drive, SSD.    

 

What happened is after years of operating without issue, I had dockers stop, I believe because of the BTRFS kernel issue with balance space.  So after that, I formatted XFS and restored cache drive to how it was, it was working perfectly once more for 2days, and all of a sudden it just croaked.  Dockers stopped, after a reboot the cache said unmountable and filesystem check could not do anything.  Which is why I was asked to run a memtest.  

 

So that is where I am, nothing points at that cache drive being bad just yet, but somehow after being xfs for 2 days it just kinda quit and I seem to have lost all my appdata share

 

So if anything I think I would only be doing this to the SSD Cache drive, which right now is unmountable anyways so it seems my data on that drive is lost to begin with.  

 

Or were you referring to some other drives in the array?  

disk 5 was in question, so at the least run fsck on it... 

 

You're running a single cache with no redundancy and no backups?  so, yeah, it's not a matter of IF you will lose data, just WHEN... that when might be right now.  You can run badblocks on an SSD, but honestly that's not a perfect test because of th way ssd's work... you should run fsck on it too.   I'd probably go grab whatever tool the manufacturer of the drive provides to test it, the only annoying thing is that might be some stupid windows utility...  My gut feeling is your ssd is dying.

 

also, you didn't answer when you last trim'd the ssd. 

 

ps, there is a plugin that will schedule a backup of your appdata to your array, you could use should at least do that in the future.

Link to comment

I do have the plugin that backs up appdata to array, I just hope to god it's intact.   Thats my only backup of my cache drive, but setup for that reason, so I am just praying that to work.  So you think the SSD may be the one crapping out, of all drives I expected that one to go last. 

 

I did not do any SSD Trimming I dont think.  Is that something I should be doing regularly?  

873883812_ScreenShot2019-06-16at1_35_53AM.thumb.png.b5aec6db3f4eabf2e377df2b831dbcf8.png

With array on, when I click the smart test nothing happens. tower-smart-20190616-0134.zip

 

Link to comment

I checked and I did have the Dynamix SSD Trim Plugin Installed, though what I cannot find is if I had the Cron job or if there was a need to activate it.   

 

Edit:  If I understand correctly, I clicked the scheduler and it was running weekly at 4am, I have changed that to Daily. 

Edited by manolodf
Link to comment
6 hours ago, johnnie.black said:

No, that's normal, NVMe devices don't support SMART tests, since it's useless for flash devices.

that's not true.  Some SSD's do and some don't.  He has an Intel 600p which supports some of the tests.  Obviously things like amount of time it's been spun up is silly on an ssd, but keeping track of lifetime stats like data written/read is important and useful.

 

so going by your pics in previous posts, you are 2/3 through the lifetime written max the ssd is rated for, but over at Tomshardware they had the 600p crap out around 105TB (you are at 94.5TB in the pic)... so it might be that its getting too beat up. 

references -

Intel Ark - https://ark.intel.com/content/www/us/en/ark/products/94921/intel-ssd-600p-series-256gb-m-2-80mm-pcie-3-0-x4-3d1-tlc.html

Toms hardware - https://www.tomshardware.com/reviews/intel-ssd-600p-nvme-endurance-testing,4826.html

 

Good you were trimming occasionally, it helps a bit with wear leveling and maintaining performance.  Weekly or monthly is fine though with your usage, daily just adds extra stress to the drive, I'd probably put it back to weekly.  Another thing that is hard on nvme is the temperature, alot of people put a heatsink on them since they are only a few dollars and can make things last longer.

 

On the plus side it has a 5 year warranty and it hasn't been sold that long, so worst case you gotta RMA the drive.

 

If I were in your shoes (I might be some day since I have two 660p's in mine as cache), I'd probably try to run the intel toolbox on the drive since it has Intel's diagnostics.  I've never run it on an nvme, so I'm unsure of the results, but it seems prudent.  The only annoying thing is they only release it for Windows... i have no windows machines, you might not either... makes it irritating, but not impossible.

https://www.intel.com/content/www/us/en/support/articles/000005800/memory-and-storage.html

Link to comment

So I restored my appdata and reinstalled from previous apps but encountered the following issues:

 

I got at first this error on installing all apps for some of them:

785143902_ScreenShot2019-06-16at12_14_55PM.thumb.png.ca8468b075da4132fad63a30986d3f76.png

 

Then whenever I tried to start those apps it gave a server error:

1256815159_ScreenShot2019-06-16at12_17_04PM.png.0606ea8bc50fda64dd491db9da15c941.png

 

Then after I stopped the apps that successfully started and tried to start again, it gives a 403 error:

170953867_ScreenShot2019-06-16at12_17_18PM.png.45f24bfc3effc889f6e91d8c059579b6.png

 

When I try to manually update one this is what I get:

image.png.b7354f6ea33334834eb04ee8fcbd36be.png

 

 

Link to comment

Ok, so now its a real  issue.

 

After that I rebooted, and when it came back up the Cache drive came up as Unmountable again, unfortunately I did not get diagnostics before a reboot, but I am sure I can recreate it.   Just let me know at what point we need diagnostics. 

image.thumb.png.39e6c1eb2747e6643e585e30ce94fe4f.png

Link to comment

Best bet is probably to go cacheless for now.

 

yeah, read that tomshardware link I posted, when the drive crapped out on them it entered a readonly state...  Sure sounds similar to your issue.

 

I would find out what is needed for RMA, I think it just bit the dust. 

 

Also, just an fyi, some peoplemake ram drives for things like the plex transcode folder  when their appdate is on a ssd array, just to help reduce writes.  might be worth considering in the future.

Link to comment

Do you think it has anything to do with XFS, like would formatting it BTFRS help the cache drive at all?  
After formatting BTRFS at least it survives a reboot without showing up as unmountable... 

 

I guess I have to order another one in the mean time and figure out the RMA. 

 

I do have the transcode mapped to Ram by using the /tmp on Plex, is that what you are referring to?

Edited by manolodf
Link to comment
11 minutes ago, Squid said:

If anything, with only a single cache device, you're far better off having it formatted as xfs

That is actually why I formatted it as XFS after I had the issues on my cache drive.  But being formatted xfs, every time I reboot it comes back up as unmountable: No File System and I lost all my appdata. 

- Actually it is after reboot once dockers have been installed that it does that.  After restoring appdata it was fine, it was just after the dockers were installed and ran that it just crapped out. 

 

I just randomly tried formatting it brtfs, because well why not!  And all of a sudden it survives a reboot without saying Unmountable!  

But I have not tried with dockers just yet. 

Edited by manolodf
Accuracy of the reboot results
Link to comment
13 minutes ago, manolodf said:

Oh wow, I had no clue, that must have been the one that came with the Mobo at the time I got it. I will get on updating that asap.  Does Unraid have any specific steps for updating MB Bios?

No.   Process for updating a MB is determined by the MB manufacturer.

Link to comment
1 hour ago, manolodf said:

Oh wow, I had no clue, that must have been the one that came with the Mobo at the time I got it.

When buying a new mobo, after verifying basic functionality of it, the first thing I do is update its BIOS to whatever is current.  Its criminal how often motherboards are sold with extremely outdated BIOS's

 

55 minutes ago, manolodf said:

could this be a reason for some of my XFS problems on checks?

 

Doesn't look good.  Not sure where you're updating xfsprogs, but if there's a file within /boot/extra for it, you can delete it.

Link to comment
17 hours ago, Abzstrak said:

that's not true.  Some SSD's do and some don't. 

I said NVMe devices don't support SMART tests, and AFAIK no NVMe device supports short or long SMART tests, if that's not correct please post a SMART output showing otherwise, all SATA SSD still support SMART tests.

Link to comment
5 hours ago, johnnie.black said:

I said NVMe devices don't support SMART tests, and AFAIK no NVMe device supports short or long SMART tests, if that's not correct please post a SMART output showing otherwise, all SATA SSD still support SMART tests.

Sorry, I thought you said SSD, not NVMe, reading too fast :)    

 

However smartctl can output on an NVMe, info and example --> https://www.smartmontools.org/wiki/NVMe_Support

 

Most (but some) nvme don't support the sata/sas type commands for smart, but it really would be nice if unraid would just run a smartctl -x on an nvme and give us that.

 

example on my box-

 

# smartctl -x /dev/nvme1n1
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-4.19.41-Unraid] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEKNW512G8
Serial Number:                      BTNH90920C3Y512A
Firmware Version:                   002C
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          512,110,190,592 [512 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Mon Jun 17 08:36:10 2019 CDT
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     3.50W       -        -    0  0  0  0        0       0
 1 +     2.70W       -        -    1  1  1  1        0       0
 2 +     2.00W       -        -    2  2  2  2        0       0
 3 -   0.0250W       -        -    3  3  3  3     5000    5000
 4 -   0.0040W       -        -    4  4  4  4     5000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    620,079 [317 GB]
Data Units Written:                 1,358,920 [695 GB]
Host Read Commands:                 3,438,919
Host Write Commands:                6,865,860
Controller Busy Time:               120
Power Cycles:                       6
Power On Hours:                     182
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.