SSD cache drive no longer accessible.


Recommended Posts

It went down 12 hours ago and i'm not sure why.. only discovered it went down when the docker containers werent running and said something about a readonly file system (cache).

 

This is a 2 month old ssd drive and it was working fine all along.

 

Attached a copy of the syslog (red logs are all the way down) and diagnostics.

 

I have stopped my docker service. Should I shutdown my server as well?

tower-syslog-20190721-0314.zip tower-diagnostics-20190721-0317.zip

Edited by CyberMew
Link to comment

Apparently, it went missing about here:

Jul 20 15:52:05 Tower kernel: nvme nvme0: I/O 994 QID 14 timeout, aborting
Jul 20 15:52:07 Tower kernel: nvme nvme0: I/O 491 QID 5 timeout, aborting
Jul 20 15:52:13 Tower kernel: nvme nvme0: I/O 465 QID 4 timeout, aborting
Jul 20 15:52:13 Tower kernel: nvme nvme0: I/O 466 QID 4 timeout, aborting
Jul 20 15:52:13 Tower kernel: nvme nvme0: I/O 467 QID 4 timeout, aborting
Jul 20 15:52:13 Tower kernel: nvme nvme0: I/O 468 QID 4 timeout, aborting
Jul 20 15:52:13 Tower kernel: nvme nvme0: I/O 469 QID 4 timeout, aborting
Jul 20 15:52:13 Tower kernel: nvme nvme0: I/O 470 QID 4 timeout, aborting
Jul 20 15:52:35 Tower kernel: nvme nvme0: I/O 994 QID 14 timeout, reset controller
Jul 20 15:53:06 Tower kernel: nvme nvme0: I/O 27 QID 0 timeout, reset controller
Jul 20 15:54:10 Tower kernel: nvme nvme0: Device not ready; aborting reset
Jul 20 15:54:10 Tower kernel: nvme nvme0: Abort status: 0x7

It does not show up in the SMART reports in the Diagnostics file.  Meaning that it is missing in action. 

 

By the way, that Internet speed test is spamming your syslog.  It would probably be well to turn it off until this issue is resolved. 

 

I am not an expert on exactly how to proceed at this point but I would be tempted to reboot the server and see if it comes back online.  (You might want to wait for a few hours and see if anyone else has seen anything...)  If it does, at that point, get a new diagnostics file for a new post and that may contain some information about the state of the drive.  

Edited by Frank1940
Link to comment

Thanks, pretty sure it's already screwed down tight a couple months ago, but will still double check the screw in case it somehow came loose. 

 

So far it's been running for 24 hours and all seems to be fine 🤞Let's see what happens during the next parity check next month..that's when the read/drive errors will usually appear again

Link to comment

Very interesting read.

I also experienced drop-outs but it was the RAM - a couple of times.

IIRC the connection techniques of RAM and NVME are quite similar.

Reseating the RAM solved the issue but it's very annoying, especially if I'm not at home and my family wants to use the media server. ¬¬

I wonder if you guys ever had this kind of connection issues?

Link to comment
  • 10 months later...
On 7/24/2019 at 10:31 AM, CyberMew said:

Thanks, pretty sure it's already screwed down tight a couple months ago, but will still double check the screw in case it somehow came loose. 

 

So far it's been running for 24 hours and all seems to be fine 🤞Let's see what happens during the next parity check next month..that's when the read/drive errors will usually appear again

Did you ever discover what was wrong? I'm having the same issue as we speak.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.