Weird SSD Cache Issues (2 Random disconnects, and I/O Timeouts) (SOLVED)


Ryonez

Recommended Posts

 

So for a while I've notice the servers UI has become unresponsive quite often, especially with updating dockers. Even the stats for the CPU on the dashboard vary wildly from what htop would report. I noticed this around the time I updated to 6.6.6.

Today, it got much worse. I went to update all the dockers at once under the docker page annnnnd, I came back to find the cache disappeared. I managed to umount the array, and shutdown the machine. After checking the cables, I booted up, cache was there, so I tried the updates again. It didn't take long before the cache chose to go walkies again. This time I could shut the array down and had to pull the power.

Once I had the server running again I checked the SMART information for the drive. It looked fine, just it was acting off.
I ended up trying to update the dockers one by one, often with the gui timing out. Eventually I got them updated and figured I'd look into it more tomorrow.

After getting alerts for more updates and with me still awake, I thought I'd update them. Annnnd it timed out again. So I thought I'd grab the diagnostic infomation and seek help here.
image.thumb.png.2c74b691c81ecb0181a9a68538d33f92.png

This shows me trying to update a docker. The upper left shows the connected drives in the system. The right shows atop with disk information, which shows the cache drive struggling. The bottom is the log info for the system, and the gui timing out during the docker update.

Just in case, I threw the array into maintenance mode and checked the drive, receiving the following results:

Phase 1 - find and verify superblock...
        - block cache size set to 768472 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 59131 tail block 59131
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 0
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Sat Dec 29 01:48:26 2018

Phase		Start		End		Duration
Phase 1:	12/29 01:48:10	12/29 01:48:10
Phase 2:	12/29 01:48:10	12/29 01:48:10
Phase 3:	12/29 01:48:10	12/29 01:48:19	9 seconds
Phase 4:	12/29 01:48:19	12/29 01:48:19
Phase 5:	Skipped
Phase 6:	12/29 01:48:19	12/29 01:48:26	7 seconds
Phase 7:	12/29 01:48:26	12/29 01:48:26

Total run time: 16 seconds

And below, the cache's SMART info and drive info.

 

ATTRIBUTES
#	ATTRIBUTE NAME	FLAG	VALUE	WORST	THRESHOLD	TYPE	UPDATED	FAILED	RAW VALUE
1	Raw read error rate	0x0032	095	095	050	Old age	Always	Never	1/122114931
5	Retired block count	0x0033	097	097	003	Pre-fail	Always	Never	0
9	Power on hours and msec	0x0032	073	073	000	Old age	Always	Never	24474h+59m+54.220s
12	Power cycle count	0x0032	100	100	000	Old age	Always	Never	580
171	Program fail count	0x000a	100	100	000	Old age	Always	Never	0
172	Erase fail count	0x0032	100	100	000	Old age	Always	Never	0
174	Unexpect power loss count	0x0030	000	000	000	Old age	Offline	Never	220
177	Wear range delta	0x0000	000	000	000	Old age	Offline	Never	74
181	Program fail count	0x000a	100	100	000	Old age	Always	Never	0
182	Erase fail count	0x0032	100	100	000	Old age	Always	Never	0
187	Reported uncorrect	0x0012	100	100	000	Old age	Always	Never	0
194	Temperature celsius	0x0022	037	045	000	Old age	Always	Never	37 (min/max -22/45)
195	ECC uncorr error count	0x001c	116	116	000	Old age	Offline	Never	1/122114931
196	Reallocated event count	0x0033	097	097	003	Pre-fail	Always	Never	0
201	Unc soft read err rate	0x001c	116	116	000	Old age	Offline	Never	1/122114931
204	Soft ECC correct rate	0x001c	116	116	000	Old age	Offline	Never	1/122114931
230	Life curve status	0x0013	100	100	000	Pre-fail	Always	Never	100
231	SSD life left	0x0013	073	073	010	Pre-fail	Always	Never	0
233	Sandforce internal	0x0032	000	000	000	Old age	Always	Never	104650
234	Sandforce internal	0x0032	000	000	000	Old age	Always	Never	72048
241	Lifetime writes gib	0x0032	000	000	000	Old age	Always	Never	72048
242	Lifetime reads gib	0x0032	000	000	000	Old age	Always	Never	49050

CAPABILITIES
FEATURE	VALUE	INFORMATION
Offline data collection status:	0x00	Offline data collection activity was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:		The previous self-test routine completed without error or no self-test has ever been run.
Total time to complete Offline data collection:		seconds.
Offline data collection capabilities:	0x7d	SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:	0x0003	Saves SMART data before entering power-saving mode.
Supports SMART auto save timer.
Error logging capability:	0x01	Error logging supported.
General Purpose Logging supported.
Short self-test routine recommended polling time:	1	minutes.
Extended self-test routine recommended polling time:	48	minutes.
Conveyance self-test routine recommended polling time:	2	minutes.
SCT capabilities:	0x0025	SCT Status supported.
SCT Data Table supported.

IDENTITY
TITLE	INFORMATION
Model family:	SandForce Driven SSDs
Device model:	ADATA SP900
Serial number:	XXXXXXXXXXXXX
LU WWN device id:	5 707c18 000058b16
Firmware version:	5.6.0
User capacity:	128,035,676,160 bytes [128 GB]
Sector size:	512 bytes logical/physical
Rotation rate:	Solid State Device
Device:	In smartctl database [for details use: -P show]
ATA version:	ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA version:	SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local time:	Sat Dec 29 01:48:56 2018 NZDT
SMART support:	Available - device has SMART capability.
SMART support:	Enabled
SMART overall-health:	Passed
Manufacturing date:	
dd/mm/yyyy
Date of purchase:	
dd/mm/yyyy
Warranty period:	

Does anyone see something I don't, or have any advice? Thank you.

Edited by Ryonez
Removed a serial number
Link to comment

This is still happening, and I'm at a complete loss.

The dashboard can report max CPU usage, at the same time htop says nothing.
I'm thinking about downgrading to see if this stops, this is out of hand, I've been trying to update one docker for 20 mins. The site page timed out, and the docker has just disappeared now.

Link to comment

Just moved all the dockers onto the array and tested the docker filesystem which returned no errors.
I just went to turn the docker service off, and the webUI didn't update:
 

Jan 4 22:23:04 Atlantis nginx: 2019/01/04 22:23:04 [error] 11886#11886: *86889 upstream timed out (110: Connection timed out) while reading upstream, client: 10.1.1.30, server: , request: "POST /update.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "98cfc7c68d1cb5a7ddfa9158c07b51ec370a5a02.unraid.net", referrer: "https://98cfc7c68d1cb5a7ddfa9158c07b51ec370a5a02.unraid.net/Settings/DockerSettings"

 

I'm just completely stumped. This can be replicated without the cache, docker is saying it's image is fine, is unraid just dying?

Link to comment

Alright, Testing with copying just the 40GB docker image to the cache, this should be the only activity on the server, and I'm seeing this:
image.thumb.png.af144a339967129711bf4277e6e56b93.png 

Even 5 minutes in this is taking... Ahah, I've actually managed to trigger a failure durning testing, let me throw the diagnostics to this.
Is this an issue with the controller, drive, the kernel?

atlantis-diagnostics-20190104-2243.zip

Edited by Ryonez
Spelling
Link to comment
5 hours ago, johnnie.black said:

Errors in the end show what appears to be a connection problem with the cache device, replace cables.

Alright, I had a look.

I actually got myself a new sdd for my desktop for Christmas, and the old ssd was a sister to the one I had in the server. Popping it in, and checking the cables, so far it's not a cable failure, but a sdd one. Even though the one from the desktop has 2 relocated sectors, it works and giving the one in the server is failing to report an issue, I have to replace it now.

I'm shifting files back onto the cache now, will report back here the results later.

Link to comment

And done.

So in the end, it was a faulty ssd. Even though SMART was reporting nothing wrong, it'd fail under heavy write conditions.

I replaced the drive, copying the data that was on the old cache onto the new drive. With the help of the appdata backup, I went through all the dockers and they look safe, no errors being reported from them. Testing the services the dockers provides yields no problems.

On a side note, the gui seems to report the right loads for the CPU now, not sure why that was screwing up because of the hdd.

As everything seems good now, I'm going to mark this as solved.

Link to comment
  • 2 months later...
On 1/4/2019 at 7:53 PM, Ryonez said:

So in the end, it was a faulty ssd. 

Can you say how you identified the SSD as faulty other than deduction? I'm having similar symptoms to yours (8core xeon being e3-1220 v2 being maxed by the usual dockers) and my SSD is one of the re-used components. This is one of the more useful threads I've seen on this matter.

Link to comment
4 hours ago, hpka said:

Can you say how you identified the SSD as faulty other than deduction? I'm having similar symptoms to yours (8core xeon being e3-1220 v2 being maxed by the usual dockers) and my SSD is one of the re-used components. This is one of the more useful threads I've seen on this matter.

It was just pure deduction.

All of the reporting tools were saying it was fine. But it kept having issues which meant I had to look into it further.

In the end, it was manually testing the device that allowed me to narrow it down to a point I could make the ssd fail 100% of the time under one of my test conditions. At that point it doesn't matter what the SMART says, I can reproduce a failure at will with something that within it's specs, proving the device was faulty.

Hardware failure can be tricky. Not everything is going to be easily visible as a concrete red line somewhere. There will be times you have to just test for things yourself to try and figure it out.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.