Ryonez Posted December 28, 2018 Share Posted December 28, 2018 (edited) So for a while I've notice the servers UI has become unresponsive quite often, especially with updating dockers. Even the stats for the CPU on the dashboard vary wildly from what htop would report. I noticed this around the time I updated to 6.6.6. Today, it got much worse. I went to update all the dockers at once under the docker page annnnnd, I came back to find the cache disappeared. I managed to umount the array, and shutdown the machine. After checking the cables, I booted up, cache was there, so I tried the updates again. It didn't take long before the cache chose to go walkies again. This time I could shut the array down and had to pull the power. Once I had the server running again I checked the SMART information for the drive. It looked fine, just it was acting off. I ended up trying to update the dockers one by one, often with the gui timing out. Eventually I got them updated and figured I'd look into it more tomorrow. After getting alerts for more updates and with me still awake, I thought I'd update them. Annnnd it timed out again. So I thought I'd grab the diagnostic infomation and seek help here. This shows me trying to update a docker. The upper left shows the connected drives in the system. The right shows atop with disk information, which shows the cache drive struggling. The bottom is the log info for the system, and the gui timing out during the docker update. Just in case, I threw the array into maintenance mode and checked the drive, receiving the following results: Phase 1 - find and verify superblock... - block cache size set to 768472 entries Phase 2 - using internal log - zero log... zero_log: head block 59131 tail block 59131 - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 1 - agno = 2 - agno = 3 - agno = 0 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Sat Dec 29 01:48:26 2018 Phase Start End Duration Phase 1: 12/29 01:48:10 12/29 01:48:10 Phase 2: 12/29 01:48:10 12/29 01:48:10 Phase 3: 12/29 01:48:10 12/29 01:48:19 9 seconds Phase 4: 12/29 01:48:19 12/29 01:48:19 Phase 5: Skipped Phase 6: 12/29 01:48:19 12/29 01:48:26 7 seconds Phase 7: 12/29 01:48:26 12/29 01:48:26 Total run time: 16 seconds And below, the cache's SMART info and drive info. ATTRIBUTES # ATTRIBUTE NAME FLAG VALUE WORST THRESHOLD TYPE UPDATED FAILED RAW VALUE 1 Raw read error rate 0x0032 095 095 050 Old age Always Never 1/122114931 5 Retired block count 0x0033 097 097 003 Pre-fail Always Never 0 9 Power on hours and msec 0x0032 073 073 000 Old age Always Never 24474h+59m+54.220s 12 Power cycle count 0x0032 100 100 000 Old age Always Never 580 171 Program fail count 0x000a 100 100 000 Old age Always Never 0 172 Erase fail count 0x0032 100 100 000 Old age Always Never 0 174 Unexpect power loss count 0x0030 000 000 000 Old age Offline Never 220 177 Wear range delta 0x0000 000 000 000 Old age Offline Never 74 181 Program fail count 0x000a 100 100 000 Old age Always Never 0 182 Erase fail count 0x0032 100 100 000 Old age Always Never 0 187 Reported uncorrect 0x0012 100 100 000 Old age Always Never 0 194 Temperature celsius 0x0022 037 045 000 Old age Always Never 37 (min/max -22/45) 195 ECC uncorr error count 0x001c 116 116 000 Old age Offline Never 1/122114931 196 Reallocated event count 0x0033 097 097 003 Pre-fail Always Never 0 201 Unc soft read err rate 0x001c 116 116 000 Old age Offline Never 1/122114931 204 Soft ECC correct rate 0x001c 116 116 000 Old age Offline Never 1/122114931 230 Life curve status 0x0013 100 100 000 Pre-fail Always Never 100 231 SSD life left 0x0013 073 073 010 Pre-fail Always Never 0 233 Sandforce internal 0x0032 000 000 000 Old age Always Never 104650 234 Sandforce internal 0x0032 000 000 000 Old age Always Never 72048 241 Lifetime writes gib 0x0032 000 000 000 Old age Always Never 72048 242 Lifetime reads gib 0x0032 000 000 000 Old age Always Never 49050 CAPABILITIES FEATURE VALUE INFORMATION Offline data collection status: 0x00 Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: seconds. Offline data collection capabilities: 0x7d SMART execute Offline immediate. No Auto Offline data collection support. Abort Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: 0x0003 Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: 0x01 Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: 1 minutes. Extended self-test routine recommended polling time: 48 minutes. Conveyance self-test routine recommended polling time: 2 minutes. SCT capabilities: 0x0025 SCT Status supported. SCT Data Table supported. IDENTITY TITLE INFORMATION Model family: SandForce Driven SSDs Device model: ADATA SP900 Serial number: XXXXXXXXXXXXX LU WWN device id: 5 707c18 000058b16 Firmware version: 5.6.0 User capacity: 128,035,676,160 bytes [128 GB] Sector size: 512 bytes logical/physical Rotation rate: Solid State Device Device: In smartctl database [for details use: -P show] ATA version: ATA8-ACS, ACS-2 T13/2015-D revision 3 SATA version: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local time: Sat Dec 29 01:48:56 2018 NZDT SMART support: Available - device has SMART capability. SMART support: Enabled SMART overall-health: Passed Manufacturing date: dd/mm/yyyy Date of purchase: dd/mm/yyyy Warranty period: Does anyone see something I don't, or have any advice? Thank you. Edited January 5, 2019 by Ryonez Removed a serial number Quote Link to comment
Ryonez Posted December 28, 2018 Author Share Posted December 28, 2018 Just got an unclean shutdown warning after tacking the array off of maintenance mode. Not sure that's normal. But for now, I need sleep, so please excuses any late replies from me. Quote Link to comment
Ryonez Posted December 29, 2018 Author Share Posted December 29, 2018 Still having issues, updating dockers one at a time and crossing my fingers it'll go through without timing out. Is this an issue in the system, or is the ssd dying? Quote Link to comment
Ryonez Posted January 3, 2019 Author Share Posted January 3, 2019 This is still happening, and I'm at a complete loss. The dashboard can report max CPU usage, at the same time htop says nothing. I'm thinking about downgrading to see if this stops, this is out of hand, I've been trying to update one docker for 20 mins. The site page timed out, and the docker has just disappeared now. Quote Link to comment
Ryonez Posted January 3, 2019 Author Share Posted January 3, 2019 (edited) I tried going back to 6.6.5, no joy. Loaded 6.6.6 back now. Edited January 3, 2019 by Ryonez Quote Link to comment
Ryonez Posted January 4, 2019 Author Share Posted January 4, 2019 Just moved all the dockers onto the array and tested the docker filesystem which returned no errors. I just went to turn the docker service off, and the webUI didn't update: Jan 4 22:23:04 Atlantis nginx: 2019/01/04 22:23:04 [error] 11886#11886: *86889 upstream timed out (110: Connection timed out) while reading upstream, client: 10.1.1.30, server: , request: "POST /update.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "98cfc7c68d1cb5a7ddfa9158c07b51ec370a5a02.unraid.net", referrer: "https://98cfc7c68d1cb5a7ddfa9158c07b51ec370a5a02.unraid.net/Settings/DockerSettings" I'm just completely stumped. This can be replicated without the cache, docker is saying it's image is fine, is unraid just dying? Quote Link to comment
Ryonez Posted January 4, 2019 Author Share Posted January 4, 2019 (edited) Alright, Testing with copying just the 40GB docker image to the cache, this should be the only activity on the server, and I'm seeing this: Even 5 minutes in this is taking... Ahah, I've actually managed to trigger a failure durning testing, let me throw the diagnostics to this. Is this an issue with the controller, drive, the kernel? atlantis-diagnostics-20190104-2243.zip Edited January 4, 2019 by Ryonez Spelling Quote Link to comment
JorgeB Posted January 4, 2019 Share Posted January 4, 2019 Errors in the end show what appears to be a connection problem with the cache device, replace cables. Quote Link to comment
Ryonez Posted January 4, 2019 Author Share Posted January 4, 2019 5 hours ago, johnnie.black said: Errors in the end show what appears to be a connection problem with the cache device, replace cables. Alright, I had a look. I actually got myself a new sdd for my desktop for Christmas, and the old ssd was a sister to the one I had in the server. Popping it in, and checking the cables, so far it's not a cable failure, but a sdd one. Even though the one from the desktop has 2 relocated sectors, it works and giving the one in the server is failing to report an issue, I have to replace it now. I'm shifting files back onto the cache now, will report back here the results later. Quote Link to comment
Ryonez Posted January 5, 2019 Author Share Posted January 5, 2019 And done. So in the end, it was a faulty ssd. Even though SMART was reporting nothing wrong, it'd fail under heavy write conditions. I replaced the drive, copying the data that was on the old cache onto the new drive. With the help of the appdata backup, I went through all the dockers and they look safe, no errors being reported from them. Testing the services the dockers provides yields no problems. On a side note, the gui seems to report the right loads for the CPU now, not sure why that was screwing up because of the hdd. As everything seems good now, I'm going to mark this as solved. Quote Link to comment
JorgeB Posted January 5, 2019 Share Posted January 5, 2019 Glad you found the problem. Quote Link to comment
Ryonez Posted January 5, 2019 Author Share Posted January 5, 2019 As am I. Blimmin heck that was confusing and frustrating >.< Quote Link to comment
hpka Posted March 12, 2019 Share Posted March 12, 2019 On 1/4/2019 at 7:53 PM, Ryonez said: So in the end, it was a faulty ssd. Can you say how you identified the SSD as faulty other than deduction? I'm having similar symptoms to yours (8core xeon being e3-1220 v2 being maxed by the usual dockers) and my SSD is one of the re-used components. This is one of the more useful threads I've seen on this matter. Quote Link to comment
Ryonez Posted March 12, 2019 Author Share Posted March 12, 2019 4 hours ago, hpka said: Can you say how you identified the SSD as faulty other than deduction? I'm having similar symptoms to yours (8core xeon being e3-1220 v2 being maxed by the usual dockers) and my SSD is one of the re-used components. This is one of the more useful threads I've seen on this matter. It was just pure deduction. All of the reporting tools were saying it was fine. But it kept having issues which meant I had to look into it further. In the end, it was manually testing the device that allowed me to narrow it down to a point I could make the ssd fail 100% of the time under one of my test conditions. At that point it doesn't matter what the SMART says, I can reproduce a failure at will with something that within it's specs, proving the device was faulty. Hardware failure can be tricky. Not everything is going to be easily visible as a concrete red line somewhere. There will be times you have to just test for things yourself to try and figure it out. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.