zAdok Posted April 21, 2017 Share Posted April 21, 2017 System became unresponsive, was able to reboot cleanly. System booted but web interface was extremely slow, cache SSDs were missing. Stopped array, clean reboot again. Cache drives returned but docker will not start. zadok-nas-diagnostics-20170421-2050.zip Quote Link to comment
JorgeB Posted April 21, 2017 Share Posted April 21, 2017 Docker image is corrupt, delete and recreate. 1 Quote Link to comment
zAdok Posted April 21, 2017 Author Share Posted April 21, 2017 Yeah thats what I was thinking. I've done this a few times in the past, replaced the SSDs as I thought they might have been causing it. Any ideas to what else could be causing this to happen? Quote Link to comment
JorgeB Posted April 21, 2017 Share Posted April 21, 2017 These are usually the result of device errors, on SSDs mostly caused by cable issues, logs are very new so there's nothing else there, but you can use btrfs device stats to find out if there were previous problems, use: btrfs device stats -z /mnt/cache All values should be 0, if not there were previous issues, -z will show current values and reset all values to 0, so keep monitoring for a few weeks (without -z), and if errors appear again grab the diags before rebooting. Quote Link to comment
zAdok Posted April 21, 2017 Author Share Posted April 21, 2017 Thanks, heaps of write, read and flush IO errors on one of the SSDs. I'll replace that cable and keep an eye on it. Quote Link to comment
zAdok Posted April 24, 2017 Author Share Posted April 24, 2017 OK so this just happened again, this time I was able to run diagnostics. Gui reported that one cache disk was missing and that the remaining cache disk was running hot. The one that was missing was NOT the one with the reported errors last time. btrfs device stats /mnt/cache reports no errors at all since I last cleared them a few days back. Docker didn't corrupt this time. zadok-nas-diagnostics-20170424-2019.zip Quote Link to comment
JorgeB Posted April 24, 2017 Share Posted April 24, 2017 (edited) SSD dropped offline, on SSDs this is usually caused by a cable issue, since the SSD in on an LSI controller that doesn't support trim I would recommend swapping cables with one of the disks connected on the onboard controller. Edited April 24, 2017 by johnnie.black 1 Quote Link to comment
zAdok Posted April 24, 2017 Author Share Posted April 24, 2017 OK cheers. Thanks for the helpful feedback. Quote Link to comment
zAdok Posted April 25, 2017 Author Share Posted April 25, 2017 So this morning I changed the SSDs onto the onboard controller as advised. After a few hours of running I started getting temp alarms from the SSDs again. I stopped the array and the temps immediately dropped 20 degrees celsius. Any ideas's why this is happening? zadok-nas-diagnostics-20170425-1055.zip Quote Link to comment
JorgeB Posted April 25, 2017 Share Posted April 25, 2017 Just a quick look but don't see any issues in the logs, what temps are the SSDs set to? it's normal for them to get over the 45C default for all devices when they are under heavy writes, it varies from model to model but I usually set mine to 55 warn / 60 critical. Quote Link to comment
zAdok Posted April 25, 2017 Author Share Posted April 25, 2017 Cache2 hit 71, Cache1 was at 63 i think. Quote Link to comment
JorgeB Posted April 25, 2017 Share Posted April 25, 2017 That's too high, especially cache2, it's above max temp for those models: Operating temperature32ºF to 158ºF (0ºC to 70 ºC) That has nothing to do with the controller they're on though, and maybe not the reason for the earlier drop outs, but you want to improve cooling to keep them below 60C. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.