January 7, 20179 yr Something must be going on... I have another dual drive failure.. I will now dive into the diagnostics.. Any help is appreciated.. tower-diagnostics-20170107-1322.zip
January 7, 20179 yr Community Expert Looks to me like one of the SASLP crashed resulting in timeouts and 2 disks dropped offline, you should reboot to check the SMART reports but they should be OK. I've seen this before recently on another user and it also happened to me once a couple of weeks ago with the same controller, maybe some kernel/driver issue?
January 7, 20179 yr Author It really sounds like that... I need to finish a big copy and will then reboot.. How do i re-add them at that point ?
January 7, 20179 yr Community Expert After rebooting and checking SMART: stop array unassign parity and disk7 start array stop array reassign parity and disk7 start array to begin parity sync and data rebuild
January 7, 20179 yr Author Thanks. I will do so. I did some more checking and I notice that the errors started quite sudden after an invocation of the OOM killer... Could it be that an OOM event triggers the stopping of a process that is key to unraid functioning ? Something like the driver for the card ? Jan 7 07:30:28 Tower kernel: Out of memory: Kill process 21500 (netdata) score 1001 or sacrifice child Jan 7 07:30:28 Tower kernel: Killed process 21546 (python) total-vm:277060kB, anon-rss:12920kB, file-rss:0kB, shmem-rss:0kB Jan 7 07:30:28 Tower kernel: oom_reaper: reaped process 21546 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Some Python process appears to have been reaped.. Dunno what it is.. I checked and my first dual drive failure did not have an OOM event before.. So maybe it is unrelated.. Link to my first dual drive failure: http://lime-technology.com/forum/index.php?topic=55008.msg525059#msg525059
January 7, 20179 yr Community Expert Controller crashed a few hours latter, don't think it's related but I could be wrong.
January 7, 20179 yr Author Think I found it.. its a netdata process.. so indeed not related... but a controller crash on itself points towards a driver or controller issue I would think.. Not towards a drive issue.. I am very anxious to see the smart values for the two drives... Its going to be few hours before I can reboot.
January 7, 20179 yr Author After rebooting and checking SMART: stop array unassign parity and disk7 start array stop array reassign parity and disk7 start array to begin parity sync and data rebuild - Just rebooted (put VM's and Dockers on no-autostart top speed thing up) - Checked Smart values: - no pending sescotrs of offline uncorrectables on the parity drive or the data drive - stopped array, unassigned both failed drives - started array (weird: the parity drive came up assigned and in working condition, no parity build was started !) - stopped array (weird: now the parity drive is unassigned again!) - reassigned both drives - started array (parity sync is starting and data rebuild is running on disk7" Guess we'll wait ! I'll leave the dockers and vm's off for the night just to keep the array quiet.
January 7, 20179 yr Thanks. I will do so. I did some more checking and I notice that the errors started quite sudden after an invocation of the OOM killer... Could it be that an OOM event triggers the stopping of a process that is key to unraid functioning ? Something like the driver for the card ? Jan 7 07:30:28 Tower kernel: Out of memory: Kill process 21500 (netdata) score 1001 or sacrifice child Jan 7 07:30:28 Tower kernel: Killed process 21546 (python) total-vm:277060kB, anon-rss:12920kB, file-rss:0kB, shmem-rss:0kB Jan 7 07:30:28 Tower kernel: oom_reaper: reaped process 21546 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Some Python process appears to have been reaped.. Dunno what it is.. I checked and my first dual drive failure did not have an OOM event before.. So maybe it is unrelated.. Link to my first dual drive failure: http://lime-technology.com/forum/index.php?topic=55008.msg525059#msg525059 Anytime the kernel starts killing things due to out of memory, bad things can happen. The oom process killer is NOT smart. It is just trying to keep the kernel running, anything in userland is fairgame. It would be good to find out what is consuming memory and tune it or add memory.
January 7, 20179 yr Author Thanks. I will do so. I did some more checking and I notice that the errors started quite sudden after an invocation of the OOM killer... Could it be that an OOM event triggers the stopping of a process that is key to unraid functioning ? Something like the driver for the card ? Jan 7 07:30:28 Tower kernel: Out of memory: Kill process 21500 (netdata) score 1001 or sacrifice child Jan 7 07:30:28 Tower kernel: Killed process 21546 (python) total-vm:277060kB, anon-rss:12920kB, file-rss:0kB, shmem-rss:0kB Jan 7 07:30:28 Tower kernel: oom_reaper: reaped process 21546 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB Some Python process appears to have been reaped.. Dunno what it is.. I checked and my first dual drive failure did not have an OOM event before.. So maybe it is unrelated.. Link to my first dual drive failure: http://lime-technology.com/forum/index.php?topic=55008.msg525059#msg525059 Anytime the kernel starts killing things due to out of memory, bad things can happen. The oom process killer is NOT smart. It is just trying to keep the kernel running, anything in userland is fairgame. It would be good to find out what is consuming memory and tune it or add memory. I know.. It was Netdata.. I already removed the docker.. Looks nice but wasn't really using it..
Archived
This topic is now archived and is closed to further replies.