Another dual drive failure

January 7, 20179 yr

Something must be going on... I have another dual drive failure.. I will now dive into the diagnostics.. Any help is appreciated..

tower-diagnostics-20170107-1322.zip

Quote

January 7, 20179 yr

Community Expert

Looks to me like one of the SASLP crashed resulting in timeouts and 2 disks dropped offline, you should reboot to check the SMART reports but they should be OK.

I've seen this before recently on another user and it also happened to me once a couple of weeks ago with the same controller, maybe some kernel/driver issue?

Quote

January 7, 20179 yr

Author

It really sounds like that... I need to finish a big copy and will then reboot..

How do i re-add them at that point ?

Quote

January 7, 20179 yr

Community Expert

After rebooting and checking SMART:

stop array

unassign parity and disk7

start array

stop array

reassign parity and disk7

start array to begin parity sync and data rebuild

Quote

January 7, 20179 yr

Author

Thanks. I will do so.

I did some more checking and I notice that the errors started quite sudden after an invocation of the OOM killer... Could it be that an OOM event triggers the stopping of a process that is key to unraid functioning ? Something like the driver for the card ?

Jan  7 07:30:28 Tower kernel: Out of memory: Kill process 21500 (netdata) score 1001 or sacrifice child
Jan  7 07:30:28 Tower kernel: Killed process 21546 (python) total-vm:277060kB, anon-rss:12920kB, file-rss:0kB, shmem-rss:0kB
Jan  7 07:30:28 Tower kernel: oom_reaper: reaped process 21546 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

Some Python process appears to have been reaped.. Dunno what it is..

I checked and my first dual drive failure did not have an OOM event before.. So maybe it is unrelated..

Link to my first dual drive failure:

http://lime-technology.com/forum/index.php?topic=55008.msg525059#msg525059

Quote

January 7, 20179 yr

Community Expert

Controller crashed a few hours latter, don't think it's related but I could be wrong.

Quote

January 7, 20179 yr

Author

Think I found it.. its a netdata process.. so indeed not related... but a controller crash on itself points towards a driver or controller issue I would think.. Not towards a drive issue.. I am very anxious to see the smart values for the two drives... Its going to be few hours before I can reboot.

Quote

January 7, 20179 yr

Author

After rebooting and checking SMART:

stop array

unassign parity and disk7

start array

stop array

reassign parity and disk7

start array to begin parity sync and data rebuild

- Just rebooted (put VM's and Dockers on no-autostart top speed thing up)

- Checked Smart values:

- no pending sescotrs of offline uncorrectables on the parity drive or the data drive

- stopped array, unassigned both failed drives

- started array (weird: the parity drive came up assigned and in working condition, no parity build was started !)

- stopped array (weird: now the parity drive is unassigned again!)

- reassigned both drives

- started array (parity sync is starting and data rebuild is running on disk7"

Guess we'll wait ! I'll leave the dockers and vm's off for the night just to keep the array quiet.

Quote

January 7, 20179 yr

Thanks. I will do so.

I did some more checking and I notice that the errors started quite sudden after an invocation of the OOM killer... Could it be that an OOM event triggers the stopping of a process that is key to unraid functioning ? Something like the driver for the card ?
Jan  7 07:30:28 Tower kernel: Out of memory: Kill process 21500 (netdata) score 1001 or sacrifice child
Jan  7 07:30:28 Tower kernel: Killed process 21546 (python) total-vm:277060kB, anon-rss:12920kB, file-rss:0kB, shmem-rss:0kB
Jan  7 07:30:28 Tower kernel: oom_reaper: reaped process 21546 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Some Python process appears to have been reaped.. Dunno what it is..

I checked and my first dual drive failure did not have an OOM event before.. So maybe it is unrelated..

Link to my first dual drive failure:

http://lime-technology.com/forum/index.php?topic=55008.msg525059#msg525059

Anytime the kernel starts killing things due to out of memory, bad things can happen. The oom process killer is NOT smart. It is just trying to keep the kernel running, anything in userland is fairgame. It would be good to find out what is consuming memory and tune it or add memory.

Quote

January 7, 20179 yr

Author

Thanks. I will do so.

I did some more checking and I notice that the errors started quite sudden after an invocation of the OOM killer... Could it be that an OOM event triggers the stopping of a process that is key to unraid functioning ? Something like the driver for the card ?
Jan  7 07:30:28 Tower kernel: Out of memory: Kill process 21500 (netdata) score 1001 or sacrifice child
Jan  7 07:30:28 Tower kernel: Killed process 21546 (python) total-vm:277060kB, anon-rss:12920kB, file-rss:0kB, shmem-rss:0kB
Jan  7 07:30:28 Tower kernel: oom_reaper: reaped process 21546 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Some Python process appears to have been reaped.. Dunno what it is..

I checked and my first dual drive failure did not have an OOM event before.. So maybe it is unrelated..

Link to my first dual drive failure:

http://lime-technology.com/forum/index.php?topic=55008.msg525059#msg525059
Anytime the kernel starts killing things due to out of memory, bad things can happen. The oom process killer is NOT smart. It is just trying to keep the kernel running, anything in userland is fairgame. It would be good to find out what is consuming memory and tune it or add memory.

I know.. It was Netdata.. I already removed the docker.. Looks nice but wasn't really using it..

Quote

January 8, 20179 yr

Author

I'm at 61% rebuild.. So we are getting there...

Quote

Another dual drive failure

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)