Jump to content

6.12.6 Instability root cause identified : HC550 parity drive. Anything to do?


Go to solution Solved by Terebi,

Recommended Posts

I have been having nightly crashes for months now, only at night.  I have done all the standard things, turning off all dockers and plugins, reinstalling from scratch, trying old versions, 24 hour memtests, etc.  Nothing worked.    I started pulling drives, and BOOM, back to stability. 

 

The drive causing the problems is a WD 16TB HDD Ultrastar DC HC550 SATA 7200RPM .  But I think lots of people use this drive or something similar. Its being used in an Asustor Lockerstor 4 Gen2 AS6704T

 

The instability ONLY happens at night. generally between 11pm and 3am my time. It does NOT consistently line up with mover or plex jobs, etc.  It feels like its some sort of unraid task kicking off, or some kind of hibernate in the drive after NAS usage drops off for the night.  It NEVER happens during the day, even if I'm heavily streaming via plex, or downloading data, or manually kicking off mover during the day. 

 

Pulling the drive completely gives stability. 

 

Having the drive in but not mounted at all gives stability ( I need to double confirm this with a longer test, but I got to 48 hours last time, before I mounted it in unassigned drives (see next item))

 

Having the drive mounted in unassigned devices or as a data disk gives the nightly crash, even if its not being used (no data on drive)

 

Having the drive mounted as a parity drive gives the nightly crash UNLESS a parity check is being performed.  (This imo is more evidence towards some sort of hibernate) (need to double confirm this again too)

 

Turning off spin-down does NOT resolve the issue

I have c-states disabled, and am not using s3 sleep or anything

 

 

If its a bad disk, while I'll be disappointed in the wasted $, its not the end of the world. But because of the pattern of instability, I'm concerned its not a bad disk, but some kind of incompatibility or configuration issue.   That makes me concerned about other disks I may purchase that may also show issues.

 

Other drives I have in the system that are not causing issues : 

Sandisk 3.2Gen1 30gb (unraid flash)

2x Samsung 970 EVO Plus NVME (btrfs cache)

MaxDigitalData 14tb drive (xfs data)

Seagate Ironwolf Pro 16tb (xfs data)

 

Currently running without parity as I have no disks big enough other than the problem child. 

 

Instability comes in a variety of symptoms, there does not seem to be a strong pattern : 

Sometimes I get a runaway CPU. When this happens it is often aligned with X:47 which is the hourly cron jobs, but it happens even with no plugins/dockers so I don't think the cron jobs are really doing anything other than "waking up" and then noticing a problem.  Via netdata/newrelic, the problem processes are "system" often shfs. Its never in docker or my containers. 

 

Sometimes OOM errors in syslog

 

Sometime just random kernel dumps

 

Often (but not always) when it crashes,  the system becomes unavailable to new network connections (confirmed via inbound and outbound uptime monitors). But during the pattern of instability, quite often docker continues to run for some time (sometimes hours) because newrelic/netdata continue to collect metrics, and sometimes I can see the arr ecosystem has continued to process feeds and downloads. But other times it crashes hard and there is an immediate cutoff of everything at the same time. 

 

Attached diagnostics are without drive, as I'm double confirming that I get many-multiple days of stability without the drive 

tower-diagnostics-20240126-0953.zip

 

Link to comment
4 hours ago, JorgeB said:

I would suggest testing a little longer without the drive to confirm that is really the problem, it would be a strange one, but I guess stranger things have happened.

Will do.  The content on the server is replaceable, so Im comfortable running without parity for a while.  Im at 48 hours of uptime now, ill let it run through at least the weekend before I touch it again.  But the crashes prior were VERY reliable. 

Link to comment
  • 2 weeks later...
  • Solution
On 1/26/2024 at 10:47 AM, JorgeB said:

I would suggest testing a little longer without the drive to confirm that is really the problem, it would be a strange one, but I guess stranger things have happened.

 

Well, you were right. I think the problem was actually ram now. Not bad ram, but I had (foolishly I know) mixed the original 4gb chip with a 16gb addon.  In retrospect I know ram should be matched, i was just being dumb.  With either chip (but not both) the system is fine. 

 

The weird part is still the pattern of instability though. Only that overnight crash.  If the ram was causing problems, I would expect the instability to be at times of heaviest usage. 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...