Help! Disks kept busy and SMB stops working


Recommended Posts

I've recently put together this unRAID system which is intended to be a media and application server along with providing a virtual HTPC to my media room.

It all runs swimmingly for a little while but after 1-3 days I will typically find my SMB shares, the unRAID web (or one of the docker webs - or a combination) in a non-responsive state.

 

There are some files being held onto (usually by SMBD or SHFS) and while I can unmount some of the drives, there will typically be some which

are held busy by one of those rogue processes and I cannot kill the offending process.  This prevents me from shutting down nicely -even using the powerdown plugin, but so far I haven't lost any data (or even been forced to do a parity check on startup).  Powerdown -r  says its halting the system, but after that  nothing happens. I can still login to the system using SSH.  My only recourse is to hold down the power button to restart.

 

The only thing I can see in the log is a couple of mce [Hardware Error]s at boot time.  But the problem occurs much later usually and doesn't crash the system at all.  I've done a memtest on the system for over 24 hours and Im running everything at normal clock rates (not overclocked).

 

I have 1 Windows 10 VM (with a video card and a USB bus passed through)

I have 3 dockers running:  Mariadb, SABNZBd and SickBeard

I have 2 plugins loaded:  Community Applications, Powerdown

 

The system is pretty beefy too: 

cpu: i7 5820K,

ram:32 GB

mobo: ASRock X99 Extreme 6

video: GTX 970

cache: 2x 240GB SSDs

parity 1 6TB

data: 2x 6TB, 2x 4TB, 1x2TB

unraid-diagnostics-20160422-0749.zip

Link to comment

Adam64 -- how did you discover that it was an Intel i 218V ethernet issue and was there a fix (other than adding a new network card)? I to have a v6 system that becomes unusable every few days (opened a new post on it this moring), and I have a ASRcok Motherboard in it (like the original poster here), but not sure what onboard ethernet it has.

Thanks.

Link to comment

Just checked and I have the Z77 Extreme 4 MB. Does that use the same chipset?

 

03:00.0 Ethernet controller [0200]: Broadcom Corporation NetLink BCM57781 Gigabit Ethernet PCIe [14e4:16b1] (rev 10)

Subsystem: ASRock Incorporation Z77 Extreme4 motherboard [1849:96b1]

Kernel driver in use: tg3

Kernel modules: tg3

Link to comment

I had similar issues when I was using my onboard Intel i218V ethernet.  Which one are you using?

 

So I just checked and saw that in fact I was using the top RJ45 jack - which is the Atheros one -- so I don't think this is my issue.

 

Also,  It seemed pretty unlikely at any rate since The system was still responding just fine over ssh.    Its just that some of the disk devices were being kept busy by something and the web interface stopped responding (which also made SMB stop responding).

 

Still havent resolved this issue and I'm wondering if this is a V6 issue as I dont seem to be the only one.  I'm not sure exactly what I would lose by downgrading to v5 but I'm considering it as I'd like to resolve these issues. 

Link to comment

Just checked and I have the Z77 Extreme 4 MB. Does that use the same chipset?

 

03:00.0 Ethernet controller [0200]: Broadcom Corporation NetLink BCM57781 Gigabit Ethernet PCIe [14e4:16b1] (rev 10)

Subsystem: ASRock Incorporation Z77 Extreme4 motherboard [1849:96b1]

Kernel driver in use: tg3

Kernel modules: tg3

 

No - your motherboard uses the Z77 chipset.  What CPU are you running?

Link to comment

I have the same problem on a beta 21 PC with a AMD config:

 

Gigabyte Technology Co., Ltd. - 990FXA-UD3

CPU: AMD FX-8350 Eight-Core @ 4000

HVM: Enabled

IOMMU: Enabled

Cache: 384 kB, 8192 kB, 8192 kB

Memory: 24576 MB (max. installable capacity 32 GB)

Network: bond0: fault-tolerance (active-backup), mtu 1500

eth0: 1000Mb/s, Full Duplex, mtu 1500

Kernel: Linux 4.4.6-unRAID x86_64

OpenSSL: 1.0.2g

 

Exact same symptons.  I end up having to hard reset the box since it will not soft reset from the command line.  My NIC is a Realtek RTL8111E chip (10/100/1000 Mbit)

Link to comment

I had a similar issue as well.  The web GUI locked up, i could still ssh in so i did get my diagnostics.  There were several processes locked up that i couldn't kill.  I ended up trying to reboot, but that didn't work so i did a hard reset.  Parity and array drives were fine after reboot.  It may have been coincidence, but the system froze shortly after spinning down hard drives.  I can post my diagnostics when i get home if you want, or start a new thread.  But it sounds related as I was also d/ling a decent amount via dockers: i was running couchpotato, sonarr, deluge, and plex.  And on top of that i was running file integrity plugin.  I have since turned off cache drive for my downloads folder, disabled file integrity auto hashing, and set spin delay to never.  I will monitor to see if that had anything to do with it.

 

My M/B uses Intel 82579LM and 82574L LAN.

 

Config

6.2.0-beta21

M/B: Supermicro - X9SCL/X9SCM

CPU: Intel® Xeon® CPU E31230 @ 3.20GHz

HVM: Enabled

IOMMU: Enabled

Cache: 256 kB

Memory: 16384 MB (max. installable capacity 32 GB)

Network: bond0: fault-tolerance (active-backup), mtu 1500

eth0: 1000Mb/s, Full Duplex, mtu 1500

eth1: 1000Mb/s, Full Duplex, mtu 1500

Kernel: Linux 4.4.6-unRAID x86_64

OpenSSL: 1.0.2g

Link to comment

I had a similar issue as well.  The web GUI locked up, i could still ssh in so i did get my diagnostics.  There were several processes locked up that i couldn't kill.  I ended up trying to reboot, but that didn't work so i did a hard reset.  Parity and array drives were fine after reboot.  It may have been coincidence, but the system froze shortly after spinning down hard drives.  I can post my diagnostics when i get home if you want, or start a new thread.  But it sounds related as I was also d/ling a decent amount via dockers: i was running couchpotato, sonarr, deluge, and plex.  And on top of that i was running file integrity plugin.  I have since turned off cache drive for my downloads folder, disabled file integrity auto hashing, and set spin delay to never.  I will monitor to see if that had anything to do with it.

 

My M/B uses Intel 82579LM and 82574L LAN.

 

Config

6.2.0-beta21

M/B: Supermicro - X9SCL/X9SCM

CPU: Intel® Xeon® CPU E31230 @ 3.20GHz

HVM: Enabled

IOMMU: Enabled

Cache: 256 kB

Memory: 16384 MB (max. installable capacity 32 GB)

Network: bond0: fault-tolerance (active-backup), mtu 1500

eth0: 1000Mb/s, Full Duplex, mtu 1500

eth1: 1000Mb/s, Full Duplex, mtu 1500

Kernel: Linux 4.4.6-unRAID x86_64

OpenSSL: 1.0.2g

What type of drives do you have?    I had stability issues until I set my WD 6TB Red drives to never spin down.  This setting does not seem to be necessary for the other drives in my system.
Link to comment

What type of drives do you have?    I had stability issues until I set my WD 6TB Red drives to never spin down.  This setting does not seem to be necessary for the other drives in my system.

 

Hmmm  Thats pretty interesting.  I have a mix of drives - and 3 of them are WD 6TB Red drives.  I've noticed that sometimes its not the Red drives which are held 'busy' by some process, but I suppose since my Parity drive IS a Red drive this *might* be my issue...

 

UPDATE:  I just checked an all my drives Spin Down Delay is set to "Use Default"  which is set to Never  So that does not appear to be it.

Link to comment

What type of drives do you have?    I had stability issues until I set my WD 6TB Red drives to never spin down.  This setting does not seem to be necessary for the other drives in my system.

 

I have an older 60GB OCZ SSD for a cache drive, 3x 1TB WD Blacks, and a couple 500GB Seagates for the array.  I know 3TB usable its not much, but I am still testing unRaid while i wait for some sales on some newer drives.  I have 5x 3TB Seagate drives but 3 of them died so I am looking to replace them with something better.

 

I was surprised to see the default spindown delay was set to never.  I changed it to 2 hours and then the server froze, but I was also going heavy on the downloading trying to restore my collection.  I don't even know why the drives were spun down if there were still processes accessing them.

Link to comment

So I'm still having this issue.  If Im just reading files off the file system (or light writes) everything seems fine for days (had a previous up time of 7 days).

However it seems that large writes cause the problem (If I queue up some big downloads in SABNZB this happens).

Again,  one or more disks are held busy by a process (usually SMBD or SHFS) and I can't kill them.

 

Yesterday I restarted my server after such an occurrence and (From a windows machine) moved a large directory (about 15GB) from one user share to another.  Halfway through the process it hung.  SMB stopped responding and when I attempted to SSH into the machine I was able to, but I was unable to stop the move.  I ended up having to restart.

 

Could this be the problem?  Moving large files around?  SABNZB does this when it completes a download...  so does Couchpotato...

 

Anyone?  I'm pretty desperate to resolve this.

Link to comment

So I'm still having this issue.  If Im just reading files off the file system (or light writes) everything seems fine for days (had a previous up time of 7 days).

However it seems that large writes cause the problem (If I queue up some big downloads in SABNZB this happens).

Again,  one or more disks are held busy by a process (usually SMBD or SHFS) and I can't kill them.

 

Yesterday I restarted my server after such an occurrence and (From a windows machine) moved a large directory (about 15GB) from one user share to another.  Halfway through the process it hung.  SMB stopped responding and when I attempted to SSH into the machine I was able to, but I was unable to stop the move.  I ended up having to restart.

 

Could this be the problem?  Moving large files around?  SABNZB does this when it completes a download...  so does Couchpotato...

 

Anyone?  I'm pretty desperate to resolve this.

 

Hmm... that sounds a bit similar to what is happening with me on 6.2.0 beta 21. I managed to pin-point that it's due to something hanging in the array but couldn't figured out what. dAigo seems to have a similar issue too.

 

You might want to try 6.1.9 to see if it works.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.