V6.1 Custom Build - Tuning and Upgrade advice please


Recommended Posts

Hi Folks,

 

I have been using unraid for a few years now.  Migrated from windows home server V1.

 

This server is primarily used for a photographic business, with some home media server and backup duties

 

I am getting some funny behavior that I can't nail down.

 

Every once in a while the server stops responding to SMB requests.  I can access via the web gui but the web gui stops responding about a minute after i start using it in this scenario (normal times the web gui is fine).  I can then putty into it, but any attempt to restart it via command line fails, it just sits there.  If openelec is playing a video via smb during all this it just keeps playing without issue, but any attempt to browse the file shares just hang (windows 10 does not report an error if browsing from laptop, it just sits there waiting).

 

The only way to get the server back is a hard boot.  Pressing the power button does not result in a proper shutdown (as it normally would for the server outside this scenario)

 

This same behaviour started on V5, at which point i suspected a hardware failure (8 year old machine) so i replaced everything but the drives.

 

I am hoping you can give me some insights based on the diagnostic file attached, and the the description above - as to whether i have any sub optimal configuration at play.

 

Current Hardware:

 

ASRock H97 Performance

Enermax psu

Intel Pentium CPU G3260

8GB Ram

2X Addonics ad4sa6gpx2

Intel CT Gigabit network adaptor

80gb intel 520 ssd cache

240 gb intel 535 ssd cache

wd red 3gb parity

wd red 2gb data

2x wd red 3gb data

sandisk cruiser usb

 

The other thing i'd like advice on is where to go next with this server.... cpu, nic, controller card?  I'm getting around 110MB/s on parity calc.

 

Current plan is a 240gb ssd for cache pool, and 5tb drive to replace current parity and shift that to storage pool.

 

 

Thanks, i know everyones time is valuable, hopefully i have give the right info up front.

solar-diagnostics-20160409-1338.zip

Link to comment

 

Every once in a while the server stops responding to SMB requests.  I can access via the web gui but the web gui stops responding about a minute after i start using it in this scenario (normal times the web gui is fine).  I can then putty into it, but any attempt to restart it via command line fails, it just sits there.  If openelec is playing a video via smb during all this it just keeps playing without issue, but any attempt to browse the file shares just hang (windows 10 does not report an error if browsing from laptop, it just sits there waiting).

 

 

 

There have been issues with Win10 and Samba recently.  (Win10 made some changes to their version of SMB and the Linux Samba team is playing catch-up.)  Apparently, the latest beta version has corrected them.  While I don't recommend going to a beta for a working server, there are some unRAID work arounds for earlier versions if you go looking for them. 

Link to comment

Some comments, first in general, then from the syslog -

 

* You say it started in v5, which is interesting.  That would mean the problem has to be something that is common to both, but that's not easy as there's so little in common between v5 and v6 (32 bit vs 64bit).  In general, that would seem to eliminate the Linux kernel, the unRAID software, anything VM related, and anything Docker related.  And because you have replaced the hardware, you can eliminate the hardware too.  Even plugins are different, although I suppose it could be a plugin that is almost identical in function.  Or perhaps a faulty application (plugin before, but could be plugin, Docker, or VM now), that while different code and in a different environment, can misbehave in the same way.  Or as mentioned a faulty protocol, that behaves as badly in either version...

 

* Your syslog shows a strange motherboard, one I haven't seen before.  It appears at first as a typical Intel board, H97 based, but it has NO Intel SATA motherboard controller or ports!  The only disk SATA controllers are a pair of Marvell 9230's (which make me nervous lately), and they are a little odd too.  Each Marvell 9230 reports that it is an 8 port SATA controller, and all 8 ports are setup with 8 SCSI and ATA channels.  But the 8th port on each turns out to have a Marvell virtual console device attached, which causes a few initial issues.  Then you have connected drives only to the first 4 ports on each, 5th, 6th, and 7th are unconnected on each, which makes me wonder if it actually looks like 4 port cards, or maybe 6 port or 7 port?  Anyway, there are no other issues involved with these, so I'm off-topic, just curious.

 

* You have a dual Intel NIC, but the cable appears to be on the second network port, because eth0 is not working!  Only eth1 is working, which would normally mean you have no networking (unRAID uses eth0).  Bonding is off, but bridging is on, and apparently unRAID is using the bridge and passing all packets over eth1.  I wonder if it would work better if the cable was moved to the other connector.

 

* You have it set to use DHCP, so every 12 hours it renews the IP lease, which also resets the bridging.  But something with that isn't working correctly, because the first time it renews, something is not right and it begins spewing error messages to the syslog, a LOT of them.  A sample -

Apr  8 04:39:04 Solar dhcpcd[1601]: br0: xid 0x7415714e is for hwaddr 08:ed:b9:7b:27:df:00:00:00:00:00:00:00:00:00:00

Apr  8 04:40:09 Solar dhcpcd[1601]: br0: xid 0x7415714e is for hwaddr 08:ed:b9:7b:27:df:00:00:00:00:00:00:00:00:00:00

Apr  8 04:41:13 Solar dhcpcd[1601]: br0: xid 0x7415714e is for hwaddr 08:ed:b9:7b:27:df:00:00:00:00:00:00:00:00:00:00

It repeats to the end of the syslog, several hundred KB of them.  They may be harmless, but a nuisance.  I believe that if you change to a static IP, the same 192.168.1.85 that you are already getting, then you would remove all of that from the syslog.  Will it fix other problems?  I doubt it, but who knows.

 

* Something else that takes up an unusual amount of syslog space is the mover logging.  Recently, obfuscation was added for privacy concerns, so I can't tell what is actually happening.  It *looks* like it is trying to move the same file over and over and over, for almost 30 minutes.  And the next day, it takes about 15 minutes doing the same.  We'd have to see the original syslog to know if this is OK (all files with *almost* the same file name), or there's a problem here.  You should examine the syslog yourself, and see if the logged Mover lines look right.  This is one case where privacy concerns are hampering troubleshooting.

Apr  8 02:00:01 Solar logger: moving "E..y"

Apr  8 02:00:01 Solar logger: . file: /E..y/...

Apr  8 02:00:08 Solar logger: cd+++++++++ file: /E..y/...

Apr  8 02:00:08 Solar logger: >f+++++++++ file: /E..y/...

Apr  8 02:00:08 Solar logger: . file: /E..y/...

Apr  8 02:00:08 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:00:08 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:00:08 Solar logger: >f+++++++++ file: /E..y/...

Apr  8 02:00:20 Solar logger: . file: /E..y/...

Apr  8 02:00:20 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:00:20 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:00:20 Solar logger: >f+++++++++ file: /E..y/...

...[snipped]...

Apr  8 02:29:54 Solar logger: . file: /E..y/...

Apr  8 02:29:54 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:29:54 Solar logger: cd+++++++++ file: /E..y/...

Apr  8 02:29:54 Solar logger: >f+++++++++ file: /E..y/...

Apr  8 02:29:54 Solar logger: . file: /E..y/...

Apr  8 02:29:54 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:29:54 Solar logger: . file: /E..y/...

Apr  8 02:29:54 Solar logger: . file: /E..y/...

Apr  8 02:29:54 Solar logger: .d..t...... file: /E..y/...

...[next day]...

Apr  9 02:00:01 Solar logger: mover started

Apr  9 02:00:01 Solar logger: moving "E..y"

Apr  9 02:00:01 Solar logger: . file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... ./

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: >f+++++++++ file: /E..y/...

Apr  9 02:00:08 Solar logger: . file: /E..y/...

Apr  9 02:00:08 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:08 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:08 Solar logger: >f+++++++++ file: /E..y/...

...[snipped]...

Apr  9 02:15:20 Solar logger: . file: /E..y/...

Apr  9 02:15:20 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:15:20 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:15:20 Solar logger: >f+++++++++ file: /E..y/...

Apr  9 02:15:21 Solar logger: . file: /E..y/...

Apr  9 02:15:21 Solar logger: . file: /E..y/...

Apr  9 02:15:21 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:15:21 Solar logger: . file: /E..y/...

Apr  9 02:15:21 Solar logger: .d..t...... file: /E..y/...

Link to comment

Aside from the current SMB problem, do you have any performance problems you would like to address?  You mention wanting to figure out where to go next, but that looks like a pretty solid basic NAS build...

 

Thanks for your reply.

 

I don't have any performance issues as such, my wife would always like it faster, so i like to try to keep improving it, but I am not sure where to go from here.  I thought perhaps going dual NIC, but i wondered if there was a next logical step up from the addonics controllers I am using.  I selected the addonics only because it was on the unraid compatibility list and seem to be plug and play whilst supporting sata3.

 

 

 

 

 

 

There have been issues with Win10 and Samba recently.  (Win10 made some changes to their version of SMB and the Linux Samba team is playing catch-up.)  Apparently, the latest beta version has corrected them.  While I don't recommend going to a beta for a working server, there are some unRAID work arounds for earlier versions if you go looking for them.

Thanks Frank, I'll do a search for win ten smb problems and implement and fixes recommended on the forum.  :)

Link to comment

Thanks for the spending time here:

 

* You say it started in v5, which is interesting.  That would mean the problem has to be something that is common to both, but that's not easy as there's so little in common between v5 and v6 (32 bit vs 64bit).  In general, that would seem to eliminate the Linux kernel, the unRAID software, anything VM related, and anything Docker related.  And because you have replaced the hardware, you can eliminate the hardware too.  Even plugins are different, although I suppose it could be a plugin that is almost identical in function.  Or perhaps a faulty application (plugin before, but could be plugin, Docker, or VM now), that while different code and in a different environment, can misbehave in the same way.  Or as mentioned a faulty protocol, that behaves as badly in either version...

Yes I am rather perplexed too, although it's worth mentioning it happens much less on V6 with the old machine, it's almost a rare occurence but always seems to happen when my wife is working on it and so the impact (and the moaning) is quite high.  I am wondering if another network device could be causing some sort of storm, but I have not got the technical mastry to easily track this down, I am slowly working through excluding particular things from the network to see if it comes right, but it's so intermittent it's hard to do focused troubleshooting.

 

* Your syslog shows a strange motherboard, one I haven't seen before.  It appears at first as a typical Intel board, H97 based, but it has NO Intel SATA motherboard controller or ports!  The only disk SATA controllers are a pair of Marvell 9230's (which make me nervous lately), and they are a little odd too.  Each Marvell 9230 reports that it is an 8 port SATA controller, and all 8 ports are setup with 8 SCSI and ATA channels.  But the 8th port on each turns out to have a Marvell virtual console device attached, which causes a few initial issues.  Then you have connected drives only to the first 4 ports on each, 5th, 6th, and 7th are unconnected on each, which makes me wonder if it actually looks like 4 port cards, or maybe 6 port or 7 port?  Anyway, there are no other issues involved with these, so I'm off-topic, just curious.

Good spotting, Reason for this is that I have disabled anything in the bios that was not needed.  The marvel 9230's are the two addonics cards I believe, and I am starting to worry about them.  They are only 4 port cards... so it's pretty weird, this is a card recomended on the hardware compatibility list, but maybe it's starting to show issues, I am open to ditching them if it's looking like they are causing a headache.

 

Apr  7 16:36:42 Solar kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6

Apr  7 16:36:42 Solar kernel: ata8.00: irq_stat 0x40000001

Apr  7 16:36:42 Solar kernel: ata8.00: cmd a0/01:00:00:00:01/00:00:00:00:00/a0 tag 2 dma 16640 in

Apr  7 16:36:42 Solar kernel:        opcode=0x12 12 01 00 00 ff 00res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x3 (HSM violation)

Apr  7 16:36:42 Solar kernel: ata8: hard resetting link

 

And i just spotted this:

 

Apr  7 16:36:42 Solar kernel: ACPI Warning: SystemIO range 0x000000000000F040-0x000000000000F05F conflicts with OpRegion 0x000000000000F040-0x000000000000F04F (\_SB_.PCI0.SBUS.SMBI) (20150410/utaddress-254)

 

Which i tried googling but i think is too specific to my system to return anything of value from a search.

* You have a dual Intel NIC, but the cable appears to be on the second network port, because eth0 is not working!  Only eth1 is working, which would normally mean you have no networking (unRAID uses eth0).  Bonding is off, but bridging is on, and apparently unRAID is using the bridge and passing all packets over eth1.  I wonder if it would work better if the cable was moved to the other connector.

I've got an onboard Intel and an Intel Add In Card.  I forgot to disable the onboard in the bios.  It sounds like I have stuff up something in bridging, i had a bridge configured to support VM's

* You have it set to use DHCP, so every 12 hours it renews the IP lease, which also resets the bridging.  But something with that isn't working correctly, because the first time it renews, something is not right and it begins spewing error messages to the syslog, a LOT of them.  A sample -

Apr  8 04:39:04 Solar dhcpcd[1601]: br0: xid 0x7415714e is for hwaddr 08:ed:b9:7b:27:df:00:00:00:00:00:00:00:00:00:00

Apr  8 04:40:09 Solar dhcpcd[1601]: br0: xid 0x7415714e is for hwaddr 08:ed:b9:7b:27:df:00:00:00:00:00:00:00:00:00:00

Apr  8 04:41:13 Solar dhcpcd[1601]: br0: xid 0x7415714e is for hwaddr 08:ed:b9:7b:27:df:00:00:00:00:00:00:00:00:00:00

It repeats to the end of the syslog, several hundred KB of them.  They may be harmless, but a nuisance.  I believe that if you change to a static IP, the same 192.168.1.85 that you are already getting, then you would remove all of that from the syslog.  Will it fix other problems?  I doubt it, but who knows.

Good call, i have set it to static now,  Thanks.

* Something else that takes up an unusual amount of syslog space is the mover logging.  Recently, obfuscation was added for privacy concerns, so I can't tell what is actually happening.  It *looks* like it is trying to move the same file over and over and over, for almost 30 minutes.  And the next day, it takes about 15 minutes doing the same.  We'd have to see the original syslog to know if this is OK (all files with *almost* the same file name), or there's a problem here.  You should examine the syslog yourself, and see if the logged Mover lines look right.  This is one case where privacy concerns are hampering troubleshooting.

Apr  8 02:00:01 Solar logger: moving "E..y"

Apr  8 02:00:01 Solar logger: . file: /E..y/...

Apr  8 02:00:08 Solar logger: cd+++++++++ file: /E..y/...

Apr  8 02:00:08 Solar logger: >f+++++++++ file: /E..y/...

Apr  8 02:00:08 Solar logger: . file: /E..y/...

Apr  8 02:00:08 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:00:08 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:00:08 Solar logger: >f+++++++++ file: /E..y/...

Apr  8 02:00:20 Solar logger: . file: /E..y/...

Apr  8 02:00:20 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:00:20 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:00:20 Solar logger: >f+++++++++ file: /E..y/...

...[snipped]...

Apr  8 02:29:54 Solar logger: . file: /E..y/...

Apr  8 02:29:54 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:29:54 Solar logger: cd+++++++++ file: /E..y/...

Apr  8 02:29:54 Solar logger: >f+++++++++ file: /E..y/...

Apr  8 02:29:54 Solar logger: . file: /E..y/...

Apr  8 02:29:54 Solar logger: .d..t...... file: /E..y/...

Apr  8 02:29:54 Solar logger: . file: /E..y/...

Apr  8 02:29:54 Solar logger: . file: /E..y/...

Apr  8 02:29:54 Solar logger: .d..t...... file: /E..y/...

...[next day]...

Apr  9 02:00:01 Solar logger: mover started

Apr  9 02:00:01 Solar logger: moving "E..y"

Apr  9 02:00:01 Solar logger: . file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... ./

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:01 Solar logger: >f+++++++++ file: /E..y/...

Apr  9 02:00:08 Solar logger: . file: /E..y/...

Apr  9 02:00:08 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:08 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:00:08 Solar logger: >f+++++++++ file: /E..y/...

...[snipped]...

Apr  9 02:15:20 Solar logger: . file: /E..y/...

Apr  9 02:15:20 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:15:20 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:15:20 Solar logger: >f+++++++++ file: /E..y/...

Apr  9 02:15:21 Solar logger: . file: /E..y/...

Apr  9 02:15:21 Solar logger: . file: /E..y/...

Apr  9 02:15:21 Solar logger: .d..t...... file: /E..y/...

Apr  9 02:15:21 Solar logger: . file: /E..y/...

Apr  9 02:15:21 Solar logger: .d..t...... file: /E..y/...

I'll have a look but as my wife is moving a heap of files with practically identical names (uploading compact flash card contents after a shoot) I think it's ok, but i will check the mover after her next shoot and report back.  One thing I did notice is my weird mix of 80gb and 240gb cache has most of the 240gb drive wasted and my wife is filling up the cache, so i'll look to add another 240 to bring the cache pool capacity to 280gb, although the 80gb is only sata 2... which could be slowly up the works?

Link to comment

Thanks for your reply.

 

I don't have any performance issues as such, my wife would always like it faster, so i like to try to keep improving it, but I am not sure where to go from here.

Faster doing what?  Are writes to the system of camera files (from a single PC or laptop?) the primary thing you'd like to improve?

 

 

 

 

Link to comment

Thanks for your reply.

 

I don't have any performance issues as such, my wife would always like it faster, so i like to try to keep improving it, but I am not sure where to go from here.

Faster doing what?  Are writes to the system of camera files (from a single PC or laptop?) the primary thing you'd like to improve?

Ugh, Sorry, yeah i didn't really provide much info there.

 

At present she is editing files on the server, Photoshop and lightroom. She was editing today and I noticed the CPU utilization creeping up to 75% at times, which is interesting as i would have hoped cpu load would be low with the addonics card's hopefully doing most of the work.

 

So, in answer to your question, creating and modifying files ranging from 10MB to 90MB in size, batch jobs at times too, so heaps of operations editing files where the server is presumably moving a file from the storage pool to the cache pool and writing changes.

 

The client machine is windows 10 Skylake with an intel NIC, would like to goto dual nics at both ends as a next step as im able to hit 125MB/s pretty easily when doing copys.

 

Hope I have provided a bit more info this time, thanks for your patience.

 

PS. the reason she is editing on the server is she cannot be trusted to copy stuff to the server if she edit's on the local desktop.

Link to comment

Well, to provide some comparison I run an older 4,000 Passmark CPU compared to your 3,300 Passmark CPU.  I run it with a single on-board 1GB NIC.  Opening a ~28MB raw takes around 3-4 seconds off a spun-up unRAID network drive (to the camera Raw preview screen) and another 3 seconds to open in Photoshop after I click Open Image.  It also takes about 3 seconds to save a corresponding 70MB psd file to an uncached user share.  Opening files off the local SSD also takes about 3 seconds, though saving is definitely faster - about a second.  Opening 70MB psd files off the network takes about 3 seconds, and opening those same psd files from my SSD is about a second.  My desktop is a Haswell Core i7.  I wonder if we're seeing anything similar, performance wise?

 

Based on the feedback provided by Frank and RobJ, it sounds like you have some things worth looking at.  If your performance is significantly worse than mine, I'd focus on those issues since we're running relatively similar class hardware.

 

Another thing - my server is rarely (never?) above 10% CPU on basic NAS operations like copy, etc.  Again I'd look at the config stuff Frank and RobJ identified - 75% doesn't sound right and maybe your server is fighting errors of some kind to push it that high.  (I do get high CPU utilization sometimes, but that is because of the Dockers I run).

Link to comment

Well, it's been quite a day.

 

Woke up to the wife telling me the server was acting strangely.  Bunch of stuff missing, i checked the web gui and I had lost parity drive and disk 1, wdred 3tb on parity and wdred 2tb on disk 1.

 

A reboot brought disk 1 back, but the parity drive was still toast.  I went and got another drive and started the rebuild, due to the fact that the two drives were in question above were both on the same addonics card I moved the new parity drive to the onboard controller and kicked off a rebuild.

 

It ran for 6 or so hours and then things went really screwy.

 

I lost disk 1 again and got a heap of read errors on the other 2 drives on the other controller card (another addonics).  I pulled the controller cards out and put everything on the onboard disk. 

 

Disk 1 is now unmountable, fs check returns a faulty disk message and recommends replacement, a short smart test says everything is hunky dory.

 

My theory is that the addonics cards are junk, and have corrupted the file system on the disk.  I am running an extended smart test but may end up putting disk 1 back into service.

 

I will adress all the points made above but I am wondering if these addonics cards have been causing issues all along...

Link to comment

Well, it's been quite a day.

 

Woke up to the wife telling me the server was acting strangely.  Bunch of stuff missing, i checked the web gui and I had lost parity drive and disk 1, wdred 3tb on parity and wdred 2tb on disk 1.

 

A reboot brought disk 1 back, but the parity drive was still toast.  I went and got another drive and started the rebuild, due to the fact that the two drives were in question above were both on the same addonics card I moved the new parity drive to the onboard controller and kicked off a rebuild.

 

It ran for 6 or so hours and then things went really screwy.

 

I lost disk 1 again and got a heap of read errors on the other 2 drives on the other controller card (another addonics).  I pulled the controller cards out and put everything on the onboard disk. 

 

Disk 1 is now unmountable, fs check returns a faulty disk message and recommends replacement, a short smart test says everything is hunky dory.

 

My theory is that the addonics cards are junk, and have corrupted the file system on the disk.  I am running an extended smart test but may end up putting disk 1 back into service.

 

I will adress all the points made above but I am wondering if these addonics cards have been causing issues all along...

 

Attach a diagnostics file  ('Tools' >>  'Diagnostics')  and someone may be able to give you some useful information...

Link to comment

Yeah good point, sorry!

 

Please find attached.

 

I have replaced the 2tb data disk with the brand new 3tb wd red.  And put the original 3tb parity back in as parity, rebuild underway.

 

Thankfully i have offsites off all the important stuff...  :)

 

PS> Onboard ethernet now disabled as siggested above which is why it's on eth0 now.  And it's now using static ip as suggested.

solar-diagnostics-20160417-0048.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.