Unable to load web interface or Hit Shares [SOLVED]


Recommended Posts

  • Replies 117
  • Created
  • Last Reply

Top Posters In This Topic

 

Ran reiserfsck tool against disk 5, looks clean to me.

 

Replaying journal: Done.

Reiserfs journal '/dev/md5' in blocks [18..8211]: 27 transactions replayed

Checking internal tree..  finished

Comparing bitmaps..finished

Checking Semantic tree:

finished

No corruptions found

There are on the filesystem:

        Leaves 555762

        Internal nodes 3557

        Directories 2158

        Other files 10230

        Data block pointers 560633744 (0 of them are zero)

        Safe links 0

###########

reiserfsck finished at Thu Aug 27 14:44:34 2015

Link to comment

 

Two different drives showed indications of a faulty SATA cable, ST3000DM001-9YN166 Z1F12JLY twice (Disk 5) and WDC WD20EARS-00MVWB0 WD-WMAZA3638502 (Disk 1).  I would replace their SATA cables with better quality ones.

 

Just to rule this out I replaced all cables.  Thanks Rob for catching that.

 

I also ran a memtest for 24 hours, no errors.

 

Going to try anything at this point to make my unraid stable I am going work through this post next: http://lime-technology.com/forum/index.php?topic=28484.0 as suggested by bonienl in another thread.  Even though I never ran Unraid 6 rc 13

 

 

Link to comment

I just manually invoked the mover and it failed hard.  My unraid is locked up and I cant hit my shares. Any clues? Here is the relevant syslog data:

 

 

Aug 30 20:45:04 nas kernel: general protection fault: 0000 [#1] PREEMPT SMP

Aug 30 20:45:04 nas kernel: Modules linked in: md_mod xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat k10temp pata_atiixp i2c_piix4 ahci libahci r8169 sata_sil24 mii acpi_cpufreq [last unloaded: md_mod]

Aug 30 20:45:04 nas kernel: CPU: 0 PID: 25306 Comm: shfs Not tainted 4.1.5-unRAID #3

Aug 30 20:45:04 nas kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./HZ03-GT-V2, BIOS 080015  12/24/2010

Aug 30 20:45:04 nas kernel: task: ffff8800a117c300 ti: ffff8801163f8000 task.ti: ffff8801163f8000

Aug 30 20:45:04 nas kernel: RIP: 0010:[<ffffffff811533e4>]  [<ffffffff811533e4>] __discard_prealloc+0x98/0xb3

Aug 30 20:45:04 nas kernel: RSP: 0018:ffff8801163fbcd8  EFLAGS: 00010246

Aug 30 20:45:04 nas kernel: RAX: ffff8800947226a8 RBX: ffff880094722680 RCX: bdb5d18c95cbac9b

Aug 30 20:45:04 nas kernel: RDX: cb904b10fa85c8b5 RSI: ffff880094722680 RDI: ffff8801163fbe40

Aug 30 20:45:04 nas kernel: RBP: ffff8801163fbd08 R08: 00000000000004c5 R09: 00000000000201b9

Aug 30 20:45:04 nas kernel: R10: 00000000ffffffff R11: ffff88005ad4a0d0 R12: ffff8801163fbe40

Aug 30 20:45:04 nas kernel: R13: ffff880094722720 R14: ffff8801163fbe40 R15: 00000000804a3392

Aug 30 20:45:04 nas kernel: FS:  00002b7a0bc77700(0000) GS:ffff88011dc00000(0000) knlGS:0000000000000000

Aug 30 20:45:04 nas kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

Aug 30 20:45:04 nas kernel: CR2: 00002b7a0a638000 CR3: 00000000dd18f000 CR4: 00000000000006f0

Aug 30 20:45:04 nas kernel: Stack:

Aug 30 20:45:04 nas kernel: ffff8800dc586200 ffff8801163fbe40 ffffc90001e99000 ffffc90001eb91e8

Aug 30 20:45:04 nas kernel: ffff8801163fbe40 ffff880075b43800 ffff8801163fbd38 ffffffff81153463

Aug 30 20:45:04 nas kernel: ffff8801163fbe40 ffff8800a117c300 ffffc90001e99000 ffffc90001e99000

Aug 30 20:45:04 nas kernel: Call Trace:

Aug 30 20:45:04 nas kernel: [<ffffffff81153463>] reiserfs_discard_all_prealloc+0x44/0x4e

Aug 30 20:45:04 nas kernel: [<ffffffff8116fca4>] do_journal_end+0x4e7/0xc78

Aug 30 20:45:04 nas kernel: [<ffffffff81170994>] journal_end+0xae/0xb6

Aug 30 20:45:04 nas kernel: [<ffffffff811579c7>] reiserfs_mkdir+0x1d7/0x1fc

Aug 30 20:45:04 nas kernel: [<ffffffff8117349d>] ? reiserfs_permission+0x11/0x13

Aug 30 20:45:04 nas kernel: [<ffffffff81105b73>] vfs_mkdir+0x6e/0xa8

Aug 30 20:45:04 nas kernel: [<ffffffff8110a2cb>] SyS_mkdirat+0x6d/0xab

Aug 30 20:45:04 nas kernel: [<ffffffff8110a31d>] SyS_mkdir+0x14/0x16

Aug 30 20:45:04 nas kernel: [<ffffffff81615c6e>] system_call_fastpath+0x12/0x71

Aug 30 20:45:04 nas kernel: Code: 1c 75 bb 0f 0b 85 c0 74 12 48 8b 93 e8 00 00 00 4c 89 ee 4c 89 e7 e8 be 6e 00 00 48 8b 4b 28 44 89 7b 1c 48 8d 43 28 48 8b 53 30 <48> 89 51 08 48 89 0a 48 89 43 28 48 89 43 30 58 5b 41 5c 41 5d

Aug 30 20:45:04 nas kernel: RIP  [<ffffffff811533e4>] __discard_prealloc+0x98/0xb3

Aug 30 20:45:04 nas kernel: RSP <ffff8801163fbcd8>

Aug 30 20:45:04 nas kernel: ---[ end trace 6982e962bf2605e4 ]---

 

Link to comment

Attach diagnostics file. Tools->diagnostics

 

Thanks for looking into this.  Latest diagnostics attached.

 

I am pretty convinced these crashes are 100% related to the mover script.  Whenever it crashes, I see rsync processes running.  I am unable to kill these even using kill -9 PID of rsync.

 

I have set my mover script to monthly to see if I can go more more than 2 days without a crash and prove this theory.

nas-diagnostics-20150831-1050.zip

Link to comment

* Your SI3132 card appears to have the RAID firmware.  We've always been advised to flash it with the non-RAID firmware.  As this is the first time I've ever seen that, I have no idea what the ramifications are.

 

* Check your BIOS settings for the extra SATA controller, currently set to IDE mode.  Change it to a native SATA mode, preferably AHCI.  The SSD speed is almost certainly being limited.

 

* At 4GB, memory looks a little tight for what is loaded.  Is it possible that the rsync commands are needing more?  You might be OK in normal operation, but when the memory demands are higher, it may be squeezed.  Certainly shouldn't crash though.  It might be interesting to see how 8GB would perform, or a swap file.

 

* Unfortunately, this syslog does not show any problems.  A parity check starts because of unclean shutdown, is stopped, Mover is manually run, without any issues, then another parity check started and stopped.  No problems seen.  If at all possible, we need the syslog covering the period where trouble occurs, but obviously that's not possible if machine completely crashes or freezes.  Do you happen to have the syslog from which you extracted the 'general protection fault'?

Link to comment

Rob, I appreciate you taking a look.

 

Not sure how I feel about flashing the firmware on that card...

 

Good to know about that bios setting, I will change that.

 

I do not have a syslog capture from the time I had the fault.  It was completely locked up.

 

For the record I've never had any of these problems running unraid 4/5 with this same hardware for years.  4GB of ram has always been sufficient and I actually ran more plugins before (unless docker is more resource intensive).  I would happily buy extra ram if I knew that was the issue, but I'm not too sure.

 

 

Link to comment

Not sure how I feel about flashing the firmware on that card...

It appears to be working correctly, so I probably wouldn't change anything.  The thing is, I had always thought it wouldn't work at all, in RAID mode.

 

For the record I've never had any of these problems running unraid 4/5 with this same hardware for years.  4GB of ram has always been sufficient and I actually ran more plugins before (unless docker is more resource intensive).  I would happily buy extra ram if I knew that was the issue, but I'm not too sure.

And I can't say for sure that it's a lack of memory, I don't know.  It does look tight though, might work faster with more.

 

There have been found to be differences in operation between v5 and v6.  Memory is used differently in the 64 bit v6.  Memory that was fine in v5, may in rare cases not work right in v6.  And I've added a section to the Upgrading to UnRAID v6 guide about disk controllers, found in the Troubleshooting section.  I don't think it applies to you, but it's another case where everything worked fine in v5, but not in v6.

Link to comment

Well, new record for days without a lockup: 6.  Did the upgrade to 6.1.1 today and it took somewhat of a long time.  Not sure if that is relevant.

 

Anyway, went to reboot and noticed I could not shut down the array.  Mover still running as similar with all of my other lockups.  Full diagnostics attached.

 

Same symptoms: cannot hit any of my shares or shutdown the array cleanly.

 

rsync is still running and I am unable to kill it.

 

When i run ps ax, I have this line over 1,300 times:

 

32453      D      0:00 /usr/sbin/smbd -D

 

Any help would be greatly appreciated.

nas-diagnostics-20150907-1327.zip

Link to comment

 

* Check your BIOS settings for the extra SATA controller, currently set to IDE mode.  Change it to a native SATA mode, preferably AHCI.  The SSD speed is almost certainly being limited.

 

 

Since my machine was down from the most recent crash I changed this setting in my bios to AHCI.  I also added another 2TB drive.  Everything was running smooth until the mover script ran last night again and same situation.  Can't hit shares and "Mover is running" in the web gui.  Latest syslog is a massive 124mb.  Zipped and uploaded here: http://1drv.ms/1LV2irr

 

This is a new entry to me all over my syslog: Sep  8 01:37:41 nas kernel: REISERFS error (device md5): vs-4010 is_reusable: block number is out of range 1228469026 (732566633)

Link to comment

This is a new entry to me all over my syslog: Sep  8 01:37:41 nas kernel: REISERFS error (device md5): vs-4010 is_reusable: block number is out of range 1228469026 (732566633)

That indicates that there is file system level corruption on disk5 that can only be repaired using reiserfsck. 

 

Why this should have happened is not clear - that type of corruption is normally the result of a failed write.

Link to comment

This is a new entry to me all over my syslog: Sep  8 01:37:41 nas kernel: REISERFS error (device md5): vs-4010 is_reusable: block number is out of range 1228469026 (732566633)

That indicates that there is file system level corruption on disk5 that can only be repaired using reiserfsck. 

 

Why this should have happened is not clear - that type of corruption is normally the result of a failed write.

 

Am I doing something wrong here?  I have already run reiserfsck twice now since I was told there is something wrong with this disk.  I put the array into maintenance mode and run:

reiserfsck --check /dev/md5

 

It comes back with no issues...

 

I have also run this on all of my other disks as well.

Link to comment

This is a new entry to me all over my syslog: Sep  8 01:37:41 nas kernel: REISERFS error (device md5): vs-4010 is_reusable: block number is out of range 1228469026 (732566633)

That indicates that there is file system level corruption on disk5 that can only be repaired using reiserfsck. 

 

Why this should have happened is not clear - that type of corruption is normally the result of a failed write.

No idea what is going on then!    This is the first time I have heard of a error message like that in the log not resulting in reiserfsck reporting an error when run with the --check option.

 

Am I doing something wrong here?  I have already run reiserfsck twice now since I was told there is something wrong with this disk.  I put the array into maintenance mode and run:

reiserfsck --check /dev/md5

 

It comes back with no issues...

 

I have also run this on all of my other disks as well.

Link to comment

Upgraded to 6.1.2.  Hoping this magically improves my situation.

 

Can anyone help me with regrading my sata controller?

 

I am starting to wonder if Unraid 6 does not support my raid card.  Is there one out there someone can recommend that is certified to work with 6? 

 

 

I posted in the storage devices and controllers subforum looking for advice as well but no bites.  Does anyone know if this guy works with unraid 6.x?

 

http://www.amazon.com/IO-Crest-Controller-Non-Raid-SI-PEX40064/dp/B00AZ9T3OU

Link to comment

Shares went down again.  This time I noticed my nzbget docker stopped responding and I was unable to restart it.  The logs for that docker are below:

 

2015-09-07 14:28:09,831 DEBG fd 14 closed, stopped monitoring (stderr)>
2015-09-07 14:28:09,831 INFO success: start entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2015-09-07 14:28:09,831 DEBG fd 9 closed, stopped monitoring (stdout)>
2015-09-07 14:28:09,831 INFO exited: start (exit status 0; expected)
2015-09-07 14:28:09,831 DEBG received SIGCLD indicating a child quit
2015-09-07 14:28:09,950 DEBG fd 8 closed, stopped monitoring (stderr)>
2015-09-07 14:28:09,951 DEBG fd 6 closed, stopped monitoring (stdout)>
2015-09-07 14:28:09,951 INFO exited: nzbget (exit status 0; not expected)
2015-09-07 14:28:09,951 DEBG received SIGCLD indicating a child quit
2015-09-07 14:28:10,954 INFO spawned: 'nzbget' with pid 16
2015-09-07 14:28:10,967 DEBG fd 8 closed, stopped monitoring (stderr)>
2015-09-07 14:28:10,967 DEBG fd 6 closed, stopped monitoring (stdout)>
2015-09-07 14:28:10,967 INFO exited: nzbget (exit status 0; not expected)
2015-09-07 14:28:10,967 DEBG received SIGCLD indicating a child quit
2015-09-07 14:28:10,975 INFO reaped unknown pid 17
2015-09-07 14:28:10,975 DEBG received SIGCLD indicating a child quit
2015-09-07 14:28:12,979 INFO spawned: 'nzbget' with pid 18
2015-09-07 14:28:12,994 DEBG fd 8 closed, stopped monitoring (stderr)>
2015-09-07 14:28:12,994 DEBG fd 6 closed, stopped monitoring (stdout)>
2015-09-07 14:28:12,994 INFO exited: nzbget (exit status 0; not expected)
2015-09-07 14:28:12,994 DEBG received SIGCLD indicating a child quit
2015-09-07 14:28:13,001 INFO reaped unknown pid 19
2015-09-07 14:28:13,002 DEBG received SIGCLD indicating a child quit
2015-09-07 14:28:16,007 INFO spawned: 'nzbget' with pid 20
2015-09-07 14:28:16,020 DEBG fd 8 closed, stopped monitoring (stderr)>
2015-09-07 14:28:16,020 DEBG fd 6 closed, stopped monitoring (stdout)>
2015-09-07 14:28:16,020 INFO exited: nzbget (exit status 0; not expected)
2015-09-07 14:28:16,020 DEBG received SIGCLD indicating a child quit
2015-09-07 14:28:16,024 INFO gave up: nzbget entered FATAL state, too many start retries too quickly
2015-09-07 14:28:16,024 INFO reaped unknown pid 21
2015-09-07 14:28:16,024 DEBG received SIGCLD indicating a child quit
2015-09-07 14:28:33,247 WARN received SIGTERM indicating exit request
2015-09-07 14:37:39,316 CRIT Set uid to user 0
2015-09-07 14:37:39,316 WARN Included extra file "/etc/supervisor/conf.d/nzbget.conf" during parsing
2015-09-07 14:37:39,343 INFO supervisord started with pid 1
2015-09-07 14:37:40,345 INFO spawned: 'nzbget' with pid 9
2015-09-07 14:37:40,359 INFO spawned: 'start' with pid 10
2015-09-07 14:37:40,375 DEBG 'start' stdout output:

 

Full diagnostics attached.

nas-diagnostics-20150909-1648.zip

Link to comment

Well, the good news is I don't see anything that's hardware related this time (unless there's a memory issue, have you tested your RAM lately?).

 

Everything was running fine until 4:34pm, when suddenly the Reiser file system on the Cache drive was corrupted, and was changed to read-only, which brought everything from nzbget to a stop.  That's what probably caused all those messages you saw.  So the real problem is why the Reiser file system was corrupted.  It just doesn't happen like that normally.  Even buggy software can't do that normally.  The two possible reasons are there was already hidden corruption on the Cache drive, or you have bad memory chips.

 

I recommend running Check Disk filesystems on the Cache drive ( I know you're tired of that), and running several passes of Memtest on it (from the boot menu).

Link to comment

 

I recommend running Check Disk filesystems on the Cache drive ( I know you're tired of that), and running several passes of Memtest on it (from the boot menu).

 

Thanks for continuing to look at my issue Rob.  So I ran a filesystem check via the web gui again on the cache drive with the array in maintenance mode and still no errors (see below).  I have ruled out my sata controller as I borrowed one from a friend's backup system and it still crashed last night.  Let's say it is my cache drive...would a corruption on that force my whole system to hang?  Besides buying a new one would wiping it and re-adding it be beneficial?  I will try to run a memtest for more than 24 hours next...

 

 

 

Replaying journal: Done.

Reiserfs journal '/dev/sdg1' in blocks [18..8211]: 451 transactions replayed

Checking internal tree..  finished

Comparing bitmaps..finished

Checking Semantic tree:

finished

No corruptions found

There are on the filesystem:

Leaves 8290

Internal nodes 54

Directories 392

Other files 8266

Data block pointers 6856347 (2683087 of them are zero)

Safe links 0

###########

reiserfsck finished at Thu Sep 10 07:28:45 2015

###########

Link to comment

There's no way corruption can be detected then not found by reiserfsck, the way it's happening on your system.  So the only interesting question right now is what will Memtest find!  I rather think I already know!  And if the memory does check out, then you have a bad motherboard, because you have something that's corrupting internal data handling.

Link to comment

So the only interesting question right now is what will Memtest find!  I rather think I already know!  And if the memory does check out, then you have a bad motherboard, because you have something that's corrupting internal data handling.

 

So far 1 pass on memtest and no errors.  I'll let it keep running.

 

I just find it odd my motherboard would go bad the day I upgraded to 6.0  ???  ;D

Link to comment

Same sad story.  I decided to to turn my mover script back on since I had gone 3 days without a crash.  Woke up this morning to "Mover is running".  Unable to hit shares.  Cannot cleanly reboot.  Syslog has no useful info but attached anyway.  Also screen shot of htop if this is useful.  Looks like one of my CPU cores has spiked to 100%  See attached.  Is it time to buy a new motherboard?

 

56H8TEQ.png

nas-diagnostics-20150915-0644.zip

Link to comment

As you said, nothing in the syslog.  Mover starts normally, transfers some files without issues, then nothing, no errors, and the drives spin down.

 

There was a case not too long ago, where the memory tested fine, on a long test with many passes, but someone (Tom I think) said that Memtest doesn't catch everything, so the user replaced their memory sticks - and had no more problems!  Obviously, this is a shot in the dark, an expensive one too.

 

htop says the CPU is stuck in the User Shares file system.  I'm not sure, but there seem to be quite a few threads working on the User Share file system!  More than I would expect, but I don't have your Dockers, so can't say if that's abnormal.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.