trurl Posted August 27, 2015 Share Posted August 27, 2015 Make sure you don't have console or telnet session with the current working directory on a user share or disk. That will also keep the drive(s) mounted. Quote Link to comment
pickthenimp Posted August 27, 2015 Author Share Posted August 27, 2015 Ran reiserfsck tool against disk 5, looks clean to me. Replaying journal: Done. Reiserfs journal '/dev/md5' in blocks [18..8211]: 27 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 555762 Internal nodes 3557 Directories 2158 Other files 10230 Data block pointers 560633744 (0 of them are zero) Safe links 0 ########### reiserfsck finished at Thu Aug 27 14:44:34 2015 Quote Link to comment
pickthenimp Posted August 28, 2015 Author Share Posted August 28, 2015 Two different drives showed indications of a faulty SATA cable, ST3000DM001-9YN166 Z1F12JLY twice (Disk 5) and WDC WD20EARS-00MVWB0 WD-WMAZA3638502 (Disk 1). I would replace their SATA cables with better quality ones. Just to rule this out I replaced all cables. Thanks Rob for catching that. I also ran a memtest for 24 hours, no errors. Going to try anything at this point to make my unraid stable I am going work through this post next: http://lime-technology.com/forum/index.php?topic=28484.0 as suggested by bonienl in another thread. Even though I never ran Unraid 6 rc 13 Quote Link to comment
pickthenimp Posted August 31, 2015 Author Share Posted August 31, 2015 I just manually invoked the mover and it failed hard. My unraid is locked up and I cant hit my shares. Any clues? Here is the relevant syslog data: Aug 30 20:45:04 nas kernel: general protection fault: 0000 [#1] PREEMPT SMP Aug 30 20:45:04 nas kernel: Modules linked in: md_mod xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat k10temp pata_atiixp i2c_piix4 ahci libahci r8169 sata_sil24 mii acpi_cpufreq [last unloaded: md_mod] Aug 30 20:45:04 nas kernel: CPU: 0 PID: 25306 Comm: shfs Not tainted 4.1.5-unRAID #3 Aug 30 20:45:04 nas kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./HZ03-GT-V2, BIOS 080015 12/24/2010 Aug 30 20:45:04 nas kernel: task: ffff8800a117c300 ti: ffff8801163f8000 task.ti: ffff8801163f8000 Aug 30 20:45:04 nas kernel: RIP: 0010:[<ffffffff811533e4>] [<ffffffff811533e4>] __discard_prealloc+0x98/0xb3 Aug 30 20:45:04 nas kernel: RSP: 0018:ffff8801163fbcd8 EFLAGS: 00010246 Aug 30 20:45:04 nas kernel: RAX: ffff8800947226a8 RBX: ffff880094722680 RCX: bdb5d18c95cbac9b Aug 30 20:45:04 nas kernel: RDX: cb904b10fa85c8b5 RSI: ffff880094722680 RDI: ffff8801163fbe40 Aug 30 20:45:04 nas kernel: RBP: ffff8801163fbd08 R08: 00000000000004c5 R09: 00000000000201b9 Aug 30 20:45:04 nas kernel: R10: 00000000ffffffff R11: ffff88005ad4a0d0 R12: ffff8801163fbe40 Aug 30 20:45:04 nas kernel: R13: ffff880094722720 R14: ffff8801163fbe40 R15: 00000000804a3392 Aug 30 20:45:04 nas kernel: FS: 00002b7a0bc77700(0000) GS:ffff88011dc00000(0000) knlGS:0000000000000000 Aug 30 20:45:04 nas kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 30 20:45:04 nas kernel: CR2: 00002b7a0a638000 CR3: 00000000dd18f000 CR4: 00000000000006f0 Aug 30 20:45:04 nas kernel: Stack: Aug 30 20:45:04 nas kernel: ffff8800dc586200 ffff8801163fbe40 ffffc90001e99000 ffffc90001eb91e8 Aug 30 20:45:04 nas kernel: ffff8801163fbe40 ffff880075b43800 ffff8801163fbd38 ffffffff81153463 Aug 30 20:45:04 nas kernel: ffff8801163fbe40 ffff8800a117c300 ffffc90001e99000 ffffc90001e99000 Aug 30 20:45:04 nas kernel: Call Trace: Aug 30 20:45:04 nas kernel: [<ffffffff81153463>] reiserfs_discard_all_prealloc+0x44/0x4e Aug 30 20:45:04 nas kernel: [<ffffffff8116fca4>] do_journal_end+0x4e7/0xc78 Aug 30 20:45:04 nas kernel: [<ffffffff81170994>] journal_end+0xae/0xb6 Aug 30 20:45:04 nas kernel: [<ffffffff811579c7>] reiserfs_mkdir+0x1d7/0x1fc Aug 30 20:45:04 nas kernel: [<ffffffff8117349d>] ? reiserfs_permission+0x11/0x13 Aug 30 20:45:04 nas kernel: [<ffffffff81105b73>] vfs_mkdir+0x6e/0xa8 Aug 30 20:45:04 nas kernel: [<ffffffff8110a2cb>] SyS_mkdirat+0x6d/0xab Aug 30 20:45:04 nas kernel: [<ffffffff8110a31d>] SyS_mkdir+0x14/0x16 Aug 30 20:45:04 nas kernel: [<ffffffff81615c6e>] system_call_fastpath+0x12/0x71 Aug 30 20:45:04 nas kernel: Code: 1c 75 bb 0f 0b 85 c0 74 12 48 8b 93 e8 00 00 00 4c 89 ee 4c 89 e7 e8 be 6e 00 00 48 8b 4b 28 44 89 7b 1c 48 8d 43 28 48 8b 53 30 <48> 89 51 08 48 89 0a 48 89 43 28 48 89 43 30 58 5b 41 5c 41 5d Aug 30 20:45:04 nas kernel: RIP [<ffffffff811533e4>] __discard_prealloc+0x98/0xb3 Aug 30 20:45:04 nas kernel: RSP <ffff8801163fbcd8> Aug 30 20:45:04 nas kernel: ---[ end trace 6982e962bf2605e4 ]--- Quote Link to comment
dgaschk Posted August 31, 2015 Share Posted August 31, 2015 Attach diagnostics file. Tools->diagnostics Quote Link to comment
pickthenimp Posted August 31, 2015 Author Share Posted August 31, 2015 Attach diagnostics file. Tools->diagnostics Thanks for looking into this. Latest diagnostics attached. I am pretty convinced these crashes are 100% related to the mover script. Whenever it crashes, I see rsync processes running. I am unable to kill these even using kill -9 PID of rsync. I have set my mover script to monthly to see if I can go more more than 2 days without a crash and prove this theory. nas-diagnostics-20150831-1050.zip Quote Link to comment
RobJ Posted August 31, 2015 Share Posted August 31, 2015 * Your SI3132 card appears to have the RAID firmware. We've always been advised to flash it with the non-RAID firmware. As this is the first time I've ever seen that, I have no idea what the ramifications are. * Check your BIOS settings for the extra SATA controller, currently set to IDE mode. Change it to a native SATA mode, preferably AHCI. The SSD speed is almost certainly being limited. * At 4GB, memory looks a little tight for what is loaded. Is it possible that the rsync commands are needing more? You might be OK in normal operation, but when the memory demands are higher, it may be squeezed. Certainly shouldn't crash though. It might be interesting to see how 8GB would perform, or a swap file. * Unfortunately, this syslog does not show any problems. A parity check starts because of unclean shutdown, is stopped, Mover is manually run, without any issues, then another parity check started and stopped. No problems seen. If at all possible, we need the syslog covering the period where trouble occurs, but obviously that's not possible if machine completely crashes or freezes. Do you happen to have the syslog from which you extracted the 'general protection fault'? Quote Link to comment
pickthenimp Posted August 31, 2015 Author Share Posted August 31, 2015 Rob, I appreciate you taking a look. Not sure how I feel about flashing the firmware on that card... Good to know about that bios setting, I will change that. I do not have a syslog capture from the time I had the fault. It was completely locked up. For the record I've never had any of these problems running unraid 4/5 with this same hardware for years. 4GB of ram has always been sufficient and I actually ran more plugins before (unless docker is more resource intensive). I would happily buy extra ram if I knew that was the issue, but I'm not too sure. Quote Link to comment
RobJ Posted August 31, 2015 Share Posted August 31, 2015 Not sure how I feel about flashing the firmware on that card... It appears to be working correctly, so I probably wouldn't change anything. The thing is, I had always thought it wouldn't work at all, in RAID mode. For the record I've never had any of these problems running unraid 4/5 with this same hardware for years. 4GB of ram has always been sufficient and I actually ran more plugins before (unless docker is more resource intensive). I would happily buy extra ram if I knew that was the issue, but I'm not too sure. And I can't say for sure that it's a lack of memory, I don't know. It does look tight though, might work faster with more. There have been found to be differences in operation between v5 and v6. Memory is used differently in the 64 bit v6. Memory that was fine in v5, may in rare cases not work right in v6. And I've added a section to the Upgrading to UnRAID v6 guide about disk controllers, found in the Troubleshooting section. I don't think it applies to you, but it's another case where everything worked fine in v5, but not in v6. Quote Link to comment
pickthenimp Posted September 7, 2015 Author Share Posted September 7, 2015 Well, new record for days without a lockup: 6. Did the upgrade to 6.1.1 today and it took somewhat of a long time. Not sure if that is relevant. Anyway, went to reboot and noticed I could not shut down the array. Mover still running as similar with all of my other lockups. Full diagnostics attached. Same symptoms: cannot hit any of my shares or shutdown the array cleanly. rsync is still running and I am unable to kill it. When i run ps ax, I have this line over 1,300 times: 32453 D 0:00 /usr/sbin/smbd -D Any help would be greatly appreciated. nas-diagnostics-20150907-1327.zip Quote Link to comment
pickthenimp Posted September 8, 2015 Author Share Posted September 8, 2015 * Check your BIOS settings for the extra SATA controller, currently set to IDE mode. Change it to a native SATA mode, preferably AHCI. The SSD speed is almost certainly being limited. Since my machine was down from the most recent crash I changed this setting in my bios to AHCI. I also added another 2TB drive. Everything was running smooth until the mover script ran last night again and same situation. Can't hit shares and "Mover is running" in the web gui. Latest syslog is a massive 124mb. Zipped and uploaded here: http://1drv.ms/1LV2irr This is a new entry to me all over my syslog: Sep 8 01:37:41 nas kernel: REISERFS error (device md5): vs-4010 is_reusable: block number is out of range 1228469026 (732566633) Quote Link to comment
itimpi Posted September 8, 2015 Share Posted September 8, 2015 This is a new entry to me all over my syslog: Sep 8 01:37:41 nas kernel: REISERFS error (device md5): vs-4010 is_reusable: block number is out of range 1228469026 (732566633) That indicates that there is file system level corruption on disk5 that can only be repaired using reiserfsck. Why this should have happened is not clear - that type of corruption is normally the result of a failed write. Quote Link to comment
pickthenimp Posted September 8, 2015 Author Share Posted September 8, 2015 This is a new entry to me all over my syslog: Sep 8 01:37:41 nas kernel: REISERFS error (device md5): vs-4010 is_reusable: block number is out of range 1228469026 (732566633) That indicates that there is file system level corruption on disk5 that can only be repaired using reiserfsck. Why this should have happened is not clear - that type of corruption is normally the result of a failed write. Am I doing something wrong here? I have already run reiserfsck twice now since I was told there is something wrong with this disk. I put the array into maintenance mode and run: reiserfsck --check /dev/md5 It comes back with no issues... I have also run this on all of my other disks as well. Quote Link to comment
itimpi Posted September 8, 2015 Share Posted September 8, 2015 This is a new entry to me all over my syslog: Sep 8 01:37:41 nas kernel: REISERFS error (device md5): vs-4010 is_reusable: block number is out of range 1228469026 (732566633) That indicates that there is file system level corruption on disk5 that can only be repaired using reiserfsck. Why this should have happened is not clear - that type of corruption is normally the result of a failed write. No idea what is going on then! This is the first time I have heard of a error message like that in the log not resulting in reiserfsck reporting an error when run with the --check option. Am I doing something wrong here? I have already run reiserfsck twice now since I was told there is something wrong with this disk. I put the array into maintenance mode and run: reiserfsck --check /dev/md5 It comes back with no issues... I have also run this on all of my other disks as well. Quote Link to comment
pickthenimp Posted September 8, 2015 Author Share Posted September 8, 2015 I am starting to wonder if Unraid 6 does not support my raid card. Is there one out there someone can recommend that is certified to work with 6? At this point I am willing to try anything to get this thing stable. I have a very angry household on my hands! Quote Link to comment
pickthenimp Posted September 9, 2015 Author Share Posted September 9, 2015 Upgraded to 6.1.2. Hoping this magically improves my situation. Can anyone help me with regrading my sata controller? I am starting to wonder if Unraid 6 does not support my raid card. Is there one out there someone can recommend that is certified to work with 6? I posted in the storage devices and controllers subforum looking for advice as well but no bites. Does anyone know if this guy works with unraid 6.x? http://www.amazon.com/IO-Crest-Controller-Non-Raid-SI-PEX40064/dp/B00AZ9T3OU Quote Link to comment
pickthenimp Posted September 9, 2015 Author Share Posted September 9, 2015 Shares went down again. This time I noticed my nzbget docker stopped responding and I was unable to restart it. The logs for that docker are below: 2015-09-07 14:28:09,831 DEBG fd 14 closed, stopped monitoring (stderr)> 2015-09-07 14:28:09,831 INFO success: start entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) 2015-09-07 14:28:09,831 DEBG fd 9 closed, stopped monitoring (stdout)> 2015-09-07 14:28:09,831 INFO exited: start (exit status 0; expected) 2015-09-07 14:28:09,831 DEBG received SIGCLD indicating a child quit 2015-09-07 14:28:09,950 DEBG fd 8 closed, stopped monitoring (stderr)> 2015-09-07 14:28:09,951 DEBG fd 6 closed, stopped monitoring (stdout)> 2015-09-07 14:28:09,951 INFO exited: nzbget (exit status 0; not expected) 2015-09-07 14:28:09,951 DEBG received SIGCLD indicating a child quit 2015-09-07 14:28:10,954 INFO spawned: 'nzbget' with pid 16 2015-09-07 14:28:10,967 DEBG fd 8 closed, stopped monitoring (stderr)> 2015-09-07 14:28:10,967 DEBG fd 6 closed, stopped monitoring (stdout)> 2015-09-07 14:28:10,967 INFO exited: nzbget (exit status 0; not expected) 2015-09-07 14:28:10,967 DEBG received SIGCLD indicating a child quit 2015-09-07 14:28:10,975 INFO reaped unknown pid 17 2015-09-07 14:28:10,975 DEBG received SIGCLD indicating a child quit 2015-09-07 14:28:12,979 INFO spawned: 'nzbget' with pid 18 2015-09-07 14:28:12,994 DEBG fd 8 closed, stopped monitoring (stderr)> 2015-09-07 14:28:12,994 DEBG fd 6 closed, stopped monitoring (stdout)> 2015-09-07 14:28:12,994 INFO exited: nzbget (exit status 0; not expected) 2015-09-07 14:28:12,994 DEBG received SIGCLD indicating a child quit 2015-09-07 14:28:13,001 INFO reaped unknown pid 19 2015-09-07 14:28:13,002 DEBG received SIGCLD indicating a child quit 2015-09-07 14:28:16,007 INFO spawned: 'nzbget' with pid 20 2015-09-07 14:28:16,020 DEBG fd 8 closed, stopped monitoring (stderr)> 2015-09-07 14:28:16,020 DEBG fd 6 closed, stopped monitoring (stdout)> 2015-09-07 14:28:16,020 INFO exited: nzbget (exit status 0; not expected) 2015-09-07 14:28:16,020 DEBG received SIGCLD indicating a child quit 2015-09-07 14:28:16,024 INFO gave up: nzbget entered FATAL state, too many start retries too quickly 2015-09-07 14:28:16,024 INFO reaped unknown pid 21 2015-09-07 14:28:16,024 DEBG received SIGCLD indicating a child quit 2015-09-07 14:28:33,247 WARN received SIGTERM indicating exit request 2015-09-07 14:37:39,316 CRIT Set uid to user 0 2015-09-07 14:37:39,316 WARN Included extra file "/etc/supervisor/conf.d/nzbget.conf" during parsing 2015-09-07 14:37:39,343 INFO supervisord started with pid 1 2015-09-07 14:37:40,345 INFO spawned: 'nzbget' with pid 9 2015-09-07 14:37:40,359 INFO spawned: 'start' with pid 10 2015-09-07 14:37:40,375 DEBG 'start' stdout output: Full diagnostics attached. nas-diagnostics-20150909-1648.zip Quote Link to comment
RobJ Posted September 10, 2015 Share Posted September 10, 2015 Well, the good news is I don't see anything that's hardware related this time (unless there's a memory issue, have you tested your RAM lately?). Everything was running fine until 4:34pm, when suddenly the Reiser file system on the Cache drive was corrupted, and was changed to read-only, which brought everything from nzbget to a stop. That's what probably caused all those messages you saw. So the real problem is why the Reiser file system was corrupted. It just doesn't happen like that normally. Even buggy software can't do that normally. The two possible reasons are there was already hidden corruption on the Cache drive, or you have bad memory chips. I recommend running Check Disk filesystems on the Cache drive ( I know you're tired of that), and running several passes of Memtest on it (from the boot menu). Quote Link to comment
pickthenimp Posted September 10, 2015 Author Share Posted September 10, 2015 I recommend running Check Disk filesystems on the Cache drive ( I know you're tired of that), and running several passes of Memtest on it (from the boot menu). Thanks for continuing to look at my issue Rob. So I ran a filesystem check via the web gui again on the cache drive with the array in maintenance mode and still no errors (see below). I have ruled out my sata controller as I borrowed one from a friend's backup system and it still crashed last night. Let's say it is my cache drive...would a corruption on that force my whole system to hang? Besides buying a new one would wiping it and re-adding it be beneficial? I will try to run a memtest for more than 24 hours next... Replaying journal: Done. Reiserfs journal '/dev/sdg1' in blocks [18..8211]: 451 transactions replayed Checking internal tree.. finished Comparing bitmaps..finished Checking Semantic tree: finished No corruptions found There are on the filesystem: Leaves 8290 Internal nodes 54 Directories 392 Other files 8266 Data block pointers 6856347 (2683087 of them are zero) Safe links 0 ########### reiserfsck finished at Thu Sep 10 07:28:45 2015 ########### Quote Link to comment
RobJ Posted September 10, 2015 Share Posted September 10, 2015 There's no way corruption can be detected then not found by reiserfsck, the way it's happening on your system. So the only interesting question right now is what will Memtest find! I rather think I already know! And if the memory does check out, then you have a bad motherboard, because you have something that's corrupting internal data handling. Quote Link to comment
pickthenimp Posted September 10, 2015 Author Share Posted September 10, 2015 So the only interesting question right now is what will Memtest find! I rather think I already know! And if the memory does check out, then you have a bad motherboard, because you have something that's corrupting internal data handling. So far 1 pass on memtest and no errors. I'll let it keep running. I just find it odd my motherboard would go bad the day I upgraded to 6.0 Quote Link to comment
RobJ Posted September 10, 2015 Share Posted September 10, 2015 V6 is 64 bit with virtualization support. Your BIOS is from 2010, see if you can update that, could make a difference. Quote Link to comment
pickthenimp Posted September 11, 2015 Author Share Posted September 11, 2015 V6 is 64 bit with virtualization support. Your BIOS is from 2010, see if you can update that, could make a difference. Updated to the latest bios (now 2011). 12 passes of memtest and no errors. Fingers crossed.. Quote Link to comment
pickthenimp Posted September 15, 2015 Author Share Posted September 15, 2015 Same sad story. I decided to to turn my mover script back on since I had gone 3 days without a crash. Woke up this morning to "Mover is running". Unable to hit shares. Cannot cleanly reboot. Syslog has no useful info but attached anyway. Also screen shot of htop if this is useful. Looks like one of my CPU cores has spiked to 100% See attached. Is it time to buy a new motherboard? nas-diagnostics-20150915-0644.zip Quote Link to comment
RobJ Posted September 15, 2015 Share Posted September 15, 2015 As you said, nothing in the syslog. Mover starts normally, transfers some files without issues, then nothing, no errors, and the drives spin down. There was a case not too long ago, where the memory tested fine, on a long test with many passes, but someone (Tom I think) said that Memtest doesn't catch everything, so the user replaced their memory sticks - and had no more problems! Obviously, this is a shot in the dark, an expensive one too. htop says the CPU is stuck in the User Shares file system. I'm not sure, but there seem to be quite a few threads working on the User Share file system! More than I would expect, but I don't have your Dockers, so can't say if that's abnormal. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.