WD20EARS "Freezing" on File copies during rebuild?


Recommended Posts

I was previously running with a WD 2TB EADS parity drive and Seagate 1.5TB Data drives. I had picked up a WD 2TB EARS drive to try out. I put the jumper on 7-8 and it is starting the rebuild.

 

I'm not sure if this is a valid or not, but I'm also trying to copy some files to the drive being rebuilt. It is hanging up and not completing when I try to do so.

 

Is that even possible or do I have to wait for the array to rebuild correctly before trying to copy some more files onto it? I've been able to copy some files over to the drive being rebuilt, but other times it just hangs. ie, Since the files are exported and I have to rebuild I'm not sure what I'm doing is even valid  ???

 

Can I move back to the 1.5TB drive? If so, how do I do that? When I tried to put the old drive back in it said I had to have a bigger drive than the failed drive.

Link to comment

It is hanging up and not completing when I try to do so...

Abeta, you should know betta'

When asking such question, always attach a syslog.

 

There wasn't anything in the syslog when the file was freezing or anything. New messages nothing. The last messages were the startup ones where it synchronized with the ftp log, etc. This has happened a few times.

 

The Teracopy will eventually just give up without anything that looks wrong from unRAIDs side that I can see. I've had it work sometimes and sometimes not during the same rebuild.

 

Here's the last snippets and full log attached:

 

Mar 11 20:01:06 Serenity kernel: mdcmd (18): check

Mar 11 20:01:06 Serenity kernel: md: recovery thread woken up ...

Mar 11 20:01:06 Serenity kernel: md: recovery thread rebuilding disk3 ...

Mar 11 20:01:06 Serenity emhttp: shcmd (12): mount -t reiserfs -o noacl,nouser_xattr,noatime,nodiratime /dev/md3 /mnt/disk3 >/dev/null 2>&1

Mar 11 20:01:06 Serenity kernel: md: using 1152k window, over a total of 1953514552 blocks.

Mar 11 20:01:06 Serenity kernel: REISERFS (device md3): found reiserfs format "3.6" with standard journal

Mar 11 20:01:06 Serenity kernel: REISERFS (device md3): using ordered data mode

Mar 11 20:01:06 Serenity kernel: REISERFS (device md1): found reiserfs format "3.6" with standard journal

Mar 11 20:01:06 Serenity kernel: REISERFS (device md1): using ordered data mode

Mar 11 20:01:06 Serenity kernel: REISERFS (device md4): found reiserfs format "3.6" with standard journal

Mar 11 20:01:06 Serenity kernel: REISERFS (device md4): using ordered data mode

Mar 11 20:01:06 Serenity kernel: REISERFS (device md2): found reiserfs format "3.6" with standard journal

Mar 11 20:01:06 Serenity kernel: REISERFS (device md2): using ordered data mode

Mar 11 20:01:06 Serenity kernel: REISERFS (device md3): journal params: device md3, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Mar 11 20:01:06 Serenity kernel: REISERFS (device md3): checking transaction log (md3)

Mar 11 20:01:06 Serenity kernel: REISERFS (device md4): journal params: device md4, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Mar 11 20:01:06 Serenity kernel: REISERFS (device md4): checking transaction log (md4)

Mar 11 20:01:06 Serenity kernel: REISERFS (device md2): journal params: device md2, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Mar 11 20:01:06 Serenity kernel: REISERFS (device md2): checking transaction log (md2)

Mar 11 20:01:06 Serenity kernel: REISERFS (device md1): journal params: device md1, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30

Mar 11 20:01:06 Serenity kernel: REISERFS (device md1): checking transaction log (md1)

Mar 11 20:01:06 Serenity sshd[2215]: Server listening on 0.0.0.0 port 22.

Mar 11 20:01:06 Serenity kernel: REISERFS (device md4): replayed 3 transactions in 0 seconds

Mar 11 20:01:06 Serenity kernel: REISERFS (device md1): replayed 2 transactions in 0 seconds

Mar 11 20:01:06 Serenity kernel: REISERFS (device md2): replayed 2 transactions in 0 seconds

Mar 11 20:01:06 Serenity init: Re-reading inittab

Mar 11 20:01:06 Serenity kernel: REISERFS (device md1): Using r5 hash to sort names

Mar 11 20:01:06 Serenity kernel: REISERFS (device md2): Using r5 hash to sort names

Mar 11 20:01:06 Serenity kernel: REISERFS (device md4): Using r5 hash to sort names

Mar 11 20:01:07 Serenity emhttp: shcmd (15): cp /var/spool/cron/crontabs/root- /var/spool/cron/crontabs/root

Mar 11 20:01:07 Serenity emhttp: shcmd (16): echo '# Generated mover schedule:' >>/var/spool/cron/crontabs/root

Mar 11 20:01:07 Serenity emhttp: shcmd (17): echo '40 3 * * * /usr/local/sbin/mover 2>&1 | logger' >>/var/spool/cron/crontabs/root

Mar 11 20:01:07 Serenity emhttp: shcmd (18): crontab /var/spool/cron/crontabs/root

Mar 11 20:01:09 Serenity apcupsd[1677]: NIS server startup succeeded

Mar 11 20:01:09 Serenity apcupsd[1677]: apcupsd 3.14.3 (20 January 2008) slackware startup succeeded

Mar 11 20:01:10 Serenity kernel: REISERFS (device md3): replayed 4 transactions in 4 seconds

Mar 11 20:01:10 Serenity kernel: REISERFS (device md3): Using r5 hash to sort names

Mar 11 20:01:10 Serenity emhttp: shcmd (20): rm /etc/samba/smb-shares.conf >/dev/null 2>&1

Mar 11 20:01:10 Serenity emhttp: _shcmd: shcmd (20): exit status: 1

Mar 11 20:01:10 Serenity emhttp: shcmd (21): cp /etc/exports- /etc/exports

Mar 11 20:01:10 Serenity emhttp: shcmd (22): mkdir /mnt/user

Mar 11 20:01:10 Serenity emhttp: shcmd (23): /usr/local/sbin/shfs /mnt/user -o noatime,big_writes,allow_other,default_permissions

Mar 11 20:01:11 Serenity emhttp: shcmd (24): killall -HUP smbd

Mar 11 20:01:11 Serenity emhttp: shcmd (25): /etc/rc.d/rc.nfsd restart | logger

Mar 11 20:01:36 Serenity ntpd[1430]: synchronized to 209.237.247.192, stratum 3

Mar 11 20:01:36 Serenity ntpd[1430]: time reset -0.276106 s

Mar 11 20:02:38 Serenity sshd[3589]: error: Could not get shadow information for root

Mar 11 20:02:38 Serenity sshd[3589]: Accepted password for root from 192.168.1.4 port 51713 ssh2

Mar 11 20:02:38 Serenity sshd[3593]: lastlog_filetype: Couldn't stat /var/log/lastlog: No such file or directory

Mar 11 20:02:38 Serenity sshd[3593]: lastlog_openseek: /var/log/lastlog is not a file or directory!

Mar 11 20:02:38 Serenity sshd[3593]: lastlog_filetype: Couldn't stat /var/log/lastlog: No such file or directory

Mar 11 20:02:38 Serenity sshd[3593]: lastlog_openseek: /var/log/lastlog is not a file or directory!

Mar 11 20:09:57 Serenity ntpd[1430]: synchronized to 209.237.247.192, stratum 3

Mar 12 06:28:24 Serenity kernel: mdcmd (6537): spindown 4

 

Total lines: 773

unRAID-Syslog.txt

Link to comment

A quick glance at your syslog didn't reveal anything funny going on. 

 

Three suggestions come to mind though. First, switch to CFQ scheduler. CFQ works much smarter than the defauld NOOP that you have there.

Second, stop disabling the NCQ. They fixed that bug many kernels ago.  And third, try setting 'max_sectors_kb' for the disks to 128:

for i in /sys/block/[hs]d? ; do echo 128 > $i/queue/max_sectors_kb ; done 2>/dev/null

With the above three tweaks, I've successfully eliminated similar freezes on my system.

They may or may not fix your problem, but it's worth the try. Please report back how it goes.

 

Link to comment

A quick glance at your syslog didn't reveal anything funny going on. 

 

Three suggestions come to mind though. First, switch to CFQ scheduler. CFQ works much smarter than the defauld NOOP that you have there.

Second, stop disabling the NCQ. They fixed that bug many kernels ago.  And third, try setting 'max_sectors_kb' for the disks to 128:

for i in /sys/block/[hs]d? ; do echo 128 > $i/queue/max_sectors_kb ; done 2>/dev/null

With the above three tweaks, I've successfully eliminated similar freezes on my system.

They may or may not fix your problem, but it's worth the try. Please report back how it goes.

 

These are the default settings in unRAID 4.5.3. How do I implement your suggestions? I think I can use the script in unmenu to do CFQ. How do I do the second two? I see a disk read-ahead 256 in unmenu? Thx!

 

Link to comment

A quick glance at your syslog didn't reveal anything funny going on.  

 

Three suggestions come to mind though. First, switch to CFQ scheduler. CFQ works much smarter than the defauld NOOP that you have there.

Second, stop disabling the NCQ. They fixed that bug many kernels ago.  And third, try setting 'max_sectors_kb' for the disks to 128:

for i in /sys/block/[hs]d? ; do echo 128 > $i/queue/max_sectors_kb ; done 2>/dev/null

With the above three tweaks, I've successfully eliminated similar freezes on my system.

They may or may not fix your problem, but it's worth the try. Please report back how it goes.

 

These are the default settings in unRAID 4.5.3. How do I implement your suggestions? I think I can use the script in unmenu to do CFQ. How do I do the second two? I see a disk read-ahead 256 in unmenu? Thx!

 

 

The read-ahead in unmenu has nothing to do with what I'm talking about.

 

For #1: Add the following boot code to your 'syslinux.cfg' file: elevator=cfq

The unRAID section in your 'syslinux.cfg' should look something like this:

label unRAID OS
 menu default
 kernel bzimage
 append  elevator=cfq  initrd=bzroot

NOTE: Make sure that you use a text editor that can do Linux-style line endings!  Read this:

http://lime-technology.com/wiki/index.php?title=FAQ#Why_do_my_scripts_have_problems_with_end-of-lines.3F

 

For #2: Go to the unRAID management web page, go to tab "Settings", and near the bottom of that page, in "Disk Settings" set the "Force NCQ disabled" to No.

 

For #3: Put that line in your 'go' script:

for i in /sys/block/[hs]d? ; do echo 128 > $i/queue/max_sectors_kb ; done 2>/dev/null

 

After you've done #1, #2, and #3, reboot your server.  Then see if you can reproduce the problem.  Let us know how it goes.

 

Link to comment

Thanks for the steps. I can definitely try to configure those steps tonight...but my data rebuild finished this morning.

 

How do I "fake" out unRAID and convince it the drive is new again and make it rebuild for me to test? For example, if this is a corner case and not really supported it won't matter I guess I just need to make sure I wait for a complete rebuild before trying to copy files to the drive being rebuilt.

Link to comment

How do I "fake" out unRAID and convince it the drive is new again and make it rebuild for me to test? For example, if this is a corner case and not really supported it won't matter I guess

 

It's not like I've observed your particular "corner case" before.

So I would rather describe it somewhat more generally.

Something like, unRAID freezing on high volume I/O, on your hardware.

 

How to simulate that? You could start a couple of preclear scripts, start a large copy with 'mc' between disk within the server, and then on top of all that start a large samba copy from a windows computer to the unRAID server. While all that is going on, see how you browse your shared disks.  If all that doesn't choke it, then you're good to go. But if it freezes on you, then try to apply the above mentioned tweaks and see if it will make a difference.

 

Link to comment

Second, stop disabling the NCQ. They fixed that bug many kernels ago. 

 

The default in 4.5.3 is to have NCQ disabled. Should we be enabling this even if not seeing any issues? What benefits will we see with it enabled and are there any potential issues if using it?

Link to comment

Second, stop disabling the NCQ. They fixed that bug many kernels ago. 

 

The default in 4.5.3 is to have NCQ disabled. Should we be enabling this even if not seeing any issues? What benefits will we see with it enabled and are there any potential issues if using it?

 

Google NCQ, test it both ways, and figure it out.

 

Link to comment

Second, stop disabling the NCQ. They fixed that bug many kernels ago. 

 

The default in 4.5.3 is to have NCQ disabled. Should we be enabling this even if not seeing any issues? What benefits will we see with it enabled and are there any potential issues if using it?

 

Google NCQ, test it both ways, and figure it out.

 

 

I've been reading quite a bit about it and it sounds like whether one would see benefits is not straightforward. I'm just curious in an unraid environment where streaming large files if there could be a benefit. I'll most certainly try some testing. Your response earlier in the thread recommending to stop disabling it got me wondering if it is something I should be considering.

Link to comment

Hell OP, your problem is not limited to your HDD, I am using Hitachi DeskStar and receiving the same problem during inital parity resync. Start a copy  which would then slow down then just stop and time out.

 

Hmmmm.....so in your case it isn't restricted to the drive being rebuilt. I'm pretty sure I've done this before either on a parity resync or to a drive other than the one being rebuilt but possibly not with unRAID 4.5.3.

Link to comment

Hmmmm.....so in your case it isn't restricted to the drive being rebuilt.

As I suggested earlier, the freezings and the time-outs are not restricted to your particular "corner case", as you called it.

 

Another useful fix (in addition to the three tweaks in my earlier post) is to increase Samba's "max open files" to 16500 as described in this post:

http://lime-technology.com/forum/index.php?topic=5004.msg46112#msg46112

 

Also, when you need to copy large amounts of stuff from a windows machine to your unRAID server, TeraCopy does a much better job than windows explorer.

 

Link to comment

I've seen the Samba file problem and it mostly goes away with teracopy, but my specific case, it's copying one file that has been timing out for me :). I've put the changes you mentioned in but haven't been able to replicate the problem. I might start the parity check and try copying files as well to see how it behaves.

Link to comment

I've put the changes you mentioned in but haven't been able to replicate the problem.

Instead of 'but' you mean 'and', right?

 

They mean the same thing to me but to clarify I have not been able to replicate the problem at this time but I have not been able to do exactly the same thing that was causing my problem in the first place. Namely it was rebuilding the drive when I was trying to copy files to it.

Link to comment
  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.