NFS dies during extended rsync RC5 - RC6


Recommended Posts

I have a backup process that has operated successfully for several months on all flavors of beta v5.  Now suddenly after upgrading 2 backup servers to rc6-8168, nfs dies.  Server1 (v5-rc5) is backed up to tower1 (v5-rc6) and to tower2 (v5-rc6) via unmenu script from tower1 or tower2.  :-[

 

Here is the relevant log entries. 

 

Jul 14 07:58:56 Tower2 kernel: nfs: server server1 not responding, still trying
Jul 14 07:58:56 Tower2 last message repeated 18 times

 

The following unmenu userscript was being executed and fails about midway through.  It would always die on disk2, so I commented out disk2, and now it failed on disk4.  This repeated 4 times with both tower1 and tower2.  When it fails, unmenu locks up,  the server still responds via the normal web page, and I can still get putty to telnet in.  Only way to clean it up, is to use top to identify the 3 rsync pid, and then kill them via telnet.  Once I do that, I can reboot the server via the standard web interface.

 

Server1 (V5-rc5) Supermicro X9SCM-F / Intel i3-2100 and 4 gig RAM / Supermicro AOC-SASLP-MV8 8-Port

Norco 4224 / mix of 3tb and 2tb drives

 

Other 2 are identical AMD based : Tower1 and Tower2 (v5-rc6-8168)

Asus m4a 785 m / AMD 140 / 2 gig RAM

 

#define USER_SCRIPT_LABEL Sync to Server1
#define USER_SCRIPT_DESCR Once the server is up this should work.  Start the array first. 
#UNMENU_RELEASE $Revision: 7 $ $Date: 2010-03-29 23:12:23 -0400 (Mon, 29 Mar 2010) $
echo "Starting Sync to Server1"
cd /boot
# Backup from production server to 2nd level backup via cron and log results
# version .11

# Set up email header
cd /boot
echo To: [email protected] >> /boot/logs/cronlogs/t1summary.log
echo From: [email protected] >> /boot/logs/cronlogs/t1summary.log
echo Subject: Tower1 rsync summary >> /boot/logs/cronlogs/t1summary.log
echo   >> /boot/logs/cronlogs/t1summary.log
echo Log of rSync backup >> /boot/logs/cronlogs/t1summary.log

# Set up mounts
cd /boot
othermounts.sh

echo ++ >> /boot/logs/cronlogs/t1summary.log
echo ++++++ starting daily cron now ++++++++++ >> /boot/logs/cronlogs/t1summary.log
echo ++ >> /boot/logs/cronlogs/t1summary.log
# Backup Disk 1
echo ++++++  >> /boot/logs/cronlogs/t1disk1.log
echo ++++++ starting daily cron now ++++++++++ >> /boot/logs/cronlogs/t1disk1.log
echo ++++++  >> /boot/logs/cronlogs/t1disk1.log
date
echo ====== moving to Disk1 ===========
echo ====== moving to Disk1 =========== >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk1.log

rsync -av --stats --progress /mnt/s1disk1/ /mnt/disk1/  >> /boot/logs/cronlogs/t1disk1.log

date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk1.log

echo == >> /boot/logs/cronlogs/t1summary.log
echo ====== ending daily cron now Disk1 =========== >> /boot/logs/cronlogs/t1summary.log
echo == >> /boot/logs/cronlogs/t1summary.log

# Backup Disk 2
echo ++++++  >> /boot/logs/cronlogs/t1disk2.log
echo ++++++ starting daily cron now ++++++++++ >> /boot/logs/cronlogs/t1disk2.log
echo ++++++  >> /boot/logs/cronlogs/t1disk2.log
date
echo ====== moving to Disk2 ===========
echo ====== moving to disk2 =========== >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk2.log

# rsync -av --stats --progress /mnt/s1disk2/ /mnt/disk2/  >> /boot/logs/cronlogs/t1disk2.log

date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk2.log

echo == >> /boot/logs/cronlogs/t1summary.log
echo ====== ending daily cron now Disk2 =========== >> /boot/logs/cronlogs/t1summary.log
echo == >> /boot/logs/cronlogs/t1summary.log

# Backup Disk 3
echo ++++++  >> /boot/logs/cronlogs/t1disk3.log
echo ++++++ starting daily cron now ++++++++++ >> /boot/logs/cronlogs/t1disk3.log
echo ++++++  >> /boot/logs/cronlogs/t1disk3.log
date
echo ====== moving to Disk3 ===========
echo ====== moving to disk3 =========== >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk3.log

rsync -av --stats --progress /mnt/s1disk3/ /mnt/disk3/  >> /boot/logs/cronlogs/t1disk3.log

date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk3.log

echo == >> /boot/logs/cronlogs/t1summary.log
echo ====== ending daily cron now disk3 =========== >> /boot/logs/cronlogs/t1summary.log
echo == >> /boot/logs/cronlogs/t1summary.log

# Backup Disk 4
echo ++++++  >> /boot/logs/cronlogs/t1disk4.log
echo ++++++ starting daily cron now ++++++++++ >> /boot/logs/cronlogs/t1disk4.log
echo ++++++  >> /boot/logs/cronlogs/t1disk4.log
date
echo ====== moving to Disk4 ===========
echo ====== moving to disk4 =========== >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk4.log

rsync -av --stats --progress /mnt/s1disk4/ /mnt/disk4/  >> /boot/logs/cronlogs/t1disk4.log

date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk4.log

echo == >> /boot/logs/cronlogs/t1summary.log
echo ====== ending daily cron now disk4 =========== >> /boot/logs/cronlogs/t1summary.log
echo == >> /boot/logs/cronlogs/t1summary.log

# Backup Disk 5
echo ++++++  >> /boot/logs/cronlogs/t1disk5.log
echo ++++++ starting daily cron now ++++++++++ >> /boot/logs/cronlogs/t1disk5.log
echo ++++++  >> /boot/logs/cronlogs/t1disk5.log
date
echo ====== moving to Disk5 ===========
echo ====== moving to disk5 =========== >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk5.log

rsync -av --stats --progress /mnt/s1disk5/ /mnt/disk5/  >> /boot/logs/cronlogs/t1disk5.log

date >> /boot/logs/cronlogs/t1summary.log
date >> /boot/logs/cronlogs/t1disk5.log

echo == >> /boot/logs/cronlogs/t1summary.log
echo ====== ending daily cron now disk5 =========== >> /boot/logs/cronlogs/t1summary.log
echo == >> /boot/logs/cronlogs/t1summary.log

rsync -av --stats --progress /mnt/s1disk2/ /mnt/disk2/  >> /boot/logs/cronlogs/t1disk2.log

# Send email of summary of results
ssmtp [email protected] < /boot/logs/cronlogs/t1summary.log
cd /boot/logs/cronlogs  
mv t1summary.log "`date +%Y%m%d_%H%M`_t1summary.logarchive"

syslog-20120714-081009.zip

Link to comment

As soon as I updated server1 to rc6-8168 all problems went away.  Before this I hadn't had an interoperability problems, but had never run a 8168 release.  b12a, b13, b14, rc1, rc2, rc3, rc4, and rc5 all worked together fine with this same rsync script.

 

Wonder why this rc6-8168 version doesn't play well with rc5?

Link to comment

rsync tends to pretty well saturate a link but I don't see link problems in your log. Is this log from one of the times your system failed? Have you seen anything interesting on the console? A tail -f /var/log/syslog at the console can be very revealing if telnet sessions aren't showing you anything.

 

Is there a difference in the contents of the disks that might contribute? When I was still fighting realtek nics in early versions of 5 I often found differences in the failures depending on the files I was rsyncing. Many small files had a better chance than large files. Lots of small files require more work by the cpu & disk systems while few large files are mostly waiting on the network. Of course my issues were the basic dropped packets & overruns so YMMV.

Link to comment
Wonder why this rc6-8168 version doesn't play well with rc5?

 

In the same way as I have problems with nfs tcp, I suspect that these problems occur when one of the network devices is running faster, or more efficiently, than the other.  This would suggest that there is a timing problem somewhere within the system code.

Link to comment

Wonder why this rc6-8168 version doesn't play well with rc5?

 

In the same way as I have problems with nfs tcp, I suspect that these problems occur when one of the network devices is running faster, or more efficiently, than the other.  This would suggest that there is a timing problem somewhere within the system code.

 

And it seems to be related to this realtek 8168 version.  Server1 being intel based should be totally unaffected, but tower1 and tower2 are realtek based.

 

rsync tends to pretty well saturate a link but I don't see link problems in your log. Is this log from one of the times your system failed? Have you seen anything interesting on the console? A tail -f /var/log/syslog at the console can be very revealing if telnet sessions aren't showing you anything.

 

Is there a difference in the contents of the disks that might contribute? When I was still fighting realtek nics in early versions of 5 I often found differences in the failures depending on the files I was rsyncing. Many small files had a better chance than large files. Lots of small files require more work by the cpu & disk systems while few large files are mostly waiting on the network. Of course my issues were the basic dropped packets & overruns so YMMV.

 

Yes, this log was from one of the times the tower2 system failed.  Server1 remained active and available throughout.  No console activity was checked as tower1 was running headless.  I too was thinking it was running into a file that was causing the issue.  Moving the disk2 sync later in the process killed that thought.  Files being synced here vary between 2meg and 1gig.

 

This is not such a big issue for me as it is for Tom, and what is happening in the code behind the scenes....  Once all computers are running the rc6 version, all problems disappeared.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.