NFS stale file handle


Recommended Posts

2) When writing to unRAID, nfs can report a 'stale nfs handle'.  I think that rsync does not generate the conditions which provoke this error.

 

I completely forgot about this.  I switched all my connectivity to SMB because nfs was so unreliable in the previous betas.

 

Just testing now and I'm still seeing the problem though it's not as severe as it once was.  If I create a new file in an existing directory then doing an ls causes the stale nfs mount message.  Moving back then forwards again corrects the issue.  Previously it always gave the stale nfs message.

Link to comment
  • 2 weeks later...

Copied from RC2 thread:

 

I don't know whether this provides any clues, but here is a sequence of events on my Ubuntu desktop machine, after I encountered a 'Stale nfs file handle':

 

While I was browsing around my Photos share, Nautilus became unresponsive (but there was no error message).

 

I used telnet to connect to my unRAID server and performed 'ls -l' on both the Photos directory and the 'user' parent:

root@Tower:~# ls -l /mnt/user/Photos
total 161248
drwxrwx--- 1 nobody users      4360 2011-06-20 01:11 100OLYMP1/
drwxrwx--- 1 nobody users       496 2003-10-26 15:42 100OLYMP2/
drwxrwx--- 1 nobody users       120 2012-01-27 18:00 101029/
drwxr-xr-x 1 nobody users        72 2012-01-27 14:00 110324/
drwxr-xr-x 1 nobody users        72 2012-01-27 14:00 110830/
drwxr-xr-x 1 nobody users        72 2012-01-27 14:00 111004/
drwxr-xr-x 1 nobody users        72 2012-01-27 14:00 111007/
drwxr-xr-x 1 nobody users       336 2011-10-17 20:07 111007YNL/
drwxrwxr-x 1 nobody users       592 2011-10-25 00:29 111022Baking/
drwxrwxr-x 1 nobody users        96 2012-01-27 18:00 111106/
drwxrwxr-x 1 nobody users       136 2012-01-27 14:00 111107_MYF_CrocPark/
drwxrwxr-x 1 nobody users       104 2012-01-27 14:00 111209_Bukidnon/
drwxrwxr-x 1 nobody users       136 2012-01-27 18:00 120108_CDO/
drwxrwxr-x 1 nobody users        48 2012-05-07 18:20 120507/
drwx------ 1 nobody users       720 2012-01-27 14:00 Ai-Ai\ Graduation/
drwxr-xr-x 1 nobody users      2224 2012-01-27 14:00 Import/
drwxr-xr-x 1 nobody users       240 2012-01-27 14:00 Methodist\ Youth\ Surigao/
drwxr-xr-x 1 nobody users       208 2011-06-20 01:11 Ruby\ in\ UK/
-rw-r--r-- 1 nobody users   3072000 2012-01-27 16:06 digikam4.db
-rw-r--r-- 1 nobody users 161874944 2012-01-27 16:06 thumbnails-digikam.db
root@Tower:~# ls -l /mnt/user
total 13024031
drwxr-xr-x 1 nobody              users         48 2011-10-17 19:24 111007YNL/
drwxrwx--- 1 nobody              users        424 2010-09-08 21:42 Athlon/
drwxr-xr-x 1 nobody              users      20352 2011-12-08 08:30 Downloaded\ Files/
-rw-rw---- 1 nobody              users 6542697514 2010-02-05 18:20 LoveStory_DVD.mkv
drwxrwx--- 1 nobody              users        128 2012-03-25 08:01 Maildir/
drwxrwx--- 1 nobody              users       6912 2012-05-06 16:14 Movies/
drwxrwx--- 1 nobody              users        384 2012-04-08 18:00 Music/
-rw-rw---- 1 nobody              users 2088899096 2010-02-07 01:20 NOTTING\ HILL.mkv
-rw-rw---- 1 nobody              users 4691562496 2010-06-13 21:05 National\ Treasure\ 2004\ 720p.avi
drwxrwxr-x 1 nobody              users        504 2011-12-27 08:50 Pete's\ N97/
drwxrwxr-x 1 nobody              users         72 2012-05-07 18:20 Photos/
drwxrwx--- 1 nobody              users        296 2011-09-14 08:08 Series/
drwxrwx--- 1 nobody              users        176 2012-04-08 07:55 Squeeze/
drwxr-xr-x 1 logitechmediaserver users       1392 2012-04-08 16:23 Squeeze-7.7.2/
drwxr-xr-x 1 root                root         480 2012-03-30 12:32 Temporary/
drwxrwxr-x 1 nobody              users        520 2011-11-12 19:58 UMC/
drwxrwx--- 1 nobody              users        600 2010-09-04 20:22 UMC2.07/
drwxrwx--- 1 nobody              users        600 2010-10-16 09:13 UMC2.08.1/
drwxrwxrwx 1 nobody              users        600 2011-06-07 10:42 UMC2.08.1x/
drwxr-xr-x 1 nobody              users        496 2011-09-13 08:41 UMC2.10/
drwxrwxr-x 1 nobody              users        584 2012-02-15 22:54 UMC2.11/
drwxrwx--- 1 nobody              users        696 2011-06-06 08:34 UMCold/
drwxrwx--- 1 nobody              users        504 2012-01-02 09:01 Videos/
drwxrwxr-x 1 nobody              users        168 2011-11-14 10:32 Wii/
drwxrwx--- 1 nobody              users        576 2011-10-18 13:54 Work/
drwxr-xr-x 1 root                root         504 2012-04-08 22:45 XFarG7/
drwxrwx--- 1 nobody              users         72 2010-10-29 02:18 ZA30/
drwxrwx--- 1 nobody              users        864 2010-10-29 01:21 ZK10/
drwxrwx--- 1 nobody              users        120 2010-10-25 15:36 ZVideo/
drwxrwx--- 1 nobody              users        184 2010-10-29 02:14 ZVideo2/
drwxrwx--- 1 nobody              users         72 2010-10-19 23:43 mediaserver/
drwxrwx--- 1 nobody              users        112 2011-02-26 13:12 mp3/
-rw-r--r-- 1 root                root       18020 2012-03-30 12:32 nolimetangere.odt
-rw-r--r-- 1 root                root      342926 2012-03-30 12:30 output.pdf
-rw-r--r-- 1 nobody              users      30312 2012-01-11 00:35 pro.odt
drwxrwx--- 1 nobody              users         80 2011-09-15 07:19 series/
root@Tower:~# 

 

I then looked at the same directories from Ubuntu:

peter@desktop:~$ ls -l /net/tower/mnt/user
ls: cannot access /net/tower/mnt/user/Photos: Stale NFS file handle
total 0
drwxr-xr-x 2 root root 0 May  7 18:19 Movies
drwxr-xr-x 2 root root 0 May  7 18:19 Music
d??? ? ?    ?    ?            ? Photos
drwxr-xr-x 2 root root 0 May  7 18:19 series
drwxr-xr-x 2 root root 0 May  7 18:19 Series
drwxr-xr-x 2 root root 0 May  7 18:19 UMC
drwxr-xr-x 2 root root 0 May  7 18:19 Videos
peter@desktop:~$ ls -l /net/tower/mnt/user/Photos
ls: cannot access /net/tower/mnt/user/Photos: Stale NFS file handle
peter@desktop:~$ sudo umount -f /net/tower/mnt/user/Photos
[sudo] password for peter: 
peter@desktop:~$ ls -l /net/tower/mnt/user
total 0
drwxr-xr-x 2 root root   0 May  7 18:19 Movies
drwxr-xr-x 2 root root   0 May  7 18:19 Music
drwxrwxr-x 1   99 users 72 May  7 18:20 Photos
drwxr-xr-x 2 root root   0 May  7 18:19 series
drwxr-xr-x 2 root root   0 May  7 18:19 Series
drwxr-xr-x 2 root root   0 May  7 18:19 UMC
drwxr-xr-x 2 root root   0 May  7 18:19 Videos
peter@desktop:~$ ls -l /net/tower/mnt/user
total 8
drwxrwx--- 1   99 users 6912 May  6 16:14 Movies
drwxrwx--- 1   99 users  384 Apr  8 18:00 Music
drwxrwxr-x 1   99 users   72 May  7 18:20 Photos
drwxrwx--- 1   99 users  296 Sep 14  2011 series
drwxr-xr-x 2 root root     0 May  7 18:19 Series
drwxrwxr-x 1   99 users  520 Nov 12 19:58 UMC
drwxrwx--- 1   99 users  504 Jan  2 09:01 Videos
peter@desktop:~$ 

 

I use autofs to mount nfs shares automatically, hence I don't have to issue the mount command.  Between the last two 'ls -l /net/tower/mnt/user' I had opened the Photos share in Nautilus - note that ownership of most folders has changed from 'root' to '99'.

 

Here is the line showing details of the 'Photos' share from the output of 'mount' from Ubuntu.

tower:/mnt/user/Photos on /net/tower/mnt/user/Photos type nfs (rw,nosuid,nodev,vers=3,hard,intr,nolock,udp,sloppy,addr=10.2.0.100)

 

and

I believe that the stale file handle problem still exists (copying to unRAID)

 

Edit:

 

Yes, I can confirm that the stale file handle problem still persists.

 

@Tom

Research on the net suggests that the reason for a stale file handle is that the contents of a directory has been changed without the modification time of the directory itself having been updated.  Is it possible that this might occur in unRAID - perhaps when a cache drive is in use?

Link to comment

I don't have a cache drive so it can't be caused by the cache drive.  It could be made worse by a cache drive tho.

 

I have a feeling it's a glitch in the way the user mount driver handles file listings (the way I understand this works is a memory map of which files exist on which drives).  Perhaps the driver needs to update the timestamp or possibly not update the timestamp.

Link to comment

I have never seen it happen on a disk share, but I spend more time writing large file to user shares!  The problem is not reproducible at will (unlike the hanging nfs reads), but I will try some repeated writes to a disk share over the weekend.

Link to comment

There have been reports of directory timestamps not properly updating in user shares.

 

Just remembered the context: A Plex or XMBC user was having a problem with the scraper ignoring new file adds because a directory timestamp was not being updated. The user share spanned multiple disks.

Link to comment

There have been reports of directory timestamps not properly updating in user shares.

 

Just remembered the context: A Plex or XMBC user was having a problem with the scraper ignoring new file adds because a directory timestamp was not being updated. The user share spanned multiple disks.

 

If that is happening, then it would explain exactly why we get stale handles.

Link to comment

As another datapoint in support of a problem with nfs + user shares...

 

I use nfs directly (heavily) to the cache drives direct export. No problem at all.

 

I also more sparingly use nfs to all my user shares. These are what are giving problems.

 

This occured using ubuntu as a client and symptoms similar to what everyone else is describing.

 

I gave it another go this weekend with a different client - Arch Linux running kernel 3.3.7 and still had problems with stale file handles. Kernel logs also issued :

 

[  723.909179] NFS: server 192.168.1.150 error: fileid changed
[  723.909180] fsid 0:14: expected fileid 0x10000018, got 0x5eb

 

Which supports the 'files on user shares changing underneath nfs' feet' theory. I don't know if newer kernels are just less tolerant of this / why it was never a problem in the past.

Link to comment

As another datapoint in support of a problem with nfs + user shares...

 

I use nfs directly (heavily) to the cache drives direct export. No problem at all.

 

I also more sparingly use nfs to all my user shares. These are what are giving problems.

 

This occured using ubuntu as a client and symptoms similar to what everyone else is describing.

 

I gave it another go this weekend with a different client - Arch Linux running kernel 3.3.7 and still had problems with stale file handles. Kernel logs also issued :

 

[  723.909179] NFS: server 192.168.1.150 error: fileid changed
[  723.909180] fsid 0:14: expected fileid 0x10000018, got 0x5eb

 

Which supports the 'files on user shares changing underneath nfs' feet' theory. I don't know if newer kernels are just less tolerant of this / why it was never a problem in the past.

 

I think I have a fix for this in next release  :)

Link to comment
  • 2 weeks later...

I've spent quite a bit of time just copying files to the server, as a test.  Past experience suggests that I might have seen 'stale file handle' three or four times with RC3, or earlier.  With RC4, everything appears to be behaving as it should.

 

It's looking extremely promising!  :)

 

Edit:  After some further testing, I am reasonably confident to mark this one as solved ... tentatively.  If I experience any more 'stale file handles' I will re-open.

Link to comment

I'm going to update to rc[4|5] over the weekend - I was looking forward to this fix too.

 

It's good to hear you've found it much improved - thanks for the info I'm much more confident about doing the update now!

 

For me, rc4 seems to be pretty well perfect.  I may try reverting to the tcp transport for for nfs, but I don't see any change which would be expected to fix this so I still expect to see client hangs.

Link to comment

After some further testing, I am reasonably confident to mark this one as solved ... tentatively.  If I experience any more 'stale file handles' I will re-open.

 

I recently upgraded from 5.0b14 to 5.0rc4, and I am still having Stale NFS Handle errors when connecting to my WD Live box.  This has been reported a number of times in the past, and the only solution has been to either switch to 4.7, or switch to SMB connections.  http://lime-technology.com/forum/index.php?topic=17679.0

 

My current workaround is a little jenky - feed an SMB connection to a Windows 7 box, and rewrap that as NFS using HaneWin NFS Server.  It looks like I still need to do this workaround with RC4.

Link to comment

... I am still having Stale NFS Handle errors when connecting to my WD Live box.

 

to the WD Live box, or from the WD Live box?

 

Sorry, you are absolutely correct.  I am getting those errors when connecting from the WD Live box to my NFS user share on my unRAID system.

 

I disabled all plugins on my unRAID system this morning, and I am attempting to recreate the issue.  This NFS error sometimes happens almost immediately, while other times takes hours.  As soon as I have recreated the error, I will be posting the appropriate unRAID system logs.

Link to comment

I just got the Stale NFS Handle error again.  I believe this happened around 11am.  The only thing in my system logs that correlates around that time are the spindown 4 & spindown 5 events at 10:58am.  I should note that when I was using 5.0b14, I disabled disk spin down, and this error still occurred.

 

Here is the error on my WD Live Box:

 

# ls -al
ls: ./zTemp: Stale NFS file handle
ls: ./Videos: Stale NFS file handle
drwxr-xr-x    7 root     root          140 Dec 31 16:45 .
drwxr-xr-x    3 root     root           60 Dec 31 16:00 ..
drwxrwxrwx    1 nobody   100           416 Jun  9  2012 .HD_Videos
d-wxr----t    3 root     root           60 Dec 31 16:45 .wd_tv
drwxr-xr-x    3 root     root           60 Dec 31 16:00 USB2

 

 

Here is my unRAID system log for the past ~2 hours

Jun 11 09:06:27 MrTower emhttp: Start NFS...
Jun 11 09:06:27 MrTower emhttp: shcmd (51): /etc/rc.d/rc.nfsd start |& logger
Jun 11 09:06:27 MrTower logger: Starting NFS server daemons:
Jun 11 09:06:27 MrTower logger:   /usr/sbin/exportfs -r
Jun 11 09:06:27 MrTower logger:   /usr/sbin/rpc.nfsd 8
Jun 11 09:06:27 MrTower logger:   /usr/sbin/rpc.mountd
Jun 11 09:06:27 MrTower mountd[1445]: Kernel does not have pseudo root support.
Jun 11 09:06:27 MrTower mountd[1445]: NFS v4 mounts will be disabled unless fsid=0
Jun 11 09:06:27 MrTower mountd[1445]: is specfied in /etc/exports file.
Jun 11 09:06:27 MrTower emhttp: shcmd (52): /usr/local/sbin/emhttp_event svcs_restarted
Jun 11 09:06:27 MrTower emhttp_event: svcs_restarted
Jun 11 09:08:54 MrTower mountd[1446]: authenticated mount request from 172.16.0.92:1001 for /mnt/user/Media/_isVideo/HD/.Playlists/Playlists (/mnt/user/Media)
Jun 11 09:08:57 MrTower mountd[1446]: authenticated mount request from 172.16.0.92:716 for /mnt/user/Media/_isVideo/HD (/mnt/user/Media)
Jun 11 09:09:00 MrTower mountd[1446]: authenticated mount request from 172.16.0.92:852 for /mnt/user/Media/_isVideo/zTemp (/mnt/user/Media)
Jun 11 09:21:58 MrTower kernel: mdcmd (39): spindown 0
Jun 11 09:23:59 MrTower kernel: mdcmd (40): spindown 1
Jun 11 09:24:00 MrTower kernel: mdcmd (41): spindown 3
Jun 11 09:24:02 MrTower kernel: mdcmd (42): spindown 4
Jun 11 09:24:05 MrTower kernel: mdcmd (43): spindown 5
Jun 11 09:24:08 MrTower kernel: mdcmd (44): spindown 2
Jun 11 09:54:45 MrTower in.telnetd[3637]: connect from 172.16.0.150 (172.16.0.150)
Jun 11 09:54:51 MrTower login[3638]: ROOT LOGIN  on '/dev/pts/0' from '172.16.0.150'
Jun 11 10:07:19 MrTower kernel: mdcmd (45): spindown 4
Jun 11 10:07:21 MrTower kernel: mdcmd (46): spindown 5
Jun 11 10:08:14 MrTower kernel: mdcmd (47): spindown 1
Jun 11 10:08:15 MrTower kernel: mdcmd (48): spindown 3
Jun 11 10:58:38 MrTower kernel: mdcmd (49): spindown 4
Jun 11 10:58:50 MrTower kernel: mdcmd (50): spindown 5

 

Link to comment

I'm looking at the unRAID Processes log, and I noticed the following:

 

root      1437     2  0 09:06 ?        00:00:01 [nfsd]
root      1438     2  0 09:06 ?        00:00:01 [nfsd]
root      1439     2  0 09:06 ?        00:00:02 [nfsd]
root      1440     2  0 09:06 ?        00:00:01 [nfsd]
root      1441     2  0 09:06 ?        00:00:02 [nfsd]
root      1442     2  0 09:06 ?        00:00:01 [nfsd]
root      1443     2  0 09:06 ?        00:00:01 [nfsd]
root      1444     2  0 09:06 ?        00:00:01 [nfsd]
root      1446     1  0 09:06 ?        00:00:00 /usr/sbin/rpc.mountd
nobody    2221  1314  1 09:24 ?        00:01:24 /usr/sbin/smbd -D
root      6259     2  0 10:55 ?        00:00:00 [kworker/0:0]
root      6260     2  0 10:55 ?        00:00:00 [flush-9:1]

 

I'm not certain if this is important information, but I do notice some processes logged at 10:55am, which is right around when I think the Stale NFS File Handle error occurred.

Link to comment

Okay, I've just had a number of 'stale file handle' errors but, admittedly, while running an application (digiKam) which claims that it cannot access data on network shares. :o

 

However, I discover that if I disable the use of the cache drive for the user share in question, then I have no more problems.

 

I now believe that there were two stale file handle faults - the symptoms of this one were a little different to the 'other' stale handle problem which I still believe to have been resolved.

Link to comment

Thinking about this problem, and trying to  rationalise it against the "directory changing without a time stamp change" theory, I can see why there might be difficulties.  When a user share is split across different physical disks, there are different parent directories with, presumably, different time stamps.  Which disk does unRAID fetch the directory details from, to return to the nfs client?  The problem is exacerbated when a cache drive comes into play, because there is now an additional directory on yet another physical disk.

 

When I experienced the problem yesterday, I found that there was an empty 'user share' directory left on the cache drive.  It wasn't close to a time when mover would be invoked, and it is a situation which shouldn't occur in normal use - this is what led me to try disabling the cache drive for that user share.

Link to comment

Thinking about this problem, and trying to  rationalise it against the "directory changing without a time stamp change" theory, I can see why there might be difficulties.  When a user share is split across different physical disks, there are different parent directories with, presumably, different time stamps.  Which disk does unRAID fetch the directory details from, to return to the nfs client?  The problem is exacerbated when a cache drive comes into play, because there is now an additional directory on yet another physical disk.

 

When I experienced the problem yesterday, I found that there was an empty 'user share' directory left on the cache drive.  It wasn't close to a time when mover would be invoked, and it is a situation which shouldn't occur in normal use - this is what led me to try disabling the cache drive for that user share.

it should, in my opinion, always return the most current of all the various time-stamps of the parallel directories involved.  It does mean they must be kept in memory for comparison (otherwise, you would need to spin up the drives to learn them).
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.