Bug 415607

Summary: nfs-mounted directories hang
Product: [openSUSE] openSUSE 11.0 Reporter: Carl Benson <cbenson>
Component: KernelAssignee: Neil Brown <nfbrown>
Status: RESOLVED NORESPONSE QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P2 - High CC: cbenson, cyberorg, david.roper, R.Vickers, ursula.brueckner
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 11.1   
Whiteboard:
Found By: Community User Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: tcpdump trace
excerpt from /var/log/messages
gzipped tcpdump file
kernel messages on client
rpcinfo run on client, reporting on server
tcpdump output showing all traffic between client and server
log messages and network trace
log messages and tcpdump trace

Description Carl Benson 2008-08-07 21:24:43 UTC
User home directories are automounted from a NetApp Filer to server
named "copperhead".
"df" command hangs on NFS mounts.
Root attempts to "su - <user>", but this also hangs.
It looks like the NFS server is not responding, but this is not
the case -- we have many other servers using the same directories
via NFS, and response is fine. Monitoring on the NetApp show it
is running well.

/var/log/messages has this:
Aug  7 09:55:42 copperhead klogd: ------------[ cut here ]------------
Aug  7 09:55:42 copperhead klogd: kernel BUG at net/sunrpc/rpcb_clnt.c:290!
Aug  7 09:55:42 copperhead klogd: invalid opcode: 0000 [1] SMP
Aug  7 09:55:42 copperhead klogd: last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
Aug  7 09:55:42 copperhead klogd: CPU 3
Aug  7 09:55:42 copperhead klogd: Modules linked in: lp parport_pc ppdev parport nls_utf8 nfs lockd nfs_acl sunrpc autofs4 binfmt_misc iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 bonding microcode firmware_class fuse loop dm_mod rtc_cmos iTCO_wdt sg shpchp rtc_core i2c_i801 iTCO_vendor_support rtc_lib floppy container i2c_core i5000_edac usb_storage sr_mod pci_hotplug button cdrom joydev e1000e edac_core usbhid hid ff_memless ehci_hcd uhci_hcd sd_mod usbcore edd ext3 mbcache jbd fan ata_piix ahci libata dock 3w_9xxx scsi_mod thermal processor [last unloaded: parport_pc]
Aug  7 09:55:42 copperhead klogd: Pid: 14342, comm: xauth Tainted: G        N 2.6.25.11-0.1-default #1
Aug  7 09:55:42 copperhead klogd: RIP: 0010:[<ffffffff883ad55a>]  [<ffffffff883ad55a>] :sunrpc:rpcb_getport_async+0x66/0x460
Aug  7 09:55:42 copperhead klogd: RSP: 0018:ffff810110cff938  EFLAGS: 00010287
Aug  7 09:55:42 copperhead klogd: RAX: ffffffff883d11d0 RBX: ffff81022c8d5c00 RCX: 0000000000000001
Aug  7 09:55:42 copperhead klogd: RDX: 0000000000000000 RSI: ffff81010a543e40 RDI: ffff81010a543e40
Aug  7 09:55:42 copperhead klogd: RBP: ffff810110cffa58 R08: ffff81010a543f28 R09: 0000000000000001
Aug  7 09:55:42 copperhead klogd: R10: ffff810001031280 R11: ffff810110cffc18 R12: ffff81018b047800
Aug  7 09:55:42 copperhead klogd: R13: ffff81018b047800 R14: ffff81010a543e40 R15: ffff810229d40800
Aug  7 09:55:42 copperhead klogd: FS:  00007fb4bd7ed6f0(0000) GS:ffff81022f1037c0(0000) knlGS:0000000000000000
Aug  7 09:55:42 copperhead klogd: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Aug  7 09:55:42 copperhead klogd: CR2: 00007f3992dfcf00 CR3: 00000001d4601000 CR4: 00000000000006e0
Aug  7 09:55:42 copperhead klogd: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug  7 09:55:42 copperhead klogd: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug  7 09:55:42 copperhead klogd: Process xauth (pid: 14342, threadinfo ffff810110cfe000, task ffff810116954080)
Aug  7 09:55:42 copperhead klogd: Stack:  0000000000000246 ffff81022ad67370 ffff810110cff978 0000000000000246
Aug  7 09:55:42 copperhead klogd:  0000000000011200 ffff81022ad67340 ffff810110cff9a8 ffff810110cff9c0
Aug  7 09:55:42 copperhead klogd:  ffff810110cff988 ffffffff80276757 ffff810110cffa08 ffffffff8027685a
Aug  7 09:55:42 copperhead klogd: Call Trace:
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a018b>] :sunrpc:call_bind+0x6f/0x74
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a72d9>] :sunrpc:__rpc_execute+0x86/0x246
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a751e>] :sunrpc:rpc_execute+0x85/0x8e
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a14a5>] :sunrpc:rpc_run_task+0x52/0x5a
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883a1598>] :sunrpc:rpc_call_sync+0x3f/0x5d
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8840c0bb>] :nfs:nfs3_rpc_wrapper+0x22/0x5c
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8840c5b5>] :nfs:nfs3_proc_access+0xf3/0x15e
Aug  7 09:55:42 copperhead klogd:  [<ffffffff883ffde0>] :nfs:nfs_do_access+0x173/0x341
Aug  7 09:55:42 copperhead klogd:  [<ffffffff884000a3>] :nfs:nfs_permission+0xf5/0x166
Aug  7 09:55:42 copperhead klogd:  [<ffffffff802a7497>] permission+0xc7/0x118
Aug  7 09:55:42 copperhead klogd:  [<ffffffff802a7bbc>] lookup_hash+0x21/0x43
Aug  7 09:55:42 copperhead klogd:  [<ffffffff802ab6d8>] open_namei+0xe8/0x6a0
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8029ec5b>] do_filp_open+0x28/0x4b
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8029eccf>] do_sys_open+0x51/0xd2
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8029ed79>] sys_open+0x1b/0x1d
Aug  7 09:55:42 copperhead klogd:  [<ffffffff8020bffa>] system_call_after_swapgs+0x8a/0x8f
Aug  7 09:55:42 copperhead klogd: DWARF2 unwinder stuck at system_call_after_swapgs+0x8a/0x8f
Aug  7 09:55:42 copperhead klogd:
Aug  7 09:55:42 copperhead klogd: Leftover inexact backtrace:
Aug  7 09:55:42 copperhead klogd:
Aug  7 09:55:42 copperhead syslog-ng[2774]: last message repeated 2 times
Aug  7 09:55:42 copperhead klogd: Code: 3b 88 44 8b 4b 44 44 8b 43 40 48 c7 c7 57 74 3b 88 48 8b 4b 50 89 04 24 31 c0 e8 ed 24 0a f8 4c 8b bb e8 00 00 00 49 39 df 74 04 <0f> 0b eb fe 49 8d 85 78 02 00 00 48 89 85 f8 fe ff ff f0 0f ba
Aug  7 09:55:42 copperhead klogd: RIP  [<ffffffff883ad55a>] :sunrpc:rpcb_getport_async+0x66/0x460
Aug  7 09:55:42 copperhead klogd:  RSP <ffff810110cff938>
Aug  7 09:55:42 copperhead klogd: ---[ end trace 4b516acb183486f8 ]---
Comment 1 Carl Benson 2008-08-08 14:43:11 UTC
I left out some obvious stuff. "uname -a" yields:

Linux copperhead 2.6.25.11-0.1-default #1 SMP 2008-07-13 20:48:28 +0200 x86_64 x86_64 x86_64 GNU/Linux
Comment 2 Neil Brown 2008-08-12 02:28:28 UTC
Thanks for the report and all the details.

Can I assume that you have rebooted 'copperhead' and it is working happily
again?
Has this only happened once, or multiple times?

I have looked at the code and I think I can see what is happening.
I will discuss this with some colleagues and try to find a solution.
Comment 3 Carl Benson 2008-08-12 15:21:13 UTC
Reset is necessary -- it won't reboot due to hang on NFS.

Reboot fixes the problem temporarily, but it happens again
eventually.

I can't be 100% certain, but I think this is a problem with
64-bit installations but not 32-bit. I've been using 11.0
32-bit for some time with no NFS problems. I have other users
with openSUSE 11.0 32-bit desktops and no NFS problems.

Thanks for looking at this. I have a lot of servers to upgrade
to 11.0 if/when this bug is fixed.

If there's any other info I can provide, please let me know.

--Carl Benson
Comment 4 Carl Benson 2008-08-13 15:28:42 UTC
I discovered that even with no users logged on, the NFS hang could be
triggered by updatedb, which is started by a cron job.  The defaults
for updatedb (in /etc/sysconfig/locate) allow it to traverse /home.

In past versions of Linux, NFS automounted /home was treated as a
"netpath", but now NFS-mounted home directories are treated as local
to the systems. The result (in my case) is that updatedb runs find
across every one of hundreds of home directories on the NetApp Filer,
and tries to mount home directories that are not exported to this
system. I think that's where the hang happens.

I killed find and updatedb, restarted nfs and autofs, and removed
/home from the paths covered by updatedb. This fixed the hang.
The underlying problem is still there, however.
Comment 5 David Roper 2008-08-13 18:20:19 UTC
I'm have been working a case with this very same issue on a NetApp storage appliance. 

automounter contents:
vol00 -fstype=nfs4,rw,nosuid,rsize=32768,wsize=32768,hard,intr
drive00.engr.uiowa.edu:/vol/vol1

uname -a:
Linux a-lnx001 2.6.25.5-1.1-default #1 SMP 2008-06-07 01:55:22 +0200 x86_64 x86_64 x86_64 

hardware is box with AMD-64 Operon procs (HP xw9300 workstation)

The filer does show a slow response time as below from messages and we did confirm correct functional operation and configuration no the filer: 

Wed Jul 23 13:38:32 CDT [drive00: nfsap_process:warning]: NFSv4 response to client  128.255.18.37 for volume 0x1ff0eee was slow, 173 > 60 (in seconds) 

Appreciate that this has been documented and will be watching for any updates. 
Comment 6 Neil Brown 2008-08-14 02:37:52 UTC
It looks like the BUG_ON is simply wrong.
I have removed in the SUSE CVS tree.
So you can:
  - compile the kernel yourself with the line removed
  - use a ktod (kernel of the day) from tomorrow
  - wait for the next kernel update to be released (no idea when
    that might be).

I won't close this yet though as we need to explore a bit deeper to
make sure we are 100% certain what is happening.
Comment 7 Carl Benson 2008-08-14 21:24:26 UTC
Is it true that this affects both the 64-bit and 32-bit default kernels?

I looked at /usr/src/linux/net/sunrpc/rpcb_clnt.c on both a 32-bit and
a 64-bit system, and they both have a BUG_ON at line 290.
Comment 8 Carl Benson 2008-08-15 21:04:48 UTC
I commented out line 290 of net/sunrpc/rpcb_clnt.c, recompiled the kernel,
and rebooted. So far, I have not seen that error in /var/log/messages.

Unfortunately, that did not help the incredibly slow response of NFS to our
NetApp Filer. That situation is as bad as before. I don't believe it's the
NetApp Filer. I have roughly equivalent servers running openSUSE 10.2, 10.3
and 11.0, all configured to automount home directories.

As root, if I simultaneously "su - user" on all three systems, where the
user has a home directory on the NetApp, the response on 10.2 and 10.3
are instant. The 11.0 system never returns the prompt. In /var/log/messages,
it claims the NetApp is not responding, still trying.
Comment 9 Neil Brown 2008-08-18 00:26:05 UTC
I did not realise you were getting slow performance.
You said that commands "hang", and I assumed this meant that they
never returned.  That would be consistent with the BUG_ON firing.

Can you provide more information about this slowness?  Are all
operations slow or just some?  Does performance steadily get worse?
Does it sometimes get better?

Comment 10 Carl Benson 2008-08-18 15:09:08 UTC
I would say performance steadily gets worse.

On an openSUSE 11.0 system left to itself, I see these messages in
/var/log/messages every 30 minutes:

Aug 18 07:08:38 diamondback klogd: nfs: server netappb1 not responding, still trying
Aug 18 07:10:38 diamondback klogd: nfs: server netappb1 OK

The netappb1 is performing well at the time.

As root, I can try "su - <user>" for a number of users, and it works
for 4 users, but for the 5th user, "su - <user>" hangs and does not
come back. In this case, I see in /var/log/messages that netappb1
has "timed out". I can Ctl-C out of the "su - <user>". After that,
if I do "df" it hangs, and has to be killed "-9".

If I get su-ed to a user and do an "ls", it may wait a long time,
then respond. Other times, it will respond immediately.

Here is /proc/mounts for copperhead when it's stuck on "df":

rootfs / rootfs rw 0 0
udev /dev tmpfs rw,mode=755 0 0
/dev/sda2 / ext3 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0
proc /proc proc rw 0 0
sysfs /sys sysfs rw 0 0
debugfs /sys/kernel/debug debugfs rw 0 0
devpts /dev/pts devpts rw,gid=5,mode=620 0 0
/dev/sda3 /loc ext3 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0
securityfs /sys/kernel/security securityfs rw 0 0
fusectl /sys/fs/fuse/connections fusectl rw 0 0
rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0
cplserv:/vol/cpl_data /mnt/cpl/data nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=140.107.170.71 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
auto.home /home autofs rw,fd=5,pgrp=3305,timeout=600,minproto=5,maxproto=5,indirect 0 0
proc /var/lib/ntp/proc proc rw 0 0
cplserv:/vol/cpl_home/tholzman /home/tholzman nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=140.107.170.71 0 0

Copperhead and diamondback are identical servers running openSUSE 11.0.

The problem does seem to be limited to home directories automounted from
the NetApp Filer. I have a few home directories on another server, and
copperhead and diamondback have no trouble with those.


Comment 11 David Roper 2008-08-20 17:05:17 UTC
I was also curios what version of Data ONTAP is running on the filer. For our issue the customer has upgraded their version of ONTAP to 7.3 and the problem has decreased significantly. 
Comment 12 Carl Benson 2008-08-20 17:13:14 UTC
Our NetApp Filer is running Data ONTAP v7.2.4.

We are using NFS v3 for all mounts. I noticed your case involves v4.
Seems like v3 should be well-established code for all versions of
Linux by this time, so that why it's surprising that anything
should be different for openSUSE 11.0 (vs 10.2 and 10.3).
Comment 13 Carl Benson 2008-08-29 14:53:37 UTC
Is there any more information or assistance I could provide that
would help this move along?

openSUSE 11.1 is going to be out soon -- is there any way to know
if this same problem is going to persist in 11.1?
Comment 14 Neil Brown 2008-09-09 06:35:16 UTC
Sorry for not getting back to this sooner .... I've been busy :-(

To summarise: openSUSE11.0 sometimes (often) sees no response from a NetApp filer,
but does not see problems talking to other (non-NetApp) file servers.
openSUSE10.2 and 10.3 don't have any problems talking to that same filer.

So the problem is specific to 11.0 and NetApp.  Quite strange.

I will need to get some tracing information to find out what is happening.
Could you collect two things for me please?

 1/ tcpdump -w /tmp/trace -s 0 host netappb1

start that running, then try to reproduce the problem, then interrupt the tcpdump
and collect the "/tmp/trace" file.  Compress it and attach it to the bugzilla
Hopefully it won't be too big.

 2/  echo 65535 > /proc/sys/sunrpc/nfs_debug
     echo 65535 > /proc/sys/sunrpc/rpc_debug

  This writes trace information to the kernel log (it will appear in /var/log/messages I think).
  After the 'echo's, reproduce the problem, then echo 0 to both files and
  collect the logs.

Probably best to do both of these at the same time.  so start tcpdump,
do the two 'echo' commands, then reproduce the problem, then collect the
logs.

Thanks.
Comment 15 Carl Benson 2008-09-09 19:39:18 UTC
Created attachment 238696 [details]
tcpdump trace

tcpdump -i bond0 -w /tmp/trace -s 0 host netappb1

During this window, as root, I su-ed to two user accounts with home dirs
automounted from netappb1, and the "su" hung until it timed out.
Comment 16 Carl Benson 2008-09-09 19:57:06 UTC
Created attachment 238706 [details]
excerpt from /var/log/messages

echo 65535 >/proc/sys/sunrpc/nfs_debug
echo 65535 >/proc/sys/sunrpc/rpc_debug
recreate problem
echo 0 >/proc/sys/sunrpc/nfs_debug
echo 0 >/proc/sys/sunrpc/rpc_debug

Darn. I missed your bit at the end about doing both at once.
At any rate, from the messages, it appears to loop once per
minute, so it's just the same thing over and over.
Comment 17 Forgotten User b5BnQSUi71 2008-09-24 06:13:59 UTC
*** Bug 416654 has been marked as a duplicate of this bug. ***
Comment 18 Bob Vickers 2008-09-24 08:28:28 UTC
After looking at both this bug and bug 416654 I realise there is a piece of information I did not supply: the NFS servers in my case were both running SuSE 10.2. One of them was 32-bit, the other 64-bit.

In all my cases the client was 64-bit SuSE 11.0, and I have not seen the problem on similarly configured 32-bit clients.

I believe a workround is to put 'udp' in the fstab options, at any rate it has worked so far for me. This is not ideal, however, because the man page suggests a performance impact.
Comment 19 Neil Brown 2008-09-25 05:50:38 UTC
Thanks for the traces.  I can now see some of what is happening.

The tcpdump trace shows normal NFS startup handshake where the client
first checks with portmap to find mountd then talks to mountd to get
the root filehandle.  Then it checks with portmap again to get the
port for NFS (2049) and then connects to the NFS server.

You would expect to then see an FSINFO call, an FSINFO reply, then a
PATHCONF call and a PATHCONF reply.

Instead we see various other calls such as GETATTR, FSSTAT, PATHCONF, all
for *other* filehandles.  There calls are made consecutively without waiting
for a reply.  As soon as the server sees the PATHCONF request (with a
different filehandle to what the mount returned), it closes the TCP connection.
The client then opens the connection again submits several requests one of
which fails.

What must be happening here is that several mounts are all being attempted
against the same server at the same time.  When this happens the client will
use the one TCP connection for all of the mount points.  The PATHINFO
for one of these is causes the server to (incorrectly I think) close the
connection.  This just causes it to be retried again and again making
no progress and blocking any other mounts from the same fileserver.

So it seems like the NetApp filer is doing the wrong thing, closing the
connection.
However Bob's situation doesn't involve a NetApp filer.

So maybe Bob's problem is completely different and just looks similar,
or maybe I'm missing something.

Bob: are you able to collect a tcpdump trace of traffic when the problem
occurs??  Thank might be helpful.
Comment 20 Bob Vickers 2008-09-25 14:45:20 UTC
Sorry, but for the moment I cannot collect information because the machines that demonstrate it are all production machines, and I switched them all to UDP mounts so as to get the service ready for the new term. However I am expecting a couple of new 64-bit workstations in a week or two, so I will try to reproduce the problem on them. 
Comment 21 Carl Benson 2008-09-25 14:53:16 UTC
Neil's comment that about various calls for other filehandles, could that be
automounter generating this? I noticed since openSUSE 10.3 that automounter
tries to mount every possible home directory in some cases. I have noticed this
on 64-bit servers, but not on 32-bit desktops.

Since we have 155 possible home directories, that would be quite a few extra
calls.

I'll do some testing on an 11.0 server where home directories are statically
mounted, and see if it hangs up, too.
Comment 22 Bob Vickers 2008-10-07 08:49:57 UTC
New information: this problem has now occurred (twice) on a 32-bit client. I will attach a binary tcpdump taken on the server.

Wile tcpdump was running I typed (on the client)

umount -fl /rmt/csnewton/CS
mount /rmt/csnewton/CS
(waited a while)
cntrl-c
umount -fl /rmt/csnewton/CS
(failed)
mount -o udp /rmt/csnewton/CS
(worked)
Comment 23 Bob Vickers 2008-10-07 08:59:28 UTC
Created attachment 243853 [details]
gzipped tcpdump file
Comment 24 Neil Brown 2008-10-21 00:18:01 UTC
Thanks for the tcpdump trace.  Unfortunately you didn't specify "-s 0" so
the packets are truncated.
However it looks like there is a fairly normal handshake between client
and server where the client finds out the port number and filehandle that it
needs.  But then it never even tries to establish a TCP connection to the server.

It seems likely that the NFS client thinks it already has a connection to
that server (For some other mount point) and is trying to reuse it, but 
for some reason that connection isn't making any progress.

This is quite different to the other trace where the client kept trying
to make connections but they kept closing for some reason.

So it seems quite likely that we have (at least) two different bugs here.

To make any more progress,  I think I'll really need a 'tcpdump -s0' and
a kernel message log with 
echo 65535 >/proc/sys/sunrpc/nfs_debug
echo 65535 >/proc/sys/sunrpc/rpc_debug

both collected at the same time.

Comment 25 Neil Brown 2008-11-07 04:25:17 UTC
Any updates here?  Is the bug still occuring?
Comment 26 Bob Vickers 2008-11-07 14:08:47 UTC
Sorry for not replying sooner, but I don't have much information to give. The machines where the problem happened before are all production machines and I don't want to destabilize them by moving back from udp to tcp. I have tried to reproduce the problem on a workstation, but so far without any joy (even when hammering NFS with a ferocious mixture of find and tar).

However I have some other avenues to pursue, so I will try them next week.
Comment 27 Carl Benson 2008-11-07 14:30:46 UTC
At our site, the bug seems to be related to automounter. Since openSUSE 11.1 will be out soon, and since I have a Service Level Agreement requiring me to install 11.1 on servers as soon as possible, I've decided to abandon automounter. All NFS mounts will be static, and automounts will be replaced by symbolic links in /home.

While this is an unfortunate waste of time, it also bypasses other long-standing problems with automounter.

We are currently unable to upgrade DataONTAP on our NetApp Filer (which may or may not fix this problem). But we will do so in the future.
Comment 28 Bob Vickers 2008-11-07 14:36:05 UTC
For what it's worth, we don't use automounter at this site. All our NFS mounts are made at boot-time.
Comment 29 Neil Brown 2008-11-10 00:43:32 UTC
Thanks.
I'll leave this bug open a while longer incase something comes up.
Just a pointer for anyone else who happens upon this bug, I need the
information described in comment #24 to make any progress.
Comment 30 Bob Vickers 2008-11-11 11:04:01 UTC
Yesterday I got permission to temporarily revert one of our servers to TCP rather than UDP mounting, and I tried really hard to provoke the bug, but without success. However, I don't want to risk destabilising the service at this time of year, so I have put all the servers back to UDP now.

I conclude that either some other change has fixed the bug, or else I failed to reproduce the conditions which provoke it.

Sorry I can't be more helpful.
Comment 31 Bob Vickers 2008-12-17 12:48:08 UTC
On Monday I upgraded our NFS server to SuSE 10.3, and I have just had a hang on a client which never saw the problem before.

The client is a 32-bit Athlon running 11.0, with kernel 2.6.25.16-0.1-pae.
The server is also 32-bit, and is running kernel  2.6.22.19-0.1-default.
The network connection is fine: I can still ssh from the client to the server, and rpcinfo looks good.

I will attach:
(1) kernel messages on the client
(2) rpcinfo run on the client (rpcinfo -p server)
(3) a binary tcpdump file produced on the server.

During this period I ran a mount command, which just hung.

I don't mind leaving the client in its current state for a few hours, so if you need any more info let me know.


Comment 32 Bob Vickers 2008-12-17 12:50:39 UTC
Created attachment 260570 [details]
kernel messages on client
Comment 33 Bob Vickers 2008-12-17 12:51:54 UTC
Created attachment 260572 [details]
rpcinfo run on client, reporting on server
Comment 34 Bob Vickers 2008-12-17 12:53:27 UTC
Created attachment 260574 [details]
tcpdump output showing all traffic between client and server
Comment 35 Neil Brown 2009-01-30 02:17:46 UTC
I've looked at this a few times now and haven't made much progress.

There is not NFS traffic between client and server at all.  Assuming
the capture is correct, that definitely shows something is wrong on the
client.

Based on the rpc tracing messages, it is conceivable that XPRT_CLOSING
has been set on the transport, but it isn't getting closed completely.
However I'm not really familiar enough with the code to know if that is
even possible.

If you feel up to rebuilding a kernel (which I suspect is unlikely)
it would be helpful to put a printk in xs_tcp_connect in net/sunrpc/xprtsock.c
like this:
	/* Exit if we need to wait for socket shutdown to complete */
	if (test_bit(XPRT_CLOSING, &xprt->state)) {
                printk("Still closing\n");
		return;
        }

and run the same experiment and see if that message gets printed out.
If it does, I would know where to start looking.
If not.. I'm still lost.

A less intrusive thing that might be helpful is to run
  netstat -nt

while the client is hanging and see if it mentions the connection 
to the server at all, and if so, what state the connection is in.
Do this a few times over several minutes to see if it changes at all.

That is all I can suggest at the moment.
Comment 36 Neil Brown 2009-06-10 23:14:34 UTC
This bug seems to be languishing with no obvious way forward...
Is this still a problem?  If it is I'll try harder, but it probably
involve you needing to rebuild the kernel with a patch applied.
Comment 37 Bob Vickers 2009-06-11 08:46:28 UTC
I'm not the originator of the bug report, just a fellow sufferer, but
I'll give you the situation from my perspective.

The bug only ever manifested itself on production servers so the
desire to obtain information clashed with the need to keep a service
running. My solution was to change all the mounts to UDP and this was
successful, but it concerns me because of the dire warnings of data
loss in the NFS man page.

I am about to do my annual upgrade and over the next few weeks I will
upgrade all the production machines to 11.1. At the same time I will bravely
change the NFS mounts back to TCP and see whether the problem returns.

Can you confirm the information I should gather if the problem does
return?

In comment #24 you mention 
'tcpdump -s0' and a kernel message log with 
echo 65535 >/proc/sys/sunrpc/nfs_debug
echo 65535 >/proc/sys/sunrpc/rpc_debug

Is that on the client that you want the message log? Is there anything
else that would help?

Thanks,
Bob
Comment 38 Neil Brown 2009-07-10 02:17:02 UTC
(In reply to comment #37)
> 
> Can you confirm the information I should gather if the problem does
> return?
> 
> In comment #24 you mention 
> 'tcpdump -s0' and a kernel message log with 
> echo 65535 >/proc/sys/sunrpc/nfs_debug
> echo 65535 >/proc/sys/sunrpc/rpc_debug

Yes, that is the information that would be useful.

> 
> Is that on the client that you want the message log? Is there anything
> else that would help?

Yes, on the client.  Nothing else that might be useful springs to mind.
Comment 39 Bob Vickers 2009-08-04 10:00:30 UTC
Created attachment 310287 [details]
log messages and network trace

OK, it's happened again. Both client and server are running Opensuse 11.1, and I've attached a gzipped tar file containing messages and a network trace.

Just to recap on what happens:

(1) Various things hang, including df. Running 'strace df' reveals which NFS file system it gets stuck on. At this point I collected the debugging info for a minute or two.

(2) I unmounted the file system with 'umount -fl'

(3) An attempt to mount it hung, so I interrupted that and tried 'mount -o udp' which worked.

(4) df now hangs on a different NFS file system so I repeated steps 1 to 3 until df ran to completion.

I hope this is useful. My quick analysis of the network trace suggests there is no NFS-related traffic going on...don't know what this signifies.

Bob
Comment 40 Neil Brown 2009-08-13 04:47:13 UTC
Thanks for the trace.
As you say, no NFS traffic.  It seem the sunrpc code is confused about the
status of the connection.  It knows it needs to open a new connection, but
it seems to think that is already happening, despite the fact that it
isn't.

Matching the trace to the code, it seems that either XPRT_CLOSING is set,
or XPRT_CONNECTING is set.

When XPRT_CONNECTING is set, it schedules a task for at most 5 minutes in
the future which is guaranteed to clear XPRT_CONNECTING again, so that
cannot stay set for very long.

The trace you captured was for less than 5 minutes so I cannot be sure
that the task didn't fire in that time, but it seems unlikely.

So that points to XPRT_CLOSING.
There is an upstream patch

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=670f94573104b4a25525d3fcdcd6496c678df172

which changes move the setting of that flag.  I'm not a all certain that
this patch will fix the problem but it is possible, and is the only
credible possibility at the moment.

The important part of the patch is moving the "set_bit(XPRT_CLOSING" from
the TCP_CLOSE_WAIT case to the TCP_LAST_ACK case of the switch.

Are you in a position to compile your own kernel?  Could you apply
that patch and try it out?

I'll get that patch into the 10.1 kernel (it is already in 11.1) so it
will be in any future update, but I don't know when that is likely to be.
Comment 41 Bob Vickers 2009-08-13 09:40:08 UTC
Thanks for doing the analysis.

I should be able to compile a kernel, though it is quite a few years since I have 
done one so I would have to do some revision.

I'm a bit confused by your last sentence...are there some typos or am I misinterpreting the version numbers? Both client and server were running Opensuse 11.1 when I ran the latest trace.
Comment 42 Neil Brown 2009-08-13 12:11:40 UTC
There were typos, but still confusion.
I meant that I would get the patch in to 11.0, and that it is already
in 11.1.

I just checked and the patch went in to the 11.1 kernel on July 25th,
3 weeks ago.   I don't think we've released a kernel since then, so the
best thing to do would probably be grab the relevant kernel-of-the-day
from
    http://ftp.suse.com/pub/projects/kernel/kotd/SLE11_BRANCH/

(openSuSE 11.1 and SLE11 use the same kernel).
Comment 43 Bob Vickers 2009-08-21 08:32:27 UTC
Neil: I started following your advice, but then held off when I heard there was a new official kernel in the pipeline to fix the NULL pointer dereference bug. That has been released now and I have started deploying it: could you confirm my impression that it includes the XPRT_CLOSING patch?

The change log for kernel-default-base-2.6.27.29-0.1.1 includes 

* Fri Jul 31 2009 gregkh@suse.de
- Update to 2.6.27.29
  - lots of bugfixes
* Sat Jul 25 2009 gregkh@suse.de
- Update to 2.6.27.27 and 2.6.27.28:
  - bugfixes

so I am guessing that one of those includes the patch. I'll keep my fingers crossed that it either fixes the problem or provides more information.

Thanks,
Bob
Comment 44 Neil Brown 2009-08-21 09:58:05 UTC
Yes, the fix was in 2.6.27.28 so you do have the XPRT_CLOSING fix.

NeilBrown
Comment 45 Bob Vickers 2009-09-08 11:11:09 UTC
Created attachment 317175 [details]
log messages and tcpdump trace

Unfortunately the problem happened again, this time with the latest 11.1 kernel:

# uname -a
Linux ribosome 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux

I have attached (as a gzipped tar file):
/var/log/warn
/var/log/messages
a binary tcpdump file

Hope this is useful,
Bob
Comment 46 Neil Brown 2009-09-29 08:00:57 UTC
Thanks... it may well be useful..
Based on that trace I'm thinking that upstream commit
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=01d37c428ae080563c0a3bb8bdfa88c65a6891d3
might be the fix.

I'll try to reproduce the problem, and see if that fixes it.

The current reports seem to be very different than those which stared
this bug off.  I wonder if we are now looking at a different bug that
has a similar symptom... Not that it really matters I guess.
Comment 47 Neil Brown 2009-10-01 01:35:19 UTC
I'm not having a lot of luck.  I'm no longer thinking of that patch...

I compared the trace with the kernel code and I'm fairly sure that
the trace is only possible if either XPRT_CLOSING is set, or 
XPRT_CONNECTING is set.

The first should only be set if the TCP connection is in one of the states:
FIN_WAIT_1 FIN_WAIT_2 CLOSING TIME_WAIT LAST_ACK

It is very unlikely the connection would stay in those states for more than
a couple of minutes providing the server was up and responding.  If it
happens again you could check if the TCP connection is still present and
what its state is with
   netstat -nt | grep 2049

Setting of XPRT_CONNECTING will be accompanied by a debug message mentioning
xs_connect, then it will be cleared again within at most 5 minutes which will
result in a "worker connecting ..." message.
We don't see either of those messages over a period longer than 5 minutes.

So my only hope is that "netstat -nt" might show something useful
if the problem happens again (assume you are willing to try running
with TCP for a while again....)

So: setting to "NEEDINFO" for "Can I get 'netstat -nt' output while the
problem is happening'.
Comment 48 Bob Vickers 2009-10-01 09:09:03 UTC
I do still have some machines running with NFS over TCP so if it happens again I will provide the extra information.

Bob
Comment 49 Neil Brown 2009-11-24 03:09:02 UTC
As we unfortunately seem to be getting no-where with this
I'm going to close it as noreponse (which sounds a bit hash as you
have responded as best you can... but it still seems the best fit),
and hope that it is fixed in 11.2 which is now out....

Thanks for your efforts and apologies for not finding the problem.