Bugzilla – Bug 415607
nfs-mounted directories hang
Last modified: 2009-11-24 03:09:02 UTC
User home directories are automounted from a NetApp Filer to server named "copperhead". "df" command hangs on NFS mounts. Root attempts to "su - <user>", but this also hangs. It looks like the NFS server is not responding, but this is not the case -- we have many other servers using the same directories via NFS, and response is fine. Monitoring on the NetApp show it is running well. /var/log/messages has this: Aug 7 09:55:42 copperhead klogd: ------------[ cut here ]------------ Aug 7 09:55:42 copperhead klogd: kernel BUG at net/sunrpc/rpcb_clnt.c:290! Aug 7 09:55:42 copperhead klogd: invalid opcode: 0000 [1] SMP Aug 7 09:55:42 copperhead klogd: last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map Aug 7 09:55:42 copperhead klogd: CPU 3 Aug 7 09:55:42 copperhead klogd: Modules linked in: lp parport_pc ppdev parport nls_utf8 nfs lockd nfs_acl sunrpc autofs4 binfmt_misc iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 bonding microcode firmware_class fuse loop dm_mod rtc_cmos iTCO_wdt sg shpchp rtc_core i2c_i801 iTCO_vendor_support rtc_lib floppy container i2c_core i5000_edac usb_storage sr_mod pci_hotplug button cdrom joydev e1000e edac_core usbhid hid ff_memless ehci_hcd uhci_hcd sd_mod usbcore edd ext3 mbcache jbd fan ata_piix ahci libata dock 3w_9xxx scsi_mod thermal processor [last unloaded: parport_pc] Aug 7 09:55:42 copperhead klogd: Pid: 14342, comm: xauth Tainted: G N 2.6.25.11-0.1-default #1 Aug 7 09:55:42 copperhead klogd: RIP: 0010:[<ffffffff883ad55a>] [<ffffffff883ad55a>] :sunrpc:rpcb_getport_async+0x66/0x460 Aug 7 09:55:42 copperhead klogd: RSP: 0018:ffff810110cff938 EFLAGS: 00010287 Aug 7 09:55:42 copperhead klogd: RAX: ffffffff883d11d0 RBX: ffff81022c8d5c00 RCX: 0000000000000001 Aug 7 09:55:42 copperhead klogd: RDX: 0000000000000000 RSI: ffff81010a543e40 RDI: ffff81010a543e40 Aug 7 09:55:42 copperhead klogd: RBP: ffff810110cffa58 R08: ffff81010a543f28 R09: 0000000000000001 Aug 7 09:55:42 copperhead klogd: R10: ffff810001031280 R11: ffff810110cffc18 R12: ffff81018b047800 Aug 7 09:55:42 copperhead klogd: R13: ffff81018b047800 R14: ffff81010a543e40 R15: ffff810229d40800 Aug 7 09:55:42 copperhead klogd: FS: 00007fb4bd7ed6f0(0000) GS:ffff81022f1037c0(0000) knlGS:0000000000000000 Aug 7 09:55:42 copperhead klogd: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Aug 7 09:55:42 copperhead klogd: CR2: 00007f3992dfcf00 CR3: 00000001d4601000 CR4: 00000000000006e0 Aug 7 09:55:42 copperhead klogd: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 7 09:55:42 copperhead klogd: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Aug 7 09:55:42 copperhead klogd: Process xauth (pid: 14342, threadinfo ffff810110cfe000, task ffff810116954080) Aug 7 09:55:42 copperhead klogd: Stack: 0000000000000246 ffff81022ad67370 ffff810110cff978 0000000000000246 Aug 7 09:55:42 copperhead klogd: 0000000000011200 ffff81022ad67340 ffff810110cff9a8 ffff810110cff9c0 Aug 7 09:55:42 copperhead klogd: ffff810110cff988 ffffffff80276757 ffff810110cffa08 ffffffff8027685a Aug 7 09:55:42 copperhead klogd: Call Trace: Aug 7 09:55:42 copperhead klogd: [<ffffffff883a018b>] :sunrpc:call_bind+0x6f/0x74 Aug 7 09:55:42 copperhead klogd: [<ffffffff883a72d9>] :sunrpc:__rpc_execute+0x86/0x246 Aug 7 09:55:42 copperhead klogd: [<ffffffff883a751e>] :sunrpc:rpc_execute+0x85/0x8e Aug 7 09:55:42 copperhead klogd: [<ffffffff883a14a5>] :sunrpc:rpc_run_task+0x52/0x5a Aug 7 09:55:42 copperhead klogd: [<ffffffff883a1598>] :sunrpc:rpc_call_sync+0x3f/0x5d Aug 7 09:55:42 copperhead klogd: [<ffffffff8840c0bb>] :nfs:nfs3_rpc_wrapper+0x22/0x5c Aug 7 09:55:42 copperhead klogd: [<ffffffff8840c5b5>] :nfs:nfs3_proc_access+0xf3/0x15e Aug 7 09:55:42 copperhead klogd: [<ffffffff883ffde0>] :nfs:nfs_do_access+0x173/0x341 Aug 7 09:55:42 copperhead klogd: [<ffffffff884000a3>] :nfs:nfs_permission+0xf5/0x166 Aug 7 09:55:42 copperhead klogd: [<ffffffff802a7497>] permission+0xc7/0x118 Aug 7 09:55:42 copperhead klogd: [<ffffffff802a7bbc>] lookup_hash+0x21/0x43 Aug 7 09:55:42 copperhead klogd: [<ffffffff802ab6d8>] open_namei+0xe8/0x6a0 Aug 7 09:55:42 copperhead klogd: [<ffffffff8029ec5b>] do_filp_open+0x28/0x4b Aug 7 09:55:42 copperhead klogd: [<ffffffff8029eccf>] do_sys_open+0x51/0xd2 Aug 7 09:55:42 copperhead klogd: [<ffffffff8029ed79>] sys_open+0x1b/0x1d Aug 7 09:55:42 copperhead klogd: [<ffffffff8020bffa>] system_call_after_swapgs+0x8a/0x8f Aug 7 09:55:42 copperhead klogd: DWARF2 unwinder stuck at system_call_after_swapgs+0x8a/0x8f Aug 7 09:55:42 copperhead klogd: Aug 7 09:55:42 copperhead klogd: Leftover inexact backtrace: Aug 7 09:55:42 copperhead klogd: Aug 7 09:55:42 copperhead syslog-ng[2774]: last message repeated 2 times Aug 7 09:55:42 copperhead klogd: Code: 3b 88 44 8b 4b 44 44 8b 43 40 48 c7 c7 57 74 3b 88 48 8b 4b 50 89 04 24 31 c0 e8 ed 24 0a f8 4c 8b bb e8 00 00 00 49 39 df 74 04 <0f> 0b eb fe 49 8d 85 78 02 00 00 48 89 85 f8 fe ff ff f0 0f ba Aug 7 09:55:42 copperhead klogd: RIP [<ffffffff883ad55a>] :sunrpc:rpcb_getport_async+0x66/0x460 Aug 7 09:55:42 copperhead klogd: RSP <ffff810110cff938> Aug 7 09:55:42 copperhead klogd: ---[ end trace 4b516acb183486f8 ]---
I left out some obvious stuff. "uname -a" yields: Linux copperhead 2.6.25.11-0.1-default #1 SMP 2008-07-13 20:48:28 +0200 x86_64 x86_64 x86_64 GNU/Linux
Thanks for the report and all the details. Can I assume that you have rebooted 'copperhead' and it is working happily again? Has this only happened once, or multiple times? I have looked at the code and I think I can see what is happening. I will discuss this with some colleagues and try to find a solution.
Reset is necessary -- it won't reboot due to hang on NFS. Reboot fixes the problem temporarily, but it happens again eventually. I can't be 100% certain, but I think this is a problem with 64-bit installations but not 32-bit. I've been using 11.0 32-bit for some time with no NFS problems. I have other users with openSUSE 11.0 32-bit desktops and no NFS problems. Thanks for looking at this. I have a lot of servers to upgrade to 11.0 if/when this bug is fixed. If there's any other info I can provide, please let me know. --Carl Benson
I discovered that even with no users logged on, the NFS hang could be triggered by updatedb, which is started by a cron job. The defaults for updatedb (in /etc/sysconfig/locate) allow it to traverse /home. In past versions of Linux, NFS automounted /home was treated as a "netpath", but now NFS-mounted home directories are treated as local to the systems. The result (in my case) is that updatedb runs find across every one of hundreds of home directories on the NetApp Filer, and tries to mount home directories that are not exported to this system. I think that's where the hang happens. I killed find and updatedb, restarted nfs and autofs, and removed /home from the paths covered by updatedb. This fixed the hang. The underlying problem is still there, however.
I'm have been working a case with this very same issue on a NetApp storage appliance. automounter contents: vol00 -fstype=nfs4,rw,nosuid,rsize=32768,wsize=32768,hard,intr drive00.engr.uiowa.edu:/vol/vol1 uname -a: Linux a-lnx001 2.6.25.5-1.1-default #1 SMP 2008-06-07 01:55:22 +0200 x86_64 x86_64 x86_64 hardware is box with AMD-64 Operon procs (HP xw9300 workstation) The filer does show a slow response time as below from messages and we did confirm correct functional operation and configuration no the filer: Wed Jul 23 13:38:32 CDT [drive00: nfsap_process:warning]: NFSv4 response to client 128.255.18.37 for volume 0x1ff0eee was slow, 173 > 60 (in seconds) Appreciate that this has been documented and will be watching for any updates.
It looks like the BUG_ON is simply wrong. I have removed in the SUSE CVS tree. So you can: - compile the kernel yourself with the line removed - use a ktod (kernel of the day) from tomorrow - wait for the next kernel update to be released (no idea when that might be). I won't close this yet though as we need to explore a bit deeper to make sure we are 100% certain what is happening.
Is it true that this affects both the 64-bit and 32-bit default kernels? I looked at /usr/src/linux/net/sunrpc/rpcb_clnt.c on both a 32-bit and a 64-bit system, and they both have a BUG_ON at line 290.
I commented out line 290 of net/sunrpc/rpcb_clnt.c, recompiled the kernel, and rebooted. So far, I have not seen that error in /var/log/messages. Unfortunately, that did not help the incredibly slow response of NFS to our NetApp Filer. That situation is as bad as before. I don't believe it's the NetApp Filer. I have roughly equivalent servers running openSUSE 10.2, 10.3 and 11.0, all configured to automount home directories. As root, if I simultaneously "su - user" on all three systems, where the user has a home directory on the NetApp, the response on 10.2 and 10.3 are instant. The 11.0 system never returns the prompt. In /var/log/messages, it claims the NetApp is not responding, still trying.
I did not realise you were getting slow performance. You said that commands "hang", and I assumed this meant that they never returned. That would be consistent with the BUG_ON firing. Can you provide more information about this slowness? Are all operations slow or just some? Does performance steadily get worse? Does it sometimes get better?
I would say performance steadily gets worse. On an openSUSE 11.0 system left to itself, I see these messages in /var/log/messages every 30 minutes: Aug 18 07:08:38 diamondback klogd: nfs: server netappb1 not responding, still trying Aug 18 07:10:38 diamondback klogd: nfs: server netappb1 OK The netappb1 is performing well at the time. As root, I can try "su - <user>" for a number of users, and it works for 4 users, but for the 5th user, "su - <user>" hangs and does not come back. In this case, I see in /var/log/messages that netappb1 has "timed out". I can Ctl-C out of the "su - <user>". After that, if I do "df" it hangs, and has to be killed "-9". If I get su-ed to a user and do an "ls", it may wait a long time, then respond. Other times, it will respond immediately. Here is /proc/mounts for copperhead when it's stuck on "df": rootfs / rootfs rw 0 0 udev /dev tmpfs rw,mode=755 0 0 /dev/sda2 / ext3 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0 proc /proc proc rw 0 0 sysfs /sys sysfs rw 0 0 debugfs /sys/kernel/debug debugfs rw 0 0 devpts /dev/pts devpts rw,gid=5,mode=620 0 0 /dev/sda3 /loc ext3 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0 securityfs /sys/kernel/security securityfs rw 0 0 fusectl /sys/fs/fuse/connections fusectl rw 0 0 rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 cplserv:/vol/cpl_data /mnt/cpl/data nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=140.107.170.71 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 auto.home /home autofs rw,fd=5,pgrp=3305,timeout=600,minproto=5,maxproto=5,indirect 0 0 proc /var/lib/ntp/proc proc rw 0 0 cplserv:/vol/cpl_home/tholzman /home/tholzman nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=140.107.170.71 0 0 Copperhead and diamondback are identical servers running openSUSE 11.0. The problem does seem to be limited to home directories automounted from the NetApp Filer. I have a few home directories on another server, and copperhead and diamondback have no trouble with those.
I was also curios what version of Data ONTAP is running on the filer. For our issue the customer has upgraded their version of ONTAP to 7.3 and the problem has decreased significantly.
Our NetApp Filer is running Data ONTAP v7.2.4. We are using NFS v3 for all mounts. I noticed your case involves v4. Seems like v3 should be well-established code for all versions of Linux by this time, so that why it's surprising that anything should be different for openSUSE 11.0 (vs 10.2 and 10.3).
Is there any more information or assistance I could provide that would help this move along? openSUSE 11.1 is going to be out soon -- is there any way to know if this same problem is going to persist in 11.1?
Sorry for not getting back to this sooner .... I've been busy :-( To summarise: openSUSE11.0 sometimes (often) sees no response from a NetApp filer, but does not see problems talking to other (non-NetApp) file servers. openSUSE10.2 and 10.3 don't have any problems talking to that same filer. So the problem is specific to 11.0 and NetApp. Quite strange. I will need to get some tracing information to find out what is happening. Could you collect two things for me please? 1/ tcpdump -w /tmp/trace -s 0 host netappb1 start that running, then try to reproduce the problem, then interrupt the tcpdump and collect the "/tmp/trace" file. Compress it and attach it to the bugzilla Hopefully it won't be too big. 2/ echo 65535 > /proc/sys/sunrpc/nfs_debug echo 65535 > /proc/sys/sunrpc/rpc_debug This writes trace information to the kernel log (it will appear in /var/log/messages I think). After the 'echo's, reproduce the problem, then echo 0 to both files and collect the logs. Probably best to do both of these at the same time. so start tcpdump, do the two 'echo' commands, then reproduce the problem, then collect the logs. Thanks.
Created attachment 238696 [details] tcpdump trace tcpdump -i bond0 -w /tmp/trace -s 0 host netappb1 During this window, as root, I su-ed to two user accounts with home dirs automounted from netappb1, and the "su" hung until it timed out.
Created attachment 238706 [details] excerpt from /var/log/messages echo 65535 >/proc/sys/sunrpc/nfs_debug echo 65535 >/proc/sys/sunrpc/rpc_debug recreate problem echo 0 >/proc/sys/sunrpc/nfs_debug echo 0 >/proc/sys/sunrpc/rpc_debug Darn. I missed your bit at the end about doing both at once. At any rate, from the messages, it appears to loop once per minute, so it's just the same thing over and over.
*** Bug 416654 has been marked as a duplicate of this bug. ***
After looking at both this bug and bug 416654 I realise there is a piece of information I did not supply: the NFS servers in my case were both running SuSE 10.2. One of them was 32-bit, the other 64-bit. In all my cases the client was 64-bit SuSE 11.0, and I have not seen the problem on similarly configured 32-bit clients. I believe a workround is to put 'udp' in the fstab options, at any rate it has worked so far for me. This is not ideal, however, because the man page suggests a performance impact.
Thanks for the traces. I can now see some of what is happening. The tcpdump trace shows normal NFS startup handshake where the client first checks with portmap to find mountd then talks to mountd to get the root filehandle. Then it checks with portmap again to get the port for NFS (2049) and then connects to the NFS server. You would expect to then see an FSINFO call, an FSINFO reply, then a PATHCONF call and a PATHCONF reply. Instead we see various other calls such as GETATTR, FSSTAT, PATHCONF, all for *other* filehandles. There calls are made consecutively without waiting for a reply. As soon as the server sees the PATHCONF request (with a different filehandle to what the mount returned), it closes the TCP connection. The client then opens the connection again submits several requests one of which fails. What must be happening here is that several mounts are all being attempted against the same server at the same time. When this happens the client will use the one TCP connection for all of the mount points. The PATHINFO for one of these is causes the server to (incorrectly I think) close the connection. This just causes it to be retried again and again making no progress and blocking any other mounts from the same fileserver. So it seems like the NetApp filer is doing the wrong thing, closing the connection. However Bob's situation doesn't involve a NetApp filer. So maybe Bob's problem is completely different and just looks similar, or maybe I'm missing something. Bob: are you able to collect a tcpdump trace of traffic when the problem occurs?? Thank might be helpful.
Sorry, but for the moment I cannot collect information because the machines that demonstrate it are all production machines, and I switched them all to UDP mounts so as to get the service ready for the new term. However I am expecting a couple of new 64-bit workstations in a week or two, so I will try to reproduce the problem on them.
Neil's comment that about various calls for other filehandles, could that be automounter generating this? I noticed since openSUSE 10.3 that automounter tries to mount every possible home directory in some cases. I have noticed this on 64-bit servers, but not on 32-bit desktops. Since we have 155 possible home directories, that would be quite a few extra calls. I'll do some testing on an 11.0 server where home directories are statically mounted, and see if it hangs up, too.
New information: this problem has now occurred (twice) on a 32-bit client. I will attach a binary tcpdump taken on the server. Wile tcpdump was running I typed (on the client) umount -fl /rmt/csnewton/CS mount /rmt/csnewton/CS (waited a while) cntrl-c umount -fl /rmt/csnewton/CS (failed) mount -o udp /rmt/csnewton/CS (worked)
Created attachment 243853 [details] gzipped tcpdump file
Thanks for the tcpdump trace. Unfortunately you didn't specify "-s 0" so the packets are truncated. However it looks like there is a fairly normal handshake between client and server where the client finds out the port number and filehandle that it needs. But then it never even tries to establish a TCP connection to the server. It seems likely that the NFS client thinks it already has a connection to that server (For some other mount point) and is trying to reuse it, but for some reason that connection isn't making any progress. This is quite different to the other trace where the client kept trying to make connections but they kept closing for some reason. So it seems quite likely that we have (at least) two different bugs here. To make any more progress, I think I'll really need a 'tcpdump -s0' and a kernel message log with echo 65535 >/proc/sys/sunrpc/nfs_debug echo 65535 >/proc/sys/sunrpc/rpc_debug both collected at the same time.
Any updates here? Is the bug still occuring?
Sorry for not replying sooner, but I don't have much information to give. The machines where the problem happened before are all production machines and I don't want to destabilize them by moving back from udp to tcp. I have tried to reproduce the problem on a workstation, but so far without any joy (even when hammering NFS with a ferocious mixture of find and tar). However I have some other avenues to pursue, so I will try them next week.
At our site, the bug seems to be related to automounter. Since openSUSE 11.1 will be out soon, and since I have a Service Level Agreement requiring me to install 11.1 on servers as soon as possible, I've decided to abandon automounter. All NFS mounts will be static, and automounts will be replaced by symbolic links in /home. While this is an unfortunate waste of time, it also bypasses other long-standing problems with automounter. We are currently unable to upgrade DataONTAP on our NetApp Filer (which may or may not fix this problem). But we will do so in the future.
For what it's worth, we don't use automounter at this site. All our NFS mounts are made at boot-time.
Thanks. I'll leave this bug open a while longer incase something comes up. Just a pointer for anyone else who happens upon this bug, I need the information described in comment #24 to make any progress.
Yesterday I got permission to temporarily revert one of our servers to TCP rather than UDP mounting, and I tried really hard to provoke the bug, but without success. However, I don't want to risk destabilising the service at this time of year, so I have put all the servers back to UDP now. I conclude that either some other change has fixed the bug, or else I failed to reproduce the conditions which provoke it. Sorry I can't be more helpful.
On Monday I upgraded our NFS server to SuSE 10.3, and I have just had a hang on a client which never saw the problem before. The client is a 32-bit Athlon running 11.0, with kernel 2.6.25.16-0.1-pae. The server is also 32-bit, and is running kernel 2.6.22.19-0.1-default. The network connection is fine: I can still ssh from the client to the server, and rpcinfo looks good. I will attach: (1) kernel messages on the client (2) rpcinfo run on the client (rpcinfo -p server) (3) a binary tcpdump file produced on the server. During this period I ran a mount command, which just hung. I don't mind leaving the client in its current state for a few hours, so if you need any more info let me know.
Created attachment 260570 [details] kernel messages on client
Created attachment 260572 [details] rpcinfo run on client, reporting on server
Created attachment 260574 [details] tcpdump output showing all traffic between client and server
I've looked at this a few times now and haven't made much progress. There is not NFS traffic between client and server at all. Assuming the capture is correct, that definitely shows something is wrong on the client. Based on the rpc tracing messages, it is conceivable that XPRT_CLOSING has been set on the transport, but it isn't getting closed completely. However I'm not really familiar enough with the code to know if that is even possible. If you feel up to rebuilding a kernel (which I suspect is unlikely) it would be helpful to put a printk in xs_tcp_connect in net/sunrpc/xprtsock.c like this: /* Exit if we need to wait for socket shutdown to complete */ if (test_bit(XPRT_CLOSING, &xprt->state)) { printk("Still closing\n"); return; } and run the same experiment and see if that message gets printed out. If it does, I would know where to start looking. If not.. I'm still lost. A less intrusive thing that might be helpful is to run netstat -nt while the client is hanging and see if it mentions the connection to the server at all, and if so, what state the connection is in. Do this a few times over several minutes to see if it changes at all. That is all I can suggest at the moment.
This bug seems to be languishing with no obvious way forward... Is this still a problem? If it is I'll try harder, but it probably involve you needing to rebuild the kernel with a patch applied.
I'm not the originator of the bug report, just a fellow sufferer, but I'll give you the situation from my perspective. The bug only ever manifested itself on production servers so the desire to obtain information clashed with the need to keep a service running. My solution was to change all the mounts to UDP and this was successful, but it concerns me because of the dire warnings of data loss in the NFS man page. I am about to do my annual upgrade and over the next few weeks I will upgrade all the production machines to 11.1. At the same time I will bravely change the NFS mounts back to TCP and see whether the problem returns. Can you confirm the information I should gather if the problem does return? In comment #24 you mention 'tcpdump -s0' and a kernel message log with echo 65535 >/proc/sys/sunrpc/nfs_debug echo 65535 >/proc/sys/sunrpc/rpc_debug Is that on the client that you want the message log? Is there anything else that would help? Thanks, Bob
(In reply to comment #37) > > Can you confirm the information I should gather if the problem does > return? > > In comment #24 you mention > 'tcpdump -s0' and a kernel message log with > echo 65535 >/proc/sys/sunrpc/nfs_debug > echo 65535 >/proc/sys/sunrpc/rpc_debug Yes, that is the information that would be useful. > > Is that on the client that you want the message log? Is there anything > else that would help? Yes, on the client. Nothing else that might be useful springs to mind.
Created attachment 310287 [details] log messages and network trace OK, it's happened again. Both client and server are running Opensuse 11.1, and I've attached a gzipped tar file containing messages and a network trace. Just to recap on what happens: (1) Various things hang, including df. Running 'strace df' reveals which NFS file system it gets stuck on. At this point I collected the debugging info for a minute or two. (2) I unmounted the file system with 'umount -fl' (3) An attempt to mount it hung, so I interrupted that and tried 'mount -o udp' which worked. (4) df now hangs on a different NFS file system so I repeated steps 1 to 3 until df ran to completion. I hope this is useful. My quick analysis of the network trace suggests there is no NFS-related traffic going on...don't know what this signifies. Bob
Thanks for the trace. As you say, no NFS traffic. It seem the sunrpc code is confused about the status of the connection. It knows it needs to open a new connection, but it seems to think that is already happening, despite the fact that it isn't. Matching the trace to the code, it seems that either XPRT_CLOSING is set, or XPRT_CONNECTING is set. When XPRT_CONNECTING is set, it schedules a task for at most 5 minutes in the future which is guaranteed to clear XPRT_CONNECTING again, so that cannot stay set for very long. The trace you captured was for less than 5 minutes so I cannot be sure that the task didn't fire in that time, but it seems unlikely. So that points to XPRT_CLOSING. There is an upstream patch http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=670f94573104b4a25525d3fcdcd6496c678df172 which changes move the setting of that flag. I'm not a all certain that this patch will fix the problem but it is possible, and is the only credible possibility at the moment. The important part of the patch is moving the "set_bit(XPRT_CLOSING" from the TCP_CLOSE_WAIT case to the TCP_LAST_ACK case of the switch. Are you in a position to compile your own kernel? Could you apply that patch and try it out? I'll get that patch into the 10.1 kernel (it is already in 11.1) so it will be in any future update, but I don't know when that is likely to be.
Thanks for doing the analysis. I should be able to compile a kernel, though it is quite a few years since I have done one so I would have to do some revision. I'm a bit confused by your last sentence...are there some typos or am I misinterpreting the version numbers? Both client and server were running Opensuse 11.1 when I ran the latest trace.
There were typos, but still confusion. I meant that I would get the patch in to 11.0, and that it is already in 11.1. I just checked and the patch went in to the 11.1 kernel on July 25th, 3 weeks ago. I don't think we've released a kernel since then, so the best thing to do would probably be grab the relevant kernel-of-the-day from http://ftp.suse.com/pub/projects/kernel/kotd/SLE11_BRANCH/ (openSuSE 11.1 and SLE11 use the same kernel).
Neil: I started following your advice, but then held off when I heard there was a new official kernel in the pipeline to fix the NULL pointer dereference bug. That has been released now and I have started deploying it: could you confirm my impression that it includes the XPRT_CLOSING patch? The change log for kernel-default-base-2.6.27.29-0.1.1 includes * Fri Jul 31 2009 gregkh@suse.de - Update to 2.6.27.29 - lots of bugfixes * Sat Jul 25 2009 gregkh@suse.de - Update to 2.6.27.27 and 2.6.27.28: - bugfixes so I am guessing that one of those includes the patch. I'll keep my fingers crossed that it either fixes the problem or provides more information. Thanks, Bob
Yes, the fix was in 2.6.27.28 so you do have the XPRT_CLOSING fix. NeilBrown
Created attachment 317175 [details] log messages and tcpdump trace Unfortunately the problem happened again, this time with the latest 11.1 kernel: # uname -a Linux ribosome 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux I have attached (as a gzipped tar file): /var/log/warn /var/log/messages a binary tcpdump file Hope this is useful, Bob
Thanks... it may well be useful.. Based on that trace I'm thinking that upstream commit http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=01d37c428ae080563c0a3bb8bdfa88c65a6891d3 might be the fix. I'll try to reproduce the problem, and see if that fixes it. The current reports seem to be very different than those which stared this bug off. I wonder if we are now looking at a different bug that has a similar symptom... Not that it really matters I guess.
I'm not having a lot of luck. I'm no longer thinking of that patch... I compared the trace with the kernel code and I'm fairly sure that the trace is only possible if either XPRT_CLOSING is set, or XPRT_CONNECTING is set. The first should only be set if the TCP connection is in one of the states: FIN_WAIT_1 FIN_WAIT_2 CLOSING TIME_WAIT LAST_ACK It is very unlikely the connection would stay in those states for more than a couple of minutes providing the server was up and responding. If it happens again you could check if the TCP connection is still present and what its state is with netstat -nt | grep 2049 Setting of XPRT_CONNECTING will be accompanied by a debug message mentioning xs_connect, then it will be cleared again within at most 5 minutes which will result in a "worker connecting ..." message. We don't see either of those messages over a period longer than 5 minutes. So my only hope is that "netstat -nt" might show something useful if the problem happens again (assume you are willing to try running with TCP for a while again....) So: setting to "NEEDINFO" for "Can I get 'netstat -nt' output while the problem is happening'.
I do still have some machines running with NFS over TCP so if it happens again I will provide the extra information. Bob
As we unfortunately seem to be getting no-where with this I'm going to close it as noreponse (which sounds a bit hash as you have responded as best you can... but it still seems the best fit), and hope that it is fixed in 11.2 which is now out.... Thanks for your efforts and apologies for not finding the problem.