|
Bugzilla – Full Text Bug Listing |
| Summary: | nfs-mounted directories hang | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 11.0 | Reporter: | Carl Benson <cbenson> |
| Component: | Kernel | Assignee: | Neil Brown <nfbrown> |
| Status: | RESOLVED NORESPONSE | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P2 - High | CC: | cbenson, cyberorg, david.roper, R.Vickers, ursula.brueckner |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | openSUSE 11.1 | ||
| Whiteboard: | |||
| Found By: | Community User | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
tcpdump trace
excerpt from /var/log/messages gzipped tcpdump file kernel messages on client rpcinfo run on client, reporting on server tcpdump output showing all traffic between client and server log messages and network trace log messages and tcpdump trace |
||
|
Description
Carl Benson
2008-08-07 21:24:43 UTC
I left out some obvious stuff. "uname -a" yields: Linux copperhead 2.6.25.11-0.1-default #1 SMP 2008-07-13 20:48:28 +0200 x86_64 x86_64 x86_64 GNU/Linux Thanks for the report and all the details. Can I assume that you have rebooted 'copperhead' and it is working happily again? Has this only happened once, or multiple times? I have looked at the code and I think I can see what is happening. I will discuss this with some colleagues and try to find a solution. Reset is necessary -- it won't reboot due to hang on NFS. Reboot fixes the problem temporarily, but it happens again eventually. I can't be 100% certain, but I think this is a problem with 64-bit installations but not 32-bit. I've been using 11.0 32-bit for some time with no NFS problems. I have other users with openSUSE 11.0 32-bit desktops and no NFS problems. Thanks for looking at this. I have a lot of servers to upgrade to 11.0 if/when this bug is fixed. If there's any other info I can provide, please let me know. --Carl Benson I discovered that even with no users logged on, the NFS hang could be triggered by updatedb, which is started by a cron job. The defaults for updatedb (in /etc/sysconfig/locate) allow it to traverse /home. In past versions of Linux, NFS automounted /home was treated as a "netpath", but now NFS-mounted home directories are treated as local to the systems. The result (in my case) is that updatedb runs find across every one of hundreds of home directories on the NetApp Filer, and tries to mount home directories that are not exported to this system. I think that's where the hang happens. I killed find and updatedb, restarted nfs and autofs, and removed /home from the paths covered by updatedb. This fixed the hang. The underlying problem is still there, however. I'm have been working a case with this very same issue on a NetApp storage appliance. automounter contents: vol00 -fstype=nfs4,rw,nosuid,rsize=32768,wsize=32768,hard,intr drive00.engr.uiowa.edu:/vol/vol1 uname -a: Linux a-lnx001 2.6.25.5-1.1-default #1 SMP 2008-06-07 01:55:22 +0200 x86_64 x86_64 x86_64 hardware is box with AMD-64 Operon procs (HP xw9300 workstation) The filer does show a slow response time as below from messages and we did confirm correct functional operation and configuration no the filer: Wed Jul 23 13:38:32 CDT [drive00: nfsap_process:warning]: NFSv4 response to client 128.255.18.37 for volume 0x1ff0eee was slow, 173 > 60 (in seconds) Appreciate that this has been documented and will be watching for any updates. It looks like the BUG_ON is simply wrong.
I have removed in the SUSE CVS tree.
So you can:
- compile the kernel yourself with the line removed
- use a ktod (kernel of the day) from tomorrow
- wait for the next kernel update to be released (no idea when
that might be).
I won't close this yet though as we need to explore a bit deeper to
make sure we are 100% certain what is happening.
Is it true that this affects both the 64-bit and 32-bit default kernels? I looked at /usr/src/linux/net/sunrpc/rpcb_clnt.c on both a 32-bit and a 64-bit system, and they both have a BUG_ON at line 290. I commented out line 290 of net/sunrpc/rpcb_clnt.c, recompiled the kernel, and rebooted. So far, I have not seen that error in /var/log/messages. Unfortunately, that did not help the incredibly slow response of NFS to our NetApp Filer. That situation is as bad as before. I don't believe it's the NetApp Filer. I have roughly equivalent servers running openSUSE 10.2, 10.3 and 11.0, all configured to automount home directories. As root, if I simultaneously "su - user" on all three systems, where the user has a home directory on the NetApp, the response on 10.2 and 10.3 are instant. The 11.0 system never returns the prompt. In /var/log/messages, it claims the NetApp is not responding, still trying. I did not realise you were getting slow performance. You said that commands "hang", and I assumed this meant that they never returned. That would be consistent with the BUG_ON firing. Can you provide more information about this slowness? Are all operations slow or just some? Does performance steadily get worse? Does it sometimes get better? I would say performance steadily gets worse. On an openSUSE 11.0 system left to itself, I see these messages in /var/log/messages every 30 minutes: Aug 18 07:08:38 diamondback klogd: nfs: server netappb1 not responding, still trying Aug 18 07:10:38 diamondback klogd: nfs: server netappb1 OK The netappb1 is performing well at the time. As root, I can try "su - <user>" for a number of users, and it works for 4 users, but for the 5th user, "su - <user>" hangs and does not come back. In this case, I see in /var/log/messages that netappb1 has "timed out". I can Ctl-C out of the "su - <user>". After that, if I do "df" it hangs, and has to be killed "-9". If I get su-ed to a user and do an "ls", it may wait a long time, then respond. Other times, it will respond immediately. Here is /proc/mounts for copperhead when it's stuck on "df": rootfs / rootfs rw 0 0 udev /dev tmpfs rw,mode=755 0 0 /dev/sda2 / ext3 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0 proc /proc proc rw 0 0 sysfs /sys sysfs rw 0 0 debugfs /sys/kernel/debug debugfs rw 0 0 devpts /dev/pts devpts rw,gid=5,mode=620 0 0 /dev/sda3 /loc ext3 rw,errors=continue,user_xattr,acl,barrier=1,data=ordered 0 0 securityfs /sys/kernel/security securityfs rw 0 0 fusectl /sys/fs/fuse/connections fusectl rw 0 0 rpc_pipefs /var/lib/nfs/rpc_pipefs rpc_pipefs rw 0 0 cplserv:/vol/cpl_data /mnt/cpl/data nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=140.107.170.71 0 0 none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0 auto.home /home autofs rw,fd=5,pgrp=3305,timeout=600,minproto=5,maxproto=5,indirect 0 0 proc /var/lib/ntp/proc proc rw 0 0 cplserv:/vol/cpl_home/tholzman /home/tholzman nfs rw,vers=3,rsize=65536,wsize=65536,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=140.107.170.71 0 0 Copperhead and diamondback are identical servers running openSUSE 11.0. The problem does seem to be limited to home directories automounted from the NetApp Filer. I have a few home directories on another server, and copperhead and diamondback have no trouble with those. I was also curios what version of Data ONTAP is running on the filer. For our issue the customer has upgraded their version of ONTAP to 7.3 and the problem has decreased significantly. Our NetApp Filer is running Data ONTAP v7.2.4. We are using NFS v3 for all mounts. I noticed your case involves v4. Seems like v3 should be well-established code for all versions of Linux by this time, so that why it's surprising that anything should be different for openSUSE 11.0 (vs 10.2 and 10.3). Is there any more information or assistance I could provide that would help this move along? openSUSE 11.1 is going to be out soon -- is there any way to know if this same problem is going to persist in 11.1? Sorry for not getting back to this sooner .... I've been busy :-(
To summarise: openSUSE11.0 sometimes (often) sees no response from a NetApp filer,
but does not see problems talking to other (non-NetApp) file servers.
openSUSE10.2 and 10.3 don't have any problems talking to that same filer.
So the problem is specific to 11.0 and NetApp. Quite strange.
I will need to get some tracing information to find out what is happening.
Could you collect two things for me please?
1/ tcpdump -w /tmp/trace -s 0 host netappb1
start that running, then try to reproduce the problem, then interrupt the tcpdump
and collect the "/tmp/trace" file. Compress it and attach it to the bugzilla
Hopefully it won't be too big.
2/ echo 65535 > /proc/sys/sunrpc/nfs_debug
echo 65535 > /proc/sys/sunrpc/rpc_debug
This writes trace information to the kernel log (it will appear in /var/log/messages I think).
After the 'echo's, reproduce the problem, then echo 0 to both files and
collect the logs.
Probably best to do both of these at the same time. so start tcpdump,
do the two 'echo' commands, then reproduce the problem, then collect the
logs.
Thanks.
Created attachment 238696 [details]
tcpdump trace
tcpdump -i bond0 -w /tmp/trace -s 0 host netappb1
During this window, as root, I su-ed to two user accounts with home dirs
automounted from netappb1, and the "su" hung until it timed out.
Created attachment 238706 [details]
excerpt from /var/log/messages
echo 65535 >/proc/sys/sunrpc/nfs_debug
echo 65535 >/proc/sys/sunrpc/rpc_debug
recreate problem
echo 0 >/proc/sys/sunrpc/nfs_debug
echo 0 >/proc/sys/sunrpc/rpc_debug
Darn. I missed your bit at the end about doing both at once.
At any rate, from the messages, it appears to loop once per
minute, so it's just the same thing over and over.
*** Bug 416654 has been marked as a duplicate of this bug. *** After looking at both this bug and bug 416654 I realise there is a piece of information I did not supply: the NFS servers in my case were both running SuSE 10.2. One of them was 32-bit, the other 64-bit. In all my cases the client was 64-bit SuSE 11.0, and I have not seen the problem on similarly configured 32-bit clients. I believe a workround is to put 'udp' in the fstab options, at any rate it has worked so far for me. This is not ideal, however, because the man page suggests a performance impact. Thanks for the traces. I can now see some of what is happening. The tcpdump trace shows normal NFS startup handshake where the client first checks with portmap to find mountd then talks to mountd to get the root filehandle. Then it checks with portmap again to get the port for NFS (2049) and then connects to the NFS server. You would expect to then see an FSINFO call, an FSINFO reply, then a PATHCONF call and a PATHCONF reply. Instead we see various other calls such as GETATTR, FSSTAT, PATHCONF, all for *other* filehandles. There calls are made consecutively without waiting for a reply. As soon as the server sees the PATHCONF request (with a different filehandle to what the mount returned), it closes the TCP connection. The client then opens the connection again submits several requests one of which fails. What must be happening here is that several mounts are all being attempted against the same server at the same time. When this happens the client will use the one TCP connection for all of the mount points. The PATHINFO for one of these is causes the server to (incorrectly I think) close the connection. This just causes it to be retried again and again making no progress and blocking any other mounts from the same fileserver. So it seems like the NetApp filer is doing the wrong thing, closing the connection. However Bob's situation doesn't involve a NetApp filer. So maybe Bob's problem is completely different and just looks similar, or maybe I'm missing something. Bob: are you able to collect a tcpdump trace of traffic when the problem occurs?? Thank might be helpful. Sorry, but for the moment I cannot collect information because the machines that demonstrate it are all production machines, and I switched them all to UDP mounts so as to get the service ready for the new term. However I am expecting a couple of new 64-bit workstations in a week or two, so I will try to reproduce the problem on them. Neil's comment that about various calls for other filehandles, could that be automounter generating this? I noticed since openSUSE 10.3 that automounter tries to mount every possible home directory in some cases. I have noticed this on 64-bit servers, but not on 32-bit desktops. Since we have 155 possible home directories, that would be quite a few extra calls. I'll do some testing on an 11.0 server where home directories are statically mounted, and see if it hangs up, too. New information: this problem has now occurred (twice) on a 32-bit client. I will attach a binary tcpdump taken on the server. Wile tcpdump was running I typed (on the client) umount -fl /rmt/csnewton/CS mount /rmt/csnewton/CS (waited a while) cntrl-c umount -fl /rmt/csnewton/CS (failed) mount -o udp /rmt/csnewton/CS (worked) Created attachment 243853 [details]
gzipped tcpdump file
Thanks for the tcpdump trace. Unfortunately you didn't specify "-s 0" so the packets are truncated. However it looks like there is a fairly normal handshake between client and server where the client finds out the port number and filehandle that it needs. But then it never even tries to establish a TCP connection to the server. It seems likely that the NFS client thinks it already has a connection to that server (For some other mount point) and is trying to reuse it, but for some reason that connection isn't making any progress. This is quite different to the other trace where the client kept trying to make connections but they kept closing for some reason. So it seems quite likely that we have (at least) two different bugs here. To make any more progress, I think I'll really need a 'tcpdump -s0' and a kernel message log with echo 65535 >/proc/sys/sunrpc/nfs_debug echo 65535 >/proc/sys/sunrpc/rpc_debug both collected at the same time. Any updates here? Is the bug still occuring? Sorry for not replying sooner, but I don't have much information to give. The machines where the problem happened before are all production machines and I don't want to destabilize them by moving back from udp to tcp. I have tried to reproduce the problem on a workstation, but so far without any joy (even when hammering NFS with a ferocious mixture of find and tar). However I have some other avenues to pursue, so I will try them next week. At our site, the bug seems to be related to automounter. Since openSUSE 11.1 will be out soon, and since I have a Service Level Agreement requiring me to install 11.1 on servers as soon as possible, I've decided to abandon automounter. All NFS mounts will be static, and automounts will be replaced by symbolic links in /home. While this is an unfortunate waste of time, it also bypasses other long-standing problems with automounter. We are currently unable to upgrade DataONTAP on our NetApp Filer (which may or may not fix this problem). But we will do so in the future. For what it's worth, we don't use automounter at this site. All our NFS mounts are made at boot-time. Thanks. I'll leave this bug open a while longer incase something comes up. Just a pointer for anyone else who happens upon this bug, I need the information described in comment #24 to make any progress. Yesterday I got permission to temporarily revert one of our servers to TCP rather than UDP mounting, and I tried really hard to provoke the bug, but without success. However, I don't want to risk destabilising the service at this time of year, so I have put all the servers back to UDP now. I conclude that either some other change has fixed the bug, or else I failed to reproduce the conditions which provoke it. Sorry I can't be more helpful. On Monday I upgraded our NFS server to SuSE 10.3, and I have just had a hang on a client which never saw the problem before. The client is a 32-bit Athlon running 11.0, with kernel 2.6.25.16-0.1-pae. The server is also 32-bit, and is running kernel 2.6.22.19-0.1-default. The network connection is fine: I can still ssh from the client to the server, and rpcinfo looks good. I will attach: (1) kernel messages on the client (2) rpcinfo run on the client (rpcinfo -p server) (3) a binary tcpdump file produced on the server. During this period I ran a mount command, which just hung. I don't mind leaving the client in its current state for a few hours, so if you need any more info let me know. Created attachment 260570 [details]
kernel messages on client
Created attachment 260572 [details]
rpcinfo run on client, reporting on server
Created attachment 260574 [details]
tcpdump output showing all traffic between client and server
I've looked at this a few times now and haven't made much progress.
There is not NFS traffic between client and server at all. Assuming
the capture is correct, that definitely shows something is wrong on the
client.
Based on the rpc tracing messages, it is conceivable that XPRT_CLOSING
has been set on the transport, but it isn't getting closed completely.
However I'm not really familiar enough with the code to know if that is
even possible.
If you feel up to rebuilding a kernel (which I suspect is unlikely)
it would be helpful to put a printk in xs_tcp_connect in net/sunrpc/xprtsock.c
like this:
/* Exit if we need to wait for socket shutdown to complete */
if (test_bit(XPRT_CLOSING, &xprt->state)) {
printk("Still closing\n");
return;
}
and run the same experiment and see if that message gets printed out.
If it does, I would know where to start looking.
If not.. I'm still lost.
A less intrusive thing that might be helpful is to run
netstat -nt
while the client is hanging and see if it mentions the connection
to the server at all, and if so, what state the connection is in.
Do this a few times over several minutes to see if it changes at all.
That is all I can suggest at the moment.
This bug seems to be languishing with no obvious way forward... Is this still a problem? If it is I'll try harder, but it probably involve you needing to rebuild the kernel with a patch applied. I'm not the originator of the bug report, just a fellow sufferer, but I'll give you the situation from my perspective. The bug only ever manifested itself on production servers so the desire to obtain information clashed with the need to keep a service running. My solution was to change all the mounts to UDP and this was successful, but it concerns me because of the dire warnings of data loss in the NFS man page. I am about to do my annual upgrade and over the next few weeks I will upgrade all the production machines to 11.1. At the same time I will bravely change the NFS mounts back to TCP and see whether the problem returns. Can you confirm the information I should gather if the problem does return? In comment #24 you mention 'tcpdump -s0' and a kernel message log with echo 65535 >/proc/sys/sunrpc/nfs_debug echo 65535 >/proc/sys/sunrpc/rpc_debug Is that on the client that you want the message log? Is there anything else that would help? Thanks, Bob (In reply to comment #37) > > Can you confirm the information I should gather if the problem does > return? > > In comment #24 you mention > 'tcpdump -s0' and a kernel message log with > echo 65535 >/proc/sys/sunrpc/nfs_debug > echo 65535 >/proc/sys/sunrpc/rpc_debug Yes, that is the information that would be useful. > > Is that on the client that you want the message log? Is there anything > else that would help? Yes, on the client. Nothing else that might be useful springs to mind. Created attachment 310287 [details]
log messages and network trace
OK, it's happened again. Both client and server are running Opensuse 11.1, and I've attached a gzipped tar file containing messages and a network trace.
Just to recap on what happens:
(1) Various things hang, including df. Running 'strace df' reveals which NFS file system it gets stuck on. At this point I collected the debugging info for a minute or two.
(2) I unmounted the file system with 'umount -fl'
(3) An attempt to mount it hung, so I interrupted that and tried 'mount -o udp' which worked.
(4) df now hangs on a different NFS file system so I repeated steps 1 to 3 until df ran to completion.
I hope this is useful. My quick analysis of the network trace suggests there is no NFS-related traffic going on...don't know what this signifies.
Bob
Thanks for the trace. As you say, no NFS traffic. It seem the sunrpc code is confused about the status of the connection. It knows it needs to open a new connection, but it seems to think that is already happening, despite the fact that it isn't. Matching the trace to the code, it seems that either XPRT_CLOSING is set, or XPRT_CONNECTING is set. When XPRT_CONNECTING is set, it schedules a task for at most 5 minutes in the future which is guaranteed to clear XPRT_CONNECTING again, so that cannot stay set for very long. The trace you captured was for less than 5 minutes so I cannot be sure that the task didn't fire in that time, but it seems unlikely. So that points to XPRT_CLOSING. There is an upstream patch http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=670f94573104b4a25525d3fcdcd6496c678df172 which changes move the setting of that flag. I'm not a all certain that this patch will fix the problem but it is possible, and is the only credible possibility at the moment. The important part of the patch is moving the "set_bit(XPRT_CLOSING" from the TCP_CLOSE_WAIT case to the TCP_LAST_ACK case of the switch. Are you in a position to compile your own kernel? Could you apply that patch and try it out? I'll get that patch into the 10.1 kernel (it is already in 11.1) so it will be in any future update, but I don't know when that is likely to be. Thanks for doing the analysis. I should be able to compile a kernel, though it is quite a few years since I have done one so I would have to do some revision. I'm a bit confused by your last sentence...are there some typos or am I misinterpreting the version numbers? Both client and server were running Opensuse 11.1 when I ran the latest trace. There were typos, but still confusion.
I meant that I would get the patch in to 11.0, and that it is already
in 11.1.
I just checked and the patch went in to the 11.1 kernel on July 25th,
3 weeks ago. I don't think we've released a kernel since then, so the
best thing to do would probably be grab the relevant kernel-of-the-day
from
http://ftp.suse.com/pub/projects/kernel/kotd/SLE11_BRANCH/
(openSuSE 11.1 and SLE11 use the same kernel).
Neil: I started following your advice, but then held off when I heard there was a new official kernel in the pipeline to fix the NULL pointer dereference bug. That has been released now and I have started deploying it: could you confirm my impression that it includes the XPRT_CLOSING patch? The change log for kernel-default-base-2.6.27.29-0.1.1 includes * Fri Jul 31 2009 gregkh@suse.de - Update to 2.6.27.29 - lots of bugfixes * Sat Jul 25 2009 gregkh@suse.de - Update to 2.6.27.27 and 2.6.27.28: - bugfixes so I am guessing that one of those includes the patch. I'll keep my fingers crossed that it either fixes the problem or provides more information. Thanks, Bob Yes, the fix was in 2.6.27.28 so you do have the XPRT_CLOSING fix. NeilBrown Created attachment 317175 [details]
log messages and tcpdump trace
Unfortunately the problem happened again, this time with the latest 11.1 kernel:
# uname -a
Linux ribosome 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200 x86_64 x86_64 x86_64 GNU/Linux
I have attached (as a gzipped tar file):
/var/log/warn
/var/log/messages
a binary tcpdump file
Hope this is useful,
Bob
Thanks... it may well be useful.. Based on that trace I'm thinking that upstream commit http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=01d37c428ae080563c0a3bb8bdfa88c65a6891d3 might be the fix. I'll try to reproduce the problem, and see if that fixes it. The current reports seem to be very different than those which stared this bug off. I wonder if we are now looking at a different bug that has a similar symptom... Not that it really matters I guess. I'm not having a lot of luck. I'm no longer thinking of that patch... I compared the trace with the kernel code and I'm fairly sure that the trace is only possible if either XPRT_CLOSING is set, or XPRT_CONNECTING is set. The first should only be set if the TCP connection is in one of the states: FIN_WAIT_1 FIN_WAIT_2 CLOSING TIME_WAIT LAST_ACK It is very unlikely the connection would stay in those states for more than a couple of minutes providing the server was up and responding. If it happens again you could check if the TCP connection is still present and what its state is with netstat -nt | grep 2049 Setting of XPRT_CONNECTING will be accompanied by a debug message mentioning xs_connect, then it will be cleared again within at most 5 minutes which will result in a "worker connecting ..." message. We don't see either of those messages over a period longer than 5 minutes. So my only hope is that "netstat -nt" might show something useful if the problem happens again (assume you are willing to try running with TCP for a while again....) So: setting to "NEEDINFO" for "Can I get 'netstat -nt' output while the problem is happening'. I do still have some machines running with NFS over TCP so if it happens again I will provide the extra information. Bob As we unfortunately seem to be getting no-where with this I'm going to close it as noreponse (which sounds a bit hash as you have responded as best you can... but it still seems the best fit), and hope that it is fixed in 11.2 which is now out.... Thanks for your efforts and apologies for not finding the problem. |