Bug 578646 - NFS gets disrupted when transfering files
Summary: NFS gets disrupted when transfering files
Status: VERIFIED NORESPONSE
Alias: None
Product: openSUSE 11.2
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Final
Hardware: i686 openSUSE 11.2
: P2 - High : Major with 5 votes (vote)
Target Milestone: ---
Assignee: Brandon Philips
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-02-10 10:19 UTC by Gruber Wolfgang
Modified: 2010-09-29 14:43 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gruber Wolfgang 2010-02-10 10:19:01 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; de; rv:1.9.1.7) Gecko/20091222 SUSE/3.5.7-1.1.1 Firefox/3.5.7

Nearly every time, i copy a file on my Asus eeePC 1005HA over NFS, i get this message in /var/log/messages

kernel: RPC: multiple fragments per record not supported

After this message no NFS transfer is possible until i restart the computer.

Kernel is kernel-default 2.6.31.12

Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Comment 1 Jeff Mahoney 2010-02-10 15:07:30 UTC
I ran into something similar, where NFS would hang, but I updated recently to 11.3 M1 and lost the logs.
Comment 2 Neil Brown 2010-02-10 23:19:28 UTC
NFS is a packet oriented protocol.  When these packets are sent over
a TCP connection (which is stream oriented) each packets is prefixed with
a short header which gives the length of the packet.

The header contains a flag which says that the current packet is only
part  of the NFS packet and the receiver should gather packets until it
recieves one with the flag clear.

No known NFS client ever sets this flag.  They all send the whole NFS
request in a single RPC packet.

The message you are getting "... multiple fragments per record not supported"
means that the NFS server has received an RPC packet which has this bit set.

The most likely explanation for this is that the stream has been corrupted
some how and the bytes that are being interpreted as a header with the flag
set are meant to be something else entirely.  There was a client bug some time
ago that could cause that, but I believe it has been fixed.

I assume that the kernel version you game (2.6.31.12) is the kernel that is
running on the NFS server.
Please report also the OS and Kernel version that is running on your eeePC.

Also if it is possible to get a tcpdump trace when the error occurs that could
be helpful.
On the server:
   tcpdump -s 0 -w /tmp/tcpdump host address-off-client

and let that run while you copy a file on your eeepc.
Comment 3 Gruber Wolfgang 2010-02-12 18:30:14 UTC
Thank you for helping. I have really big problems because of this bug, because i manage nearly everything over NFS.



No, kernel 2.6.31.12 is on the client (eeePC). I thought that the error should be in the client, because 5 other clients work with this NFS server. The server has kernel 2.6.27.42.

I first started the command:

# tcpdump -s 0 -w /tmp/tcpdump host 192.168.0.44 | tee /home/user/Documents/Tmp/tcpdump.log

Then i copied 3 files and after the second file i had again the error 

kernel: RPC: fragment too large: 0x3f525d39

in the server logs.

But the file /home/user/Documents/Tmp/tcpdump.log only shows this entrys:

tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
0 packets captured
1 packets received by filter
0 packets dropped by kernel

Is there something wrong with the command?

Thanks!
Comment 4 Neil Brown 2010-02-17 00:57:19 UTC
The command looks right, however if the host that you ran the command
on has multiple network interfaces, and if the NFS traffic would go
over an interface other than 'eth0', then it won't have collected anything
useful.
The captured traffic goes in the file '/tmp/tcpdump', you wouldn't get
much going to stdout, but you would expect a larger number than
   '0 packets captured'

Maybe you need to specify the interface with "-i"
e.g.
   tcpdump -s 0 -i wlan0 -w /tmp/tcpdump host 192.168.0.44
Comment 5 Gruber Wolfgang 2010-02-17 21:15:34 UTC
> Maybe you need to specify the interface with "-i"

My fault :-/ Thanks, it was the false interface. Now it worked. I copied one big file (init 3 with cp) and suddenly the error appeared and NFS was not working any more:

http://dl.dropbox.com/u/2393448/tcpdump

I also think that i made the following not clear: NFS is only corrupted on the client. Other clients can access the NFS on the server without problems. 

Also i wanted to ask: Can it be that this is a hardware problem of my clients network card?
Comment 6 Neil Brown 2010-02-17 22:45:36 UTC
Thanks for the tcpdump trace - it is very helpful.

Everything looks fine up to packet 21.  Then it goes horribly wrong.

Packet 21 should be an NFS write request, or at least the beginning of one.
It appear that the 'wsize' is 128K so the whole write request would be slightly
more than 128K in length, so severl packets.

The first 0x42 bytes of packet 21 are the TCP/IP headers exactly as you would
expect.  After that should come the RPC header, then NFS header, then WRITE
data.
However instead, the second 0x42 bytes are an exact duplicate of the first 0x42
bytes.  After that I can see the correct RPC header - only it is at the wrong place.

So something is duplicating the IP/RPC headers.  I think it is very
likely that this is related to the particular network card,
either a hardware fault in the card or an error in the driver.
Also, I think is very likely to be related to some aspected of 'offload'.
Probably TCP segmentation offload.

Could you please use "ethtool --show-offload" to see what offload features
are enabled, then use e.g. "ethtool --offload tso off" to disable any
offload features and then see if the error recurs.

If that does remove the NFS errors, then you can either accept that as a
work-around, or refill this bug against the driver for the particular
hardware.
Comment 7 Gruber Wolfgang 2010-02-18 10:00:18 UTC
I hope i did not again something stupid, but i got no result with the command:

# ethtool --show-offload
Offload parameters for --show-offload:
Cannot get device rx csum settings: No such device
Cannot get device tx csum settings: No such device
Cannot get device scatter-gather settings: No such device
Cannot get device tcp segmentation offload settings: No such device
Cannot get device udp large send offload settings: No such device
Cannot get device generic segmentation offload settings: No such device
Cannot get device flags: No such device

I did this on the client. Is this correct? To be sure i tried also on the server... but the same:

# ethtool --show-offload
Offload parameters for --show-offload:
Cannot get device rx csum settings: No such device
Cannot get device tx csum settings: No such device
Cannot get device scatter-gather settings: No such device
Cannot get device tcp segmentation offload settings: No such device
Cannot get device udp large send offload settings: No such device
Cannot get device generic segmentation offload settings: No such device
Cannot get device flags: No such device
no offload info available
Comment 8 Neil Brown 2010-02-18 10:49:30 UTC
"man ethtool" for the exact usage.
I think you need to give the name of the interface.
e.g.  ethtool --show-offload eth0
Comment 9 Gruber Wolfgang 2010-02-18 11:14:52 UTC
Arg, ok, thanks. Here is the output from the client:

# ethtool --show-offload eth0
Offload parameters for eth0:
Cannot get device rx csum settings: Operation not supported
Cannot get device flags: Operation not supported
rx-checksumming: off
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: on
large receive offload: off

What should i deactivate from this options and how? Thanks for your help!
Comment 10 Neil Brown 2010-02-19 09:01:17 UTC
First try turning everything off that is on.

ethtool --offload eth0 tx off sg off tso off gso off lro off


Then if that fixes the problem, you might like to try turning them
back on one by won until the problem returns.
Comment 11 Gruber Wolfgang 2010-02-19 10:39:14 UTC
Thank you, but never ending story :-/ For every setting i get this answer:

# ethtool --offload eth0 tso off
Cannot set device tcp segmentation offload settings: Operation not supported
Comment 12 Neil Brown 2010-02-21 22:55:28 UTC
I think we've just about exhausted my expertise here.  I'll need to
find someone who knows about network cards.
Please report details of your network hardware.

e.g.
  ethtool -i eth2
  lspci

then we'll try to re-assign to someone who knows about that stuff.
Comment 13 Gruber Wolfgang 2010-02-22 08:10:55 UTC
Thank you very much that you tried everything. This is a big problem for me, because in the moment this computer is useless for me.

I thought that it would be a good idea to check if it is really eth0 which makes the problem. So i copied a big file over wlan0 and all worked. So the problem should really be eth0.

# ethtool -i eth0
driver: atl1c
version: 1.0.0.1-NAPI
firmware-version: N/A
bus-info: 0000:01:00.0

# lspci
01:00.0 Ethernet controller: Attansic Technology Corp. Atheros AR8132 / L1c Gigabit Ethernet Adapter (rev c0)
Comment 14 Neil Brown 2010-02-22 10:16:26 UTC
I am reassigning this bug to the default assignee for 'kernel' so
it can be reassigned to someone who knows about the driver for
Atheros Gigabit Ethernet Adapter (see previous comment).

There is strong evidence that this controller is sending NFS/TCP
packets badly, possible the tso is not working correctly.
Comment 15 Gruber Wolfgang 2010-03-09 07:27:37 UTC
Did something went wrong with reassigning the bug? I ask, because i have really big problems because of this.
Comment 16 Brandon Philips 2010-03-10 01:41:50 UTC
Can you try the kernel of the day to see if the issue is fixed upstream already?

1) Add the and enable the following URL to your repository list using Yast: http://download.opensuse.org/repositories/Kernel:/HEAD/openSUSE_Factory/
2) Install the latest kernel-default package.

Can you also attach the output of hwinfo?
Comment 17 Gruber Wolfgang 2010-03-10 11:33:01 UTC
Cool, thank you. With 2.6.33-29-default it works!

If you need it, here hwinfo:

60: None 00.0: 10701 Ethernet
  [Created at net.124]
  Unique ID: usDW.ndpeucax6V1
  Parent ID: rBUF.Qk9ZRmN_Ab8
  SysFS ID: /class/net/eth0
  SysFS Device Link: /devices/pci0000:00/0000:00:1c.3/0000:01:00.0
  Hardware Class: network interface
  Model: "Ethernet network interface"
  Driver: "atl1c"
  Driver Modules: "atl1c"
  Device File: eth0
  HW Address: 90:e6:ba:6b:28:ed
  Link detected: yes
  Config Status: cfg=no, avail=yes, need=no, active=unknown
  Attached to: #28 (Ethernet controller)


Perhaps if other switch to this kernel: After update i was not able to load eeepc_laptop. To do this you must give the option acpi_osi=Linux in Grub on startup.
Comment 18 Brandon Philips 2010-03-10 22:03:17 UTC
I am guessing this patch is what fixed the issue:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=678b77e265f6d66f1e68f3d095841c44ba5ab112

I am building up a Kernel for you to test that includes this fix. I will post the RPMs once they are built.

Thanks for testing.
Comment 19 Brandon Philips 2010-03-10 23:34:52 UTC
Can you please test the kernel-default rpm for your platform:
 http://beta.suse.com/private/bphilips//578646/
Comment 20 Gruber Wolfgang 2010-03-19 17:00:25 UTC
Can you help me? How can i install the package. My way didn't work:

# rpm -i kernel-default-2.6.31.12-bnc578646.0.i586.rpm
        package kernel-default-2.6.33-29.1.i586 (which is newer than kernel-default-2.6.31.12-bnc578646.0.i586) is already installed
Comment 21 Pratap B S 2010-03-24 06:46:49 UTC
Hi Brandon,

One of our issues,https://bugzilla.novell.com/show_bug.cgi?id=589071 needs a similar fix.So instead of Open SUSE,can we get SLES 11 kernel style package which would include this fix.
Comment 22 Brandon Philips 2010-03-24 23:13:55 UTC
(In reply to comment #20)
> Can you help me? How can i install the package. My way didn't work:
> 
> # rpm -i kernel-default-2.6.31.12-bnc578646.0.i586.rpm
>         package kernel-default-2.6.33-29.1.i586 (which is newer than
> kernel-default-2.6.31.12-bnc578646.0.i586) is already installed

rpm -i --force kernel-default-2.6.31.12-bnc578646.0.i586.rpm

Sorry for the delay. Missed this message.
Comment 23 Brandon Philips 2010-09-29 14:43:38 UTC
Closing this bug as NORESPONSE as Gruber didn't test the Kernel in Comment #22.