Bug 1160035

Summary: NFS-Client: Different behavior Leap 15.1 and Tumbleweed (Tumbleweed with issues)
Product: [openSUSE] openSUSE Tumbleweed Reporter: Sebastian Kuhne <sebastian.kuhne>
Component: NetworkAssignee: Neil Brown <nfbrown>
Status: RESOLVED DUPLICATE QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: alynx.zhou, mkubecek, richard.martin, Sauerlandlinux
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
See Also: http://bugzilla.opensuse.org/show_bug.cgi?id=1006815
https://bugzilla.opensuse.org/show_bug.cgi?id=1060159
https://bugzilla.opensuse.org/show_bug.cgi?id=1151044
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Bug1160035_Sebastian.tar.gz
dmesg
mount -v
Results_Feb-15_1100am
nfs.pcap_10SecOnly
dmesg_OnServerLeap151_CalledFromClientTW_2020_02_18_0637am
ip6tables_OnServerLeap151_withoutFirewall_20200301
ethtool_OnServerLeap151_withoutFirewall_20200301
ip6tables_OnClientTW_withoutFirewall_20200301
ethtool_OnClientTW_withoutFirewall_20200301
dmesg_OnServerLeap151_CalledFromClientTW_withoutFirewall_20200301

Description Sebastian Kuhne 2020-01-03 07:03:34 UTC
Please also refer to (in German)
https://www.opensuse-forum.de/thread/43634-nfs-client-unterschiedliches-verhalten-leap-15-1-und-tumbleweed-tumbleweed-mit-p/

Potential related issues:
1006815
1060159
1151044

My setup:
- avahi on server and all clients installed und active
- NFS server and client installed on all computers
- Firewalls configured all the same
- all systems up-to-date


NFS- Server:
  openSUSE Leap 15.1
  Computername: linux-multimedia (linux-multimedia.local)
  User-name: multimedia
  NFS-Export: /home/multimedia

NFS-Client I + II
  NFS client I:
    - openSUSE Leap 15.1
    - /etc/fstab: linux-multimedia.local:/home/multimedia /mnt/linux-multimedia/home/multimedia nfs users,noauto,nfsvers=4 0 0
    ==> NFS directory is mounted without any issues :)

NFS client II:
   - openSUSE Tumbleweed
   - /etc/fstab: linux-multimedia.local:/home/multimedia /mnt/linux-multimedia/home/multimedia nfs users,noauto,nfsvers=4 0 0
   - showmount -e linux-multimedia.local results in /home/multimedia *
   ==> NFS directory NOT mounted


Since NFS-Client I (with Leap 15.1) runs without any issues I may assume that the NFS-Server (with Leap 15.1) is running well and all configuration is all right (incl. Firewall). Also I may assume that the NFS client I (with Leap 15.1) is configured all right, and that basically I know what to do.

The NFS-Client II (with Tumbleweed) is configured the same way, but mounting is problematic.

One more thing: During the evaluation in the openSUSE forum, it happens twice that the server was mounted correctly on Client II (with Tumbleweed).
I have NO IDEA what was the reason - this behavior is completely unrepeatable. However, that should be a big concerns since it looks like a race condition.
Comment 1 Sebastian Kuhne 2020-01-03 11:14:09 UTC
To add:

1) The error code in Dolphin (at mount) is:
"Beim Zugriff auf „home/multimedia auf linux-multimedia.local“ ist ein Fehler aufgetreten, die Meldung lautet: mount.nfs: mount system call failed"

2) The error code in Yast - NFS client after confirmation and NFS configuration write process:
"Die NFS-Verzeichnisse in /etc/fstab konnten nicht eingehängt werden."
Comment 2 Richard Martin 2020-02-10 20:04:13 UTC
Are you still working on it or is it already solved and I didn't noticed it?

I can't mount the nfs drive from the synology via the entry in the /etc/fstab. I got the error as decribed. Mounting from the terminal works fine.

/etc/fstab:
FS-Ritchie.fritz.box:/volume1/Daten /home/NFS-Data nfs auto,rw,sync,soft,user,intr,tcp  0  0

Output from journalctl:
Feb 10 21:00:06 richardlx systemd[1]: Condition check resulted in RPC security service for NFS server being skipped.
Feb 10 21:00:06 richardlx kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Feb 10 21:00:06 richardlx systemd[1]: Condition check resulted in RPC security service for NFS client and server being skipped.
Feb 10 21:00:06 richardlx systemd[1]: Reached target NFS client services.
Feb 10 21:00:07 richardlx systemd[1]: Starting Notify NFS peers of a restart...
Feb 10 21:00:07 richardlx systemd[1]: Started Notify NFS peers of a restart.
Feb 10 21:00:07 richardlx systemd[1]: Mounting /home/NFS-Data...
Feb 10 21:00:07 richardlx systemd[1]: home-NFS\x2dData.mount: Mount process exited, code=exited, status=32/n/a
Feb 10 21:00:07 richardlx systemd[1]: home-NFS\x2dData.mount: Failed with result 'exit-code'.
Feb 10 21:00:07 richardlx systemd[1]: Failed to mount /home/NFS-Data.
Feb 10 21:00:11 richardlx.fritz.box kernel: NFS: Registering the id_resolver key type

Manual mount works fine:
sudo mount -t nfs -o soft fs-ritchie.fritz.box:/volume1 /home/NFS-Data/

Output from journalctl:
Feb 10 21:02:08 richardlx.fritz.box sudo[3871]:  richard : TTY=pts/0 ; PWD=/home/richard ; USER=root ; COMMAND=/usr/bin/mount -t nfs -o soft fs-ritchie.fritz.box:/volume1 /home/NFS-Data/
Comment 3 Neil Brown 2020-02-10 20:39:48 UTC
It isn't clear to me what the problem is.

You say the mount option contain

  users,noauto

which means that the filesystem should *not* be mounted automatically, but that any root (including non-root) is allows to mount it manually.

You also say:
   Manual mount works fine:

So what is the problem?
Comment 4 Sebastian Kuhne 2020-02-11 06:36:44 UTC
(In reply to Neil Brown from comment #3)
> It isn't clear to me what the problem is.
> 
> You say the mount option contain
> 
>   users,noauto
> 
> which means that the filesystem should *not* be mounted automatically, but
> that any root (including non-root) is allows to mount it manually.
> 
> You also say:
>    Manual mount works fine:
> 
> So what is the problem?

Neil, the report from Richard is not what I have reported at first. My issue still persists, and I was hoping that my request has been acknowledged.

I am happy to answer all your questions. Again, I tried first to solve it with help of the forum without success, unfortunately. Now I am here and hope for a solution.
Comment 5 Richard Martin 2020-02-11 15:16:49 UTC
(Also in reply to Neil Brown from comment #3)

It doesn't matter. I have a second device, same OS version, with this entry in /etc/fstab:
fs-ritchie.fritz.box:/volume1/Daten /home/NFS-Data nfs auto,rw,sync,soft,intr,user,tcp  0  0

Also when I reconfigure the connection with defaults it won't work.
/etc/fstab:
FS-Ritchie.fritz.box:/volume1/Daten        /home/NFS-Data          nfs    defaults                      0  0

It worked fine until the update to Tumbleweed 2019090 (https://bugzilla.opensuse.org/show_bug.cgi?id=1150807).

It's annoying to mount the drive manually every time the devices are restarting. And the user needs to know root (very bad idea...) or I have to make a script and configure sudo for every user using it.
Comment 6 Neil Brown 2020-02-11 22:10:57 UTC
Hi Sebastian,
 Sorry, I didn't notice that the comments were from different people.

Still, you say that the mount options conain "noauto", so the filesystem should not be mounted automatically.
Can you mount it manually?  If not, what error do you get.
So I repeat my question:  What exactly is the problem?
Comment 7 Neil Brown 2020-02-11 22:14:16 UTC
Richard:
> It doesn't matter.
 What doesn't matter?

> Also when I reconfigure the connection with defaults it won't work.

What won't work.
Really, you need to be specific and precise, I cannot guess what you are thinking.

If you've removed the "noauto" option and the filesystem doesn't mount automatically at boot, then I need to see the journalctl logs, and the output of 
  systemctl status home-NFS-Data.mount
Comment 8 Sebastian Kuhne 2020-02-12 05:33:07 UTC
(In reply to Neil Brown from comment #6)
> Hi Sebastian,
>  Sorry, I didn't notice that the comments were from different people.
> 
> Still, you say that the mount options conain "noauto", so the filesystem
> should not be mounted automatically.
> Can you mount it manually?  If not, what error do you get.
> So I repeat my question:  What exactly is the problem?

Hi Neal,

no problem at all. However the noauto issue is not *MY* problem - that is Richard's issue. My problem is described at the very beginning of the thread. I can't mount a TW NFS client to a Leap 15.1 server.

I have no problem to mount a Leap 15.1 client to the same Leap 15.1 server. So I assume here is a bug.
Please refer to the beginning of the thread, and to the openSUSE forum link (which is in German, but outputs of all commands are there, too).

Really need help since I assume we have a race condition - sometimes the connection works for whatever reason! Again, the issue is between TW/Leap - it works perfect between Leap/Leap.
Comment 9 Neil Brown 2020-02-12 06:06:36 UTC
Hi Sebastian.
 I cannot help you unless you tell me what the problem is - and you haven't.

I cannot read German, and seeing the log output doesn't help me unless I know what you are trying to do.

The /etc/fstab entries in the initial description that you wrote mention "noauto", so I have to assume that you do not expect the filesystems to be mounted at boot.

You write:
> NFS client II:
>  ....
>   ==> NFS directory NOT mounted

So I assume that the directory is not mounted, but I don't know why you think that it should be mounted.  Did you try to mount it?  If so, what command did you use, and what error messages did you get?

You also write

>The NFS-Client II (with Tumbleweed) is configured the same way, but mounting is problematic.

but you don't tell me what the problem is.
Please Please Please be explicit.
Comment 10 Sebastian Kuhne 2020-02-12 07:09:49 UTC
(In reply to Neil Brown from comment #9)
> Hi Sebastian.
>  I cannot help you unless you tell me what the problem is - and you haven't.
> 
> I cannot read German, and seeing the log output doesn't help me unless I
> know what you are trying to do.
> 
> The /etc/fstab entries in the initial description that you wrote mention
> "noauto", so I have to assume that you do not expect the filesystems to be
> mounted at boot.
> 
> You write:
> > NFS client II:
> >  ....
> >   ==> NFS directory NOT mounted
> 
> So I assume that the directory is not mounted, but I don't know why you
> think that it should be mounted.  Did you try to mount it?  If so, what
> command did you use, and what error messages did you get?
> 
> You also write
> 
> >The NFS-Client II (with Tumbleweed) is configured the same way, but mounting is problematic.
> 
> but you don't tell me what the problem is.
> Please Please Please be explicit.

Hi Neil,

really sorry for not being specific enough.

The following happens:
- NFS server Leap running
- NFS client TW running

The noauto option is well known to me, and I do not expect an automount.
BUT, manual mounting doesn't work either:
- in Dolphin: Klick on Externel device. Nothing happens for some minutes. Then the error message appears: "mount.nfs: mount system call failed"
- sudo mount -a: The command hangs, no message, nothing else
- In Yast --> NFSClient: After setup, the following message appears: "Die NFS-Verzeichnisse in /etc/fstab konnten nicht eingehängt werden." (NFS directories couldn't be mounted).

HOWEVER, sometimes it works, but mostly is doesn't. That's why I assume that my basic configuration is all right, and that we have a bug here. Also, another Leap client always gets connected to the same server without any issues.

I really hope that I am specific enough now.
Comment 11 Richard Martin 2020-02-12 07:46:43 UTC
Ok - from scratch:

I run openSUSE Tumbleweed on 4 machines. They connect during startup via NFS to a Synology. All went well until the Tumbleweed update 20190909 was released. After that update the mounting during startup of these machines with the entry from the /etc/fstab did not work anymore and users have to mount the share manually to get access to the data.

From my perspective NFS basically works because after logging on to these computers I can manually mount the share.

Actual state during boot extracted from journalctl:
Feb 10 21:00:06 richardlx systemd[1]: Condition check resulted in RPC security service for NFS server being skipped.
Feb 10 21:00:06 richardlx kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Feb 10 21:00:06 richardlx systemd[1]: Condition check resulted in RPC security service for NFS client and server being skipped.
Feb 10 21:00:06 richardlx systemd[1]: Reached target NFS client services.
Feb 10 21:00:07 richardlx systemd[1]: Starting Notify NFS peers of a restart...
Feb 10 21:00:07 richardlx systemd[1]: Started Notify NFS peers of a restart.
Feb 10 21:00:07 richardlx systemd[1]: Mounting /home/NFS-Data...
Feb 10 21:00:07 richardlx systemd[1]: home-NFS\x2dData.mount: Mount process exited, code=exited, status=32/n/a
Feb 10 21:00:07 richardlx systemd[1]: home-NFS\x2dData.mount: Failed with result 'exit-code'.
Feb 10 21:00:07 richardlx systemd[1]: Failed to mount /home/NFS-Data.
Feb 10 21:00:11 richardlx.fritz.box kernel: NFS: Registering the id_resolver key type

Actual entries in /etc/fstab:
FS-Ritchie.fritz.box:/volume1/Daten     /home/NFS-Data    nfs    auto,rw,sync,soft,user,intr,tcp    0  0

Command used to mount it manually:
sudo mount -t nfs -o soft fs-ritchie.fritz.box:/volume1 /home/NFS-Data/

I tried also to recreate the entry in /etc/fstab via YaST with defaults. It's the same result and the share isn't mounted during startup.

There is no 'noauto' entry! The statement 'Doesn't matter' was related to the behavior of the supposed different entries in fstab. But I've checked them and they are equal on all machines.

I'm not so familiar with bugzilla so if I have to create a new entry rather then extending a exising one please let me know. Also if I should provide more data.

Richard
Comment 12 Neil Brown 2020-02-12 08:55:06 UTC
Thanks to both of you for the extra details - it really helps a lot.

Firstly, please create /etc/nfs.conf.local as an empty file - if you haven't already done so.  This is extremely unlikely to fix the bug, but will silence some annoying warning (which claim to be errors).

Secondly, I'll need some lower level tracing to work out what is happening.
If I understand correctly, Richard's experience is that it only fails during boot, while Sebastian reports that it sometimes fails for manual mounting.  In that case it will probably be most likely that Sebastian will be able to get useful results.

What I would like you to do is:

 rpcdebug -m rpc -s all
 tcpdump -i INTERFACENAME -s 0 -w /tmp/nfs.pcap port 2049 &
 ATTEMPT TO MOUNT THE FILESYSTEM
 rpcdebug -m rpc -c all
 killall tcpdump
 dmesg > /tmp/dmesg

Then compress and attatch /tmp/nfs.pcap and /tmp/dmesg

For INTERFACENAME you need to provide the name of the network interface that is used to talk to the server.  If you only have one, then leaving out "-i INTERFACENAME" will probably cause it to use the correct one as a default.

I need data from a mount attempt that fails.
Comment 13 Richard Martin 2020-02-12 20:29:30 UTC
How can I run these commands during boot?
Comment 14 Sebastian Kuhne 2020-02-12 20:48:42 UTC
(In reply to Neil Brown from comment #12)
> Thanks to both of you for the extra details - it really helps a lot.
> 
> Firstly, please create /etc/nfs.conf.local as an empty file - if you haven't
> already done so.  This is extremely unlikely to fix the bug, but will
> silence some annoying warning (which claim to be errors).
> 
> Secondly, I'll need some lower level tracing to work out what is happening.
> If I understand correctly, Richard's experience is that it only fails during
> boot, while Sebastian reports that it sometimes fails for manual mounting. 
> In that case it will probably be most likely that Sebastian will be able to
> get useful results.
> 
> What I would like you to do is:
> 
>  rpcdebug -m rpc -s all
>  tcpdump -i INTERFACENAME -s 0 -w /tmp/nfs.pcap port 2049 &
>  ATTEMPT TO MOUNT THE FILESYSTEM
>  rpcdebug -m rpc -c all
>  killall tcpdump
>  dmesg > /tmp/dmesg
> 
> Then compress and attatch /tmp/nfs.pcap and /tmp/dmesg
> 
> For INTERFACENAME you need to provide the name of the network interface that
> is used to talk to the server.  If you only have one, then leaving out "-i
> INTERFACENAME" will probably cause it to use the correct one as a default.
> 
> I need data from a mount attempt that fails.

Hi Neil,

attached the two files as requested. Many thanks for taking care.

Best regards
Sebastian
Comment 15 Sebastian Kuhne 2020-02-12 20:49:50 UTC
Created attachment 830021 [details]
Bug1160035_Sebastian.tar.gz
Comment 16 Neil Brown 2020-02-13 04:37:28 UTC
Richard:  It would be quite difficult to run them at the right time during boot. So if Sebastian can get useful results, I plan to focus there first.

Sebastian - the nfs.pcap file is empty (24 bytes of header) so no traffic was captured - which is strange, but not impossible.
Also, the dmesg file contains no rpc debuging.  The string "RPC:" should appear a lot, but there are only 4 standard messages during boot.

Maybe NFS isn't even trying to send and RPC traffic.
Please try again with

  rpcdebug -m nfs -s all
  rpcdebug -m rpc -s all

then try to mount the NFS filesystem, the

   rpcdebug -m nfs -c all
   rpcdebug -m rpc -c all

and collect the output of dmesg.

Also, when you try to mount the filesystem, give the "-v" option to mount, and report the output.  You don't that in original opensuse-forum.de which was useful, but I'll like to be sure I see the -v output that matches the rpcdebug output.
Comment 17 Sebastian Kuhne 2020-02-13 20:29:24 UTC
(In reply to Neil Brown from comment #16)
> Richard:  It would be quite difficult to run them at the right time during
> boot. So if Sebastian can get useful results, I plan to focus there first.
> 
> Sebastian - the nfs.pcap file is empty (24 bytes of header) so no traffic
> was captured - which is strange, but not impossible.
> Also, the dmesg file contains no rpc debuging.  The string "RPC:" should
> appear a lot, but there are only 4 standard messages during boot.
> 
> Maybe NFS isn't even trying to send and RPC traffic.
> Please try again with
> 
>   rpcdebug -m nfs -s all
>   rpcdebug -m rpc -s all
> 
> then try to mount the NFS filesystem, the
> 
>    rpcdebug -m nfs -c all
>    rpcdebug -m rpc -c all
> 
> and collect the output of dmesg.
> 
> Also, when you try to mount the filesystem, give the "-v" option to mount,
> and report the output.  You don't that in original opensuse-forum.de which
> was useful, but I'll like to be sure I see the -v output that matches the
> rpcdebug output.

Hi Neil,
hope this is what you need. See attached.
Comment 18 Sebastian Kuhne 2020-02-13 20:30:01 UTC
Created attachment 830085 [details]
dmesg
Comment 19 Sebastian Kuhne 2020-02-13 20:30:25 UTC
Created attachment 830086 [details]
mount -v
Comment 20 Neil Brown 2020-02-13 22:12:40 UTC
That's certainly more useful, thanks.  Though now that we have the RPC: messages, I really need the tcpdump trace that goes with it.

The dmesg shows the NFS client attempting to connect to the server and very quickly failing.  To find the cause of the failure, it might help to see the ICMP messages.
So if you could run the experiment while
  tcpdump -i INTERFACENAME -s 0 -w /tmp/nfs.pcap port 2049 or icmp &

is running, and then provide /tmp/nfs.pcap, that might help.
Also collect the rpcdebug info at the same time.

Also, when I said "give the "-v" option to mount" I meant that when you issue the mount command to mount the filesystem, give the -v option as well. So:

 mount -v -o options server:/path /path

Something like that.
Comment 21 Sebastian Kuhne 2020-02-15 10:18:58 UTC
(In reply to Neil Brown from comment #20)
> That's certainly more useful, thanks.  Though now that we have the RPC:
> messages, I really need the tcpdump trace that goes with it.
> 
> The dmesg shows the NFS client attempting to connect to the server and very
> quickly failing.  To find the cause of the failure, it might help to see the
> ICMP messages.
> So if you could run the experiment while
>   tcpdump -i INTERFACENAME -s 0 -w /tmp/nfs.pcap port 2049 or icmp &
> 
> is running, and then provide /tmp/nfs.pcap, that might help.
> Also collect the rpcdebug info at the same time.
> 
> Also, when I said "give the "-v" option to mount" I meant that when you
> issue the mount command to mount the filesystem, give the -v option as well.
> So:
> 
>  mount -v -o options server:/path /path
> 
> Something like that.

Hi Neil,

attached the tar with some more outputs as requested.
I have two observations, especially in terms of the mount command (see attached). The file "mount-command" contains two sessions. However, here are my thoughts:
- you see NFS vers 4.2 but I have forced via Yast NFS 4.0 (see image)
- you see two sessions (try 1 and try 2), both start with IPv6 and then with IPv4. Immediately after IPv4 was active the connection was established, and NFS access was possible.

So, we may have two issues: NFS version 4.2 vs 4.0, and IPv6 vs. IPv4.
On the server and client, IPv6 is activated via Yast.

Hope this helps.

-
Comment 22 Sebastian Kuhne 2020-02-15 10:21:26 UTC
Created attachment 830182 [details]
Results_Feb-15_1100am
Comment 23 Neil Brown 2020-02-16 22:29:22 UTC
Thanks for the new data.

The screenshot of yast shows you setting the NFS version as "nfsvers=4".
This does *not* mean 4.0, it means the highest 4.x which is available.
If you really want 4.0, you need to us "nfsvers=4.0"

As you say, the "mount -v" output does strongly suggest that IPv6 isn't working and IPv4 is.  I wonder if this could be a firewall problem.
Can you try disabling any filewall on either client or server and try again?

It still might help to get a network trace taken during a failed mount attempt.

tcpdump -s 0 -w /tmp/nfs.pcap port 2049 or icmp &


Richard: are you using IPv6 at all?  Is it possible that IPv6 attempts fail for you, but IPv4 attempt work?
Comment 24 Richard Martin 2020-02-16 22:42:23 UTC
(In reply to Neil Brown from comment #23)
> Thanks for the new data.
> 
> The screenshot of yast shows you setting the NFS version as "nfsvers=4".
> This does *not* mean 4.0, it means the highest 4.x which is available.
> If you really want 4.0, you need to us "nfsvers=4.0"
> 
> As you say, the "mount -v" output does strongly suggest that IPv6 isn't
> working and IPv4 is.  I wonder if this could be a firewall problem.
> Can you try disabling any filewall on either client or server and try again?
> 
> It still might help to get a network trace taken during a failed mount
> attempt.
> 
> tcpdump -s 0 -w /tmp/nfs.pcap port 2049 or icmp &
> 
> 
> Richard: are you using IPv6 at all?  Is it possible that IPv6 attempts fail
> for you, but IPv4 attempt work?

IPv6 is activated on clients, yes. Since two days it works after the updates coming with zypper dup... wondering. But the mountpoint behaviour is different. On the local system there is a /home/NFS-Data folder and this folder is used for the mounting via /etc/fstab

Previous behaviour:
/etc/fstab
FS-Ritchie.fritz.box:/volume1/Daten        /home/NFS-Data     nfs    auto,rw,sync,soft,user,intr,tcp  0  0  

Data availabe here: /home/NFS-Data/Daten

Now the data is directly in /home/NFS-Data

I had to change the nfs config to keep some software working without changing their config. New nfs settings:
Mountpoint new: /home/NFS-Data/Daten
Changed /etc/fstab:
FS-Ritchie.fritz.box:/volume1/Daten        /home/NFS-Data/Daten     nfs    auto,rw,sync,soft,user,intr,tcp  0  0

Wondering... was there an update?
Comment 25 Neil Brown 2020-02-16 23:12:05 UTC
> Wondering... was there an update?

The last update to nfs-utils would have been Dec 2019.  I don't follow exactly when things enter Tumbleweed, but it would have been after 24nov2019, but not very much after. Maybe 2 or 3 weeks I guess.

The previous update would have been after 30sep2019.

The kernel gets updated quite regularly, but I don't think there have been any interesting changes in kernel NFS recently.

It is very surprising that you data would have appeared in /home/NFS-Data/Daten on the client, unless it was in /volume1/Daten/Daten on the server.
The new behaviour you describe is definitely the correct behaviour.  I cannot think how that earlier behaviour could possible happen.

(BTW I notice that you are using the 'soft' mount option.  Our standard recommendation is never to use that.  It can lead to silent data corruption.
A new mount option - softreval - is coming for Linux 5.6, which might be a better choice, depending on what your actual need is)
Comment 26 Sebastian Kuhne 2020-02-17 05:55:47 UTC
(In reply to Neil Brown from comment #23)
> Thanks for the new data.
> 
> The screenshot of yast shows you setting the NFS version as "nfsvers=4".
> This does *not* mean 4.0, it means the highest 4.x which is available.
> If you really want 4.0, you need to us "nfsvers=4.0"
> 
> As you say, the "mount -v" output does strongly suggest that IPv6 isn't
> working and IPv4 is.  I wonder if this could be a firewall problem.
> Can you try disabling any filewall on either client or server and try again?
> 
> It still might help to get a network trace taken during a failed mount
> attempt.
> 
> tcpdump -s 0 -w /tmp/nfs.pcap port 2049 or icmp &
> 
> 
> Richard: are you using IPv6 at all?  Is it possible that IPv6 attempts fail
> for you, but IPv4 attempt work?

I have disabled both firewalls, on the server and on the client. No change. The connection gets established after the three minutes timeout with IPv4.
I am attaching the tcpdump in a second.
Comment 27 Sebastian Kuhne 2020-02-17 06:07:10 UTC
Sorry, the tcpdump is too large for submitting (131 MiB). Any hint what I should look for in the dump file?
Comment 28 Neil Brown 2020-02-17 06:26:28 UTC
Is that 131MiB compressed?  The file should compress quite well.

I really just need to see a single connection attempt.
The client should send a SYN on port 2049, the server replies with SYN-ACK then the client sends ACK.
Possible failure modes might be:
 - no reply at all from the server
 - RST from the server after the SYN, or after the ACK
I'd also link to see any ICMP that come back.  They might say port unreachable, or host unreachable or maybe somthing else.

I probably only needs to see a few dozen packets, starting when the client sends SYN on port 2049.
There is a program 'editcap' in the wireshark package which can be used to select a range of packets from a larger pcap.  Maybe that will help.
Comment 29 Sebastian Kuhne 2020-02-17 06:40:25 UTC
(In reply to Neil Brown from comment #28)
> Is that 131MiB compressed?  The file should compress quite well.
> 
> I really just need to see a single connection attempt.
> The client should send a SYN on port 2049, the server replies with SYN-ACK
> then the client sends ACK.
> Possible failure modes might be:
>  - no reply at all from the server
>  - RST from the server after the SYN, or after the ACK
> I'd also link to see any ICMP that come back.  They might say port
> unreachable, or host unreachable or maybe somthing else.
> 
> I probably only needs to see a few dozen packets, starting when the client
> sends SYN on port 2049.
> There is a program 'editcap' in the wireshark package which can be used to
> select a range of packets from a larger pcap.  Maybe that will help.

I did the exercise for 10 seconds by interrupting the mount command. The dump file is now smaller of course. Please find attached.
Comment 30 Sebastian Kuhne 2020-02-17 06:41:26 UTC
Created attachment 830203 [details]
nfs.pcap_10SecOnly

The size of 130 MiB was compressed still 20 MiB.
Comment 31 Neil Brown 2020-02-18 03:07:51 UTC
Thanks for the shortened log, it was plenty.

The NFS client successfully connects to the server (SYN, ACK, SYN/ACK) then end send an NFSv4 NULL request and gets a TCP ack, confirming that the server received the request.  It should reply with NULL reply, but instead it closes the connection.

There are very few circumstances where a Leap 15.1 server will close the connection in response to a well formed NULL request.  I think I can only find one - memory allocation failure.  And that would affect IPv4 the same as IPv6.

So I need some trace information from the server.

So on the server run
  rpcdebug -m rpc -s all

then try to mount the NFS filesystem from the client using IPv6, so that it fails.
Then on the server again:
  rpcdebug -m rpc -c all
  dmesg > /tmp/dmesg
and attach the (compressed) dmesg.  Hopefully that will tell me something.
I think we are getting closer, but it is still very strange.
Comment 32 Sebastian Kuhne 2020-02-18 06:13:09 UTC
Created attachment 830410 [details]
dmesg_OnServerLeap151_CalledFromClientTW_2020_02_18_0637am

Hi Neil,

attached the dmesg information on the Leap 15.1 server.
Hope this helps.

Best regards
Sebastian
Comment 33 Neil Brown 2020-02-26 23:44:34 UTC
Sorry for the delay - I suddenly got busy :-(

The log shows the server getting error -95: EOPNOTSUPP
This is coming from kernel_sendpage()

I don't know what would be causing that.
I can think of two things to try.

1/ What does
    ip6table -L
  report.  I know you said you disabled the firewell, this will help me confirm that it really is disabled.

2/ What does
    ethtool -k INTERFACE
  report for your network interface.  Give the interface name, not INTERFACE of course.  For me, that is "ethtool -k enp4s0"
 I'm particularly interested in any transmit-offload features that might be on.
It would be worth using "ethtool -K INTERFACE FEATURE off" to turn each one off, then see if that makes a difference.

If that doesn't get you anywhere, I'll have to ask a networking specialist for help.
Comment 34 Sebastian Kuhne 2020-02-28 06:06:07 UTC
(In reply to Neil Brown from comment #33)
> Sorry for the delay - I suddenly got busy :-(
> 
> The log shows the server getting error -95: EOPNOTSUPP
> This is coming from kernel_sendpage()
> 
> I don't know what would be causing that.
> I can think of two things to try.
> 
> 1/ What does
>     ip6table -L
>   report.  I know you said you disabled the firewall, this will help me
> confirm that it really is disabled.
> 
> 2/ What does
>     ethtool -k INTERFACE
>   report for your network interface.  Give the interface name, not INTERFACE
> of course.  For me, that is "ethtool -k enp4s0"
>  I'm particularly interested in any transmit-offload features that might be
> on.
> It would be worth using "ethtool -K INTERFACE FEATURE off" to turn each one
> off, then see if that makes a difference.
> 
> If that doesn't get you anywhere, I'll have to ask a networking specialist
> for help.

No problem at all. ;-)

Some questions before I can proceed:

(1) In terms of the firewall there is a misunderstanding. After I had proved that there was no improvement with firewalls off, I re-activated the firewalls again (for both, client and server).
Does it mean the rpcdebug information is now useless for you? In other words, shall I repeat the tests
- rpcdebug -m rpc -s all
- try to mount the NFS filesystem from the client using IPv6, so that it fails.
- rpcdebug -m rpc -c all
on the server with disabled firewall (I assume for both, server and client)?

In any case I will disable the both firewalls completely from now on.

(2) ip6table -L and ethtool -k INTERFACE (enp3s0): Where to run (on server or on client or both)?

(3) ip6table -L is not a known command at my computer (TW client). Which package I have to install?

Best regards
Sebastian
Comment 35 Neil Brown 2020-02-28 07:58:04 UTC
> Does it mean the rpcdebug information is now useless for you?

Not necessarily.  The problem probably isn't being caused by a firewall, but I'm deeply suspicious and like to triple check anything that seems even close to where the problem might be.
I wouldn't object to new tpcdebug trace with firewall turned of, but I doubt it will show anything.  If you just turn off the firewall and run "ip6tables -L" that is probably sufficient.

> Where to run (on server or on client or both)?

Server.  The rpcdebug trace point to a problem on the server.  This doesn't entirely gel with your report that it started occurring after you upgraded the client, but it is where the evidence is currently pointing, and that is what I like to follow.

> ip6table -L is not a known command

That is because I cannot type.  There is an 's' at the end.
/usr/sbin/ip6tables is part of the "iptables" package.
Comment 36 Sebastian Kuhne 2020-03-01 06:39:57 UTC
Created attachment 831641 [details]
ip6tables_OnServerLeap151_withoutFirewall_20200301
Comment 37 Sebastian Kuhne 2020-03-01 06:40:38 UTC
Created attachment 831642 [details]
ethtool_OnServerLeap151_withoutFirewall_20200301
Comment 38 Sebastian Kuhne 2020-03-01 06:41:09 UTC
Created attachment 831643 [details]
ip6tables_OnClientTW_withoutFirewall_20200301
Comment 39 Sebastian Kuhne 2020-03-01 06:42:00 UTC
Created attachment 831644 [details]
ethtool_OnClientTW_withoutFirewall_20200301
Comment 40 Sebastian Kuhne 2020-03-01 06:43:17 UTC
Created attachment 831645 [details]
dmesg_OnServerLeap151_CalledFromClientTW_withoutFirewall_20200301
Comment 41 Sebastian Kuhne 2020-03-01 06:48:41 UTC
Hi Neil,

I have attached five output files. All commands with firewall off on both, server and client.
--> ip6table -L and ethtool -k INTERFACE on both, server and client.
--> I also repeated the tpcdebug trace on the server with firewall off.
The naming of the files should be self explaining.

Hope this all is of further help. Seems to be a tricky situation.
Many thanks for helping!!!
Comment 42 Neil Brown 2020-03-02 01:59:00 UTC
Hi Michal,
 could you have a look at this and see if you can help.

 NFS from a Tumbleweed client to a Leap 15.1 server is failing when the client tries IPv6.
The server gets the initial RPC-NULL request and tries to send a reply but gets EOPNOTSUPP from kernel_sendpage().  When the client retries with IPv4, it all works smoothly.

What might cause EOPNOSUPP in that circumstance?  There is no firewall to get in the way, and I don't think there is any offload that might be causing problems.
Comment 43 Michal Kubeček 2020-03-02 04:33:48 UTC
This sounds like a duplicate of bsc#1144162. Please check if scatter-gather
is enabled on the outgoing interface. Current openSUSE-15.1 KotD should have
the fix already.
Comment 44 Sebastian Kuhne 2020-03-02 19:46:44 UTC
Hi Neil and Michael,

many thanks for the new ideas.
30 minutes ago, I did an update for both, the Leap 15.1 server (zypper up) and the TW client (zypper dup).

Unfortunately the issue still persists.

BUT: The message on the client is different now:

linux-sebastian:/home/sebastian # mount -v linux-multimedia.local:/home/multimedia /mnt/linux-multimedia/home/multimedia/
mount.nfs: timeout set for Mon Mar  2 20:34:14 2020
mount.nfs: trying text-based options 'vers=4.2,addr=2a02:810a:8340:2dc0:6c7a:9044:c416:1954,clientaddr=2a02:810a:8340:2dc0::5'
mount.nfs: mount(2): Connection reset by peer
mount.nfs: Connection reset by peer
linux-sebastian:/home/sebastian # 


I never have seen the "reset by peer" before. This is new after the updates.
Is this giving any hint to you?

Just as information, the server is an Intel processor based system whereas the client is an AMD based system.
Comment 45 Michal Kubeček 2020-03-02 20:09:53 UTC
(In reply to Sebastian Kuhne from comment #44)
> 30 minutes ago, I did an update for both, the Leap 15.1 server (zypper up)
> and the TW client (zypper dup).
> 
> Unfortunately the issue still persists.

It's not likely that the fix would be in a maintenance update already. Please
try the package from

  http://download.opensuse.org/repositories/Kernel:/openSUSE-15.1/standard/

> Just as information, the server is an Intel processor based system whereas
> the client is an AMD based system.

It has nothing to do with CPU, the issue is triggered by disabled scatter-gather
on the outgoing network device.
Comment 46 Sebastian Kuhne 2020-03-03 11:03:36 UTC
Neil, Michal,

I am happy to confirm with the kernel update on the Leap 15.1 Server the issue is solved!

I can now stable mount / umount - many thanks for your help!

If you agree I would change the status of this thread to RESOLVED.

Thanks again and best regards
Sebastian
Comment 47 Neil Brown 2020-03-03 23:10:14 UTC
Good news.  It does seem like your problem is fully resolved, and Richard's disappeared a while ago, so I'll close this now.
Comment 48 Michal Kubeček 2020-03-04 06:30:38 UTC
Let's mark it explicitly as a duplicate.

*** This bug has been marked as a duplicate of bug 1144162 ***