Bug 960153

Summary: Network manager and DNS forwarders don't work together
Product: [openSUSE] openSUSE Distribution Reporter: Jan Kara <jack>
Component: NetworkAssignee: Jonathan Kang <songchuan.kang>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P2 - High CC: jack, jdelvare, lnussel, mt, them4z
Version: Leap 42.1   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: My /etc/sysconfig/network/config file
Journal log from the resume
Debug messages from network manager
Log from program watching changes to /etc/ directory
NM rpm with fix included

Description Jan Kara 2015-12-23 10:05:49 UTC
Created attachment 660275 [details]
My /etc/sysconfig/network/config file

I'm using Networkmanager to manage my network connection including updating /etc/resolv.conf (since this is a laptop which I connect at different places). As I need to use openvpn to connect to SUSE internal network, I'm using dnsmasq to forward resolving of SUSE domain names to SUSE name servers and the rest is resolved through the local name server. This used to work fine with 13.2.

Now I have updated to 42.1 and my resolv.conf contains:
# Generated by NetworkManager
search suse.cz suse.de
nameserver 192.168.2.1

When 'netconfig update' is run, it complains that resolv.conf has been updated manually and so it won't touch it. I have to run 'netconfig update -f' to force updating /etc/resolv.conf and then my DNS setup starts working again.

So apparently some cooperation between NetworkManager and netconfig has been broken with 42.1...
Comment 1 Ludwig Nussel 2016-04-04 09:05:32 UTC
*** Bug 958108 has been marked as a duplicate of this bug. ***
Comment 3 Jonathan Kang 2017-01-16 07:48:26 UTC
NetworkManager uses dnsmasq to handle the resolving, and that's why it
overrides any dns entry in resolv.conf.

Check this link[1] for how to preserve your DNS settings.

*[1] https://wiki.archlinux.org/index.php/Resolv.conf#Preserve_DNS_settings

Closing this as RESOLVED INVALID.
Comment 4 Ludwig Nussel 2017-01-16 09:05:30 UTC
Unfortunately, that is not how it works in openSUSE. We have netconfig, NetworkManager is meant to to provide the information to netconfig which in turn takes the necessary steps to update e.g. /etc/resolv.conf. I guess some patch got lost that detects use of netconfig?
Comment 5 Jonathan Kang 2017-01-17 09:32:50 UTC
(In reply to Ludwig Nussel from comment #4)
> Unfortunately, that is not how it works in openSUSE. We have netconfig,
> NetworkManager is meant to to provide the information to netconfig which in
> turn takes the necessary steps to update e.g. /etc/resolv.conf. I guess some
> patch got lost that detects use of netconfig?

Ah. You are right.

(In reply to Jan Kara from comment #0)
> Created attachment 660275 [details]
> My /etc/sysconfig/network/config file
> 
> I'm using Networkmanager to manage my network connection including updating
> /etc/resolv.conf (since this is a laptop which I connect at different
> places). As I need to use openvpn to connect to SUSE internal network, I'm
> using dnsmasq to forward resolving of SUSE domain names to SUSE name servers
> and the rest is resolved through the local name server. This used to work
> fine with 13.2.
> 
> Now I have updated to 42.1 and my resolv.conf contains:
> # Generated by NetworkManager
> search suse.cz suse.de
> nameserver 192.168.2.1
> 
> When 'netconfig update' is run, it complains that resolv.conf has been
> updated manually and so it won't touch it. I have to run 'netconfig update
> -f' to force updating /etc/resolv.conf and then my DNS setup starts working
> again.
> 
> So apparently some cooperation between NetworkManager and netconfig has been
> broken with 42.1...

Try adding "rc-manager=netconfig" in the main section of NetworkManager.conf
to see if it fixes this problem.
Comment 6 Jan Kara 2017-01-17 16:33:22 UTC
Ah, yes. "rc-manager=netconfig" seems to have fixed the issue. Thanks!
Comment 8 Jan Kara 2017-01-18 07:50:08 UTC
Actually, the issue does not appear to be quite fixed. So when I changed the settings, rebooted the machine, correct /etc/resolv.conf has been created by netconfig - I'm using 'Auto eth0' for network connection. However this morning when I resumed the machine from suspend-to-ram, when network manager reconnected, /etc/resolv.conf has been suddenly replaced by the problematic one (i.e., the one directly generated by NM).
Comment 9 Jonathan Kang 2017-01-18 08:19:03 UTC
(In reply to Jan Kara from comment #8)
> Actually, the issue does not appear to be quite fixed. So when I changed the
> settings, rebooted the machine, correct /etc/resolv.conf has been created by
> netconfig - I'm using 'Auto eth0' for network connection. However this
> morning when I resumed the machine from suspend-to-ram, when network manager
> reconnected, /etc/resolv.conf has been suddenly replaced by the problematic
> one (i.e., the one directly generated by NM).

The new issue you mentioned seems similar with the one reported in RedHat
bugzilla[1].

*[1] https://bugzilla.redhat.com/show_bug.cgi?id=1373485
Comment 10 Jan Kara 2017-01-18 09:04:13 UTC
Created attachment 710597 [details]
Journal log from the resume

Here are interesting system messages from the reboot (full messages attached). You can see NetworkManager reestablishing connections there and at one point:

Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <info>  NetworkManager state is now CONNECTED_SITE
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <info>  Policy set 'Auto Ethernet' (em1) as default for IPv4 routing and DNS.
Jan 18 08:24:39 quack2.suse.cz dns-resolver[23908]: ATTENTION: You have modified /etc/resolv.conf. Leaving it untouched...
Jan 18 08:24:39 quack2.suse.cz dns-resolver[23910]: You can find my version in /etc/resolv.conf.netconfig
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <13>Jan 18 08:24:39 dns-resolver: ATTENTION: You have modified /etc/resolv.conf. Leaving it untouched...
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <13>Jan 18 08:24:39 dns-resolver: You can find my version in /etc/resolv.conf.netconfig
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: ATTENTION: You have modified /etc/resolv.conf.  Leaving it untouched...
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: You can find my version in /etc/resolv.conf.netconfig ...
Jan 18 08:24:39 quack2.suse.cz nscd[968]: 968 monitored file `/etc/resolv.conf` was moved into place, adding watch
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <info>  (em1): Activation: successful, device activated.

So at this point /etc/resolv.conf has apparently already been modified by the NetworkManager but NetworkManager apparently *also* tried to update the nameservers in resolv.conf via netconfig.

Also note that there is activation on wlan0 interface running in parallel, however we got DHCP response on that interface only at 08:24:41 so that does not seem directly responsible for this problem.
Comment 11 Jan Kara 2017-01-18 09:09:17 UTC
I'd also note that nscd is monitoring /etc/resolv.conf via inotify and it has noticed any change only at 08:24:39 (the previous change of resolv.conf noticed by nscd is at 08:24:32 when NM took interfaces down and began to reestablish connections). So likely the direct write to resolv.conf by NM and then via netconfig had to come in a very close succession (not very surprising but still a data point).
Comment 13 Jonathan Kang 2017-01-19 07:29:39 UTC
(In reply to Jan Kara from comment #10)
> Created attachment 710597 [details]
> Journal log from the resume
> 
> Here are interesting system messages from the reboot (full messages
> attached).

I suppose these logs are from a suspend rather than a reboot.


I used your /etc/sysconf/network/config file to replace the one in my test
machine, connected to vpn using vpnc and run "netconfig update". It didn't
annoyed me that /etc/resolv.conf has been updated manually. Did I miss
anything? Can you elaborate on how to reproduce this bug?
Comment 14 Jan Kara 2017-01-19 10:27:19 UTC
Yes, the logfile was from suspend.

So the issue did not happen this morning and what seems to make a difference is whether wlan0 interface is enabled or not. I.e.: When I suspend & resume with wlan0 enabled (i.e. the machine gets connected to network both vie ethernet em1 interface and wireless wlan0 interface), /etc/resolv.conf gets overwritten by NetworkManager. When I suspend & resume with wlan0 disabled (i.e., only em1 interface gets reactivated on resume), /etc/resolv.conf is fine.

However, when I just enable wlan0 interface after the machine is already resumed, /etc/resolv.conf stays correct. So it must be some race during resume when both em1 and wlan0 are being brought up at the same time...
Comment 15 Jan Kara 2017-01-19 10:33:39 UTC
And I did one more experiment: When I disconnect the machine from ethernet, then after suspend & resume /etc/resolv.conf is fine as well. So the only case where I'm able to see the problem is when the machine is reconnecting to network via both em1 and wlan0 at once. In that case it happened 3 out of 3 experiments.
Comment 16 Jonathan Kang 2017-01-23 08:14:39 UTC
@Jan

Could you please attach related debug information? How-to is described below:

Add 

> [logging]
> level=DEBUG

to /etc/NetworkManager/NetworkManager.conf

and then attach the output of "journalctl -b -u NetworkManager".

Thanks.
Comment 17 Jan Kara 2017-01-24 13:58:43 UTC
Created attachment 711392 [details]
Debug messages from network manager

Here are requested debug messages from NetworkManager. Last two resume attempts resulted in NM directly modifying /etc/resolv.conf. The attempt before did not result in directly modified /etc/resolv.conf.
Comment 18 Jan Kara 2017-01-24 14:02:37 UTC
Created attachment 711393 [details]
Log from program watching changes to /etc/ directory

In the last reproduction attempt, I have also run a program that watches all the changes in /etc directory (through inotify). There you can see that apparently NM has created its resolv.conf version (resolv.conf.tmp) at 14:52:25 and moved it to resolv.conf in the same second. At 14:52:26, netconfig has created its version of resolv.conf (resolv.conf.gANX1c - the attempts from NM and from netconfig can be recognized by a different name of the temporary file) and then at 14:52:27 renamed it to resolv.conf.netconfig because resolv.conf was already modified by NM.
Comment 19 Jonathan Kang 2017-02-06 02:45:50 UTC
(In reply to Jan Kara from comment #17)
> Created attachment 711392 [details]
> Debug messages from network manager
> 
> Here are requested debug messages from NetworkManager. Last two resume
> attempts resulted in NM directly modifying /etc/resolv.conf. The attempt
> before did not result in directly modified /etc/resolv.conf.

Sorry for missing a key point. You need restart NetworkManager to make sure
the new added config effective.
Comment 20 Jan Kara 2017-02-07 10:22:07 UTC
Ah, good point. I have actually rebooted couple of times since I set the debug option so I had enough NetworkManager logs accumulated. Looking into the logs a situation when resolv.conf is updated directly by NetworkManager is triggered by the following sequence of events:

Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.827341] [dns-manager/nm-dns-manager.c:635] update_dns(): updating resolv.conf
Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.827355] [dns-manager/nm-dns-manager.c:250] run_netconfig(): spawning '/sbin/netconfig modify --service NetworkManager' 
Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.828307] [dns-manager/nm-dns-manager.c:267] write_to_netconfig(): writing to netconfig: INTERFACE='NetworkManager'
Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.828338] [dns-manager/nm-dns-manager.c:267] write_to_netconfig(): writing to netconfig: DNSSEARCH='suse.cz'
Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.828355] [NetworkManagerUtils.c:805] nm_utils_kill_child_sync(): kill child process 'netconfig' (28627): waiting up to 1000 milliseconds for process to terminate normally after sending no signal (0)...
Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828418] [NetworkManagerUtils.c:821] nm_utils_kill_child_sync(): kill child process 'netconfig' (28627): sending SIGKILL...
Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828620] [NetworkManagerUtils.c:855] nm_utils_kill_child_sync(): kill child process 'netconfig' (28627): after sending no signal (0) and SIGKILL, process 28627 exited by signal 9 (1000266 usec elapsed)
Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828908] [devices/nm-device.c:6716] nm_device_set_ip6_config(): [0x1e72d90] (em1): clear IP6Config instance (/org/freedesktop/NetworkManager/IP6Config/10)
Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828959] [dns-manager/nm-dns-manager.c:1040] nm_dns_manager_begin_updates(): (device_ip6_config_changed): queueing DNS updates (1)
Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828979] [dns-manager/nm-dns-manager.c:1058] nm_dns_manager_end_updates(): (nm_dns_manager_end_updates): DNS configuration did not change
Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828986] [dns-manager/nm-dns-manager.c:1062] nm_dns_manager_end_updates(): (device_ip6_config_changed): no DNS changes to commit (0)

So 'netconfig modify' did not manage to finish in the allocated 1s, so NetworkManager killed it. I'm a bit puzzled that I don't see further debug messages showing that resolv.conf was updated directly but whatever. The code is pretty clearly showing that if updating using netconfig fails, it does the update directly.

So I dug a bit more into netconfig to see what it is doing. I've instrumented netconfig script with some debug messages and here is the output (messages should be relatively self-explanatory):

Entering netconfig  11:00:15:691579073
Before modify  11:00:15:697381985
Entering modify  11:00:15:698074098
Done reading modify input  11:00:15:700162938
Determining state  11:00:15:702411665
Exiting from modify  11:00:15:703101298
After modify  11:00:15:703873893
Before modules  11:00:15:704495286
Running module /etc/netconfig.d//dns-resolver  11:00:15:705215014
dns-resolver begins  11:00:15:707300709
dns-resolver parsed policy  11:00:15:717535930
dns-resolver before resolv.conf  11:00:15:718404372
dns-resolver file created  11:00:15:721682996
dns-resolver header  11:00:15:723014609
dns-resolver searchlist  11:00:15:723693154
dns-resolver nameserver  11:00:15:724343748
dns-resolver done  11:00:16:693133913
<killed>

We can see that we spent most of the 1s time limit in netconfig_check_md5_and_move() (that is the only thing happening between 'nameserver' and 'done' messages in dns-resolver. During successful resume attempts when resolv.conf gets properly updated, I can see netconfig finishing just barely under 1s. So that explains why the problem sometimes happens and sometimes not.

Looking into netconfig_check_md5_and_move() function, I'm not hugely surprised that it takes close to 1s. That spawning of subshells, gawk processes, md5sum processes is going to take up some time when everything is resuming and fighting for CPU and disk. That being said I can investigate that more closely if you think it would be worth it.

But overall it seems that the 1s timeout to run netconfig is just too short in some cases and bumping it up should fix my issues.
Comment 21 Jonathan Kang 2017-02-08 08:08:41 UTC
(In reply to Jan Kara from comment #20)
> 
> So 'netconfig modify' did not manage to finish in the allocated 1s, so
> NetworkManager killed it. I'm a bit puzzled that I don't see further debug
> messages showing that resolv.conf was updated directly but whatever. The
> code is pretty clearly showing that if updating using netconfig fails, it
> does the update directly.

Yes. There isn't any debug message when NM directly writes to resolv.conf. And
by reading those debug messagesk, it's obvious that netconfig was killed by
NM.
> 
> So I dug a bit more into netconfig to see what it is doing. I've
> instrumented netconfig script with some debug messages and here is the
> output (messages should be relatively self-explanatory):
> 
> Entering netconfig  11:00:15:691579073
> Before modify  11:00:15:697381985
> Entering modify  11:00:15:698074098
> Done reading modify input  11:00:15:700162938
> Determining state  11:00:15:702411665
> Exiting from modify  11:00:15:703101298
> After modify  11:00:15:703873893
> Before modules  11:00:15:704495286
> Running module /etc/netconfig.d//dns-resolver  11:00:15:705215014
> dns-resolver begins  11:00:15:707300709
> dns-resolver parsed policy  11:00:15:717535930
> dns-resolver before resolv.conf  11:00:15:718404372
> dns-resolver file created  11:00:15:721682996
> dns-resolver header  11:00:15:723014609
> dns-resolver searchlist  11:00:15:723693154
> dns-resolver nameserver  11:00:15:724343748
> dns-resolver done  11:00:16:693133913
> <killed>
> 
> We can see that we spent most of the 1s time limit in
> netconfig_check_md5_and_move() (that is the only thing happening between
> 'nameserver' and 'done' messages in dns-resolver. During successful resume
> attempts when resolv.conf gets properly updated, I can see netconfig
> finishing just barely under 1s. So that explains why the problem sometimes
> happens and sometimes not.
> 
> Looking into netconfig_check_md5_and_move() function, I'm not hugely
> surprised that it takes close to 1s. That spawning of subshells, gawk
> processes, md5sum processes is going to take up some time when everything is
> resuming and fighting for CPU and disk. That being said I can investigate
> that more closely if you think it would be worth it.

These info are should be sufficient and very helpful. Thanks.
> 
> But overall it seems that the 1s timeout to run netconfig is just too short
> in some cases and bumping it up should fix my issues.

Quite right. 1 second seems to be not long enough.

This is the documentation of nm_utils_kill_child_sync() function call:
> Kill a child process synchronously and wait. The function first checks if
> the child already terminated and if it did, return the exit status.
> Otherwise send one @sig signal. @sig  will always be sent unless the child
> already exited. If the child does not exit within @wait_before_kill_msec
> milliseconds, the function will send %SIGKILL and waits for the child
> indefinitly. If @wait_before_kill_msec is zero, no %SIGKILL signal will be
> sent.

So we can set @wait_before_kill_msec to a bigger value(> 1 second). Or set
@wait_before_kill_msec to 0(we don't try to kill netconfig, let it exit
itself).
Comment 22 Jonathan Kang 2017-02-16 02:54:29 UTC
@Jan

Which version of openSUSE are you using now? I'm planning to generate a fix.
So I might need your help to test if it works.

Thanks.
Comment 23 Jan Kara 2017-02-20 08:58:18 UTC
I'm using openSUSE Leap 42.2.
Comment 24 Jonathan Kang 2017-02-20 09:43:21 UTC
Created attachment 714724 [details]
NM rpm with fix included

@Jan

Here is a rpm of NetworkManager with fix included. You can also use this link
to download the rpm if you prefer.
https://api.opensuse.org/build/home:JonathanKang:branches:openSUSE:Leap:42.2:Update/standard/x86_64/NetworkManager/NetworkManager-1.0.12-5.1.x86_64.rpm

After installing this rpm, the following steps needs to be done:
1. run "systemctl restart NetworkManager" to restart NetworkManager service
2. run "netconfig update --force" to update /etc/resolv.conf to a working one

I set the wait time from 1s to 2s, which should be long enough. Could you
please test if this rpm works for you?

Thanks.
Comment 25 Jan Kara 2017-02-20 14:50:55 UTC
Thanks for the test RPM. I have installed it, restarted NM, and the first test suspend/resume went fine. The problem did not trigger every time so I'll be running it for a few days and will let you know whether everything worked fine.
Comment 26 Jan Kara 2017-02-22 12:15:54 UTC
So far no problems. I'd consider the problem fixed...
Comment 27 Jonathan Kang 2017-02-23 06:32:57 UTC
(In reply to Jan Kara from comment #26)
> So far no problems. I'd consider the problem fixed...

Thanks for it. I'll submit the fix then.
Comment 29 Jonathan Kang 2017-03-08 02:46:45 UTC
Closing this as RESOLVED FIXED.
Comment 30 Swamp Workflow Management 2017-04-06 16:09:49 UTC
SUSE-RU-2017:0952-1: An update that has four recommended fixes can now be installed.

Category: recommended (low)
Bug References: 1009717,1021665,960153,966232
CVE References: 
Sources used:
SUSE Linux Enterprise Workstation Extension 12-SP2 (src):    NetworkManager-1.0.12-12.4
SUSE Linux Enterprise Software Development Kit 12-SP2 (src):    NetworkManager-1.0.12-12.4
SUSE Linux Enterprise Server for Raspberry Pi 12-SP2 (src):    NetworkManager-1.0.12-12.4
SUSE Linux Enterprise Server 12-SP2 (src):    NetworkManager-1.0.12-12.4
SUSE Linux Enterprise Desktop 12-SP2 (src):    NetworkManager-1.0.12-12.4
Comment 31 Swamp Workflow Management 2017-04-18 19:10:31 UTC
openSUSE-RU-2017:1049-1: An update that has four recommended fixes can now be installed.

Category: recommended (low)
Bug References: 1009717,1021665,960153,966232
CVE References: 
Sources used:
openSUSE Leap 42.2 (src):    NetworkManager-1.0.12-4.3.1
Comment 32 Bernhard Wiedemann 2017-06-07 14:01:26 UTC
This is an autogenerated message for OBS integration:
This bug (960153) was mentioned in
https://build.opensuse.org/request/show/501753 Factory / NetworkManager
Comment 33 Olaf Hering 2017-06-13 06:57:33 UTC
(In reply to Jonathan Kang from comment #21)
> So we can set @wait_before_kill_msec to a bigger value(> 1 second). Or set
> @wait_before_kill_msec to 0(we don't try to kill netconfig, let it exit
> itself).

As it can be seen from bugreports, calling netconfig is not an atomic operation.
A timeout may lead to corrupted config files.
Comment 34 Jonathan Kang 2017-06-22 02:39:55 UTC
(In reply to Olaf Hering from comment #33)
> 
> As it can be seen from bugreports, calling netconfig is not an atomic
> operation.
> A timeout may lead to corrupted config files.

Yes, calling netconfig is not an atomic operation. Maybe we should wait util
netconfig exits itself.
Comment 35 Bernhard Wiedemann 2017-09-25 14:01:16 UTC
This is an autogenerated message for OBS integration:
This bug (960153) was mentioned in
https://build.opensuse.org/request/show/528644 Factory / NetworkManager
Comment 37 Jonathan Kang 2017-09-29 01:38:43 UTC
RESOLVED FIXED.
Comment 39 Swamp Workflow Management 2018-10-24 13:18:03 UTC
SUSE-RU-2018:3382-1: An update that has two recommended fixes can now be installed.

Category: recommended (low)
Bug References: 1103477,960153
CVE References: 
Sources used:
SUSE Linux Enterprise Workstation Extension 12-SP3 (src):    NetworkManager-1.0.12-13.6.1
SUSE Linux Enterprise Software Development Kit 12-SP3 (src):    NetworkManager-1.0.12-13.6.1
SUSE Linux Enterprise Server 12-SP3 (src):    NetworkManager-1.0.12-13.6.1
SUSE Linux Enterprise Desktop 12-SP3 (src):    NetworkManager-1.0.12-13.6.1
Comment 40 Swamp Workflow Management 2018-10-25 13:21:11 UTC
openSUSE-RU-2018:3434-1: An update that has two recommended fixes can now be installed.

Category: recommended (low)
Bug References: 1103477,960153
CVE References: 
Sources used:
openSUSE Leap 42.3 (src):    NetworkManager-1.0.12-7.3.1