|
Bugzilla – Full Text Bug Listing |
| Summary: | Network manager and DNS forwarders don't work together | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Distribution | Reporter: | Jan Kara <jack> |
| Component: | Network | Assignee: | Jonathan Kang <songchuan.kang> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P2 - High | CC: | jack, jdelvare, lnussel, mt, them4z |
| Version: | Leap 42.1 | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
My /etc/sysconfig/network/config file
Journal log from the resume Debug messages from network manager Log from program watching changes to /etc/ directory NM rpm with fix included |
||
*** Bug 958108 has been marked as a duplicate of this bug. *** NetworkManager uses dnsmasq to handle the resolving, and that's why it overrides any dns entry in resolv.conf. Check this link[1] for how to preserve your DNS settings. *[1] https://wiki.archlinux.org/index.php/Resolv.conf#Preserve_DNS_settings Closing this as RESOLVED INVALID. Unfortunately, that is not how it works in openSUSE. We have netconfig, NetworkManager is meant to to provide the information to netconfig which in turn takes the necessary steps to update e.g. /etc/resolv.conf. I guess some patch got lost that detects use of netconfig? (In reply to Ludwig Nussel from comment #4) > Unfortunately, that is not how it works in openSUSE. We have netconfig, > NetworkManager is meant to to provide the information to netconfig which in > turn takes the necessary steps to update e.g. /etc/resolv.conf. I guess some > patch got lost that detects use of netconfig? Ah. You are right. (In reply to Jan Kara from comment #0) > Created attachment 660275 [details] > My /etc/sysconfig/network/config file > > I'm using Networkmanager to manage my network connection including updating > /etc/resolv.conf (since this is a laptop which I connect at different > places). As I need to use openvpn to connect to SUSE internal network, I'm > using dnsmasq to forward resolving of SUSE domain names to SUSE name servers > and the rest is resolved through the local name server. This used to work > fine with 13.2. > > Now I have updated to 42.1 and my resolv.conf contains: > # Generated by NetworkManager > search suse.cz suse.de > nameserver 192.168.2.1 > > When 'netconfig update' is run, it complains that resolv.conf has been > updated manually and so it won't touch it. I have to run 'netconfig update > -f' to force updating /etc/resolv.conf and then my DNS setup starts working > again. > > So apparently some cooperation between NetworkManager and netconfig has been > broken with 42.1... Try adding "rc-manager=netconfig" in the main section of NetworkManager.conf to see if it fixes this problem. Ah, yes. "rc-manager=netconfig" seems to have fixed the issue. Thanks! Actually, the issue does not appear to be quite fixed. So when I changed the settings, rebooted the machine, correct /etc/resolv.conf has been created by netconfig - I'm using 'Auto eth0' for network connection. However this morning when I resumed the machine from suspend-to-ram, when network manager reconnected, /etc/resolv.conf has been suddenly replaced by the problematic one (i.e., the one directly generated by NM). (In reply to Jan Kara from comment #8) > Actually, the issue does not appear to be quite fixed. So when I changed the > settings, rebooted the machine, correct /etc/resolv.conf has been created by > netconfig - I'm using 'Auto eth0' for network connection. However this > morning when I resumed the machine from suspend-to-ram, when network manager > reconnected, /etc/resolv.conf has been suddenly replaced by the problematic > one (i.e., the one directly generated by NM). The new issue you mentioned seems similar with the one reported in RedHat bugzilla[1]. *[1] https://bugzilla.redhat.com/show_bug.cgi?id=1373485 Created attachment 710597 [details]
Journal log from the resume
Here are interesting system messages from the reboot (full messages attached). You can see NetworkManager reestablishing connections there and at one point:
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <info> NetworkManager state is now CONNECTED_SITE
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <info> Policy set 'Auto Ethernet' (em1) as default for IPv4 routing and DNS.
Jan 18 08:24:39 quack2.suse.cz dns-resolver[23908]: ATTENTION: You have modified /etc/resolv.conf. Leaving it untouched...
Jan 18 08:24:39 quack2.suse.cz dns-resolver[23910]: You can find my version in /etc/resolv.conf.netconfig
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <13>Jan 18 08:24:39 dns-resolver: ATTENTION: You have modified /etc/resolv.conf. Leaving it untouched...
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <13>Jan 18 08:24:39 dns-resolver: You can find my version in /etc/resolv.conf.netconfig
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: ATTENTION: You have modified /etc/resolv.conf. Leaving it untouched...
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: You can find my version in /etc/resolv.conf.netconfig ...
Jan 18 08:24:39 quack2.suse.cz nscd[968]: 968 monitored file `/etc/resolv.conf` was moved into place, adding watch
Jan 18 08:24:39 quack2.suse.cz NetworkManager[1066]: <info> (em1): Activation: successful, device activated.
So at this point /etc/resolv.conf has apparently already been modified by the NetworkManager but NetworkManager apparently *also* tried to update the nameservers in resolv.conf via netconfig.
Also note that there is activation on wlan0 interface running in parallel, however we got DHCP response on that interface only at 08:24:41 so that does not seem directly responsible for this problem.
I'd also note that nscd is monitoring /etc/resolv.conf via inotify and it has noticed any change only at 08:24:39 (the previous change of resolv.conf noticed by nscd is at 08:24:32 when NM took interfaces down and began to reestablish connections). So likely the direct write to resolv.conf by NM and then via netconfig had to come in a very close succession (not very surprising but still a data point). (In reply to Jan Kara from comment #10) > Created attachment 710597 [details] > Journal log from the resume > > Here are interesting system messages from the reboot (full messages > attached). I suppose these logs are from a suspend rather than a reboot. I used your /etc/sysconf/network/config file to replace the one in my test machine, connected to vpn using vpnc and run "netconfig update". It didn't annoyed me that /etc/resolv.conf has been updated manually. Did I miss anything? Can you elaborate on how to reproduce this bug? Yes, the logfile was from suspend. So the issue did not happen this morning and what seems to make a difference is whether wlan0 interface is enabled or not. I.e.: When I suspend & resume with wlan0 enabled (i.e. the machine gets connected to network both vie ethernet em1 interface and wireless wlan0 interface), /etc/resolv.conf gets overwritten by NetworkManager. When I suspend & resume with wlan0 disabled (i.e., only em1 interface gets reactivated on resume), /etc/resolv.conf is fine. However, when I just enable wlan0 interface after the machine is already resumed, /etc/resolv.conf stays correct. So it must be some race during resume when both em1 and wlan0 are being brought up at the same time... And I did one more experiment: When I disconnect the machine from ethernet, then after suspend & resume /etc/resolv.conf is fine as well. So the only case where I'm able to see the problem is when the machine is reconnecting to network via both em1 and wlan0 at once. In that case it happened 3 out of 3 experiments. @Jan
Could you please attach related debug information? How-to is described below:
Add
> [logging]
> level=DEBUG
to /etc/NetworkManager/NetworkManager.conf
and then attach the output of "journalctl -b -u NetworkManager".
Thanks.
Created attachment 711392 [details]
Debug messages from network manager
Here are requested debug messages from NetworkManager. Last two resume attempts resulted in NM directly modifying /etc/resolv.conf. The attempt before did not result in directly modified /etc/resolv.conf.
Created attachment 711393 [details]
Log from program watching changes to /etc/ directory
In the last reproduction attempt, I have also run a program that watches all the changes in /etc directory (through inotify). There you can see that apparently NM has created its resolv.conf version (resolv.conf.tmp) at 14:52:25 and moved it to resolv.conf in the same second. At 14:52:26, netconfig has created its version of resolv.conf (resolv.conf.gANX1c - the attempts from NM and from netconfig can be recognized by a different name of the temporary file) and then at 14:52:27 renamed it to resolv.conf.netconfig because resolv.conf was already modified by NM.
(In reply to Jan Kara from comment #17) > Created attachment 711392 [details] > Debug messages from network manager > > Here are requested debug messages from NetworkManager. Last two resume > attempts resulted in NM directly modifying /etc/resolv.conf. The attempt > before did not result in directly modified /etc/resolv.conf. Sorry for missing a key point. You need restart NetworkManager to make sure the new added config effective. Ah, good point. I have actually rebooted couple of times since I set the debug option so I had enough NetworkManager logs accumulated. Looking into the logs a situation when resolv.conf is updated directly by NetworkManager is triggered by the following sequence of events: Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.827341] [dns-manager/nm-dns-manager.c:635] update_dns(): updating resolv.conf Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.827355] [dns-manager/nm-dns-manager.c:250] run_netconfig(): spawning '/sbin/netconfig modify --service NetworkManager' Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.828307] [dns-manager/nm-dns-manager.c:267] write_to_netconfig(): writing to netconfig: INTERFACE='NetworkManager' Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.828338] [dns-manager/nm-dns-manager.c:267] write_to_netconfig(): writing to netconfig: DNSSEARCH='suse.cz' Feb 05 15:18:22 quack2.suse.cz NetworkManager[1105]: <debug> [1486304302.828355] [NetworkManagerUtils.c:805] nm_utils_kill_child_sync(): kill child process 'netconfig' (28627): waiting up to 1000 milliseconds for process to terminate normally after sending no signal (0)... Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828418] [NetworkManagerUtils.c:821] nm_utils_kill_child_sync(): kill child process 'netconfig' (28627): sending SIGKILL... Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828620] [NetworkManagerUtils.c:855] nm_utils_kill_child_sync(): kill child process 'netconfig' (28627): after sending no signal (0) and SIGKILL, process 28627 exited by signal 9 (1000266 usec elapsed) Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828908] [devices/nm-device.c:6716] nm_device_set_ip6_config(): [0x1e72d90] (em1): clear IP6Config instance (/org/freedesktop/NetworkManager/IP6Config/10) Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828959] [dns-manager/nm-dns-manager.c:1040] nm_dns_manager_begin_updates(): (device_ip6_config_changed): queueing DNS updates (1) Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828979] [dns-manager/nm-dns-manager.c:1058] nm_dns_manager_end_updates(): (nm_dns_manager_end_updates): DNS configuration did not change Feb 05 15:18:23 quack2.suse.cz NetworkManager[1105]: <debug> [1486304303.828986] [dns-manager/nm-dns-manager.c:1062] nm_dns_manager_end_updates(): (device_ip6_config_changed): no DNS changes to commit (0) So 'netconfig modify' did not manage to finish in the allocated 1s, so NetworkManager killed it. I'm a bit puzzled that I don't see further debug messages showing that resolv.conf was updated directly but whatever. The code is pretty clearly showing that if updating using netconfig fails, it does the update directly. So I dug a bit more into netconfig to see what it is doing. I've instrumented netconfig script with some debug messages and here is the output (messages should be relatively self-explanatory): Entering netconfig 11:00:15:691579073 Before modify 11:00:15:697381985 Entering modify 11:00:15:698074098 Done reading modify input 11:00:15:700162938 Determining state 11:00:15:702411665 Exiting from modify 11:00:15:703101298 After modify 11:00:15:703873893 Before modules 11:00:15:704495286 Running module /etc/netconfig.d//dns-resolver 11:00:15:705215014 dns-resolver begins 11:00:15:707300709 dns-resolver parsed policy 11:00:15:717535930 dns-resolver before resolv.conf 11:00:15:718404372 dns-resolver file created 11:00:15:721682996 dns-resolver header 11:00:15:723014609 dns-resolver searchlist 11:00:15:723693154 dns-resolver nameserver 11:00:15:724343748 dns-resolver done 11:00:16:693133913 <killed> We can see that we spent most of the 1s time limit in netconfig_check_md5_and_move() (that is the only thing happening between 'nameserver' and 'done' messages in dns-resolver. During successful resume attempts when resolv.conf gets properly updated, I can see netconfig finishing just barely under 1s. So that explains why the problem sometimes happens and sometimes not. Looking into netconfig_check_md5_and_move() function, I'm not hugely surprised that it takes close to 1s. That spawning of subshells, gawk processes, md5sum processes is going to take up some time when everything is resuming and fighting for CPU and disk. That being said I can investigate that more closely if you think it would be worth it. But overall it seems that the 1s timeout to run netconfig is just too short in some cases and bumping it up should fix my issues. (In reply to Jan Kara from comment #20) > > So 'netconfig modify' did not manage to finish in the allocated 1s, so > NetworkManager killed it. I'm a bit puzzled that I don't see further debug > messages showing that resolv.conf was updated directly but whatever. The > code is pretty clearly showing that if updating using netconfig fails, it > does the update directly. Yes. There isn't any debug message when NM directly writes to resolv.conf. And by reading those debug messagesk, it's obvious that netconfig was killed by NM. > > So I dug a bit more into netconfig to see what it is doing. I've > instrumented netconfig script with some debug messages and here is the > output (messages should be relatively self-explanatory): > > Entering netconfig 11:00:15:691579073 > Before modify 11:00:15:697381985 > Entering modify 11:00:15:698074098 > Done reading modify input 11:00:15:700162938 > Determining state 11:00:15:702411665 > Exiting from modify 11:00:15:703101298 > After modify 11:00:15:703873893 > Before modules 11:00:15:704495286 > Running module /etc/netconfig.d//dns-resolver 11:00:15:705215014 > dns-resolver begins 11:00:15:707300709 > dns-resolver parsed policy 11:00:15:717535930 > dns-resolver before resolv.conf 11:00:15:718404372 > dns-resolver file created 11:00:15:721682996 > dns-resolver header 11:00:15:723014609 > dns-resolver searchlist 11:00:15:723693154 > dns-resolver nameserver 11:00:15:724343748 > dns-resolver done 11:00:16:693133913 > <killed> > > We can see that we spent most of the 1s time limit in > netconfig_check_md5_and_move() (that is the only thing happening between > 'nameserver' and 'done' messages in dns-resolver. During successful resume > attempts when resolv.conf gets properly updated, I can see netconfig > finishing just barely under 1s. So that explains why the problem sometimes > happens and sometimes not. > > Looking into netconfig_check_md5_and_move() function, I'm not hugely > surprised that it takes close to 1s. That spawning of subshells, gawk > processes, md5sum processes is going to take up some time when everything is > resuming and fighting for CPU and disk. That being said I can investigate > that more closely if you think it would be worth it. These info are should be sufficient and very helpful. Thanks. > > But overall it seems that the 1s timeout to run netconfig is just too short > in some cases and bumping it up should fix my issues. Quite right. 1 second seems to be not long enough. This is the documentation of nm_utils_kill_child_sync() function call: > Kill a child process synchronously and wait. The function first checks if > the child already terminated and if it did, return the exit status. > Otherwise send one @sig signal. @sig will always be sent unless the child > already exited. If the child does not exit within @wait_before_kill_msec > milliseconds, the function will send %SIGKILL and waits for the child > indefinitly. If @wait_before_kill_msec is zero, no %SIGKILL signal will be > sent. So we can set @wait_before_kill_msec to a bigger value(> 1 second). Or set @wait_before_kill_msec to 0(we don't try to kill netconfig, let it exit itself). @Jan Which version of openSUSE are you using now? I'm planning to generate a fix. So I might need your help to test if it works. Thanks. I'm using openSUSE Leap 42.2. Created attachment 714724 [details] NM rpm with fix included @Jan Here is a rpm of NetworkManager with fix included. You can also use this link to download the rpm if you prefer. https://api.opensuse.org/build/home:JonathanKang:branches:openSUSE:Leap:42.2:Update/standard/x86_64/NetworkManager/NetworkManager-1.0.12-5.1.x86_64.rpm After installing this rpm, the following steps needs to be done: 1. run "systemctl restart NetworkManager" to restart NetworkManager service 2. run "netconfig update --force" to update /etc/resolv.conf to a working one I set the wait time from 1s to 2s, which should be long enough. Could you please test if this rpm works for you? Thanks. Thanks for the test RPM. I have installed it, restarted NM, and the first test suspend/resume went fine. The problem did not trigger every time so I'll be running it for a few days and will let you know whether everything worked fine. So far no problems. I'd consider the problem fixed... (In reply to Jan Kara from comment #26) > So far no problems. I'd consider the problem fixed... Thanks for it. I'll submit the fix then. Closing this as RESOLVED FIXED. SUSE-RU-2017:0952-1: An update that has four recommended fixes can now be installed. Category: recommended (low) Bug References: 1009717,1021665,960153,966232 CVE References: Sources used: SUSE Linux Enterprise Workstation Extension 12-SP2 (src): NetworkManager-1.0.12-12.4 SUSE Linux Enterprise Software Development Kit 12-SP2 (src): NetworkManager-1.0.12-12.4 SUSE Linux Enterprise Server for Raspberry Pi 12-SP2 (src): NetworkManager-1.0.12-12.4 SUSE Linux Enterprise Server 12-SP2 (src): NetworkManager-1.0.12-12.4 SUSE Linux Enterprise Desktop 12-SP2 (src): NetworkManager-1.0.12-12.4 openSUSE-RU-2017:1049-1: An update that has four recommended fixes can now be installed. Category: recommended (low) Bug References: 1009717,1021665,960153,966232 CVE References: Sources used: openSUSE Leap 42.2 (src): NetworkManager-1.0.12-4.3.1 This is an autogenerated message for OBS integration: This bug (960153) was mentioned in https://build.opensuse.org/request/show/501753 Factory / NetworkManager (In reply to Jonathan Kang from comment #21) > So we can set @wait_before_kill_msec to a bigger value(> 1 second). Or set > @wait_before_kill_msec to 0(we don't try to kill netconfig, let it exit > itself). As it can be seen from bugreports, calling netconfig is not an atomic operation. A timeout may lead to corrupted config files. (In reply to Olaf Hering from comment #33) > > As it can be seen from bugreports, calling netconfig is not an atomic > operation. > A timeout may lead to corrupted config files. Yes, calling netconfig is not an atomic operation. Maybe we should wait util netconfig exits itself. This is an autogenerated message for OBS integration: This bug (960153) was mentioned in https://build.opensuse.org/request/show/528644 Factory / NetworkManager RESOLVED FIXED. SUSE-RU-2018:3382-1: An update that has two recommended fixes can now be installed. Category: recommended (low) Bug References: 1103477,960153 CVE References: Sources used: SUSE Linux Enterprise Workstation Extension 12-SP3 (src): NetworkManager-1.0.12-13.6.1 SUSE Linux Enterprise Software Development Kit 12-SP3 (src): NetworkManager-1.0.12-13.6.1 SUSE Linux Enterprise Server 12-SP3 (src): NetworkManager-1.0.12-13.6.1 SUSE Linux Enterprise Desktop 12-SP3 (src): NetworkManager-1.0.12-13.6.1 openSUSE-RU-2018:3434-1: An update that has two recommended fixes can now be installed. Category: recommended (low) Bug References: 1103477,960153 CVE References: Sources used: openSUSE Leap 42.3 (src): NetworkManager-1.0.12-7.3.1 |
Created attachment 660275 [details] My /etc/sysconfig/network/config file I'm using Networkmanager to manage my network connection including updating /etc/resolv.conf (since this is a laptop which I connect at different places). As I need to use openvpn to connect to SUSE internal network, I'm using dnsmasq to forward resolving of SUSE domain names to SUSE name servers and the rest is resolved through the local name server. This used to work fine with 13.2. Now I have updated to 42.1 and my resolv.conf contains: # Generated by NetworkManager search suse.cz suse.de nameserver 192.168.2.1 When 'netconfig update' is run, it complains that resolv.conf has been updated manually and so it won't touch it. I have to run 'netconfig update -f' to force updating /etc/resolv.conf and then my DNS setup starts working again. So apparently some cooperation between NetworkManager and netconfig has been broken with 42.1...