Bug 737636

Summary: network is started then stopped right away
Product: [openSUSE] openSUSE 12.1 Reporter: Forgotten User cI3FYhfYTs <forgotten_cI3FYhfYTs>
Component: BasesystemAssignee: Marius Tomaschewski <mt>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: fcrozat, info, mantel, suse-beta
Version: Final   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: systemd debugging output for the ssh/network problem
dmesg output of failure after updating to test version
dmesg output of failure after updating to fcrozat's systemd
dmesg output not that rpcbind is not started anymore

Description Forgotten User cI3FYhfYTs 2011-12-19 13:08:25 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:11.0a1) Gecko/20111211 Firefox/11.0a1

Even though ssh is set to `Yes' under System Services (Runlevel) the network isn't started, well to be precise, the network is started then stopped again.


Reproducible: Always

Steps to Reproduce:
1. Under System Services (Runlevel) set xdm to no and ssh to yes
2. boot system

Actual Results:  
Network is started, then stopped again, ssh is not running.

Expected Results:  
ssh should be up an running.

My network is configured statically (through yast).

I've found a work-around which I could live with, reluctantly, for the moment:
Enable XDM in System Services, the network appears to remain started if XDM is running.
Comment 1 Forgotten User cI3FYhfYTs 2011-12-19 13:09:19 UTC
Created attachment 468065 [details]
systemd debugging output for the ssh/network problem
Comment 2 Urs Mueller 2011-12-30 23:03:16 UTC
I got a similiar problem. I have to up eth0 manually by "ifup eth0". It runs until reboot.
I saw that postfix was not running but showing "dead". The log (/var/log/message) shows postmap complaining about a missing local interface ::1. 

I had deactivated IPv6 on setup due to no need. After that problem, I reactivated it and bingo. The rcnetwork or restart does start eth0.

Could it be, that there is a wrong dependency in the network start script forcing to shut it off when there is no IPv6?
Comment 3 Forgotten User cI3FYhfYTs 2011-12-30 23:09:53 UTC
I wish, I have the opposite problem, it's a v6 only network, so I have no need for IPv4.  But I doubt there's an option to turn it off and then on again, so this won't work for me.
Comment 4 Frederic Crozat 2012-01-27 13:14:16 UTC
could you test with updated systemd and sysconfig from 12.1 maintenance update ? Issue regarding network service going down just after being up should be fixed.
Comment 5 Forgotten User cI3FYhfYTs 2012-01-27 19:31:09 UTC
There is no update available.  Whats the version number supposed to be?
I've got systemd-37-3.6.1 and the bug is still in there.
Comment 6 Frederic Crozat 2012-01-30 10:17:12 UTC
my bad, I thought sysconfig update had been pushed. Please test with sysconfig package from https://build.opensuse.org/package/show?package=sysconfig&project=openSUSE%3A12.1%3AUpdate%3ATest
Comment 7 Forgotten User cI3FYhfYTs 2012-02-06 12:20:41 UTC
There must be something wrong with the link, I don't seem to be able to add the repo to my zypper repo list.
Comment 8 Frederic Crozat 2012-02-06 12:38:04 UTC
because it wasn't a url for adding a repository.

Try http://download.opensuse.org/repositories/home:/mtomaschewski:/branches:/openSUSE:/12.1:/Update:/Test/openSUSE_12.1/
Comment 9 Forgotten User cI3FYhfYTs 2012-02-06 13:02:07 UTC
Yep, looks like it's fixed.  I've rebooted 3 times now and the network stays up as expected.
Comment 10 Frederic Crozat 2012-02-06 14:43:39 UTC
marking closed as fixed, will be handled by pending maintenance update for 12.1
Comment 11 Forgotten User cI3FYhfYTs 2012-02-06 14:46:48 UTC
Created attachment 474575 [details]
dmesg output of failure after updating to test version

Spoke too soon.  On the 5th reboot it shows the original (erroneous) behaviour again, attached is the dmesg.
Comment 12 Forgotten User cI3FYhfYTs 2012-02-06 15:01:57 UTC
Don't know if it's good news or bad news, out of 10 reboots 2 succeeded, 8 failed, and I don't know what the schema behind that is; as I was saying the good news is that it's not always reproducible.
Comment 13 Hubert Mantel 2012-02-06 15:19:47 UTC
*** Bug 741588 has been marked as a duplicate of this bug. ***
Comment 14 Marius Tomaschewski 2012-02-06 16:41:25 UTC
We've added the "X-Systemd-RemainAfterExit: true" support to systemd
and added the flag to the LSB tags in network script to keep services,
that do not have any daemon, in "running" state instead to assume that
they're gone when there is no process after start.

[systemd update is released, sysconfig still in comment 8 repo]

But it seems, there is still some check for empty cgroup somewhere in
systemd causing systemd to execute the stop action for the service:

[   19.199283] systemd[1]: Accepted connection on private bus.
[   19.200140] systemd[1]: Got D-Bus request: org.freedesktop.systemd1.Agent.Released() on /org/freedesktop/systemd1/agent
[   19.200235] systemd[1]: network.service: cgroup is empty
[   19.200344] systemd[1]: About to execute: /etc/init.d/network stop
[   19.205247] systemd[1]: Forked /etc/init.d/network as 1258
[   19.205481] systemd[1]: network.service changed running -> stop

See also deadlock problem in bug 724777 comment 4.

Further, AFAIS systemd never calls the "network status" action but always
the cached "start" return code only [that may have reported "in background"
when dhcp didn't get any lease yet, ipv6 dad is not finished yet, ...].
And because "status" isn't always correct, another action results may be
wrong too (e.g. "try-restart").

I were unable to catch the problem until now, because it works for me as
soon as I enable debugging :-/ what is usually a timing/locking problem.
Comment 15 Marius Tomaschewski 2012-02-06 16:57:56 UTC
(In reply to comment #11)
> Created an attachment (id=474575) [details]
> dmesg output of failure after updating to test version
> 
> Spoke too soon.  On the 5th reboot it shows the original (erroneous) behaviour
> again, attached is the dmesg.

Look at the wrong start order -- systemd also starts most network depending
services incl. remote-fs before network:

[    7.024754] systemd[1]: Installed new job remote-fs.target/start as 65
[    7.024757] systemd[1]: Installed new job syslog.service/start as 66
[    7.024760] systemd[1]: Installed new job nfs-secure.service/start as 69
[    7.024764] systemd[1]: Installed new job var-lib-nfs-rpc_pipefs.mount/start as 70
[    7.024767] systemd[1]: Installed new job rpcbind.service/start as 71
...
[    7.024833] systemd[1]: Installed new job network.service/start as 88
[    7.024836] systemd[1]: Installed new job network.target/start as 89

I don't think I can fix (work around) here something in the network script.

See also bug 744293 -- does the systemd dependency fix helps in your case?
Perhaps network simply gets stopped because of a try to resolve the deps?
Comment 16 Forgotten User cI3FYhfYTs 2012-02-07 10:30:47 UTC
Created attachment 474756 [details]
dmesg output of failure after updating to fcrozat's systemd

Not entirely sure what you mean by `systemd dependency fix'.  I took it you wanted me to try fcrozat's latest systemd version and I did.  Attached is the result.
Comment 17 Marius Tomaschewski 2012-02-16 14:10:02 UTC
Yes, I were speaking about systemd with a remote-fs dependency fix from:

http://download.opensuse.org/repositories/home:/fcrozat:/systemd/openSUSE_12.1/

Hmm... there still seems to be a deps issue. Diverse remotefs related service
are started before network:

[   20.803349] systemd[1]: Got SIGCHLD for process 1216 (ifup-route)
[   20.803468] systemd[1]: Child 1216 died (code=exited, status=0/SUCCESS)
[   20.804130] systemd[1]: Received SIGCHLD from PID 613 (network).
[   20.804208] systemd[1]: Got SIGCHLD for process 613 (network)
[   20.804320] systemd[1]: Child 613 died (code=exited, status=0/SUCCESS)
[   20.804325] systemd[1]: Child 613 belongs to network.service
[   20.804373] systemd[1]: network.service: control process exited, code=exited status=0
[   20.804377] systemd[1]: network.service got final SIGCHLD for state start
[   20.804542] systemd[1]: network.service changed start -> running
[   20.804559] systemd[1]: Job network.service/start finished, result=done
[   20.804708] systemd[1]: network.target changed dead -> active
[   20.804716] systemd[1]: Job network.target/start finished, result=done
[   20.804784] systemd[1]: About to execute: /etc/init.d/nfs start
[   20.809192] systemd[1]: Forked /etc/init.d/nfs as 1219
[   20.809359] systemd[1]: nfs.service changed dead -> start
[   20.815299] systemd[1]: Received SIGCHLD from PID 1217 (ifup-route).
[   20.815383] systemd[1]: Got SIGCHLD for process 1217 (ifup-route)
[   20.815498] systemd[1]: Child 1217 died (code=exited, status=1/FAILURE)
[   20.824242] systemd[1]: Accepted connection on private bus.
[   20.829008] systemd[1]: Got D-Bus request: org.freedesktop.systemd1.Agent.Released() on /org/freedesktop/systemd1/agent
[   20.829190] systemd[1]: network.service: cgroup is empty
[   20.829269] systemd[1]: About to execute: /etc/init.d/network stop
[   20.832227] systemd[1]: Forked /etc/init.d/network as 1221
[   20.832473] systemd[1]: network.service changed running -> stop

Interesting thing is PID 1217 (ifup-route) after network finished...
where is this from ... just logged late?

Please make sure you have the most recent systemd >= 37-3.151.1 from:
http://download.opensuse.org/repositories/home:/fcrozat:/systemd/openSUSE_12.1/

install the sysconfig package (changelog from today, > 0.75.4-2.8) from:

http://download.opensuse.org/repositories/home:/mtomaschewski:/branches:/openSUSE:/12.1:/Update:/Test/openSUSE_12.1/

------------------------------------------------------------------
Thu Feb 16 14:03:59 UTC 2012 - mt@suse.com
- Test patch: 0010-network-start-fails-under-systemd.test.patch
  bnc#734676: network bonding does not work when used with systemd
  bnc#737636: network is started then stopped right away
  bnc#745859: network reports failure at start under systemd
  Set DEBUG=EXTRA, reboot and attach tar of /dev/.sysconfig/network.
------------------------------------------------------------------

Then set DEBUG=EXTRA in /etc/sysconfig/network/config, reboot and
attach a tar.bz2 of /dev/.sysconfig/network when it occurs again.
Comment 18 Forgotten User cI3FYhfYTs 2012-02-17 14:30:38 UTC
Ok, the network seems to start reliably now.  However, rpcbind is now not started anymore, or to be precise, started too soon (before any networking).
Shall I file a new bug report?
Comment 19 Frederic Crozat 2012-02-17 14:41:20 UTC
hmm, since there is a dependency from rpcbind on $network, it would mean network.target has been activated, ie both NM-wait-network.service and network.service are activated.

network service might be giving its status too soon.

could you test with "systemd.log_level=debug systemd.log_target=kmsg" and attach dmesg output after booting, so see what happened exactly.
Comment 20 Forgotten User cI3FYhfYTs 2012-02-17 15:31:08 UTC
Created attachment 476707 [details]
dmesg output not that rpcbind is not started anymore
Comment 21 Frederic Crozat 2012-02-17 15:45:58 UTC
I don't see rpcbind being started at startup.

what is the output of systemctl status rpcbind.service ?
Comment 22 Forgotten User cI3FYhfYTs 2012-02-17 16:24:36 UTC
root@kegel:~ # systemctl status rpcbind.service
rpcbind.service - RPC Bind
          Loaded: loaded (/lib/systemd/system/rpcbind.service; disabled)
          Active: active (running) since Fri, 17 Feb 2012 14:57:44 +0000; 1h 25min ago
        Main PID: 1510 (rpcbind)
          CGroup: name=systemd:/system/rpcbind.service
                  └ 1510 /sbin/rpcbind -w -f
Comment 23 Frederic Crozat 2012-02-17 16:37:55 UTC
rpcbind.service is not enabled. Try 
"systemctl enable rpcbind.service"

it should enable it for next reboot.

but I don't understand, rpcbind doesn't ship with a .service on 12.1. Are you running on Factory ?

I think you should open a separate bug report for rpcbind on Factory.
Comment 24 Forgotten User cI3FYhfYTs 2012-02-17 16:47:07 UTC
Indeed it does.  Ok, I'm happy now.
No, it's a 12.1 but it has your systemd and mt's sysconfig, so I don't know where the service comes from.
Comment 25 Frederic Crozat 2012-02-17 16:49:54 UTC
check rpm -qi $(rpm -qf /lib/systemd/system/rpcbind.service)

I doubt it come from openSUSE:12.1 repository..
Comment 26 Forgotten User cI3FYhfYTs 2012-02-17 17:05:31 UTC
Oh indeed, it's a leftover from my autofs bug report.
Vendor: obs://build.opensuse.org/home:elvigia

It even works better than before now, as the rpcbind process was originally started before the network was up, and there were error messages about not being able to bind a port because the address did not exist.
Comment 27 Frederic Crozat 2012-02-20 09:49:01 UTC
Do you mean network issue is fixed for you ?
Comment 28 Forgotten User cI3FYhfYTs 2012-02-20 10:31:53 UTC
Yes.
Comment 29 Marius Tomaschewski 2012-02-21 10:47:37 UTC
OK, then let's resolve it as a duplicate of bnc#744293 where the
dependency issue has been fixed.

*** This bug has been marked as a duplicate of bug 744293 ***
Comment 30 Marius Tomaschewski 2012-02-21 10:49:15 UTC
Many thanks to everybody involved for reporting, all the tests and patience!
Comment 31 Forgotten User cI3FYhfYTs 2012-02-21 10:58:16 UTC
I disagree with the resolution, the bug is fixed as is.  There has been no dependency issue at all.

Like I said, the dependency problem was there BEFORE, probably because the rpcbind service file didn't `integrate' well.  After ENABLING it (it was disabled according to fcrozat) it's started as designed.
Comment 32 Marius Tomaschewski 2012-02-21 15:03:13 UTC
Well, I've closed it as DUP e.g. because of comment 26 where you write
"rpcbind process was originally started before the network was up"
=> the remote-fs-after-network dependency problem:

- Add remote-fs-after-network.patch and update insserv patch:
  ensure remote-fs-pre.target is enabled and started
  before network mount points (bnc#744293).

But anyway, I'm fine with FIXED as well :-)
Comment 33 Bernhard Wiedemann 2012-02-29 11:00:12 UTC
This is an autogenerated message for OBS integration:
This bug (737636) was mentioned in
https://build.opensuse.org/request/show/107529 12.1 / sysconfig