Bug 1024452

Summary: network-online.target seems insufficient for corosync
Product: [openSUSE] openSUSE Distribution Reporter: Peter Wullinger <wullinger>
Component: High AvailabilityAssignee: Yan Gao <ygao>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None    
Version: Leap 42.2   
Target Milestone: ---   
Hardware: x86-64   
OS: SLES 12   
See Also: https://bugzilla.suse.com/show_bug.cgi?id=926835
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Peter Wullinger 2017-02-09 08:45:52 UTC
Systemd-based distributions corosync.service and pacemaker.service have

[Unit]
After=network.target

Unfortunately, as noted (https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/), "network.target has very little meaning during start-up.".

We currently experience intermittent failure of cluster nodes after boot when the cluster software is started before the (statically assigned) IP address on a bonded interface is brought up.

1) The most strong gripe is that node-local sshd is configured to listen only on the node's address (to avoid conflicts with cluster-assigned IP addresses). since sshd.service depends only on network.target, the node-local sshd does (intermittently) not come up after boot, making it impossible to reliably set ListenAddress in sshd_config on cluster nodes.
2) Some cluster services (e.g. dlm_controld) are started before proper networking is available, causing them to fail start (and the node in question then commit suicide).

Corosync seems to cope with this problem (it detects when an interface becomes available after it has started), but other parts of the cluster do not. In particular, the booted node got a STONITH-signal straight away.

We use wicked to configure the networking.

"network.target" with more complex network configurations does not mean that all configured static IP addresses are assigned. All services that depend on a particular static IP address being assigned at start seem to intermittently break due to a startup race condition.

This has already been observed for sshd (https://bugzilla.suse.com/show_bug.cgi?id=926835) when ListenAddress is not the wildcard address.

The situation can be improved by setting

[Unit]
After=network-online.target

for all services that listen to specific IP addresses. However the definition of network-online.target is equally vague (»the definition of "up" is defined by the network management software«) and I cannot say I have trust in the reliability of this solution.

- How does OpenSUSE define the notion of network-online.target? In particular, "all statically assigned IP addresses are available" would solve our problem.

- make all services that depend on a particular IP address being assigned depend on network-online.target instead of network.target.
Comment 1 Peter Wullinger 2017-02-09 15:48:33 UTC
Let me correct myself,

corosync's unit has

After=network-online.target

but that still seems to be insufficient to ensure reliable operation after boot.

The sshd problem remains.
Comment 2 Tomáš Chvátal 2018-04-17 14:02:48 UTC
This is automated batch bugzilla cleanup.

The openSUSE 42.2 changed to end-of-life (EOL [1]) status. As such
it is no longer maintained, which means that it will not receive any
further security or bug fix updates.
As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
openSUSE, or you can still observe it under openSUSE Leap 15.0, please
feel free to reopen this bug against that version (see the "Version"
component in the bug fields), or alternatively open
a new ticket.

Thank you for reporting this bug and we are sorry it could not be fixed
during the lifetime of the release.

[1] https://en.opensuse.org/Lifetime