Bug 918507

Summary: System services unable to start or stop, no reboot possible, zombied sshd processes
Product: [openSUSE] openSUSE 13.1 Reporter: Forgotten User px8_fL11FF <forgotten_px8_fL11FF>
Component: BasesystemAssignee: E-mail List <bnc-team-screening>
Status: RESOLVED DUPLICATE QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P1 - Urgent CC: bwiedemann, forgotten_gV38hNAnhN, markus.zimmermann
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 13.1   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Forgotten User px8_fL11FF 2015-02-18 22:38:39 UTC
Hi,

we're running 10 openSuse 13.1 servers, all fully patched. These servers have been running for roughly a year now and no configuration files were changed recently.

After the update from last Monday:

--
The following NEW patch is going to be installed:
  openSUSE-2015-149 

The following 6 packages are going to be upgraded:
  libgudev-1_0-0 libudev1 systemd systemd-32bit systemd-sysvinit udev 
--

Two of our servers have been rebooted. These two servers now show a very peculiar behavior. Every 12-16 hours all services (apache, mysql, ...) are running normally but trying to issue the following commands fails:

<code>
 service apache2 status
 service apache2 stop
 service sshd status
 service mysql stop
</code>

Output is:

No such service/target!?

Trying to reboot or shutdown also fails. Only things like

<code>
 echo 1 > /proc/sys/kernel/sysrq 
 echo b > /proc/sysrq-trigger
</code>

work. After the servers come back, all above mentioned command work fine again, start log is clean (as far as I can tell). About 12-16 hours the problem as above shows up again.

All services are running fine, but can't be stopped or their status queried. With the exception of sshd. There are a couple of sshd processes that are zombies (defunc).

This is in /var/log/messages when the servers are in this state and an ssh login occurs:

<code>
2015-02-18T20:16:56.001419+01:00 fs1 sshd[4091]: Accepted keyboard-interactive/pam for root from 217.251.***.*** port 48954 ssh2
2015-02-18T20:16:56.390724+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c35.scope: Activation of org.free
desktop.systemd1 timed out org.freedesktop.DBus.Error.TimedOut
2015-02-18T20:17:09.928009+01:00 fs1 sshd[25667]: pam_systemd(sshd:session): Failed to release session: Did not receive a reply. 
Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the repl
y timeout expired, or the network connection was broken.
2015-02-18T20:17:09.931181+01:00 fs1 systemd-cgroups-agent[4096]: Failed to get D-Bus connection: Failed to connect to socket /ru
n/systemd/private: Connection refused
2015-02-18T20:17:21.028887+01:00 fs1 sshd[4091]: pam_systemd(sshd:session): Failed to create session: Did not receive a reply. Po
ssible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2015-02-18T20:17:21.391049+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c36.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:17:46.391342+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c37.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply
</code>

Also, we have a couple of cron jobs that run every few minutes. This shows up in the log files:

<code>
2015-02-18T20:22:26.932169+01:00 fs1 /USR/SBIN/CRON[4231]: (root) CMD (/etc/ha.d/mysql_watcher3.php)
2015-02-18T20:22:26.932662+01:00 fs1 /USR/SBIN/CRON[4232]: (root) CMD (/etc/health/healthd.sh)
2015-02-18T20:22:26.933064+01:00 fs1 /USR/SBIN/CRON[4233]: (root) CMD (/etc/ha.d/watch_messages.php)
2015-02-18T20:22:31.308509+01:00 fs1 systemd-logind[611]: Failed to start session scope session-3387.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:22:56.308636+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c42.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:23:21.309211+01:00 fs1 systemd-logind[611]: Failed to start session scope session-c43.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:24:26.971793+01:00 fs1 systemd-logind[611]: Failed to start session scope session-3391.scope: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. org.freedesktop.DBus.Error.NoReply
2015-02-18T20:24:26.972637+01:00 fs1 /usr/sbin/cron[4243]: pam_systemd(crond:session): Failed to create session: Input/output error
</code>

If I log in using sshd (which still works, even if the problems described above are "active") and I try to get the list of currently logged in users like this

<code>
 systemd-loginctl list-sessions
</code>

this usually works the first time (only showing my session [no other sessions should be present]) but stops working after 5 minutes of being logged in. Than it just hangs until killed with ^C.


Also the log file is cluttered with messages like this:

<code>
2015-02-18T15:01:01.181226+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd
2015-02-18T15:04:01.267349+01:00 fs1 systemd-logind[611]: message repeated 6 times: [ Failed to store session release timer fd]
2015-02-18T15:05:01.520119+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd
2015-02-18T15:06:01.283288+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd
2015-02-18T15:10:01.479541+01:00 fs1 systemd-logind[611]: message repeated 9 times: [ Failed to store session release timer fd]
2015-02-18T15:10:02.064196+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd
2015-02-18T15:11:01.552170+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd
2015-02-18T15:14:01.702136+01:00 fs1 systemd-logind[611]: message repeated 6 times: [ Failed to store session release timer fd]
2015-02-18T15:15:01.609761+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd
2015-02-18T15:15:01.666773+01:00 fs1 systemd-logind[611]: Failed to store session release timer fd
</code>


There is plenty of space left on all hard disks. Here's the output of

<code>
 cat /proc/mdstat

Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] 
md0 : active raid1 sda1[0] sdb1[1] sdc1[3] sdd1[4](S)
      16779136 blocks super 1.0 [3/3] [UUU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sdd2[4](S) sdc2[3] sdb2[1] sda2[0]
      471606080 blocks super 1.0 [3/3] [UUU]
      bitmap: 1/4 pages [4KB], 65536KB chunk

unused devices: <none>
</code>


I really do need help with this, any input is greatly appreciated.


Yours


Paul
Comment 1 Forgotten User gV38hNAnhN 2015-02-19 16:24:00 UTC
Hello,

We also have a server running openSuSE, which since the mentioned update exhibits the same behaviour.

All services run fine but terminated SSH sessions show up in the process list as zombie processes, also login in via SSH (or su and sudo) takes much longer than usual. Moreover, I cannot start/stop/reload services. These commands fail with the following error message:

<code>
Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Connection refused
</code>

Shutdown/reboot fails as well. I performed a hardware reset yesterday, which seemed to fix the issue for a while. The strange behaviour continued again 10-12 hours after the reboot. While digging through /var/log/messages I happened upon the following line:

<code>
2015-02-19T12:56:40.155489+01:00 openSUSE-131-64-minimal kernel: [42374.242520] systemd[1]: segfault at 7f8d9e6519f8 ip 00007f8d9e6519f8 sp 00007fff4cee9868 error 15 in libc-2.18.so[7f8d9e651000+2000]
2015-02-19T12:56:40.739580+01:00 openSUSE-131-64-minimal systemd[1]: Caught <SEGV>, dumped core as pid 27762.
2015-02-19T12:56:40.780896+01:00 openSUSE-131-64-minimal systemd[1]: Freezing execution.
</code>

This seems to be the point the strange behaviour had started. As SEGFAULT in systemd?
Comment 2 Forgotten User px8_fL11FF 2015-02-19 16:34:52 UTC
As Thomas wrote, this really seems to start with a segfault in systemd. This is our log content:


2015-02-18T08:16:43.892154+01:00 fs1-1 kernel: [404687.140461] systemd[1]: segfault at a8 ip 000000000047912e sp 00007fffd0db7110 error 4 in systemd[400000+ed000]
2015-02-18T08:16:44.352680+01:00 fs1-1 systemd[1]: Caught <SEGV>, dumped core as pid 19512.
2015-02-18T08:16:44.411008+01:00 fs1-1 systemd[1]: Freezing execution.
2015-02-18T08:16:01.955797+01:00 fs1-1 systemd-logind[12034]: message repeated 4 times: [ Failed to store session release timer fd]
2015-02-18T08:18:26.939222+01:00 fs1-1 systemd-logind[12034]: Failed to start unit user-0.slice: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2015-02-18T08:18:26.939762+01:00 fs1-1 systemd-logind[12034]: Failed to start user slice: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2015-02-18T08:18:26.940061+01:00 fs1-1 /usr/sbin/cron[19515]: pam_systemd(crond:session): Failed to create session: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2015-02-16T23:31:56.333429+01:00 fs1-1 dbus[592]: message repeated 3 times: [ [system] Reloaded configuration]
2015-02-18T08:18:26.940421+01:00 fs1-1 dbus[592]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out


Does anybody know whether reverting back to systemd 208-23.3 will help? Next time our servers start misbehaving I'll give it try and give feedback here.

Paul
Comment 3 Forgotten User gV38hNAnhN 2015-02-19 18:39:29 UTC
This seems to be a duplicate of this bug:
https://bugzilla.opensuse.org/show_bug.cgi?id=918226

Once commenter suggests that downgrading systemd fixed the problem.
Comment 4 Bernhard Wiedemann 2015-02-21 08:05:42 UTC
I also think, the crash from the newly broken systemd version
is the underlying problem

*** This bug has been marked as a duplicate of bug 918226 ***