Bug 541802

Summary: syslog-ng needs 100% cpu usage on remote logging
Product: [openSUSE] openSUSE 11.1 Reporter: Martin Vogt <vogt>
Component: OtherAssignee: Marius Tomaschewski <mt>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: jreuter, maintenance
Version: Final   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard: maint:released:11.1:30566 maint:released:11.2:30566
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Martin Vogt 2009-09-24 14:23:42 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.12) Gecko/2009071200 SUSE/3.0.12-0.1.2 Firefox/3.0.12 Mnenhy/0.7.6.666

Hello,

today I noticed 100% CPU usage on my local syslog-ng daemon.
I straced it:

endless of these messages:

>events=POLLIN}, {fd=5, events=POLLIN}], 8, 1028000) = 1 ([{fd=10, >revents=POLLERR}])
>poll([{fd=7, events=0}, {fd=6, events=0}, {fd=8, events=0}, {fd=9, events=0}, >{fd=10, events=0}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN}, {fd=5, >events=POLLIN}], 8, 1028000) = 1 ([{fd=10, revents=POLLERR}])
>poll([{fd=7, events=0}, {fd=6, events=0}, {fd=8, events=0}, {fd=9, events=0}, 
[...]

Further debugging revealed:

1. the local syslog is configured to forward the messages to a loghost:
I added locally in /etc/syslog-ng/syslog-ng.conf :

(*)
>destination a_udp { udp( "loghost" port(514) ); };
>log { source(src); destination(a_udp); };

This forwards the locally generated messages to "loghost".
But the remote syslog-ng was not running(!)

Thus the bug is:

- syslog-ng produces 100% cpu usage, if it send logging to a remote
location which is not availabe (eg crashed)

In the local log file you will find:

Sep 24 16:07:10  syslog-ng[6128]: I/O error occurred while writing; fd='6'
, error='Connection refused (111)'

top:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 6246 root      20   0 23204  808  516 R  100  0.0   0:03.32 syslog-ng  

The cpu load of 100% occurs for around five minutes, then it comes down to normal cpu usage for maybe a minute, and then it starts over again with 100%.

This is a bug IMHO.

The rpm -qi for syslog-ng says:

>This package contains Novell/SUSE specific modifications / extensions,
>please report problems using Bugzilla at https://bugzilla.novell.com/
>before you contact the authors.

and among the features:

* better network forwarding

:-) (not)

regards,

Martin

PS: I know that this bugs only shows up if something is wrong, but syslog-ng
does not handle this corner case fine.







Reproducible: Always

Steps to Reproduce:
1.add (*) to you local syslog-ng.conf 
2.change "loghost" to something which exists, but _not_ your local machine name
3.restart syslog
4. monitor with "top"
Actual Results:  
100 % cpu usage

Expected Results:  
syslog should ignore it if it cannot send udp messages or something
but should not eat up 100% cpu.
Comment 1 Marius Tomaschewski 2010-01-15 11:20:38 UTC
Hmm.. I've tried to reproduce this, but without any luck.

I've some error/time_reopen messages in the local log:

Jan 15 11:15:40 xanthos syslog-ng[20515]: syslog-ng starting up; version='2.0.9'
Jan 15 11:16:12 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='5', error='Connection refused (111)'
Jan 15 11:16:12 xanthos syslog-ng[20515]: Connection broken; time_reopen='60'
Jan 15 11:17:45 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='9', error='Connection refused (111)'
Jan 15 11:17:45 xanthos syslog-ng[20515]: Connection broken; time_reopen='60'
Jan 15 11:18:45 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='9', error='Connection refused (111)'
Jan 15 11:18:45 xanthos syslog-ng[20515]: Connection broken; time_reopen='60'
Jan 15 11:19:45 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='9', error='Connection refused (111)'
Jan 15 11:19:45 xanthos syslog-ng[20515]: Connection broken; time_reopen='60'
[...]
Jan 15 11:28:45 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='9', error='Connection refused (111)'
Jan 15 11:28:45 xanthos syslog-ng[20515]: Connection broken; time_reopen='60'

and when I run a loop with
  while true ; do logger ============== ; echo -n . ; done

something like 2-6% CPU usage:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
20515 root      20   0 10996 1380  604 S    3  0.0   1:00.51 syslog-ng          

this is IMO OK.

Otherwise (no input to process), 0% CPU usage:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
20515 root      20   0 10996 1380  604 S    0  0.0   1:03.93 syslog-ng
Comment 2 Marius Tomaschewski 2010-01-15 11:33:52 UTC
Above test was with syslog-ng-2.0.9-27.3.x86_64 on 11.1 (same results as
with syslog-ng-2.0.9-29.1.x86_64 on 11.2), that is with the RPM released
with 11.1 GA:

http://download.opensuse.org/distribution/11.1/repo/oss/suse/x86_64/syslog-ng-2.0.9-27.3.x86_64.rpm
Comment 3 Marius Tomaschewski 2010-01-15 11:39:20 UTC
Or are you using syslog-ng < 1.6.9 with libol < 0.3.17 on 11.1
instead of the 2.0.x series?
The 1.x series has had a bug fixed in libol-0.3.17 (bug 537314).
Comment 4 Martin Vogt 2010-01-15 11:49:43 UTC
Hello,

I still can reproduce the bug:

Swap:  1048568k total,      168k used,  1048400k free,  2796316k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
23438 root      20   0 23204  924  612 S  100  0.0   1:53.00 syslog-ng    

Have you used a remote host which is on (ping) but has not remote
syslog running.
If I have only a dns entry put the machine is off I could reproduce 
it either.

Martin

PS: I dont think thich should be closed, please give it another try.
  

root@pxe3[syslog-ng]>rpm -qa | grep syslog
syslog-ng-2.0.9-27.22
Comment 5 Marius Tomaschewski 2010-01-15 12:19:23 UTC
(In reply to comment #4)
> Hello,
> 
> I still can reproduce the bug:
> 
> Swap:  1048568k total,      168k used,  1048400k free,  2796316k cached
> 
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
> 23438 root      20   0 23204  924  612 S  100  0.0   1:53.00 syslog-ng    
> 
> Have you used a remote host which is on (ping) but has not remote
> syslog running.

Yes.

> If I have only a dns entry put the machine is off I could reproduce 
> it either.

OK, I'll try now.

> PS: I dont think thich should be closed, please give it another try.

Sure.
Comment 6 Martin Vogt 2010-01-15 12:35:32 UTC

Hello,

only to correct my spelling mistakes in the previous mail:

- I can reproduce the bug if the remote machine is on (ping),
but has no syslogd with remote logging capability running.

- The bug is not reproducible if I only have a valid dns entry
but the machine is off (no ping)

And it takes around ~2 minutes before syslog-ng east up 100%

Martin


>> If I have only a dns entry put the machine is off I could reproduce 
>> it either.
Should be
>> If I have only a dns entry _b_ut the machine is off I could_n't_ reproduce 
>> it either.
Comment 7 Martin Vogt 2010-01-15 12:37:27 UTC
Well: there shouldn't be a firewall running too.....
Comment 8 Marius Tomaschewski 2010-01-15 13:42:38 UTC
OK, thanks!

Now I was able to reproduce it by putting a "loghost" dummy IP to DNS
instead of /etc/hosts and by playing with adding/removing the IP on
the log server interface and starting/stopping the daemon there
(don't know where exactly the difference to earlier test is, but...)

poll([{fd=5, events=0}, {fd=6, events=0}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 4, 1086000) = 1 ([{fd=5, revents=POLLERR}])
poll([{fd=5, events=0}, {fd=6, events=0}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 4, 1086000) = 1 ([{fd=5, revents=POLLERR}])
poll([{fd=5, events=0}, {fd=6, events=0}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 4, 1086000) = 1 ([{fd=5, revents=POLLERR}])
poll([{fd=5, events=0}, {fd=6, events=0}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 4, 1086000) = 1 ([{fd=5, revents=POLLERR}])^C
Process 21649 detached

And I found also fixes in the upstream git that looks like they would
address exactly this problem:

commit 8295be1303fab60405fe50cbcfbc5b0865fd34a4
Author: Balazs Scheidler <bazsi@balabit.hu>
Date:   Mon Oct 5 12:33:07 2009 +0200

    [logwriter] fixed a 100% CPU usage if a udp destination is down
    
    Sometimes when a UDP destination is down we get POLLERR from poll. This
    was not always propagated to the dispatch handler by GLib if
    pollfd.events was 0.
    
    This patch makes sure that at least G_IO_ERR is present in the "events"
    field in logreader/logwriter.
    
    Reported-By: Sandor Geller

[...]

commit 80d6f2147b7eaee118ca0573eb9e354170aa4362
Author: Balazs Scheidler <bazsi@balabit.hu>
Date:   Thu Nov 6 11:53:32 2008 +0100

    fixed 100% CPU spinning on HP-UX upon a connection break
    
    As I see the problem seems to be caused by the fact that HP-UX returns
    POLLERR only without the other bits (e.g. POLLHUP) syslog-ng would
    handle this gracefully if either the other bits would be set, or there'd
    be some pending messages to send, in which case a normal write() error
    would occur.

I'll prepare a package with this fix and provide an Url to a test RPM soon.
Comment 9 Martin Vogt 2010-01-15 14:27:05 UTC
Thanks.
I noticed the problem on my machines when the remote logging server
crashed. 

100% cpu load can be annoying on every machine in eg. a cluster.
I will then wait for the fix.

bye,

Martin
Comment 10 Marius Tomaschewski 2010-01-15 15:29:48 UTC
OK, a test package with applied patches will appear at:

http://download.opensuse.org/repositories/home:/mtomaschewski:/branches:/openSUSE:/11.1/openSUSE_11.1/

when it is built. Would you test it please?

Please verify, that the "rpm -qp --changelog syslog-ng*.rpm" changelog entry is
visible:

Fri Jan 15 15:31:16 CET 2010 - mt@suse.de
- Applied POLLERR handling fixes for logreader/logwriter causing
  100% CPU usage e.g. when a UDP destination is down (bnc#541802).
Comment 11 Martin Vogt 2010-01-21 14:03:42 UTC
Hello,

I can confirm that the 100% CPU load is now gone in my setup.

regards,

Martin
Comment 12 Marius Tomaschewski 2010-01-25 20:32:27 UTC
OK, same to me, thanks!!

submitting the package to SLE-11:Update:Test.
Comment 14 Swamp Workflow Management 2010-01-26 11:05:11 UTC
The SWAMPID for this issue is 30492.
Please submit the patch and patchinfo file using this ID.
(https://swamp.suse.de/webswamp/wf/30492)
Comment 17 Swamp Workflow Management 2010-03-23 14:12:47 UTC
Update released for: syslog-ng, syslog-ng-debuginfo, syslog-ng-debugsource
Products:
SLE-DEBUGINFO 11 (i386, ia64, ppc64, s390x, x86_64)
SLE-DESKTOP 11 (i386, x86_64)
SLE-SERVER 11 (i386, ia64, ppc64, s390x, x86_64)
Comment 18 Swamp Workflow Management 2010-03-23 14:13:36 UTC
Update released for: syslog-ng, syslog-ng-debuginfo, syslog-ng-debugsource
Products:
openSUSE 11.1 (debug, i586, ppc, x86_64)
openSUSE 11.2 (debug, i586, x86_64)
Comment 19 Marius Tomaschewski 2010-09-22 14:01:47 UTC
*** Bug 587508 has been marked as a duplicate of this bug. ***
Comment 20 Swamp Workflow Management 2019-02-08 08:41:22 UTC
This is an autogenerated message for OBS integration:
This bug (541802) was mentioned in
https://build.opensuse.org/request/show/672696 15.1 / syslog-ng