|
Bugzilla – Full Text Bug Listing |
| Summary: | syslog-ng needs 100% cpu usage on remote logging | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 11.1 | Reporter: | Martin Vogt <vogt> |
| Component: | Other | Assignee: | Marius Tomaschewski <mt> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | jreuter, maintenance |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | maint:released:11.1:30566 maint:released:11.2:30566 | ||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
|
Description
Martin Vogt
2009-09-24 14:23:42 UTC
Hmm.. I've tried to reproduce this, but without any luck. I've some error/time_reopen messages in the local log: Jan 15 11:15:40 xanthos syslog-ng[20515]: syslog-ng starting up; version='2.0.9' Jan 15 11:16:12 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='5', error='Connection refused (111)' Jan 15 11:16:12 xanthos syslog-ng[20515]: Connection broken; time_reopen='60' Jan 15 11:17:45 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='9', error='Connection refused (111)' Jan 15 11:17:45 xanthos syslog-ng[20515]: Connection broken; time_reopen='60' Jan 15 11:18:45 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='9', error='Connection refused (111)' Jan 15 11:18:45 xanthos syslog-ng[20515]: Connection broken; time_reopen='60' Jan 15 11:19:45 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='9', error='Connection refused (111)' Jan 15 11:19:45 xanthos syslog-ng[20515]: Connection broken; time_reopen='60' [...] Jan 15 11:28:45 xanthos syslog-ng[20515]: I/O error occurred while writing; fd='9', error='Connection refused (111)' Jan 15 11:28:45 xanthos syslog-ng[20515]: Connection broken; time_reopen='60' and when I run a loop with while true ; do logger ============== ; echo -n . ; done something like 2-6% CPU usage: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20515 root 20 0 10996 1380 604 S 3 0.0 1:00.51 syslog-ng this is IMO OK. Otherwise (no input to process), 0% CPU usage: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20515 root 20 0 10996 1380 604 S 0 0.0 1:03.93 syslog-ng Above test was with syslog-ng-2.0.9-27.3.x86_64 on 11.1 (same results as with syslog-ng-2.0.9-29.1.x86_64 on 11.2), that is with the RPM released with 11.1 GA: http://download.opensuse.org/distribution/11.1/repo/oss/suse/x86_64/syslog-ng-2.0.9-27.3.x86_64.rpm Or are you using syslog-ng < 1.6.9 with libol < 0.3.17 on 11.1 instead of the 2.0.x series? The 1.x series has had a bug fixed in libol-0.3.17 (bug 537314). Hello, I still can reproduce the bug: Swap: 1048568k total, 168k used, 1048400k free, 2796316k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23438 root 20 0 23204 924 612 S 100 0.0 1:53.00 syslog-ng Have you used a remote host which is on (ping) but has not remote syslog running. If I have only a dns entry put the machine is off I could reproduce it either. Martin PS: I dont think thich should be closed, please give it another try. root@pxe3[syslog-ng]>rpm -qa | grep syslog syslog-ng-2.0.9-27.22 (In reply to comment #4) > Hello, > > I still can reproduce the bug: > > Swap: 1048568k total, 168k used, 1048400k free, 2796316k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 23438 root 20 0 23204 924 612 S 100 0.0 1:53.00 syslog-ng > > Have you used a remote host which is on (ping) but has not remote > syslog running. Yes. > If I have only a dns entry put the machine is off I could reproduce > it either. OK, I'll try now. > PS: I dont think thich should be closed, please give it another try. Sure. Hello, only to correct my spelling mistakes in the previous mail: - I can reproduce the bug if the remote machine is on (ping), but has no syslogd with remote logging capability running. - The bug is not reproducible if I only have a valid dns entry but the machine is off (no ping) And it takes around ~2 minutes before syslog-ng east up 100% Martin >> If I have only a dns entry put the machine is off I could reproduce >> it either. Should be >> If I have only a dns entry _b_ut the machine is off I could_n't_ reproduce >> it either. Well: there shouldn't be a firewall running too..... OK, thanks!
Now I was able to reproduce it by putting a "loghost" dummy IP to DNS
instead of /etc/hosts and by playing with adding/removing the IP on
the log server interface and starting/stopping the daemon there
(don't know where exactly the difference to earlier test is, but...)
poll([{fd=5, events=0}, {fd=6, events=0}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 4, 1086000) = 1 ([{fd=5, revents=POLLERR}])
poll([{fd=5, events=0}, {fd=6, events=0}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 4, 1086000) = 1 ([{fd=5, revents=POLLERR}])
poll([{fd=5, events=0}, {fd=6, events=0}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 4, 1086000) = 1 ([{fd=5, revents=POLLERR}])
poll([{fd=5, events=0}, {fd=6, events=0}, {fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 4, 1086000) = 1 ([{fd=5, revents=POLLERR}])^C
Process 21649 detached
And I found also fixes in the upstream git that looks like they would
address exactly this problem:
commit 8295be1303fab60405fe50cbcfbc5b0865fd34a4
Author: Balazs Scheidler <bazsi@balabit.hu>
Date: Mon Oct 5 12:33:07 2009 +0200
[logwriter] fixed a 100% CPU usage if a udp destination is down
Sometimes when a UDP destination is down we get POLLERR from poll. This
was not always propagated to the dispatch handler by GLib if
pollfd.events was 0.
This patch makes sure that at least G_IO_ERR is present in the "events"
field in logreader/logwriter.
Reported-By: Sandor Geller
[...]
commit 80d6f2147b7eaee118ca0573eb9e354170aa4362
Author: Balazs Scheidler <bazsi@balabit.hu>
Date: Thu Nov 6 11:53:32 2008 +0100
fixed 100% CPU spinning on HP-UX upon a connection break
As I see the problem seems to be caused by the fact that HP-UX returns
POLLERR only without the other bits (e.g. POLLHUP) syslog-ng would
handle this gracefully if either the other bits would be set, or there'd
be some pending messages to send, in which case a normal write() error
would occur.
I'll prepare a package with this fix and provide an Url to a test RPM soon.
Thanks. I noticed the problem on my machines when the remote logging server crashed. 100% cpu load can be annoying on every machine in eg. a cluster. I will then wait for the fix. bye, Martin OK, a test package with applied patches will appear at: http://download.opensuse.org/repositories/home:/mtomaschewski:/branches:/openSUSE:/11.1/openSUSE_11.1/ when it is built. Would you test it please? Please verify, that the "rpm -qp --changelog syslog-ng*.rpm" changelog entry is visible: Fri Jan 15 15:31:16 CET 2010 - mt@suse.de - Applied POLLERR handling fixes for logreader/logwriter causing 100% CPU usage e.g. when a UDP destination is down (bnc#541802). Hello, I can confirm that the 100% CPU load is now gone in my setup. regards, Martin OK, same to me, thanks!! submitting the package to SLE-11:Update:Test. The SWAMPID for this issue is 30492. Please submit the patch and patchinfo file using this ID. (https://swamp.suse.de/webswamp/wf/30492) Update released for: syslog-ng, syslog-ng-debuginfo, syslog-ng-debugsource Products: SLE-DEBUGINFO 11 (i386, ia64, ppc64, s390x, x86_64) SLE-DESKTOP 11 (i386, x86_64) SLE-SERVER 11 (i386, ia64, ppc64, s390x, x86_64) Update released for: syslog-ng, syslog-ng-debuginfo, syslog-ng-debugsource Products: openSUSE 11.1 (debug, i586, ppc, x86_64) openSUSE 11.2 (debug, i586, x86_64) *** Bug 587508 has been marked as a duplicate of this bug. *** This is an autogenerated message for OBS integration: This bug (541802) was mentioned in https://build.opensuse.org/request/show/672696 15.1 / syslog-ng |