|
Bugzilla – Full Text Bug Listing |
| Summary: | X.org busy-loops (100% CPU in a select() loop) while desktop is locked | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 11.1 | Reporter: | Michael Matz <matz> |
| Component: | X.Org | Assignee: | Egbert Eich <eich> |
| Status: | RESOLVED DUPLICATE | QA Contact: | E-mail List <xorg-maintainer-bugs> |
| Severity: | Normal | ||
| Priority: | P3 - Medium | CC: | captain.magnus, hub, lnussel, sndirsch |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | x86 | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | Development | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
More data points: Ludwig seems to have the same problem (also KDE3, but intel graphics on his laptop). Not reproducible with just 'xset dpms ...' it seems, though on this particular bug it can also mean that we just didn't wait long enough for it to show up. I can reproduce it now. From a remote machine I simply do
xset dpms force standby
KDE3 will notice this, and starts kdesktop_lock and the kss module (the program
that paints something nice), immediately SIGSTOPs that kss module (because
due to DPMS it wouldn't be visible anyway), and then the countdown to the bug
starts.
stracing X then reveals that the select() loop starts with a 870 something
seconds timeout, which is reduced with each call to select, by the amount it
took for the select to return plus 20 milli-seconds. So, for the first
~870 seconds after the screensaver becomes active all is well, as there's
still a time-out so select only returns seldomly.
After these ~870 seconds the timeout value is 0, and _doesn't_ get reset to
something larger again. From then on X will take 100% CPU. I can reproduce
this reliably, but it takes some time to show up (exactly the ~870 seconds).
I've tried looking at the patches that went in after beta4. The only thing
which barely is related to screen-saving seems to be dpms_screensaver.diff,
from 2008-11-28. I think this was shortly after beta4 was released. The patch
itself looks innocent (just introduces a call to dixSaveScreens()), but
perhaps it triggers a bug somewhere else (in the timeout calculations for
select).
[probably irrelevant:
grepping for "select" in all diffs of xorg-x11-server reveals some occurences
in xorg-server-xf4vnc.patch and xorg-server-xf4vnc-busyloop.diff, where
at least some in xf4vnc.patch also explicitely use a {0,0} timeout. That
at least looks dubious, as that usually is not what one wants. But as this
is vnc, it's most probably unrelated.]
Man! WaitFor.c is terrible spaghetti code. I would see several reasons why the select timeout could be set to zero in WaitForSomething. In the SMART_SCHEDULE code, if there are any ClientsWithInput, or in general if there are timers with an expire time in the past. It might be that when the screen is off (DPMS on) that either the timers don't go away (even when they are expired) or that ClientsWithInput never goes empty. Presumably that would make sense when the DPMS code wants to ensure that when the display is off that as few CPU is used as possible, and due to a bug there's possibly too much pruning or early-out so that events which would normally remove expired timers or clients with input don't take action. The dpms_screensaver.diff introduces a call to dixSaveScreens() which wasn't there before (and without error handling) which does fiddle with timers a bit, so maybe that's a lead. Or possibly I'm completely off :-) I could reproduce using "xset dpms force standby" and patient waiting now. The behavior seen in strace is different for me though. select is called with one second timeout {1, 0} and then all of a sudden the timeout is lower {0, 726000} once and in all following calls zero. Just before the broken select call X does this:
writev(40, [{"B\1\232\0\1\0\200\3\0\0\0\0\221\274\r\0\0\0\0\0\217\274\r\0\340\212 \0\1\307\240\267", 32}], 1) = 32
fd 40 is one of the many fds on /tmp/.X11-unix/X0
So that means your initial select timeout is 1 second (where it is about 870 for me)? That's fully possible, but still would hint at the same problem. X has a list of timers (with an expire time). The select timeout is calculated in such a way that it times out at or before the next timer has to fire. The list of timers depend on the applications running on the display, so slight changes there are expected. But the general theme is still the same. My assumption is that this list of timers is not cleaned up in the bug case, so that even already expired timers stay in the list. Due to compensation code for negative offsets this would lead exactly to the picture of zero-timeout selects. (the other possibility for that to happen is with the SMART_SCHEDULE code as mentioned above) Btw. Ludwig: You can find out which application is connected to the socket corresponding to FD 40 in Xorg in the following way: # lsof -p $(pidof X) | grep '40u.*unix' X 6191 root 40u unix 0xf7c94280 0t0 20618 /tmp/.X11-unix/X0 # lsof | grep 20617 kopete 6598 matz 4u unix 0xf7c94080 0t0 20617 socket What you need to look out for is a socket filedescriptor with an inode of one less than the inode number of the socket on the X11 side (above 20618 on X side, 20617 on kopete side). This obviously depends on how the kernel invents these inode numbers, but for my kernel (32bit kernel-pae-2.6.27.7-9.1) it uses the above rules. Thanks for the hint. The receiving process was kded4. That suspicious write didn't show up again though. Yet if I kill kde4 stuff (kded4, kdeinit4) the busy loop stops. And you can't reproduce it anymore when kde{d,init}4 isn't running anymore?
Then this one triggers the bug in Xorg, maybe a nice hint.
This bug sounds related (if not identical) to https://bugzilla.novell.com/show_bug.cgi?id=472046 (I have the exact same trace as described in the description). The patch attached in bug#472046 fixes it for me. dup. *** This bug has been marked as a duplicate of bug 472046 *** |
This bug happens for me since I updated to 11.1 final from 11.1 beta4, and it's extremely annoying for my office mates because my CPU fan is very loud. The error mode is a bit strange. It happens after some time after the desktop is locked (in my case via kdesktop_lock). That first spawns a screen-saver and after some time kills it (to not uselessly use CPU) and switches off the monitor. After the monitor is switched off it takes some more time for the bug to show up. The symptom of it is that X takes 100% CPU without anyone doing anything on the system. The reason is a busy-loop constructed with zero timeout selects, as can be seen in an strace taken from a remote system (i.e. without unlocking the desktop). The relevant excerpt from that strace looks like so: 13:48:02.254152 select(256, [1 3 5 6 12 13 14 15 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66], NULL, NULL, {0, 0}) = 0 (Timeout) 13:48:02.254289 clock_gettime(CLOCK_MONOTONIC, {174665, 645523431}) = 0 13:48:02.254383 setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 13:48:02.254483 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 13:48:02.254581 clock_gettime(CLOCK_MONOTONIC, {174665, 645815070}) = 0 13:48:02.254675 select(256, [1 3 5 6 12 13 14 15 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66], NULL, NULL, {0, 0}) = 0 (Timeout) 13:48:02.254812 clock_gettime(CLOCK_MONOTONIC, {174665, 646046604}) = 0 13:48:02.254906 setitimer(ITIMER_REAL, {it_interval={0, 20000}, it_value={0, 20000}}, NULL) = 0 13:48:02.255006 setitimer(ITIMER_REAL, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 13:48:02.255105 clock_gettime(CLOCK_MONOTONIC, {174665, 646338413}) = 0 13:48:02.255198 select(256, [1 3 5 6 12 13 14 15 18 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66], NULL, NULL, {0, 0}) = 0 (Timeout) Note how select() is called with a zero timeout (i.e. return immediately no matter if some FD is active or not). This happens ad nauseam, and causes the 100% CPU load. It goes away as soon as the desktop is unlocked (then Xorg starts using reasonable timeout ala 175 seconds again). It also goes away if someone moves the mouse (which causes the monitor to be switched on, the graphical screensaver to be started again, hence we then _have_ an active X11 client again, which somehow breaks the busy loop). Unfortunately it's slightly annoying to reproduce as it needs some time after the monitor is switched off to materialize. I.e. it doesn't start immediately after the monitor is switched off, but only after some time (Richard tells me that he had to move the mouse like every 20 minutes which would translate to something like 10 minutes after the monitor is switched off again). I'm fairly sure this didn't happen in 11.1 beta4. xorg-x11-server-7.4-17.3 x11-video-nvidiaG02-180.22-4.1 nvidia-gfxG02-kmp-pae-180.22_2.6.27.7_9.1-4.1 (yes, nvidia)