Bug 403901 - Xvnc goes to 100% CPU.
Summary: Xvnc goes to 100% CPU.
Status: RESOLVED FIXED
Alias: None
Product: openSUSE 11.0
Classification: openSUSE
Component: X.Org (show other bugs)
Version: Final
Hardware: x86 openSUSE 11.0
: P5 - None : Major with 5 votes (vote)
Target Milestone: ---
Assignee: Stefan Dirsch
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-06-25 20:15 UTC by Mobeen Azhar
Modified: 2008-10-25 14:33 UTC (History)
5 users (show)

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Proposed fix. (767 bytes, patch)
2008-10-21 15:10 UTC, Egbert Eich
Details | Diff
xorg-x11-server-extra-7.3-110.10.i586.rpm (4.32 MB, application/octet-stream)
2008-10-21 22:18 UTC, Stefan Dirsch
Details
xorg-x11-server-extra-7.3-110.10.x86_64.rpm (4.60 MB, application/octet-stream)
2008-10-21 22:20 UTC, Stefan Dirsch
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Mobeen Azhar 2008-06-25 20:15:20 UTC
Same setup under OpenSuSE 10.3, 10.2, 10.0 never showed any issues.

OpenSuSE11 machines built with the Gnome desktop.  Machines boots to console (init level 3 default).  I log into the console (or ssh) and execute the following:

/usr/bin/vncserver -alwaysshared -depth 24 -geometry 1280x1024 -dpi 96 &

The I logout.

~/.vnc/xstartup contains the following:

#!/bin/sh
[ -r $HOME/.Xresources ] && xrdb $HOME/.Xresources
xsetroot -solid grey
# eval `dbus-launch --sh-syntax --exit-with-session`
exec gnome-session &

The I use a vnc viewer from other machines to connect to the above mentioned machine on port 5901.  Gnome desktop is available and everything runs good for about 2 or so hours.  After that time, vnc viewer stops updating.  If I disconnect vncviewer I cannot connect back.  If I ssh into the machine and do top, I see XVnc running at 100% CPU.  At that time, doing strace on XVnc shows the following:

select(16, NULL, [15], NULL, {0, 0})    = 0 (Timeout)
write(15, "????????????????????????????????"..., 648) = -1 EAGAIN (Resource temporarily unavailable)

This keeps on repeating at a very rapid rate.

Only way to recover VNC (machine seems fine in the ssh session) is to kill the XVnc process (thereby losing the desktop and any and all apps running on it) and then repeating the whole process.

This occurs with the "stock" xorg-x11-Xvnc RPM, issue persists after applying all available updates, and issue also persists when using xorg-x11-Xvnc-7.3.117.2 from the cutting edge X repo.
Comment 1 Stefan Dirsch 2008-06-25 22:15:25 UTC
This sounds like one of these issues, which are rather hard to reproduce. Mike, didn't you see exactly this issue some time ago on your machine?
Comment 2 Mike Fabian 2008-06-25 22:25:37 UTC
What I saw is described in bug #379202.
I’m not sure whether this is related or not.
Comment 3 Mobeen Azhar 2008-06-25 23:03:19 UTC
In my case, the bug appears to be independent of vnc viwer options.  The bug manifests itself no matter what cursor options are specified in vnc viewer and whether the viewer is run on *nix or Windows.
Comment 4 Stefan Dirsch 2008-06-26 02:55:42 UTC
Thanks, Mike. Mobeen, which options are you using for vncviewer on which hardware architecture? On which hardware architecture is Xvnc running?
Comment 5 Mobeen Azhar 2008-06-26 03:13:09 UTC
Xvnc running on AMD Athlong XP 3000+ CPU.

Vncviewer was setup as follows:

vnc viewer from realvnc on Windows XP SP3 on a Dell D600 laptop.  Options used:  Colour and Encoding set to Auto-select, low (64 colours).  Input set to Send pointer events to server, Send keyboard events to server, Send clipboard changes to server, Accept clipboard changes from server.  Miscellaneous options set to Full screen, Render cursor locally, Allow dynamic desktop resizing, Beep when requested by server, Offer to automatically reconnect.
Comment 6 Stefan Dirsch 2008-06-26 03:32:33 UTC
So Xvnc is running openSUSE 11.0-x86_64?

Unfortunately I don't have Windows (XP) available for testing. You told me, that it also happens when running the client on Unix. Can you also reproduce it on openSUSE? Which version/architecture?
Comment 7 Mobeen Azhar 2008-06-26 14:19:00 UTC
Xvnc is running on openSUSE 11.0.  No 64 bit systems here, either for Xvnc or vncviewer.

Yes, it happens with vncviwer on linux as well.  Running vncviewer on the same as XVnc and vnc'ing to the machnine itself.  Here is the vncviewer command line:

vncviewer -shared -encodings "tight zlib hextile corre rre copyrect raw" -bgr233 -compresslevel 9 -quality 0 localhost::5901

The error is intermittent.  I was having it on quite a regular basis (about once every 2 hours) but after this last start, it has been about 24 hours with no issue.  
Comment 8 Stefan Dirsch 2008-06-26 14:31:44 UTC
Ok. So it should be possible to reproduce it on a openSUSE 11.0-i386 system with Xvnc/vncviewer running on the same machine - when trying long enough (2-24 hours). That's good.
Comment 9 Mobeen Azhar 2008-06-26 14:40:25 UTC
Correct Stefan.  Is there any type of diagnostics you would like me to run or information to gather in case the issue appears on my machine first?
In my case, the openSUSE machine in questions runs the GNOME desktop 24/7 and is accessed almost all the time via VNC on Windows.
Comment 10 Stefan Dirsch 2008-06-26 14:44:01 UTC
Well, in case you're familiar with strace and/or gdb you could connect via strace/gdb to the Xvnc process (pid) to debug the issue. For gdb you should install the xorg-x11-server-debuginfo and xorg-x11-server-debugsource packages first.
Comment 11 Mobeen Azhar 2008-06-26 14:48:55 UTC
I posted the strace results in the beginning of this bug, but those were not taken with the debug packages.
I can try the debug packages, no promises since that will depend on time available to me to change out the packages to the debug ones.
Comment 12 Stefan Dirsch 2008-06-26 15:08:38 UTC
Sorry, forgot about the strace output in initial comment. Not really useful for me though. The additionaly debug packages you would only need for using gdb. These can still be installed when Xvnc is already running.
Comment 13 Mobeen Azhar 2008-06-29 02:31:46 UTC
Hang occurred again.  strace showed same output as in my initial posting.
I was looking around for the xorg-x11-server-debuginfo and xorg-x11-server-debugsource packages but could not find them :(

Keep in mind this is using the xorg RPMs from http://download.opensuse.org/repositories/X11:/XOrg/openSUSE_11.0/i586/.  Problem occurs with the "stock" packages as well, but I had upgraded to these thinking this may fix the issue.  Looking at the URL I mention, including "neighboring" URLs (under noarch, src etc), I could not find the debugsource and debuginfo packages.

Problem had occurred while I was using vncviewer from a WinXP machine.  I disconnected the hung vnc session, and opened a ssh session and watched the Xvnc process eating up the CPU.  While looking around for the debug packages etc, about 30 minutes later, I noticed that Xvnc CPU usage went back down to normal.  At that point, I was able to VNC back into the machine with no issues - so the 100% CPU occurs (this time for me at least, I did not wait this long before so I cannot say) for about 30 or so mins.

Comment 14 Stefan Dirsch 2008-06-29 06:35:36 UTC
Unfortunately our buildservice does not build nor provide -debuginfo/-debugsource packages. :-( I don't think the -debuginfo/-debugsource packages from the DVD will match the xorg-x11-Xvnc package from the buildservice. Therefore I suggest to go back to the X packages from DVD.
Comment 15 Mobeen Azhar 2008-07-01 22:51:16 UTC
Sorry I have not had time to back rev the packages to the standard ones and install the debug stuff.
The hang occurred again.  I stayed disconnected from vnc (while staying ssh'ed in) and in about 30 ~ 45 mins the issue went away.
I did run GDB and attached it to the XVnc process that was eating CPU.  Not sure what the following information is worth since I do not have debug symbols in place yet, but here is that a backtrace in GDB shows:

(gdb) backtrace
#0  0xffffe424 in __kernel_vsyscall ()
#1  0xb7b51ebd in select () from /lib/libc.so.6
#2  0x08098223 in WriteExact ()
#3  0x0808f1b8 in rfbSendUpdateBuf ()
#4  0x0809005c in rfbSendFramebufferUpdate ()
#5  0x08094170 in rfbDisplayCursor ()
#6  0x083396f1 in AnimCurScreenBlockHandler ()
#7  0x080b6188 in BlockHandler ()
#8  0x0838a294 in WaitForSomething ()
#9  0x080b206e in Dispatch ()
#10 0x080c6265 in main ()
Comment 16 Stefan Dirsch 2008-07-02 05:44:10 UTC
Hmm. Without debug symbols in help this won't help much here. Once you have debug symbols in place you would need to stop Xvnc in gdb with Ctrl-C, step through the code, set break points, etc. to see where it hangs.
Comment 17 Stefan Dirsch 2008-07-02 11:57:07 UTC
Hope it's ok to set it to NEEDINFO until you reproduced it with -debug packages installed.
Comment 18 Stefan Dirsch 2008-07-03 19:04:16 UTC
It's unlikely, but maybe that's another duplicate of Bug #389386. Updated packages are available (see Bug #389386, comment #51).
Comment 19 Stefan Dirsch 2008-07-10 01:58:54 UTC
Mobeen, any new results?
Comment 20 Mobeen Azhar 2008-07-11 14:55:22 UTC
Hi Stefan, sorry for the delay.  Things are a little bit hectic here at the moment and I have not had a chance to back-rev the Xorg rpms to the stock ones and install the debug ones as well - yet.  Since I needed the machine up and running this week, I temporarily switched to using FreeNX instead of vnc.  I hope to back rev the rpms this weekend, get xvnc to go to 100% cpu, and get you some gdb info.
Comment 21 Mobeen Azhar 2008-07-16 19:50:57 UTC
Hi Stefan, ok, I back rev'ed xorg-x11-Xvnc to 7.3-110.4 (from the online update repository).  Is there a debuginfo RPM to this version?  I hope there is, but I cannot find one - is there a way for me to somehow build from the source rpm and get the debuginfo rpms generated?
If not, I can (if I have to) back rev to the xorg-x11-Xvnc rpm as they are in the distro - where would I find the debuginfo RPMS for that version?
Comment 22 Stefan Dirsch 2008-07-16 20:26:11 UTC
I'm not sure if we provide debuginfo RPMs for online udpate packages. Rebuilding xorg-x11-server is a non-trivial task. Instead I suggest to go back to xorg-x11-Xvnc of DVD. Hopefully the debuginfo RPMs for it is on DVD.
Comment 23 Mobeen Azhar 2008-07-16 21:15:21 UTC
I went through the downloaded DVD and found no debug rpms for xorg-x11-Xvnc.  I cannot find the debug RPMS online either (in various repos).  I have the purchased media and I will make my way to it and search there, but I do not have high hopes at this time of being able to find the debug RPMs for xorg-x11-Xvnc.
Comment 24 Mobeen Azhar 2008-07-16 21:24:22 UTC
Ok, I went through the purchased DVD media and no debug RPMS on it either.  Not sure what exactly to do at this point, other than wait for Xvnc to go to 100% CPU (last time it took about 1.5 days to do that), then disconnect my vncviewer and wait approximately 30 or so mins for Xvnc's CPU usage to go down and then continue using it again :(
Comment 25 Stefan Dirsch 2008-07-16 21:25:20 UTC
This is really sad, that we don't provide the debuginfo RPMs at all. :-(
Coolo/Andreas/Marcus, why?
Comment 26 Marcus Meissner 2008-07-16 21:28:24 UTC
we do provide them for the 11.0 GA release. They are listed in the community repositories (look for DEBUG).

We do not provide them for online updates currently AFAIK.
Comment 27 Mobeen Azhar 2008-07-16 21:39:28 UTC
Looking at ftp://mirrors.kernel.org/opensuse/distribution/11.0/repo/oss/suse/i586/, I see a bunch of debug RPMS, but none for anything to do with xorg.  If my eyes are failing me (I ~was~ doing a search in the listing for debug), can you please provide me with the URL for the debug rpm for xorg-x11-Xvnc? 
Comment 29 Mobeen Azhar 2008-07-16 21:49:24 UTC
Thanks Marcus, mea culpa maximus - I saw that location just as bugzilla emailed me Marcus' comments.
I am going to back rev xorg-x11-Xvnc to the one in the original distro, and install the debug RPMs for xorg-x11-server.  Then I will see what GDB can tell me when Xvnc hits 100% cpu.
Comment 30 Stefan Dirsch 2008-07-17 05:10:50 UTC
Marcus, thanks for the link. I was wrong and apologize. Mobeen, thanks for your patience and efforts.
Comment 31 Mobeen Azhar 2008-07-17 17:43:47 UTC
Ok, had a Xvnc hang occur after I had back rev'ed to stock xorg-x11-server RPMs.  When Xvnc got to 100% CPU, I lost my VNC session (it was just frozen so I just disconnected - attempts to reconnect failed).  At that time I ssh'ed into the box, and top indeed showed Xvnc at 97+% CPU.
I ran gdb --pid=<pid of Xvnc>.  I am not too familiar with GDB, and do not know what you wanted me to do in there.  I did a bt, here are the results:

(gdb) bt
#0  0xffffe424 in __kernel_vsyscall ()
#1  0xb7c37ebd in select () from /lib/libc.so.6
#2  0x080981f3 in WriteExact (sock=9, buf=0x8448bd8 '?' <repeats 52 times>,
    len=124) at sockets.c:490
#3  0x0808f188 in rfbSendUpdateBuf (cl=0x874fc50) at rfbserver.c:1827
#4  0x0809002c in rfbSendFramebufferUpdate (pScreen=0x8454d70, cl=0x874fc50)
    at rfbserver.c:1601
#5  0x08094140 in rfbDisplayCursor (pScreen=0x8454d70, pCursor=0x8c65860)
    at sprite.c:2411
#6  0x083393b1 in AnimCurScreenBlockHandler (screenNum=0, blockData=0x0,
    pTimeout=0xbfd25d68, pReadmask=0x8452080) at animcur.c:190
#7  0x080b6198 in BlockHandler (pTimeout=0xbfd25d68, pReadmask=0x8452080)
    at dixutils.c:441
#8  0x08389d74 in WaitForSomething (pClientsReady=0xbfd25da0) at WaitFor.c:223
#9  0x080b207e in Dispatch () at dispatch.c:425
#10 0x080c6005 in main (argc=24, argv=0xbfd262e4, envp=Cannot access memory at address 0x8
) at main.c:452
Comment 32 Stefan Dirsch 2008-07-17 23:37:26 UTC
I'm afraid that's not easy to debug at all. So if you're not really familiar with gdb it doesn't make much sense.
Comment 33 Stefan Dirsch 2008-07-18 09:33:46 UTC
So it hangs in 

  n = select(sock+1, NULL, &fds, NULL, &tv);

(hw/vnc/sockets.c:WriteExact(...) )

Maybe Egbert can help debugging here. Egbert, sources for VNC are in /work/SRC/old-versions/11.0/all/xorg-x11-server/xorg-server-1.4-vnc.patch

Comment 34 Mobeen Azhar 2008-07-18 15:03:47 UTC
I will hit it again with GDB when it goes to 100% CPU.  I have been trying to learn more about GDB, specifically in regards to troubleshooting processes going to 100% CPU.  Other than doing a backtrace at the moment the process is hung at 100% CPU, I am not sure what else to do inside of GDB.
Grabbing a strack trace (bt) in GDB from a couple of such hangs may be enough to confirm though that the code is always stuck in that one particular function.
Comment 35 Mobeen Azhar 2008-07-22 18:33:11 UTC
Here are some more results from GDB after another "hang":

(gdb) bt

#0  0xffffe424 in __kernel_vsyscall ()

#1  0xb7bcc763 in write () from /lib/libc.so.6

#2  0x0809817b in WriteExact (sock=18, buf=0x8448d24 "??????", len=296)

    at sockets.c:456

#3  0x0808f188 in rfbSendUpdateBuf (cl=0x8792da0) at rfbserver.c:1827

#4  0x0809002c in rfbSendFramebufferUpdate (pScreen=0x8454c70, cl=0x8792da0)

    at rfbserver.c:1601

#5  0x08094140 in rfbDisplayCursor (pScreen=0x8454c70, pCursor=0x88acad0)

    at sprite.c:2411

#6  0x08339235 in AnimCurDisplayCursor (pScreen=0x8454c70, pCursor=0x88acad0)

    at animcur.c:234

#7  0x080b9d83 in ChangeToCursor (cursor=0x88acad0) at events.c:963

#8  0x080d4968 in ChangeWindowAttributes (pWin=0x893b420, vmask=16384,

    vlist=0x88bd758, client=0x88788c8) at window.c:1491

#9  0x080b1e4b in ProcChangeWindowAttributes (client=0x88788c8)

    at dispatch.c:610

#10 0x083167b4 in XaceCatchDispatchProc (client=0x88788c8) at xace.c:281

#11 0x080b230c in Dispatch () at dispatch.c:502

#12 0x080c6005 in main (argc=24, argv=0xbfac3c34, envp=0x3f3f3f3f)

    at main.c:452





(gdb) finish

Run till exit from #0  0xffffe424 in __kernel_vsyscall ()

[Switching to Thread 0xb7af46c0 (LWP 3069)]

0xb7bcc763 in write () from /lib/libc.so.6

(gdb)





(gdb) bt

#0  0xb7bcc763 in write () from /lib/libc.so.6

#1  0x0809817b in WriteExact (sock=18, buf=0x8448d24 "??????", len=296)

    at sockets.c:456

#2  0x0808f188 in rfbSendUpdateBuf (cl=0x8792da0) at rfbserver.c:1827

#3  0x0809002c in rfbSendFramebufferUpdate (pScreen=0x8454c70, cl=0x8792da0)

    at rfbserver.c:1601

#4  0x08094140 in rfbDisplayCursor (pScreen=0x8454c70, pCursor=0x88acad0)

    at sprite.c:2411

#5  0x08339235 in AnimCurDisplayCursor (pScreen=0x8454c70, pCursor=0x88acad0)

    at animcur.c:234

#6  0x080b9d83 in ChangeToCursor (cursor=0x88acad0) at events.c:963

#7  0x080d4968 in ChangeWindowAttributes (pWin=0x893b420, vmask=16384,

    vlist=0x88bd758, client=0x88788c8) at window.c:1491

#8  0x080b1e4b in ProcChangeWindowAttributes (client=0x88788c8)

    at dispatch.c:610

#9  0x083167b4 in XaceCatchDispatchProc (client=0x88788c8) at xace.c:281

#10 0x080b230c in Dispatch () at dispatch.c:502

#11 0x080c6005 in main (argc=24, argv=0xbfac3c34, envp=0x3f3f3f3f)

    at main.c:452





(gdb) next

Single stepping until exit from function write,

which has no line number information.

0xb7c15126 in ?? () from /lib/libc.so.6





(gdb) bt

#0  0xb7c15126 in ?? () from /lib/libc.so.6

#1  0x0808f188 in rfbSendUpdateBuf (cl=0x8792da0) at rfbserver.c:1827

#2  0x0809002c in rfbSendFramebufferUpdate (pScreen=0x8454c70, cl=0x8792da0)

    at rfbserver.c:1601

#3  0x08094140 in rfbDisplayCursor (pScreen=0x8454c70, pCursor=0x88acad0)

    at sprite.c:2411

#4  0x08339235 in AnimCurDisplayCursor (pScreen=0x8454c70, pCursor=0x88acad0)

    at animcur.c:234

#5  0x080b9d83 in ChangeToCursor (cursor=0x88acad0) at events.c:963

#6  0x080d4968 in ChangeWindowAttributes (pWin=0x893b420, vmask=16384,

    vlist=0x88bd758, client=0x88788c8) at window.c:1491

#7  0x080b1e4b in ProcChangeWindowAttributes (client=0x88788c8)

    at dispatch.c:610

#8  0x083167b4 in XaceCatchDispatchProc (client=0x88788c8) at xace.c:281

#9  0x080b230c in Dispatch () at dispatch.c:502

#10 0x080c6005 in main (argc=24, argv=0xbfac3c34, envp=0x3f3f3f3f)

    at main.c:452





So, from my limited knowledge, it looks like Xvnc may be stuck inside of rfbSendUpdateBuf.
Comment 36 Egbert Eich 2008-10-07 15:23:19 UTC
The strace output in the description actually points to the consumer at file handle 15: this doesn't seem to consume any data any more, that's why this thing is looping forever.
Since WriteExact() gets an EAGAIN it doesn't give up.
Mobeen: when this happens again would you be able to look at /proc/<pid>/fd what this file handle points to?
I'd assume it's a socket for the connection to the client. You could do the same on the client side to see if this socket somehow matches.
Please also do an strace on the client side to see if this is still doing syscalls.
Does this still run in this select loop when you disconnect the viewer? 
Comment 37 Egbert Eich 2008-10-21 15:10:41 UTC
Created attachment 246901 [details]
Proposed fix.

The code where it busy loops is indeed bogus: when the connection peer (viewer) is not longer reading in data (because the network cable was cut or viewer is broken) Xvnc can indeed busy loop in this code. I've not been able to get Xvnc into this stage mainly because the TCP buffer takes enough data until is tops accepting more writes until some function is called to read back data from the client which will detect that the client is not responding and will thus stop updating it.
I've reenabled the time out code which will return an error after a while which in turn will usually caused the connection to this viewer to be closed. This is better than to busy loop forever trying to send out data. The user should then at least be able to start a new viewer and reconnect.
Re enabling the timeout code should not have any negative side effects as the select() will return as soon as data can be written again while the current code is just spinning over the select()
It's not the best solution and maybe just return without an error after a while pretending that everything has been sent may also be ok.
Then higher level code could handle a possible disconnect. Should the client return there may be some drawing artefacts due to dropped content but it should be possible to fix this by a screen refresh.
Since I cannot trigger the error condition I cannot test this.
I'm sure the problem that caused this problem is the viewer but I cannot really debug this without being able to trigger this problem.
Comment 38 Stefan Dirsch 2008-10-21 22:18:24 UTC
Created attachment 247001 [details]
xorg-x11-server-extra-7.3-110.10.i586.rpm
Comment 39 Stefan Dirsch 2008-10-21 22:20:02 UTC
Created attachment 247002 [details]
xorg-x11-server-extra-7.3-110.10.x86_64.rpm
Comment 40 Stefan Dirsch 2008-10-21 22:21:01 UTC
Please test the RPMs, where Egbert's patch has been applied.
Comment 41 Mobeen Azhar 2008-10-22 14:37:40 UTC
Unable to reproduce the issue so far with the attached patch - dare I raise my hopes up!  Thanks,
Comment 42 Stefan Dirsch 2008-10-22 14:43:46 UTC
Ok. I think we should wait about a week or so before we can be sure the patch helps and add it the package. So I suggest to confirm next week if the issue
occured again. Thanks.
Comment 43 Stefan Dirsch 2008-10-25 10:40:12 UTC
I think it should be safe to apply the patch now, so we have a fix for openSUSE 11.1/SLE11. Please let me know in case you run into this issue again.
Comment 44 Stefan Dirsch 2008-10-25 13:03:00 UTC
fixed for 11.1/SLE11.
Comment 45 Mobeen Azhar 2008-10-25 14:33:56 UTC
Excellent work people and many thanks.