Bug 589788

Summary: Xen hypervisor memory utilization does not reflect sum of domain allocations (leaks over time)
Product: [openSUSE] openSUSE 11.2 Reporter: Frank Fejes <ffejes>
Component: XenAssignee: Jan Beulich <jbeulich>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P3 - Medium CC: carnold, jdouglas, jfehlig, laurent.henry, vadim
Version: Final   
Target Milestone: unspecified   
Hardware: x86-64   
OS: openSUSE 11.2   
Whiteboard: maint:released:11.2:35929
Found By: Community User Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Xen Info output for a server exhibiting the behavior
console output after xm debug-key q
console output after xm debug-key H
xm info after the two debug commands
console output after xm debug-key m
sample domain config for memory leak test
tentative fix

Description Frank Fejes 2010-03-19 16:34:14 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2) Gecko/20100218 Fedora/3.6.1-2.fc14 Firefox/3.6

Over the past couple months I've noticed that a few of our Xen servers are reporting (via xentop) hypervisor memory utilization discrepancies between the sum of VM memory allocations and the actual memory used.  In time, the free memory drops to a point where no new VMs can be started.  Below is recent sample xentop output with extra columns chopped.  The hypervisor reports that it is using 32gb of ram, however the sum of the running VMs is 16gb.  Ballooning is disabled and dom0_mem is set to 4gb.

xentop - 15:38:20   Xen 3.4.1_19718_04-2.1
12 domains: 2 running, 10 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 33549248k total, 33241552k used, 307696k free    CPUs: 4 @ 2992MHz
     NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%)
xxxxxxxxxx --b---       1584    0.2     547840    1.6     547840       1.6
 Domain-0 -----r     165894   34.3    4192768   12.5   no limit       n/a
xxxxxxxxxx --b---      15166   22.3    3071916    9.2    3076096       9.2
xxxxxxxxxx -----r       1328   77.8    1028012    3.1    1028096       3.1
xxxxxxxxxx --b---      55433    4.7    1027880    3.1    1028096       3.1
xxxxxxxxxx --b---      67527    4.8    1027880    3.1    1028096       3.1
xxxxxxxxxx --b---      73019    6.8    1027880    3.1    1028096       3.1
xxxxxxxxxx --b---      61266    6.8    1027880    3.1    1028096       3.1
xxxxxxxxxx --b---      78880    6.2    1028012    3.1    1028096       3.1
xxxxxxxxxx --b---        969    7.7     516012    1.5     516096       1.5
xxxxxxxxxx --b---      10544    1.5    1027880    3.1    1028096       3.1
xxxxxxxxxx --b---      21001    5.8    1027880    3.1    1028096       3.1

This behavior is only seen on our "development" Xen servers where we routinely create/clone/destroy a large number of (mostly HVM) VMs, however when doing any of those functions I cannot reproduce the "leak".  That is to say, when I create a VM the hypervisor memory free statistic drops in an amount corresponding to the size of the VM.  When the VM is destroyed, the free statistic increases back to where it was previously.

I have not been able to find any way to get the hypervisor to reclaim some of the lost memory and, as a result, I have been forced to reboot the server.  On systems with 20+ VMs where we don't have the option for live migration this
can be rather traumatic.

Thank you!

Reproducible: Couldn't Reproduce

Steps to Reproduce:
1.
2.
3.
Comment 1 Jan Beulich 2010-03-22 13:39:58 UTC
First of all, without having a reproduction scenario we're unlikely going to be able to do anything about it. We will have to rely on you to narrow the conditions under which these leaks(?) occur.

Second, please provide the hypervisor's response to debug keys 'q' and 'H' (sent either from the serial console or through "xm debug-key <key>").

Third, please attach full hypervisor logs (ideally - according to above statement - annotated by when you start observing the bad behavior), obtained with "loglvl=all guest_loglvl=all", as well as "xm info" output.

Finally, please clarify whether you're observing this behavior on just a single system, or multiple (distinct) ones.
Comment 2 Frank Fejes 2010-03-22 14:30:33 UTC
Thank you.  To clarify, this is occurring on multiple distinct systems.  The systems appear to be our "development" group of servers where VMs are more regularly started/stopped/created/deleted.  

The 'xm debug-key' command exits with a 0 status and no text output for either q or H.

Regarding logs, it's tricky to annotate them with bad behavior information since I have not yet narrowed down any particular time or activity that causes the memory discrepancies to occur.  For example, on the server for the logs I'll attach now, it has held pretty solid at around 61gb used for a couple weeks now, despite the fact that we've only had around 25-30gb worth of VM allocations running at any one time.  Is there any sort of periodic command I could run that would generate information that would be useful in tracking this down?

We haven't been running loglvl=all but I'll make those changes to our grub config now.  Thanks again.
Comment 3 Frank Fejes 2010-03-22 14:32:56 UTC
Created attachment 349791 [details]
Xen Info output for a server exhibiting the behavior
Comment 4 Jan Beulich 2010-03-22 17:00:06 UTC
(In reply to comment #2)
> The 'xm debug-key' command exits with a 0 status and no text output for either
> q or H.

The output goes to the Xen log (i.e. you have to run "xm dmesg" afterwards).
 
> Regarding logs, it's tricky to annotate them with bad behavior information
> since I have not yet narrowed down any particular time or activity that causes
> the memory discrepancies to occur.  For example, on the server for the logs
> I'll attach now, it has held pretty solid at around 61gb used for a couple
> weeks now, despite the fact that we've only had around 25-30gb worth of VM
> allocations running at any one time.  Is there any sort of periodic command I
> could run that would generate information that would be useful in tracking this
> down?

Precisely the actions described above (i.e. the two debug keys plus "xm info").

(In reply to comment #3)
> Created an attachment (id=349791) [details]
> Xen Info output for a server exhibiting the behavior

This isn't useful alone - it ought to match up with log and debug key output.
Comment 5 Frank Fejes 2010-03-22 17:20:12 UTC
Attached are the q and H outputs along with a current xm info.  Thanks again.
Comment 6 Frank Fejes 2010-03-22 17:21:17 UTC
Created attachment 349854 [details]
console output after xm debug-key q
Comment 7 Frank Fejes 2010-03-22 17:21:55 UTC
Created attachment 349855 [details]
console output after xm debug-key H
Comment 8 Frank Fejes 2010-03-22 17:22:26 UTC
Created attachment 349856 [details]
xm info after the two debug commands
Comment 9 Jan Beulich 2010-03-24 10:39:50 UTC
Re #6: So there are (many) domains that failed to die completely. We're observing a similar problem currently on SLE11 SP1 (Xen 4.0), but this issue is still under investigation. But with that (and according to the provided output) you should be seeing this with more than half of your guests. What's possibly interesting is that while up to domain 100 no single domain got fully destroyed, subsequently there are no zombie domains anymore (all domains with IDs greater than 100 are still alive). Unless you have an explanation for this (e.g. all properly destroyed VMs having been of another type, e.g. PV), this might be a hint towards the root of the problem.

What I also note is that shadow-memory guests most of the time leak much more pages than HAP ones (and - statistically meaningless - there's one PV zombie with yet fewer pages left).

In Dom0, are you having more qemu-dm processes than there are guests?

Re #7: Actually, 'm' would probably be better than 'H' in case we'll need it again.

Would you be able to rebuild the hypervisor if we provided you with a debugging patch?
Comment 10 Frank Fejes 2010-03-24 13:21:56 UTC
That is very interesting.  There should be no differences between the first 100 domains and those created afterward.  On this particular server, each domU is an HVM built using the same libvirt xml base template file.  Right now I see that the dom0 has 17 qemu-dm processes which corresponds with the number of running domUs.  I'll attach 'm' output now.

Regarding rebuilding and running a debugging hypervisor, I would certainly be able to do that if it would help.  Finding the time for the eventual reboot could be tricky so I may not be able to give timely feedback any time soon.

Thanks again.
Comment 11 Frank Fejes 2010-03-24 13:22:41 UTC
Created attachment 350294 [details]
console output after xm debug-key m
Comment 12 Jan Beulich 2010-04-13 07:03:29 UTC
What block device protocol are you using for your guests? In particular, do you observe a difference in behavior between using tap:... and file: ones?
Comment 13 Frank Fejes 2010-04-16 13:29:14 UTC
(In reply to comment #12)
> What block device protocol are you using for your guests? In particular, do you
> observe a difference in behavior between using tap:... and file: ones?

The vast majority of these are tap:aio, though we have a few file: guests.  Right now we've begun creating all our guests with phy: devices in another environment (CentOS/Xen 3.4.2) and it would appear that we have not triggered this problem.  Is there a thought that perhaps this would be a workaround?  Thanks again.
Comment 14 Jan Beulich 2010-04-16 13:43:30 UTC
(In reply to comment #13)
> The vast majority of these are tap:aio, though we have a few file: guests. 
> Right now we've begun creating all our guests with phy: devices in another
> environment (CentOS/Xen 3.4.2) and it would appear that we have not triggered
> this problem.  Is there a thought that perhaps this would be a workaround? 
> Thanks again.

Depends on what part of your reply you mean with "this". Using file: instead of any of the tap: variants likely is a workaround, which is why we asked the question in #12 (i.e. we had hoped you could confirm this theory). As per further analysis done on SLE11 SP1, using tap:tapdisk:aio: might also be a workaround (but requires one or more tools side fixes to deal with the tapdisk: part of the protocol specification).
Comment 15 Frank Fejes 2010-04-16 17:04:50 UTC
I apologize for the confusion.  I was wondering (which you've now confirmed) if you felt that using file: or phy: could be a possible workaround for our original memory allocation problem.  Thanks.
Comment 18 Vadim Ponomarev 2010-04-30 12:26:27 UTC
my case with similar memory leaks:

0. openSUSE 11.2, xen 3.4 (distribution) or 4.0 (buildservice)

1. get Windows server 2008 R2 "10-days trial" DVD iso from http://download.microsoft.com/download/7/5/E/75EC4E54-5B02-42D6-8879-D8D3A25FBEF7/7600.16385.090713-1255_x64fre_server_eval_en-us-GRMSXEVAL_EN_DVD.iso

2. install windows using vm-install or manuall config (will attach sample config) with disk=[  'tap:qcow2:/path/to/disk/image ....' ]

3. xm info > xm-info-before
4. xm start windows-domu-name
5. shutdown windows normally
6. xm info > xm-info-after

7. diff -u xm-info-before xm-info-after:
...
-free_memory            : 9043
+free_memory            : 8434
 free_cpus              : 0
-max_free_memory        : 9043
-max_para_memory        : 9039
-max_hvm_memory         : 9009
+max_free_memory        : 8434
+max_para_memory        : 8430
+max_hvm_memory         : 8402
...

leak does not appear with file: instead of tap:qcow2
Comment 19 Vadim Ponomarev 2010-04-30 12:28:50 UTC
Created attachment 358189 [details]
sample domain config for memory leak test
Comment 20 Vadim Ponomarev 2010-04-30 13:25:52 UTC
(In reply to comment #9)
> Would you be able to rebuild the hypervisor if we provided you with a debugging
> patch?

patch is welcome
Comment 21 Jan Beulich 2010-04-30 14:39:28 UTC
We actually appear to have a (kernel side) fix for this meanwhile, so if you would test this instead - I'll attach the patch in a second.
Comment 22 Jan Beulich 2010-04-30 14:40:16 UTC
Created attachment 358245 [details]
tentative fix
Comment 23 Vadim Ponomarev 2010-04-30 15:48:01 UTC
(In reply to comment #21)
> We actually appear to have a (kernel side) fix for this meanwhile, so if you
> would test this instead - I'll attach the patch in a second.

patch helped
no more difference in free_memory before domu start and after domu shutdown for tap:qcow2 and tap:aio
Comment 24 Vadim Ponomarev 2010-04-30 17:42:22 UTC
Shouldn't "severity" be changed to "critical" ? Because it's IMHO definitely "severe memory leak" (~600Mb each run in my case) and could be critical when start/stop is regular (as in reporter's situation). Sorry if I'm missing something.
Comment 25 Vadim Ponomarev 2010-05-21 12:36:29 UTC
looks like fix has a side effect (dom0 hangs when swap used after domu stop)

0. openSUSE 11.2, xen 4.0 (buildservice)
1. kernel /boot/xen.gz vtd=1 iommu=pv dom0_mem=2048M dom0_max_vcpus=2 dom0_vcpus_pin
2. dom0 with 32Gb swap partition and kernel with tentative fix
3. PV domU with some phy:/dev/.. disks (no blktap disks were used)
4. perl -n /dev/zero, wait until ~8Gb of swap will be used, then press Ctrl-C
5. xm shutdown <domU>
6. perl -n /dev/zero - dom0 hangs in a few seconds

no problems with 2.6.31.12-0.2-xen kernel without tentative fix, got expected "out of memory"

ideas/instructions/patches are welcome
Comment 26 Jan Beulich 2010-05-21 12:53:09 UTC
How can a fix to code that doesn't get used in your test case be responsible for the bad outcome of the test? Please let's keep separate problems separate (i.e. open new bugs for unrelated problems).
Comment 27 Vadim Ponomarev 2010-05-21 13:25:45 UTC
oops
seems that this problem is really not related to #589788, sorry
(reproduced with unmodified 2.6.31.12-0.2-xen dom0 kernel)
Comment 28 Jan Beulich 2010-05-28 06:57:15 UTC
Fix was just committed, will be available in a future kernel maintenance update.
Comment 29 Jan Beulich 2010-06-14 15:31:58 UTC
*** Bug 613490 has been marked as a duplicate of this bug. ***
Comment 30 Henry Laurent 2010-06-14 16:33:06 UTC
Did the situation went better (even juste for a while) after a system reboot ?
Comment 31 Swamp Workflow Management 2010-09-23 13:09:58 UTC
Update released for: kernel-debug, kernel-debug-base, kernel-debug-base-debuginfo, kernel-debug-debuginfo, kernel-debug-debugsource, kernel-debug-devel, kernel-debug-devel-debuginfo, kernel-default, kernel-default-base, kernel-default-base-debuginfo, kernel-default-debuginfo, kernel-default-debugsource, kernel-default-devel, kernel-default-devel-debuginfo, kernel-desktop, kernel-desktop-base, kernel-desktop-base-debuginfo, kernel-desktop-debuginfo, kernel-desktop-debugsource, kernel-desktop-devel, kernel-desktop-devel-debuginfo, kernel-pae, kernel-pae-base, kernel-pae-base-debuginfo, kernel-pae-debuginfo, kernel-pae-debugsource, kernel-pae-devel, kernel-pae-devel-debuginfo, kernel-source, kernel-source-vanilla, kernel-syms, kernel-trace, kernel-trace-base, kernel-trace-base-debuginfo, kernel-trace-debuginfo, kernel-trace-debugsource, kernel-trace-devel, kernel-trace-devel-debuginfo, kernel-vanilla, kernel-vanilla-base, kernel-vanilla-base-debuginfo, kernel-vanilla-debuginfo, kernel-vanilla-debugsource, kernel-vanilla-devel, kernel-vanilla-devel-debuginfo, kernel-xen, kernel-xen-base, kernel-xen-base-debuginfo, kernel-xen-debuginfo, kernel-xen-debugsource, kernel-xen-devel, kernel-xen-devel-debuginfo, preload-kmp-default, preload-kmp-desktop
Products:
openSUSE 11.2 (debug, i586, x86_64)