Bug 551695

Summary: Xen dom0 crashes when domU uses phy: lvm volume as disk - Xen unusable!
Product: [openSUSE] openSUSE 11.2 Reporter: flo gleixner <gleixner>
Component: XenAssignee: Jan Beulich <jbeulich>
Status: RESOLVED DUPLICATE QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P2 - High CC: gleixner, jdouglas
Version: Final   
Target Milestone: unspecified   
Hardware: x86-64   
OS: openSUSE 11.2   
Whiteboard:
Found By: Community User Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: screen shot of crash dump
2x boot of domU

Description flo gleixner 2009-11-01 07:21:32 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.14) Gecko/2009090900 SUSE/3.0.14-0.1.2 Firefox/3.0.14

Since yesterday i get these messages:


Bad page state in process 'syslog-ng'
Nov  1 08:37:13 willie kernel: page:ffff88000a8f0f00 flags:0x8000000000000800 mapping:0000000000000000 mapcount:0 count:0
Nov  1 08:37:13 willie kernel: Trying to fix it up, but a reboot is needed
Nov  1 08:37:13 willie kernel: Backtrace:
Nov  1 08:37:13 willie kernel: Pid: 3292, comm: syslog-ng Tainted: G    B     2.6.27.37-0.1-xen #1
Nov  1 08:37:13 willie kernel: 
Nov  1 08:37:13 willie kernel: Call Trace:
Nov  1 08:37:13 willie kernel:  [<ffffffff8020c597>] show_trace_log_lvl+0x41/0x58
Nov  1 08:37:13 willie kernel:  [<ffffffff80464df3>] dump_stack+0x69/0x6f
Nov  1 08:37:13 willie kernel:  [<ffffffff8027835e>] bad_page+0x90/0xbd
Nov  1 08:37:13 willie kernel:  [<ffffffff802787db>] free_hot_cold_page+0xa0/0x232
Nov  1 08:37:13 willie kernel:  [<ffffffff803f0495>] skb_release_data+0x6d/0xe9
Nov  1 08:37:13 willie kernel:  [<ffffffff803f0381>] __kfree_skb+0x9/0x6f
Nov  1 08:37:13 willie kernel:  [<ffffffff804220d1>] tcp_recvmsg+0x780/0xafc
Nov  1 08:37:13 willie kernel:  [<ffffffff803eba97>] sock_common_recvmsg+0x30/0x45
Nov  1 08:37:13 willie kernel:  [<ffffffff803e9adc>] sock_aio_read+0x12c/0x149
Nov  1 08:37:13 willie kernel:  [<ffffffff8029e974>] do_sync_read+0xce/0x113
Nov  1 08:37:13 willie kernel:  [<ffffffff8029f381>] vfs_read+0xbd/0x153
Nov  1 08:37:13 willie kernel:  [<ffffffff8029f4d3>] sys_read+0x45/0x6e
Nov  1 08:37:13 willie kernel:  [<ffffffff8020b3b8>] system_call_fastpath+0x16/0x1b
Nov  1 08:37:13 willie kernel:  [<00007f9b536c5ef0>] 0x7f9b536c5ef0


Kernel is
Linux willie 2.6.27.37-0.1-xen #1 SMP 2009-10-15 14:56:58 +0200 x86_64 x86_64 x86_64 GNU/Linux

and machine is a xen dom0 running 4 domus that all log syslog to dom0's syslog-ng. Did run before for months without such a bug. Machine crashed tonight.
Recent changes: updated kernel and added a budget DVB-S card.

Is this a hardware bug or a software bug?



Reproducible: Didn't try

Steps to Reproduce:
1.
2.
3.
Comment 1 flo gleixner 2009-11-01 15:11:29 UTC
Re-tried: At the moment i'm only getting this during boot. But then i get some or many errors (2-200). From the logs i can say that the page is always different and it is always syslog-ng.
Comment 2 flo gleixner 2009-11-07 09:18:31 UTC
No one else hits this bug? Today i get the messages also during normal operation.
Comment 3 flo gleixner 2009-12-04 23:54:25 UTC
OK, i upgraded to OpenSuse 11.2. The System crashed 15 Minutes after reboot followed the upgrade. But no "bad page state"s in the log. So i resetted and got ~ 150 Bad page state messages.
Kernel is now: 2.6.31.5-0.1-xen x86_64

Error messages - all the same - read:

Dec  5 00:28:25 willie kernel: [  452.797839] BUG: Bad page state in process syslog-ng  pfn:1dc51d
Dec  5 00:28:25 willie kernel: [  452.797842] page:ffff88000a177ed8 flags:8000000000000800 count:0 mapcount:0 mapping:(null) index:0
Dec  5 00:28:25 willie kernel: [  452.797844] Pid: 2434, comm: syslog-ng Tainted: G    B      2.6.31.5-0.1-xen #1
Dec  5 00:28:25 willie kernel: [  452.797846] Call Trace:
Dec  5 00:28:25 willie kernel: [  452.797849]  [<ffffffff800119b9>] try_stack_unwind+0x189/0x1b0
Dec  5 00:28:25 willie kernel: [  452.797854]  [<ffffffff8000f466>] dump_trace+0xa6/0x1e0
Dec  5 00:28:25 willie kernel: [  452.797857]  [<ffffffff800114c4>] show_trace_log_lvl+0x64/0x90
Dec  5 00:28:25 willie kernel: [  452.797861]  [<ffffffff80011513>] show_trace+0x23/0x40
Dec  5 00:28:25 willie kernel: [  452.797865]  [<ffffffff8046af06>] dump_stack+0x81/0x9e
Dec  5 00:28:25 willie kernel: [  452.797868]  [<ffffffff800dacb5>] bad_page+0xf5/0x160
Dec  5 00:28:25 willie kernel: [  452.797872]  [<ffffffff800dcf94>] free_hot_cold_page+0xa4/0x2b0
Dec  5 00:28:25 willie kernel: [  452.797876]  [<ffffffff800dd26e>] free_hot_page+0x1e/0x40
Dec  5 00:28:25 willie kernel: [  452.797880]  [<ffffffff800e10d7>] put_page+0x57/0x150
Dec  5 00:28:25 willie kernel: [  452.797884]  [<ffffffff802ff7cb>] gnttab_page_free+0x3b/0x60
Dec  5 00:28:26 willie kernel: [  452.797888]  [<ffffffff800dcf47>] free_hot_cold_page+0x57/0x2b0
Dec  5 00:28:26 willie kernel: [  452.797892]  [<ffffffff800dd26e>] free_hot_page+0x1e/0x40
Dec  5 00:28:26 willie kernel: [  452.797896]  [<ffffffff800e10d7>] put_page+0x57/0x150
Dec  5 00:28:26 willie kernel: [  452.797900]  [<ffffffff803b5f1c>] skb_release_data+0x8c/0x100
Dec  5 00:28:26 willie kernel: [  452.797904]  [<ffffffff803b5848>] __kfree_skb+0x28/0xd0
Dec  5 00:28:26 willie kernel: [  452.797908]  [<ffffffff80400ea8>] sk_eat_skb+0x78/0x90
Dec  5 00:28:26 willie kernel: [  452.797911]  [<ffffffff804040a6>] tcp_recvmsg+0x8e6/0xda0
Dec  5 00:28:26 willie kernel: [  452.797915]  [<ffffffff803b0183>] sock_common_recvmsg+0x43/0x70
Dec  5 00:28:26 willie kernel: [  452.797919]  [<ffffffff803ace69>] sock_aio_read+0x169/0x180
Dec  5 00:28:26 willie kernel: [  452.797923]  [<ffffffff80118c52>] do_sync_read+0x102/0x160
Dec  5 00:28:26 willie kernel: [  452.797927]  [<ffffffff80119251>] vfs_read+0x1a1/0x1c0
Dec  5 00:28:26 willie kernel: [  452.797931]  [<ffffffff801198ab>] sys_read+0x5b/0xa0
Dec  5 00:28:26 willie kernel: [  452.797935]  [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
Dec  5 00:28:26 willie kernel: [  452.797940]  [<00007f5a1525cdc0>] 0x7f5a1525cdc0


Machine crashed again while writing these lines. I decided to plug a keyboard and screen to have a chance to see why it dies ...
Comment 4 flo gleixner 2009-12-05 21:21:36 UTC
Machine still crashes 5-6 times a day - it worked months with opensuse 11.1 old kernel, Weeks with 11.1 new kernel and hours with 11.2 without crash. I added a screen shot of the CTRL-F10 console after crash. I just installed kdump packages and try to get it running - any hints how to get a kernel crash dump?
Comment 5 flo gleixner 2009-12-05 21:24:57 UTC
Created attachment 331213 [details]
screen shot of crash dump
Comment 6 flo gleixner 2009-12-05 22:42:10 UTC
News - if anyone is interested ....
Crash can be reproduced! If i try to copy a big file from a domU machine via NFS (v4) to the dom0 machine. Crash dumps always have a new face. Bug in xen or bridging code? Or AHCI drivers? What can i do?
Comment 7 flo gleixner 2009-12-05 23:33:41 UTC
Now i got 3 times a crash message with:
mp bios bug: 8254 timer not connected to IO-APIC

noapic makes the kernel unbootable.
board is a biostar TA780G M2+ - if this has anything to do with the crash ...
Comment 8 flo gleixner 2009-12-05 23:52:10 UTC
Update: Boosting the virtual bridge has no effect. But when i read a file in a domU i get the crash again.
Setup:
dom0: Opensuse 11.2 64bit
domUs: Opensuse 11.1 64bit pvm
Disks are mirrored with md0 /boot and md1 pv od a lvm group. root and swap of dom0 are lvm volumes. The disks of the domUs are also lvm volumes. No lvm in domU :-)
is this setup OK? It worked as long as i had 11.1 at dom0
Comment 9 flo gleixner 2009-12-06 12:11:38 UTC
Set up another machine - totally different hardware. With Opensuse 11.2. Created LVM volume and tried to use it as physical disk for a new domU installation. Installation starts, but dom0 freezes short after. dom0 has all patches applied. I suspect the same bug. DomU had some kernel messages in boot screen - looked like the ones i've seen all over the time.

This makes Xen totally unusable in 11.2!
Comment 10 flo gleixner 2009-12-06 13:18:49 UTC
Installing domU on a file instead of a LVM volume works ...
Comment 11 Jan Beulich 2009-12-07 08:12:58 UTC
Very likely a duplicate of 553690 and 559047. Please report whether mem=4G also allows you to work around the issue (apart from your finding of using file:/).
Comment 12 flo gleixner 2009-12-07 10:36:51 UTC
I managed to install a pv guest using mem=4G. But the guest still spits many kernel messages. I could also start this dom0 without using "mem=4G" but it crashed later dom0.
Kernel messages from domU during boot:
There are 122 Kernel errors. 118 from swapper followed by 4 from modprobe. The swapper messages:
See attachment.
Comment 13 flo gleixner 2009-12-07 10:41:39 UTC
Created attachment 331326 [details]
2x boot of domU
Comment 14 Jan Beulich 2009-12-14 15:43:01 UTC
Please see bug 559047 in case you want to try out a potential fix for this.
Comment 15 Jan Beulich 2009-12-15 14:55:13 UTC
.

*** This bug has been marked as a duplicate of bug 553690 ***