Bug 553690

Summary: Xenified kernel crashes during F12 PV DomU's install packages deployment phase
Product: [openSUSE] openSUSE 11.2 Reporter: Boris Derzhavets <bderzhavets>
Component: XenAssignee: Jan Beulich <jbeulich>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P2 - High CC: g.w.kant, gleixner, kcobler, rob
Version: RC 2   
Target Milestone: unspecified   
Hardware: x86-64   
OS: Other   
Whiteboard:
Found By: Community User Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Runtime snapshot F12 PV DomU
Serial log for F12 PV DomU crash install
Another serial on different drive for kernel crash
linux-2.6.31.5-0.1.99.21.446052c built wit original .config for xenified kernel
debugging patch (kernel)
Serial log after applying the most recent patch
lspci -nn report attached
debugging patch (kernel, v2)
V1 reverted , V2 applied. Kernel has been rebuilt from scratch
debugging patch (kernel, v3)

Description Boris Derzhavets 2009-11-09 07:37:49 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.4) Gecko/20091027 Fedora/3.5.4-1.fc12 Firefox/3.5.4

Screen ( and system as entity) freeze during F12 PV DomU install via profile:-

[root@fedora12sda vms]# cat f12.install
name="F12PV"
memory=1024
disk = ['phy:/dev/sda8,xvda,w' ]
vif = [ 'bridge=br0' ]
vfb = [ 'type=vnc,vncunused=1']
kernel = "./vmlinuz"
ramdisk = "./initrd.img"
vcpus=1
on_reboot = 'restart'
on_crash = 'restart'

where installation vmlinuz & initrd.img obtained via wget from F12 HTTP Mirror (local or remote no matter)

# vncviewer localhost:0

Starts installer fine and allows to get until packages deployment phase. Three attempts have been made with the same result.
The system freeze when DomU install gets into packaging deploying phase, with the 3 led of the numeric pad blinking

Message from syslogd@dhcppc5
kernel: [960.891531] Call trace
Code: 00 00 85 c0 75 ad 4d . . . . . .
. . . . . . .
kernel: [960.891593] CR2:
ffffffffffffcb0

I've also tried text mode install , to avoid vnc console output. With profile:-

# cat f12.instext
name="F12PV"
memory=1024
disk = ['phy:/dev/sda8,xvda,w' ]
vif = [ 'bridge=br0' ]
kernel = "./vmlinuz"
ramdisk = "./initrd.img"
vcpus=1
on_reboot = 'restart'
on_crash = 'restart'

# xm create -c f12.instext

System crashes in packages deployment phase again
Setting up serial console to capture kernel trace might take several days.




Reproducible: Always

Steps to Reproduce:
Attempt to create F12 PV DomU.
Actual Results:  
Xenified kernel crashes.

Expected Results:  
PV DomU gets installed
Comment 1 Jan Beulich 2009-11-09 08:09:51 UTC
We'll need the full kernel/hypervisor log (or at the very least the full kernel register/stack dump, including the last few lines that precede it).
Comment 2 Boris Derzhavets 2009-11-09 18:29:20 UTC
I attempted three more times to install F12 PV DomU switching to kernel
console right after package deployment phase started. 2 times install succeded
and 1 time failed. 

In case of success CTRL+ALT+F1 , login in text mode,
kill -9 pid_of_Xorg. X-Server up again.
Loading DomU via profile :-

dhcppc5:~/vms # cat f12.pyrun
name="VF12"
memory=2048
kernel="./vmlinuz-2.6.31.5-122.fc12.x86_64"
ramdisk="./initramfs-2.6.31.5-122.fc12.x86_64.img"
root="/dev/mapper/vg_fedora-lv_root"
disk = ['phy:/dev/sda9,xvda,w' ]
vif = [ 'bridge=br0' ]
vfb = [ 'type=vnc,vncunused=1']
vcpus=1
on_reboot = 'restart'
on_crash = 'restart'

Pygrub cannot load F12 DomU (bootloader returns no date).
Seems like Xen has been built without e2fsprogs-devel.

End of output of kernel console in case of crash :-

do_page_fault
page_fault
memcpy_c
swiotlb_balance
unmap_single
swiotlb_umap_sg_attrs
_ata_sg_clean
__ata_qc_comlete
_ata_qc_comlete
_ata_qc_comlete_multiple
ahci_interrupt
handle_IRQ_event
handle_level_irq
evtchn_do_upcall
do_hypercall_callback
0xfffff...f802063ac
xen_safe_halt
xen_idle
cpu_idle
rest_init
start_kernel
x86_64_start_reservations
x86_64_start_kernel
Comment 3 Boris Derzhavets 2009-11-09 18:36:07 UTC
Created attachment 326328 [details]
Runtime snapshot F12 PV DomU
Comment 4 Boris Derzhavets 2009-11-09 19:04:13 UTC
4-th attempt of F12 PV DomU install failed. End of output of kernel console is exactly the same as above.
Comment 5 Jan Beulich 2009-11-10 11:03:23 UTC
(In reply to comment #2)
> End of output of kernel console in case of crash :-
> 
> do_page_fault
> page_fault
> memcpy_c
> swiotlb_balance

swiotlb_bounce ???

> unmap_single
> swiotlb_umap_sg_attrs
> _ata_sg_clean
> __ata_qc_comlete
> _ata_qc_comlete
> _ata_qc_comlete_multiple
> ahci_interrupt
> handle_IRQ_event
> handle_level_irq
> evtchn_do_upcall
> do_hypercall_callback
> 0xfffff...f802063ac
> xen_safe_halt
> xen_idle
> cpu_idle
> rest_init
> start_kernel
> x86_64_start_reservations
> x86_64_start_kernel

Just the function names don't tell much, unfortunately. However, it seems inconsistent that you have unmap_single() and memcpy_c() on the stack: swiotlb_bounce() calls memcpy() only for DMA_TO_DEVICE, but do_unmap_single() passes DMA_FROM_DEVICE. This may indicate there's earlier corruption, and hence we'll get nowhere without seeing the full hypervisor and kernel logs, i.e. we need to wait for you to set up serial.

Btw., does this also occur for file:/ backed guest disks?
Comment 6 Boris Derzhavets 2009-11-10 14:10:37 UTC
Created attachment 326595 [details]
Serial log for F12 PV DomU crash install

Serial log obtained during package deployment phase.
Comment 7 Boris Derzhavets 2009-11-10 15:06:41 UTC
Created attachment 326606 [details]
Another serial on different drive for kernel crash

One more serial log for kernel crash on /dev/sda9 ( first one was for /dev/sdb9).
It looks different. Output has no disk errors similar to kernel console
output submitted yesterday.
Comment 8 Boris Derzhavets 2009-11-10 15:15:33 UTC
I believe second serial log submitted is a fair , not first one
Comment 9 Jan Beulich 2009-11-10 15:21:37 UTC
I think these

>(XEN) mm.c:4206:d0 Global bit is set to kernel page f6454555a9
>(XEN) mm.c:4206:d0 Global bit is set to kernel page 4736f480a0

are the indicators of the beginning problems (in the log this is followed by severe problems in the SATA driver, likely because of interrupts no longer arriving).

The frame numbers, however, are completely bogus, and I'm unaware of any code path in our kernel that could lead to the global bit to be set on a kernel page.

Unless you can assist with debugging this, I don't think we can do much here without reproducing this internally.

>Linux version 2.6.31.5-0.1-desktop (geeko@buildhost) (gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux) ) #3 SMP Sat Nov 7 13:41:03 EST 2009

But - what kernel was this created with? Our Xen kernels should call themselves -xen, not -desktop. Did you build this yourself? We need you to use the provided kernel in order to be useful for analysis. And if rebuilding the kernel is indeed unavoidable, it'd be nice for the tag to identify it clearly is such (we do have a -desktop kernel flavor).

Finally, for eventual future logs, I'd like to ask that to avoid (if possible) making the logs as redundant as this one (most messages are there several times, which likely is a result of your use of the various command line options).
Comment 10 Jan Beulich 2009-11-10 15:28:51 UTC
Okay, the second log indeed comes closer to your previous description, and we see

>[ 1380.923876] Thread overran stack, or stack corrupted

Without knowing what kernel this is there's nothing we can do here.
Comment 11 Boris Derzhavets 2009-11-10 15:44:33 UTC
(In reply to comment #10)
> Okay, the second log indeed comes closer to your previous description, and we
> see
> 
> >[ 1380.923876] Thread overran stack, or stack corrupted
> 
> Without knowing what kernel this is there's nothing we can do here.

View bug:-

https://bugzilla.novell.com/show_bug.cgi?id=552492
  
-------  Comment #20 From Jan Beulich  2009-11-06 02:23:13 MST   (-) -------

(In reply to comment #18)
> Sorry, my experience with Suse is limited.
> I would be glad to test patch with step by step instruction.
> I need xen-kernel source installed on the machine , but don't know
> where to get kernel-source-???.x86_64.rpm ( i suspect kernel-xen-source ...)
*************************************
Patch for X-server suggested by you
*************************************
If you don't need the exact RC2 kernel, you could try
ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.2/noarch/
*****************
Go to this link
*****************
Index of ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.2/noarch/
Up to higher level directory
   Name
   	Size
   	Last Modified
 File:kernel-source-2.6.31.5-0.1.99.23.2e8a968.noarch.rpm
 	68957 KB
 	11/09/2009
 	04:02:00 PM
 File:kernel-source-vanilla-2.6.31.5-0.1.99.23.2e8a968.noarch.rpm
 	67280 KB
 	11/09/2009
 	04:02:00 PM
 kernel-source-vanilla.rpm
 	11/09/2009
 	11:04:00 PM
 kernel-source.rpm
 	
 	11/09/2009
 	11:04:00 PM

I downloaded and installed kernel-source.rpm ( suggested by you )
total 12
lrwxrwxrwx  1 root root   32 Nov  7 04:59 linux -> linux-2.6.31.5-0.1.99.21.446052c
drwxr-xr-x 25 root root 4096 Nov  8 08:10 linux-2.6.31.5-0.1.99.21.446052c
drwxr-xr-x  8 root root 4096 Oct 27 20:11 packages
-rw-r--r--  1 root root 2140 Nov  7 05:01 v32.patch1
dhcppc2:~ # cd /usr/src/linux
Applied your patch and built xenified kernel.
dhcppc2:/usr/src/linux # make menuconfig ( tuned as xenified as usual for rebased ones)
# make -j4
# make modules_install install
It appears to be named 2.6.31.5-01-desktop works under Xen and brings up X-Server wit no memory limit.
I don't think it's important what name it has. Xen patches and v32,patch1
are coming obviously from you. From my side only "make menuconfig" to
activate Xen Dom0 kernel feature.
Comment 12 Jan Beulich 2009-11-10 15:57:52 UTC
But I never said to use this self built kernel for reporting other problems, even more if you didn't even use our .config. There's no way for us to tell whether your problems simply result from you using some config option that was never tested under Xen.

The easiest thing is probably going to be that you re-do this with our kernel (using mem=4G on the Xen command line if need be, although the two logs provided don't suggest that you're even loading drm, so I can't see why that other bug would matter).
Comment 13 Boris Derzhavets 2009-11-10 16:15:00 UTC
Kernel was just rebuilt latter reporting first issue with drm disabled.
I believe it makes sense to wait for final release with 2.6.31.5-xen.
If problem will persist i will report with serial log from the very beginning
Comment 14 Boris Derzhavets 2009-11-10 20:02:00 UTC
(In reply to comment #12)
> But I never said to use this self built kernel for reporting other problems,
> even more if you didn't even use our .config
How could i use your's .config which didn't exist under /usr/src/linux ?
Comment 15 Jan Beulich 2009-11-11 07:59:55 UTC
(In reply to comment #14)
> How could i use your's .config which didn't exist under /usr/src/linux ?

By either installing the full set of kernel-* packages (it moved a number of time, so I'm not sure it's kernel-devel, but I'd guess it is), or more trivially by reading /proc/config.gz while our Xen kernel is running.
Comment 16 Boris Derzhavets 2009-11-11 10:32:39 UTC
> trivially by reading /proc/config.gz while our Xen kernel is running.
Thanks. If i understand you right gunzip /proc/config.gz >/usr/src/linux/.config should give .config to build kernel as you did
Comment 17 Jan Beulich 2009-11-11 10:51:16 UTC
Yes.
Comment 18 Boris Derzhavets 2009-11-14 08:22:03 UTC
Kernel linux-2.6.31.5-0.1.99.21.446052c has been rebuilt with .config
obtained via gunzip /proc/config.gz located on Suse 11.2 final xen instance
running on the same box ,dual booting with first one (mem=4G) . It generates same serial log for crash.
Comment 19 Boris Derzhavets 2009-11-14 08:26:04 UTC
I mean mem=4G was applied to newly installed Suse 11.2 final to able brought
it up with X-Server and picked up your's .config for xenified kernel.
Comment 20 Jan Beulich 2009-11-16 09:41:54 UTC
(In reply to comment #18)
> Kernel linux-2.6.31.5-0.1.99.21.446052c has been rebuilt with .config
> obtained via gunzip /proc/config.gz located on Suse 11.2 final xen instance
> running on the same box ,dual booting with first one (mem=4G) . It generates
> same serial log for crash.

Perhaps a similar one... We'll need the full log of this anyway, together with a pointer where the kernel binaries (in particular, vmlinux and any modules involved in the backtrace) used live, in order to be able to analyze it. Stack overrun/corruption unfortunately isn't the easiest thing to debug...
Comment 21 Boris Derzhavets 2009-11-16 11:22:45 UTC
1. Please confirm that you want me to submit serial log of crash kernel
linux-2.6.31.5-0.1.99.21.446052c been built via your's .config for xenified
kernel.
2. If you point me to any other kernel-source-xen.rpm i can build 
another kernel (probably more recent) with your's .config for xenified kernel
previously applied patch for X-Server and obtain serial log the kernel.

> together with a pointer where the kernel binaries (in particular, vmlinux and > any modules involved in the backtrace) used live,

3.Does it mean, that you want to run ?
# gdb vmlinux
# dissamble particular_module_from_trace
Search for certain [<fffff..XXXX>] mentioned
is stack trace of serial log
Comment 22 Jan Beulich 2009-11-16 16:39:39 UTC
(In reply to comment #21)
> 1. Please confirm that you want me to submit serial log of crash kernel
> linux-2.6.31.5-0.1.99.21.446052c been built via your's .config for xenified
> kernel.

Of course I'd prefer you to use a pre-built kernel (in which case I could just retrieve the binaries I need for analysis myself), but short of that I'm indeed asking for some other consistent pair of (log,kernel).

> 2. If you point me to any other kernel-source-xen.rpm i can build 
> another kernel (probably more recent) with your's .config for xenified kernel
> previously applied patch for X-Server and obtain serial log the kernel.

Other than the KOTD I pointed you at above there's nothing I'm aware of until the first maintenance update kernel will eventually get released.
 
> 3.Does it mean, that you want to run ?
> # gdb vmlinux
> # dissamble particular_module_from_trace
> Search for certain [<fffff..XXXX>] mentioned
> is stack trace of serial log

Something along those lines, yes, but also things beyond that (like associating source level variables with registers or stack locations).

Btw., to be maximally useful here (given that we're at least suspecting stack overrun/corruption), it would be a good idea for you to include "kstack=1024" on the Dom0 (kernel) command line.
Comment 23 Boris Derzhavets 2009-11-18 14:40:34 UTC
(In reply to comment #22)
> (In reply to comment #21)
> > 1. Please confirm that you want me to submit serial log of crash kernel
> > linux-2.6.31.5-0.1.99.21.446052c been built via your's .config for xenified
> > kernel.
> 
> Of course I'd prefer you to use a pre-built kernel (in which case I could just
> retrieve the binaries I need for analysis myself), but short of that I'm indeed
> asking for some other consistent pair of (log,kernel).
> Btw., to be maximally useful here (given that we're at least suspecting stack
> overrun/corruption), it would be a good idea for you to include "kstack=1024"
> on the Dom0 (kernel) command line.
Serial log of F12 domU crash submitted as requested , kstack=1024 included in
xen kernel command line. View attachment.
Comment 24 Boris Derzhavets 2009-11-18 14:43:30 UTC
Created attachment 328220 [details]
linux-2.6.31.5-0.1.99.21.446052c built wit original .config for xenified kernel

Serial log for linux-2.6.31.5-0.1.99.21.446052c built wit original .config for xenified kernel
Comment 25 Jan Beulich 2009-11-18 16:10:51 UTC
So where can I pick up the corresponding binary?
Comment 26 Boris Derzhavets 2009-11-18 16:47:28 UTC
(In reply to comment #25)
> So where can I pick up the corresponding binary?
Sorry , i am just an independent consultant in regards of current issue
I can only try to upload via ftp vmlinux , vmlinux.o to your's location.
I don't have my personal site , registered in DNS
Comment 27 Jan Beulich 2009-11-18 16:54:15 UTC
The try it via mail attachment (just vmlinux, perhaps compressed).
Comment 28 Boris Derzhavets 2009-11-18 17:37:29 UTC
vlinux.bz2 - 29 MB doesn't go through yahoo (< 25 MB).
I need time to look for solution.
Comment 29 Boris Derzhavets 2009-11-18 19:03:58 UTC
(In reply to comment #27)
> The try it via mail attachment (just vmlinux, perhaps compressed).
Done.
http://free.mailbigfile.com/0982c32dc12fc361ae43d945fc43bdab/listFiles.php
Comment 30 Jan Beulich 2009-11-19 15:22:47 UTC
Created attachment 328458 [details]
debugging patch (kernel)
Comment 31 Jan Beulich 2009-11-19 15:33:19 UTC
This log together with one of the earlier provided ones makes it clear that swiotlb code is being instructed to write over a page table, due to running off the end of a valid buffer. It is not clear however whether the buffer was originally specified improperly, or whether stored data got corrupted e.g. during I/O. Since I can't reproduce the issue myself, I'm hoping that you would be able to rebuild your kernel with the debugging patch just attached, and then try and see whether it captures the problem any earlier (and of course doesn't have any adverse side effects).

Do you, btw, also run into this issue when using mem=4G on the Xen command line? Also I assume you're not having the machine do any other things while starting the guest? And from the last log I'm having the impression that only the third guest that got started actually crashed the machine - were the first two of different type, or does the problem not always occur?

It would also be nice if you attached "lspci -nn" output for the machine, unless you know the problem is present on two sufficiently different ones.
Comment 32 Boris Derzhavets 2009-11-19 16:47:25 UTC
(In reply to comment #31)
> This log together with one of the earlier provided ones makes it clear that
> swiotlb code is being instructed to write over a page table, due to running off
> the end of a valid buffer. It is not clear however whether the buffer was
> originally specified improperly, or whether stored data got corrupted e.g.
> during I/O. Since I can't reproduce the issue myself, I'm hoping that you would
> be able to rebuild your kernel with the debugging patch just attached, and then
> try and see whether it captures the problem any earlier (and of course doesn't
> have any adverse side effects).
   Will try. 
> Do you, btw, also run into this issue when using mem=4G on the Xen command
> line? 

   Installer hangs downloading installation image. I just cannot get so far

> Also I assume you're not having the machine do any other things while
> starting the guest? 

   Sure.

>And from the last log I'm having the impression that only
> the third guest that got started actually crashed the machine - were the first
> two of different type, or does the problem not always occur?

   Problem occurs always. The first F12 guest been installed, crashed Dom0 either via pygrub profile (guest's /boot of ext3fs type) or via regular xm-profile (guest's /boot of ext4fs type) attempting to load DomU via already built up image. 

  I reproduced it twice with F12 (final release guest).Now i passed packages deployment phase via installation profile. Shut down DomU , then attempted to load and crashed right away in both cases. I just submitted only one serial log.
 
> It would also be nice if you attached "lspci -nn" output for the machine,
> unless you know the problem is present on two sufficiently different ones.

No  problem.
It's C2D E8400, ASUS P5Q3, 4x2GB Kingston 1333, SATA 250 GB Seagate Barracuda
Comment 33 Boris Derzhavets 2009-11-19 18:06:21 UTC
Created attachment 328502 [details]
Serial log after applying the most recent patch

DomU's install crashes in the very beginning attempting either detect or
partition image device
Comment 34 Boris Derzhavets 2009-11-19 18:10:37 UTC
Created attachment 328503 [details]
lspci -nn report attached

lspci -nn has been ran.
Comment 35 Jan Beulich 2009-11-20 09:46:40 UTC
Created attachment 328617 [details]
debugging patch (kernel, v2)

Sorry, oversight on my part. Should be better now.
Comment 36 Boris Derzhavets 2009-11-20 12:37:28 UTC
(In reply to comment #35)
> Created an attachment (id=328617) [details]
> debugging patch (kernel, v2)
> 
> Sorry, oversight on my part. Should be better now.

Revert  V1 and apply V2 ?
Comment 37 Jan Beulich 2009-11-20 12:40:41 UTC
Yes.
Comment 38 Boris Derzhavets 2009-11-20 14:21:59 UTC
Created attachment 328693 [details]
V1 reverted , V2 applied. Kernel has been rebuilt from scratch

Done with V2.
Comment 39 Jan Beulich 2009-11-23 07:52:07 UTC
Created attachment 328910 [details]
debugging patch (kernel, v3)

As I understand it, v2 still brought the machine down too early. I hope that v3 finally gets us forward. I'm sorry for not having spotted this earlier.
Comment 40 Boris Derzhavets 2009-11-23 10:32:59 UTC
(In reply to comment #39)
> Created an attachment (id=328910) [details]
> debugging patch (kernel, v3)
> 
> As I understand it, v2 still brought the machine down too early. I hope that v3
> finally gets us forward. I'm sorry for not having spotted this earlier.

I will able to proceed with v3 on 11/23 or 11/24
Comment 41 Boris Derzhavets 2009-11-23 19:19:02 UTC
> As I understand it, v2 still brought the machine down too early. I hope that v3
> finally gets us forward. I'm sorry for not having spotted this earlier.
Kernel patched with V3 crashes at same point as with V2.
Attempt to format partitions on image device. To get serial log i need
to move the box again.
Comment 42 Jan Beulich 2009-11-24 07:56:25 UTC
Hmm, that's odd - I can't see anything wrong with the debug code anymore. But perhaps I will once I see the new log (which in any case should be different, as I swapped the probe and and warning, and tightened the warning condition).
Comment 43 Jan Beulich 2009-12-07 09:06:34 UTC
Following the observation in https://bugzilla.novell.com/show_bug.cgi?id=551695#c10, did you ever try running a VM on file:/ rather than phy:/ (see also my similar question in #5)?
Comment 44 Boris Derzhavets 2009-12-07 10:33:41 UTC
(In reply to comment #43)
> Following the observation in
> https://bugzilla.novell.com/show_bug.cgi?id=551695#c10, did you ever try
> running a VM on file:/ rather than phy:/ (see also my similar question in #5)?
No , i didn't
Comment 45 Jan Beulich 2009-12-14 15:44:24 UTC
Please see bug 559047 for an updated version of the debugging patch (and a potential fix).
Comment 46 Jan Beulich 2009-12-15 10:45:27 UTC
*** Bug 564427 has been marked as a duplicate of this bug. ***
Comment 47 Jan Beulich 2009-12-15 14:55:15 UTC
*** Bug 551695 has been marked as a duplicate of this bug. ***
Comment 48 Jan Beulich 2009-12-16 08:08:41 UTC
*** Bug 559047 has been marked as a duplicate of this bug. ***
Comment 49 Jan Beulich 2010-01-04 15:53:57 UTC
*** Bug 567306 has been marked as a duplicate of this bug. ***
Comment 50 Boris Derzhavets 2010-01-05 11:54:00 UTC
Issue gets resolved after the the most recent maintenance update 11.2
Comment 51 Jan Beulich 2010-01-05 12:11:34 UTC
Thanks.