|
Bugzilla – Full Text Bug Listing |
| Summary: | Xenified kernel crashes during F12 PV DomU's install packages deployment phase | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 11.2 | Reporter: | Boris Derzhavets <bderzhavets> |
| Component: | Xen | Assignee: | Jan Beulich <jbeulich> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Major | ||
| Priority: | P2 - High | CC: | g.w.kant, gleixner, kcobler, rob |
| Version: | RC 2 | ||
| Target Milestone: | unspecified | ||
| Hardware: | x86-64 | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | Community User | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
Runtime snapshot F12 PV DomU
Serial log for F12 PV DomU crash install Another serial on different drive for kernel crash linux-2.6.31.5-0.1.99.21.446052c built wit original .config for xenified kernel debugging patch (kernel) Serial log after applying the most recent patch lspci -nn report attached debugging patch (kernel, v2) V1 reverted , V2 applied. Kernel has been rebuilt from scratch debugging patch (kernel, v3) |
||
|
Description
Boris Derzhavets
2009-11-09 07:37:49 UTC
We'll need the full kernel/hypervisor log (or at the very least the full kernel register/stack dump, including the last few lines that precede it). I attempted three more times to install F12 PV DomU switching to kernel console right after package deployment phase started. 2 times install succeded and 1 time failed. In case of success CTRL+ALT+F1 , login in text mode, kill -9 pid_of_Xorg. X-Server up again. Loading DomU via profile :- dhcppc5:~/vms # cat f12.pyrun name="VF12" memory=2048 kernel="./vmlinuz-2.6.31.5-122.fc12.x86_64" ramdisk="./initramfs-2.6.31.5-122.fc12.x86_64.img" root="/dev/mapper/vg_fedora-lv_root" disk = ['phy:/dev/sda9,xvda,w' ] vif = [ 'bridge=br0' ] vfb = [ 'type=vnc,vncunused=1'] vcpus=1 on_reboot = 'restart' on_crash = 'restart' Pygrub cannot load F12 DomU (bootloader returns no date). Seems like Xen has been built without e2fsprogs-devel. End of output of kernel console in case of crash :- do_page_fault page_fault memcpy_c swiotlb_balance unmap_single swiotlb_umap_sg_attrs _ata_sg_clean __ata_qc_comlete _ata_qc_comlete _ata_qc_comlete_multiple ahci_interrupt handle_IRQ_event handle_level_irq evtchn_do_upcall do_hypercall_callback 0xfffff...f802063ac xen_safe_halt xen_idle cpu_idle rest_init start_kernel x86_64_start_reservations x86_64_start_kernel Created attachment 326328 [details]
Runtime snapshot F12 PV DomU
4-th attempt of F12 PV DomU install failed. End of output of kernel console is exactly the same as above. (In reply to comment #2) > End of output of kernel console in case of crash :- > > do_page_fault > page_fault > memcpy_c > swiotlb_balance swiotlb_bounce ??? > unmap_single > swiotlb_umap_sg_attrs > _ata_sg_clean > __ata_qc_comlete > _ata_qc_comlete > _ata_qc_comlete_multiple > ahci_interrupt > handle_IRQ_event > handle_level_irq > evtchn_do_upcall > do_hypercall_callback > 0xfffff...f802063ac > xen_safe_halt > xen_idle > cpu_idle > rest_init > start_kernel > x86_64_start_reservations > x86_64_start_kernel Just the function names don't tell much, unfortunately. However, it seems inconsistent that you have unmap_single() and memcpy_c() on the stack: swiotlb_bounce() calls memcpy() only for DMA_TO_DEVICE, but do_unmap_single() passes DMA_FROM_DEVICE. This may indicate there's earlier corruption, and hence we'll get nowhere without seeing the full hypervisor and kernel logs, i.e. we need to wait for you to set up serial. Btw., does this also occur for file:/ backed guest disks? Created attachment 326595 [details]
Serial log for F12 PV DomU crash install
Serial log obtained during package deployment phase.
Created attachment 326606 [details]
Another serial on different drive for kernel crash
One more serial log for kernel crash on /dev/sda9 ( first one was for /dev/sdb9).
It looks different. Output has no disk errors similar to kernel console
output submitted yesterday.
I believe second serial log submitted is a fair , not first one I think these >(XEN) mm.c:4206:d0 Global bit is set to kernel page f6454555a9 >(XEN) mm.c:4206:d0 Global bit is set to kernel page 4736f480a0 are the indicators of the beginning problems (in the log this is followed by severe problems in the SATA driver, likely because of interrupts no longer arriving). The frame numbers, however, are completely bogus, and I'm unaware of any code path in our kernel that could lead to the global bit to be set on a kernel page. Unless you can assist with debugging this, I don't think we can do much here without reproducing this internally. >Linux version 2.6.31.5-0.1-desktop (geeko@buildhost) (gcc version 4.4.1 [gcc-4_4-branch revision 150839] (SUSE Linux) ) #3 SMP Sat Nov 7 13:41:03 EST 2009 But - what kernel was this created with? Our Xen kernels should call themselves -xen, not -desktop. Did you build this yourself? We need you to use the provided kernel in order to be useful for analysis. And if rebuilding the kernel is indeed unavoidable, it'd be nice for the tag to identify it clearly is such (we do have a -desktop kernel flavor). Finally, for eventual future logs, I'd like to ask that to avoid (if possible) making the logs as redundant as this one (most messages are there several times, which likely is a result of your use of the various command line options). Okay, the second log indeed comes closer to your previous description, and we see
>[ 1380.923876] Thread overran stack, or stack corrupted
Without knowing what kernel this is there's nothing we can do here.
(In reply to comment #10) > Okay, the second log indeed comes closer to your previous description, and we > see > > >[ 1380.923876] Thread overran stack, or stack corrupted > > Without knowing what kernel this is there's nothing we can do here. View bug:- https://bugzilla.novell.com/show_bug.cgi?id=552492 ------- Comment #20 From Jan Beulich 2009-11-06 02:23:13 MST (-) ------- (In reply to comment #18) > Sorry, my experience with Suse is limited. > I would be glad to test patch with step by step instruction. > I need xen-kernel source installed on the machine , but don't know > where to get kernel-source-???.x86_64.rpm ( i suspect kernel-xen-source ...) ************************************* Patch for X-server suggested by you ************************************* If you don't need the exact RC2 kernel, you could try ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.2/noarch/ ***************** Go to this link ***************** Index of ftp://ftp.suse.com/pub/projects/kernel/kotd/openSUSE-11.2/noarch/ Up to higher level directory Name Size Last Modified File:kernel-source-2.6.31.5-0.1.99.23.2e8a968.noarch.rpm 68957 KB 11/09/2009 04:02:00 PM File:kernel-source-vanilla-2.6.31.5-0.1.99.23.2e8a968.noarch.rpm 67280 KB 11/09/2009 04:02:00 PM kernel-source-vanilla.rpm 11/09/2009 11:04:00 PM kernel-source.rpm 11/09/2009 11:04:00 PM I downloaded and installed kernel-source.rpm ( suggested by you ) total 12 lrwxrwxrwx 1 root root 32 Nov 7 04:59 linux -> linux-2.6.31.5-0.1.99.21.446052c drwxr-xr-x 25 root root 4096 Nov 8 08:10 linux-2.6.31.5-0.1.99.21.446052c drwxr-xr-x 8 root root 4096 Oct 27 20:11 packages -rw-r--r-- 1 root root 2140 Nov 7 05:01 v32.patch1 dhcppc2:~ # cd /usr/src/linux Applied your patch and built xenified kernel. dhcppc2:/usr/src/linux # make menuconfig ( tuned as xenified as usual for rebased ones) # make -j4 # make modules_install install It appears to be named 2.6.31.5-01-desktop works under Xen and brings up X-Server wit no memory limit. I don't think it's important what name it has. Xen patches and v32,patch1 are coming obviously from you. From my side only "make menuconfig" to activate Xen Dom0 kernel feature. But I never said to use this self built kernel for reporting other problems, even more if you didn't even use our .config. There's no way for us to tell whether your problems simply result from you using some config option that was never tested under Xen. The easiest thing is probably going to be that you re-do this with our kernel (using mem=4G on the Xen command line if need be, although the two logs provided don't suggest that you're even loading drm, so I can't see why that other bug would matter). Kernel was just rebuilt latter reporting first issue with drm disabled. I believe it makes sense to wait for final release with 2.6.31.5-xen. If problem will persist i will report with serial log from the very beginning (In reply to comment #12) > But I never said to use this self built kernel for reporting other problems, > even more if you didn't even use our .config How could i use your's .config which didn't exist under /usr/src/linux ? (In reply to comment #14) > How could i use your's .config which didn't exist under /usr/src/linux ? By either installing the full set of kernel-* packages (it moved a number of time, so I'm not sure it's kernel-devel, but I'd guess it is), or more trivially by reading /proc/config.gz while our Xen kernel is running. > trivially by reading /proc/config.gz while our Xen kernel is running.
Thanks. If i understand you right gunzip /proc/config.gz >/usr/src/linux/.config should give .config to build kernel as you did
Yes. Kernel linux-2.6.31.5-0.1.99.21.446052c has been rebuilt with .config obtained via gunzip /proc/config.gz located on Suse 11.2 final xen instance running on the same box ,dual booting with first one (mem=4G) . It generates same serial log for crash. I mean mem=4G was applied to newly installed Suse 11.2 final to able brought it up with X-Server and picked up your's .config for xenified kernel. (In reply to comment #18) > Kernel linux-2.6.31.5-0.1.99.21.446052c has been rebuilt with .config > obtained via gunzip /proc/config.gz located on Suse 11.2 final xen instance > running on the same box ,dual booting with first one (mem=4G) . It generates > same serial log for crash. Perhaps a similar one... We'll need the full log of this anyway, together with a pointer where the kernel binaries (in particular, vmlinux and any modules involved in the backtrace) used live, in order to be able to analyze it. Stack overrun/corruption unfortunately isn't the easiest thing to debug... 1. Please confirm that you want me to submit serial log of crash kernel
linux-2.6.31.5-0.1.99.21.446052c been built via your's .config for xenified
kernel.
2. If you point me to any other kernel-source-xen.rpm i can build
another kernel (probably more recent) with your's .config for xenified kernel
previously applied patch for X-Server and obtain serial log the kernel.
> together with a pointer where the kernel binaries (in particular, vmlinux and > any modules involved in the backtrace) used live,
3.Does it mean, that you want to run ?
# gdb vmlinux
# dissamble particular_module_from_trace
Search for certain [<fffff..XXXX>] mentioned
is stack trace of serial log
(In reply to comment #21) > 1. Please confirm that you want me to submit serial log of crash kernel > linux-2.6.31.5-0.1.99.21.446052c been built via your's .config for xenified > kernel. Of course I'd prefer you to use a pre-built kernel (in which case I could just retrieve the binaries I need for analysis myself), but short of that I'm indeed asking for some other consistent pair of (log,kernel). > 2. If you point me to any other kernel-source-xen.rpm i can build > another kernel (probably more recent) with your's .config for xenified kernel > previously applied patch for X-Server and obtain serial log the kernel. Other than the KOTD I pointed you at above there's nothing I'm aware of until the first maintenance update kernel will eventually get released. > 3.Does it mean, that you want to run ? > # gdb vmlinux > # dissamble particular_module_from_trace > Search for certain [<fffff..XXXX>] mentioned > is stack trace of serial log Something along those lines, yes, but also things beyond that (like associating source level variables with registers or stack locations). Btw., to be maximally useful here (given that we're at least suspecting stack overrun/corruption), it would be a good idea for you to include "kstack=1024" on the Dom0 (kernel) command line. (In reply to comment #22) > (In reply to comment #21) > > 1. Please confirm that you want me to submit serial log of crash kernel > > linux-2.6.31.5-0.1.99.21.446052c been built via your's .config for xenified > > kernel. > > Of course I'd prefer you to use a pre-built kernel (in which case I could just > retrieve the binaries I need for analysis myself), but short of that I'm indeed > asking for some other consistent pair of (log,kernel). > Btw., to be maximally useful here (given that we're at least suspecting stack > overrun/corruption), it would be a good idea for you to include "kstack=1024" > on the Dom0 (kernel) command line. Serial log of F12 domU crash submitted as requested , kstack=1024 included in xen kernel command line. View attachment. Created attachment 328220 [details]
linux-2.6.31.5-0.1.99.21.446052c built wit original .config for xenified kernel
Serial log for linux-2.6.31.5-0.1.99.21.446052c built wit original .config for xenified kernel
So where can I pick up the corresponding binary? (In reply to comment #25) > So where can I pick up the corresponding binary? Sorry , i am just an independent consultant in regards of current issue I can only try to upload via ftp vmlinux , vmlinux.o to your's location. I don't have my personal site , registered in DNS The try it via mail attachment (just vmlinux, perhaps compressed). vlinux.bz2 - 29 MB doesn't go through yahoo (< 25 MB). I need time to look for solution. (In reply to comment #27) > The try it via mail attachment (just vmlinux, perhaps compressed). Done. http://free.mailbigfile.com/0982c32dc12fc361ae43d945fc43bdab/listFiles.php Created attachment 328458 [details]
debugging patch (kernel)
This log together with one of the earlier provided ones makes it clear that swiotlb code is being instructed to write over a page table, due to running off the end of a valid buffer. It is not clear however whether the buffer was originally specified improperly, or whether stored data got corrupted e.g. during I/O. Since I can't reproduce the issue myself, I'm hoping that you would be able to rebuild your kernel with the debugging patch just attached, and then try and see whether it captures the problem any earlier (and of course doesn't have any adverse side effects). Do you, btw, also run into this issue when using mem=4G on the Xen command line? Also I assume you're not having the machine do any other things while starting the guest? And from the last log I'm having the impression that only the third guest that got started actually crashed the machine - were the first two of different type, or does the problem not always occur? It would also be nice if you attached "lspci -nn" output for the machine, unless you know the problem is present on two sufficiently different ones. (In reply to comment #31) > This log together with one of the earlier provided ones makes it clear that > swiotlb code is being instructed to write over a page table, due to running off > the end of a valid buffer. It is not clear however whether the buffer was > originally specified improperly, or whether stored data got corrupted e.g. > during I/O. Since I can't reproduce the issue myself, I'm hoping that you would > be able to rebuild your kernel with the debugging patch just attached, and then > try and see whether it captures the problem any earlier (and of course doesn't > have any adverse side effects). Will try. > Do you, btw, also run into this issue when using mem=4G on the Xen command > line? Installer hangs downloading installation image. I just cannot get so far > Also I assume you're not having the machine do any other things while > starting the guest? Sure. >And from the last log I'm having the impression that only > the third guest that got started actually crashed the machine - were the first > two of different type, or does the problem not always occur? Problem occurs always. The first F12 guest been installed, crashed Dom0 either via pygrub profile (guest's /boot of ext3fs type) or via regular xm-profile (guest's /boot of ext4fs type) attempting to load DomU via already built up image. I reproduced it twice with F12 (final release guest).Now i passed packages deployment phase via installation profile. Shut down DomU , then attempted to load and crashed right away in both cases. I just submitted only one serial log. > It would also be nice if you attached "lspci -nn" output for the machine, > unless you know the problem is present on two sufficiently different ones. No problem. It's C2D E8400, ASUS P5Q3, 4x2GB Kingston 1333, SATA 250 GB Seagate Barracuda Created attachment 328502 [details]
Serial log after applying the most recent patch
DomU's install crashes in the very beginning attempting either detect or
partition image device
Created attachment 328503 [details]
lspci -nn report attached
lspci -nn has been ran.
Created attachment 328617 [details]
debugging patch (kernel, v2)
Sorry, oversight on my part. Should be better now.
(In reply to comment #35) > Created an attachment (id=328617) [details] > debugging patch (kernel, v2) > > Sorry, oversight on my part. Should be better now. Revert V1 and apply V2 ? Yes. Created attachment 328693 [details]
V1 reverted , V2 applied. Kernel has been rebuilt from scratch
Done with V2.
Created attachment 328910 [details]
debugging patch (kernel, v3)
As I understand it, v2 still brought the machine down too early. I hope that v3 finally gets us forward. I'm sorry for not having spotted this earlier.
(In reply to comment #39) > Created an attachment (id=328910) [details] > debugging patch (kernel, v3) > > As I understand it, v2 still brought the machine down too early. I hope that v3 > finally gets us forward. I'm sorry for not having spotted this earlier. I will able to proceed with v3 on 11/23 or 11/24 > As I understand it, v2 still brought the machine down too early. I hope that v3
> finally gets us forward. I'm sorry for not having spotted this earlier.
Kernel patched with V3 crashes at same point as with V2.
Attempt to format partitions on image device. To get serial log i need
to move the box again.
Hmm, that's odd - I can't see anything wrong with the debug code anymore. But perhaps I will once I see the new log (which in any case should be different, as I swapped the probe and and warning, and tightened the warning condition). Following the observation in https://bugzilla.novell.com/show_bug.cgi?id=551695#c10, did you ever try running a VM on file:/ rather than phy:/ (see also my similar question in #5)? (In reply to comment #43) > Following the observation in > https://bugzilla.novell.com/show_bug.cgi?id=551695#c10, did you ever try > running a VM on file:/ rather than phy:/ (see also my similar question in #5)? No , i didn't Please see bug 559047 for an updated version of the debugging patch (and a potential fix). *** Bug 564427 has been marked as a duplicate of this bug. *** *** Bug 551695 has been marked as a duplicate of this bug. *** *** Bug 559047 has been marked as a duplicate of this bug. *** *** Bug 567306 has been marked as a duplicate of this bug. *** Issue gets resolved after the the most recent maintenance update 11.2 Thanks. |