Bug 254208

Summary: Kernel panic with 2.6.18.8-0.1-xen
Product: [openSUSE] openSUSE 10.2 Reporter: Greg Riedesel <greg>
Component: XenAssignee: Jan Beulich <jbeulich>
Status: RESOLVED FIXED QA Contact: Jason Douglas <jdouglas>
Severity: Normal    
Priority: P5 - None    
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: serial console capture of the panic
Another serial-console capture of the panic
Output of 'lspci' under the non-Xen kernel (2.6.18.8-0.1-default)
Output of "lscpi -n" in the non-Xen kernel (2.6.18.8-0.1-default)
boot.msg file for the standard kernel boot.
The "lsmod" output from the default/standard kernel

Description Greg Riedesel 2007-03-13 21:20:45 UTC
After upgrading to the 2.6.18.8-0.1-xen kernel from 2.6.18.2-23-xen the machine would throw a kernel panic. I've attached a serial-console capture of the panic.

Adding the line "agp=off" to the Additional Options line, not boot options, allowed boot to bypass the panic. "agpgart" is in the /etc/modprobe.d/blacklist file.

The hardware is an Asus P5B Deluxe, with a Core2 CPU, running x86-64.
Comment 1 Greg Riedesel 2007-03-13 21:21:40 UTC
Created attachment 124157 [details]
serial console capture of the panic
Comment 2 Greg Riedesel 2007-03-13 21:29:17 UTC
Created attachment 124159 [details]
Another serial-console capture of the panic
Comment 3 Greg Riedesel 2007-03-13 21:39:47 UTC
This bug does not affect the 'default' kernel. Only the 'xen' kernel.
Comment 4 Greg Riedesel 2007-03-15 23:16:01 UTC
Doing a diff of defconfig.default and defconfig.xen gives this tidbit:

1914c1845
< CONFIG_AGP_INTEL=m
---
> CONFIG_AGP_INTEL=y

I don't know enough about Xen kernel configuration to know why Intel AGP is being hard-loaded into the kernel, but this would explain why the default kernel doesn't show the problem.
Comment 5 Lynn Bendixsen 2007-03-29 23:01:02 UTC
Jan, this is an openSuse 10.2 bug entry.
Comment 6 Jan Beulich 2007-03-30 11:05:55 UTC
While this config selection wasn't intended to be that way, it also wasn't changed after the original release, so you had the driver built in there, too. I'm surprised this worked for you. (Any chance you have a boot.msg obtained with the old kernel?)

Jason/Lynn, any chance we have a machine (Intel chipset driven by intel-agp and 4Gb+ of memory) in the lab this can be reproduced on?

Regardless of that I think I found two issues with the code:
- the use of GFP_DMA32, assuming the machine address will result in memory below 4G (which isn't true under Xen)
- arithmetic extending across page boundaries on values returned from virt_to_gart() (the physical<->machine relationship isn't contiguous under Xen)
Comment 7 Jan Beulich 2007-03-30 12:13:34 UTC
Please also provide output of lspci and lspci -n (obtained from the native kernel).
Comment 8 Lynn Bendixsen 2007-03-30 15:59:15 UTC
(In reply to comment #6)

> Jason/Lynn, any chance we have a machine (Intel chipset driven by intel-agp and
> 4Gb+ of memory) in the lab this can be reproduced on?

We probably have one but as this is for opensuse it is a low priority for us right now.  We may have a chance to get to it inthe middle of next week.
Comment 10 Greg Riedesel 2007-03-30 16:12:53 UTC
Created attachment 127892 [details]
Output of 'lspci' under the non-Xen kernel (2.6.18.8-0.1-default)
Comment 11 Greg Riedesel 2007-03-30 16:13:45 UTC
Created attachment 127893 [details]
Output of "lscpi -n" in the non-Xen kernel (2.6.18.8-0.1-default)
Comment 12 Greg Riedesel 2007-03-30 16:21:21 UTC
> While this config selection wasn't intended to be that way, it also wasn't
> changed after the original release, so you had the driver built in there, too.
> I'm surprised this worked for you. (Any chance you have a boot.msg obtained
> with the old kernel?)

Bug #227324 describes some of the problem I had with the Final kernel (2.6.18.2-34) series. In that case "agp=off" also seemed to bypass the problems, though I did have luck using the modprobe blacklist. It was the agpgart problems that had me keep the 2.6.18.2-23-Xen kernel after 10.2 released, as that kernel didn't seem to have the same problem.
Comment 15 Jan Beulich 2007-04-02 08:01:16 UTC
So with native not working (without agp=off or blacklisting intel-agp, as the referenced bug #227324 described), this is not really a Xen bug but a generic issue; it just happens that under the Xen kernel, due to intel-agp inadvertently being built in, you can't use the blacklisting method but have to use agp=off.

Nevertheless, I believe looking closely at this code has revealed a number of weaknesses on the Xen side.
Comment 16 Greg Riedesel 2007-06-15 16:56:45 UTC
Bug 271573 is a duplicate of this one, with newer code.

Kernel 2.6.18.8-0.3 still has this issue. The standard kernel does not show the problem, but the xen kernel does. As with the earlier one, adding agp=off to the kernel options bypasses this bug.
Comment 17 Greg Riedesel 2007-06-15 16:59:58 UTC
Created attachment 146580 [details]
boot.msg file for the standard kernel boot.

THis is the boot.msg file for a standard kernel boot of 2.6.18.8-0.3-default. THis had no issues.
Comment 18 Jan Beulich 2007-06-18 06:55:21 UTC
*** Bug 271573 has been marked as a duplicate of this bug. ***
Comment 19 Jan Beulich 2007-06-18 07:07:20 UTC
That would mean bug 227324 is no longer applicable. Based on that bug's history, however, I would think that you just happen to (not) see the problem in different kernel versions depending on other characteristics of the respective kernel build.

Please clarify whether intel-agp is being loaded in the native kernel (via lsmod output), as the boot.msg provided seems to indicate that it is not being loaded at all (missing the "Detected an Intel ... Chipset." message), which makes me assume that its loading is still being suppressed by some means.
Comment 20 Greg Riedesel 2007-07-16 21:53:38 UTC
The problem appears to still exist in the 2.6.18.8-0.5-xen build. Once again, using the "agp=off" option in the Boot Options gets past the Kernel Panic.
Comment 21 Greg Riedesel 2007-07-16 22:09:33 UTC
Created attachment 151330 [details]
The "lsmod" output from the default/standard kernel
Comment 22 Greg Riedesel 2007-07-16 22:14:51 UTC
The 2.6.18.8.0-0.5 "defconfig.xen" and "defconfig.default". files have the same difference I mentioned in comment 4. Specifically, when I diff the two I get this in the stream:
[/usr/src/linux/arch/x86_64 # diff defconfig.default defconfig.xen]

1914c1845
< CONFIG_AGP_INTEL=m
---
> CONFIG_AGP_INTEL=y

Which tells me that the default kernel has intel_agp as a module, and the Xen kernel has intel_agp static in the kernel. The lsmod output for the default kernel does not show "intel_agp" loaded.
Comment 23 Jan Beulich 2007-07-17 10:37:02 UTC
So in order for intel-agp to do anything, it must find matching hardware in your system, and hence the same matching hardware would be found during a native kernel boot. If intel-agp isn't loaded in the latter case, then it means you're suppressing its loading by some means. If such is necessary for your system to work, then agp=off is the way to go in my opinion; you'd have to live with the fact that the disabling needs to be done differently for the Xen and the native kernels. 10.3 will have intel-agp as a module.