Bug 551598

Summary: openSUSE 11.2 rc2 fails to boot in fully virtualized Xen VM on SLES 10 SP2 or SLES 11 GM
Product: [openSUSE] openSUSE 11.2 Reporter: Jared Hudson <jared.hudson>
Component: KernelAssignee: Ky Srinivasan <ksrinivasan>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P2 - High CC: agraf, jbeulich, jeffm, jfehlig, nettings
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 11.2   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: xm console output
opensuse11.2_ignore_loglevel.txt
Preserve compatibility

Description Jared Hudson 2009-10-30 21:42:23 UTC
The OS was installed with the Base Pattern only to save download time. LVM was used. Mostly default settings were used throughout the install. After the first reboot openSUSE starts booting but ends up getting scsi errors and stops. I then transferred the image to a SLES 11 system think that perhaps openSUSE 11.2 was not compatible with SLES 10. The VM running in a SLES 11 virtual host behaves the same way.

I setup console to serial and connected to the xen console so I could retrieve the errors. Here's a snippet from when the problem starts. I'll include the entire log following this msgs.

[    6.125143] ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 5
[    6.127971] xen-platform-pci 0000:00:03.0: PCI INT A -> Link[LNKD] -> GSI 5 (level, low) -> IRQ 5
[    6.133236] Xen version 3.3.
[    6.134127] Hypercall area is 1 pages.
[    6.157113] IRQ 5/xen-platform-pci: IRQF_DISABLED is not guaranteed on shared IRQs
[    6.213582] suspend: event channel 4
[   36.704474] ata1: lost interrupt (Status 0x0)
[   36.707490] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[   36.710785] ata1.00: cmd c8/00:20:09:63:5a/00:00:00:00:00/e0 tag 0 dma 16384 in
[   36.710787]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[   36.718208] ata1.00: status: { DRDY }
[   36.720860] ata1: soft resetting link
[   36.874597] ata1.00: revalidation failed (errno=-2)
[   41.873881] ata1: soft resetting link
[   42.028607] ata1.00: revalidation failed (errno=-2)
[   47.027861] ata1: soft resetting link
[   47.182568] ata1.00: revalidation failed (errno=-2)
[   47.185088] ata1.00: disabled
[   47.187126] ata1.00: device reported invalid CHS sector 0
[   47.190596] ata1: soft resetting link
[   47.344321] ata1: EH complete
[   47.345593] sd 0:0:0:0: [sda] Unhandled error code
[   47.347853] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[   47.352065] end_request: I/O error, dev sda, sector 5923593
Comment 1 Jared Hudson 2009-10-30 21:44:33 UTC
Created attachment 324958 [details]
xm console output
Comment 3 Tejun Heo 2009-11-23 02:29:29 UTC
Hmmm... First, command timed out and then revalidation failed with -ENOENT which means that libata is failing to read IDENTIFY data off the QEMU emulated drive.  Can you please boot with ignore_loglevel and post the console output?  It will show us why IDENTIFY reading is failing.  Given that the failure is on QEMU emulated devices, I don't think this has much to do with the ata_piix driver itself.  It looks like IRQ delivary failed for some reason under xen (which seems somewhat common) and then QEMU disk emulation just freaked out on recovery sequence.

Thanks.
Comment 4 Jared Hudson 2009-11-24 23:22:07 UTC
I just did a fresh install with openSUSE 11.2 final. Now it does boot but still produces errors. They're just no longer fatal.
Comment 5 Jared Hudson 2009-11-24 23:22:58 UTC
Created attachment 329340 [details]
opensuse11.2_ignore_loglevel.txt
Comment 6 Tejun Heo 2009-11-25 00:48:17 UTC
Those HSM failures are from QEMU emulated cdroms and different from the original ones you reported.  ISTR simliar problem with qemu-kvm.  cc'ing Alex.  Alex, does xen-qemu work about the same as qemu-kvm?  Jared is reporting HSM violations on QEMU cdrom device and IIRC there was similar issue with qemu-kvm, right?
Comment 7 Alexander Graf 2009-11-25 07:24:07 UTC
Xen uses its own fork of qemu for HVM. So chances are pretty good that it's similar.
The issue with qemu-kvm was/is only triggered on eject though.

This looks more related to pv-ops or something similar. It seems like OpenSUSE 11.2 knows it's running inside Xen and loses interrupts? (rough guess)

So let's ask Jan and Jim if they know anything here.
Comment 8 Jan Beulich 2009-11-25 09:06:59 UTC
That seems to be the 11.2 incarnation of a previously reported bug (and I thought we wouldn't repeat the same mistake): Once the pv drivers are being installed, the "native" ones (i.e. libata and friends) shouldn't be loaded anymore, as the pv drivers disable their respective PCI devices when they load. KY should have the best overview of what was done where to accommodate for that behavior, and hence who should do what change.
Comment 9 Alexander Graf 2009-12-20 21:02:09 UTC
KY, mind to shed some light on this?
Comment 10 Ky Srinivasan 2009-12-21 19:05:56 UTC
On sles11, if I remember correctly, the problem we had was that disks would appear both as an IDE disk (managed by the PV driver) and as a SCSI disk managed by the libata/scsi driver stack. The way we dealt with this problem was to ensure that the PV drivers were loaded prior to loading the libata. Look at the file xen_pvdrivers under /etc/modprobe.d/ We could try a similar solution here. We would need to get the installation team involved to make the necessary changes.
Comment 11 Ky Srinivasan 2010-01-20 15:41:45 UTC
Created attachment 337683 [details]
Preserve compatibility
Comment 12 Ky Srinivasan 2010-01-20 15:44:23 UTC
The problem appears to be in the new PV drivers that we picked up. I have submitted a patch for this problem on sle 11 sp1 code base. This patch should address the problem here as well. The patch is attached (comment #11). Charles, if there is any update planned for 11.2, could you include this patch.
Comment 13 Ky Srinivasan 2010-02-08 15:21:45 UTC
I am am going to close this bug, since a patch has been submitted.