Bug 971386

Summary: Xen crashes with "invalid opcode" while booting in UEFI mode
Product: [openSUSE] openSUSE Distribution Reporter: Anton Samsonov <avsco>
Component: XenAssignee: Xen Virtualization <xen-bugs>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: avsco, carnold, jbeulich
Version: Leap 42.1Flags: jbeulich: needinfo? (avsco)
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 42.1   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Xen boot log: crash in EFI mode
Xen boot log: success in CSM mode, early messages
Xen boot log: success in CSM mode, late messages (services starting)
Sysinfo: "xl info" output when booted in CSM mode
Sysinfo: "/proc/cpuinfo" contents
Sysinfo: "cpuid" output
Sysinfo: "dmidecode" output

Description Anton Samsonov 2016-03-16 11:57:48 UTC
Created attachment 669292 [details]
Xen boot log: crash in EFI mode

After installing openSUSE Leap 42.1 from scratch in UEFI mode, I found out that Xen cannot boot: after displaying some messages (until the screen fills to the bottom, approximately at "Dom0 has maximum 8 VCPUs"), the display goes blank and, some seconds later, the computer reboots.

With enabling serial console output and plugging into another computer, I was able to see the actual messages that precede the reboot:


[    0.000000] 	Offload RCU callbacks from all CPUs
[    0.000000] 	Offload RCU callbacks from CPUs: 0-7.
(XEN) ----[ Xen-4.5.2_04-9  x86_64  debug=n  Not tainted ]----
.....
(XEN) Xen call trace:
(XEN)    [<0000000000000008>] 0000000000000008
(XEN)    [<ffff82d0802298ca>] efi_rs_enter+0xfa/0x120
(XEN)    [<ffff82d08012acd9>] _spin_lock_irqsave+0x9/0x10
(XEN)    [<ffff82d08022a358>] efi_runtime_call+0x4e8/0x870
(XEN)    [<ffff82d080186baf>] flush_area_mask+0x6f/0x130
(XEN)    [<ffff82d08012d50b>] add_entry+0x4b/0xb0
(XEN)    [<ffff82d080167c4f>] do_platform_op+0xeff/0x1840
(XEN)    [<ffff82d080142494>] do_console_io+0x3b4/0x3f0
(XEN)    [<ffff82d080226199>] syscall_enter+0xa9/0xae
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL TRAP: vector = 6 (invalid opcode)
(XEN) ****************************************


This looks strange, as I have been using Xen on this machine (Core i7-2600) for quite a while already.

To check whether current Xen version is viable at all, I then added a line to GRUB2-classic config of my old installation on separate drive, and was able to boot the new Xen successfully in CSM mode.


[    0.000000] 	Offload RCU callbacks from all CPUs
[    0.000000] 	Offload RCU callbacks from CPUs: 0-7.
[    0.000000] Xen reported: 3392.394 MHz processor.
.....
[    0.000000] Linux version 4.1.15-8-xen (geeko@buildhost) (gcc version 4.8.5 (SUSE Linux) ) #1 SMP Wed Jan 20 16:41:00 UTC 2016 (0e3b3ab)
.....
Welcome to openSUSE Leap 42.1 - Kernel 4.1.15-8-xen (xvc0).
samsonov login:


Before that, I also tried to boot the new Xen with an older Linux kernel, such as 3.16.7 from openSUSE 13.2 (because that trick helped me on another machine with classic BIOS firmware, and that is another story), but had no luck this time.

PS. No idea whether my issue is related to #936418 or not. Did not test the "blacklist efivarfs" solution from #912566 either.
Comment 1 Anton Samsonov 2016-03-16 11:59:20 UTC
Created attachment 669293 [details]
Xen boot log: success in CSM mode, early messages
Comment 2 Anton Samsonov 2016-03-16 12:00:06 UTC
Created attachment 669295 [details]
Xen boot log: success in CSM mode, late messages (services starting)
Comment 3 Anton Samsonov 2016-03-16 12:02:37 UTC
Created attachment 669298 [details]
Sysinfo: "xl info" output when booted in CSM mode
Comment 4 Anton Samsonov 2016-03-16 12:03:38 UTC
Created attachment 669300 [details]
Sysinfo: "/proc/cpuinfo" contents
Comment 5 Anton Samsonov 2016-03-16 12:04:12 UTC
Created attachment 669303 [details]
Sysinfo: "cpuid" output
Comment 6 Anton Samsonov 2016-03-16 12:04:37 UTC
Created attachment 669304 [details]
Sysinfo: "dmidecode" output
Comment 7 Jan Beulich 2016-03-16 15:42:13 UTC
(In reply to Anton Samsonov from comment #0)
> This looks strange, as I have been using Xen on this machine (Core i7-2600)
> for quite a while already.

Please be more specific here: Did you successfully use Xen in EFI mode on this machine before? If so, something must have changed. Did you perhaps update firmware after the last successful run?

In any event this is a firmware issue, and from the looks of it the only way around it would be to suppress use of runtime services: "efi=no-rs" on the hypervisor command line (i.e. the "options=" one in xen.cfg).
Comment 8 Anton Samsonov 2016-03-16 16:23:07 UTC
(In reply to Jan Beulich from comment #7)


> Did you successfully use Xen in EFI mode on this machine before?

Unfortunately, I had no previous experience in EFI mode with Xen on that machine, although I did install openSUSE in EFI mode there.


> Did you perhaps update firmware after the last successful run?

Yes, I did update firmware, as openSUSE release notes suggests (although no "last [successful] run" ever took place). I do not consider downgrading firmware to check Xen with older firmware version, as such process is even more dangerous than upgrading.


> In any event this is a firmware issue.

Indeed, adding "efi=no-rs" allowed to boot successfully in EFI mode (thank you!), though with less messages being output to screen, so I assume you are right. Could you please give any hint on how to pinpoint the firmware bug so I report that issue to Dell?
Comment 9 Jan Beulich 2016-03-16 16:37:39 UTC
(In reply to Anton Samsonov from comment #8)
> Could you please give any hint on how to pinpoint the firmware bug so
> I report that issue to Dell?

Well, I expect the location in efi_rs_enter that the stack trace shows to point past the call instruction invoking the firmware function. Anything down the call tree is a firmware bug (maybe unless we called it with bogus arguments, but we pretty certainly don't as things work on good firmware, and even with bad arguments I'd much rather expect an error indicator to be returned, or at worst a page fault from accessing bad data, but surely not a branch to NULL [or something very close to NULL]). Also the address at the top of the stack points into firmware - I would guess that's pointing past a call instruction which does that actual transfer to (almost) NULL.