Bug 1187245

Summary: Rebooting KVM virtual machine gives a black screen
Product: [openSUSE] openSUSE Distribution Reporter: Neil Rickert <nwr10cst-oslnx>
Component: KVMAssignee: Joey Lee <jlee>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Minor    
Priority: P2 - High CC: dfaggioli, jcheung, jlee, jose.ziviani, mchang, nwr10cst-oslnx, predivan
Version: Leap 15.3   
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
See Also: https://bugzilla.suse.com/show_bug.cgi?id=1192126
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Output from "virsh dumpxml ubuntu19"
OVMF debug log
VM domain XML
libvirt domain configuration
OVMF debug log('systemctl reboot')
Attaching "dmesg.log" as requested.
ovmf-bsc1192126-OvmfPkg-PlatformPei-Always-reserve-the-SEV-ES-work-a.patch

Description Neil Rickert 2021-06-11 20:49:57 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0
Build Identifier: 

I am using several virtual machines.  The host system is running Leap 15.3.  The virtual machines run a variety (15.3, Tumbleweed, 15.2, Solus).

On rebooting the VM, I often finish up with a black screen.  Based on what I see on the screen, it looks is if the VM is rebooting correctly, but is failing to connect to the firmware on reboot.  It seems to just loop.

My workaround is to force the machine off, and then restart.

I have also checked that the file systems were cleanly unmounted (by doing next boot to the DVD installer, and running "fsck" from the rescue system).

This is probably an OVMF issue.  It did not happen when the host machine was running Leap 15.2.  It often (but not always) happens with host machine at Leap 15.3.

This does not happen when the VM is using traditional BIOS.
This also does not happen when the VM is using "ovmf-ia32-code.bin".  But it does happen when using "ovmf-x86_64-ms-4m-code.bin" or "ovmf-x86_64-smm-ms-code.bin" or "ovmf-x86_64-code.bin".

The problem is most likely to occur after do a significant sized update to the software running in the VM.

Reproducible: Sometimes
Comment 2 Neil Rickert 2021-07-29 00:22:20 UTC
A couple of additional comments:

(1) I am pretty sure that this is not a grub problem.  It looks like an ovmf firmware problem.  On a normal reboot, I should see a prompt to hit ESC if I want the boot options menu.  But when I see this issue, it does not even get that far.

(2) There was an "ovmf" update a few weeks ago.  After that update, the problem went away on virtual machines that are using "/usr/share/qemu/ovmf-x86_64-smm-ms-code.bin".  But the problem does still show up on a virtual machine using "/usr/share/qemu/ovmf-x86_64-ms-4m-code.bin" (but not on every reboot).
Comment 4 Michael Chang 2021-07-29 04:07:47 UTC
(In reply to Neil Rickert from comment #2)
> A couple of additional comments:
> 
> (1) I am pretty sure that this is not a grub problem.  It looks like an ovmf
> firmware problem.  On a normal reboot, I should see a prompt to hit ESC if I
> want the boot options menu.  But when I see this issue, it does not even get
> that far.
> 
> (2) There was an "ovmf" update a few weeks ago.  After that update, the
> problem went away on virtual machines that are using
> "/usr/share/qemu/ovmf-x86_64-smm-ms-code.bin".  But the problem does still
> show up on a virtual machine using
> "/usr/share/qemu/ovmf-x86_64-ms-4m-code.bin" (but not on every reboot).

Hi Neil,

Are you using libvirt ? If so could you please attach the "virsh dumpxml ..." output from you guest ? You can also grab the log from the ovmf via.

   <qemu:commandline>
    <qemu:arg value='-chardev'/>
    <qemu:arg value='file,id=ovmf,path=/tmp/myvm-ovmf-debug.log'/>
    <qemu:arg value='-device'/>
    <qemu:arg value='isa-debugcon,iobase=0x402,chardev=ovmf'/>
  </qemu:commandline>

Please refer to 
  https://libvirt.org/kbase/qemu-passthrough-security.html
for setting up the qemu command-line passthrough for libvirt

If you are using qemu, just attach your qemu command here and try the command line above to grab the ovmf log when the problem occurs ... 
Thanks.
Comment 5 Neil Rickert 2021-09-18 20:00:23 UTC
Created attachment 852631 [details]
Output from "virsh dumpxml ubuntu19"

I am attaching the dumpxxml output.

I'm not sure how to get those ovmf logs.  I configured virt-manager to allow editing xml.  Then I copied the lines you suggested to just above "</domain>" and saved those changes.  But the resulting xml file still did not have those changes.  Maybe I am doing something wrong.  Or perhaps I have to directly edit the file in "/etc/libvirt/qemu".
Comment 6 Predrag Ivanović 2021-12-12 13:35:14 UTC
I've seen the same thing for a while now, and across different versions of
OVMF package(I am using Virtualization repo for qemu/libvirt), 
currently 202108-195.1

Two VM's I recently saw behaving like that both use ovmf-x86_64-ms-4m-code.bin, 
and I can't test with smm ones, host CPU is way too old :)

Attached are xml for 15.3 VM domain and OVMF debug log from when I upgraded the VM to 15.4 alpha and the freeze happened.
'virsh destroy $domain --graceful' is how I dealt with that so far, with, AFAICT,
no side-effects to the VM running afterwards.
Comment 7 Predrag Ivanović 2021-12-12 13:36:12 UTC
Created attachment 854502 [details]
OVMF debug log
Comment 8 Predrag Ivanović 2021-12-12 13:38:29 UTC
Created attachment 854503 [details]
VM domain XML
Comment 9 Dario Faggioli 2021-12-15 18:20:34 UTC
Ok, it still seems a potential OVMF issue to me, so I'm trying to assign to Joey, to see what he thinks about it.

That said, Michael, I think we now have the OVMF logs you wanted to see (albeit, from a different user)... Is that the case?
Comment 10 Michael Chang 2021-12-16 06:54:51 UTC
(In reply to Dario Faggioli from comment #9)
> Ok, it still seems a potential OVMF issue to me, so I'm trying to assign to
> Joey, to see what he thinks about it.
> 
> That said, Michael, I think we now have the OVMF logs you wanted to see
> (albeit, from a different user)... Is that the case?

The OVMF was trapped in this loop, likely being reset over and over again as the SecCoreStartupWithStack() is the entry point for C codes. This is certainly way over my head.

Yes Joey would have better idea on the OVMF internals. 

> SecCoreStartupWithStack(0xFFFCC000, 0x820000)
> Register PPI Notify: DCD0BE23-9586-40F4-B643-06522CED4EDE
> Install PPI: 8C8CE578-8A3D-4F1C-9935-896185C32DD3
> Install PPI: 5473C07A-3DCB-4DCA-BD6F-1E9689E7349A
> The 0th FV start address is 0x00000820000, size is 0x000E0000, handle is 0x820000
> Register PPI Notify: 49EDB1C1-BF21-4761-BB12-EB0031AABB39
> Register PPI Notify: EA7CA24B-DED5-4DAD-A389-BF827E8F9B38
> Install PPI: B9E0ABFE-5979-4914-977F-6DEE78C278A6
> Install PPI: DBE23AA9-A345-4B97-85B6-B226F1617389
> DiscoverPeimsAndOrderWithApriori(): Found 0xB PEI FFS files in the 0th FV
> Loading PEIM 9B3ADA4F-AE56-4C24-8DEA-F03B7558AE50
> Loading PEIM at 0x0000082C140 EntryPoint=0x0000082F58A PcdPeim.efi
> Install PPI: 06E81C58-4AD7-44BC-8390-F10265F72480
> Install PPI: 01F34D25-4DE2-23AD-3FF3-36353FF323F1
> Install PPI: 4D8B155B-C059-4C8F-8926-06FD4331DB8A
> Install PPI: A60C6B59-E459-425D-9C69-0BCC9CB27D81
> Register PPI Notify: 605EA650-C65C-42E1-BA80-91A52AB618C6
> Loading PEIM A3610442-E69F-4DF3-82CA-2360C4031A23
> Loading PEIM at 0x000008313C0 EntryPoint=0x00000832814 ReportStatusCodeRouterPei.efi
> Install PPI: 0065D394-9951-4144-82A3-0AFC8579C251
> Install PPI: 229832D3-7A30-4B36-B827-F40CB7D45436
> Loading PEIM 9D225237-FA01-464C-A949-BAABC02D31D0
> Loading PEIM at 0x00000833440 EntryPoint=0x00000834704 StatusCodeHandlerPei.efi
> Loading PEIM 222C386D-5ABC-4FB4-B124-FBB82488ACF4
> Loading PEIM at 0x00000835440 EntryPoint=0x0000083AAE0 PlatformPei.efi
> Select Item: 0x0
> FW CFG Signature: 0x554D4551
> Select Item: 0x1
> FW CFG Revision: 0x3
Comment 11 Michael Chang 2021-12-16 07:06:23 UTC
(In reply to Neil Rickert from comment #5)
> Created attachment 852631 [details]
> Output from "virsh dumpxml ubuntu19"
> 
> I am attaching the dumpxxml output.
> 
> I'm not sure how to get those ovmf logs.  I configured virt-manager to allow
> editing xml.  Then I copied the lines you suggested to just above
> "</domain>" and saved those changes.  But the resulting xml file still did
> not have those changes.  Maybe I am doing something wrong.  Or perhaps I
> have to directly edit the file in "/etc/libvirt/qemu".

Sorry somehow this fell through the cracks. :(

You could try `virsh edit ..` to change domain xml if virt-manager somehow have your changes discard. Please also remember to add custom namespace or the added qemu elements would be rejected.

 <domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
Comment 12 Michael Chang 2021-12-17 07:52:01 UTC
I am confused. There seems to be no file associated to your cdrom device in VM domain XML (attachment 854503 [details]) 

  <disk type="file" device="cdrom">
    <driver name="qemu" type="raw"/>
    <target dev="sda" bus="sata"/>
    <readonly/>
    <address type="drive" controller="0" bus="0" target="0" unit="0"/>
  </disk>

But OVMF debug log (attachment 854502 [details]) has shown that cdrom was booting. Could you please help to check why they didn't match ?

Btw, I tried several times to reproduced on my leap15.3 with ovmf-x86_64-ms-4m-code.bin fully updated. This is also my default vm setup I am running daily, but still I can't reproduce or see the problem ever ...

[Bds]=============Begin Load Options Dumping ...=============
  Driver Options:
  SysPrep Options:
  Boot Options:
    Boot0004: UEFI QEMU DVD-ROM QM00001  		 0x0001
    Boot0001: opensuse-secureboot 		 0x0001
    Boot0002: UEFI Misc Device 		 0x0001
    Boot0000: UiApp 		 0x0109
    Boot0003: EFI Internal Shell 		 0x0001
  PlatformRecovery Options:
    PlatformRecovery0000: Default PlatformRecovery 		 0x0001
[Bds]=============End Load Options Dumping=============

[Bds]Booting UEFI QEMU DVD-ROM QM00001 
 BlockSize : 2048 
 LastBlock : 1ED1FF 
PartitionDxe: El Torito standard found on handle 0x7E5B0C18.
 BlockSize : 2048 
 LastBlock : 3 
FatDiskIo: Cache Page OutBound occurred! 
FSOpen: Open '\EFI\BOOT\BOOTX64.EFI' Success
Comment 13 Predrag Ivanović 2021-12-17 21:11:47 UTC
Created attachment 854681 [details]
libvirt domain configuration
Comment 14 Predrag Ivanović 2021-12-17 21:29:47 UTC
(In reply to Michael Chang from comment #12)
> I am confused. There seems to be no file associated to your cdrom device in
> VM domain XML (attachment 854503 [details]) 
> 
>   <disk type="file" device="cdrom">
>     <driver name="qemu" type="raw"/>
>     <target dev="sda" bus="sata"/>
>     <readonly/>
>     <address type="drive" controller="0" bus="0" target="0" unit="0"/>
>   </disk>
> 
> But OVMF debug log (attachment 854502 [details]) has shown that cdrom was
> booting. Could you please help to check why they didn't match ?

Apologies for that, I attached the proper file now(hopefully).

> Btw, I tried several times to reproduced on my leap15.3 with
> ovmf-x86_64-ms-4m-code.bin fully updated. This is also my default vm setup I
> am running daily, but still I can't reproduce or see the problem ever ...

I haven't figured out how to reliably reproduce it yet,
but, based on past experience, and the fact I've been seeing it occasionally
since 15.3 beta-ish, I *think* that it is triggered either
1. by system update that requires the initrd rebuild, or 
2.distribution upgrade (which I was doing in this case, 15.3 with XFCE to 15.4-alpha).
No hard evidence to back that up, I am afraid.
Comment 15 Predrag Ivanović 2021-12-20 15:26:02 UTC
Created attachment 854709 [details]
OVMF debug log('systemctl reboot')
Comment 16 Predrag Ivanović 2021-12-20 15:30:11 UTC
> I haven't figured out how to reliably reproduce it yet,
> but, based on past experience, and the fact I've been seeing it occasionally
> since 15.3 beta-ish, I *think* that it is triggered either
> 1. by system update that requires the initrd rebuild, or 
> 2.distribution upgrade (which I was doing in this case, 15.3 with XFCE to
> 15.4-alpha).

3.Heavy-ish VM disk use?

VM disk was getting low on free space, so I removed some packages and old snapshots, ~4Gb worth, then ran 'btrfs scrub', which completed without errors.
When I tried 'systemctl reboot', the screen went blank, with the cursor indicating
some activity.
OVMF debug log attached.
Comment 17 Joey Lee 2021-12-21 04:54:19 UTC
(In reply to Neil Rickert from comment #0)
> User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
> Firefox/78.0
> Build Identifier: 
> 
> I am using several virtual machines.  The host system is running Leap 15.3. 
> The virtual machines run a variety (15.3, Tumbleweed, 15.2, Solus).
> 
> On rebooting the VM, I often finish up with a black screen.  Based on what I
> see on the screen, it looks is if the VM is rebooting correctly, but is
> failing to connect to the firmware on reboot.  It seems to just loop.
> 

I just fought with the bsc#1193315 and the symptom is unlimited reboot. 

Could you (or anyone) try this OVMF in my home branch?

https://build.opensuse.org/package/show/home:joeyli:branches:SUSE:SLE-15-SP3:Update/ovmf

My workaround patch can fix the bsc#1193315. Maybe it also works here.
Comment 18 Joey Lee 2021-12-21 04:58:20 UTC
(In reply to Joey Lee from comment #17)
> (In reply to Neil Rickert from comment #0)
> > User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
> > Firefox/78.0
> > Build Identifier: 
> > 
> > I am using several virtual machines.  The host system is running Leap 15.3. 
> > The virtual machines run a variety (15.3, Tumbleweed, 15.2, Solus).
> > 
> > On rebooting the VM, I often finish up with a black screen.  Based on what I
> > see on the screen, it looks is if the VM is rebooting correctly, but is
> > failing to connect to the firmware on reboot.  It seems to just loop.
> > 
> 
> I just fought with the bsc#1193315 and the symptom is unlimited reboot.
                        ^^^^^^^^^^^^
                           bsc#1192126, sorry for my typo!
 
> 
> Could you (or anyone) try this OVMF in my home branch?
> 
> https://build.opensuse.org/package/show/home:joeyli:branches:SUSE:SLE-15-SP3:
> Update/ovmf
> 
> My workaround patch can fix the bsc#1193315. Maybe it also works here.
Comment 19 Neil Rickert 2021-12-21 06:24:52 UTC
Responding to Joey Lee c#18

I downloaded ovmf-202008-10.13.1.x86_64.rpm and installed that.  But the same misbehavior occurs.

I'm not sure how all of this works.  I would have thought that "qemu-ovmf-x86_64" needed to be updated, but I did not see an update for that.
Comment 20 Joey Lee 2021-12-21 07:21:22 UTC
(In reply to Neil Rickert from comment #19)
> Responding to Joey Lee c#18
> 
> I downloaded ovmf-202008-10.13.1.x86_64.rpm and installed that.  But the
> same misbehavior occurs.

The ovmf-202008 package only includes some efi tools, not ovmf binary.

> 
> I'm not sure how all of this works.  I would have thought that
> "qemu-ovmf-x86_64" needed to be updated, but I did not see an update for
> that.

Please download qemu-ovmf-x86_64-202008-10.13.1.noarch.rpm from here:

https://build.opensuse.org/package/binaries/home:joeyli:branches:SUSE:SLE-15-SP3:Update/ovmf/pool-leap-15.3

Then install and test.

Or you want to add my branch to zypper repo list:

https://download.opensuse.org/repositories/home:/joeyli:/branches:/SUSE:/SLE-15-SP3:/Update/pool-leap-15.3/

Thanks!
Comment 21 Neil Rickert 2021-12-21 15:24:10 UTC
Responding to c#20

I added your repo, and updated "qemu-ovmf-x86_64" to 202008-10.13.1

But still the same misbehavior.

From what I see, it does look like OVMF crashing/resetting in a loop, as in bug 1187245 .  Perhaps I need to rebuild that virtual machine:

(1) delete the VM but retain the disk image
(2) create new VM importing the disk image
(3) configure that VM to use ovmf-x86_64-ms-4m-code.bin (the one currently being used).
Comment 22 Joey Lee 2021-12-21 15:48:32 UTC
(In reply to Neil Rickert from comment #21)
> Responding to c#20
> 
> I added your repo, and updated "qemu-ovmf-x86_64" to 202008-10.13.1
> 
> But still the same misbehavior.
> 
> From what I see, it does look like OVMF crashing/resetting in a loop, as in
> bug 1187245 .  Perhaps I need to rebuild that virtual machine:
> 
> (1) delete the VM but retain the disk image
> (2) create new VM importing the disk image
> (3) configure that VM to use ovmf-x86_64-ms-4m-code.bin (the one currently
> being used).

You does not need to rebuild VM image. 

Could you please attach guest's dmesg log on bugzilla after booting with new OVMF? Please add the following kernel parameter in /boot/grub2/grub.cfg:

efi=debug

Then boot to console and run "dmesg > dmesg.log" then attach dmesg.log. If you run with the _right_ OVMF image, then we should see the following region in EFI memory map:

[    0.000000] efi: mem06: [ACPI Mem NVS|   |  |  |  |  |  |  |  |  |   |WB|WT|WC|UC] range=[0x000000000080b000-0x000000000080bfff] (0MB) 

The 0x80b000 is the start address of PcdSevEsWorkArea.
Comment 23 Joey Lee 2021-12-21 16:41:00 UTC
(In reply to Joey Lee from comment #22)
> (In reply to Neil Rickert from comment #21)
> > Responding to c#20
> > 
> > I added your repo, and updated "qemu-ovmf-x86_64" to 202008-10.13.1
> > 
> > But still the same misbehavior.
> > 
> > From what I see, it does look like OVMF crashing/resetting in a loop, as in
> > bug 1187245 .  Perhaps I need to rebuild that virtual machine:
> > 
> > (1) delete the VM but retain the disk image
> > (2) create new VM importing the disk image
> > (3) configure that VM to use ovmf-x86_64-ms-4m-code.bin (the one currently
> > being used).
> 
> You does not need to rebuild VM image. 
> 
> Could you please attach guest's dmesg log on bugzilla after booting with new
> OVMF? Please add the following kernel parameter in /boot/grub2/grub.cfg:
> 
> efi=debug
> 
> Then boot to console and run "dmesg > dmesg.log" then attach dmesg.log. If
> you run with the _right_ OVMF image, then we should see the following region
> in EFI memory map:
> 
> [    0.000000] efi: mem06: [ACPI Mem NVS|   |  |  |  |  |  |  |  |  |  
> |WB|WT|WC|UC] range=[0x000000000080b000-0x000000000080bfff] (0MB) 
> 
> The 0x80b000 is the start address of PcdSevEsWorkArea.

On the other hand, please add "-d cpu_reset" qemu parameter when reproducing issue. It can print the cpu register when system reboot.
Comment 24 Neil Rickert 2021-12-21 19:46:04 UTC
Created attachment 854744 [details]
Attaching "dmesg.log" as requested.

I am not seeing the line that you expected in that "dmesg" output.

I tried rebuilding the VM (I first cloned, then rebuilt the clone).  It did not help.

Perhaps the "ovmf" image that I am using was not patched.

I should perhaps mention that when I first reported this, I was seeing the issue  with both "ovmf-x86_64-ms-4m-code.bin" and "ovmf-x86_64-smm-ms-code.bin".  Then at some time there was an "ovmf" update, and after that update I only saw the issue with "ovmf-x86_64-ms-4m-code.bin".
Comment 25 Joey Lee 2021-12-22 04:08:51 UTC
(In reply to Neil Rickert from comment #24)
> Created attachment 854744 [details]
> Attaching "dmesg.log" as requested.
> 
> I am not seeing the line that you expected in that "dmesg" output.
> 
> I tried rebuilding the VM (I first cloned, then rebuilt the clone).  It did
> not help.
> 
> Perhaps the "ovmf" image that I am using was not patched.
> 
> I should perhaps mention that when I first reported this, I was seeing the
> issue  with both "ovmf-x86_64-ms-4m-code.bin" and
> "ovmf-x86_64-smm-ms-code.bin".  Then at some time there was an "ovmf"
> update, and after that update I only saw the issue with
> "ovmf-x86_64-ms-4m-code.bin".

Thanks! The demsg log is useful. I know what's the problem now.

When S3 be disabled in VM. The PcdSevEsWorkArea will be reserved as a EfiBootServicesData region. This region be marked as usable region after booting to OS. So this region can still be written by kernel. When system reboot, kernel writes data to the region then the unlimited system reset be triggered.

I am producing a new workaround patch against this problem.
Comment 26 Joey Lee 2021-12-22 04:39:25 UTC
Created attachment 854757 [details]
ovmf-bsc1192126-OvmfPkg-PlatformPei-Always-reserve-the-SEV-ES-work-a.patch

Updated workaround patch. Always reserved the SEV-ES work area as a ACPI NVS region.
Comment 27 Joey Lee 2021-12-22 06:07:20 UTC
Hi Neil, 

(In reply to Joey Lee from comment #26)
> Created attachment 854757 [details]
> ovmf-bsc1192126-OvmfPkg-PlatformPei-Always-reserve-the-SEV-ES-work-a.patch
> 
> Updated workaround patch. Always reserved the SEV-ES work area as a ACPI NVS
> region.

I just built new workaround patch with ovmf in my home branch

https://build.opensuse.org/package/binaries/home:joeyli:branches:SUSE:SLE-15-SP3:Update/ovmf/pool-leap-15.3

Could you please help to test the qemu-ovmf-x86_64-202008-10.14.1.noarch.rpm again?
Comment 28 Neil Rickert 2021-12-22 14:43:38 UTC
I updated both "ovmf" and "qemu-ovmf-x86_64" to 202008-10.14.1 using your repo.

It now seems to behave the way it should (rebooting normally).

Thanks.
Comment 29 Joey Lee 2021-12-23 02:11:24 UTC
(In reply to Neil Rickert from comment #28)
> I updated both "ovmf" and "qemu-ovmf-x86_64" to 202008-10.14.1 using your
> repo.
> 
> It now seems to behave the way it should (rebooting normally).
> 
> Thanks.

Thanks for your testing!

The workaround patch will be pushed to SLE15-SP3 in IBS then duplicate to Leap 15.3
Comment 30 Joey Lee 2021-12-24 05:23:39 UTC
(In reply to Joey Lee from comment #29)
> (In reply to Neil Rickert from comment #28)
> > I updated both "ovmf" and "qemu-ovmf-x86_64" to 202008-10.14.1 using your
> > repo.
> > 
> > It now seems to behave the way it should (rebooting normally).
> > 
> > Thanks.
> 
> Thanks for your testing!
> 
> The workaround patch will be pushed to SLE15-SP3 in IBS then duplicate to
> Leap 15.3

The patch be merged to SLE15-SP3/ovmf. Waiting the change be duplicated to Leap 15.3 in OBS.
Comment 31 Neil Rickert 2022-01-04 00:59:58 UTC
The OVMF update for 15.3 showed up today.  And everything seems to be working as it should.

I'll close this as fixed.  Thanks.