Bug 1162320

Summary: grub2-arm64-efi upgrade 2.02-lp151.21.6.1 -> 2.02-lp151.21.9.1 causes system to fail reboot with "error: symbol `grub_efi_allocate_any_pages' not found."
Product: [openSUSE] openSUSE Distribution Reporter: Oliver Kurz <okurz>
Component: BootloaderAssignee: Michael Chang <mchang>
Status: RESOLVED FIXED QA Contact: Jiri Srain <jsrain>
Severity: Major    
Priority: P5 - None CC: afaerber, arvidjaar, fvogt, hluo, iforster, igonzalezsosa, jcheung, mchang, okurz, rw
Version: Leap 15.1   
Target Milestone: ---   
Hardware: aarch64   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Oliver Kurz 2020-01-31 08:12:21 UTC
## Observation

On the machine "aarch64" within the openqa.opensuse.org system, which is a transactional system, the system failed to boot after a nightly system upgrade on 2020-01-14. I could pinpoint that the upgrade of grub2-arm64-efi 2.02-lp151.21.6.1 -> 2.02-lp151.21.9.1 triggers
error: symbol `grub_efi_allocate_any_pages' not found.
when trying to load the kernel and initrd image after grub.

See
https://progress.opensuse.org/issues/62102
for details where I originally recorded the problem.


## Workaround

So far I did `zypper al grub2-arm64-efi`.
Comment 1 Michael Chang 2020-01-31 08:58:46 UTC
Hi Oliver

Would you please check /boot/grub2/arm64-efi is btrfs subvolume ? You can also refer to bug 1122591, comment 2 for more details.
Comment 2 Oliver Kurz 2020-02-03 15:45:58 UTC
No, it's vfat:

```
/dev/sda1 on /boot/efi type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
```

So I guess I can apply the same workaround as mentioned in
https://bugzilla.opensuse.org/show_bug.cgi?id=1122591#c28
though no real fix is available yet. Thanks so far.
Comment 3 Oliver Kurz 2020-03-10 09:52:36 UTC
@mchang may I ask, what are your plans regarding a fix? What ETA should I expect? days/weeks/months/years?
Comment 4 Michael Chang 2020-03-10 10:38:26 UTC
(In reply to Oliver Kurz from comment #3)
> @mchang may I ask, what are your plans regarding a fix? What ETA should I
> expect? days/weeks/months/years?

Hi Oliver,

No, it is not grub. The installer is in charge of setting up the subvolumes. There seems to be no subvolumes proposed for the system initially.
Comment 5 Michael Chang 2020-03-10 10:41:06 UTC
Please see bug 1122591, comment 6
Comment 6 Michael Chang 2020-03-10 10:58:31 UTC
Admittedly I am not much into openQA, but sourcing from bsc#1097235 claimed that the problem has been fixed, why openQA still (constantly) notify the failure result ? Is there anything from the test case has to check for a more close result ?
Comment 7 Oliver Kurz 2020-03-10 11:23:54 UTC
If you really think https://bugzilla.opensuse.org/show_bug.cgi?id=1122591#c6 is the right solution which I would find very bad as users can unknowingly run into unbootable systems then please at least update this and other other bug accordingly to communicate this intention. This bug here is still in "NEW" and you are the bug assignee but I guess it should be in "CONFIRMED" or "IN_PROGRESS" with an update on the actual plan, right? Or even set it to "RESOLVED" with the according resolution, e.g. pointing to the release notes that cover this for Leap and other distributions.

I just checked again on the machine "aarch64" and just upgrading the package grub2-arm64-efi still breaks the boot with the same error as originally reported. Please keep in mind that this bug here is about an upgrade on a physical machine, so neither "migration" between different versions of products nor an openQA test itself.

Also, the workaround as mentioned in https://bugzilla.opensuse.org/show_bug.cgi?id=1122591#c28 did not work for me. I assume that the instructions are incomplete, e.g. if the btrfs subvolume needs to be mounted or so. There is also an unanswered question about this in https://bugzilla.opensuse.org/show_bug.cgi?id=1122591#c34

> Admittedly I am not much into openQA, but sourcing from bsc#1097235 claimed that the problem has been fixed, why openQA still (constantly) notify the failure result ? Is there anything from the test case has to check for a more close result ?

Regarding openQA test failure reminders, as https://bugzilla.opensuse.org/show_bug.cgi?id=1122591#c49 states:

> This bug is still referenced in a failing openQA test: 
> migration_media+scc_sle15_ha_alpha_node01
> https://openqa.suse.de/tests/3886491
>
> To prevent further reminder comments one of the following options should be followed:
> 1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
> 2. The openQA job group is moved to "Released"
> 3. The label in the openQA scenario is removed

https://openqa.suse.de/tests/3886491#step/patch_sle/106 is the location in test results which seems to record the soft failure which comes from https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/2e020725b70964a3465730411694e2849b3d487a/lib/utils.pm#L1583 after checking that the necessary btrfs subvolume is not already there. So I consider the products still affected because either the bugfix is not effective or never reached the affected product(s).
Comment 8 Michael Chang 2020-03-11 06:28:29 UTC
(In reply to Oliver Kurz from comment #7)
> If you really think https://bugzilla.opensuse.org/show_bug.cgi?id=1122591#c6
> is the right solution which I would find very bad as users can unknowingly
> run into unbootable systems then please at least update this and other other
> bug accordingly to communicate this intention.

It seems to me no other way out either and manual intervention has always been necessary to deal with such issues. CCed Andreas here to if he has anything to comment.

@ Andreas,

Did you have any update for the documentation or bsc#1122591 in general ? From bug 1122591, comment 29 it's not clear whether the workaround was picked up by release note eventually. (cf fate#327771)? 


> This bug here is still in
> "NEW" and you are the bug assignee but I guess it should be in "CONFIRMED"
> or "IN_PROGRESS" with an update on the actual plan, right? Or even set it to
> "RESOLVED" with the according resolution, e.g. pointing to the release notes
> that cover this for Leap and other distributions.

My apologies, I was misinterpreting your comment#2 as an agreement with the workaround steps and thinking maybe the issue can be closed. I should make it obvious and clear, and also did better follow-through works.

> 
> I just checked again on the machine "aarch64" and just upgrading the package
> grub2-arm64-efi still breaks the boot with the same error as originally
> reported. Please keep in mind that this bug here is about an upgrade on a
> physical machine, so neither "migration" between different versions of
> products nor an openQA test itself.

Is it on transactional sever ?  And I think grub's version upgrade is enough to trigger the problem sooner or later as older/newer modules in btrfs root tree could be exposed to grub if not get exempted from the rollback operation via a separate subvolume. The way how transaction server performing it's update makes that easier to happen.

> Also, the workaround as mentioned in
> https://bugzilla.opensuse.org/show_bug.cgi?id=1122591#c28 did not work for
> me. I assume that the instructions are incomplete, e.g. if the btrfs
> subvolume needs to be mounted or so. There is also an unanswered question
> about this in https://bugzilla.opensuse.org/show_bug.cgi?id=1122591#c34

Ah, yes I think fstab mount point is necessary. Just creating subvolume would only work once, after the rollback the new root won't have the subvolume in place so that we need to specify in fstab that could mount the missing subvolumes to where it belongs ... 
 
> > Admittedly I am not much into openQA, but sourcing from bsc#1097235 claimed that the problem has been fixed, why openQA still (constantly) notify the failure result ? Is there anything from the test case has to check for a more close result ?
> 
> Regarding openQA test failure reminders, as
> https://bugzilla.opensuse.org/show_bug.cgi?id=1122591#c49 states:
> 
> > This bug is still referenced in a failing openQA test: 
> > migration_media+scc_sle15_ha_alpha_node01
> > https://openqa.suse.de/tests/3886491
> >
> > To prevent further reminder comments one of the following options should be followed:
> > 1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
> > 2. The openQA job group is moved to "Released"
> > 3. The label in the openQA scenario is removed
> 
> https://openqa.suse.de/tests/3886491#step/patch_sle/106 is the location in
> test results which seems to record the soft failure which comes from
> https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/
> 2e020725b70964a3465730411694e2849b3d487a/lib/utils.pm#L1583 after checking
> that the necessary btrfs subvolume is not already there. So I consider the
> products still affected because either the bugfix is not effective or never
> reached the affected product(s).

Thanks a lot for pointing me this as I was struggling to find what's going wrong in openQA. After all I think it is good to begin with fixing the openQA test case so that we can verify the problem really got understood and fixed. And please bear me some time to digest the test script and will get back to you if I find anything can be improved in this regard.
Comment 9 Michael Chang 2020-03-11 06:32:47 UTC
Add Andreas and Huajian to the CC list.

Hi Andreas and Hujian,

Would you please help to have a look on comment#8 as related to bsc#1122591 ...
Thanks.
Comment 10 Huajian Luo 2020-03-11 06:51:35 UTC
yes, while run ARM migration test we hit this issue and the developer said to update the document to let the user manually create subvol to avoid this bug.
Andreas suggested to create a subvolume for the snapshot (btrfs subvolume create /path) not a new snapshot.
Comment 11 Michael Chang 2020-03-11 07:27:22 UTC
(In reply to Michael Chang from comment #8)
> (In reply to Oliver Kurz from comment #7)

> And please bear me some time to digest the test script and will get
> back to you if I find anything can be improved in this regard.

It turns out I have misinterpreted the problem again. The openQA script is fine as it detects software failure as the subvoulume not being created for /boot/grub2/arm64-efi, but not the workaround didn't work thus require improvement .. 
   
@ Huajian,

Is my understanding correct ? If so I think we have to CC Fabian for checking status of bsc#1097235 as yast seem to have fixed that.
Comment 12 Huajian Luo 2020-03-11 07:29:29 UTC
yes,you can,thanks for the update.
Comment 13 Michael Chang 2020-03-11 08:03:51 UTC
Hi Fabian,

Would you please shed some light on the status of fixing grub's btrfs subvolume for aarch64 ? According to openSUSE openQA test result it is missing in factory. Thanks in advance.
Comment 14 Fabian Vogt 2020-03-11 08:17:00 UTC
(In reply to Michael Chang from comment #13)
> Hi Fabian,
> 
> Would you please shed some light on the status of fixing grub's btrfs
> subvolume for aarch64 ? According to openSUSE openQA test result it is
> missing in factory. Thanks in advance.

It was added two years ago: https://github.com/yast/skelcd-control-openSUSE/pull/138

Apparently to the wrong section though, it's only for the "serverro" system role.
Comment 15 Fabian Vogt 2020-03-11 16:09:28 UTC
AFAICT it's possible to have the grub2 package do the migration to a new subvolume in its %post script.

Only if /boot/grub2/arm64-efi is a subvolume, grub2-install adds "btrfs-mount-subvol ($root) /boot/grub2/arm64-efi @/boot/grub2/arm64-efi" to the embedded config to bootaa64.efi. This means that if was not a subvolume at grub2-install time, it'll use the modules part of the default subvol, like before.

So if the %post script creates the subvolume (wouldn't even need to copy the modules), the next grub2-install call will make use of it and the mere presence of the subvolume won't break rollbacks or upgrades.

I wonder why this subvolume is needed for EFI though, for secure boot grub.efi is unable to load any modules anyway (there is no embedded config), so it has to boot without those already. On x86_64 it seems like shim is used by default even in situations without secure boot, so module loading isn't possible there either (I checked, ls $prefix/x86_64-efi is empty).
Comment 16 Michael Chang 2020-03-12 03:21:41 UTC
(In reply to Fabian Vogt from comment #14)
> (In reply to Michael Chang from comment #13)

> It was added two years ago:
> https://github.com/yast/skelcd-control-openSUSE/pull/138
> 
> Apparently to the wrong section though, it's only for the "serverro" system
> role.

May I ask dumb question : what is "serverro" role ? Did you think it is reasonable to include yast team to have a look ? Thanks in advance.
Comment 17 Michael Chang 2020-03-12 04:04:24 UTC
(In reply to Fabian Vogt from comment #15)
> AFAICT it's possible to have the grub2 package do the migration to a new
> subvolume in its %post script.

It is possible, but I really don't want to enter it unless "nothing else helps" in the end ...
 
> Only if /boot/grub2/arm64-efi is a subvolume, grub2-install adds
> "btrfs-mount-subvol ($root) /boot/grub2/arm64-efi @/boot/grub2/arm64-efi" to
> the embedded config to bootaa64.efi. This means that if was not a subvolume
> at grub2-install time, it'll use the modules part of the default subvol,
> like before.

Even though grub-install can handle dynamic detection of subvolumes, the creation of subvolume itself can bring more unexpected outcome ...

For example, we cannot create subvolume out of reach by default root tree, therefore we have to mount target root tree somewhere to create subvolume ...

The problem here is
1. Deciding of the target root tree for subvolumes, and mount points, is not by the package but the system wide configuration .. 
2. Is mount permissible in %post ?
 
> So if the %post script creates the subvolume (wouldn't even need to copy the
> modules), the next grub2-install call will make use of it and the mere
> presence of the subvolume won't break rollbacks or upgrades.

The sanest fix should consider adding the subvolume to /etc/fstab, so that rollback or upgrade (ie changing the default root) could re-mount it to the desired place. Otherwise the workaround would apply to new root tree everytime and waste space (and time) ..

Yet again I didn't know it is permissible to modify /etc/fstab from %post ...

> I wonder why this subvolume is needed for EFI though, for secure boot
> grub.efi is unable to load any modules anyway (there is no embedded config),
> so it has to boot without those already. On x86_64 it seems like shim is
> used by default even in situations without secure boot, so module loading
> isn't possible there either (I checked, ls $prefix/x86_64-efi is empty).

Yes. For secure boot turned on it is not necessary. But we still care about secure boot turned off and provide user the capability to load modules for their own benefits ..
Comment 18 Fabian Vogt 2020-03-12 08:07:54 UTC
(In reply to Michael Chang from comment #16)
> (In reply to Fabian Vogt from comment #14)
> > (In reply to Michael Chang from comment #13)
> 
> > It was added two years ago:
> > https://github.com/yast/skelcd-control-openSUSE/pull/138
> > 
> > Apparently to the wrong section though, it's only for the "serverro" system
> > role.
> 
> May I ask dumb question : what is "serverro" role ?

"Transactional Server"

> Did you think it is
> reasonable to include yast team to have a look ? Thanks in advance.

Yes, unless you want to create the PR itself - it should only involve copying
a few lines of XML.

(In reply to Michael Chang from comment #17)
> (In reply to Fabian Vogt from comment #15)
> > AFAICT it's possible to have the grub2 package do the migration to a new
> > subvolume in its %post script.
> 
> It is possible, but I really don't want to enter it unless "nothing else
> helps" in the end ...

Currently it's an error-prone (by not reading the release notes...) manual task,
so I'd say that doing it automated is much better.

> > Only if /boot/grub2/arm64-efi is a subvolume, grub2-install adds
> > "btrfs-mount-subvol ($root) /boot/grub2/arm64-efi @/boot/grub2/arm64-efi" to
> > the embedded config to bootaa64.efi. This means that if was not a subvolume
> > at grub2-install time, it'll use the modules part of the default subvol,
> > like before.
> 
> Even though grub-install can handle dynamic detection of subvolumes, the
> creation of subvolume itself can bring more unexpected outcome ...
> 
> For example, we cannot create subvolume out of reach by default root tree,
> therefore we have to mount target root tree somewhere to create subvolume ...
>
> The problem here is
> 1. Deciding of the target root tree for subvolumes, and mount points, is not
> by the package but the system wide configuration .. 
> 2. Is mount permissible in %post ?

There are already other packages (kuberneted, read-only-root-fs, systemd) which
use mksubvolume in %post, so I'd say it's tested meanwhile.

Mounting wouldn't even be necessary, as grub2-install would handle both situations
properly AFAICT. The combination of mounting + forced grub2-install would allow
deleting the then unused modules from the snapshot itself though, which would save
a few MiB of space.
  
> > So if the %post script creates the subvolume (wouldn't even need to copy the
> > modules), the next grub2-install call will make use of it and the mere
> > presence of the subvolume won't break rollbacks or upgrades.
> 
> The sanest fix should consider adding the subvolume to /etc/fstab, so that
> rollback or upgrade (ie changing the default root) could re-mount it to the
> desired place. Otherwise the workaround would apply to new root tree
> everytime and waste space (and time) ..
>
> Yet again I didn't know it is permissible to modify /etc/fstab from %post ...

mksubvolume does that automatically AFAIK.

> > I wonder why this subvolume is needed for EFI though, for secure boot
> > grub.efi is unable to load any modules anyway (there is no embedded config),
> > so it has to boot without those already. On x86_64 it seems like shim is
> > used by default even in situations without secure boot, so module loading
> > isn't possible there either (I checked, ls $prefix/x86_64-efi is empty).
> 
> Yes. For secure boot turned on it is not necessary. But we still care about
> secure boot turned off and provide user the capability to load modules for
> their own benefits ..

Which modules? If they are needed for booting/normal use, they should be
included in the .efi binary and if they're not needed they wouldn't have
to be loaded anyway.
Comment 19 Andreas Färber 2020-03-12 11:09:56 UTC
Fabian, I thought the GRUB sub-packages got converted to noarch, so that they aarch64 binaries can be installed on an x86_64 PXE server, too. Creating aarch64-specific volumes in %post on other architectures does not sound like a good idea.
Comment 21 Michael Chang 2020-03-12 11:35:03 UTC
(In reply to Fabian Vogt from comment #18)
> (In reply to Michael Chang from comment #16)
> > (In reply to Fabian Vogt from comment #14)
> > > (In reply to Michael Chang from comment #13)
> > 
> > > It was added two years ago:
> > > https://github.com/yast/skelcd-control-openSUSE/pull/138
> > > 
> > > Apparently to the wrong section though, it's only for the "serverro" system
> > > role.
> > 
> > May I ask dumb question : what is "serverro" role ?
> 
> "Transactional Server"

Unfortunately this bug is reported on Transactional Server ...

> Yes, unless you want to create the PR itself - it should only involve copying
> a few lines of XML.

OK. CCing Imobach Gonzalez Sosa.

@Imobach

Would you please check PR for https://github.com/yast/skelcd-control-openSUSE/pull/138 and why it seems not in transactional server here ?

> Currently it's an error-prone (by not reading the release notes...) manual
> task,
> so I'd say that doing it automated is much better.

Sure user's experience is better, on the othe hand just like any other workaround we care about the risk are worthwhile the effort. Also the root cause has yet to know, as it seems yast should take care during installation already ...

> There are already other packages (kuberneted, read-only-root-fs, systemd)
> which
> use mksubvolume in %post, so I'd say it's tested meanwhile.

Thanks for the info. I used to come across the mksubvolume tool but didn't really know any package would use it in %post ..
 
> Mounting wouldn't even be necessary, as grub2-install would handle both
> situations
> properly AFAICT. The combination of mounting + forced grub2-install would
> allow
> deleting the then unused modules from the snapshot itself though, which
> would save
> a few MiB of space.

I don't think grub-install is smart enough to know when and how to mount the subvolume, or am I missing something here ?

> > Yet again I didn't know it is permissible to modify /etc/fstab from %post ...
> 
> mksubvolume does that automatically AFAIK.

Then it looks more and more sweet. 

> Which modules? If they are needed for booting/normal use, they should be
> included in the .efi binary and if they're not needed they wouldn't have
> to be loaded anyway.

I think we have included most modules, but still there are many not included. It is hard to tell in advance what is needed, so that we still allow loading modules via disabling secure boot to fill the gap and after they can report and we'll see if that's reasonable request or not.
Comment 22 Fabian Vogt 2020-03-12 13:14:43 UTC
(In reply to Andreas Färber from comment #19)
> Fabian, I thought the GRUB sub-packages got converted to noarch, so that
> they aarch64 binaries can be installed on an x86_64 PXE server, too.
> Creating aarch64-specific volumes in %post on other architectures does not
> sound like a good idea.

Good point, %ifarch aarch64 would be enough to fix that though.
YaST creates the subvolume even if grub isn't used, so it wouldn't have to check for the default bootloader in %post either,
that would be an inconsistency otherwise.

(In reply to Michael Chang from comment #21)
> (In reply to Fabian Vogt from comment #18)
> > (In reply to Michael Chang from comment #16)
> > > (In reply to Fabian Vogt from comment #14)
> > > > (In reply to Michael Chang from comment #13)
> > > 
> > > > It was added two years ago:
> > > > https://github.com/yast/skelcd-control-openSUSE/pull/138
> > > > 
> > > > Apparently to the wrong section though, it's only for the "serverro" system
> > > > role.
> > > 
> > > May I ask dumb question : what is "serverro" role ?
> > 
> > "Transactional Server"
> 
> Unfortunately this bug is reported on Transactional Server ...

Maybe the system was installed using Leap 15.0 and then upgraded to 15.1 at some point?
15.0 doesn't have the arm64-efi subvolume at all.

> > Currently it's an error-prone (by not reading the release notes...) manual
> > task,
> > so I'd say that doing it automated is much better.
> 
> Sure user's experience is better, on the othe hand just like any other
> workaround we care about the risk are worthwhile the effort. Also the root
> cause has yet to know, as it seems yast should take care during installation
> already ...

It's entirely missing from control.xml on Leap <= 15.0 and for
>= 15.1 it's only part of serverro. So there definitely needs to be a control.xml
fix anyway. This would also be the root cause, unless the affected system was
installed from a known fixed Leap 15.1.

> > Mounting wouldn't even be necessary, as grub2-install would handle both
> > situations
> > properly AFAICT. The combination of mounting + forced grub2-install would
> > allow
> > deleting the then unused modules from the snapshot itself though, which
> > would save
> > a few MiB of space.
> 
> I don't think grub-install is smart enough to know when and how to mount the
> subvolume, or am I missing something here ?

If /boot/grub2/arm64-efi is not a subvolume, it builds an grub.efi image without
"btrfs-mount-subvol" in the embedded config. Which means that the installed modules
would be found in the right place.

> > Which modules? If they are needed for booting/normal use, they should be
> > included in the .efi binary and if they're not needed they wouldn't have
> > to be loaded anyway.
> 
> I think we have included most modules, but still there are many not
> included. It is hard to tell in advance what is needed, so that we still
> allow loading modules via disabling secure boot to fill the gap and after
> they can report and we'll see if that's reasonable request or not.

Which seems to be the case for this report here, as otherwise grub wouldn't
have attempted to load a mismatching module at all, right?
So by adding whatever module grub failed to load here, the upgrade issue would also
be worked around (at least temporarily).
Comment 24 Andreas Färber 2020-03-12 13:39:36 UTC
(In reply to Fabian Vogt from comment #22)
> (In reply to Andreas Färber from comment #19)
> > Fabian, I thought the GRUB sub-packages got converted to noarch, so that
> > they aarch64 binaries can be installed on an x86_64 PXE server, too.
> > Creating aarch64-specific volumes in %post on other architectures does not
> > sound like a good idea.
> 
> Good point, %ifarch aarch64 would be enough to fix that though.

No, we can't use %ifarch in a noarch package.
One could use uname though, for instance.
Comment 25 Fabian Vogt 2020-03-12 13:40:46 UTC
(In reply to Andreas Färber from comment #24)
> (In reply to Fabian Vogt from comment #22)
> > (In reply to Andreas Färber from comment #19)
> > > Fabian, I thought the GRUB sub-packages got converted to noarch, so that
> > > they aarch64 binaries can be installed on an x86_64 PXE server, too.
> > > Creating aarch64-specific volumes in %post on other architectures does not
> > > sound like a good idea.
> > 
> > Good point, %ifarch aarch64 would be enough to fix that though.
> 
> No, we can't use %ifarch in a noarch package.
> One could use uname though, for instance.

Oops, of course.
Comment 26 Imobach Gonzalez Sosa 2020-03-12 16:35:48 UTC
(In reply to Michael Chang from comment #21)

[..]

> > Yes, unless you want to create the PR itself - it should only involve copying
> > a few lines of XML.
> 
> OK. CCing Imobach Gonzalez Sosa.
> 
> @Imobach
> 
> Would you please check PR for
> https://github.com/yast/skelcd-control-openSUSE/pull/138 and why it seems
> not in transactional server here ?

Hi Michael,

That PR adds the subvolume only for the Transactional Server role. If we want the subvolume to be present no matter the role the user has selected, we need to add the specification to the general subvolumes list. This PR does it for openSUSE 15.2: https://github.com/yast/skelcd-control-openSUSE/pull/197. Could you have a look, please?

[..]
Comment 27 Michael Chang 2020-03-13 08:14:36 UTC
(In reply to Fabian Vogt from comment #22)
> (In reply to Andreas Färber from comment #19)

> Maybe the system was installed using Leap 15.0 and then upgraded to 15.1 at
> some point?
> 15.0 doesn't have the arm64-efi subvolume at all.

> It's entirely missing from control.xml on Leap <= 15.0 and for
> >= 15.1 it's only part of serverro. So there definitely needs to be a control.xml
> fix anyway. This would also be the root cause, unless the affected system was
> installed from a known fixed Leap 15.1.

These information is definitely missing from the bug report. I think Oliver would be able to provide ...

@Oliver:

Would you please help to clarify that ?

> 
> > > Mounting wouldn't even be necessary, as grub2-install would handle both
> > > situations
> > > properly AFAICT. The combination of mounting + forced grub2-install would
> > > allow
> > > deleting the then unused modules from the snapshot itself though, which
> > > would save
> > > a few MiB of space.
> > 
> > I don't think grub-install is smart enough to know when and how to mount the
> > subvolume, or am I missing something here ?
> 
> If /boot/grub2/arm64-efi is not a subvolume, it builds an grub.efi image
> without
> "btrfs-mount-subvol" in the embedded config. Which means that the installed
> modules
> would be found in the right place.

Because of that "Mounting wouldn't even be necessary" in front of "as grub2-install would handle both ..." confused me about that grub can intelligently do the mounting for us. Anyway I believed we are understanding each other. 

> Which seems to be the case for this report here, as otherwise grub wouldn't
> have attempted to load a mismatching module at all, right?
> So by adding whatever module grub failed to load here, the upgrade issue
> would also
> be worked around (at least temporarily).

That is already covered by `grub2-install --suse-force-signed ...`.
Comment 28 Michael Chang 2020-03-13 08:22:45 UTC
(In reply to Imobach Gonzalez Sosa from comment #26)
> (In reply to Michael Chang from comment #21)
> 
> [..]
> 
 server here ?
> 
> Hi Michael,
> 
> That PR adds the subvolume only for the Transactional Server role. If we
> want the subvolume to be present no matter the role the user has selected,
> we need to add the specification to the general subvolumes list. This PR
> does it for openSUSE 15.2:
> https://github.com/yast/skelcd-control-openSUSE/pull/197. Could you have a
> look, please?

IT looks good to me. We have to enable subvolume to cover the case of snapshot rollback which would be needed by "many roles".
Comment 29 Imobach Gonzalez Sosa 2020-03-13 11:18:45 UTC
(In reply to Michael Chang from comment #28)
 
> IT looks good to me. We have to enable subvolume to cover the case of
> snapshot rollback which would be needed by "many roles".

OK. I have merged a fix for openSUSE Leap 15.2[1] and another one for Tumbleweed[2]. Moreover, I have checked other skelcd-control-* repositories and, unless I have overlooked something, they are all fine (SLE_RT does not contain that subvolume, but AFAIK only x86_64 is supported).

Please, let me know if you need anything else from the YaST team.

[1] https://github.com/yast/skelcd-control-openSUSE/pull/197
[2] https://github.com/yast/skelcd-control-openSUSE/pull/198.
Comment 32 Swamp Workflow Management 2020-03-13 17:00:07 UTC
This is an autogenerated message for OBS integration:
This bug (1162320) was mentioned in
https://build.opensuse.org/request/show/784689 15.1 / skelcd-control-openSUSE
https://build.opensuse.org/request/show/784692 15.1 / skelcd-control-openSUSE
Comment 33 Oliver Kurz 2020-03-16 20:43:41 UTC
(In reply to Michael Chang from comment #27)
> (In reply to Fabian Vogt from comment #22)
> > (In reply to Andreas Färber from comment #19)
> 
> > Maybe the system was installed using Leap 15.0 and then upgraded to 15.1 at
> > some point?
> > 15.0 doesn't have the arm64-efi subvolume at all.
> 
> > It's entirely missing from control.xml on Leap <= 15.0 and for
> > >= 15.1 it's only part of serverro. So there definitely needs to be a control.xml
> > fix anyway. This would also be the root cause, unless the affected system was
> > installed from a known fixed Leap 15.1.
> 
> These information is definitely missing from the bug report. I think Oliver
> would be able to provide ...
> 
> @Oliver:
> 
> Would you please help to clarify that ?

AFAIK, yes, the system was upgraded from Leap 15.0 so any changes in the installer system would not fix the issue. I guess in general any "changes" to the installer can also induce a need for corresponding changes for the upgrade case which can effectively only mean the mentioned %post script additions, right?

So far the machine is still in the pre-update state with a zypper lock added on the old package version. I have not yet understood what entry to /etc/fstab I would need to add or what path explicitly I should pass to `mksubvolume`.
Comment 34 Michael Chang 2020-03-18 06:34:14 UTC
Hi Oliver,

The steps you'd need to take are:

> mksubvolume /boot/grub2/arm64-efi
> update-bootloader --reinit

The `mksubvolume` would take care to

1. Create subvolume <FS_TREE>/@/boot/grub2/arm64-efi
2. Mount the new created subvolume to /boot/grub2/arm64-efi
3. Add the fstab entry to have it persist permanently

> UUID=... /boot/grub2/i386-pc btrfs subvol=@/boot/grub2/i386-pc 0 0

Let me know is it workable for you ?
Comment 35 Swamp Workflow Management 2020-03-24 17:14:35 UTC
openSUSE-RU-2020:0372-1: An update that has one recommended fix can now be installed.

Category: recommended (moderate)
Bug References: 1162320
CVE References: 
Sources used:
openSUSE Backports SLE-15-SP1 (src):    skelcd-control-openSUSE-15.1.19-bp151.2.3.1, skelcd-control-openSUSE-promo-15.1.19-bp151.2.3.1
Comment 36 Oliver Kurz 2020-03-24 18:56:49 UTC
(In reply to Michael Chang from comment #34)
> Hi Oliver,
> 
> The steps you'd need to take are:
> 
> > mksubvolume /boot/grub2/arm64-efi
> > update-bootloader --reinit
> 
> The `mksubvolume` would take care to
> 
> 1. Create subvolume <FS_TREE>/@/boot/grub2/arm64-efi
> 2. Mount the new created subvolume to /boot/grub2/arm64-efi
> 3. Add the fstab entry to have it persist permanently
> 
> > UUID=... /boot/grub2/i386-pc btrfs subvol=@/boot/grub2/i386-pc 0 0
> 
> Let me know is it workable for you ?

Unfortunately all my tries failed.

```
# mount -o rw,remount /
# mksubvolume /boot/grub2/arm64-efi
failure (target exists)
# mv /boot/grub2/arm64-efi{,.old}
mv: cannot move 'arm64-efi' to 'arm64-efi.old': Read-only file system
# mount | grep ' / '
/dev/sda2 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=933,subvol=/@/.snapshots/583/snapshot)
```

Within `transactional-update shell` unfortunately I have similar results. Also all tries to recreate the steps manually have not succeeded. Further hints?
Comment 37 Fabian Vogt 2020-03-25 16:00:14 UTC
I had a debugging session with iforster on the affected machine and we analyzed the issue.
This is actually three separate issues now:

1. The system does not boot after update of grub2

This is caused by a mismatch of the grub EFI binary on the EFI partition and the modules it tries to load.
As it fails even directly after the grub2 package got updated and called update-bootloader --reinit,
this is not related to the missing subvolume though. The grub2 modules in the right version are part of the
default subvolume, so whether /boot/grub2/arm64-efi is a subvolume or not does not matter at this point.

Checking /boot/efi/EFI/opensuse/grubaa64.efi, it was obvious that something was broken: The binary was
last modified in 2018, probably during the initial installation. Indeed, running "update-bootloader --reinit"
in the transactional-update shell did not update it either.

The cause for this is that update-bootloader passed the "--no-nvram --removable" options to grub2-install,
which changes the target path to EFI/BOOT/BOOTAA64.EFI. This binary isn't used for booting though, as the
boot order in the nvram lists opensuse/grubaa64.efi first. The result is that the ancient grubaa64.efi was
booted, which loaded modules from the new grub2 in the default subvolume.

update-bootloader detects which options to use based on the existence and content of /sys/firmware/efi/efivars,
which is a mountpoint. Only if it's non-empty, it installs grub in the right location. transactional-update doesn't
know about that though, so it simply wasn't mounted. I recommend using a bind mount of the host's /sys here.
This bug also affects x86_64, where the efi binary does currently not get updated by update-bootloader either.

Fixing this should be enough to get a working system even if grub2 is updated.

2a. The /boot/grub2/arm64-efi subvolume is missing

What this issue is currently mostly about - this is now fixed in the skelcd and created during the installation.
As existing systems won't benefit from that, this should ideally be handled automatically by grub at some point.

This subvolume is only needed in the case that the default subvolume is not the one the grub2 efi binary is from,
which is the case after a rollback, for instance.
In this particular case it actually wouldn't help to have this subvolume at all, because it would only contain the new modules
anyway, which don't match the ancient grub2 in the EFI partition. It would actually make the issue worse, as it would also fail
to boot after a rollback.

2b. mksubvolume /boot/grub2/arm64-efi is broken

The way this migration can be done would be this:
# transactional-update shell
transactional-update # mv /boot/grub2/arm64-efi{,-old}
transactional-update # mksubvolume /boot/grub2/arm64-efi
transactional-update # mv /boot/grub2/arm64-efi-old/* /boot/grub2/arm64-efi/
transactional-update # rmdir /boot/grub2/arm64-efi-old/
transactional-update # exit
# reboot

However, mksubvolume has a bug, it fails with:

failure (mkdir failed, errno:2 (No such file or directory))

This is because it tries to create the parent directory for the subvolume first, but not recursively:
mount("/dev/sda2", "/tmp/mksubvolume-CyZ04B", "btrfs", 0, "subvol=@") = 0
mkdirat(AT_FDCWD, "/tmp/mksubvolume-CyZ04B/boot/grub2", 0777) = -1 ENOENT (No such file or directory)

It would have to mkdir /tmp/mksubvolume-CyZ04B/boot/ first for this to work.


Fixing 1 is the most important part for now, as it's pretty much a time bomb for all EFI systems using transactional-update for updating.
After 2b. is fixed, grub can learn to do the migration in %post, which would fix 2a.
Comment 38 Ignaz Forster 2020-03-26 08:40:35 UTC
Problem 1 from Fabian's list has been fixed in transactional-update 2.20.4 by mounting `efivarfs` into the transactional-update environment on EFI systems.
Comment 39 Swamp Workflow Management 2020-03-26 09:10:14 UTC
This is an autogenerated message for OBS integration:
This bug (1162320) was mentioned in
https://build.opensuse.org/request/show/788446 Factory / transactional-update
Comment 40 Fabian Vogt 2020-03-26 10:47:28 UTC
On x86_64 the issue that efivarfs was not mounted in the chroot lead to perl-Bootloader calling shim-install --no-nvram --removable, putting everything into /boot/efi/EFI/BOOT/. It also calls grub-install --no-nvram unconditionally, which means that it always overwrites /boot/efi/EFI/opensuse/grubx64.efi, even though --removable is specified. AFAICT this also happens with --efi-directory is specified, which does not sound right.

@mchang: Is ^ intentional?
Comment 41 Michael Chang 2020-03-27 05:40:04 UTC
(In reply to Fabian Vogt from comment #40)
> On x86_64 the issue that efivarfs was not mounted in the chroot lead to
> perl-Bootloader calling shim-install --no-nvram --removable, putting
> everything into /boot/efi/EFI/BOOT/. It also calls grub-install --no-nvram
> unconditionally, which means that it always overwrites
> /boot/efi/EFI/opensuse/grubx64.efi, even though --removable is specified.
> AFAICT this also happens with --efi-directory is specified, which does not
> sound right.
> 
> @mchang: Is ^ intentional?

It is intended, as here we want the "side effects" of "grub-install --no-nvram" which does a great deal of things for setup /boot/grub2 directory ready to work, things like copying modules, translation files and so on whichever materials needed by grub.cfg. Certainly having /boot/efi/EFI/opensuse/grubx64.efi updated is not good, but does it cause any problem ?
Comment 42 Michael Chang 2020-03-27 05:43:51 UTC
(In reply to Ignaz Forster from comment #38)
> Problem 1 from Fabian's list has been fixed in transactional-update 2.20.4
> by mounting `efivarfs` into the transactional-update environment on EFI
> systems.

Thanks a lot. I always have to learn from you and Fabian about the transactional server. The comment#37 is also very insightful. Good job.
Comment 43 Fabian Vogt 2020-03-27 07:11:24 UTC
(In reply to Michael Chang from comment #41)
> (In reply to Fabian Vogt from comment #40)
> > On x86_64 the issue that efivarfs was not mounted in the chroot lead to
> > perl-Bootloader calling shim-install --no-nvram --removable, putting
> > everything into /boot/efi/EFI/BOOT/. It also calls grub-install --no-nvram
> > unconditionally, which means that it always overwrites
> > /boot/efi/EFI/opensuse/grubx64.efi, even though --removable is specified.
> > AFAICT this also happens with --efi-directory is specified, which does not
> > sound right.
> > 
> > @mchang: Is ^ intentional?
> 
> It is intended, as here we want the "side effects" of "grub-install
> --no-nvram" which does a great deal of things for setup /boot/grub2
> directory ready to work, things like copying modules, translation files and
> so on whichever materials needed by grub.cfg. Certainly having
> /boot/efi/EFI/opensuse/grubx64.efi updated is not good, but does it cause
> any problem ?

Unless --efi-directory is used for shim-install, AFAICT currently not.
Comment 44 Michael Chang 2020-04-14 07:51:08 UTC
(In reply to Fabian Vogt from comment #43)
> (In reply to Michael Chang from comment #41)
> > (In reply to Fabian Vogt from comment #40)

> Unless --efi-directory is used for shim-install, AFAICT currently not.

If so, please open a new bug to track the problem then move our discussion there.
Thanks.
Comment 45 Michael Chang 2020-04-14 07:53:41 UTC
Hi Oliver,

Should we close the issue or are we waiting for other fixes to arrive ?
Thanks in advance.
Comment 46 Oliver Kurz 2020-04-15 13:24:54 UTC
Well, neither the issue for the particular machine is fixed as it is still running the old, locked packages nor am I aware of fixes for 2a or 2b as mentioned in https://bugzilla.suse.com/show_bug.cgi?id=1162320#c37 . https://build.opensuse.org/request/show/788446 has been accepted into Factory but no maintenance update for Leap has been mentioned. Probably you also need to ensure the fixed packages end up in Leap 15.2
Comment 47 Ignaz Forster 2020-04-15 13:42:53 UTC
Recursively bind mounting all system mounts revealed a bug in util-linux (bug 1168389), the fix has to be backported to Leap first before I can submit the fixed transactional-update version to Leap. I wasn't sure who would be doing the backport, so I'll do this shortly.
Comment 52 Swamp Workflow Management 2020-09-02 13:14:37 UTC
SUSE-RU-2020:2448-1: An update that has one recommended fix can now be installed.

Category: recommended (important)
Bug References: 1162320
CVE References: 
JIRA References: 
Sources used:
SUSE Linux Enterprise Module for Transactional Server 15-SP2 (src):    transactional-update-2.20.3-3.3.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 53 Swamp Workflow Management 2020-09-04 16:40:53 UTC
SUSE-RU-2020:2490-1: An update that has one recommended fix can now be installed.

Category: recommended (important)
Bug References: 1162320
CVE References: 
JIRA References: 
Sources used:
SUSE Linux Enterprise Module for Transactional Server 15-SP1 (src):    transactional-update-2.15-3.6.2

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 54 Swamp Workflow Management 2020-09-07 13:26:14 UTC
openSUSE-RU-2020:1373-1: An update that has one recommended fix can now be installed.

Category: recommended (important)
Bug References: 1162320
CVE References: 
JIRA References: 
Sources used:
openSUSE Leap 15.2 (src):    transactional-update-2.20.3-lp152.2.3.1
Comment 55 Swamp Workflow Management 2020-09-07 22:15:05 UTC
openSUSE-RU-2020:1377-1: An update that has one recommended fix can now be installed.

Category: recommended (important)
Bug References: 1162320
CVE References: 
JIRA References: 
Sources used:
openSUSE Leap 15.1 (src):    transactional-update-2.15-lp151.2.6.1
Comment 56 Ignaz Forster 2020-09-08 06:26:47 UTC
The fix has been released for all supported platforms.
Comment 57 Oliver Kurz 2020-09-08 12:42:58 UTC
As the originally affected machine was reinstalled in the meantime I have no easy means to verify this. I will have to trust you :)
Comment 59 OBSbugzilla Bot 2021-11-11 15:40:20 UTC
This is an autogenerated message for OBS integration:
This bug (1162320) was mentioned in
https://build.opensuse.org/request/show/930877 15.2 / transactional-update
Comment 60 Swamp Workflow Management 2021-11-15 14:26:13 UTC
openSUSE-RU-2021:1476-1: An update that has 5 recommended fixes can now be installed.

Category: recommended (moderate)
Bug References: 1133891,1149131,1162320,1168389,1192078
CVE References: 
JIRA References: 
Sources used:
openSUSE Leap 15.2 (src):    transactional-update-2.22-lp152.2.6.1