Bug 1192665 - Boot UEFI with broken RAID1 mirror fails to load
Summary: Boot UEFI with broken RAID1 mirror fails to load
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Bootloader (show other bugs)
Version: Current
Hardware: 64bit Other
: P5 - None : Major with 5 votes (vote)
Target Milestone: ---
Assignee: dracut maintainers
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-11-13 23:23 UTC by Cedric Simon
Modified: 2021-12-23 07:27 UTC (History)
4 users (show)

See Also:
Found By: Community User
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Dracut loop error before entering in emergency mode. (2.79 MB, image/jpeg)
2021-11-13 23:23 UTC, Cedric Simon
Details
Copy of journal entries (163.67 KB, text/plain)
2021-11-15 21:28 UTC, Cedric Simon
Details
RD SOS Report (241.36 KB, text/plain)
2021-11-15 21:28 UTC, Cedric Simon
Details
RD SOS report with debug (3.85 MB, text/plain)
2021-11-16 15:07 UTC, Cedric Simon
Details
Journal with rd.debug (134.44 KB, text/plain)
2021-11-16 15:07 UTC, Cedric Simon
Details
Grub vommand line without quiet and with rd.debu (173.44 KB, image/jpeg)
2021-11-16 15:09 UTC, Cedric Simon
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Cedric Simon 2021-11-13 23:23:22 UTC
Created attachment 853725 [details]
Dracut loop error before entering in emergency mode.

openSuse Tumbleweed install with UEFI boot, RAID1 partition for root.

Testcase: 
1) Have a synchronized RAID1 partition(s), latest Tumbleweed version (zypper dup).
2) Shut down normally 
3) Remove one disk used by RAID1 partition
4) Boot system

Result: dracut is unable to degrade the RAID1. System boots in emergency mode.

Work around:
a) Removed failed disk from RAID array(s) while system is on (and system will boot as degrade without problem)
b) Boot in rescue mode (using DVD), removed failed disk from array, mount RAID partition(s), and reboot.

Adding screen capture for now. Next week I will add system logs.

Note:
a) If system is not using UEFI, array does degrade normally.
b) I did not test with other RAID configuration
c) UEFI partintion being on RAID1 partition or not does not impact
d) I have reproduced this bug various times. With DVD with august29th, 2021 openuse version, bug is already present, so it might be there for a while.
Comment 1 Cedric Simon 2021-11-15 21:27:13 UTC
Adding journal and rdsosreport.txt
Comment 2 Cedric Simon 2021-11-15 21:28:00 UTC
Created attachment 853759 [details]
Copy of journal entries
Comment 3 Cedric Simon 2021-11-15 21:28:30 UTC
Created attachment 853760 [details]
RD SOS Report
Comment 4 Thomas Blume 2021-11-16 08:45:37 UTC
(In reply to Cedric Simon from comment #3)
> Created attachment 853760 [details]
> RD SOS Report

Unfortunately, the logs don't show what happens within the dracut scripts.

Could you please retry with the boot option 'quiet' removed and 'rd.debug' added and attach these logs?
Comment 5 Cedric Simon 2021-11-16 15:07:01 UTC
Created attachment 853781 [details]
RD SOS report with debug
Comment 6 Cedric Simon 2021-11-16 15:07:36 UTC
Created attachment 853782 [details]
Journal with rd.debug
Comment 7 Cedric Simon 2021-11-16 15:08:40 UTC
Files without quiet and with rd.debug added.
Comment 8 Cedric Simon 2021-11-16 15:09:42 UTC
Created attachment 853783 [details]
Grub vommand line without quiet and with rd.debu
Comment 9 Antonio Feijoo 2021-11-17 08:47:04 UTC
We've created 3 Tumbleweed vms (BIOS, UEFI and UEFI after zypper dup), but we couldn't reproduce the issue; all of them boot as degraded without problem after removing one disk.

The logs show that the RAID setup is somewhat broken, compared to what it should show: /proc/mdstat displays md0 and md1 as inactive and blkid does not print them as well.

We can also see that dracut starts the assembly of the RAID device:

-->
[    2.377096] install systemd-udevd[399]: sda2: Process '/sbin/initqueue --timeout --name 50-mdraid_start --onetime --unique /sbin/mdraid_start' succeeded.
[    2.377114] install systemd-udevd[399]: sda2: Running command "/sbin/mdadm -I /dev/sda2"
[    2.377130] install systemd-udevd[399]: sda2: Starting '/sbin/mdadm -I /dev/sda2'
[    2.377224] install systemd-udevd[399]: Successfully forked off '(spawn)' as PID 552.
[....]
[    2.396366] install systemd-udevd[404]: sda1: Process '/sbin/initqueue --timeout --name 50-mdraid_start --onetime --unique /sbin/mdraid_start' succeeded.
[    2.396386] install systemd-udevd[404]: sda1: Running command "/sbin/mdadm -I /dev/sda1"
[    2.396402] install systemd-udevd[404]: sda1: Starting '/sbin/mdadm -I /dev/sda1'
[    2.396497] install systemd-udevd[404]: Successfully forked off '(spawn)' as PID 565.
--<

But then mdadm throws an error:

-->
[    2.401345] install systemd-udevd[399]: sda2: '/sbin/mdadm -I /dev/sda2'(err) 'mdadm: /dev/sda2 attached to /dev/md/md1, not enough to start safely.'
[    2.401379] install systemd-udevd[399]: sda2: Process '/sbin/mdadm -I /dev/sda2' succeeded.
[...]
[    2.402445] install systemd-udevd[404]: sda1: '/sbin/mdadm -I /dev/sda1'(err) 'mdadm: /dev/sda1 attached to /dev/md0, not enough to start safely.'
[    2.402474] install systemd-udevd[404]: sda1: Process '/sbin/mdadm -I /dev/sda1' succeeded.
--<

So, that isn't dracut, it is mdadm that refuses to start the arrays.

Neil, any hint of what might be going on here?
Comment 10 Neil Brown 2021-11-21 22:42:00 UTC
It is expected that "mdadm -I" will refuse to start the array.  mdadm needs to wait until all devices have been discovered, just incase some other device appears which can complete the array.

That is what the mdadm-last-resort@.timer systemd timer is for.  That time starts when udev/rules/64-md-raid-assembly.rules first adds starts assembling the array.
After 30 seconds, if the array hasn't been started yet, but can be started degraded, then mdadm-last-resport@.service will start it.

But... there is also mdraid_start.sh which is run after a timeout and tries to do the samething.  I suspect the last-resort timer shouldn't be used in dracut.  IT was added for bug 1017695 (also reported by Cedric)

Anyway, mdraid_start isn't doing the right thing.
From the dracut log:

[  137.632402] install dracut-initqueue[2031]: + _MD_UUID='fd889b76:c16f476d:3cb2241c:9ab1ca44
[  137.632402] install dracut-initqueue[2031]: 40b8cb3d:150cdfb8:11dc4f3c:51c116ec'

Note that there is a newline character rather than a space separating the two uuids.

A bit later

[  137.745164] install dracut-initqueue[2031]: + strstr ' fd889b76c16f476d3cb2241c9ab1ca44
[  137.745164] install dracut-initqueue[2031]: 40b8cb3d150cdfb811dc4f3c51c116ec ' ' 40b8cb3d150cdfb811dc4f3c51c116ec '

This fails.  The code expects the uuids in _MD_UUID to be space separated, but they aren't.

I suspect that in Cedric's config where this doesn't work, there are 2 md arrays that dracut needs to work with.  Credric: please confirm that and describe the configuration

In the tests that you (Antonio) tried, you presumably only had one md array.
Please try with 2 md arrays.
Comment 11 Antonio Feijoo 2021-11-22 07:17:02 UTC
(In reply to Neil Brown from comment #10)
> But... there is also mdraid_start.sh which is run after a timeout and tries
> to do the samething.  I suspect the last-resort timer shouldn't be used in
> dracut.  IT was added for bug 1017695 (also reported by Cedric)
> 
> Anyway, mdraid_start isn't doing the right thing.
> From the dracut log:
> 
> [  137.632402] install dracut-initqueue[2031]: +
> _MD_UUID='fd889b76:c16f476d:3cb2241c:9ab1ca44
> [  137.632402] install dracut-initqueue[2031]:
> 40b8cb3d:150cdfb8:11dc4f3c:51c116ec'
> 
> Note that there is a newline character rather than a space separating the
> two uuids.
> 
> A bit later
> 
> [  137.745164] install dracut-initqueue[2031]: + strstr '
> fd889b76c16f476d3cb2241c9ab1ca44
> [  137.745164] install dracut-initqueue[2031]:
> 40b8cb3d150cdfb811dc4f3c51c116ec ' ' 40b8cb3d150cdfb811dc4f3c51c116ec '
> 
> This fails.  The code expects the uuids in _MD_UUID to be space separated,
> but they aren't.

Thank you for your extensive analysis Neil. This shed some light on me, as this issue seems to be related to this pending upstream PR:

https://github.com/dracutdevs/dracut/pull/1633
Comment 12 Cedric Simon 2021-11-22 13:13:46 UTC
Indeed, I do have 2 RAID arrays: one for the UEFI (/dev/md0 --> /boot/efi) and one for root (/dev/md1 --> /).

Sorry not to have mentioned this before :(
Comment 13 Cedric Simon 2021-11-22 13:40:26 UTC
Double RAID is my "default" configuration.

I did test it using non RAID /boot/efi, with same issue, but maybe the /dev/md0 was remaining as defined, not sure...
Comment 14 Antonio Feijoo 2021-11-23 08:32:23 UTC
(In reply to Antonio Feijoo from comment #11)
> (In reply to Neil Brown from comment #10)
> > But... there is also mdraid_start.sh which is run after a timeout and tries
> > to do the samething.  I suspect the last-resort timer shouldn't be used in
> > dracut.  IT was added for bug 1017695 (also reported by Cedric)
> > 
> > Anyway, mdraid_start isn't doing the right thing.
> > From the dracut log:
> > 
> > [  137.632402] install dracut-initqueue[2031]: +
> > _MD_UUID='fd889b76:c16f476d:3cb2241c:9ab1ca44
> > [  137.632402] install dracut-initqueue[2031]:
> > 40b8cb3d:150cdfb8:11dc4f3c:51c116ec'
> > 
> > Note that there is a newline character rather than a space separating the
> > two uuids.
> > 
> > A bit later
> > 
> > [  137.745164] install dracut-initqueue[2031]: + strstr '
> > fd889b76c16f476d3cb2241c9ab1ca44
> > [  137.745164] install dracut-initqueue[2031]:
> > 40b8cb3d150cdfb811dc4f3c51c116ec ' ' 40b8cb3d150cdfb811dc4f3c51c116ec '
> > 
> > This fails.  The code expects the uuids in _MD_UUID to be space separated,
> > but they aren't.
> 
> Thank you for your extensive analysis Neil. This shed some light on me, as
> this issue seems to be related to this pending upstream PR:
> 
> https://github.com/dracutdevs/dracut/pull/1633

I've been able to reproduce the error using the reported settings (2 RAID1 arrays + 2 rd.md.uuid params) and I can confirm that this pending fix solves the issue.
Comment 18 Antonio Feijoo 2021-12-23 07:27:43 UTC
The fix for this bug is included since dracut-055+suse.150.g24986370, so we can close it.