Bug 832501

Summary: boot on raid device is not started if degraded; fix provided
Product: [openSUSE] openSUSE Distribution Reporter: Forgotten User VB1HhTwhLY <forgotten_VB1HhTwhLY>
Component: OtherAssignee: Neil Brown <nfbrown>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: arvidjaar, hpj, jjletho67-esus, mchang, mmarek, ohering, trenn
Version: 13.2   
Target Milestone: 13.2 RC 1   
Hardware: x86-64   
OS: openSUSE 13.2   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: patch for /var/mkinitrd/scripts/, maybe not src files
partition layout
mdadm RPM for testing
rdosreport.txt After a failed boot, taken from the emergency shell (initramfs phase)
disk layout
patch that helps.

Description Forgotten User VB1HhTwhLY 2013-07-31 10:56:21 UTC
Created attachment 550422 [details]
patch for /var/mkinitrd/scripts/, maybe not src files

User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:21.0) Gecko/20100101 Firefox/21.0

If your /boot is on a separate raid device from your /, mkinitrd does not add any information in the initrd to start the raid device, so boot will fail.

I don't know why booting works if the RAID is clean. Perhaps systemd is starting it in this case.

Ubuntu 12.04 (grub 1.99) can boot with degraded raid as long as you manually fix the metadata version of the device (change to 0.90, possibly 1.0, but not 1.2 which is default on CLI and in Ubuntu installer), so I was sad to see that the latest openSUSE does not work (even though previous versions did work). But I was happy to see that openSUSE will work with my fix and without changing the metatdata, because openSUSE uses grub 2.00 and the installer uses metadata 1.0 instead of 1.2.

I have fixed the problem on my machine by editing the mkinitrd scripts. I don't know if I did a nice clean job that will work on other systems, so please validate it. I have also added some extra output in verbose mode.

In my solution, I have checked to see if the mdadm.conf exists, and if not, generated one. This is because the openSUSE installer did not generate one for me in my most hackish of tests. I think this seems like a good way to prevent some problems, even if they are the users' fault.

In my solution, I am not sure if there is a problem when you have no mdadm.conf or your mdadm.conf has entries for things you don't want to be required for boot, and then the initrd will try to start them too. I did a check in /sys/devices/virtual/block/ to see if there are devices before trying to handle them, and then if there are devices but no mdadm.conf, then I use <(mdadm -D --scan) to read the output instead of the file.

Reproducible: Always

Steps to Reproduce:
Set up a test machine:

2 x 16 GB virtual disks

    md0 is raid1, sda1 and sdb1, and mounted on /boot as ext4
    md1 is raid1, sda2 and sdb2, and is a LVM PV
    /dev/suse is the LVM VG containing PV /dev/md1
    /dev/suse/root is from VG /dev/suse, and mounted on / as ext4
    /dev/suse/swap is from VG /dev/suse, and is swap

On command line, you could create the devices like this:
    mdadm --create /dev/md0 -n 2 -x 0 -l 1 -e 1.0 missing /dev/sdb1
    mdadm --create /dev/md1 -n 2 -x 0 -l 1 -e 1.0 missing /dev/sdb2

    mkfs.ext4 -L boot /dev/md0
    pvcreate /dev/md1
    vgcreate suse /dev/md1
    lvcreate -L 4GB -n swap suse
    lvcreate -l 100%FREE -n root suse
    mkfs.ext4 -L root /dev/suse/root
    mkswap /dev/suse/swap


After the machine is up, run this to ensure the machine should be ready to boot with either disk missing:
    grub2-install /dev/sda
    grub2-install /dev/sdb
    mkinitrd
    grub2-mkconfig -o /boot/grub2/grub.cfg


Then shut it down; remove a disk (I removed the 2nd for most of my tests, because virtualbox snapshots mess up if you boot from the one you add afterwards).

Then boot it up
Actual Results:  
You get a very long wait (at least 60 seconds) and then you get emergency mode.

Normal startup was blocked because fsck could not open /dev/md0; it could not open it because /dev/md0 is started and exists, but is not running (as if --run was not used when assembling).

Expected Results:  
You get a successful boot with degraded arrays.

The systemd log shows you something like this:

Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job dev-disk-by\x2duuid-a16b10b0\x2dd038\x2d4946\x2dad88\x2d97c0617bbf8c.device/start timed out.
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-a16b10b0\x2dd038\x2d4946\x2dad88\x2d97c0617bbf8c.device.
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Dependency failed for /boot.
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Dependency failed for Local File Systems.
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Dependency failed for Remote File Systems (Pre).
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job remote-fs-pre.target/start failed with result 'dependency'.
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job local-fs.target/start failed with result 'dependency'.
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Triggering OnFailure= dependencies of local-fs.target.
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job boot.mount/start failed with result 'dependency'.
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Dependency failed for File System Check on /dev/disk/by-uuid/a16b10b0-d038-4946-ad88-97c0617bbf8c.
Jul 30 12:29:11 peterrouter.bc.local systemd[1]: Job systemd-fsck@dev-disk-by\x2duuid-a16b10b0\x2dd038\x2d4946\x2dad88\x2d97c0617bbf8c.service/start failed with result 'dependency'.
Comment 1 Forgotten User VB1HhTwhLY 2013-07-31 11:02:54 UTC
Oh and by the way, maybe my fix would also fix this bug too: https://bugzilla.novell.com/show_bug.cgi?id=823125
Comment 2 Forgotten User VB1HhTwhLY 2013-07-31 11:15:50 UTC
and also, for unknown reasons, if in the rescue shell you run "mdadm --run /dev/md0" then "systemctl default", then reboot, it will boot normally again, even though it is still degraded.
Comment 3 Olaf Hering 2013-08-30 18:53:39 UTC
Thanks for the report. I'm trying to reproduce and understand the issue.
Comment 4 Olaf Hering 2013-09-04 08:04:24 UTC
I can not reproduce it.
I installed a VM with two disk and the layout as described above. After install, the VM was shut down, one disk removed, and the system starts ok with just that one disk. I ran mkinitrd and compared the old and new initrd. There are no differences. mdadm.conf exists and did not change.

In case you want to poke around, hammer175 is the VM, password is root.
Comment 5 Forgotten User VB1HhTwhLY 2013-09-04 11:07:45 UTC
Can I look at the VM? How do I connect to the machine?
Comment 6 Neil Brown 2013-11-03 20:41:26 UTC
(In reply to comment #0)

> If your /boot is on a separate raid device from your /, mkinitrd does not add
> any information in the initrd to start the raid device, so boot will fail.

Hi Peter,
 I'm a bit confused by this part of the problem description.

As I understand it, the initrd does not need to access the /boot filesystem at all.  The boot loader (e.g. grub) does of course so that it can load the kernel and  the initrd,  But all the initrd need access to is the root filesystem and the swap partition.

Once it mounts  root, the scripts in there will take over to mount /boot and anything else.

Clearly you are having a problem and it does seem to be related to the md device containing /boot, but I think it needs to be fixed in the regular boot scripts, not in the initrd.

Handling freshly degraded arrays at boot is somewhat tricky with the dependency driven boot sequence that systemd uses.

As devices are discovered, udev runs "mdadm -I $DEVICE" and mdadm incrementally assembles the arrays.  Once all components are there the array is started.

But if all components never arrive, the array will never be started with just that mechanism.

To address this you can run  "mdadm -IRs" which essentially says "all devices have arrived, time to start any remaining md arrays  which are degraded".

systemd need to do this when it times out waiting for a device.  But I don't know how to tell it (not that I have really looked recently).

The initrd does have a call to "mdadm -IRs" half way through timing out for the root device.  This is why your boot works if you tell the initrd to assemble the boot device.  But that isn't really the right fix.

I'll do some reading about systemd and see if I can figure out how to give it an action to perform on timeout.
Comment 7 Marco M. 2013-11-28 09:31:41 UTC
HI,
I think I'm facing an almost identical problem. The only main difference is that on my system the emergency shell never appears.

I'm attaching a file in which you can find (in this order) : partition layout, mdstat, mdadm --detail on all raid devices,vgdisplay, lvdisplay, fstab and active mount point.

When i start the system with /dev/sdb pulled  out systemd complains about missing /boot with those error messages on the console:

"
Timed out waiting for device dev-disk-by\x2dlabel-bootFS.device
Dependency failed for /boot
Dependency failed for Local File Systems
Welcome to emergency mode! After Logging in type "journalctl -xb" to view system logs "systemctl reboot" to reboot, "systemctl default" to try again to boot into default mode
"

The last line is repeated on the screen about every 60 seconds, but no keyboard inputs are accepted and no login prompt appears. I can only switch to and from alt-F1 and alt-F7 console.
Comment 8 Marco M. 2013-11-28 09:32:37 UTC
Created attachment 569482 [details]
partition layout
Comment 9 Neil Brown 2013-11-28 10:27:57 UTC
Yes, that looks like the same problem.
I have a fix nearly worked out.  I should have something for you to test early next week.
I don't know why you don't get a login prompt, but it might be worth trying to boot with plymouth - that might be confusing things.

I think you try 'e' to the grub menu and it puts you in a simple editor.  Find the kernel command line and add
   plymouth.enable=0
to the end.   See if that provides a password prompt in emergency mode.
Comment 10 Andrei Borzenkov 2013-11-28 13:15:02 UTC
(In reply to comment #9)

> I don't know why you don't get a login prompt

See bnc#852021
Comment 11 Marco M. 2013-11-28 14:26:17 UTC
> I don't know why you don't get a login prompt, but it might be worth trying to
> boot with plymouth - that might be confusing things.
> 
> I think you try 'e' to the grub menu and it puts you in a simple editor.  Find
> the kernel command line and add
>    plymouth.enable=0
> to the end.   See if that provides a password prompt in emergency mode.

I added plymouth.enable=0 but the emergency shell is still not working
Comment 12 Marco M. 2013-11-28 14:28:58 UTC
(In reply to comment #10)
> (In reply to comment #9)
> 
> > I don't know why you don't get a login prompt
> 
> See bnc#852021

It looks very similar! I'm going to try the suggested patch as soon as possible and I'll let you know the result, thank you.
Comment 13 Marco M. 2013-11-28 17:47:35 UTC
(In reply to comment #12)
> (In reply to comment #10)
> > (In reply to comment #9)
> > 
> > > I don't know why you don't get a login prompt
> > 
> > See bnc#852021
> 
> It looks very similar! I'm going to try the suggested patch as soon as possible
> and I'll let you know the result, thank you.

Ok the emergency shell problem is the same described in bnc#852021 and the patch proposed has worked for me. (in the sense that i solved the login prompt problem, but of course i still have the main problem we are facing here) 

I'm of course available to test a patch
Comment 14 Neil Brown 2013-12-02 04:29:51 UTC
Created attachment 569764 [details]
mdadm RPM for testing

Please test this rpm and confirm that it fixes the problem.

If you boot with not all expected devices present there will be a 30 second delay waiting for device to appear, then any md arrays which can be started degraded will be.  On subsequent boots the 30 second delay will not be needed as the other devices are not longer expected.
Comment 15 Marco M. 2013-12-03 18:46:19 UTC
(In reply to comment #14)
> Created an attachment (id=569764) [details]
> mdadm RPM for testing
> 
> Please test this rpm and confirm that it fixes the problem.
> 
>
I installed the rpm with this command:

 rpm -ivh --replacepkgs --force

--force was necessary otherwise rpm complains that  the already installed package is newer than the one i was installing.

I added the nofail option to all mounted standard (NO RAID) partitions on the disk that i was about to remove. (the absence of a partition which is indicated in fstab as automounted triggers the emergency shell)

I pulled out the sdb disk and the system booted as expected! So the patch is working fine for me!
Comment 16 Neil Brown 2013-12-05 00:25:17 UTC
Hi Peter,
 I've been approach this as an openSUSE-13.1 problem. However I just noticed that the affected "Product" at the top says "openSUSE-12.3".  12.3 works quite differently in this area and I think that it works correctly.

Could you please confirm whether you we seeing this in 12.3 or in a 13.1 beta?

Thanks.
Comment 17 Bernhard Wiedemann 2013-12-05 01:00:15 UTC
This is an autogenerated message for OBS integration:
This bug (832501) was mentioned in
https://build.opensuse.org/request/show/209450 Factory / mdadm
Comment 18 Forgotten User VB1HhTwhLY 2013-12-05 02:46:20 UTC
> Could you please confirm whether you we seeing this in 12.3 or in a 13.1 beta?

@Neil

It was 12.3; I have not tested 13.1 at all yet.
Comment 19 Marco M. 2013-12-05 18:28:11 UTC
I confirm the bug is present in 13.1 and NOT present in 12.3.
I tested the patch in 13.1
Comment 20 Neil Brown 2013-12-12 03:44:53 UTC
Hi Peter,
 I've tried to reproduce this on 12.3 and I cannot.
It always boots with /boot found and mounted happily.

I can only assume that there is some detail in you configuration which is different to the way the installer sets things up.

I would definitely advise having all the arrays listed in /etc/mdadm.conf, though I find it works even without that.

Also it is best if /etc/fstab gives /dev/disk/by-id/md-uuid-..... devices rather than e.g. /dev/md127.  Doing that latter could possibly confuse thing.

So I'm going to focus on getting a good fix for this into 13.1 and leave it at that.
Comment 21 Benjamin Brunner 2013-12-16 13:31:54 UTC
Update released for openSUSE 13.1. Resolved fixed.
Comment 22 Swamp Workflow Management 2013-12-16 14:05:42 UTC
openSUSE-RU-2013:1883-1: An update that has two recommended fixes can now be installed.

Category: recommended (low)
Bug References: 832501,851993
CVE References: 
Sources used:
openSUSE 13.1 (src):    mdadm-3.3-4.4.1
Comment 23 Marco M. 2014-10-15 10:33:16 UTC
I think the bug is present again in 13.2 RC1. 
I was unable to boot with a degraded raid 1 array (both boot and root where on raid)

I'm attaching the rdosreport.txt.

The strange thing is that , from the very limited emergency shell That come out in the initramfs phase,  I can correctly see the degraded array and the lvm logical volume on Which root filesystem is.file:///boot/rdsosreport.txt
Comment 24 Marco M. 2014-10-15 10:35:25 UTC
Created attachment 610152 [details]
rdosreport.txt After a failed boot, taken from the emergency shell (initramfs phase)
Comment 25 Andrei Borzenkov 2014-10-15 17:51:47 UTC
(In reply to Marco M. from comment #23)
> I was unable to boot with a degraded raid 1 array (both boot and root where
> on raid)
> 

I cannot reproduce this using boot on MD RAID with current Factory. Your log shows

[  143.781288] linux-m61d dracut-initqueue[270]: Warning: Cancelling resume operation. Device not found.
[  144.094926] linux-m61d systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-1f3d0926\x2da660\x2d45a6\x2da2b0\x2dbfdf165d64b5.device.
[  144.096300] linux-m61d systemd[1]: Dependency failed for /sysroot.
[  144.098235] linux-m61d systemd[1]: Dependency failed for Initrd Root File System.
[  144.099596] linux-m61d systemd[1]: Dependency failed for Reload Configuration from the Real Root.
[  144.100150] linux-m61d systemd[1]: Dependency failed for File System Check on /dev/disk/by-uuid/1f3d0926-a660-45a6-a2b0-bfdf165d64b5.
[  144.553192] linux-m61d dracut-initqueue[270]: Scanning devices md2  for LVM logical volumes SystemVG/rootLV

So it is actually problem of LVM on RAID, not RAID itself.

/dev/mapper/SystemVG-rootLV: LABEL="rootFS" UUID="1f3d0926-a660-45a6-a2b0-bfdf165d64b5" TYPE="ext4"

Please provide your initrd that fails as already requested.
Comment 26 Marco M. 2014-10-16 08:18:09 UTC
Created attachment 610281 [details]
disk layout

Just to clarify my partitions layout I attach a text file with:

disks partition tables
mdadm --detail of all raid array
pvdisplay, vgdisplay and lvdisplay
fstab

I can't directly attach initrd file because it exceeds the max  size allowed. I'm going to  share the file with some other file sharing system.
Comment 27 Marco M. 2014-10-16 08:30:23 UTC
https://copy.com/zDQdfUM0L4lJz4fo

with this link you can download the initrd file
Comment 28 Neil Brown 2014-11-05 05:05:45 UTC
Strange...

The rdosreport.txt log reports:

[  144.094926] linux-m61d systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-1f3d0926\x2da660\x2d45a6\x2da2b0\x2dbfdf165d64b5.device.

But also shows

/dev/disk/by-uuid:
...
lrwxrwxrwx 1 root 0 10 Oct 15 12:12 1f3d0926-a660-45a6-a2b0-bfdf165d64b5 -> ../../dm-0


So it clearly find the device, but maybe not soon enough for systemd.

This dm-0 is an LVM partition in md2, which is a degraded raid1.

As it is degraded it isn't started immediately - not until /sbin/mdraid_start is called.
That is scheduled to run at a timeout by a udev rules file:

etc/udev/rules.d/65-md-incremental-imsm.rules:RUN+="/sbin/initqueue --timeout --name 50-mdraid_start --onetime --unique /sbin/mdraid_start"

This will be after the normal lvm scan, but there is a secondary lvm scan similarly scheduled by a timeout.

etc/udev/rules.d/64-lvm.rules:RUN+="/sbin/initqueue --timeout --name 51-lvm_scan --onetime --unique /sbin/lvm_scan --partial"


So the md device should be assemboed by "50-mdraid_start" and the lvm device by "51-lvm_scan", and as "51" comes after "50", this should happen in the right order.

The log confirms this:

[  143.256336] linux-m61d kernel: md/raid1:md2: active with 1 out of 2 mirrors
...
[  144.553192] linux-m61d dracut-initqueue[270]: Scanning devices md2  for LVM logical volumes SystemVG/rootLV

The problem is that systemd complains between these two:

[  144.094926] linux-m61d systemd[1]: Timed out waiting for device dev-disk-by\x2duuid-1f3d0926\x2da660\x2d45a6\x2da2b0\x2dbfdf165d64b5.device.

I guess the systemd timeout must be only a tiny bit longer than the initqueue timeout.

The initqueue timeout is  2*$RDRETRY/3 where RDRETRY is 180 (by default).  So 120 seconds.

The default timeout for devices seems to be 90 seconds - set by the DefaultTimeoutStartSec value.
But that doesn't really seem to line up, so I must be missing something.


Can you try booting with
   rd.timeout=180

added to the command line args?  If that works, it will at least hint that I'm on the right track.

I might try to set something up myself to test ... if I find time.
Comment 29 Neil Brown 2014-11-06 06:00:23 UTC
I've can reproduce this bug.
It appears to be a problem in dracut, though I still don't fully understand it.

There seem to be two timeouts.
After one, dracut gives up waiting for enough devices for non-degraded arrays to appear and accepts degraded arrays.
This defaults to 120seconds (2/3 of RDRETRY).  If I boot with
   rd.retry=80

then boot with a removed device is successful.

The other timeout controls when systemd will give up waiting for the root device to appear  This seems to default to 90 seconds, but I guess it starts from a different moment than the other timeout.

This timeout seems to fire immediately *after* the md raid devices have been assembled degraded, but just *before* an lvm device is found in the md raid device  (which happens about 1 second later).

If I edit /etc/systemd/system.conf, uncomment
 #DefaultTimeoutStartSec=90s

and change it to 300s, and run "mkinitrd" then again the boot succeeds, but without the need to set rd.retry=80

Trying a smaller number, like 120s, should be sufficient.  But it isn't.
I don't understand that.
I didn't binary-search to find where the cut-off is .... it takes too long to run a test.

So you can make your system work by making the above change to /etc/systemd/system.conf.

Thomas: are you the maintainer for dracut?  Do you know anything about the
timeout for the root device to appear and how it suppose to compare with RDRETRY ??
Comment 30 Marco M. 2014-11-10 11:24:26 UTC
> 
> If I edit /etc/systemd/system.conf, uncomment
>  #DefaultTimeoutStartSec=90s
> 
> and change it to 300s, and run "mkinitrd" then again the boot succeeds, but
> without the need to set rd.retry=80
> 

This workaround has worked fine also for me. I did those two tests:

1) degraded raid1 device with root on it plus one data missing filesystem (a simple partition with a file system on it, no raid involved): root filesystem is mounted and the emergency shell appears.

2) degraded raid1 device with root on it, no other missing filesystem (there was a missing filesystem, but i marked it with the "nofail" tag in /etc/fstab): the system boots correctly in runlevel 5 ( in the systemd equivalent runlevel 5...)
Comment 31 Neil Brown 2015-02-05 01:55:48 UTC
*** Bug 811830 has been marked as a duplicate of this bug. ***
Comment 32 Neil Brown 2015-03-12 00:07:03 UTC
Created attachment 626448 [details]
patch that helps.

I think this is a more robust fix for the problem than fiddling with timeouts.

If you
 cd /usr/lib/dracut
 patch -p1 < /path/to/dracut.diff
 mkinitrd

then you should have better luck booting with a missing device.

I've sent an email discussing the problem to the initramfs mailing list:
 
http://comments.gmane.org/gmane.linux.kernel.initramfs/4075

but has not received a reply yet.
Comment 33 Marco M. 2015-06-15 17:08:57 UTC
Hi,
i have just removed the needinfo flag for my user, because i think i have provided the information requested. If you need something else of course I'm available, so please let me know

@Neil: on the initramfs mailing list they've answered to you and are waiting for more info.
Comment 34 Neil Brown 2015-06-16 21:20:57 UTC
Thanks...
I did see that reply, sent the patches properly (in a different thread), they were eventually applied, and an updated 'dracut' was released for openSUSE about a week ago.

Can you install  dracut-037-17.12.1  and confirm that it fixes the problem?

Thanks.
Comment 35 Marco M. 2015-07-08 16:07:13 UTC
Hi,
i installed the latest dracut version, removed this line from the systemd.conf 

DefaultTimeoutStartSec=300s

and I re-run mkinitrd

After i rebooted and repeated the same test I described  in my post on 2014-11-10. (the vm i used today was exactly the same)

All worked as expected, the only thing i noticed is that the boot (in both cases) require a very long time (some minutes more than normal).

The only open question now is why the yast installer is not able to install grub on both the raid members while installing. (see the duplicated bug here: https://bugzilla.novell.com/show_bug.cgi?id=811830)

Thank you all!
Comment 36 Neil Brown 2015-07-08 22:47:33 UTC
> some minutes more than normal

When booting with a newly degraded array I would expect an extra delay of 2 minutes. - there is a default timeout of 180 seconds and a magic factor of 2/3 applied.

If you are seeing a longer delay, or a delay when the array was not newly degraded, then that might be a bug.  Otherwise it is acting as expected.

The yast installer/grub install issue is quite separate.  Your best bet would be to open a new bug focusing on just that issue.  Feel free to add me to 'cc', but I'm not likely to be the one to push it to resolution (I hope).

Thanks for the positive report - I'll close this bug now on the assumption that the delays you see match expectations.  If they don't and you want to pursue that issue, please re-open.
Comment 37 Marco M. 2015-07-09 09:55:29 UTC
> If you are seeing a longer delay, or a delay when the array was not newly
> degraded, then that might be a bug.  Otherwise it is acting as expected.

It is working as expected.

> 
> The yast installer/grub install issue is quite separate.  Your best bet
> would be to open a new bug focusing on just that issue.  Feel free to add me
> to 'cc', but I'm not likely to be the one to push it to resolution (I hope).
> 
I'll build a new clean  test environment and I'll open a new bug as you suggested.
Thank you very much