Bug 445490 - boot stops on mounting a software RAID - Invalid root filesystem
Summary: boot stops on mounting a software RAID - Invalid root filesystem
Status: RESOLVED FIXED
: 433980 460917 484897 490008 (view as bug list)
Alias: None
Product: openSUSE 11.1
Classification: openSUSE
Component: Installation (show other bugs)
Version: Final
Hardware: i386 Other
: P2 - High : Critical with 29 votes (vote)
Target Milestone: ---
Assignee: Michal Marek
QA Contact: Jiri Srain
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-11-16 13:27 UTC by Michael McCarthy
Modified: 2011-10-20 14:49 UTC (History)
26 users (show)

See Also:
Found By: Beta-Customer
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
save_y2log (388.28 KB, application/x-compressed-tar)
2008-11-16 13:44 UTC, Michael McCarthy
Details
This is the boot.msg file (28.43 KB, application/octet-stream)
2008-11-16 13:49 UTC, Michael McCarthy
Details
GBs boot.msg (42.99 KB, text/plain)
2009-01-24 13:57 UTC, Forgotten User E_KYbzzvNl
Details
the fix as a patch (409 bytes, patch)
2009-01-27 21:02 UTC, Hans-Peter Jansen
Details | Diff
md-restart-uevent (385 bytes, patch)
2009-06-09 09:47 UTC, Hannes Reinecke
Details | Diff
Script that waits until MD-array is started (for UDEV) (161 bytes, text/plain)
2009-06-11 07:35 UTC, Andreas Osterburg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michael McCarthy 2008-11-16 13:27:26 UTC
Created a single RAID1 partition with root "/" filesystem.  After software installation during initial boot, the boot stops with error "Invalid root filesystem -- exiting to /bin/sh".  Typing <CTRL-D> allows boot to proceed.  It appears to mount /dev/md0 as root immediately after the <CTRL-D>.
Comment 1 Michael McCarthy 2008-11-16 13:44:24 UTC
Created attachment 252501 [details]
save_y2log

This is the save_y2log
Comment 2 Michael McCarthy 2008-11-16 13:49:03 UTC
Created attachment 252502 [details]
This is the boot.msg file

NOTE: The sytem was stopped at the /bin/sh prompt when I inserted a USB flash drive.  The follwing lines in the boot.msg file indicate the time when the system was stopped and the usb stick inserted prior to resuming with <CTRL-D>.

<6>usb 1-1: new high speed USB device using ehci_hcd and address 2
<6>usb 1-1: configuration #1 chosen from 1 choice
<6>usb 1-1: New USB device found, idVendor=0930, idProduct=6534
<6>usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
<6>usb 1-1: Product: USB Flash Memory
<6>usb 1-1: Manufacturer: M-Sys
<6>usb 1-1: SerialNumber: 087043505130A995
Comment 3 Michael McCarthy 2008-11-16 13:57:43 UTC
The installation was completed, but any time the system is rebooted, the stop occurs.  These are the last couple lines from the boot screen:

PM: Starting manual resume from disk
Waiting for device /dev/md0 to appear:   ok
invalid root filesystem - exiting to /bin/sh
$


After <CTRL_D> the boot resumes and the following line appears

$ exit
mounting root on /dev/md0

and the boot continues.  It appears that the check on the root filesystem is incorrect or is done at the wrong time.  Possibly it is checking the wrong device.
Comment 4 Jan Kupec 2008-11-18 13:35:48 UTC
Looks like a kernel problem then.
Comment 5 Jan Kupec 2008-11-18 13:36:45 UTC
adjusting summary: every boot stops
Comment 6 Neil Brown 2008-11-21 21:11:39 UTC
I think this might be the same as bug 433980 and bug 435778.
I'm guessing some udev raciness but I'm no expert there.
Comment 7 Kay Sievers 2008-12-08 21:35:59 UTC
Sounds like an initramfs problem.

Initramfs can not just wait for /dev/md* devices to appear. Because of the weird lifetime rules of md, and it's legacy device-creation interface, it is always there, and does not show up when it available.

I guess, it needs some special handling, to check when the device becomes "online". I can not tell for sure, I do not have any raid rootfs setup.
Comment 8 Bernhard Walle 2008-12-11 12:44:17 UTC
Neil, can you explain what initramfs should do here? I don't know about DM-RAID.
Comment 9 Neil Brown 2008-12-11 19:38:04 UTC
According to bug 435778, this has been fixed.
It needs a 'udevadm settle' before asking udev what the filesystem
type is.
Hannes added 'wait_for_events' in mkinird-boot.sh for mdadm, just after
assembling the root device.
Comment 10 Joe Morris 2008-12-23 00:00:37 UTC
Since bug 435778 is only internally accessible, could you spell out here what the resolution is?  At the present, I have to hit Ctrl-D to boot every time, and if this has been resolved (but no updates have been available, I would gladly tweak whatever is needed til there is an update.  This bit me hard on updating from 10.3 to 11.1.  What file needs 'udevadm settle' added to it?  I tried adding 'wait_for_events' at the bottom of /lib/mkinitrd/scripts/boot-md.sh, but it didn't fix it here.  Perhaps it is a different file.  I would appreciate knowing what to do to fix mine, since this is resolved.  Thanks.
Comment 11 Neil Brown 2008-12-23 00:20:17 UTC
That is the correct file and (close enough to) the correct fix.
We put "wait_for_events" just before the final "fi" of that file.

One you do that, you need to recreate the initrd.  Simply running

   mkinitrd

as root should do this.
Then try to reboot.
Comment 12 Joe Morris 2008-12-23 11:01:04 UTC
Sorry to be a pain in the neck, but adding wait_for_events to /lib/mkinitrd/scripts/boot-md.sh before the last fi and running mkinitrd still did not fix it.  You also mention in Comment #9 that there is a 'udevadm settle' also added.  I cannot see anywhere in that file that I can figure out where it should go.  Perhaps in another file?  I just checked boot-udev.sh, and it looks to me like wait_for_events is a variable meaning udevadm settle with a timeout set to 30 in setup-udev.sh.  Is it possible the timeout is too quick for my system?  I will test and let you know.
Comment 13 Joe Morris 2008-12-23 11:59:11 UTC
Well, changing the timeout to 60 didn't change anything.  The message "invalid root filesystem -- exiting to /bin/sh" is found in line 88 of boot-mount.sh where it looks like it is in the filesystem checking area.  I suppose it cannot do a fsck, even though it looks like it has assembled the raid1, and not found the resume, before that message.  If the wait_for_events is the fix, it isn't working here.  Just to triple check, this is my last few lines from boot.md.sh:

	if [ "$md_dev" ] ; then
	    /sbin/mdadm $mdconf --auto=md $md_dev || /sbin/mdadm -Ac partitions $mdarg --auto=md $md_dev
	fi
wait_for_events
fi

(beware of word wrapping, that is only 5 lines).  Thanks for any ideas.  Should I reopen this bug?
Comment 14 Neil Brown 2008-12-23 21:11:20 UTC
I've added you to cc for  bug 435778 so you should be able to read it.
Comment #7 might be an interesting starting point.
If you see IF_FS_TYPE=whatever then it seems likely that it is the same race.
If you do not, the problem must be elsewhere.

Comment 15 Joe Morris 2008-12-24 05:47:21 UTC
Thanks for that.  This is what I found.

udevadm info -q path -n /dev/md0
gave me the correct
/devices/virtual/block/md0

udevadm info -q env -p /devices/virtual/block/md0
gave me 'no record for '/devices/virtual/block/md0/ in the database' (note: you had a typo)
after some testing, I tried 
udevadm info -q all -n /dev/md0
That gave me some output, but ID_FS_TYPE was missing.  I entered
echo change > /sys/block/mdo/uevent
and repeated
udevadm info -q all -n /dev/md0
This time I did have ID_FS_TYPE as well as others.  Not quite sure what this reveals yet, but it gives me more things to check.  Either hitting Ctrl-D or entering exit continues the boot, but boot.md shows as failed, as well as it shows failed (but apparently succeeds) when it later tried to mount my raid 1 home partition md1.  So I wil look again at the mkinitrd scripts.  Thanks for your help so far.
Comment 16 Joe Morris 2008-12-24 08:27:59 UTC
OK, still haven't found the problem.  I just found mdadm-3.0-12.1.x86_64.rpm in CTiel's home 11.1 hotfix directory.  It improved some things, boot.md no longer shows as failed, but I still get the invalid root filesystem error and it stops booting.  I tried adding 'echo change > /sys/block/$md_dev/uevent', just before wait_for_events, and saw that the variable worked, but got an error that the file was not valid or something like that, basically showing me that md0 was not there yet.  It was before the message "waiting for device /dev/md0 to appear : ok", so I am assuming it has to be a bug in a different script.  10.3 worked great.  I did test again, and 'udevadm info -q env -n /dev/md0' after failure to boot still does not contain ID_FS_TYPE, but after echo change > /sys/block/md0/uevent, it does.
Comment 17 Joe Morris 2008-12-24 09:53:41 UTC
Well, after looking around, and figuring out there was a boot folder that had the order of the scripts executing, I decided (since it failed right after resume device not found, ignoring) it had to be something in boot-mount.sh.  I compared it with the one from 10.3 and too much has changed for me to understand what I could change.  So I added the echo line (and then subsequent test still failed but I did has the ID_FS_TYPE this time), then added a sleep line after it to give it a bit more time.  Now it will boot, but this is at best a kludge or desperate work around.  To see what I changed:

# And now for the real thing
if ! discover_root ; then
    echo "not found -- exiting to /bin/sh"
    cd /
    PATH=$PATH PS1='$ ' /bin/sh -i
fi

echo change > /sys/block/md0/uevent
sleep 1

sysdev=$(/sbin/udevadm info -q path -n $rootdev)
# Fallback if rootdev is not controlled by udev
if [ $? -ne 0 ] && [ -b "$rootdev" ] ; then
    devn=$(devnumber $rootdev)
    maj=$(devmajor $devn)
    min=$(devminor $devn)
    if [ -e /sys/dev/block/$maj:$min ] ; then
	sysdev=$(cd -P /sys/dev/block/$maj:$min ; echo $PWD)
    fi
  
It surely isn't a fix, but at least now it will boot without me needing to hit Ctrl-D.  Hope this helps to track down a real fix.
Comment 18 Petr Matula 2008-12-24 14:50:15 UTC
This problem looks similar.
https://bugzilla.novell.com/show_bug.cgi?id=460917
Comment 19 Joe Morris 2008-12-24 15:26:06 UTC
Until the real developers are able to track this down, I thought of a slightly less cludgy fix.  I took out what I added to boot-mount.sh, and added these 2 lines in the opposite order just before the wait_for_events in boot-md.sh.  This is also working, and since it uses the md variable, it should work a bit better than the hard coded md0, which meant I needed to put this fix in boot-md.sh and not mount.  Here is what it looks like now at the end. (watch the wrapping)

	
	if [ "$md_dev" ] ; then
	    /sbin/mdadm $mdconf --auto=md $md_dev || /sbin/mdadm -Ac partitions $mdarg --auto=md $md_dev
	fi
	sleep 1
	echo change > /sys/block/md$md_minor/uevent
	wait_for_events
fi


Hope this will help you Petr until a real fix comes along.  This is a better workaround than earlier, and it allows mine to boot right up.
Comment 20 Joe Morris 2008-12-25 04:31:48 UTC
I tried undoing the workaround and applied the patch from https://bugzilla.novell.com/show_bug.cgi?id=460917 but it did not help me at all.  My problem is definitely that ID_FS_TYPE was missing.  I just checked, after booting with the workaround, and on the booted system I get

jmorris:/ # udevadm info -q env -n /dev/md0
MD_LEVEL=raid1
MD_DEVICES=2
MD_METADATA=0.90
MD_UUID=ffb096e5:5d1a78ab:71771454:6b84c526
ID_FS_USAGE=filesystem
ID_FS_TYPE=ext3
ID_FS_VERSION=1.0
ID_FS_UUID=a2d3f2bf-eaec-45a0-b843-55b15f037d83
ID_FS_UUID_ENC=a2d3f2bf-eaec-45a0-b843-55b15f037d83
ID_FS_LABEL=root
ID_FS_LABEL_ENC=root
ID_FS_LABEL_SAFE=root
jmorris:/ # udevadm info -q env -n /dev/md1
MD_LEVEL=raid1
MD_DEVICES=2
MD_METADATA=0.90
MD_UUID=50317547:316ba81e:fc3a8342:011169ec
MD_DEVNAME=1

So even after booting, the echo command would seem to be needed to give md1 the FS info.

jmorris:/ # echo change > /sys/block/md1/uevent
jmorris:/ # udevadm info -q env -n /dev/md1
MD_LEVEL=raid1
MD_DEVICES=2
MD_METADATA=0.90
MD_UUID=50317547:316ba81e:fc3a8342:011169ec
ID_FS_USAGE=filesystem
ID_FS_TYPE=ext3
ID_FS_VERSION=1.0
ID_FS_UUID=44d4e1ac-8ce7-49ec-b64f-2c084a817515
ID_FS_UUID_ENC=44d4e1ac-8ce7-49ec-b64f-2c084a817515
ID_FS_LABEL=home
ID_FS_LABEL_ENC=home
ID_FS_LABEL_SAFE=home
FSTAB_NAME=/dev/md1
FSTAB_DIR=/home
FSTAB_TYPE=ext3
FSTAB_OPTS=acl,user_xattr
FSTAB_FREQ=1
FSTAB_PASSNO=2

Interestingly enough, KDE popped up a dialog box after entering the echo command.  It sure seems to be a difficult bug to locate.
Comment 21 Hannes Reinecke 2009-01-07 08:59:51 UTC
It looks as if either md doesn't generate change events for when it's ready or that the udev rules need to be modified to update ID_FS_XXX for md.
Kay? Neil?
Comment 22 Bernhard Walle 2009-01-07 12:40:18 UTC
Per comment 21 NEEDINFO.
Comment 23 Hans-Peter Jansen 2009-01-07 13:17:03 UTC
FWIW, modification according to https://bugzilla.novell.com/show_bug.cgi?id=445490#19 fixed it on 3 different systems for me.

Joe, thanks a lot. 

BTW, I would raise the severity to critical or even blocker, since it definitely inhibits common install scenarios for a lot of folks..
Comment 24 David Davey 2009-01-07 21:56:52 UTC
I have had the same result.  https://bugzilla.novell.com/show_bug.cgi?id=445490#19
fixed the problem.
Comment 25 Michael McCarthy 2009-01-08 10:12:55 UTC
*** Bug 460917 has been marked as a duplicate of this bug. ***
Comment 26 Kay Sievers 2009-01-08 14:33:33 UTC
We see this in comment#3:
  Waiting for device /dev/md0 to appear:   ok

As said in comment#7, we can not just wait for the /dev/md* device-node to appear, it is always there. We need to loop until the device is usable.

The lifetime of md device is "broken" from udev's view, they need static device nodes because of their legacy behavior. They do not fit into today's kernel/udev device model, and need to be special-cased in initramfs.
Comment 27 Bernhard Walle 2009-01-08 15:40:16 UTC
Michal, that needs changes in mkinitrd-boot.sh in mdadm package. You're listed as maintainer of that package. Can you help here? I'm not really familiar with md, so you're the better person here ... :)
Comment 28 Kay Sievers 2009-01-08 15:49:58 UTC
If we would boot by-uuid, instead of the questionable md kernel device name use, it would just work, right?
Comment 29 Joe Morris 2009-01-08 21:50:01 UTC
I do not believe so.  For example, this is from my running system:
joe@jmorris:~> ls -l /dev/disk/by-uuid/
total 0
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 801c2029-6789-4117-a461-da745a19f062 -> ../../sda1
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 a2d3f2bf-eaec-45a0-b843-55b15f037d83 -> ../../md0
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 a97bf970-827c-458d-8d59-764fc0381da8 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 e3e86c6b-4445-4611-90a8-fb7b2df6e5af -> ../../sda7

Even though my home is on md1, as you can see above it is still absent in by-uuid.  I believe md0 is only there because of the added 
        sleep 1
        echo change > /sys/block/md$md_minor/uevent
that I added to boot-mount.sh See Comment #19.

But, looking around at /dev/disk/by-id, I see:
joe@jmorris:~> ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S -> ../../sdb
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S-part5 -> ../../sdb5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S-part6 -> ../../sdb6
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY -> ../../sda
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part5 -> ../../sda5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part6 -> ../../sda6
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part7 -> ../../sda7
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 edd-int13_dev80 -> ../../sda
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part5 -> ../../sda5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part6 -> ../../sda6
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part7 -> ../../sda7
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 edd-int13_dev81 -> ../../sdb
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev81-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev81-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev81-part5 -> ../../sdb5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev81-part6 -> ../../sdb6
lrwxrwxrwx 1 root root  9 2009-01-09 05:21 md-uuid-50317547:316ba81e:fc3a8342:011169ec -> ../../md1
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 md-uuid-ffb096e5:5d1a78ab:71771454:6b84c526 -> ../../md0
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S -> ../../sdb
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S-part5 -> ../../sdb5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S-part6 -> ../../sdb6
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY -> ../../sda
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part5 -> ../../sda5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part6 -> ../../sda6
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part7 -> ../../sda7
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 usb-IC_USB_Storage-CFC_20020509145305401-0:0 -> ../../sdc
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 usb-IC_USB_Storage-MMC_20020509145305401-0:2 -> ../../sde
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 usb-IC_USB_Storage-MSC_20020509145305401-0:3 -> ../../sdf
lrwxrwxrwx 1 root root  9 2009-01-09 13:20 usb-IC_USB_Storage-SMC_20020509145305401-0:1 -> ../../sdd

Which at least recognizes md1 as well as md0, though I am not sure how that would change the fact that the file system info is still not output, which is what is causing this part to fail, i.e.

jmorris:/home/joe # udevadm info -q env -n /dev/md0
MD_LEVEL=raid1
MD_DEVICES=2
MD_METADATA=0.90
MD_UUID=ffb096e5:5d1a78ab:71771454:6b84c526
ID_FS_USAGE=filesystem
ID_FS_TYPE=ext3
ID_FS_VERSION=1.0
ID_FS_UUID=a2d3f2bf-eaec-45a0-b843-55b15f037d83
ID_FS_UUID_ENC=a2d3f2bf-eaec-45a0-b843-55b15f037d83
ID_FS_LABEL=root
ID_FS_LABEL_ENC=root
ID_FS_LABEL_SAFE=root
jmorris:/home/joe # udevadm info -q env -n /dev/md1
MD_LEVEL=raid1
MD_DEVICES=2
MD_METADATA=0.90
MD_UUID=50317547:316ba81e:fc3a8342:011169ec
MD_DEVNAME=1

Without the ID_FS_TYPE info the fsck cannot work and stops booting.
Comment 30 Joe Morris 2009-01-09 12:13:48 UTC
I may need to eat my words.  I checked earlier our 10.3 server with udevinfo (10.3 doesn't have udevadm) and I found out only md0 on that box had ID_FS_XXXX info (it has 4 raid 1.  I am waiting for this bug to be fixed before I upgrade it to 11.1).  It also only had the link to md0 in by-uuid, but by-id had them all.  I would be willing to test to see if booting by-id would work.  What all would need changed?  Menu.lst?  Fstab?  rebuild initrd, anything else?

Just tried Yast Partitioner, by-id is greyed out, so it is not an option.  UUID is an option, but I cannot imagine trying to remember its name for any commands.
/dev/disk/by-uuid/a2d3f2bf-eaec-45a0-b843-55b15f037d83.  From the above, though, it looks like it uses ID_FS_UUID to get it, not MD_UUID.  Without the 
echo change > /sys/block/md0/uevent though there is no ID_FS_UUID present.  There is MD_UUID.
Comment 31 Michal Marek 2009-01-12 14:35:08 UTC
BTW, I haven't been able to reproduce even bug #460917 yet, so no progress so far. Any help from the udeve experts would be appreciated :).
Comment 32 Hans-Peter Jansen 2009-01-12 22:30:50 UTC
Michal, I bet that installing 10.2 involving a md and than upgrading to 11.1 will suffice.
Comment 33 Joe Morris 2009-01-14 13:02:38 UTC
Since I can easily reproduce this problem (my raid was originally built back around 9.2 with upgrades to 9.3, 10.1, 10.2, 10.3, and now 11.1 if something needs changed) I would be happy to try booting by-uuid as mentioned in Comment #28 if I had some info as to what needs changed, and if it does not depend on the ID_FS_UUID info, which I know already does not work.
Comment 34 Cristian Rodriguez 2009-01-15 18:33:41 UTC
got bitten by this one recently.. comment19 seems to help.
Comment 35 Forgotten User E_KYbzzvNl 2009-01-24 13:57:22 UTC
Created attachment 267459 [details]
GBs boot.msg
Comment 36 Forgotten User E_KYbzzvNl 2009-01-24 13:59:51 UTC
The same here with my DELL Dimension 8200.
The relevant lines in my boot.msg read as follows:


Trying manual resume from /dev/md0
resume device /dev/md0 not found (ignoring)
Trying manual resume from /dev/md0
resume device /dev/md0 not found (ignoring)
Waiting for device /dev/md1 to appear:  ok
invalid root filesystem -- exiting to /bin/sh
$ 



Here I have to hit Ctrl+D (or type "exit" followed by <ENTER>) to continue booting.
Comment 37 Joe Morris 2009-01-27 10:25:08 UTC
Any progress with this bug?  This bug is holding me back from upgrading our office server, which has 4 md raid 1 partitions, and even though the work around is working OK, I know it is a cludge and not a fix.  Just asking.  If there is anything I could do to help...  Does anyone know if a newly created md raid 1 partition with 11.1 will boot right up with out the work around?  Just grabbing at straws.
Comment 38 Hans-Peter Jansen 2009-01-27 20:55:33 UTC
Joe, it escapes me, why this glitch holds you back from doing anything with 11.1? Well yes, it's nagging to boot into the rescue system, mount the devices by hand, edit the offender, call mkinitrd, and reboot..

Even a newly created RAID 1 will fail without the fix. 

But being as alarmed as you are, how about doing the upgrade, wait for the "will reboot in 10s" message, cancel it, apply attached patch, call mkinitrd, and all should be well.

OTOH, you will only circumvent this issue, even if a fix is committed and rolled out, if you manually add the correct update repo during the upgrade/install.. 

Thus the former approach is still more appealing in my book.
Comment 39 Hans-Peter Jansen 2009-01-27 21:02:41 UTC
Created attachment 268090 [details]
the fix as a patch

Just switch to a text console, change to install root path during install (inquire with df -h), and apply with patch -p0 < mkinitrd-boot-md-fix.diff

It's a lot more difficult to describe the procedure, then doing it all the way..
Comment 40 Michal Marek 2009-01-28 16:05:48 UTC
Kay, I'm probably not the right person for this bug :(. Could you have a look?
Comment 41 Kay Sievers 2009-01-28 18:32:05 UTC
See comment #7, it's nothing udev could fix. Md needs special workarounds.
Comment 42 Kay Sievers 2009-01-28 19:10:45 UTC
It needs to be checked, that the kernel md code sends the "change" event at the proper time, when device is ready to read from userspace, if we are not sure that is correct already.

Also, the initramfs needs special handling of md devices, loop until the device is readable to investigate it, it can not just wait for the device to appear.
Comment 44 Hannes Reinecke 2009-02-19 14:48:56 UTC
And this looks like the ideal bug to get Milan started ...
Comment 45 Gordon Schumacher 2009-03-11 20:59:41 UTC
So... I'm seeing this very issue on a brand-new-clean install of 11.1 (so re: #32 - an upgrade has nothing to do with it.)

I have two SATA disks, each of which has three partitions.  All partitions are RAID autodetect, configured as RAID 1 - the first set for /boot, the second for swap, and the third for the root partition.  I did all this at install time, and hit this bug when the system did its first reboot to complete the installation.

So I believe that the requirement for reproducing this is 1) having your root partition being handled by MD raid, and possibly 2) having it not be /dev/md0.

(In reply to comment #29)

> Without the ID_FS_TYPE info the fsck cannot work and stops booting.

I'll second that; the init dies in /boot/83-mount.sh, lines 79-90:

if [ -z "$rootfstype" -a -x /sbin/udevadm -a -n "$sysdev" ]; then
    eval $(/sbin/udevadm info -q env -p $sysdev | sed -n '/ID_FS_TYPE/p')
    rootfstype=$ID_FS_TYPE
    [ -n "$rootfstype" ] && [ "$rootfstype" = "unknown" ] && $rootfstype=
    ID_FS_TYPE=
fi

# check filesystem if possible
if [ -z "$rootfstype" ]; then
    echo "invalid root filesystem -- exiting to /bin/sh"
    cd /
    PATH=$PATH PS1='$ ' /bin/sh -i



The patch in #39 worked perfectly, though.
Comment 46 Joe Morris 2009-03-11 23:49:59 UTC
Thanks for the info it does happen on a new install and not just an upgrade.  As far as which dev / is on, it failed on mine with / being on md0, so that isn't it.  I suspect the only relevant fact is that the root partition is on an md raid device, and it does not by default contain the ID_FS_TYPE info, which causes the fsck to fail.  I have been using my (IMHO) rather cludgy fix for over 2 months with no apparent ill effects, and it does allow it to work effectively normal. Thanks for your input and info Gordon.
Comment 47 Petr Matula 2009-03-25 14:35:35 UTC
New SLES/SLED 11 :    the same problem
Comment 48 Martin Schaub 2009-04-03 19:53:54 UTC
I have a fix for this problem in a different direction. Instead of fixing it in udev, I changed the /lib/mkinitrd/setup/91-mount.sh file by saving the variable rootfstype at the end.

I am new to opensuse, therefore I don’t have a deep understanding of the system. But I think this line should be there, even when this bug is solved by udev. The reason is that in my opinion it makes no sense to save the name of the fsck command in the configuration file and try to get the file system type at boot time.
Comment 49 Petr Matula 2009-04-04 15:38:45 UTC
openSUSE Factory:
The boot resume without this error.
Comment 50 Forgotten User E_KYbzzvNl 2009-04-10 07:33:53 UTC
(In reply to comment #19)

This fix worked nicely for many weeks now, but with some of the latest patches I did install some days ago the fix was cancelled. Booting stops again with the error "invalid root filesystem". When I looked into my boot-mount.sh today, the first two of the three lines added where gone. 

GONE:            sleep 1
GONE:            echo change > /sys/block/md$md_minor/uevent
STILL THERE:     wait_for_events

After adding again the two missing lines and calling mkinitrd (from a root console) the problem was gone.
Comment 51 David Davey 2009-04-15 13:14:33 UTC
I can confirm the patches undid the fix on the machine involved in my
report #24.  But in the meantime I have installed 11.1 on two other machines
with md root filesystems where this problem did not arise.  This is
correlated with the problem machine having a 64bit AMD Athon processor, and
the two problem-free machines having Intel Core2 Quad processors.
Comment 52 Cheng Shun Xia 2009-05-04 06:51:12 UTC
*** Bug 433980 has been marked as a duplicate of this bug. ***
Comment 53 Forgotten User bKE5XLoalW 2009-06-01 16:51:46 UTC
I got bitten too while upgrading to 11.1.

Patch from #24 fixed it.
Comment 54 Milan Vančura 2009-06-09 09:03:01 UTC
Thank you, Joe, for the patch from Comment #19. I manage it for *SUSE*10 in another bug (502714 - can't make it public, sorry for that). For *SUSE*11 this is a part of mdadm package so I'm reassigning this bug to mdadm maintainer.
Comment 55 Hannes Reinecke 2009-06-09 09:10:48 UTC
Again, why can't md send out 'change' events once it detects the device is useable?
That would be a kernel patch, but would save us quite a lot of hassle from userspace.
Comment 56 Hannes Reinecke 2009-06-09 09:14:42 UTC
drivers/md/md.c:md_new_event() seems like an ideal place for this ...
Comment 57 Hannes Reinecke 2009-06-09 09:32:17 UTC
Actually we have this:

	kobject_uevent(&mddev->gendisk->dev.kobj, KOBJ_CHANGE);

at the end of drivers/md/md.c:do_md_run(), so we should be getting CHANGE events for this case.
Comment 58 Hannes Reinecke 2009-06-09 09:47:33 UTC
Created attachment 296894 [details]
md-restart-uevent

Send 'change' uevent when array is restarted.
Comment 59 Hannes Reinecke 2009-06-09 09:48:29 UTC
Maybe the above helps. Can you test with it and have a 'udevadm monitor --env' running; this will show us all events which are being generated.
Comment 60 Andreas Osterburg 2009-06-11 07:32:46 UTC
After upgrading to the latest kernel on openSUSE 11.1 the same problem
occured.
The problem behind is very simple and the solution, too.

The main problem is, that all udev rules are fired when a new MD
array is assembled, but this does not necessarily has to be started in these moments.

The critical section is within the udev-rules-file "64-md-raid.rules" (/lib/udev/rules.d/)

==> HERE IS THE PROBLEMATIC RULE:

IMPORT{program}="vol_id --export $tempnode"

vol_id will be called on an array that is assembled, but not started,
so the determination of the filesystem type fails (env-Var ID_FS_TYPE)
(and then the remaining parts in iniramfs fail)
It can be seen when you call "mdadm --detail $tempnode" and pipe it to /dev/console within these rules (separate script needed)
It says "status: clean, Not Started"

The solution is simple: Wait until the array is started.

My quick solution is the following:

A new udev program called "md-stat.sh" which waits, until the array is ready.
Run it before vol_id will be called (file is attached).

So add the following rule before the vol_id rule:

IMPORT{program}="md-stat.sh $tempnode"

(the dirtiest hack is to wait 3 seconds [IMPORT{program}="/bin/sleep 3"])

4 steps are needed to fix:

1. Put md-stat.sh to /lib/udev
2. Modify /lib/udev/rules.d/64-md-raid.rules
3. run "mkinitrd"
4. reboot ;-)
Comment 61 Andreas Osterburg 2009-06-11 07:35:31 UTC
Created attachment 297396 [details]
Script that waits until MD-array is started (for UDEV)
Comment 62 Peter Schlaf 2009-06-25 17:33:24 UTC
Had the same problem ("invalid root filesystem") during boot time with opensuse 11.1 and an AMD Athlon 2100+. My Intel core i7 920 and Pentium III computers have no such a problem. However, I installed your patch on my AMD system and it worked! No hang at boottime! No [Ctrl + D] anymore! Thank You very much!!
Comment 63 Michal Marek 2009-07-03 21:17:19 UTC
I built 11.1 test kernels with the patch from comment #57. Those of you who can reproduce the bug, please try them _without_ the mkinitrd / udev workarounds. The kernels are here: http://labs.suse.cz/mmarek/bnc445490/
Comment 64 Neil Brown 2009-07-03 22:03:26 UTC
Hi Michal,
Could you please do the same with the patch from comment #18 of bug #509495.
I fairly sure that will fix the problem.

Thanks.

I don't think that bug is world-readable, so I'll paste it in below
for others to see.


From: NeilBrown <neilb@suse.de>
References: bnc#509495
Subject: Update size of md device as soon as it is successfully assemble.

It is import that we get the size of an md device 'right' before
generating the KOBJ_CHANGE message.  If we don't, then if a program
runs as a result of that message and open the md device before the
mdadm which created the device closes it, the new program will see a
size of zero which will be confusing.

So call revalidate_disk, which checks and updates the size.

Signed-off-by: NeilBrown <neilb@suse.de>

---
 drivers/md/md.c |    1 +
 1 file changed, 1 insertion(+)

--- linux-2.6.27-SLE11_BRANCH.orig/drivers/md/md.c
+++ linux-2.6.27-SLE11_BRANCH/drivers/md/md.c
@@ -3809,6 +3809,7 @@ static int do_md_run(mddev_t * mddev)
 	md_wakeup_thread(mddev->thread);
 	md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */
 
+	revalidate_disk(mddev->gendisk);
 	mddev->changed = 1;
 	md_new_event(mddev);
 	sysfs_notify(&mddev->kobj, NULL, "array_state");
Comment 66 Michal Marek 2009-07-08 11:42:57 UTC
OK, I finally built packages with Neil's patch: http://labs.suse.cz/mmarek/bnc445490/neil/

The top of rpm changelog is
* Tue Jul 07 2009 mmarek@suse.cz
- patches.fixes/md-update-size: Update size of md device as soon
  as it is successfully assemble. (bnc#509495).

Please install these packages and revert all the previous workarounds, hopefully this is the real fix.
Comment 67 Joe Morris 2009-07-08 21:03:39 UTC
Just to let you know.  I undid the work around from Comment #19, installed the kernel packages for my machine from Comment #66, and it rebooted with no problems.  Looks like this finally fixes this bug.  Thanks!!!  Will this be rolled into a kernel update via the update repo?
Comment 68 Neil Brown 2009-07-09 06:43:24 UTC
(In reply to comment #67)
> Looks like this finally fixes this bug.  Thanks!!!  Will this be
> rolled into a kernel update via the update repo?


Thanks for testing.

Yes, I have check the fix in so it should be in any future update.
I don't know when that is scheduled to be.
Comment 69 Donald Haselwood 2009-10-30 01:08:52 UTC
This bug appears to still be present.  I did a NET install 10/27/2009 and upon boot it failed exactly as described above.  At first I thought the install was bad, but with a little luck I stumbled onto getting to continue booting.  Following the suggestion in post #16 fixed the problem.

Kernel: 2.6.27.2-9-pae
/md0 - Suse 10.3
/md1 - swap
/md2 - Suse 11.1

Maybe 11.1 on the 3rd RAID1 partition is the cause(?)
Comment 70 Joe Morris 2009-10-30 02:49:58 UTC
Were you installing 11.1?  If so, you will need to update to the latest kernel.
Comment 71 Michal Marek 2009-10-30 08:25:53 UTC
Right, the fix went into the 2.6.27.29-0.1 update kernel, watch for the following in the rpm changelog:
* Thu Jul 09 2009 nfbrown@suse.de
- patches.fixes/md-update-size: Update size of md device as soon
  as it is successfully assemble. (bnc#509495).
Comment 72 Donald Haselwood 2009-10-30 19:54:40 UTC
Yes, I was installing 11.1 (and from the network).  When I rebooted I encountered the problem.  I assumed that with the network install the kernel would be recent.  I ran update and it went from 2.6.27.7-9 to 2.6.27.37-0.1 and it now boots OK.  So, it looks like the bug will be with us until the kernel in the initial install is more recent than 2.6.27.29-0.1.
Comment 73 Michal Marek 2009-11-09 09:48:36 UTC
*** Bug 490008 has been marked as a duplicate of this bug. ***
Comment 74 Michal Marek 2010-01-22 12:56:45 UTC
*** Bug 484897 has been marked as a duplicate of this bug. ***
Comment 75 Shawn Perkins 2011-10-19 23:26:21 UTC
Guys, I am new to SUSE Linux, Don't wanna seem like a bonehead here but I have been fighting this issue for days now. I have a brand new all intel server with raid 10, 4 2tb hard drives and it runs great until I install updates and then reboot, now it exits to $ prompt and I'm lost what to do next to fix it, I see the solution above but am not sure how to implement it. By the way the os is SUSE 11 sp1
Comment 76 Neil Brown 2011-10-19 23:48:41 UTC
Do you mean "openSUSE-11.1" or "SLES11 SP1" or "SLED11 SP1" ??
"SUSE 11 sp1" isn't a valid name.

What kernel are you running?   Type
   uname -a
at the "$" prompt and report the output.
Comment 77 Shawn Perkins 2011-10-20 13:25:47 UTC
SLES-11 sp1

says

Linux (none)  2.6.27.54-0.2-default #1 SMP 2010-10-19 18:40:07 +0200 x86_64 x86_64 x86_x64 GNU/Linux