Bug 721905

Summary: random /dev/md* device names
Product: [openSUSE] openSUSE 12.1 Reporter: Christian Boltz <suse-beta>
Component: BasesystemAssignee: Neil Brown <nfbrown>
Status: VERIFIED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None    
Version: Factory   
Target Milestone: ---   
Hardware: x86-64   
OS: Other   
Whiteboard:
Found By: Beta-Customer Services Priority:
Business Priority: Blocker: Yes
Marketing QA Status: --- IT Deployment: ---
Bug Depends on:    
Bug Blocks: 720919    
Attachments: Replacement /lib/mkinitrd/scripts/setup-md.sh
new mdadm RPM for x86_64 with fix.

Description Christian Boltz 2011-10-03 21:46:58 UTC
After updating from 11.4 to Factory (slightly newer than 12.1 beta1), I get more or less random device numbers for my raid devices.

Only /dev/md2 (my / partition) has a stable naming.

The other partitions get a number starting at /dev/md124. To make it more interesting[tm], the numbers change at nearly every boot (but always start at 124).
Now combine that with an encrypted /home (including symlinked /var and /tmp) - this makes booting really funny :-/

I'm not sure which logs or details are useful here - please tell me what you need.


# cat /etc/mdadm.conf
DEVICE partitions
ARRAY /dev/md0 level=raid1 UUID=41b8971f:df6c3d81:40a999bc:490996f9
ARRAY /dev/md1 level=raid1 UUID=fa768d7e:b3eed134:3bf3c2ce:70ca362f
ARRAY /dev/md2 level=raid1 UUID=caa75fa1:5e0a6fcf:c456a754:896617d0
ARRAY /dev/md3 level=raid1 UUID=3b988094:5c379c09:cb3773cc:a44c243c
ARRAY /dev/md4 level=raid1 UUID=dfc46d0f:cf98c90a:163cbca9:c954a7c6


# cat /proc/mdstat 
Personalities : [raid1] [raid0] [raid10] [raid6] [raid5] [raid4] 
md124 : active raid1 sdb3[1] sda3[0]
      133162712 blocks super 1.0 [2/2] [UU]
      bitmap: 6/254 pages [24KB], 256KB chunk

md125 : active raid1 sdb2[1] sda2[0]
      200800 blocks super 1.0 [2/2] [UU]
      bitmap: 0/7 pages [0KB], 16KB chunk

md2 : active raid1 sda5[0] sdb5[1]
      10482308 blocks super 1.0 [2/2] [UU]
      bitmap: 0/160 pages [0KB], 32KB chunk

md126 : active (auto-read-only) raid1 sda6[0] sdb6[1]
      10482308 blocks super 1.0 [2/2] [UU]
      bitmap: 2/160 pages [8KB], 32KB chunk

md127 : active (auto-read-only) raid1 sda7[0] sdb7[1]
      1839396 blocks super 1.0 [2/2] [UU]
      bitmap: 0/8 pages [0KB], 128KB chunk

unused devices: <none>
Comment 1 Neil Brown 2011-10-04 21:55:31 UTC
This implies that when mdadm assembled the array it couldn't reliably determine which number it should use so it arbitrarily assigned one starting at 127 and working downwards.

As the arrays appear to be listed in /etc/mdadm.conf the possibly explanations that come to mind are:

1/ the uuids in /etc/mdadm.conf don't match the uuids of the arrays.  So please
  report
  mdadm -Ds
  so we can compare uuids.

2/ The arrays are getting assembled from the initrd and the mdadm.conf on the
  initrd does not have correct information.  So please:
  mkdir /tmp/i
  cd /tmp/i
  zcat /boot/initrd-KERNEL_VERSION |cpio -idv
  cat etc/mdadm.conf

Thanks.
Comment 2 Christian Boltz 2011-10-04 22:32:26 UTC
(In reply to comment #1)
>   mdadm -Ds
>   so we can compare uuids.

ARRAY /dev/md/linux:0 metadata=1.0 name=linux:0 
  UUID=41b8971f:df6c3d81:40a999bc:490996f9
ARRAY /dev/md/linux:3 metadata=1.0 name=linux:3 
  UUID=3b988094:5c379c09:cb3773cc:a44c243c
ARRAY /dev/md/linux:4 metadata=1.0 name=linux:4 
  UUID=dfc46d0f:cf98c90a:163cbca9:c954a7c6
ARRAY /dev/md2 metadata=1.0 name=linux:2 
  UUID=caa75fa1:5e0a6fcf:c456a754:896617d0
ARRAY /dev/md/linux:1 metadata=1.0 name=linux:1 
  UUID=fa768d7e:b3eed134:3bf3c2ce:70ca362f

The UUIDs are the same as in /etc/mdadm.conf (unless I overlooked a minor difference).

>   zcat /boot/initrd-KERNEL_VERSION |cpio -idv
>   cat etc/mdadm.conf

ARRAY /dev/md2 metadata=1.0 name=linux:2 
  UUID=caa75fa1:5e0a6fcf:c456a754:896617d0

Looks like the initrd contains only the information for the / partition, not for my other /dev/md* arrays.

I just checked my backups - the initrd from openSUSE 11.4 also has only /dev/md2 in its mdadm.conf. The boot/51-md.sh script in the 11.4 and 12.1 initrd are exactly the same. However there are some differences in 
lib/udev/rules.d/64-md-raid.rules
Comment 3 Neil Brown 2011-10-04 22:48:38 UTC
Thanks.
I can see what is happening now.  I'll have to give a bit of thought to figure
how best to fix it.
In the mean time you could edit 64-md-raid.rules and remove (or comment out)
the line that mentions "--incremental" and then remake your initrd (mkinitrd).
Then the arrays should look right on the next boot.
Comment 4 Neil Brown 2011-10-05 03:31:35 UTC
Created attachment 454530 [details]
Replacement /lib/mkinitrd/scripts/setup-md.sh

Ok, I've had a think about this and tried some things out.

Please don't make the change to the udev/rules.d file - or reverse it if you have made it, and instead replace /lib/mkinitrd/scripts/setup-md.sh with the attached file.
The change is just this:
@@ -82,7 +82,7 @@
 
 if [ -n "$root_md" ] ; then
     need_mdadm=1
-    echo -n "" > $tmp_mnt/etc/mdadm.conf
+    echo "AUTO -all" > $tmp_mnt/etc/mdadm.conf
     for md in $md_devs; do
         eval echo -e \"\$md_conf_$md\" >> $tmp_mnt/etc/mdadm.conf
     done

It should cause the initrd environment to only try to assemble arrays listed in mdadm.conf.  Once boot completes the rest of the arrays will then be assembled based on the /etc/mdadm.conf in the root filesystem.

Please confirm that this works for you and I'll submit an update.

Thanks for the report.
Comment 5 Christian Boltz 2011-10-05 11:33:18 UTC
(In reply to comment #4)
> Replacement /lib/mkinitrd/scripts/setup-md.sh
> Please confirm that this works for you and I'll submit an update.

etc/mdadm.conf in the initrd now has:

AUTO -all
ARRAY /dev/md2 metadata=1.0 name=linux:2 
  UUID=caa75fa1:5e0a6fcf:c456a754:896617d0

Unfortunately I still get /dev/md12* devices, so you'll need another fix...
Comment 6 Neil Brown 2011-10-06 02:28:57 UTC
Thanks for testing.
It seems that "AUTO -all" is broken in the latest mdadm :-(

I've fixed it and submitted an update - it should filter through the system in a couple of days.  I'm highly confident that this will really fix it.

I'll attach and RPM that I have build so you can test with that if you like.

Setting Needinfo for confirmation, either from this RPM or when an 'offical' one gets out.

Thanks,
NeilBrown
Comment 7 Neil Brown 2011-10-06 02:30:13 UTC
Created attachment 454797 [details]
new mdadm RPM for x86_64 with fix.
Comment 8 Bernhard Wiedemann 2011-10-06 03:00:10 UTC
This is an autogenerated message for OBS integration:
This bug (721905) was mentioned in
https://build.opensuse.org/request/show/86854 Factory / mdadm
Comment 9 Christian Boltz 2011-10-07 20:01:53 UTC
Tested with mdadm-3.2.2-3.1.x86_64 from Factory. Good news: I get the correct device names again :-)

Thanks for fixing this issue!
Comment 10 Neil Brown 2011-10-07 21:10:32 UTC
Thanks for the confirmation - and the original report.