Bug 793954

Summary: raid1 array sometimes started with one of two disks at boot (again)
Product: [openSUSE] openSUSE Tumbleweed Reporter: Christian Boltz <suse-beta>
Component: BasesystemAssignee: Neil Brown <nfbrown>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: bugz57, fcrozat, forgotten_5bOUleMVRM, jslaby
Version: 13.1 Milestone 1   
Target Milestone: ---   
Hardware: Other   
OS: openSUSE 12.2   
Whiteboard:
Found By: Beta-Customer Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: boot.log
dmesg output
alternate mdadm to test
Updated RPM which prints messages when it gets EBUSY and is forced to wait
boot log received via journalctl - md0 breaking
new rpm with extra EBUSY loop
dmesg output (initscript changed according to comment 21)
init script 1
init script 2 (boot.md-retry)
sysconfig addition
strace of multiple reboots
reboot 1 degraded
reboot 2 ok
Log files of reboot 2 (comment #51)
Log files for comment #55

Description Christian Boltz 2012-12-11 18:25:00 UTC
Created attachment 516596 [details]
boot.log

(factory from 2012-12-03, mdadm-3.2.6-2.1)

It seems the random "start raid array with only one of two disks" problem is back in Factory :-(

Details:
I have two harddisks in my laptop. Each of them has several partitions, which are paired as RAID 1 (mirroring) arrays.

In the last days, my /boot partition (md0) was started with one of two disks (IIRC this happened twice). Today my /testroot partition (md3) started with only one active disk. dmesg does not show any sign of broken disks.

At least for /boot, re-adding the second disk with mdadm worked without any complaints (and in the next boots, it came up with both disks), and I expect the same for /testroot.

I'll attach boot.log and dmesg output with md3 starting with only one disk. (Sorry, no /v/l/messages because my syslog setup is broken :-( )
Comment 1 Christian Boltz 2012-12-11 18:25:44 UTC
Created attachment 516598 [details]
dmesg output
Comment 2 Neil Brown 2012-12-12 05:27:16 UTC
If you are referring to md3 then the logs suggest:

[   13.478064] md: bind<sda6>
[   13.478999] md: kicking non-fresh sdb6 from array!
[   13.479017] md: unbind<sdb6>
[   13.481045] md: export_rdev(sdb6)
[   13.483721] md/raid1:md3: active with 1 out of 2 mirrors


that sdb6 had failed previously and was not included for that reason - perfectly correct.

Why it failed I cannot tell without /var/log/messages.

So it all appears to be working correctly -> invalid.
Comment 3 Christian Boltz 2012-12-12 12:00:00 UTC
(In reply to comment #2)
> that sdb6 had failed previously and was not included for that reason -
> perfectly correct.

Then it probably failed on the previous boot.

> Why it failed I cannot tell without /var/log/messages.

And that's something I can't provide because syslog is broken on my system :-(

Anyway - thanks for checking this. If this happens again and I have more useful logs, I'll report back.
Comment 4 Christian Boltz 2013-01-10 17:16:39 UTC
Happened again (Factory from 2012-12-25, mdadm-3.2.6-2.2.x86_64) - and this time I have logs :-)

The relevant part from /var/log/messages are (from booting):

2013-01-09T12:08:18.137440+01:00 geeko kernel: [   14.418724] md: md1 stopped.
2013-01-09T12:08:18.137442+01:00 geeko kernel: [   14.608502] md: bind<sdb3>
2013-01-09T12:08:18.137444+01:00 geeko kernel: [   14.608549] md: could not open unknown-block(8,3).
2013-01-09T12:08:18.137446+01:00 geeko kernel: [   14.608552] md: md_import_device returned -16
2013-01-09T12:08:18.137448+01:00 geeko kernel: [   14.638234] md/raid1:md1: active with 1 out of 2 mirrors
2013-01-09T12:08:18.137449+01:00 geeko kernel: [   14.638664] created bitmap (254 pages) for device md1
2013-01-09T12:08:18.137451+01:00 geeko kernel: [   14.646312] md1: bitmap initialized from disk: read 16 pages, set 0 of 520167 bits
2013-01-09T12:08:18.137456+01:00 geeko kernel: [   14.666434] md1: detected capacity change from 0 to 136358617088
2013-01-09T12:08:18.137459+01:00 geeko kernel: [   14.668951]  md1: unknown partition table

On the previous boot, md1 was assembled with both disks, and the log does not contain anything that could explain why it broke.

BTW: Do you need something from the missing disk or can I re-integrate it into the array?
Comment 5 Neil Brown 2013-02-07 03:27:05 UTC
Sorry, I've been very distracted throughout January.
Hopefully you didn't wait for me and have already re-integrated the missing disk into the array.  If not, feel free to do so.

The  "unknown-block(8,3)" is strange.  8,3 is /dev/sda3.  If Linux don't know that, it suggests that sda has just been removed from in some way.  It must have been present previously else mdadm wouldn't have tried to use it in the array.

The "return -16" means EBUSY.  I'm surprised that a disappeared drive would return EBUSY rather than EIO or ENODEV or similar, so I might be misinterpreting something.

However it is clear that when md tries to access /dev/sda3 it cannot, either because it doesn't exist any more, or because it is being used by something else.

It might be possible that mdadm is racing with itself and another mdadm gets an O_EXCL open on the device just as md tries to open it.  That wouldn't explain the "unknown-block(8,3)", but would explain the rest.

mdadm wasn't originally designed with concurrent access in mind and there have been a few issues in that area of late.  I'll see if I can come up with a credible race scenario, then see if I can provide some locking against it.
Comment 6 Christian Boltz 2013-02-09 18:12:19 UTC
(In reply to comment #5)
> Hopefully you didn't wait for me and have already re-integrated the missing
> disk into the array.  If not, feel free to do so.

I'm just doing it _again_ :-( - this time sdb3 was missing from md1
(are you interested in fresh logs? - note to myself: 2013-02-09T11:39:33)

Besides your ideas what could go wrong, would an unclear umount/shutdown be an explanation for what I'm seeing? md1 is my /home partition and also contains /var (/var is a symlink to /home/sys-var/). I'm quite sure I've seen umounting md1 failing at shutdown because it's still in use, but I don't know if this happens always or only sometimes (and it's probably too late to be logged).

BTW: The resync is usually extremely fast (120 G in 2 minutes) - is there a "list of changed sectors" that avoids a full resync?
Comment 7 Neil Brown 2013-02-11 01:52:16 UTC
An unclean shutdown wouldn't cause "md_import_device returned -16" so I don't think that is a cause.
I don't extra logs thanks.
You array have a 'write intent bitmap' which provides exactly the "list of changed sectors" that you suggest.  So it only tries to recover the bits that need to be recovered - 2 minutes sounds about right.
Comment 8 Neil Brown 2013-02-19 03:42:24 UTC
Created attachment 525153 [details]
alternate mdadm to test

I looked at the code again, and "unknown-block()" is exactly what is expected, so that doesn't tell us anything.
So it is simply an EBUSY problem, meaning a race with mdadm somehow.

Closing the race doesn't seem easy without a lot of code change (which will happen in the next upstream release, but isn't suitable for openSUSE yet), but retrying should  be just as successful and a lot easier.
So I've modified the code to retry if it gets EBUSY.  It only retries 5 times, but that should be plenty.

Could you please install that attached rpm:

  rpm -U mdadm.....

and see if the problem recurs.  I suspect we'll have to wait several weeks before we can be sure it is fixed...
Comment 9 Christian Boltz 2013-02-19 12:11:42 UTC
Thanks for the test package! I just installed it - let's see what happens...
Comment 10 Christian Boltz 2013-02-20 20:44:08 UTC
BTW: Does your patched test package log anything if it retries because of EBUSY? If not, please add it - grepping for something like "EBUSY - retrying $device" is easier than just waiting for a random bug not to appear ;-)
Comment 11 Neil Brown 2013-02-25 05:16:10 UTC
Created attachment 526231 [details]
Updated RPM which prints messages when it gets EBUSY and is forced to wait

Thanks for the suggestion.  This version prints those messages.
Comment 12 Christian Boltz 2013-02-26 20:03:45 UTC
Just happened again :-(

The previous boot looks good (I'm logging the content of /proc/mdstat from boot.local):

2013-02-25T17:48:01.742859+01:00 geeko boot.local[3338]: md1 : active raid1 sda3[0] sdb3[2]
2013-02-25T17:48:01.743058+01:00 geeko boot.local[3338]: 133162712 blocks super 1.0 [2/2] [UU]

and there's also no sign of problems in the log later.

FYI: I grepped for "md1|sda3|sdb3" + context lines

On the next boot, only sda3 was used for md1:

2013-02-26T12:02:46.684946+01:00 geeko kernel: [    1.704779]  sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 >
2013-02-26T12:02:46.684948+01:00 geeko kernel: [    1.705234] sd 0:0:0:0: [sda] Attached SCSI disk
2013-02-26T12:02:46.684950+01:00 geeko kernel: [    1.706459]  sdb: sdb1 sdb2 sdb3 sdb4 < sdb5 sdb6 sdb7 >
2013-02-26T12:02:46.684952+01:00 geeko kernel: [    1.706884] sd 1:0:0:0: [sdb] Attached SCSI disk
[...]
2013-02-26T12:02:46.685507+01:00 geeko kernel: [   17.016099] md: md1 stopped.
2013-02-26T12:02:46.685510+01:00 geeko kernel: [   17.018210] md: bind<sda3>
2013-02-26T12:02:46.685512+01:00 geeko kernel: [   17.044953] md/raid1:md1: active with 1 out of 2 mirrors
2013-02-26T12:02:46.685517+01:00 geeko kernel: [   17.045083] created bitmap (254 pages) for device md1
2013-02-26T12:02:46.685520+01:00 geeko kernel: [   17.052109] md1: bitmap initialized from disk: read 16 pages, set 0 of 520167 bits
2013-02-26T12:02:46.685522+01:00 geeko kernel: [   17.134417] md1: detected capacity change from 0 to 136358617088
2013-02-26T12:02:46.685524+01:00 geeko kernel: [   17.335114]  md1: unknown partition table

No sign of an error message :-(  but the line "bind<sdb3>" is missing.
This still leaves the question why sdb3 was not added - any idea where you could add/log another error message? ;-)

Another boot later, I get an error message (as expected, because sdb3 is outdated in the meantime):
2013-02-26T16:54:09.493860+01:00 geeko kernel: [   11.950498] md: bind<sda3>
2013-02-26T16:54:09.493865+01:00 geeko kernel: [   11.951345] md: kicking non-fresh sdb3 from array!
2013-02-26T16:54:09.493867+01:00 geeko kernel: [   11.951353] md: unbind<sdb3>
2013-02-26T16:54:09.493869+01:00 geeko kernel: [   11.952442] md: export_rdev(sdb3)
2013-02-26T16:54:09.493871+01:00 geeko kernel: [   11.953904] md/raid1:md1: active with 1 out of 2 mirrors
Comment 13 Neil Brown 2013-02-27 06:44:15 UTC
The messages from mdadm would appear in the output of
  journalctl -b

but I can see that it didn't wait at all from the lack of any delay between "md: md1 stopped" and "md: bind <sda3>".  So it didn't wait and give up - it just didn't see sdb3 at all.

Could you add "-vv" to the mdadm command in /etc/init.d/boot.md
Both
                $mdadm_BIN -IRs
                $mdadm_BIN -A -s -c $mdadm_CONFIG

could have "-vv" added at the end.

Then when it happens again, 
  journalctl -b | grep mdadm

and see what it says.

I've tried it mayself with the latest Factory and I cannot reproduce the problem, but then it is quite possibly dependent on exact timing, and that will vary with different hardware.

Am I correct in thinking that this never happens to the "root" array, only to other arrays what would be assembled once the system has booted?
Comment 14 Christian Boltz 2013-02-27 12:38:11 UTC
(In reply to comment #13)
> Could you add "-vv" to the mdadm command in /etc/init.d/boot.md

Done.

> Am I correct in thinking that this never happens to the "root" array, only to
> other arrays what would be assembled once the system has booted?

IIRC you are correct (but I'm not 100% sure). When first repoting the problem, it affected /boot and /testroot, but in the last months typically my encrypted /home (which also contains /var via symlinks) breaks.
Comment 15 Christian Boltz 2013-03-01 18:36:49 UTC
Created attachment 527783 [details]
boot log received via journalctl - md0 breaking

This time my /boot (md0, sd*2) broke.

Note: /home (md1, sd*3) was broken on one of the previous boots already, before I added the -vv option.
Comment 16 Neil Brown 2013-03-03 23:55:44 UTC
Created attachment 527872 [details]
new rpm with extra EBUSY loop

I missed a spot where it can fail with EBUSY.
please "rpm -U" this rpm and try again.
Thanks for your patience.
Comment 17 Christian Boltz 2013-03-04 12:02:59 UTC
(In reply to comment #16)
> I missed a spot where it can fail with EBUSY.

No problem ;-)  I just installed the updated package.
Comment 18 Christian Boltz 2013-03-08 00:12:34 UTC
With the latest package, I get several
    boot.md[335]: EBUSY - retrying dev_open of /dev/sda5
    boot.md[335]: EBUSY - retrying dev_open of /dev/sdb5
which is not really surprising - sd*5 is part of md2, which is my root partition and already mounted by initrd. (Nevertheless, wouldn't it make sense to skip checking partitions that are already in use by an active raid array?)

It also looks like your changes prevented a failure:

Mar 05 12:13:33 geeko.boltz boot.md[316]: mdadm: looking for devices for /dev/md1
[...]
Mar 05 12:13:35 geeko.boltz boot.md[316]: EBUSY - retrying dev_open of /dev/sdb3
[...]
Mar 05 12:13:36 geeko.boltz boot.md[316]: mdadm: /dev/sdb3 is identified as a member of /dev/md1, slot 1.
Mar 05 12:13:36 geeko.boltz boot.md[316]: mdadm: /dev/sda3 is identified as a member of /dev/md1, slot 0.
Mar 05 12:13:36 geeko.boltz boot.md[316]: mdadm: added /dev/sdb3 to /dev/md1 as 1
Mar 05 12:13:36 geeko.boltz boot.md[316]: mdadm: added /dev/sda3 to /dev/md1 as 0
Mar 05 12:13:37 geeko.boltz boot.md[316]: mdadm: /dev/md1 has been started with 2 drives.

and also helped two days later (Mar 07 12:14:51) with similar log messages.
Comment 19 Neil Brown 2013-03-12 01:58:46 UTC
That is good news - thanks.

You are right that mdadm should skip partitions that are already in use by another RAID, but the way it does that produces false-positive which is the root of the problem we've been seeing.

So now you have an effective work-around and I have a clear understanding of the problem.  I'll try to sort out a more effective fix and get it submitted.
Comment 20 Neil Brown 2013-03-12 02:46:31 UTC
*** Bug 805415 has been marked as a duplicate of this bug. ***
Comment 21 Neil Brown 2013-03-13 01:03:20 UTC
Actually I don't have a completely clear understanding.

I can see the races and the need to avoid them... but they still shouldn't be happening.

The race can only be between "mdadm -I" run by udev, and "mdadm -As" run by boot.md.
However boot.md call 'udevadm settle' which should wait for all udev activity to complete. So "mdadm -I" shouldn't run concurrently to "mdadm -As".

I've found a case where "udevadm settle" won't actually wait for the queue to settle, but it seems an unlikely case.  It is worth checking though.

So: could you please edit your "/etc/init.d/boot.md" and replace the line

            /sbin/udevadm settle --timeout="$MDADM_DEVICE_TIMEOUT"

with

            /sbin/udevadm --debug settle --timeout="$MDADM_DEVICE_TIMEOUT" > /dev/kmsg 2>&1

This should cause "calling: settle" to appear in the "dmesg" output.
If "timeout waiting for udev queue" appears as well then that points to the problem.  However whatever appears there might be useful.
Comment 22 Neil Brown 2013-03-13 04:06:31 UTC
*** Bug 760859 has been marked as a duplicate of this bug. ***
Comment 23 Christian Boltz 2013-03-13 18:58:54 UTC
Created attachment 529591 [details]
dmesg output (initscript changed according to comment 21)

This is just from one boot. If you need more (or need one with EBUSY), I'll have to write a small log collector in boot.local ;-)
Comment 24 Volker Kuhlmann 2013-03-14 02:33:27 UTC
I've just had a major filesystem corruption because of this. Back in SuSE 9.2 Linux used to just work...

Anyway it's essential to re-attach the missing disks and waiting for the sync to finish before rebooting, or otherwise next time the *other* disk might be the sole array and it's up the creek very quickly. My mistake was to not notice the degraded array for a week.

all packagea are current (oS 12.1, but same problem).

I've put some more logging into place, and the symptoms are always that the md1 raid1 array (md0 for the root fs is never a problem) is randonmly created with one disk, the other disk, both disks, or no disk. In the last case there's a huge timeout (for mounting the non-existing disk) before entering emergency mode.
Comment 25 Volker Kuhlmann 2013-03-14 02:39:51 UTC
For those frustrated users who just want heir system to work because they have work to do, I attach two init scripts which should take care of the problem.

After copying the files into place, run

chkconfig -a boot.md-check
chkconfig -a boot.md-retry

and append the -append file to /etc/sysconfig/mdadm, AND EDIT it for your system.
Comment 26 Volker Kuhlmann 2013-03-14 02:41:13 UTC
Created attachment 529614 [details]
init script 1
Comment 27 Volker Kuhlmann 2013-03-14 02:42:09 UTC
Created attachment 529615 [details]
init script 2 (boot.md-retry)
Comment 28 Volker Kuhlmann 2013-03-14 02:42:42 UTC
Created attachment 529616 [details]
sysconfig addition
Comment 29 Neil Brown 2013-03-14 05:07:22 UTC
All that is rather complex....
The easiest way to avoid the problem is to comment the lines in /usr/lib/udev/rules.d/64-md-raid.rules (though that file might be in /lib/udev/rules.d, depending on version) which mentions "/sbin/mdadm --incremental"

That should make it work reliably.

The problem seems to be that the "udevadm settle" call in boot.md doesn't always wait until udev has settled, but I cannot find any possible reason why.

(Thanks for the logs Christian - unfortunately it doesn't show anything useful, which suggests I was on the wrong track).
Comment 30 Neil Brown 2013-03-14 07:10:36 UTC
Ahh.. I might have it at last

This bug is very similar to bug 772286 which we fixed by adding
# Should-Start: udev-trigger
to /etc/init.d/boot.md.

However it seems that udev-trigger isn't a service any more.  Rather there is a 

  systemd-udev-trigger.service

So maybe we just need to change "udev-trigger" in boot.md to "systemd-udev-trigger".  A bit of experimentation suggests that changing

# Should-Start: boot.scsidev boot.multipath udev-trigger

in /etc/init.d/boot.md to 

# Should-Start: boot.scsidev boot.multipath systemd-udev-trigger

does seem to change the behaviour as I would expect.  I don't have any problems with assembling arrays (as the problem is very sensitive to particular timing of various events) so I cannot be certain that this fixes the problem.  If someone who does experience the problem could make this change and see if it fixes the problem, that would be great.

Frederic: apart from wanting to let you know that I'm a bit grumpy about this name change (not your fault exactly, but I wanted to complain to someone) I notice that the /usr/lib/systemd/system/systemd-udev-trigger.service file looks wrong based on the man page (and a quick look at the code).
It contains:

ExecStart=/usr/bin/udevadm trigger --type=subsystems --action=add ; /usr/bin/udevadm trigger --type=devices --action=add

However the body of "ExecStart" is not a shell script, but a command and some args, so this passes ";" and "/usr/bin/udevadm" etc as extra args to /usr/bin/udevadm.
For "Type=oneshot" services you are allowed multiple ExecStart, so this should be

ExecStart=/usr/bin/udevadm trigger --type=subsystems --action=add
ExecStart=/usr/bin/udevadm trigger --type=devices --action=add

This change might also be needed for md arrays to be started properly but I'm not 100% sure.

The upstream code has this bug.  Would you follow it up with upstream, or would you rather that I did?
Comment 31 Frederic Crozat 2013-03-14 09:22:37 UTC
(In reply to comment #30)
> Ahh.. I might have it at last
> 
> This bug is very similar to bug 772286 which we fixed by adding
> # Should-Start: udev-trigger
> to /etc/init.d/boot.md.
> 
> However it seems that udev-trigger isn't a service any more.  Rather there is a 
> 
>   systemd-udev-trigger.service
> 
> So maybe we just need to change "udev-trigger" in boot.md to
> "systemd-udev-trigger".  A bit of experimentation suggests that changing
> 
> # Should-Start: boot.scsidev boot.multipath udev-trigger
> 
> in /etc/init.d/boot.md to 
> 
> # Should-Start: boot.scsidev boot.multipath systemd-udev-trigger
> 
> does seem to change the behaviour as I would expect.  I don't have any problems
> with assembling arrays (as the problem is very sensitive to particular timing
> of various events) so I cannot be certain that this fixes the problem.  If
> someone who does experience the problem could make this change and see if it
> fixes the problem, that would be great.
> 
> Frederic: apart from wanting to let you know that I'm a bit grumpy about this
> name change (not your fault exactly, but I wanted to complain to someone) I
> notice that the /usr/lib/systemd/system/systemd-udev-trigger.service file looks
> wrong based on the man page (and a quick look at the code).
> It contains:
> 
> ExecStart=/usr/bin/udevadm trigger --type=subsystems --action=add ;
> /usr/bin/udevadm trigger --type=devices --action=add
> 
> However the body of "ExecStart" is not a shell script, but a command and some
> args, so this passes ";" and "/usr/bin/udevadm" etc as extra args to
> /usr/bin/udevadm.

Indeed. Splitting with ; works fine as long as the first binary doesn't take an argument, otherwise, it is lost.


> For "Type=oneshot" services you are allowed multiple ExecStart, so this should
> be
> 
> ExecStart=/usr/bin/udevadm trigger --type=subsystems --action=add
> ExecStart=/usr/bin/udevadm trigger --type=devices --action=add

Looks better indeed.

> This change might also be needed for md arrays to be started properly but I'm
> not 100% sure.
> 
> The upstream code has this bug.  Would you follow it up with upstream, or would
> you rather that I did?

Feel free to send a patch upstream, I'll backport it once it is approved
Comment 32 Volker Kuhlmann 2013-03-14 10:17:20 UTC
Re comment #30
Unfortunately adding udev-trigger to 
  # Should-Start: udev-trigger
DOES NOT fix the problem.

My older SATA/PATA system still has the line
  ENV{ID_FS_TYPE}=="linux_raid_member", ACTION=="add", RUN+="/sbin/mdadm --incremental $env{DEVNAME}"
in /lib/udev/rules.d/64-md-raid.rules commented out. Otherwise it's up to all updates of 12.1, and it seems reliable.

My system with 2 SATA has all updates, no changes, and raid1 is @*^$ed (still 12.1 too).

My additional scripts may look excessive but have the deciding advantage that they are guaranteed to work. Their warning features will also be necessary for any future tests (I don't want to get screwed again).

Of about 8 reboots for testing today the problem showed in the first 3 and then disappeared. I hate that. Is there anything that could adapt to a changed system configuration (straws, grasp)?

Over the next few weeks I expect to upgrade to 12.3, until then I can test the change

-ExecStart=/sbin/udevadm trigger --type=subsystems --action=add ; /sbin/udevadm trigger --type=devices --action=add
+ExecStart=/sbin/udevadm trigger --type=subsystems --action=add
+ExecStart=/sbin/udevadm trigger --type=devices --action=add

in /lib/systemd/system/udev-trigger.service
Comment 33 Volker Kuhlmann 2013-03-14 11:08:57 UTC
I fixed the broken udev-trigger.service and rebooted 5 times.
The 4 four had no problems.
On the fifth reboot "Starting LSB MD RAID" failed, but my check script, running immediately after, passed.

Btw is it possible to get the boot console(!) log into a file? Setting
  ENFORCE_BLOGD="yes"
in sysconfig/boot does nothing. There is no boot.msg file.

So it looks a bit like you want to get that broken udev file (and its name in boot.md) fixed in 12.3 yesterday...
Comment 34 Frederic Crozat 2013-03-14 12:05:17 UTC
(In reply to comment #33)
> 
> Btw is it possible to get the boot console(!) log into a file? Setting
>   ENFORCE_BLOGD="yes"
> in sysconfig/boot does nothing. There is no boot.msg file.

Use journalctl -b to get all the logs from current boot.

> So it looks a bit like you want to get that broken udev file (and its name in
> boot.md) fixed in 12.3 yesterday...

We'll ship a maintenance update on relevant distributions once we are sure all the fixes are tested, of course :)
Comment 35 Volker Kuhlmann 2013-03-14 21:30:55 UTC
journalctl: Command not found.

Everyone's interested in the latest, but I have to test this on oS 12.1 for the time being. (Do I put this down to yet another case of half-baked banana software like systemd being let lose on the guinee pigs ooops opensuse users...??? :-)
Comment 36 Frederic Crozat 2013-03-15 08:30:05 UTC
(In reply to comment #35)
> journalctl: Command not found.

sorry, I forgot you are on 12.2, use systemd-journalctl instead.


> Everyone's interested in the latest, but I have to test this on oS 12.1 for the
> time being. (Do I put this down to yet another case of half-baked banana
> software like systemd being let lose on the guinee pigs ooops opensuse
> users...??? :-)

These kind of comments won't really help motivating people to help you..
Comment 37 Volker Kuhlmann 2013-03-15 10:13:44 UTC
systemd-journalctl: Command not found.

rpm -ql systemd|grep -i jour
Exit 1

12.*1* (you even quoted my text saying so), systemd version 37

As far as I am concerned, my comments on systemd are entirely appropriate. Sure it boots faster, but the time saved from that nowhere near makes up for the time I wasted on it so far. That help you mention goes both ways, I am feeling I have put considerable time into making openSUSE functional. Not having console logging makes my life more difficult and costs my time, and it appears to be absent because of systemd. I have nothing against replacing known functional software with better versions, but I have issues with wanna-be replacements that only half work. From other online comments I take it I'm not the only one being less than impressed. I'd pay money for an opensuse that works well (raid isn't the only issue I'm having).

When I briefly asked how to get the conosle log into a file I was expecting the answer to be a 1-liner. Sorry for the OT question.
Comment 38 Frederic Crozat 2013-03-15 10:28:13 UTC
This bug is mixing a lot of various versions. For 12.1, there is no journal.
Comment 39 Volker Kuhlmann 2013-03-15 10:56:22 UTC
If anyone has info/suggestions for making the problem of raid assembly failure more probable, please speak up. My results were variable, sometimes no problem several times in a row, sometimes some problem 3 times is a row.
Comment 40 Frederic Crozat 2013-03-15 16:24:00 UTC
After more tests, there is no bug in systemd, ExecStart should correctly handle ";". But if at one moment "systemctl daemon-reload" is run, the status output might become incomplete, which might give the impression only one process was launched :

systemd-udev-trigger.service - udev Coldplug all Devices
	  Loaded: loaded (/usr/lib/systemd/system/systemd-udev-trigger.service; static)
	  Active: active (exited) since Fri, 2013-03-15 12:48:43 CET; 4h 29min ago
	    Docs: man:udev(7)
	          man:systemd-udevd.service(8)
	 Process: 308 ExecStart=/usr/bin/udevadm trigger --type=devices --action=add (code=exited, status=0/SUCCESS)
	 Process: 306 ExecStart=/usr/bin/udevadm trigger --type=subsystems --action=add (code=exited, status=0/SUCCESS)
	  CGroup: name=systemd:/system/systemd-udev-trigger.service

would become :

systemd-udev-trigger.service - udev Coldplug all Devices
	  Loaded: loaded (/usr/lib/systemd/system/systemd-udev-trigger.service; static)
	  Active: active (exited) since Fri, 2013-03-15 12:48:43 CET; 4h 29min ago
	    Docs: man:udev(7)
	          man:systemd-udevd.service(8)
	 Main PID: 308 (code=exited, status=0/SUCCESS)
	  CGroup: name=systemd:/system/systemd-udev-trigger.service
Comment 41 Neil Brown 2013-03-21 03:22:15 UTC
I agree that systemd doesn't have a problem.  The documentation for ExecCommand doesn't meantion ';', but the doco for ExecCommandStart does, and I found where the code makes a special case of ';' so that is all good.

I may have been a bit hasty is combining all these bugs together but they do look very similar.  The core problem is that udev is running "mdadm -I" for various devices *after* "udevadm settle" has completed.

This can definitely happen if the "udev-trigger" systemd service starts after "udevadm settle" was called, so making boot.md depend on udev-trigger can close that hole - and it definitely needs to be closed for 12.3.

But it seems that there is another hole, and I cannot find it.  And as I cannot reproduce it, that makes it extra hard.

Maybe if you could capture an 'strace' of the "udevadm settle" call made by boot.md.  That might help - and it might help even for a case where no problem with raid arrays is seen, though a cas where the problem does occur would be better.

i.e. modify /etc/init.d/boot.md so that  the "udevadm settle --timeout=...." line is
   strace -o /tmp/udevadm.trace -s 300 udevadm settle --timeout=....

i.e. just put "strace -o /tmp/udevadm.trace -s 300" in front of the command.
If you can do that, then attach the /tmp/udevadm.trace file it might help.
Comment 42 Volker Kuhlmann 2013-04-04 21:09:23 UTC
Thanks Neil, much appreciated.

I added that strace, with an -f as well. The output isn't very long. Of 4 reboots, 1 was OK, 1 was a total failure, 2 were degraded.

My diff, the strace output, and the md status are in files in the attached tar file. If you'd like me to run more and different tests no problem, and I'll be a bit faster.
Comment 43 Volker Kuhlmann 2013-04-04 21:12:20 UTC
Created attachment 533616 [details]
strace of multiple reboots
Comment 44 Neil Brown 2013-04-10 01:26:49 UTC
Thanks Volker.
Unfortunately it doesn't get me much closer.  "udevadm settle" seems to be behaving exactly correctly.  Yet something related to it isn't working.

So I'm still stuck.
Comment 45 Neil Brown 2013-04-10 01:29:09 UTC
Hi Mr Maintenance,
 a partial fix for this bug as it affect 12.3 is known (comment #30) and it seems that a complete fix is still eluding me so we should probably queue and update for just that one.
It is: change "udev-trigger" to "systemd-udev-trigger" in boot.md

Could you make it a planned-up for 12.3 please?
Comment 46 Volker Kuhlmann 2013-04-12 05:18:03 UTC
It occurred to me that stracing the mdadm run by boot.md would give only a small part of the picture, and logging *ALL* invocations of mdadm including location and arguments would be more interesting. So I put a wrapper script around /sbin/mdadm.

The problem appears to be a race between udev and boot.md. The -IRs and -A -s... are run by boot.md, the others by udev.

This gives a degraded array:

+ : mdadm --detail --export /dev/md0
+ : mdadm -IRs
+ : mdadm --examine --export /dev/sdb5
+ : mdadm --examine --export /dev/sda5
+ : mdadm --examine --export /dev/sda3
+ : mdadm --examine --export /dev/sdb3
+ : mdadm -A -s -c /etc/mdadm.conf
+ : mdadm --incremental /dev/sdb3
+ : mdadm --incremental /dev/sdb5
+ : mdadm --incremental /dev/sda5
+ : mdadm --incremental /dev/sda3

This works fine:

+ : mdadm --detail --export /dev/md0
+ : mdadm -IRs
+ : mdadm -A -s -c /etc/mdadm.conf
+ : mdadm --examine --export /dev/sdb3
+ : mdadm --examine --export /dev/sda3
+ : mdadm --examine --export /dev/sdb5
+ : mdadm --examine --export /dev/sda5
+ : mdadm --incremental /dev/sdb3
+ : mdadm --incremental /dev/sdb5
+ : mdadm --incremental /dev/sda3
+ : mdadm --incremental /dev/sda5

So udev stoidly examines all partitions and then runs incremental. Meanwhile boot.md has already run incremental and at some point assembles the arrays.

What I don't understand is why both of them are doing the same job. How much do they interfere with each other? And where/when?
Comment 47 Volker Kuhlmann 2013-04-12 05:19:49 UTC
Created attachment 534870 [details]
reboot 1 degraded

process hierarchy and start time stamps
Comment 48 Volker Kuhlmann 2013-04-12 05:22:03 UTC
Created attachment 534871 [details]
reboot 2 ok

Remember that I am running boot.md-check and boot.md-retry after boot.md so I get a reliable system (except when it sinkholes itself into emergency mode because md1 is running with 0/2).

My boot.md still should-starts udev-trigger because it's 12.1.
Comment 49 Benjamin Brunner 2013-04-17 14:30:42 UTC
Neil could you create a maintenancerequest with the fix for mdadm, please?
Comment 50 Neil Brown 2013-04-29 06:05:46 UTC
Hi Volker.

Your latest details seem to just confirm that udev is doing thing when I believe that it shouldn't.  The call to "udevadm settle" earlier in boot.md should wait for all those udev events to happen.  It very much seems like it does.
Only they still come, which seems to suggest that something else is triggering new 'add' events, though I cannot imaging what.

Could you try something new for me?
in boot.md, add the line

 /sbin/udevadm control --stop-exec-queue
before 
                $mdadm_BIN -IRs
                $mdadm_BIN -A -s -c $mdadm_CONFIG
                # a status of 2 is not an error
                test $? -eq 0 -o $? -eq 2
                rc_status -v

and add
  /sbin/udevadm control --start-exec-queue

just after that section.

This should ensure that udev and boot.md don't race.  It might slow boot by 5 seconds but that might be better than races.

Thanks.
Comment 51 Volker Kuhlmann 2013-05-03 06:56:09 UTC
Done. No improvement. Tried twice, both got degraded array md1 so decided it's not the fix we're after.

first reboot with the two added lines from comment #50
Several slow points during boot.
Severe hang during boot at boot.md-retry (which possibly has some recursion),
with system entering emergency mode after something like a 1min timeout.
Array md1 was degraded. Manually re-added 2nd drive, then systemctl default
Disabled services boot.md-check, boot.md-retry

second reboot
Log files included.
Array md1 was degraded. Manually re-added 2nd drive.
Otherwise boot sequence as always.
mdadm commands issued (in this order):
(Some may be from interactive shells checking for degraded arrays.)
mdadm --detail /dev/md1
mdadm --detail /dev/md0
mdadm --detail --export /dev/md0
mdadm --examine --export /dev/sdb5
mdadm --examine --export /dev/sda5
mdadm --examine --export /dev/sdb3
mdadm --incremental /dev/sdb3
mdadm --examine --export /dev/sda3
mdadm --incremental /dev/sdb5
mdadm -IRs
mdadm -A -s -c /etc/mdadm.conf
mdadm --incremental /dev/sda5
mdadm --incremental /dev/sda3
mdadm --detail --export /dev/md1
mdadm --detail --export /dev/md1
mdadm /dev/md1 -a /dev/sda5
mdadm --detail --export /dev/md1
mdadm --detail /dev/md1
mdadm --detail /dev/md0
Comment 52 Volker Kuhlmann 2013-05-03 06:58:00 UTC
Created attachment 537803 [details]
Log files of reboot 2 (comment #51)
Comment 53 Christian Boltz 2013-05-11 18:32:14 UTC
*** Bug 818501 has been marked as a duplicate of this bug. ***
Comment 54 Neil Brown 2013-05-14 04:40:19 UTC
Thanks for all the logs.

It seems that the events for all the sda devices are starting *after* the "udevadm settle" call.  The dependency on udev-trigger should prevent that.

Could you try something else?
In boot.md, before the "udevadm settle" call, call

  udevadm control --log-priority=debug
  udevadm trigger --sysname-match='sd[ab]*'

That should trigger just the devices we want, and not waste time triggering everything else.

It should also put lots of 'udevd' messages in the /var/log/messages file, which might be useful.

Thanks.
Comment 55 Volker Kuhlmann 2013-06-03 16:51:11 UTC
I added the two lines you suggested, and rebooted 5 times without issues, whereas two days ago without this change, the md1 array started degraded twice.
Do I need to try until it fails again, or does this give you the info you were after?
Kernel 3.1.10-1.19-desktop oS 12.1
Comment 56 Volker Kuhlmann 2013-06-03 16:55:19 UTC
Created attachment 542405 [details]
Log files for comment #55

README-3 with detailsa bout the log files, boot.md script used with diff from original, syslog, mdadm traces.
/var/log/messages had all hylafax-related lines removed (they're long, and some of it is private) but is otherwise complete.
Comment 57 Neil Brown 2013-06-10 23:37:54 UTC
Thanks.

This seems to confirm that the 

# Should-Start: udev-trigger

Isn't doing what it should, however I poured over the source code and cannot find any reason why.

I suggest you just leave the 
  udevadm trigger --sysname-match='sd[ab]*'

in boot.md and leave it at that.

As much of these code is very different in 12.3 and future releases there doesn't seem much point pushing on to a root-cause analysis.  If we can make it work for you I think that will have to be enough for now.

So I'm going to close this as "works for me", and recommend you keep the 'udevadm trigger' line to ensure that it works for you.

Thanks for your persistence and testing.
Comment 58 Christian Boltz 2013-06-11 05:03:29 UTC
Please allow me to disagree ;-)

This happened to me me on several 12.3 machines (with very different hardware), so I'd guess it's quite likely to hit more people. In other words: please create a maintenance update for 12.3. 

The package from comment #16 fixes the problem for me - maybe you can just use this package (+ the initscript changes that helped Volker) for the 12.3 update.

BTW: Is this fixed in factory? I still have a lock on mdadm, but I don't want to keep it forever ;-)
Comment 59 Neil Brown 2013-06-11 05:39:03 UTC
Polite disagreement is always welcome :-)

I really thought I had arrange for an update for 12.3, and that it was just 12.2 that was unresolved.  But now that I look I cannot see and 12.3 update.  Strange.

I'll try to get that sorted out, and Factory too.

Thanks for the poke.
Comment 60 Neil Brown 2013-06-12 22:34:50 UTC
I've submitted update requests for Factory and 12.3.
Comment 61 Forgotten User 5bOUleMVRM 2013-06-16 07:44:11 UTC
I just want to let you know that 5 weeks ago I applied the changes from comment 30 and since then had no further problems with partly assembled arrays in OS 12.3.
Comment 62 Benjamin Brunner 2013-06-24 06:37:00 UTC
Update released for openSUSE 12.3. Resolved fixed.
Comment 63 Swamp Workflow Management 2013-06-24 07:04:23 UTC
openSUSE-RU-2013:1071-1: An update that has two recommended fixes can now be installed.

Category: recommended (important)
Bug References: 793954,821366
CVE References: 
Sources used:
openSUSE 12.3 (src):    mdadm-3.2.6-3.4.1