|
Bugzilla – Full Text Bug Listing |
| Summary: | openQA: yast fails to install on ext4 system (parted 3.1: no udev settle) | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Dominique Leuenberger <dimstar> |
| Component: | Basesystem | Assignee: | Sebastian Parschauer <sebastian.parschauer> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | aschnell, chcao, crrodriguez, dimstar, hare, linuxandlanguages, meissner, mvidner, puzel, sbrabec, systemd-maintainers, thomas.blume, tiwai |
| Version: | 201502* | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| See Also: | https://bugzilla.suse.com/show_bug.cgi?id=989751 | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | Y2Los extracted from openQA | ||
|
Description
Dominique Leuenberger
2015-03-03 10:38:43 UTC
Created attachment 625179 [details]
Y2Los extracted from openQA
For reference: a retry of the test works; so it is not an always appearing / reproducible test failure The YaST log shows that YaST creates partition vda2 but it does not show up in /proc/partitions. This is latest systemd, does this happen with systemd-210 from Base:System:Legacy? Assuming that parted ran into a EBUSY error on calling BLKRRPART, it is not reporting it in the y2log. (In reply to Werner Fink from comment #4) > This is latest systemd, does this happen with systemd-210 from > Base:System:Legacy? No idea; systemd was submitted to Factory and integrated; if systemd is the component to be blamed for not being ready, it has to be replaced. The diff between the 0227 and 0301 snapshots is: https://openqa.opensuse.org/factory-package-news/diff/20150301 long. Feel free to download the temp. iso to reproduce as much as needed (but as stated in comment #2: a 2nd run of the same test did not show the issue anymore; thus: how do you prove the absence of a bug?) (In reply to Dominique Leuenberger from comment #6) I'm not willingly to handle bugs caused by bleeding edge upstream systemd simply because I do not have the time for this. Beside this: if as Jan stated the parted ran into a EBUSY error on calling BLKRRPART this bug does not belong to systemd. As already mentioned elsewhere the and systemd-210 does *not* the BLKRRPART ioctk but the new systemd/udev *does* use the BLKRRPART ioctl. That is that parted has to be fixed to handle the flock which is done by the new upstream udev. Btw: the flock() combined with the BLKRRPART ioctl in upstream udevd was one of me reasons not to switch to systemd greater then systemd-210. Also the upstream patches regarding BLKRRPART had been disabled in Base:System:Legacy/systemd (compare with the rpm spec build condition blkrrpart). IMHO parted has to be fixed to be able to do its work even if udevd used an flock() as well as the BLKRRPART ioctl. A possible fix for parted could be
a) cause udevd to stop handling the event queue
udevadm settle --timeout=5
udevadm control --stop-exec-queue
b) do a
flock(fd, LOCK_EX);
on the fd of the device node
c) now the
ioctl(fd, BLKRRPART, 0);
d) after closing the fd of the device node
udevadm control --start-exec-queue
without any warranty but IMHO the chain
settle/stop queue/flock+ioctl/start queue
avoids that udevd ignores any event on the device node
Is it necessary to lock the device? libfdisk does not appear to do that, in comparison. (In reply to Jan Engelhardt from comment #10) Hmmm ... good question ... the only risk which cold occur is that it might happen that udevd (even with settle/stop queue) does a flock() and an ioctl(BLKRRPART) ... don't know what happen then. Without (f)locking also yast could settle and stop the udev queue event handling *before* executing parted ;) But the builtin solution within has the advantage that even on the normal command line the parted would work flawless. The udevadm control --start-exec-queue could be stored also in en exit handler as well as in an SIGINT handler to be sure that the event queue of udevd will started if parted is interrupted by Ctrl-C. This holds true for all programs/libraries doing the ioctl BLKRRPART With all this interesting discussion: maybe we should just accept that we are still having parted 3.1 in the repos, and parted 3.2 contains this in the changelog:
< libparted: On multipath systems new partitions would sometimes not
> appear, reporting 'device-mapper: create ioctl failed: Device or
> resource busy' until the system was rebooted. Added dm_udev_wait
> calls to synchronize parted with udev.
Sounds very much like our issue here.
(In reply to Dominique Leuenberger from comment #13) > libparted: On multipath systems new partitions would sometimes not > appear, reporting 'device-mapper: create ioctl failed: Device or > resource busy' until the system was rebooted. Added dm_udev_wait > calls to synchronize parted with udev. The respective upstream commit is http://git.savannah.gnu.org/cgit/parted.git/commit/?id=60fe959e1446337c3455656daad2c2b7114a1dcd udevd reacts to calls of close(2) for write-opened device files. This seems reasonable. I can imagine there might be an issue if the partitioning utility calls BLKRRPART only after it has closed the fd (or a copy thereof). Whee ... an now we've come full circle. udevd calls BLKRRPART on close() a write-opened device precisely to _detect_ any changes to the partition table. And this of course races with the BLKRRPART from parted. -> strip off the BLKRRPART from parted and everything should be dandy ... (In reply to Hannes Reinecke from comment #16) Hold the queue of udev or using the DM cookie method with dm_udev_wait() may work also ... or is this a naive fallacy >udevd calls BLKRRPART on close() a write-opened device precisely to _detect_ any changes to the partition table. And this of course races with the BLKRRPART from parted.
So, why does parted close the device? fdisk does not do that:
open("/dev/sdb", O_RDWR|O_CLOEXEC) = 3
ioctl(3, BLKGETSIZE64, 8589934592) = 0
[...]
Device does not contain a recognized partition table.
open("/dev/urandom", O_RDONLY|O_CLOEXEC) = 4
close(4) = 0
Created a new DOS disklabel with disk identifier 0xd1ec9bdd.
Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
ioctl(3, BLKRRPART, 0x7f3c9adb7658) = 0
close(3) = 0
Syncing disks.
(In reply to Hannes Reinecke from comment #16) > Whee ... an now we've come full circle. > > udevd calls BLKRRPART on close() a write-opened device precisely to _detect_ > any changes to the partition table. And this of course races with the > BLKRRPART from parted. > > -> strip off the BLKRRPART from parted and everything should be dandy ... Hm, except that parted calls BLKRRPART _only_ for DASDs, which is not the case here. Something else must be going on. I'll try to find time to look more into this. If someone can get strace of the failing parted, that would certainly speed things up. (In reply to Petr Uzel from comment #20) > I'll try to find time to look more into this. If someone can get strace of > the failing parted, that would certainly speed things up. Bah, I had closer look and indeed there's no failure, just that kernel/systemd or whatever did not 'notice' /dev/vda2. If parted failed to inform the kernel (via BLKPG* ioctls()), it would report it and we would see it in the yast logs. (In reply to Petr Uzel from comment #21) > (In reply to Petr Uzel from comment #20) > > > I'll try to find time to look more into this. If someone can get strace of > > the failing parted, that would certainly speed things up. > > Bah, I had closer look and indeed there's no failure, just that > kernel/systemd or whatever did not 'notice' /dev/vda2. If parted failed to > inform the kernel (via BLKPG* ioctls()), it would report it and we would see > it in the yast logs. systemd is not running in the first installation stage, so it cannot be the culprit. However, dmesg shows that vda appears only after 225 seconds: [ 224.869586] ISO 9660 Extensions: RRIP_1991A [ 225.730031] vda: unknown partition table [ 225.759743] vda: unknown partition table [ 225.776050] vda: [ 225.816347] vda: [ 226.060602] vda: vda1 [ 226.084244] vda: vda1 [ 226.167916] vda: vda1 [ 226.272515] vda: vda1 vda2 For a vm this is a very long time, is your virtualization host heavily loaded? If the device appears so late, I would suspect that yast gets a timeout during device discovery. *** Bug 928549 has been marked as a duplicate of this bug. *** > However, dmesg shows that vda appears only after 225 seconds
That is because YaST zeroes the table shortly before that (look for wipefs in y2log).
(In reply to Martin Vidner from comment #24) > > However, dmesg shows that vda appears only after 225 seconds > > That is because YaST zeroes the table shortly before that (look for wipefs > in y2log). Ah, I see, vda appears earlier, but further detection is blocked by the cdrom discovey: -->-- [ 90.651580] vda: unknown partition table [ 91.592901] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 91.593584] ISO 9660 Extensions: RRIP_1991A [ 91.634521] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 91.635304] ISO 9660 Extensions: RRIP_1991A [ 91.669206] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 91.669913] ISO 9660 Extensions: RRIP_1991A [ 92.779850] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 92.781642] ISO 9660 Extensions: RRIP_1991A [ 92.929940] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 92.930721] ISO 9660 Extensions: RRIP_1991A [ 93.768578] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 93.769318] ISO 9660 Extensions: RRIP_1991A [ 93.986662] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 93.987402] ISO 9660 Extensions: RRIP_1991A [ 224.868468] ISO 9660 Extensions: Microsoft Joliet Level 3 [ 224.869586] ISO 9660 Extensions: RRIP_1991A [ 225.730031] vda: unknown partition table [ 225.759743] vda: unknown partition table [ 225.776050] vda: [ 225.816347] vda: [ 226.060602] vda: vda1 [ 226.084244] vda: vda1 [ 226.167916] vda: vda1 [ 226.272515] vda: vda1 vda2 --<-- And yast shows some errors: -->-- 2015-03-03 04:01:27 <5> (none)(2559) [zypp] Exception.cc(log):137 MediaHandler.cc(getDirInfo):1323 THROW: Path '/var/adm/mount/AP_0xGovM44/suse/setup/slide/txt/en_US' on medium 'cd:///?devices=/dev/disk/by-id/ata-QEMU_DVD-ROM_QM00003' is not a directory. 2015-03-03 04:01:27 <0> (none)(2559) [Pkg] PkgFunctions.cc(ExceptionAsString):241 Error message: Path '/var/adm/mount/AP_0xGovM44/suse/setup/slide/txt/en_US' on medium 'cd:///?devices=/dev/disk/by-id/ata-QEMU_DVD-ROM_QM00003' is not a directory. 2015-03-03 04:01:27 <5> (none)(2559) [zypp] Exception.cc(log):137 MediaHandler.cc(getFile):1228 THROW: File '/suse/setup/slide/txt/en/directory.yast' not found on medium 'cd:///?devices=/dev/disk/by-id/ata-QEMU_DVD-ROM_QM00003' 2015-03-03 04:01:28 <5> (none)(2559) [zypp] Exception.cc(log):137 MediaHandler.cc(getFile):1228 THROW: File '/suse/setup/slide/pic/directory.yast' not found on medium 'cd:///?devices=/dev/disk/by-id/ata-QEMU_DVD-ROM_QM00003' 2015-03-03 04:01:28 <5> (none)(2559) [zypp] Exception.cc(log):137 MediaHandler.cc(getFile):1228 THROW: File '/docu/directory.yast' not found on medium 'cd:///?devices=/dev/disk/by-id/ata-QEMU_DVD-ROM_QM00003' 2015-03-03 04:01:28 <5> (none)(2559) [zypp] Exception.cc(log):137 MediaHandler.cc(getFile):1228 THROW: File '/docu/RELEASE-NOTES.en_US.rtf' not found on medium 'cd:///?devices=/dev/disk/by-id/ata-QEMU_DVD-ROM_QM00003' 2015-03-03 04:01:28 <5> (none)(2559) [zypp] Exception.cc(log):137 MediaSetAccess.cc(provide):216 CAUGHT: File '/docu/RELEASE-NOTES.en_US.rtf' not found on medium 'cd:///?devices=/dev/disk/by-id/ata-QEMU_DVD-ROM_QM00003' --<-- Seems your cdrom installation source has a problem. I would like to confirm this bug as I ran into the same issue with the latest "Tumbleweed" i586 ISO-image which I copied to an USB-stick using dd. The installation hangs at "Preparing the disks." I tried installing on an Asus eeePC which has only 4 GB of disk space. Have you guys all checked your installation media - esp. the checksums after downloading? Is this still a current issue with latest Tumbleweed? Who can still reproduce this? Or are we good to close this now? Do you know that there was an issue with disk caches not being flushed when parted 3.1 opened a device (bsc#989751)? The patches for bsc#989751 were synced to Factory on 2016-08-09. The problem was that obsolete data has been written back to disk by parted. I think this is exactly what happened here as well.
Upstream commit:
> http://git.savannah.gnu.org/cgit/parted.git/commit/?id=fb99ba5ebd0dc34204fc9f1014131d5d494805bc
I've checked the logs again. This seems to be indeed a udev settle problem with:
> http://git.savannah.gnu.org/cgit/parted.git/commit/?id=60fe959e1446337c3455656daad2c2b7114a1dcd
Factory contains parted 3.2 already. So this should be fixed for latest Tumbleweed. This can't be fixed for existing installation media without an updated parted from a DUD. So closing as fixed for now.
|