|
Bugzilla – Full Text Bug Listing |
| Summary: | lvremove fails sometimes | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 12.1 | Reporter: | Arvin Schnell <aschnell> |
| Component: | Basesystem | Assignee: | E-mail List <bnc-team-screening> |
| Status: | RESOLVED NORESPONSE | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P5 - None | CC: | aschnell, bruce.edge, chcao, dmacvicar, forgotten_OfsFetVrzR, novell.admin, R.Vickers, sbhasker, stefan.schaefer, sweet_f_a |
| Version: | Final | ||
| Target Milestone: | Final | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | Development | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
script to trigger problem
lvm2.patch |
||
|
Description
Arvin Schnell
2010-09-28 13:50:58 UTC
Created attachment 391914 [details]
script to trigger problem
Seems to be caused by the monitor rule in udev. Any progress here? Monitoring rule generate additional udev events, which cannot be synchronized with current device-mapper implementation yet. We may have to forward-port old "sleep-try" hack to 11.4. Any progress here? I submitted the sleep-try patch to openSUSE:Factory. fixed. close. *** Bug 668021 has been marked as a duplicate of this bug. *** Problem with removing existing lvm logical volumes still persists in Milestone 6. When trying to reproduce bnc#668021 I created a LVM VG named "system" on first disk with two logical volumes "swap" and "root". "root" was formatted with ext3. When trying a installation that needs to removce the LVM VG it succesfully removed LV "swap" bug failed removing LV "root" afterwards. Executed command was: /sbin/lvremove -f -A n '/dev/system/root' Output of this command was: Unable to deactivate open system-root (253:1) Unable to deactivate logical volume "root" I tried to install RC1 today and the installer was unable to delete windows partitions. It gave me some error message and I had to abort installation entirely. Later I deleted all the partitions manually with fdisk but the installer was even unable to create new partitions. However, after creating the partitions manually YaST was finally able to format and mount them and install RC1 onto these partitions. Not sure if this is really the same. According to my investigations the problems are mostly caused by udev triggering start of all sort of processes on certain events. That processes try to read from the device so that a removal, formatting, partitioning (anything that needs exclusive access) of the device is denied by kernel with something like "device or resource busy" Would be interesting to know the exact error messages and have yast2 logs of a failed install. Thomas: I retested with the script from comment #1 in the latest opensuse 11.4, the lvremove failure never happened. Could you please post the yast2 logs? In my 11.3 with updated lvm and device-mapper rpms, the problem can be reproduced on the command line. I do not have the y2logs any more, I also do not have an 11.4 system running right now. But there is no reason to assume the same problem would not also happen with 11.4 under the right conditions. Here a cut and pasted dump of a command sequence I just typed: Lagrange:~ # lvcreate -L 5G -n test2 system Logical volume "test2" created Lagrange:~ # lvremove /dev/system/test2 Can't remove open logical volume "test2" Lagrange:~ # The device-mapper package on my system contains your fix: * Fri Jan 14 2011 lnussel@suse.de - the lsmod|grep in boot.device-mapper is superfluous, modprobe does that internally anyways * Wed Dec 15 2010 cmeng@novell.com - sleep and try to wait opne_count to be 0 while lvremove. (Bug #642296) The problem seems to be a race, it does not happen always but in a quite high percentages of lvremove commands (and also when doing "dmsetup remove ...") The machines is a Quad-Core with Hyperthreading, so it has 8 logical cores. I got the Quad-Core last December, before I had a dual core machine where the problem was not present (at least not regularly). When one looks at the processes udev starts with "udevadm monitor" it is quite clear that any operation needing exclusive access to a device may fail because because udev is starting helper processes that open the modified device in parallel. In my opinion the only way to fix this reliable is to have a way to stop udev from processing any events while a crtical sequence of block device manipulations is done and let udev resume the event processing after block device manipulation is finished. If you need access to the machine via ssh I can add you ssh key to /root/.ssh/authorized_keys. Just added your public key to /root/.ssh/authorized_keys. (In reply to comment #12) > > Lagrange:~ # lvremove /dev/system/test2 > Can't remove open logical volume "test2" > Lagrange:~ # > > [..] > The problem seems to be a race, it does not happen always but in a quite Some time ago I had the feeling that it depends on how you specify the LogicalVolumePath arg. AFAIR I never got this race again when I was really using the LogicalVolumePath system/test2 instead of path to device /dev/system/test2 Maybe there is some kind of lookup "get lvm path from the given device" which causes the device still being used when trying to remove it. But as I said this is just a feeling, don't know if this makes sense at all. I can reproducibly make the problem go away when I disable udev. So I assume it is one of the many scripts udev calls on certain kernel events the keep the volume busy temporarily. Problem still exists on 12.1 beta 1. Problem still exists on 12.1 final!!!! Any progress here? I just upgraded a SLES 11 system from SP1 to SP2 and saw the same problem. During our backup run 4 out of 9 snapshot volumes could not be removed despite successful unmounting. (In reply to comment #20) Further comment about SLES11SP2: I added sync sleep 5 to the backup script (before the lvremove) and last night all the backups completed normally. Could be coincidence of course, but looks promising as a workaround. Bob Can you replace 'sleep 5' to 'udevadm settle', this is to wait for udev events all finished before lvremove. I guess this would be another work around for this too. The 'udevadm settle' workround worked for a while, but a couple of days ago I got failures again, so it is not guranteed. I am currently including a sync, a sleep and a settle in my backup job. Bob We are the same problem in SLES11 SP2. When can we get a permanent fix for this. We cannot use the workaround given here. in bnc#766187, mentioned same problem, SLES11-SP2 was already fixed. The patch was here, https://bugzilla.novell.com/attachment.cgi?id=495328. And I would change udev_sync to default of lvm, this is more reasonable. So user doesn't change lvm.conf. But the workaround is still needed, put 'udevadm settle' before lvremove. @Bob @Bhasker, If you're using SLE11-SP2, please update to latest lvm. add udev_sync =1 to activation section of lvm.conf, and try again. Mon Jun 18 05:39:51 UTC 2012 - dmzhang@suse.com - bnc#766187, lvremove fails on deleting snapshot volume - bnc#750496, Boot stops because of lack of resume partition, add udevadm settle before active - bnc#770200: Online Update to SP2 with kernel 3.0.34-0.7-default cause LVM to report memory leak, run mkinitrd when update If you're using opensuse 12.1, please try the patch above. any updates? Did some initial testing where in we executed lvcreate and lvremove in iteration after setting udev_sync=1. We execute 20 iteration and we encountered error twice where in the same error message "Can't remove open logical volume" so it seems like the issue is not complete addressed. Created attachment 503200 [details]
lvm2.patch
1. update to lasted lvm2 or use patch lvm2.patch
2. change to udev_sync=1
try this script:
while true
do
lvcreate -l10%VG -nbak-lv /dev/vg1/lv
lsof -t /dev/vg1/bak-lv|while read app
do
kill -9 $app
done
udevadm settle
lvremove -f /dev/vg1/bak-lv
done
Can you try that? I guess this work around could work.
REASON:
1.*lsof* to ensure no other application is using the lv.
2.*udevadm settle* to prevent from race condition.
Thanks for the update. But we are seeing addition issues, where in lot of stale entries are left behind. Our product just does this create lun, discover it , then create partition, then pvcreate, vgcreate , create lvol and then do mkfs . After we do this multiple times we see lot of stale entries. Here is output from "multipath -ll" command. lnx197-170:/RRT # multipath -ll 360a980003246687a6824424353573859 dm-88 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357364f dm-61 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573373 dm-56 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357336a dm-53 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357322f dm-44 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573959 dm-97 NETAPP,LUN size=800M features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=2 status=active |- 3:0:0:0 sdb 8:16 active ready running `- 4:0:0:0 sdc 8:32 active ready running 360a980003246687a6824424353573961 dm-99 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a4d dm-0 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573858 dm-86 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573674 dm-77 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357314b dm-31 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357366b dm-58 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573068 dm-27 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573053 dm-19 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573757 dm-76 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573943 dm-94 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357342f dm-59 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a55 dm-9 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573153 dm-26 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573673 dm-75 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573572 dm-68 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573268 dm-43 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573253 dm-30 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a69 dm-13 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a54 dm-1 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573049 dm-20 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357384d dm-84 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573051 dm-16 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573939 dm-91 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357386a dm-78 size=1.0G features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573755 dm-71 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573571 dm-66 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357394d dm-93 NETAPP,LUN size=800M features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=2 status=active |- 4:0:0:2 sdg 8:96 active ready running `- 3:0:0:2 sdf 8:80 active ready running 360a980003246687a6824424353573166 dm-33 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a70 dm-3 size=1.0G features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573855 dm-83 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357384c dm-67 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573553 dm-62 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573249 dm-38 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573837 dm-81 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573668 dm-72 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573349 dm-45 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573670 dm-73 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573366 dm-42 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573937 dm-85 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573969 dm-100 NETAPP,LUN size=800M features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=2 status=active |- 3:0:0:3 sdh 8:112 active ready running `- 4:0:0:3 sdi 8:128 active ready running 360a980003246687a6824424353573954 dm-96 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a51 dm-5 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357394b dm-89 NETAPP,LUN size=800M features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=2 status=active |- 4:0:0:1 sde 8:64 active ready running `- 3:0:0:1 sdd 8:48 active ready running 360a980003246687a6824424353573667 dm-54 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573566 dm-64 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573247 dm-35 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573651 dm-65 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573045 dm-17 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357367a dm-79 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573565 dm-50 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a47 dm-7 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357325a dm-34 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573145 dm-24 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573866 dm-90 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573378 dm-47 size=1.0G features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573949 dm-80 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357306d dm-25 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573159 dm-32 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a63 dm-12 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573664 dm-69 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573161 dm-23 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573362 dm-52 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573043 dm-14 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573578 dm-70 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357336e dm-57 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357304f dm-6 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573276 dm-36 size=1.0G features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573950 dm-82 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573158 dm-21 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a62 dm-10 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573847 dm-74 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573359 dm-49 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573361 dm-39 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573562 dm-60 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357336d dm-55 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573174 dm-28 size=1.0G features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573863 dm-87 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573559 dm-48 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357334f dm-51 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573056 dm-11 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573257 dm-40 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a58 dm-8 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a43 dm-4 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357326b dm-46 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a4f dm-2 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573845 dm-63 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353567a6c dm-15 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357316a dm-37 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a682442435357302f dm-22 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573072 dm-18 size=1.0G features='0' hwhandler='0' wp=rw 360a980003246687a682442435357324d dm-41 size=800M features='0' hwhandler='0' wp=rw 360a980003246687a6824424353573155 dm-29 size=800M features='0' hwhandler='0' wp=rw Here is another script which has worked well for me on SLES11SP2.The combination of sleeps etc is based on trial and error, not science! #! /bin/bash # Delete an LVM logical volume (typically used for a snapshot that won't go # away). See # https://bugzilla.novell.com/show_bug.cgi?id=642296 set -o nounset export PATH=/sbin:/usr/sbin:/bin:/usr/bin usage=\ " deleteLV [-n maxTries] [-t timetowait] LVdevice... -n max number of tries (default 10) -t time to wait after sync (default 10 seconds)" # # Parse the command options declare -i maxtries=10 delay=10 ERRFLAG=OK OPTIND=1 while getopts n:t: OPTION do case $OPTION in n) maxtries=$OPTARG;; t) delay=$OPTARG;; \?) ERRFLAG=BAD;; esac done shift `expr $OPTIND - 1` # # Exit with a usage message if the syntax is wrong. # if [ $ERRFLAG != OK -o $# -eq 0 ] then echo 2>&1 "$0: usage:" echo 2>&1 "$usage" exit 1 fi declare -i ntries for lv in "$@" do if lvs $lv >/dev/null then # We have a valid LVM logical volume, so repeatedly try to delete it. ntries=0 while (( $ntries < $maxtries )) do ntries=$(( $ntries + 1 )) echo >&2 "$0: Removing $lv attempt $ntries" udevadm settle sync sleep $delay lvremove -f $lv && break done if lvs $lv >/dev/null 2>/dev/null then echo >&2 "$0: FAILED to remove $lv after $ntries attempts" else echo >&2 "$0: Successfully removed $lv after $ntries attempt(s)" fi else echo >&2 "$0: $lv is not an LVM logical volume" fi done (In reply to comment #31) > Here is another script which has worked well for me on SLES11SP2.The > combination of sleeps etc is based on trial and error, not science! I think this script could fix the lvremove problem. @Bhaske, >Thanks for the update. But we are seeing addition issues, where in lot of stale entries are left behind. Our product just does this create lun, discover it Is that another bug? This bug is still present in openSUSE 12.2 and at least for me now it does not only fails sometimes but always (Trying to lvremove snapshots while the origin is mounted). FYI https://bugzilla.redhat.com/show_bug.cgi?id=577798 Hi,this product does not longer get updates, if this issues still exist and you can reproduce it on a newer openSUSE version preferable 13.2, please open a new bug and provide a reference to this bug, thank you :) |