Bug 642296

Summary: lvremove fails sometimes
Product: [openSUSE] openSUSE 12.1 Reporter: Arvin Schnell <aschnell>
Component: BasesystemAssignee: E-mail List <bnc-team-screening>
Status: RESOLVED NORESPONSE QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: aschnell, bruce.edge, chcao, dmacvicar, forgotten_OfsFetVrzR, novell.admin, R.Vickers, sbhasker, stefan.schaefer, sweet_f_a
Version: Final   
Target Milestone: Final   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: Development Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: script to trigger problem
lvm2.patch

Description Arvin Schnell 2010-09-28 13:50:58 UTC
Sometimes lvremove fails with the error message that the logical
volume is open. But the logical volume is not open before the call
to lvremove (checked with dmsetup info).

Maybe lvcreate triggers udev events that open the device.
Comment 1 Arvin Schnell 2010-09-28 13:51:43 UTC
Created attachment 391914 [details]
script to trigger problem
Comment 2 Xin Wei Hu 2010-10-22 03:44:45 UTC
Seems to be caused by the monitor rule in udev.
Comment 3 Arvin Schnell 2010-11-05 14:07:35 UTC
Any progress here?
Comment 4 Xin Wei Hu 2010-11-06 05:07:12 UTC
Monitoring rule generate additional udev events, which cannot be synchronized with current device-mapper implementation yet.
We may have to forward-port old "sleep-try" hack to 11.4.
Comment 5 Arvin Schnell 2010-12-08 16:34:07 UTC
Any progress here?
Comment 6 Cong Meng 2010-12-15 08:34:59 UTC
I submitted the sleep-try patch to openSUSE:Factory. fixed. close.
Comment 7 Thomas Fehr 2011-01-31 12:57:07 UTC
*** Bug 668021 has been marked as a duplicate of this bug. ***
Comment 8 Thomas Fehr 2011-01-31 13:04:47 UTC
Problem with removing existing lvm logical volumes still persists in Milestone 6.
When trying to reproduce bnc#668021 I created a LVM VG named "system" on first disk 
with two logical volumes "swap" and "root". "root" was formatted with ext3.

When trying  a installation that needs to removce the LVM VG it succesfully removed
LV "swap" bug failed removing LV "root" afterwards.

Executed command was: /sbin/lvremove -f -A n  '/dev/system/root'
Output of this command was:
Unable to deactivate open system-root (253:1)
Unable to deactivate logical volume "root"
Comment 9 Zsolt Sági 2011-02-17 10:09:19 UTC
I tried to install RC1 today and the installer was unable to delete windows partitions. It gave me some error message and I had to abort installation entirely. Later I deleted all the partitions manually with fdisk but the installer was even unable to create new partitions. However, after creating the partitions manually YaST was finally able to format and mount them and install RC1 onto these partitions.
Comment 10 Thomas Fehr 2011-02-17 10:26:14 UTC
Not sure if this is really the same.

According to my investigations the problems are mostly caused by udev triggering start 
of all sort of processes on certain events. That processes try to read from the device so that 
a removal, formatting, partitioning (anything that needs exclusive access) of the device
is denied by kernel with something like "device or resource busy"

Would be interesting to know the exact error messages and have yast2 logs of a failed install.
Comment 11 Cong Meng 2011-02-18 08:23:44 UTC
Thomas: I retested with the script from comment #1 in the latest opensuse 11.4, the lvremove failure never happened. Could you please post the yast2 logs?
Comment 12 Thomas Fehr 2011-02-18 08:52:39 UTC
In my 11.3 with updated lvm and device-mapper rpms, the problem can be 
reproduced on the command line. I do not have the y2logs any more, I also
do not have an 11.4 system running right now. But there is no reason to
assume the same problem would not also happen with 11.4 under the right 
conditions.  

Here a cut and pasted dump of a command sequence I just typed:

Lagrange:~ # lvcreate -L 5G -n test2 system
  Logical volume "test2" created
Lagrange:~ # lvremove /dev/system/test2
  Can't remove open logical volume "test2"
Lagrange:~ #

The device-mapper package on my system contains your fix:

* Fri Jan 14 2011 lnussel@suse.de
- the lsmod|grep in boot.device-mapper is superfluous, modprobe does
  that internally anyways

* Wed Dec 15 2010 cmeng@novell.com
- sleep and try to wait opne_count to be 0 while lvremove.
  (Bug #642296)

The problem seems to be a race, it does not happen always but in a quite 
high percentages of lvremove commands (and also when doing "dmsetup remove ...")
The machines is a Quad-Core with Hyperthreading, so it has 8 logical cores.
I got the Quad-Core last December, before I had a dual core machine where
the problem was not present (at least not regularly). 

When one looks at the processes udev starts with "udevadm monitor" it is 
quite clear that any operation needing exclusive access to a device may 
fail because because udev is starting helper processes that open the
modified device in parallel. In my opinion the only way to fix this
reliable is to have a way to stop udev from processing any events while
a crtical sequence of block device manipulations is done and let udev 
resume the event processing after block device manipulation is finished.

If you need access to the machine via ssh I can add you ssh key to
/root/.ssh/authorized_keys.
Comment 14 Thomas Fehr 2011-03-01 10:05:22 UTC
Just added your public key to /root/.ssh/authorized_keys.
Comment 16 Ruediger Meier 2011-06-01 12:47:07 UTC
(In reply to comment #12)
> 
> Lagrange:~ # lvremove /dev/system/test2
>   Can't remove open logical volume "test2"
> Lagrange:~ #
> 
> [..]
> The problem seems to be a race, it does not happen always but in a quite 

Some time ago I had the feeling that it depends on how you specify the LogicalVolumePath arg.

AFAIR I never got this race again when I was really using the
LogicalVolumePath system/test2
instead of path to device /dev/system/test2

Maybe there is some kind of lookup "get lvm path from the given device" which causes the device still being used when trying to remove it.

But as I said this is just a feeling, don't know if this makes sense at all.
Comment 17 Thomas Fehr 2011-06-01 12:55:42 UTC
I can reproducibly make the problem go away when I disable udev. 
So I assume it is one of the many scripts udev calls on certain kernel events the keep the
volume busy temporarily.
Comment 18 Forgotten User OfsFetVrzR 2011-10-11 11:07:28 UTC
Problem still exists on 12.1 beta 1.
Comment 19 Stefan Schäfer 2012-04-22 10:04:55 UTC
Problem still exists on 12.1 final!!!!

Any progress here?
Comment 20 Bob Vickers 2012-06-19 14:38:17 UTC
I just upgraded a SLES 11 system from SP1 to SP2 and saw the same problem. During our backup run 4 out of 9 snapshot volumes could not be removed despite successful unmounting.
Comment 21 Bob Vickers 2012-06-20 08:21:48 UTC
(In reply to comment #20)
Further comment about SLES11SP2: I added
  sync
  sleep 5
to the backup script (before the lvremove) and last night all the backups completed normally.

Could be coincidence of course, but looks promising as a workaround.

Bob
Comment 22 Dong Mao Zhang 2012-06-21 09:08:08 UTC
Can you replace 'sleep 5' to 'udevadm settle', this is to wait for udev events all finished before lvremove.

I guess this would be another work around for this too.
Comment 23 Bob Vickers 2012-07-05 10:30:11 UTC
The 'udevadm settle' workround worked for a while, but a couple of days ago I got failures again, so it is not guranteed. I am currently including a sync, a sleep and a settle in my backup job.

Bob
Comment 24 Bhasker Sadasivan 2012-08-06 09:36:56 UTC
We are the same problem in SLES11 SP2. When can we get a permanent fix for this. We cannot use the workaround given here.
Comment 25 Dong Mao Zhang 2012-08-07 06:27:03 UTC
in bnc#766187, mentioned same problem, SLES11-SP2 was already fixed.
The patch was here,
https://bugzilla.novell.com/attachment.cgi?id=495328.

And I would change udev_sync to default of lvm, this is more reasonable. So user doesn't change lvm.conf.

But the workaround is still needed, put 'udevadm settle' before lvremove.
Comment 26 Dong Mao Zhang 2012-08-07 07:57:09 UTC
@Bob @Bhasker,

If you're using SLE11-SP2, please update to latest lvm.
add
udev_sync =1 to activation section of lvm.conf, and try again.
Mon Jun 18 05:39:51 UTC 2012 - dmzhang@suse.com

- bnc#766187, lvremove fails on deleting snapshot volume
- bnc#750496, Boot stops because of lack of resume partition,
  add udevadm settle before active
- bnc#770200: Online Update to SP2 with kernel 3.0.34-0.7-default
  cause LVM to report memory leak, run mkinitrd when update



If you're using opensuse 12.1, please try the patch above.
Comment 27 Dong Mao Zhang 2012-08-22 10:18:28 UTC
any updates?
Comment 28 Bhasker Sadasivan 2012-08-22 11:21:12 UTC
Did some initial testing where in we executed lvcreate and lvremove in iteration after setting udev_sync=1. We execute 20 iteration and we encountered error twice where in the same error message "Can't remove open logical volume"

so it seems like the issue is not complete addressed.
Comment 29 Dong Mao Zhang 2012-08-23 06:36:42 UTC
Created attachment 503200 [details]
lvm2.patch


1. update to lasted lvm2 or use patch lvm2.patch
2. change to udev_sync=1

try this script:
while true
do
        lvcreate -l10%VG -nbak-lv /dev/vg1/lv
        lsof -t /dev/vg1/bak-lv|while read app
        do
        kill -9 $app
        done
udevadm settle
lvremove -f /dev/vg1/bak-lv
done

Can you try that? I guess this work around could work.
REASON:
1.*lsof* to ensure no other application is using the lv.
2.*udevadm settle* to prevent from race condition.
Comment 30 Bhasker Sadasivan 2012-08-23 07:29:29 UTC
Thanks for the update. But we are seeing addition issues, where in lot of stale entries are left behind. Our product just does this create lun, discover it , then create partition, then pvcreate, vgcreate , create lvol and then do mkfs   . After we do this multiple times we see lot of stale entries. Here is output from "multipath -ll" command.

lnx197-170:/RRT # multipath -ll
360a980003246687a6824424353573859 dm-88
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357364f dm-61
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573373 dm-56
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357336a dm-53
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357322f dm-44
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573959 dm-97 NETAPP,LUN
size=800M features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
  |- 3:0:0:0 sdb 8:16  active ready running
  `- 4:0:0:0 sdc 8:32  active ready running
360a980003246687a6824424353573961 dm-99
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a4d dm-0
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573858 dm-86
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573674 dm-77
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357314b dm-31
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357366b dm-58
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573068 dm-27
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573053 dm-19
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573757 dm-76
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573943 dm-94
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357342f dm-59
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a55 dm-9
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573153 dm-26
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573673 dm-75
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573572 dm-68
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573268 dm-43
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573253 dm-30
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a69 dm-13
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a54 dm-1
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573049 dm-20
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357384d dm-84
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573051 dm-16
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573939 dm-91
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357386a dm-78
size=1.0G features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573755 dm-71
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573571 dm-66
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357394d dm-93 NETAPP,LUN
size=800M features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
  |- 4:0:0:2 sdg 8:96  active ready running
  `- 3:0:0:2 sdf 8:80  active ready running
360a980003246687a6824424353573166 dm-33
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a70 dm-3
size=1.0G features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573855 dm-83
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357384c dm-67
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573553 dm-62
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573249 dm-38
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573837 dm-81
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573668 dm-72
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573349 dm-45
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573670 dm-73
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573366 dm-42
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573937 dm-85
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573969 dm-100 NETAPP,LUN
size=800M features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
  |- 3:0:0:3 sdh 8:112 active ready running
  `- 4:0:0:3 sdi 8:128 active ready running
360a980003246687a6824424353573954 dm-96
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a51 dm-5
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357394b dm-89 NETAPP,LUN
size=800M features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
  |- 4:0:0:1 sde 8:64  active ready running
  `- 3:0:0:1 sdd 8:48  active ready running
360a980003246687a6824424353573667 dm-54
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573566 dm-64
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573247 dm-35
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573651 dm-65
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573045 dm-17
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357367a dm-79
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573565 dm-50
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a47 dm-7
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357325a dm-34
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573145 dm-24
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573866 dm-90
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573378 dm-47
size=1.0G features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573949 dm-80
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357306d dm-25
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573159 dm-32
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a63 dm-12
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573664 dm-69
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573161 dm-23
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573362 dm-52
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573043 dm-14
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573578 dm-70
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357336e dm-57
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357304f dm-6
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573276 dm-36
size=1.0G features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573950 dm-82
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573158 dm-21
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a62 dm-10
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573847 dm-74
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573359 dm-49
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573361 dm-39
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573562 dm-60
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357336d dm-55
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573174 dm-28
size=1.0G features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573863 dm-87
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573559 dm-48
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357334f dm-51
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573056 dm-11
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573257 dm-40
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a58 dm-8
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a43 dm-4
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357326b dm-46
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a4f dm-2
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573845 dm-63
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353567a6c dm-15
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357316a dm-37
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a682442435357302f dm-22
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573072 dm-18
size=1.0G features='0' hwhandler='0' wp=rw
360a980003246687a682442435357324d dm-41
size=800M features='0' hwhandler='0' wp=rw
360a980003246687a6824424353573155 dm-29
size=800M features='0' hwhandler='0' wp=rw
Comment 31 Bob Vickers 2012-08-23 08:33:20 UTC
Here is another script which has worked well for me on SLES11SP2.The combination of sleeps etc is based on trial and error, not science!

#! /bin/bash
# Delete an LVM logical volume (typically used for a snapshot that won't go
# away). See 
# https://bugzilla.novell.com/show_bug.cgi?id=642296

set -o nounset
export PATH=/sbin:/usr/sbin:/bin:/usr/bin

usage=\
" deleteLV  [-n maxTries] [-t timetowait] LVdevice...
    -n         max number of tries (default 10)
    -t         time to wait after sync (default 10 seconds)"

#
#       Parse the command options
declare -i maxtries=10 delay=10
ERRFLAG=OK
OPTIND=1
while getopts n:t: OPTION
do
    case $OPTION in
    n)  maxtries=$OPTARG;;
    t)  delay=$OPTARG;;
    \?) ERRFLAG=BAD;;
    esac
done
shift `expr $OPTIND - 1`
#
# Exit with a usage message if the syntax is wrong.
#
if [ $ERRFLAG != OK -o $# -eq 0 ]
then
    echo 2>&1 "$0: usage:"
    echo 2>&1 "$usage"
    exit 1
fi

declare -i ntries 

for lv in "$@"
do
    if lvs $lv >/dev/null
    then
        # We have a valid LVM logical volume, so repeatedly try to delete it.
        ntries=0
        while (( $ntries < $maxtries ))
        do
            ntries=$(( $ntries + 1 ))
            echo >&2 "$0: Removing $lv attempt $ntries" 
            udevadm settle
            sync
            sleep $delay
            lvremove -f $lv && break 
        done
        if lvs $lv >/dev/null 2>/dev/null
        then
            echo >&2  "$0: FAILED to remove $lv after $ntries attempts"
        else
            echo >&2  "$0: Successfully removed $lv after $ntries attempt(s)"
        fi
    else
         echo >&2  "$0: $lv is not an LVM logical volume"
    fi
done
Comment 32 Dong Mao Zhang 2012-08-30 06:42:42 UTC
(In reply to comment #31)
> Here is another script which has worked well for me on SLES11SP2.The
> combination of sleeps etc is based on trial and error, not science!
I think this script could fix the lvremove problem.

@Bhaske,
>Thanks for the update. But we are seeing addition issues, where in lot of stale
entries are left behind. Our product just does this create lun, discover it

Is that another bug?
Comment 33 Ruediger Meier 2012-11-21 12:58:33 UTC
This bug is still present in openSUSE 12.2
and at least for me now it does not only fails sometimes but always (Trying to lvremove snapshots while the origin is mounted).

FYI
https://bugzilla.redhat.com/show_bug.cgi?id=577798
Comment 35 Chenzi Cao 2014-11-27 08:41:46 UTC
Hi,this product does not longer get updates, if this issues still exist and you can reproduce it on a newer openSUSE version preferable 13.2, please open a new bug and provide a reference to this bug, thank you :)