Bug 1182313

Summary: KVM VirtualDomain starts twice in the cluster and corrupt the VM
Product: [SUSE Linux Enterprise High Availability Extension] SUSE Linux Enterprise High Availability Extension 12 SP5 Reporter: Martin Caj <mcaj>
Component: PacemakerAssignee: Yan Gao <ygao>
Status: RESOLVED FIXED QA Contact: SUSE Linux Enterprise High Availability Team <ha-bugs>
Severity: Critical    
Priority: P1 - Urgent CC: bwiedemann, ghe, heming.zhao, jfehlig, lma, mcaj, zzhou
Version: GM   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Martin Caj 2021-02-16 09:54:47 UTC
HI,

We are running Node cluster with SUSE Linux Enterprise High Availability Extension based on SLE-12-SP5.

The cluster has running pacemaker services with a lot of VirtualDomain instances.
The VirtualDomain are defined there as `primitive` and allow migration between nodes in the cluster.

Once we set standby mode on the one cluster node VMs are 'life' migrated to the rest of the cluster. Then on the node there is no running VM and we can do the maintenance for example package update for new kernel. 
We can also safety reboot the node.
After it, we switch the node back into online mode and the pacemaker services 
life migrate VMs back to node.

This setup was working for several years very reliable until recently happen to us that we found running vm twice on the cluster.

For example a VM icecream was according crm and hawk dashboard running on the server 1. The VM host was unstable - show problems with the file system integrity. When I check all running VMs on all cluster nodes ( via the command
`virsh list` ) I found it running also on the server 2.
The Pacemaker did report that it is running only on server 1.

Do you have an idea how that happen ?
How we can prevent that situation ?
Comment 5 Roger Zhou 2021-02-24 10:19:00 UTC
Hi Martin, 

Normally, we would like to request you run hb_report to collect information for the whole picture of the cluster. 

The second thing is a question. Do the system implemented libvirt virtlockd? It is the native approach of libvirt community to prevent the same VM image get initialized with multiple instances. The appropriate deployment of the locking mechanism could strengthen the use case what you described above[1].



[1] https://libvirt.org/kbase/locking.html
Comment 6 Bernhard Wiedemann 2021-02-27 14:41:59 UTC
It happened again and I collected more details in
https://jira.suse.com/browse/ENGINFRA-523


virtlockd was not running and I enabled+started it now to be on the safe side.
However it seems to only protect a single node
and I could not find how to have it write its logs to our shared NFS dir.
Comment 7 Roger Zhou 2021-03-02 08:07:06 UTC
(In reply to Bernhard Wiedemann from comment #6)
> It happened again and I collected more details in
> https://jira.suse.com/browse/ENGINFRA-523
> 
> 
> virtlockd was not running and I enabled+started it now to be on the safe
> side.
> However it seems to only protect a single node
> and I could not find how to have it write its logs to our shared NFS dir.


The example to config virtlockd could be,

Step 1. /etc/libvirt/qemu.conf
        lock_manager = "lockd"

Step 2. /etc/libvirt/qemu-lockd.conf
        file_lockspace_dir = "/var/lib/libvirt/lockd/files"

Step 3. systemctl restart libvirtd

Now, libvirt locking should work. To verify that, once starting a VM, you should see something like the following:

lslocks | grep lockd
virtlockd       27930  POSIX        WRITE 0          0          0 /var/lib/libvirt/lockd/files/2a48a4d85.....
Comment 8 Roger Zhou 2021-03-02 10:00:45 UTC
forgot to mention, all cluster nodes require to use the shared NFS, eg.

mount -t nfs -o rw nfs_server_name_ip:/export/xxx /var/lib/libvirt/lockd
Comment 9 Yan Gao 2021-03-02 10:25:01 UTC
According to the logs, while VMs are right in the middle of live-migration (migrate_to operation is finished on nodeA, but it hasn't been confirmed by a migrate_from on node B), cluster transitions often get interrupted by setting/unsetting standby mode of nodes, failures of resources and so on ... And based on the changed config/status, cluster often changes the decisions to instead migrate the VMs to somewhere else or even to migrate them back.

It seems that there might be situations that pacemaker scheduler doesn't handle dangling state of live-migration well under some conditions. I need to look into the details and see what might be going wrong.

Meanwhile besides using virtlockd for safety, as temporary workaround, please:

* If any single action/change is triggering a cluster transition that involves live migration of resources, please always wait for the cluster transition to be settled before making any further action/change.

You could use `crm --wait` which ensures the command is synchronous and returns only if everything is finished:
# crm --wait node standby <node>

Otherwise you could manually query the status of DC node and wait for it to become IDLE:
crmadmin -S <DC-node>

Also, fix and improvement of configuration could be done here to reduce edgy situations:

* ipmi fencing resources don't support live-migration. Given your global default `allow-migrate=true`, they should be explicitly set with meta `allow-migrate=false`.

* Preferably set `resource-stickiness` under `rsc_defaults` to prevent unnecessary shuffle of resources.
  
BTW, not really relevant to the issue but FYI, VD_backup has difficulty doing live-migration:

Feb 11 10:09:44 [9627] talon1       lrmd:   notice: operation_finished: VD_backup_migrate_to_0:15963:stderr [ error: Unsafe migration: Migration without shared storage is unsafe ]
Feb 11 10:09:44 [9627] talon1       lrmd:   notice: operation_finished: VD_backup_migrate_to_0:15963:stderr [ ocf-exit-reason:backup: live migration to talon2 failed: 1 ]
Comment 10 Yan Gao 2021-03-02 14:54:07 UTC
Martin/Bernhard, is it possible to collect a crm report covering the occurrences of the issue. The historical transition files contained in that will be very valuable for diagnosis and verification of any concepts.
Comment 11 Bernhard Wiedemann 2021-03-03 08:56:01 UTC
https://w3.suse.de/~bwiedemann/temp/hb_report-Wed-03-Mar-2021.tar.bz2

I also noticed that locks get lost on virtlockd restart:

talon3:~ # lslocks | grep lockd
virtlockd        5467  POSIX      WRITE 0     0          0 /kvm/vm/locks/scsi/3600a0980383032702d24467a30365559
virtlockd        5467  POSIX   4B WRITE 0     0          0 /run/virtlockd.pid
virtlockd        5467  POSIX      WRITE 0     0          0 /kvm/vm/locks/scsi/3600a0980383032702d24467a30365664
virtlockd        5467  POSIX      WRITE 0     0          0 /kvm/vm/locks/scsi/3600a0980383032704a24474d6e37476e
virtlockd        5467  POSIX      WRITE 0     0          0 /kvm/vm/locks/scsi/3600a0980383032702d24467a30365673
virtlockd        5467  POSIX      WRITE 0     0          0 /kvm/vm/locks/scsi/3600a0980383032702d24467a30365576
talon3:~ # rcvirtlockd restart
talon3:~ # lslocks | grep lockd
virtlockd       31180  POSIX   5B WRITE 0     0          0 /run/virtlockd.pid
Comment 12 Yan Gao 2021-03-03 10:25:06 UTC
Thanks, Bernhard. But unfortunately it doesn't contain any historical cluster transitions.

Could you please redo it with `crm report -f` option covering the times when the issue occurred?
Comment 13 Roger Zhou 2021-03-03 14:44:32 UTC
(In reply to Bernhard Wiedemann from comment #11)

> 
> I also noticed that locks get lost on virtlockd restart:
> 

This is because VM get terminated by-design.

Can you elaborate your use case and expectation a little more?  

According to `man virtlockd`, it is possible for virtlockd to do re-exec() for upgrade itself without impact the running VM. Well, I never get success with `kill -USR1`. Let me loop in Virtualization Expert for this part, any hint, @Jim / @Lin ?
Comment 14 Lin Ma 2021-03-04 12:55:52 UTC
From qemu's angle, qemu supports image locking since v2.10.0(The qemu version in SLES 12 SP5 is v3.1.1). By default, As long as the mountpoint which holding the file-backend images supports fcntl file lock, Qemu uses the file lock to prevent launching multiple VMs which backed by the same file-backend image.

According to the hb report, talon cluster uses ocfs2, I'm not familiar with ocfs2, Doesn't it support fcntl file lock? It's not a new feature and majority of modern file systems(e.g. nfs) already support it.

If 
- ocfs2 supports fcntl file lock,
and
- the fcntl file lock feature isn't disabled on mountpoint that holding the file-backend vm images.
  (I walked through the 'findmnt' output in hb report, File lock feature isn't disabled explicitly)
and
- the 'shareable' flag isn't enabled to the image in VM config,
and
- qemu image locking isn't disabled(locking=off) explicitly in VM config,

In this situation, This issue won't occur due to qemu prevents launching the multiple VMs.

From virtlockd's angle,
* With the out of box configuration, For file-backend images, virtlockd locks image file directly through the very samiliar mechanism with qemu to provide resource protection.

* With the configuration in C#7 and c#8, We can set up a lock file directory which shared with all nodes to protect the block-backend virtual storage, file-backend image as well.


We can use any of these ways OR all of ways to protect our resource.

(In reply to Bernhard Wiedemann from comment #11)
> https://w3.suse.de/~bwiedemann/temp/hb_report-Wed-03-Mar-2021.tar.bz2
> 
> I also noticed that locks get lost on virtlockd restart:
> 
> talon3:~ # lslocks | grep lockd
> virtlockd        5467  POSIX      WRITE 0     0          0
> /kvm/vm/locks/scsi/3600a0980383032702d24467a30365559
> virtlockd        5467  POSIX   4B WRITE 0     0          0 /run/virtlockd.pid
> virtlockd        5467  POSIX      WRITE 0     0          0
> /kvm/vm/locks/scsi/3600a0980383032702d24467a30365664
> virtlockd        5467  POSIX      WRITE 0     0          0
> /kvm/vm/locks/scsi/3600a0980383032704a24474d6e37476e
> virtlockd        5467  POSIX      WRITE 0     0          0
> /kvm/vm/locks/scsi/3600a0980383032702d24467a30365673
> virtlockd        5467  POSIX      WRITE 0     0          0
> /kvm/vm/locks/scsi/3600a0980383032702d24467a30365576
> talon3:~ # rcvirtlockd restart
> talon3:~ # lslocks | grep lockd
> virtlockd       31180  POSIX   5B WRITE 0     0          0 /run/virtlockd.pid

I noticed this behaviour as well, and this operation is not safe.
I'm not aware whether this is by design or not.
After doing so, It caused the lock effect losing, Means another VM which the virtual disk is backed by the same file-backend image can be launched, the image possibily will be corrupted.

you should restart libvirtd instead of virtlockd restart.
Comment 16 Roger Zhou 2021-03-08 04:10:41 UTC
Great thanks Lin to go over the picture and details at the qemu level and the libvirt level, and virtlockd. Wonderful!

And seems it is good enough let libvirt running outside of the cluster if all VM is managed by pacemaker cluster. 

(In reply to Lin Ma from comment #14)

[...]
> ocfs2, Doesn't it support fcntl file lock? It's not a new feature and
> majority of modern file systems(e.g. nfs) already support it.
> 

btw, ocfs2 does support fcntl() at the cluster level, though it is not used in this environment. That's fine.
Comment 17 James Fehlig 2021-03-09 22:54:59 UTC
(In reply to Roger Zhou from comment #13)
> According to `man virtlockd`, it is possible for virtlockd to do re-exec()
> for upgrade itself without impact the running VM. Well, I never get success
> with `kill -USR1`. Let me loop in Virtualization Expert for this part, any
> hint, @Jim / @Lin ?

As Lin mentioned, if virtlockd is used to protect resources, it should never be stopped or restarted. But as you note, re-exec'ing with USR1 is safe. E.g. when doing 'systemctl reload virtlockd' you'll notice the following output from virtlockd

Mar 09 15:48:31 xenb1s1 systemd[1]: Reloading Virtual machine lock manager.
Mar 09 15:48:31 xenb1s1 systemd[1]: Reloaded Virtual machine lock manager.
Comment 18 Roger Zhou 2021-03-15 08:46:26 UTC
(In reply to James Fehlig from comment #17)
> (In reply to Roger Zhou from comment #13)
> > According to `man virtlockd`, it is possible for virtlockd to do re-exec()
> > for upgrade itself without impact the running VM. Well, I never get success
> > with `kill -USR1`. Let me loop in Virtualization Expert for this part, any
> > hint, @Jim / @Lin ?
> 
> As Lin mentioned, if virtlockd is used to protect resources, it should never
> be stopped or restarted. 

Thanks Jim to double confirm.

> But as you note, re-exec'ing with USR1 is safe.
> E.g. when doing 'systemctl reload virtlockd' you'll notice the following
> output from virtlockd
> 
> Mar 09 15:48:31 xenb1s1 systemd[1]: Reloading Virtual machine lock manager.
> Mar 09 15:48:31 xenb1s1 systemd[1]: Reloaded Virtual machine lock manager.

Yeah, it looks works well for Xen. 

Just FYI, a little off the original topic of this bug. For KVM in my environment, I can see above two lines, but all VMs will be gone too which is different than my understanding. Well, I'm lack of visibility about the importance for KVM customers to judge this.
Comment 19 James Fehlig 2021-03-15 21:48:47 UTC
(In reply to Roger Zhou from comment #18)
> > But as you note, re-exec'ing with USR1 is safe.
> > E.g. when doing 'systemctl reload virtlockd' you'll notice the following
> > output from virtlockd
> > 
> > Mar 09 15:48:31 xenb1s1 systemd[1]: Reloading Virtual machine lock manager.
> > Mar 09 15:48:31 xenb1s1 systemd[1]: Reloaded Virtual machine lock manager.
> 
> Yeah, it looks works well for Xen.

It should work the same for KVM.

> Just FYI, a little off the original topic of this bug. For KVM in my
> environment, I can see above two lines, but all VMs will be gone too which
> is different than my understanding. Well, I'm lack of visibility about the
> importance for KVM customers to judge this.

I checked my SLES15 SP2 test machine and found that virtlockd was crashing on re-exec. Some patches from bug#1183411 are needed

https://bugzilla.suse.com/show_bug.cgi?id=1183411#c6

I have those queued for a future maintenance update of the SLE15 SP2 libvirt package. They can be pulled from our 15 SP2 devel repo if you are interested in testing

https://download.suse.de/ibs/Devel:/Virt:/SLE-15-SP2/SUSE_SLE-15-SP2_Update_standard/
Comment 20 Roger Zhou 2021-03-17 09:34:07 UTC
(In reply to James Fehlig from comment #19)

> https://download.suse.de/ibs/Devel:/Virt:/SLE-15-SP2/SUSE_SLE-15-
> SP2_Update_standard/

It works like a charm. Thanks Jim!
Comment 21 Yan Gao 2022-06-28 14:12:08 UTC
Pacemaker-wise, handling of partial live migration has been improved with this:
https://github.com/ClusterLabs/pacemaker/pull/2739
Comment 25 Yan Gao 2023-07-17 18:33:17 UTC
(In reply to Yan Gao from comment #21)
> Pacemaker-wise, handling of partial live migration has been improved with
> this:
> https://github.com/ClusterLabs/pacemaker/pull/2739

All the relevant issues and improvements on pacemaker for handling live migrations have been addressed and merged upstream as of:

https://github.com/ClusterLabs/pacemaker/pull/2739
https://github.com/ClusterLabs/pacemaker/pull/3020

Included in SLE15 SP5.