|
Bugzilla – Full Text Bug Listing |
| Summary: | KVM VirtualDomain starts twice in the cluster and corrupt the VM | ||
|---|---|---|---|
| Product: | [SUSE Linux Enterprise High Availability Extension] SUSE Linux Enterprise High Availability Extension 12 SP5 | Reporter: | Martin Caj <mcaj> |
| Component: | Pacemaker | Assignee: | Yan Gao <ygao> |
| Status: | RESOLVED FIXED | QA Contact: | SUSE Linux Enterprise High Availability Team <ha-bugs> |
| Severity: | Critical | ||
| Priority: | P1 - Urgent | CC: | bwiedemann, ghe, heming.zhao, jfehlig, lma, mcaj, zzhou |
| Version: | GM | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
|
Description
Martin Caj
2021-02-16 09:54:47 UTC
Hi Martin, Normally, we would like to request you run hb_report to collect information for the whole picture of the cluster. The second thing is a question. Do the system implemented libvirt virtlockd? It is the native approach of libvirt community to prevent the same VM image get initialized with multiple instances. The appropriate deployment of the locking mechanism could strengthen the use case what you described above[1]. [1] https://libvirt.org/kbase/locking.html It happened again and I collected more details in https://jira.suse.com/browse/ENGINFRA-523 virtlockd was not running and I enabled+started it now to be on the safe side. However it seems to only protect a single node and I could not find how to have it write its logs to our shared NFS dir. (In reply to Bernhard Wiedemann from comment #6) > It happened again and I collected more details in > https://jira.suse.com/browse/ENGINFRA-523 > > > virtlockd was not running and I enabled+started it now to be on the safe > side. > However it seems to only protect a single node > and I could not find how to have it write its logs to our shared NFS dir. The example to config virtlockd could be, Step 1. /etc/libvirt/qemu.conf lock_manager = "lockd" Step 2. /etc/libvirt/qemu-lockd.conf file_lockspace_dir = "/var/lib/libvirt/lockd/files" Step 3. systemctl restart libvirtd Now, libvirt locking should work. To verify that, once starting a VM, you should see something like the following: lslocks | grep lockd virtlockd 27930 POSIX WRITE 0 0 0 /var/lib/libvirt/lockd/files/2a48a4d85..... forgot to mention, all cluster nodes require to use the shared NFS, eg. mount -t nfs -o rw nfs_server_name_ip:/export/xxx /var/lib/libvirt/lockd According to the logs, while VMs are right in the middle of live-migration (migrate_to operation is finished on nodeA, but it hasn't been confirmed by a migrate_from on node B), cluster transitions often get interrupted by setting/unsetting standby mode of nodes, failures of resources and so on ... And based on the changed config/status, cluster often changes the decisions to instead migrate the VMs to somewhere else or even to migrate them back. It seems that there might be situations that pacemaker scheduler doesn't handle dangling state of live-migration well under some conditions. I need to look into the details and see what might be going wrong. Meanwhile besides using virtlockd for safety, as temporary workaround, please: * If any single action/change is triggering a cluster transition that involves live migration of resources, please always wait for the cluster transition to be settled before making any further action/change. You could use `crm --wait` which ensures the command is synchronous and returns only if everything is finished: # crm --wait node standby <node> Otherwise you could manually query the status of DC node and wait for it to become IDLE: crmadmin -S <DC-node> Also, fix and improvement of configuration could be done here to reduce edgy situations: * ipmi fencing resources don't support live-migration. Given your global default `allow-migrate=true`, they should be explicitly set with meta `allow-migrate=false`. * Preferably set `resource-stickiness` under `rsc_defaults` to prevent unnecessary shuffle of resources. BTW, not really relevant to the issue but FYI, VD_backup has difficulty doing live-migration: Feb 11 10:09:44 [9627] talon1 lrmd: notice: operation_finished: VD_backup_migrate_to_0:15963:stderr [ error: Unsafe migration: Migration without shared storage is unsafe ] Feb 11 10:09:44 [9627] talon1 lrmd: notice: operation_finished: VD_backup_migrate_to_0:15963:stderr [ ocf-exit-reason:backup: live migration to talon2 failed: 1 ] Martin/Bernhard, is it possible to collect a crm report covering the occurrences of the issue. The historical transition files contained in that will be very valuable for diagnosis and verification of any concepts. https://w3.suse.de/~bwiedemann/temp/hb_report-Wed-03-Mar-2021.tar.bz2 I also noticed that locks get lost on virtlockd restart: talon3:~ # lslocks | grep lockd virtlockd 5467 POSIX WRITE 0 0 0 /kvm/vm/locks/scsi/3600a0980383032702d24467a30365559 virtlockd 5467 POSIX 4B WRITE 0 0 0 /run/virtlockd.pid virtlockd 5467 POSIX WRITE 0 0 0 /kvm/vm/locks/scsi/3600a0980383032702d24467a30365664 virtlockd 5467 POSIX WRITE 0 0 0 /kvm/vm/locks/scsi/3600a0980383032704a24474d6e37476e virtlockd 5467 POSIX WRITE 0 0 0 /kvm/vm/locks/scsi/3600a0980383032702d24467a30365673 virtlockd 5467 POSIX WRITE 0 0 0 /kvm/vm/locks/scsi/3600a0980383032702d24467a30365576 talon3:~ # rcvirtlockd restart talon3:~ # lslocks | grep lockd virtlockd 31180 POSIX 5B WRITE 0 0 0 /run/virtlockd.pid Thanks, Bernhard. But unfortunately it doesn't contain any historical cluster transitions. Could you please redo it with `crm report -f` option covering the times when the issue occurred? (In reply to Bernhard Wiedemann from comment #11) > > I also noticed that locks get lost on virtlockd restart: > This is because VM get terminated by-design. Can you elaborate your use case and expectation a little more? According to `man virtlockd`, it is possible for virtlockd to do re-exec() for upgrade itself without impact the running VM. Well, I never get success with `kill -USR1`. Let me loop in Virtualization Expert for this part, any hint, @Jim / @Lin ? From qemu's angle, qemu supports image locking since v2.10.0(The qemu version in SLES 12 SP5 is v3.1.1). By default, As long as the mountpoint which holding the file-backend images supports fcntl file lock, Qemu uses the file lock to prevent launching multiple VMs which backed by the same file-backend image. According to the hb report, talon cluster uses ocfs2, I'm not familiar with ocfs2, Doesn't it support fcntl file lock? It's not a new feature and majority of modern file systems(e.g. nfs) already support it. If - ocfs2 supports fcntl file lock, and - the fcntl file lock feature isn't disabled on mountpoint that holding the file-backend vm images. (I walked through the 'findmnt' output in hb report, File lock feature isn't disabled explicitly) and - the 'shareable' flag isn't enabled to the image in VM config, and - qemu image locking isn't disabled(locking=off) explicitly in VM config, In this situation, This issue won't occur due to qemu prevents launching the multiple VMs. From virtlockd's angle, * With the out of box configuration, For file-backend images, virtlockd locks image file directly through the very samiliar mechanism with qemu to provide resource protection. * With the configuration in C#7 and c#8, We can set up a lock file directory which shared with all nodes to protect the block-backend virtual storage, file-backend image as well. We can use any of these ways OR all of ways to protect our resource. (In reply to Bernhard Wiedemann from comment #11) > https://w3.suse.de/~bwiedemann/temp/hb_report-Wed-03-Mar-2021.tar.bz2 > > I also noticed that locks get lost on virtlockd restart: > > talon3:~ # lslocks | grep lockd > virtlockd 5467 POSIX WRITE 0 0 0 > /kvm/vm/locks/scsi/3600a0980383032702d24467a30365559 > virtlockd 5467 POSIX 4B WRITE 0 0 0 /run/virtlockd.pid > virtlockd 5467 POSIX WRITE 0 0 0 > /kvm/vm/locks/scsi/3600a0980383032702d24467a30365664 > virtlockd 5467 POSIX WRITE 0 0 0 > /kvm/vm/locks/scsi/3600a0980383032704a24474d6e37476e > virtlockd 5467 POSIX WRITE 0 0 0 > /kvm/vm/locks/scsi/3600a0980383032702d24467a30365673 > virtlockd 5467 POSIX WRITE 0 0 0 > /kvm/vm/locks/scsi/3600a0980383032702d24467a30365576 > talon3:~ # rcvirtlockd restart > talon3:~ # lslocks | grep lockd > virtlockd 31180 POSIX 5B WRITE 0 0 0 /run/virtlockd.pid I noticed this behaviour as well, and this operation is not safe. I'm not aware whether this is by design or not. After doing so, It caused the lock effect losing, Means another VM which the virtual disk is backed by the same file-backend image can be launched, the image possibily will be corrupted. you should restart libvirtd instead of virtlockd restart. Great thanks Lin to go over the picture and details at the qemu level and the libvirt level, and virtlockd. Wonderful! And seems it is good enough let libvirt running outside of the cluster if all VM is managed by pacemaker cluster. (In reply to Lin Ma from comment #14) [...] > ocfs2, Doesn't it support fcntl file lock? It's not a new feature and > majority of modern file systems(e.g. nfs) already support it. > btw, ocfs2 does support fcntl() at the cluster level, though it is not used in this environment. That's fine. (In reply to Roger Zhou from comment #13) > According to `man virtlockd`, it is possible for virtlockd to do re-exec() > for upgrade itself without impact the running VM. Well, I never get success > with `kill -USR1`. Let me loop in Virtualization Expert for this part, any > hint, @Jim / @Lin ? As Lin mentioned, if virtlockd is used to protect resources, it should never be stopped or restarted. But as you note, re-exec'ing with USR1 is safe. E.g. when doing 'systemctl reload virtlockd' you'll notice the following output from virtlockd Mar 09 15:48:31 xenb1s1 systemd[1]: Reloading Virtual machine lock manager. Mar 09 15:48:31 xenb1s1 systemd[1]: Reloaded Virtual machine lock manager. (In reply to James Fehlig from comment #17) > (In reply to Roger Zhou from comment #13) > > According to `man virtlockd`, it is possible for virtlockd to do re-exec() > > for upgrade itself without impact the running VM. Well, I never get success > > with `kill -USR1`. Let me loop in Virtualization Expert for this part, any > > hint, @Jim / @Lin ? > > As Lin mentioned, if virtlockd is used to protect resources, it should never > be stopped or restarted. Thanks Jim to double confirm. > But as you note, re-exec'ing with USR1 is safe. > E.g. when doing 'systemctl reload virtlockd' you'll notice the following > output from virtlockd > > Mar 09 15:48:31 xenb1s1 systemd[1]: Reloading Virtual machine lock manager. > Mar 09 15:48:31 xenb1s1 systemd[1]: Reloaded Virtual machine lock manager. Yeah, it looks works well for Xen. Just FYI, a little off the original topic of this bug. For KVM in my environment, I can see above two lines, but all VMs will be gone too which is different than my understanding. Well, I'm lack of visibility about the importance for KVM customers to judge this. (In reply to Roger Zhou from comment #18) > > But as you note, re-exec'ing with USR1 is safe. > > E.g. when doing 'systemctl reload virtlockd' you'll notice the following > > output from virtlockd > > > > Mar 09 15:48:31 xenb1s1 systemd[1]: Reloading Virtual machine lock manager. > > Mar 09 15:48:31 xenb1s1 systemd[1]: Reloaded Virtual machine lock manager. > > Yeah, it looks works well for Xen. It should work the same for KVM. > Just FYI, a little off the original topic of this bug. For KVM in my > environment, I can see above two lines, but all VMs will be gone too which > is different than my understanding. Well, I'm lack of visibility about the > importance for KVM customers to judge this. I checked my SLES15 SP2 test machine and found that virtlockd was crashing on re-exec. Some patches from bug#1183411 are needed https://bugzilla.suse.com/show_bug.cgi?id=1183411#c6 I have those queued for a future maintenance update of the SLE15 SP2 libvirt package. They can be pulled from our 15 SP2 devel repo if you are interested in testing https://download.suse.de/ibs/Devel:/Virt:/SLE-15-SP2/SUSE_SLE-15-SP2_Update_standard/ (In reply to James Fehlig from comment #19) > https://download.suse.de/ibs/Devel:/Virt:/SLE-15-SP2/SUSE_SLE-15- > SP2_Update_standard/ It works like a charm. Thanks Jim! Pacemaker-wise, handling of partial live migration has been improved with this: https://github.com/ClusterLabs/pacemaker/pull/2739 (In reply to Yan Gao from comment #21) > Pacemaker-wise, handling of partial live migration has been improved with > this: > https://github.com/ClusterLabs/pacemaker/pull/2739 All the relevant issues and improvements on pacemaker for handling live migrations have been addressed and merged upstream as of: https://github.com/ClusterLabs/pacemaker/pull/2739 https://github.com/ClusterLabs/pacemaker/pull/3020 Included in SLE15 SP5. |