Bug 570443

Summary: CRASH @ attempt to manually remove vcpus from dom0 using vcpu-set
Product: [openSUSE] openSUSE 11.2 Reporter: mail ignored <0.bugs.only.0>
Component: XenAssignee: Jan Beulich <jbeulich>
Status: RESOLVED DUPLICATE QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: carnold, jfehlig
Version: Final   
Target Milestone: ---   
Hardware: All   
OS: openSUSE 11.2   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description mail ignored 2010-01-13 18:55:45 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.0) Gecko/20100105 SUSE/3.6rc1-1.2 Firefox/3.6

@ attempt to manually remove vcpus from dom0 using vcpu-set,

 xm vcpu-list Domain-0
  Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
  Domain-0                             0     0     0   r--      27.7 0
  Domain-0                             0     1     1   -b-       9.2 1
  Domain-0                             0     2     2   -b-      14.1 2
  Domain-0                             0     3     3   -b-      14.3 3

 xm vcpu-set --help
  Usage: xm vcpu-set <Domain> <vCPUs>
  Set the number of active VCPUs for allowed for the domain.

 xm vcpu-set Domain-0 1
     ==> xen/xend.log <==
     [2010-01-13 10:48:16 4953] INFO (XendDomainInfo:1818) Set VCPU count on domain Domain-0 to 1
 xm vcpus-list Domain-0

this hangs the current session.  checking @ Dom0,

 top
   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  6112 root      15  -5     0    0    0 R  100  0.0   0:29.92 xenwatch_cb

"xenwatch_cb" is hogging 100% cpu. then,

 kill -9 6112

recovers.

checking syslog,

==> messages <==
Jan 13 10:49:09 test kernel: [ 1568.737072] BUG: soft lockup - CPU#3 stuck for 61s! [xenwatch_cb:6112]
...
Jan 13 10:49:09 test kernel: [ 1569.275176] CPU 3:
...
Jan 13 10:49:09 test kernel: [ 1569.899143] Pid: 6112, comm: xenwatch_cb Not tainted 2.6.31.8-0.1-xen #1 System Product Name
Jan 13 10:49:09 test kernel: [ 1569.991141] RIP: e030:[<ffffffff8005ef0f>]  [<ffffffff8005ef0f>] lock_timer_base+0x7f/0x90
Jan 13 10:49:09 test kernel: [ 1570.087131] RSP: e02b:ffff88003f38dc10  EFLAGS: 00000246
Jan 13 10:49:09 test kernel: [ 1570.179128] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffffff8077a3b0
Jan 13 10:49:09 test kernel: [ 1570.275121] RDX: 0000000000000001 RSI: ffff88003f38dc50 RDI: ffffc90000015280
Jan 13 10:49:09 test kernel: [ 1570.367115] RBP: ffff88003f38dc40 R08: ffffffff807833f0 R09: 0000000000000000
Jan 13 10:49:09 test kernel: [ 1570.459110] R10: ffff88003f38dcf0 R11: 000000008141ce5c R12: ffffc90000015280
Jan 13 10:49:09 test kernel: [ 1570.551106] R13: ffff88003f38dc50 R14: 0000000000000000 R15: ffffffff8077a640
Jan 13 10:49:09 test kernel: [ 1570.639105] FS:  00007f8fefd696f0(0000) GS:ffffc90000030000(0000) knlGS:0000000000000000
Jan 13 10:49:09 test kernel: [ 1570.727098] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 13 10:49:09 test kernel: [ 1570.815091] CR2: 0000000000b781e8 CR3: 0000000000003000 CR4: 0000000000000660
Jan 13 10:49:09 test kernel: [ 1570.903088] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan 13 10:49:09 test kernel: [ 1570.991081] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan 13 10:49:09 test kernel: [ 1571.079077] Call Trace:
Jan 13 10:49:09 test kernel: [ 1571.167076]  [<ffffffff8005ef4c>] try_to_del_timer_sync+0x2c/0x90
Jan 13 10:49:09 test kernel: [ 1571.255067]  [<ffffffff8005efda>] del_timer_sync+0x2a/0x50
Jan 13 10:49:09 test kernel: [ 1571.339062]  [<ffffffff80467adf>] mce_cpu_callback+0x122/0x1aa
Jan 13 10:49:09 test kernel: [ 1571.423058]  [<ffffffff80472337>] notifier_call_chain+0x57/0xb0
Jan 13 10:49:09 test kernel: [ 1571.507055]  [<ffffffff8007585c>] __raw_notifier_call_chain+0x1c/0x40
Jan 13 10:49:09 test kernel: [ 1571.591050]  [<ffffffff8045be5f>] _cpu_down+0xaf/0x310
Jan 13 10:49:09 test kernel: [ 1571.671045]  [<ffffffff8045c147>] cpu_down+0x87/0xb0
Jan 13 10:49:09 test kernel: [ 1571.751040]  [<ffffffff8046a97c>] vcpu_hotplug+0xce/0x102
Jan 13 10:49:09 test kernel: [ 1571.831036]  [<ffffffff8046a9fb>] handle_vcpu_hotplug_event+0x4b/0x61
Jan 13 10:49:09 test kernel: [ 1571.907026]  [<ffffffff803070fc>] xenwatch_handle_callback+0x2c/0x80
Jan 13 10:49:09 test kernel: [ 1571.979030]  [<ffffffff8006f9d6>] kthread+0xb6/0xc0
Jan 13 10:49:09 test kernel: [ 1572.051024]  [<ffffffff8000d38a>] child_rip+0xa/0x20


Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Comment 1 mail ignored 2010-01-13 19:40:12 UTC
correction.  once hung, even "kill -9" is ignored,

ps ax | grep 6112
 6112 ?        R<    47:48 [xenwatch_cb]
 6319 pts/0    R<+    0:00 grep 6112

kill -9 6112
ps ax | grep 6112
 6112 ?        R<    47:53 [xenwatch_cb]
 6321 pts/0    S<+    0:00 grep 6112

ps ax | grep xenwatch
   22 ?        S<     0:00 [xenwatch]
 6112 ?        R<    51:01 [xenwatch_cb]
 6113 ?        D<     0:00 [xenwatch_cb]
 6114 ?        D<     0:00 [xenwatch_cb]
 6331 pts/0    S<+    0:00 grep xenwatch

kill -9 22 6112 6113 6114
ps ax | grep xenwatch
   22 ?        S<     0:00 [xenwatch]
 6112 ?        R<    51:18 [xenwatch_cb]
 6113 ?        D<     0:00 [xenwatch_cb]
 6114 ?        D<     0:00 [xenwatch_cb]
 6334 pts/0    R<+    0:00 grep xenwatch

reboot's required :-(
Comment 2 Jan Beulich 2010-01-14 08:19:52 UTC
Workaround until kernel update becomes available is specifying mce=0 on the kernel command line.

*** This bug has been marked as a duplicate of bug 558663 ***