Bug 558663

Summary: dom0-cpus limit causes xenwatch_cb running 100% and xm command freeze and xend dead
Product: [openSUSE] openSUSE 11.2 Reporter: Udo Attila Fischer <udo1>
Component: XenAssignee: Jan Beulich <jbeulich>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P2 - High CC: 0.bugs.only.0, carnold, jbeulich, jdouglas, jfehlig, novell.admin
Version: Final   
Target Milestone: unspecified   
Hardware: x86-64   
OS: openSUSE 11.2   
Whiteboard:
Found By: Community User Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Udo Attila Fischer 2009-11-26 10:11:55 UTC
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; hu; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)

If you limit the dom0 cpu with dom0-cpus: 
- [xenwatch_cb] is running 100% cpu and makes var log entry every 65 sec BUG: soft lockup - CPU#X stuck for 61s!
- xm commands not work
- xend is dead

*****************************


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4532 root      15  -5     0    0    0 R  100  0.0  11:14.84 xenwatch_cb


# ps aux |grep xen
root        39  0.0  0.0      0     0 ?        S<   13:03   0:00 [xenwatch]
root        40  0.0  0.0      0     0 ?        S<   13:03   0:00 [xenbus]
root      3791  0.0  0.0  11300  1560 ?        S    13:04   0:00 /bin/bash
/etc/init.d/xend start
root      4209  0.0  0.1 107504 13864 ?        S    13:04   0:00
/usr/bin/python2.6 /usr/sbin/xend start
root      4446  0.0  0.0   8488  1000 ?        S    13:04   0:00 xenstored
--pid-file /var/run/xenstore.pid
root      4448  0.0  0.0      0     0 ?        Z    13:04   0:00
[xenconsoled] <defunct>
root      4450  0.0  0.0      0     0 ?        Zs   13:04   0:00 [xend]
<defunct>
root      4451  0.0  0.1 107500 11500 ?        S    13:04   0:00
/usr/bin/python2.6 /usr/sbin/xend start
root      4453  0.0  0.0  22724   560 ?        Sl   13:04   0:00 xenconsoled
root      4455  0.0  0.2 148304 16652 ?        Sl   13:04   0:00
/usr/bin/python2.6 /usr/sbin/xend start
root      4532  100  0.0      0     0 ?        R<   13:04  40:35
[xenwatch_cb]
root      4533  0.0  0.0      0     0 ?        D<   13:04   0:00
[xenwatch_cb]
root      4534  0.0  0.0      0     0 ?        D<   13:04   0:00
[xenwatch_cb]
root      4535  0.0  0.0      0     0 ?        D<   13:04   0:00
[xenwatch_cb]
root      4536  0.0  0.0      0     0 ?        D<   13:04   0:00
[xenwatch_cb]



from /var/log/messages every 65 sec

Nov 23 13:55:14 dom0-u2 kernel: [ 3112.781517] BUG: soft lockup - CPU#4
stuck for 61s! [xenwatch_cb:4532]
Nov 23 13:55:14 dom0-u2 kernel: [ 3112.781517] Modules linked in:
sha1_generic hmac cryptomgr aead pcompress crypto_
blkcipher crypto_hash crypto_algapi drbd netbk blkbk blkback_pagemap
blktap xenbus_be binfmt_misc xt_tcpudp ip6t_REJ
ECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_physdev
xt_state iptable_raw iptable_filter ip6table_man
gle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4
ip_tables ip6table_filter ip6_tables x_tab
les ipv6 bridge stp llc dummy fuse loop dm_mod mptctl iTCO_wdt
iTCO_vendor_support i5k_amb sg i5000_edac ppdev 8250_
pnp pcspkr sr_mod edac_core parport_pc shpchp e1000e dcdbas 8250
pci_hotplug tg3 parport serio_raw serial_core butto
n usbhid hid uhci_hcd ehci_hcd xenblk cdrom xennet edd fan ide_pci_generic
piix ide_core ata_generic ata_piix mptsas
 mptscsih mptbase scsi_transport_sas thermal processor thermal_sys hwmon
Nov 23 13:55:14 dom0-u2 kernel: [ 3112.781517] CPU 4:
Nov 23 13:55:14 dom0-u2 kernel: [ 3112.781517] Modules linked in:
sha1_generic hmac cryptomgr aead pcompress crypto_blkcipher crypto_hash
crypto_algapi drbd netbk blkbk blkback_pagemap blktap xenbus_be
binfmt_misc xt_tcpudp ip6t_REJECT nf_conntrack_ipv6 ip6table_raw
xt_NOTRACK ipt_REJECT xt_physdev xt_state iptable_raw iptable_filter
ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack
nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables ipv6 bridge
stp llc dummy fuse loop dm_mod mptctl iTCO_wdt iTCO_vendor_support i5k_amb
sg i5000_edac ppdev 8250_pnp pcspkr sr_mod edac_core parport_pc shpchp
e1000e dcdbas 8250 pci_hotplug tg3 parport serio_raw serial_core button
usbhid hid uhci_hcd ehci_hcd xenblk cdrom xennet edd fan ide_pci_generic
piix ide_core ata_generic ata_piix mptsas mptscsih mptbase
scsi_transport_sas thermal processor thermal_sys hwmon
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] RIP:
e030:[<ffffffff8005f07f>]  [<ffffffff8005f07f>] lock_timer_base+
0x7f/0x90
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] RSP: e02b:ffff8801e8d0bc10 
EFLAGS: 00000246
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] RAX: 0000000000000000 RBX:
0000000000000000 RCX: ffffffff80778370
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] RDX: 0000000000000007 RSI:
ffff8801e8d0bc50 RDI: ffffc90000075280
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] RBP: ffff8801e8d0bc40 R08:
ffffffff807813b0 R09: 0000000000000000
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] R10: ffff8801e8d0bcf0 R11:
00000000e15cfb6d R12: ffffc90000075280
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] R13: ffff8801e8d0bc50 R14:
0000000000000000 R15: ffffffff80778600
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] FS:  00007f53d0abf6f0(0000)
GS:ffffc90000040000(0000) knlGS:0000000000000000
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] CS:  e033 DS: 0000 ES: 0000
CR0: 000000008005003b
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] CR2: 00007f53d0691260 CR3:
0000000000003000 CR4: 0000000000002660
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855] Call Trace:
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff8005f0bc>]
try_to_del_timer_sync+0x2c/0x90
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff8005f14a>]
del_timer_sync+0x2a/0x50
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff8046758f>]
mce_cpu_callback+0x122/0x1aa
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff80471de7>]
notifier_call_chain+0x57/0xb0
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff80075a1c>]
__raw_notifier_call_chain+0x1c/0x40
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff8045b90f>]
_cpu_down+0xaf/0x310
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff8045bbf7>]
cpu_down+0x87/0xb0
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff8046a42c>]
vcpu_hotplug+0xce/0x102
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff8046a4ab>]
handle_vcpu_hotplug_event+0x4b/0x61
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff80306c4c>]
xenwatch_handle_callback+0x2c/0x80
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff8006fb96>]
kthread+0xb6/0xc0
Nov 23 13:54:09 dom0-u2 kernel: [ 3047.280855]  [<ffffffff8000d38a>]
child_rip+0xa/0x20


Reproducible: Always

Steps to Reproduce:
1. set dom0-cpus = X where X>0 and X<[CPUS in your system] in /etc/xen/xend-config.sxp
2. reboot or just rcxend restart
Actual Results:  
- [xenwatch_cb] is running 100% cpu and makes var log entry every 65 sec BUG: soft lockup - CPU#X stuck for 61s!
- xm commands not work
- xend is dead


Expected Results:  
No error

Dell Server 2xquadcore=8 CPU, 8Gb Ram installed.
Comment 1 Jan Beulich 2009-12-01 11:44:56 UTC
This appears to also be a problem in native code (introduced in 2.6.30): If a CPU gets hot plugged while check_interval is zero (modifiable to zero via /sys, defaulting to zero on Xen), mce_timer will never get set up, and a subsequent del_timer() can't lock the timer as its base is NULL. Hence I'll get a patch submitted upstream first.
Comment 2 Jan Beulich 2009-12-01 11:47:49 UTC
Workaround for the time being would be "mce=off" on the kernel command line.
Comment 3 Udo Attila Fischer 2009-12-02 11:33:50 UTC
Another workaround is to define vcpu number at boot time (submitted by Vladislav Karpenko on the xen-users list).

add boot option dom0_max_vcpus=1 to menu.lst to the xen kernel parameter list like that

 kernel /xen.gz dom0_mem=512M dom0_vcpus_pin dom0_max_vcpus=1

this will set the cpu at boot time ( coldplug :) ) and the hotplug issue does not happen.

-----------------------------

But if I understand the problem, it should occourt as well in opensuse 11.2 DomUs when changing the ammount of VCPUs at runtime...
Comment 4 Jan Beulich 2009-12-10 07:42:34 UTC
*** Bug 561607 has been marked as a duplicate of this bug. ***
Comment 5 Jan Beulich 2009-12-21 10:46:47 UTC
This should now be fixed with the import of 2.6.31.9 (and for HEAD/Factory 2.6.32.2), going to be available with a future kernel maintenance update.
Comment 6 Jan Beulich 2010-01-14 08:19:56 UTC
*** Bug 570443 has been marked as a duplicate of this bug. ***