Bug 438610 - Call Trace with XEN-Kernel on Dell PowerEdge 2950 => onboard network "bnx2" is not working
Summary: Call Trace with XEN-Kernel on Dell PowerEdge 2950 => onboard network "bnx2" i...
Status: RESOLVED FIXED
Alias: None
Product: openSUSE 11.1
Classification: openSUSE
Component: Xen (show other bugs)
Version: Factory
Hardware: x86-64 Linux
: P2 - High : Critical with 1 vote (vote)
Target Milestone: ---
Assignee: Jan Beulich
QA Contact: Jason Douglas
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-10-24 06:59 UTC by Oliver Mössinger
Modified: 2008-12-22 09:30 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
coolo: SHIP_STOPPER-


Attachments
more kernel trace (72.30 KB, text/plain)
2008-10-24 11:11 UTC, Oliver Mössinger
Details
Patch for this bug. (10.11 KB, patch)
2008-11-19 09:54 UTC, Wei Kong
Details | Diff
process list at reboot (7.16 KB, text/plain)
2008-11-27 08:17 UTC, Oliver Mössinger
Details
dmesg at reboot (86.21 KB, text/plain)
2008-11-27 08:17 UTC, Oliver Mössinger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Mössinger 2008-10-24 06:59:38 UTC
Hi,

i get a kernel "Call Trace" on my Dell PowerEdge 2950, if i use the XEN-Kernel. Sometimes, my onboard network goes down. Probably this has a correlation!

# less /var/log/messages
Oct 22 13:25:34 xensrv2 kernel: ------------[ cut here ]------------
Oct 22 13:25:34 xensrv2 kernel: WARNING: at arch/x86/mm/pageattr-xen.c:622 __change_page_attr+0x67/0x25b()
Oct 22 13:25:34 xensrv2 kernel: CPA: called for zero pte. vaddr = ffff8800f007b000 cpa->vaddr = ffff8800f007b000
Oct 22 13:25:34 xensrv2 kernel: Modules linked in: bridge stp netbk blkbk blktap xenbus_be ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NO
TRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack ip_tables ip6ta
ble_filter ip6_tables x_tables ipv6 microcode fuse loop dm_mod usbhid hid rtc_cmos rtc_core pcspkr ff_memless ide_cd_mod serio_raw rtc_lib
8250_pnp dcdbas(X) bnx2 ses iTCO_wdt button 8250 iTCO_vendor_support serial_core igb enclosure shpchp pci_hotplug i5000_edac edac_core sg e
hci_hcd uhci_hcd usbcore sd_mod crc_t10dif xenblk cdrom xennet edd reiserfs fan ide_pci_generic ata_generic ata_piix pata_acpi libata dock
piix ide_core lpfc scsi_transport_fc scsi_tgt megaraid_sas scsi_mod thermal processor thermal_sys hwmon
Oct 22 13:25:34 xensrv2 kernel: Supported: Yes, External
Oct 22 13:25:34 xensrv2 kernel: Pid: 4191, comm: X Tainted: G          2.6.27.1-2-xen #1
Oct 22 13:25:34 xensrv2 kernel:
Oct 22 13:25:34 xensrv2 kernel: Call Trace:
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff8020ba57>] show_trace_log_lvl+0x41/0x58
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff8045cc08>] dump_stack+0x69/0x6f
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff802312f1>] warn_slowpath+0xa9/0xd1
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff80218d4a>] __change_page_attr+0x67/0x25b
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff80218f5b>] __change_page_attr_set_clr+0x1d/0x53
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff802191c4>] change_page_attr_set_clr+0xd0/0x200
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff803df1e2>] pci_mmap_page_range+0xe5/0x149
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff802eb0c6>] mmap+0x5d/0x99
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff80288872>] mmap_region+0x2a1/0x4e8
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff80288da0>] do_mmap_pgoff+0x2e7/0x34b
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff8020e833>] sys_mmap+0x8c/0xc5
Oct 22 13:25:34 xensrv2 kernel:  [<ffffffff8020a878>] system_call_fastpath+0x16/0x1b
Oct 22 13:25:34 xensrv2 kernel:  [<00007fc4623eb88a>] 0x7fc4623eb88a
Oct 22 13:25:34 xensrv2 kernel:
Oct 22 13:25:34 xensrv2 kernel: ---[ end trace 67bf12ace8cdfb26 ]---


# uname -a
Linux xendmz2 2.6.27.1-2-xen #1 SMP 2008-10-16 20:35:15 +0200 x86_64 x86_64 x86_64 GNU/Linux

# lsmod | grep bnx2
bnx2                  182280  0

# lspci | grep -i Ethernet
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
07:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
0c:00.0 Ethernet controller: Intel Corporation 82575GB Gigabit Network Connection (rev 02)
0c:00.1 Ethernet controller: Intel Corporation 82575GB Gigabit Network Connection (rev 02)
0d:00.0 Ethernet controller: Intel Corporation 82575GB Gigabit Network Connection (rev 02)
0d:00.1 Ethernet controller: Intel Corporation 82575GB Gigabit Network Connection (rev 02)

Thanks
Oliver Mössinger
Comment 1 Oliver Mössinger 2008-10-24 11:06:46 UTC
Hi,

next kernel trace on the same host:

Oct 24 12:55:17 xendmz2 kernel: Bridge firewalling registered
Oct 24 12:55:17 xendmz2 kernel: tmpbridge: Dropping NETIF_F_UFO since no NETIF_F_HW_CSUM feature.
Oct 24 12:55:17 xendmz2 kernel: eth0 renamed to peth0
Oct 24 12:55:17 xendmz2 kernel: tmpbridge renamed to eth0
Oct 24 12:55:17 xendmz2 kernel: igb 0000:0c:00.0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Oct 24 12:55:17 xendmz2 kernel: device peth0 entered promiscuous mode
Oct 24 12:55:17 xendmz2 kernel: ------------[ cut here ]------------
Oct 24 12:55:17 xendmz2 kernel: WARNING: at net/core/dev.c:1176 br_add_if+0xf3/0x1cd [bridge]()
Oct 24 12:55:17 xendmz2 kernel: Modules linked in: bridge stp fuse loop dm_mod dcdbas(X) rtc_cmos rtc_core iTCO_wdt rtc_lib serio_raw pcspk
r ide_cd_mod iTCO_vendor_support joydev i5000_edac edac_core igb ses enclosure 8250_pnp 8250 serial_core shpchp pci_hotplug button sg usbhi
d hid ff_memless uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif xenblk cdrom xennet edd reiserfs fan ide_pci_generic ata_generic ata_piix pata
_acpi libata dock piix ide_core lpfc scsi_transport_fc scsi_tgt megaraid_sas scsi_mod thermal processor thermal_sys hwmon
Oct 24 12:55:17 xendmz2 kernel: Supported: Yes, External
Oct 24 12:55:17 xendmz2 kernel: Pid: 9811, comm: brctl Tainted: G          2.6.27.1-2-xen #1
Oct 24 12:55:17 xendmz2 kernel:
Oct 24 12:55:17 xendmz2 kernel: Call Trace:
Oct 24 12:55:17 xendmz2 kernel:  [<ffffffff8020ba57>] show_trace_log_lvl+0x41/0x58
Oct 24 12:55:17 xendmz2 kernel:  [<ffffffff8045cc08>] dump_stack+0x69/0x6f
Oct 24 12:55:17 xendmz2 kernel:  [<ffffffff8023136a>] warn_on_slowpath+0x51/0x77
Oct 24 12:55:17 xendmz2 kernel:  [<ffffffffa02fb4d5>] br_add_if+0xf3/0x1cd [bridge]
Oct 24 12:55:17 xendmz2 kernel:  [<ffffffffa02fbc07>] add_del_if+0x48/0x65 [bridge]
Oct 24 12:55:17 xendmz2 kernel:  [<ffffffff803f078c>] dev_ioctl+0x400/0x4ab
Oct 24 12:55:17 xendmz2 kernel:  [<ffffffff803e1b09>] sock_ioctl+0x1ec/0x1f6
Oct 24 12:55:17 xendmz2 kernel:  [<ffffffff802a6609>] vfs_ioctl+0x21/0x6c
Oct 24 12:55:17 xendmz2 kernel:  [<ffffffff802a6893>] do_vfs_ioctl+0x23f/0x255
Oct 24 12:55:18 xendmz2 kernel:  [<ffffffff802a68fa>] sys_ioctl+0x51/0x73
Oct 24 12:55:18 xendmz2 kernel:  [<ffffffff8020a878>] system_call_fastpath+0x16/0x1b
Oct 24 12:55:18 xendmz2 kernel:  [<00007f8474e1b4e7>] 0x7f8474e1b4e7
Oct 24 12:55:18 xendmz2 kernel:
Oct 24 12:55:18 xendmz2 kernel: ---[ end trace 7a6c66e9ad895a74 ]---
Comment 2 Oliver Mössinger 2008-10-24 11:11:13 UTC
Created attachment 247766 [details]
more kernel trace
Comment 3 Wei Kong 2008-11-19 09:54:37 UTC
Created attachment 253302 [details]
Patch for this bug.

This patch may fix this bug. But how do you test it, Need I build a new kernel for you, or you can do it yourself?

--thanks
Comment 4 Oliver Mössinger 2008-11-20 08:05:23 UTC
Hi,

thank you! Yes, i will test it on saturday. Please build the kernel for me!

Oliver Mössinger
Comment 5 Wei Kong 2008-11-20 08:08:24 UTC
thanks, this patch for your comment#2

http://www.brsbox.com/filebox/filegroup/fgid/ef61bf4b743957d4007e08cdfc987e08

kernel-xen-base 6M
kernel-xen      13M

--thanks a lot
Kong 
Comment 6 Wei Kong 2008-11-20 08:14:26 UTC
Btw, the above link valid in 72hours.
Comment 7 Oliver Mössinger 2008-11-20 08:38:38 UTC
Hi,

it was difficult to read the site, but i have the files ;-)

Thanks
Oliver Mössinger
Comment 8 Jan Beulich 2008-11-20 14:01:30 UTC
Re original comment: bnx2 stopping to work intermittently is a duplicate of bug 429739. The call trace, however, is X related - this is what we really may need to look at.

Re #1: This is a duplicate of bug 435551.

Re #3: Which of the call traces do you believe this patch addresses? I don't see it releated to either.
Comment 9 Oliver Mössinger 2008-11-20 15:37:38 UTC
Hi Jan,

the traces i reported are all generated on the same host! This traces are only generated with the XEN Kernel DOM0. DOMU not checked. The default Kernel is stable! So i believe, it must be a XEN Kernel bug, not a X bug!

Oliver
Comment 10 Jan Beulich 2008-11-20 16:29:49 UTC
I didn't say it's an X bug, I said it's an issue with X (rather than with one of the network cards).

In any case, we'll need full hypervisor and kernel messages from that system, after making sure you run the latest bits.
Comment 11 Wei Kong 2008-11-21 01:37:50 UTC
Comment#2, the message only contain one call trace of WARNING as below;

"Oct 24 13:03:50 xendmz2 kernel: WARNING: at net/core/dev.c:1516 skb_gso_segment+0x82/0x1a6()"

This warning happened because the bridge doesn't deal with the relation between NETIF_F_TSO/GSO/SG and NETIF_F_GEN_CSUM. 

It seems Herbert Xu fixed it in this patch. So I need Oliver test this WARNING first.
Comment 12 Oliver Mössinger 2008-11-27 08:17:29 UTC
Created attachment 256043 [details]
process list at reboot
Comment 13 Oliver Mössinger 2008-11-27 08:17:59 UTC
Created attachment 256044 [details]
dmesg at reboot
Comment 14 Oliver Mössinger 2008-11-27 08:18:44 UTC
Excuse me, but it was not possible to make the test on Saturday. Now the test is done! Here the Information i can give you:

First i updated the host to the actual factory. Now i have this "openSUSE 11.1 Beta 5.2 (x86_64)" installation. With this kernel:

xendmz2:~ # rpm -qa | grep kernel-xen
kernel-xen-extra-2.6.27.7-3.1
kernel-xen-base-2.6.27.7-3.1
kernel-xen-2.6.27.7-3.1

Many of the kernel "Call Trace" are gone, super :-) There is only one still existent:

Nov 27 08:10:01 xendmz2 BLKTAPCTRL[18655]: blktapctrl.c:797: Found driver: [ioemu disk]
Nov 27 08:10:01 xendmz2 BLKTAPCTRL[18655]: blktapctrl.c:797: Found driver: [raw image (cdrom)]
Nov 27 08:10:01 xendmz2 BLKTAPCTRL[18655]: blktapctrl_linux.c:23: /dev/xen/blktap0 device already exists
Nov 27 08:10:02 xendmz2 kernel: vendor=8086 device=244e
Nov 27 08:10:02 xendmz2 kernel: pci 0000:14:0d.0: PCI INT A -> GSI 19 (level, low) -> IRQ 19
Nov 27 08:10:02 xendmz2 kernel: ------------[ cut here ]------------
Nov 27 08:10:02 xendmz2 kernel: WARNING: at arch/x86/mm/pageattr-xen.c:622 __change_page_attr_set_clr+0xa4/0xb7c()
Nov 27 08:10:02 xendmz2 kernel: CPA: called for zero pte. vaddr = ffff8800f0e39000 cpa->vaddr = ffff8800f0e39000
Nov 27 08:10:02 xendmz2 kernel: Modules linked in: netbk(N) blkbk(N) blktap(N) xenbus_be(N) dm_round_robin(N) dm_multipat
h(N) scsi_dh(N) ip6t_REJECT(N) nf_conntrack_ipv6(N) ip6table_raw(N) xt_NOTRACK(N) ipt_REJECT(N) xt_physdev(N) xt_state(N)
 iptable_raw(N) iptable_filter(N) ip6table_mangle(N) nf_conntrack_netbios_ns(N) nf_conntrack_ipv4(N) nf_conntrack(N) ip_t
ables(N) ip6table_filter(N) ip6_tables(N) x_tables(N) ipv6(N) microcode(N) bridge(N) stp(N) fuse(N) loop(N) dm_mod(N) bnx
2(N) 8250_pnp(N) 8250(N) rtc_cmos(N) iTCO_wdt(N) rtc_core(N) ide_cd_mod(N) serial_core(N) joydev(N) serio_raw(N) iTCO_ven
dor_support(N) e1000e(N) button(N) shpchp(N) rtc_lib(N) pcspkr(N) i5000_edac(N) dcdbas(N) pci_hotplug(N) ses(N) edac_core
(N) igb(N) enclosure(N) sg(N) usbhid(N) hid(N) ff_memless(N) uhci_hcd(N) ehci_hcd(N) usbcore(N) sd_mod(N) crc_t10dif(N) x
enblk(N) cdrom(N) xennet(N) edd(N) reiserfs(N) fan(N) ide_pci_generic(N) ata_generic(N) ata_piix(N) pata_acpi(N) libata(N
) dock(N) piix(N) ide_core(N) lpfc(N) scsi_transport_fc(N) s
Nov 27 08:10:02 xendmz2 kernel: csi_tgt(N) megaraid_sas(N) scsi_mod(N) thermal(N) processor(N) thermal_sys(N) hwmon(N)
Nov 27 08:10:02 xendmz2 kernel: Supported: No
Nov 27 08:10:02 xendmz2 kernel: Pid: 18651, comm: X Tainted: G          2.6.27.7-3-xen #1
Nov 27 08:10:02 xendmz2 kernel:
Nov 27 08:10:02 xendmz2 kernel: Call Trace:
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff8020c547>] show_trace_log_lvl+0x41/0x58
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff80461408>] dump_stack+0x69/0x6f
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff80232bf5>] warn_slowpath+0xa9/0xd1
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff8021a186>] __change_page_attr_set_clr+0xa4/0xb7c
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff8021ad2e>] change_page_attr_set_clr+0xd0/0x200
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff803e2656>] pci_mmap_page_range+0xe5/0x149
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff802eea9a>] mmap+0x5d/0x99
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff8028b9b2>] mmap_region+0x2a1/0x4e8
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff8028bee0>] do_mmap_pgoff+0x2e7/0x34b
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff8020f3a0>] sys_mmap+0x8c/0xc4
Nov 27 08:10:02 xendmz2 kernel:  [<ffffffff8020b368>] system_call_fastpath+0x16/0x1b
Nov 27 08:10:02 xendmz2 kernel:  [<00007fb3d34e5eea>] 0x7fb3d34e5eea
Nov 27 08:10:02 xendmz2 kernel:
Nov 27 08:10:02 xendmz2 kernel: ---[ end trace 344320c5fffdbd52 ]---
Nov 27 08:10:03 xendmz2 kernel: Not cloning cgroup for unused subsystem ns
Nov 27 08:10:03 xendmz2 SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...


All other Call Traces are missed! With and without your patch!!!! BUT the connection is LOST! Now more Information about this:

The Dell host has two "bnx2" interfaces, eth0 and eth1. It was not possible to transport packages with eth0. eth1 is still working! See the attachment "dmesg.txt" and "psauxf.txt". At this moment i try to reboot. The process "brctl delbr eth0" hang with following message in dmesg:

"unregister_netdevice: waiting for eth0 to become free. Usage count = 3"

There was no firewall log for eth0 between "Nov 26 17:42:06" and "Nov 27 08:02:40" (08:02 reboot time, see "psauxf.txt"):

Nov 26 17:18:09 xendmz2 kernel: SFW2-INint-DROP-DEFLT IN=eth0 OUT= PHYSIN=peth0 MAC=01:00:5e:00:00:fb:00:50:56:87:7c:6e:08:00 SRC=172.1 6.2.27 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44 Nov 26 17:42:06 xendmz2 kernel: SFW2-INext-DROP-DEFLT IN=eth1 OUT= PHYSIN=peth1 MAC=01:00:5e:00:00:fb:00:16:3e:7e:14:6b:08:00 SRC=192.1 68.254.27 DST=224.0.0.251 LEN=64 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=44
...
Nov 27 08:02:40 xendmz2 kernel: SFW2-INext-DROP-DEFLT IN=eth1 OUT= MAC= SRC=192.168.255.245 DST=224.0.0.251 LEN=551 TOS=0x00 PREC=0x00 TTL=255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=531 Nov 27 08:02:40 xendmz2 kernel: SFW2-INint-DROP-DEFLT IN=eth0 OUT= MAC= SRC=172.16.4.245 DST=224.0.0.251 LEN=534 TOS=0x00 PREC=0x00 TTL =255 ID=0 DF PROTO=UDP SPT=5353 DPT=5353 LEN=514

See the missing PHYSIN in the last firewall logs! I hope this information help.
Comment 15 Jan Beulich 2008-11-27 09:43:47 UTC
For this last remaining call trace I added a fix just half an hour ago, scheduled to go into whatever comes after RC1.

As to the bnx2 problem - I'm not certain the Xen you've got has the necessary fix; you could in any case try disabling MSI either just for that driver or globally.
Comment 16 Oliver Mössinger 2008-11-27 15:09:31 UTC
Thank you Jan,

yes, there are some problems listed with bnx2 and MSI. I disabled MSI for bnx2.

I will report what happen!
Comment 17 Oliver Mössinger 2008-12-01 10:06:07 UTC
now the host and the network is stable! I tested with patch and without. In both configurations the network works. At the moment i need to disable MSI for bnx.

Thank you
Comment 18 Stephan Kulow 2008-12-01 12:58:09 UTC
FIXED?
Comment 19 Oliver Mössinger 2008-12-01 13:29:59 UTC
not really FIXED, a workaround is available (disable MSI for bnx2)!
Comment 20 Jan Beulich 2008-12-01 13:38:45 UTC
It *is* fixed, the fix may just not be externally available, yet. Charles?
Comment 21 Charles Arnold 2008-12-01 18:20:50 UTC
Is this the same issue as bug 429739?  That fix is in RC1.
Comment 22 Jan Beulich 2008-12-02 08:51:39 UTC
Yes, thanks. So Stephan/Oliver - the bnx2 issue *is* fixed. The GUI issue, however, will only be after the next Xen patch commit to the kernel cvs.
Comment 24 Jan Beulich 2008-12-22 09:30:05 UTC
Kernel patches to fix the remaining issues here have been committed and will be available with a future kernel maintenance update.