Bug 702205 - RTL8111/8168B hard locking and rebooting machines when under heavy load
Summary: RTL8111/8168B hard locking and rebooting machines when under heavy load
Status: VERIFIED WONTFIX
: 709886 (view as bug list)
Alias: None
Product: openSUSE 11.4
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Final
Hardware: x86-64 openSUSE 11.4
: P5 - None : Critical with 10 votes (vote)
Target Milestone: ---
Assignee: Benjamin Poirier
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-06-25 23:08 UTC by Forgotten User 6pyoK8uj9i
Modified: 2012-03-22 19:00 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Forgotten User 6pyoK8uj9i 2011-06-25 23:08:36 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1

When this network card (should apply to all cards from this family) is under heavy load (>20MB) it causes hard lockups and/or hard machine reboots.

This issue is extremely hard to localize. There are either no errors, or the machine reboots to fast.

Plus the actual errors are not very informative ("eth0 link up" is one of the "fatal errors").

Upstream reference: https://bugzilla.kernel.org/show_bug.cgi?id=32962
Ubuntu forum reference: http://ubuntuforums.org/showpost.php?p=10774353&postcount=18

Solution is to use the driver provided by Realtek.

I spent two days hunting this problem.

Reproducible: Always

Steps to Reproduce:
1. do some network heavy stuff
2. watch your machine crash
Actual Results:  
Machine reboots or hard locks.

Expected Results:  
No problems should appear.

Please even if you won't fix this soon, at least disable the problematic module, so that users will have some clue where to search, because this is a really, really, really annoying problem that is extremely hard to diagnose.
Comment 1 Forgotten User 6pyoK8uj9i 2011-06-25 23:10:07 UTC
Whoops, forgot, the other characteristic error is:
"NOHZ local_softirq_pending 08"
Comment 2 Sven Hartrumpf 2011-11-01 08:35:09 UTC
I can reproduce this problem on several independent machines.
My solution was to switch to a driver as distributed by www.realtek.com.

Why is this bug not addressed?
Comment 3 Forgotten User xRcrmyYBVX 2011-11-23 15:54:07 UTC
I'm having the same problem on a HP Pavillion DV6 notebook with this chip onboard. OpenSuSE 11.4 is UNUSABLE on this machine in this state!

Please fit this problem if possible!
Comment 4 Benjamin Poirier 2012-03-13 15:30:18 UTC
Simon, Sven, Joschi:

Many different rtl8118/8168 chip versions share the same pci ids and are
driven by the r8169 module. These chip versions are distinguished by their
so-called XID. Upon encountering an unknown xid, the r8169 driver tries one of
a few fallbacks. When using older kernels, such as the one found in openSUSE
11.4, it is often the case that these unknown XIDs are for chip versions
newer than the ones supported by the driver. These can lead to half-working
devices like what you describe (I'm speaking from experience here ;). In
particular, openSUSE 11.4 is running a 2.6.37 kernel and the support for three
new chip versions was introduced in r8169 since:

01dc7fe net/r8169: support RTL8168E
v3.0-rc1

7009042 r8169: support RTL8111E-VL.
v3.1-rc1

c221892 r8169: support new chips of RTL8111F
v3.2-rc1

I would recommend you to upgrade to openSUSE 12.1, running a 3.1 kernel. That
will get you the support for the E and EVL chips.

If you'd rather stay on 11.4 (why?), you can install the kernel package alone
from 12.1:
zypper ar obs://Kernel:openSUSE-12.1/standard kotd12.1
vi /etc/zypp/zypp.conf
        # uncomment "multiversion = provides:multiversion(kernel)"
zypper dup -r kotd12.1

If you still experience issues with this network card after upgrading, please
attach your dmesg output. It should contain a line like this which will help
identifying which chip revision you are running:
eth0: RTL8168evl/8111evl at 0xf9320000, 10:1f:74:ce:b0:17, XID 0c900800 IRQ 28

Let me know how things go, thank you.
Comment 5 Benjamin Poirier 2012-03-13 18:33:38 UTC
*** Bug 709886 has been marked as a duplicate of this bug. ***
Comment 6 Martin Seidler 2012-03-17 17:28:57 UTC
Compare:

1) In/against openSUSE 12.1:
[opensuse] Install help for Network driver
(Date: Wed, 14 Mar 2012 15:27:02 -0400)
http://lists.opensuse.org/opensuse/2012-03/msg00676.html
especially:
http://lists.opensuse.org/opensuse/2012-03/msg00765.html

2) Still in/against openSUSE 11.4
http://forums.opensuse.org/deutsch-german/hilfe-und-helfen/netzwerk/473306-opensuse-11-4-x86_64-realtek-onboard-nic-schafft-keinen-link-nach-dem-booten.html
(11-Mar-2012, German)
Comment 7 Benjamin Poirier 2012-03-19 15:45:38 UTC
(In reply to comment #6)
> Compare:
> 
> 1) In/against openSUSE 12.1:
> [opensuse] Install help for Network driver
> (Date: Wed, 14 Mar 2012 15:27:02 -0400)
> http://lists.opensuse.org/opensuse/2012-03/msg00676.html
> especially:
> http://lists.opensuse.org/opensuse/2012-03/msg00765.html

There is some confusion in that thread:

> As suspected your software system uses a r816*9* kernel module (...9)
> but your hardware is a Realtek [...] RTL8111/816*8*B [10ec:816*8*] (...8).

Despite it's name, the r8169 module is meant to drive cards based on the realtek 8168/8111 chips.

The difference between the two modules is that:
r8168 is a binary-only driver provided by realtek
r8169 is a community-developped and supported driver

While it is the case that r8168 usually supports newer chips first, the version of r8169 currently in openSUSE 12.1 supports all the chips I've seen in circulation so far. IMO, steering users towards r8168 is ill-advised as it will be extremely difficult to find some developpers willing and able to provide support for it.

Secondly, lspci output is insufficient to determine the chip version as many of them share a small set of pci ids. A first step in identifying the chip version is the (masked) XID line found in the kernel logs as pointed out at the end of comment 4. The chip version is identified from the (unmasked) XID in rtl8169_get_mac_version()

http://lxr.linux.no/#linux+v3.1.10/drivers/net/r8169.c#L1724
Comment 8 Benjamin Poirier 2012-03-20 13:49:13 UTC
(In reply to comment #7)
> The difference between the two modules is that:
> r8168 is a binary-only driver provided by realtek

correction: the source for r8168 is in fact available. But it is an out of tree driver, it is not supported by the kernel community and it is not supported by SUSE (afaik). Thank you Martin for pointing this out.
Comment 9 Benjamin Poirier 2012-03-22 19:00:45 UTC
Since the release of openSUSE 12.1, openSUSE 11.4 is now getting important security and bug fixes only. From what I can tell, the problems reported here are related to support for new hardware on 11.4 so I'll go ahead and close this bug entry.

I'm sorry we could not address the reporter's problem earlier but now 12.1 has been out for a few months and it has support for newer realtek chip versions. If you experience kernel crashes related to the in-kernel r8169 module on 12.1 please do open a new bugzilla entry (leave a comment here pointing to the new one if you wish) and make sure to include the XID line from dmesg in that new entry (as well as OOPS output or other relevant info).