Bug 415034

Summary: Yast network does not set up network correctly
Product: [openSUSE] openSUSE 11.1 Reporter: Thomas Renninger <trenn>
Component: YaST2Assignee: Michal Zugec <mzugec>
Status: RESOLVED INVALID QA Contact: Jiri Srain <jsrain>
Severity: Blocker    
Priority: P5 - None CC: bgeuken, coolo, okir, ug
Version: Factory   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: Development Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Yast2 logs after end of second stage. Yast2 network module show both links are down (wrong)
Yast2 logs after doing "ethtool eth0" and "ethtool eth1", entering yast network module then showed that one device has a link (correct)
Unrelated to this bug, but shortly after this error message things break -> because these packages cannot be installed, because network is broken
Extra packages needed for installation cannot be installed
The next error window
And here we have to break our automated installation...

Description Thomas Renninger 2008-08-06 13:03:17 UTC
This report is based on this latest build:
machcd2/CDs/openSUSE-11.1-Alpha1plus-DVD-x86_64-Build0016/DVD1

When installing with autoyast, the network is not set up which results in a totally broken system.

After second stage has finished, I tried to set up the network manually via yast.
But both network cards showed the status "not connected".

This is the time of my first save_y2logs, I am going to attach.

In the console I tried:
ethtool eth0
 -> link no
ethtool eth1
 -> link yes

Now I went into yast network module again and the one network card correctly showed a link.

I have no idea whether the "ethtool eth1" triggered that it magically worked or whether there is some kind of timeout or something else, I have no idea.

This is the time of my second save_y2logs, I am going to attach.
This breaks our autoinstallation and thus we cannot install the latest software on our developement machines. This blocks the work of the architecture team -> blocker.
Comment 1 Thomas Renninger 2008-08-06 13:09:13 UTC
Created attachment 232040 [details]
Yast2 logs after end of second stage. Yast2 network module show both links are down (wrong)
Comment 2 Thomas Renninger 2008-08-06 13:10:58 UTC
Created attachment 232041 [details]
Yast2 logs after doing "ethtool eth0" and "ethtool eth1", entering yast network module then showed that one device has a link (correct)
Comment 3 Michal Zugec 2008-08-06 13:23:42 UTC
I don't have this build, what's yast2-network version there?
Just try - this was tested and fixed in yast2-network-2.17.11:
http://mzugec.blogspot.com/2008/07/autoyast-network-device-names.html
Comment 4 Thomas Renninger 2008-08-06 14:13:08 UTC
yast2-network-2.17.14-2
Something still seems to not work.
Comment 6 Michal Zugec 2008-08-14 10:22:03 UTC
trying to reproduce
Comment 7 Michal Zugec 2008-08-14 11:03:13 UTC
No it works fine (except some known bootloader problems) with Alpha1. When 2nd stage finish, you're in running system and network devices are up/down according your configuration (rcnetwork status). Attach <networking> section and content of /etc/sysconfig/network/ifcfg-* please.

Decreased serverity to Major
Comment 8 Thomas Renninger 2008-08-14 15:06:19 UTC
This was not Alpha1, but a test build afterwards coolo asked us to test.
Let's see if things change in Alpha2, if not we have to increase severity again as we have to set up every machine by hand.
The tested built is:
/mounts/dist/machcd2/CDs/openSUSE-11.1-Alpha1plus-DVD-x86_64-Build0016/DVD1/

I'll attach some screenshots.
You still may want to log into *adalid*.
It's freshly installed -> therefore network is broken and you have to log in via serial console:
ssh root@sconsole1
cscreen
-> choose adalid
Comment 9 Thomas Renninger 2008-08-14 15:08:07 UTC
Created attachment 233488 [details]
Unrelated to this bug, but shortly after this error message things break -> because these packages cannot be installed, because network is broken
Comment 10 Thomas Renninger 2008-08-14 15:09:47 UTC
Created attachment 233489 [details]
Extra packages needed for installation cannot be installed
Comment 11 Thomas Renninger 2008-08-14 15:10:35 UTC
Created attachment 233490 [details]
The next error window
Comment 12 Thomas Renninger 2008-08-14 15:11:35 UTC
Created attachment 233491 [details]
And here we have to break our automated installation...
Comment 13 Michal Zugec 2008-08-18 09:04:15 UTC
please test with comming alpha2
Comment 14 Thomas Renninger 2008-08-20 17:30:35 UTC
This can still be reproduced on Alpha2.
Adalid's network now got set up by hand.

This might not be a Yast, but a ethtool problem. This should be found out first...
How does Yast find out whether a network cable is plugged in and the link is active?
Comment 15 Michal Zugec 2008-08-20 19:02:02 UTC
>> How does Yast find out whether a network cable is plugged in and the link is
active

It uses /sys/class/net/*/carrier information
Comment 16 Thomas Renninger 2008-09-01 14:14:09 UTC
Hi Karsten, this is the bug where I expect that network link detection does not work.
Machines that should for now be affected: *adalid*, *field*.
According to Karsten one cannot really trust /sys/class/net/*/carrier or there may be delay issues (let the network driver give some more time to detect a link?).

Bjorn is playing a bit with it on adalid (go ahead and double check on *field* -> same driver? same problem? ...)

BTW: We also saw problems with ethtool and therefore are now activating network devices via ifup and then use ethtool to detect the link.
I could imagine ethtool and /sys/../carrier link detection are rather similar/same?
Then yast problaby also needs a workaround like activate network cards first?
Comment 17 Björn Geuken 2008-09-01 14:55:09 UTC
from dmesg:
NET: Registered protocol family 17
tg3: eth1: Link is up at 100 Mbps, full duplex.
tg3: eth1: Flow control is on for TX and on for RX.
NET: Registered protocol family 10

linux:~ # cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:03.1/net/eth1/carrier
cat: /sys/devices/pci0000:00/0000:00:02.0/0000:02:03.1/net/eth1/carrier: Invalid argument

output came from adalid.
Comment 18 Karsten Keil 2008-09-01 15:16:09 UTC
This seems to be normal as long the interface was not brought in up state:

gw:/usr/src/linux # ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:08:54:53:FD:03
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Interrupt:221 Base address:0xa000

gw:/usr/src/linux # cat /sys/class/net/eth0/carrier
cat: /sys/class/net/eth0/carrier: Invalid argument
gw:/usr/src/linux # ifconfig eth0 up
gw:/usr/src/linux # cat /sys/class/net/eth0/carrier
0
ifconfig eth0 down
gw:/usr/src/linux # cat /sys/class/net/eth0/carrier
cat: /sys/class/net/eth0/carrier: Invalid argument
Comment 19 Thomas Renninger 2008-09-02 10:36:53 UTC
> This seems to be normal as long the interface was not brought in up state
Right, carrier is always zero on all machines if the interface is down, so I expect they try to bring the network interface up before, maybe increasing a waiting time to let the network driver settle down a bit, works?

On adalid this test script shows that it takes rather long until the link
is detected (more than 1 sec, under high load possibly longer):

#!/bin/bash

for ((x=0;x<5;x++)); do
        ifconfig eth1 down;
        ifconfig eth1 up;
        sleep $x
        cat /sys/devices/pci0000:00/0000:00:02.0/0000:02:03.1/net/eth1/carrier;
done


/tmp/network_test.sh
0
0
1
1
1

Can the timeout in yast to let the network settle down and to check the link  be increased, pls.
Comment 20 Thomas Renninger 2008-09-02 11:38:24 UTC
Hmm, I mean how long should Yast wait...
IMO this should still be solved properly in the kernel.

IMO the sysfs file access on carrier should block if there is a detection in progress, something like:

1. I do not know anything about network layer
2. I didn't find and search for the real carrier sysfs method

But could something like this be the real solution?:

static ssize_t carrier_show(struct class *cls, char *buf)
{
      unsigned long timeout = jiffies + HZ * 5;  /* 5s */

      while (netif_carrier_check(cxy->dev)) &&
             timeout > jiffies) {
              /* I found netif_carrier_ok, netif_carrier_on and
                 netif_carrier_off...
                 NIC link detection in progress...
             /*
             cond_resched();
      }
      return sprintf(buf, "%d", netif_carrier_ok(xy->dev));
}

Two problems I am not sure:
  - Is this allowed in a sysfs read at all, Kay?
  - The netif_carrier_check(xy->dev) is probably hard to impelement?
    Maybe it could be done for the tg3 only for now?

Just an idea, but solving/workarounding this in Yast is probably really ugly:
Michal just confirmed: Waiting longer would block the whole application, not a real solution.
Comment 21 Bernhard Walle 2008-09-02 12:04:39 UTC
(In reply to comment #20 from Thomas Renninger)
>       while (netif_carrier_check(cxy->dev)) &&
>              timeout > jiffies) {

Does jiffies magic really work with 'nohz' any more (since 'jiffies' might not be updated)?
Comment 22 Karsten Keil 2008-09-02 12:21:08 UTC
No that would be a bad idea. The driver cannot make a difference between
"no connection" and "carrier detection in progress", so it would wait forever, if no cable is connected.
And the testloop from comment #19 is wrong, it restarts carrier detection in every loop, note "ifconfig down" is the same like pulling the cable.
The important things are:
YaST should not access /sys/class/net/ethX/carrier before ifconfig up was done,
this would cause an error, note on some devices a ifconfig up does not happen immediately in the driver, it maybe delayed until the driver thread is running again. You can examine /sys/class/net/ethX/flags, Bit0 shows up/down status.
If it read 0 for carrier it should retry, at least 3 seconds, but some cards (and switches) maybe need more time.
One idea would be to do ifconfig up on all found interfaces early as possible,
do something else and then test carrier state.
Comment 23 Thomas Renninger 2008-09-02 12:36:16 UTC
That does not work.
Same script, extended to read flags and carrier:

/tmp/network_test.sh
Waiting for 0 seconds...
carrirer: 0
flags: 0x1003
Waiting for 1 seconds...
carrirer: 0
flags: 0x1003
Waiting for 2 seconds...
carrirer: 1
flags: 0x1003
Waiting for 3 seconds...
carrirer: 1
flags: 0x1003
Waiting for 4 seconds...
carrirer: 1
flags: 0x1003
Comment 24 Thomas Renninger 2008-09-02 12:40:03 UTC
> You can examine /sys/class/net/ethX/flags, Bit0 shows up/down status
One second.., It's says connected :)

cat /sys/class/net/eth1/flags
0x1003
linux:~ # cat /sys/class/net/eth0/flags
0x1002

So they should use /sys/class/net/ethX/flags instead of carrier?
Comment 25 Karsten Keil 2008-09-02 12:52:08 UTC
No.
/sys/class/net/ethX/flags:0 only give the status up/down of the interface,
you could check this to verify that a ifconfig up was given and executed by the driver.
Detection of carrier takes time after ifconfig up, some cards/switches are quick (<2 sec) some need > 10 sec, you cannot do anything against that.
The issue is, that newer HW does power down the PHY interface until they got enabled with ifconfig up, some other devices do enable the PHY interface with driver load (I think these are the "quick" ones), but this is not acceptable 
as a general solution because of powersave.
Comment 26 Thomas Renninger 2008-09-02 13:39:10 UTC
Some workaround ideas:

    1) ifup all the network interfaces by another program ealier on 2nd stage
       install/setup boot.
       Is not nice, because the yast lan module will still be broken stand alone
      
    2) Wait the same amount as currently if one or more carrier files show 1
       -> link detected. It is then expect that:
         a) We have something to use for installation -> not that bad if another
            link is not detected. Also cards of the same type should be ready.
         b) If no link is detected at all, at least wait 5 secs (should not be
            that often).

    3) Re-evaluate carrier link after Yast lan is fully started and adjust
       things to the user
       -> Probably very hard to implement in Yast?

Best would be 2+3, this would be fully satisfactoring, but 3 may not be possible as Yast could make assumptions displaying things on the result of the detection?

I am off from discussion. I do not know enough in this area..., just some ideas.
But it seems, beside that our auto-installation does not work on several systems, we hit a sever bug here (especially Karsten's power saving assumptions make me nervous, that would mean that link detection time could take even longer in cards upcoming in the future if they take more care about power consumption?)
Comment 28 Thomas Renninger 2008-09-03 15:32:37 UTC
It came out that Yast and everything works rather well...

The additional network autoyast conf:
<keep_install_network config:type="boolean">true</keep_install_network>
seem to do the trick.

Thanks a lot to Michal Zugec, tracking it down and to Uweg Gansert pointing to the above param.
Comment 29 Olaf Kirch 2008-09-03 15:46:05 UTC
Is this documented in some prominent place in the autoyast documentation?