Bug 325995

Summary: Recent HP workstations need pci=nommconf - Only needed with internal graphics card, external graphics card added the machines boot
Product: [openSUSE] openSUSE 11.0 Reporter: Thomas Renninger <trenn>
Component: KernelAssignee: Greg Kroah-Hartman <gregkh>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: asadeghpour, bernard.delley, bjorn.helgaas, bryan.christ, coolo, forgotten_FOUTW3E5Ow, henry.su, otto.hase, ric, richard.zhao, sbahling, trenn
Version: Alpha 2   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: This change from 2.6.17 to 2.6.18 breaks things

Description Thomas Renninger 2007-09-18 13:16:22 UTC
Last message where it hangs (with already some debug enabled):

 nsutils-0869 [00sutils-0869 [00] ns_get_node           : _SEG, AE_NOT_FOUND
   utils-0273 [00] evaluate_integer      : Evaluate [\_SB_.PCI0._SEG]: AE_NOT_FOUND
pci_root-0226 [00] pci_root_add          : Assuming segment 0 (no _SEG)
 nsutils-0869 [00] ns_get_nod   : Evaluate [\_SB_.PCI0._BBN]: AE_NOT_FOUND
pci_root-0247 [00] pci_root_add          e [PCI0] (0000:00)

Looks like the PCI rooting information is wrongly exported or wrongly parsed.
Problem: pci=noacpi also hangs (with an oops or something, will address this later). Therefore I have not found a way yet to bring the machine up at all.

I will search for a BIOS update first, before possibly doing work for nothing...
Then I try to extract the ACPI tables via serial console, hope that works, never done before...
Comment 2 Thomas Renninger 2007-09-18 14:22:37 UTC
10.2 is also not booting, hanging at exactly the same point.
The machine is continuing booting by passing pci=nommconf (didn't test how far, but looks good)
This very much smells like a BIOS issue.
If there are no objections, I am going to upgrade the BIOS to a much bigger revision now. If we still have problems at this point, I am going to add Andi, he knows more about mmconf issues...
Comment 4 Thomas Renninger 2007-09-18 14:31:30 UTC
Adding Andi -> he might be interested
Reducing severity, this has nothing to do with the MSI boards, workaround is found...
Comment 10 Stephan Kulow 2007-11-30 10:17:31 UTC
should we close this bug?
Comment 11 Thomas Renninger 2007-11-30 10:30:48 UTC
It could be a duplicate of this one:
https://bugzilla.novell.com/show_bug.cgi?id=331027
I just did not find the time...

I close it for now.
Thomas, Christian or whoever will install the machine at some time:
  - A Bios update certainly would be a good idea
  - If inital boot hangs, it should install fine with pci=nommconf Parameter.
  - When installed and the kernel got upgraded to the latest one, it would
    be worth to try without the boot option, when it works it was a duplicate
    (see above).

If someone finds the time, it would be great if someone could add the machine type into the bug's title. Like that it is easier for people having the same machine and problem, being able to get 10.3 running...
Comment 12 Thomas Renninger 2008-01-10 16:37:00 UTC
Stefan now also has an HP machine showing this issue.
Stefan, could you please search for a BIOS update for that machine in try again without pci=nommconf.

This looks important... we probably could also get some help from HP here.
Comment 13 Stefan Dirsch 2008-01-11 18:14:15 UTC
Unfortunately I can't provide any feedback as long as no BIOS update is available. :-( I will do once I can.
Comment 14 Stefan Dirsch 2008-01-12 05:20:05 UTC
I'll provide feedback once I can test a BIOS update.
Comment 18 Thomas Renninger 2008-01-22 11:00:07 UTC
Thanks Scott for the new BIOSes. Unfortunately they did not help, I will start digging on the mm config table or wherever this leads to, asap.
Comment 19 Stefan Dirsch 2008-01-22 13:48:22 UTC
Unfortunately I can't reproduce this issue any more on my machine, neither with the openSUSE 10.3-x86 default-kernel nor with the current STABLE-x86_64 default-kernel (2.6.24-rc8-git2-3-default). Latest changelog entry of 2.6.24-rc8-git2-3-default:

* Mo Jan 21 2008 aj@suse.de
- Remove unused config/s390/rt.
Comment 20 Thomas Renninger 2008-01-22 14:47:55 UTC
Stefan Dirsch discovered something important:
Adding an external PCI Express graphics card makes the machine boot fine.
I verified this on the second HP desktop machine with another PCI Express graphics card.
A network PCIe card does not make it boot.

I can try to extract ACPI tables (it should even be possible to extract them on a broken machine)...
But I expect now is the right time to get some HP developers on board.
Scott?

Bjorn->I am still on the pnp patches..., I updated and discovered yet another problem..., there will be a post really soon...
Comment 21 Scott Bahling 2008-01-22 15:05:24 UTC
which system has the issue? The dc7800 or dc5800 or both?

the dc7800 is certified for SLE10 SP1 and I in stalled SLED10 SP1 several times without issues. Does this only show up with later kernels?
Comment 22 Thomas Renninger 2008-01-22 23:03:55 UTC
Yes, both.
No it is only 10.3 and newer (up to 11.0 Alpha1). I tried a vanilla kernel with same/similar config and it boots. So it seems to be a SUSE specific patch, I try to find it...
Comment 23 Thomas Renninger 2008-01-31 13:43:55 UTC
Not sure, I thought booting with kernel-vanilla (suse rpm package), I had a working kernel..., but it seems I was wrong or it was something else...

I now went back from latest kernel 2.6.24 to 2.6.17.
2.6.17 was the first kernel booting.
I used plain vanilla sources (not the very latest Stable ones, hmm, should have done this, but it should not really matter). I mainly used SUSE derived kernel configs, but also used vanilla default kernel config on 2.6.18 (also not working).

2.6.17 in serial console shows:
ACPI: bus type pci registered
PCI: BIOS Bug: MCFG area is not E820-reserved
PCI: Not using MMCONFIG.

Which seem to be the reason why this one is booting.
So this very much smells like an HP bug.

I just wonder (there really seem to be (nearly?) every) HP workstation with internal graphics card affected (I saw this on 3 of 3 systems...) and the bug in mainline goes back to kernel 2.6.17 (first which is working). Can this really be?!?

Even this seem to be a BIOS bug, I expect we want to have the check:
PCI: BIOS Bug: MCFG area is not E820-reserved
PCI: Not using MMCONFIG.
in our kernels?

I am going to dig it out..., hmm this should be something for Arjan who does not have a novell bugzilla account. I'll come back.
Comment 24 Scott Bahling 2008-01-31 14:18:39 UTC
Bryan, has HP seen this issue with later upstream and distro kernels?

(this is related to the firmware updates I requested - the updates did not help)
Comment 25 Thomas Renninger 2008-01-31 15:04:19 UTC
Created attachment 192558 [details]
This change from 2.6.17 to 2.6.18 breaks things
Comment 26 Bryan Christ 2008-01-31 15:33:25 UTC
Scott,

One the engineers in ISS worked on a patch for this and I believe his work has been submitted upstream.  If not, maybe he could send you the patch.  Would you like for me to investigate?

Bryan
Comment 27 Scott Bahling 2008-01-31 16:25:31 UTC
yes, please let us know the status. Thanks.
Comment 28 Bryan Christ 2008-01-31 16:40:13 UTC
Scott,

Here is what Tony had to say:

"My patch did get pushed upstream, but it has not been accepted.

The submission of my patch generated a lot of discussion on LKML, and
several alternatives have been suggested, one which I must admit is
better than mine.

The maintainer, Greg Kroah-Hartman is leaning towards a patch that was
admitted into the -mm stream, but nobody seems to like that one. It is
awful, in that it requires drivers to make a kernel call to enable
MMCONFIG.

My patch was accepted into RH and is working as advertised.

If you give me a pointer to the kernel sources for Novell, I would be
happy to create a patch for it, if they are willing to take it while
waiting for the upstream resolution.

If you log onto the LKML and search for MMCONF since Dec 19, 2007, you
will see all the discussion about it."
Comment 29 Thomas Renninger 2008-01-31 17:42:40 UTC
Puhh, I really don't want to go through these hundreds of posts if not really necessary.

Greg, can you take care of this?
A two sentence explenation/summary of the problem would be nice :)

Is there something we could/should add to 10.3 (pci=nommconf works, so this is not sever, not adding anything sounds better than risking breakage?).

I readjust the product to 11.0 now, this is more important IMO.
Comment 30 Greg Kroah-Hartman 2008-02-04 15:12:02 UTC
{sigh}

This is a long and complex path of patches, broken patches, broken hardware, and lots of other mess.

In short, this is not yet resolved upstream, and I would strongly hesitate to accept any patch into our kernel tree yet.

But, as this is a 11.0 issue, I think it will be resolved by then (hopefully), so I'll just take the bug and continue to work on the upstream issue.

And no, I don't want to take the Red Hat patch, because that is not a potential one for upstream, there are 2 others that might be used instead.
Comment 31 Alexey Starikovskiy 2008-02-04 16:22:27 UTC
*** Bug 246646 has been marked as a duplicate of this bug. ***
Comment 32 Alexey Starikovskiy 2008-02-06 16:29:16 UTC
*** Bug 328471 has been marked as a duplicate of this bug. ***
Comment 33 Greg Kroah-Hartman 2008-03-03 20:17:13 UTC
*** Bug 362588 has been marked as a duplicate of this bug. ***
Comment 34 Greg Kroah-Hartman 2008-03-03 20:17:51 UTC
Is now fixed in our kernel-of-the-day, will be in the next alpha.

If not, please reopen this bug.