Bug 197190

Summary: IA64: Xserver shutdown freezes machine w/ R420 + flat panels
Product: [openSUSE] openSUSE 10.2 Reporter: Andreas Schwab <schwab>
Component: X.OrgAssignee: Stefan Dirsch <sndirsch>
Status: RESOLVED FIXED QA Contact: Stefan Dirsch <sndirsch>
Severity: Major    
Priority: P2 - High CC: edwardsg, eich, forgotten_JLAh78sutA, gp, jlim, sbahling, sndirsch, suse-beta, susegfx
Version: Alpha 2plus   
Target Milestone: ---   
Hardware: IA64   
OS: Other   
See Also: http://bugworks.engr.sgi.com/query.cgi/959965
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Bug Depends on: 197572    
Bug Blocks: 248195    
Attachments: Xorg.0.log
strace X
Fix INT10 compiler options.
salinfo oem rpm (decodes firmware blob in mca records)
xorg.conf for dual-X2 prism
MCA dump
MCA dump
bogus int10 patch
p_pci-off-by-one.diff
x86emu.diff

Description Andreas Schwab 2006-08-04 15:20:31 UTC
See log files.
Comment 1 Andreas Schwab 2006-08-04 15:20:56 UTC
Created attachment 95237 [details]
Xorg.0.log
Comment 2 Andreas Schwab 2006-08-04 15:22:30 UTC
Created attachment 95238 [details]
strace X
Comment 3 Andreas Schwab 2006-08-04 15:24:14 UTC
# ll /sys/bus/pci/devices/
total 0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0001:00:01.0 -> ../../../devices/pci0001:00/0001:00:01.0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0001:00:02.0 -> ../../../devices/pci0001:00/0001:00:02.0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0001:00:03.0 -> ../../../devices/pci0001:00/0001:00:03.0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0001:00:04.0 -> ../../../devices/pci0001:00/0001:00:04.0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0001:01:02.0 -> ../../../devices/pci0001:00/0001:00:02.0/0001:01:02.0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0001:01:02.1 -> ../../../devices/pci0001:00/0001:00:02.0/0001:01:02.1
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0001:01:02.2 -> ../../../devices/pci0001:00/0001:00:02.0/0001:01:02.2
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0002:00:01.0 -> ../../../devices/pci0002:00/0002:00:01.0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0003:00:01.0 -> ../../../devices/pci0003:00/0003:00:01.0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0003:00:01.1 -> ../../../devices/pci0003:00/0003:00:01.1
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0005:00:00.0 -> ../../../devices/pci0005:00/0005:00:00.0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0005:00:00.1 -> ../../../devices/pci0005:00/0005:00:00.1
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0006:00:00.0 -> ../../../devices/pci0006:00/0006:00:00.0
lrwxrwxrwx 1 root root 0 Aug  4 17:20 0006:00:00.1 -> ../../../devices/pci0006:00/0006:00:00.1
Comment 4 Stefan Dirsch 2006-08-04 15:53:48 UTC
> (II) Addressable bus resource ranges are
> (II) OS-reported resource ranges:
> (II) OS-reported resource ranges after removing overlaps with PCI:
> (II) All system resource ranges:

I think we're still missing some IA64 specific patches here. This is on my TODO list. Thanks for the reminder. :-)
Comment 5 Stefan Dirsch 2006-08-05 08:32:41 UTC
I wonder if it is this one? Not sure if brouwer is such a system.

-------------------------------------------------------------------
Fri Mar 24 16:48:46 CET 2006 - sndirsch@suse.de

- p_pci-ce-x.diff:
  * fixes PCI bus scanning on CE systems (pci-pci bridges)
    (Bug #147261)

/work/SAVE/oldpackages/stable/xorg-x11/xorg-x11-20060801/p_pci-ce-x.diff

Needs to be applied to xorg-x11-server package.

Otherwise you need to wait until I adjusted about 70 patches to the new X.Org packages. This will need some time ...
Comment 6 Stefan Dirsch 2006-08-23 20:19:21 UTC
I think I was wrong. Isn't it the domain suport, which was still missing? 

Andreas, I've added you to Cc of Bug #197572, so you're up-to-date to PCI/IA64 patches we're currently preparing. The PCI domain patch is still completely untested and not applied to our xorg-x11-server package yet.
Comment 7 Stefan Dirsch 2006-08-25 09:25:40 UTC
This PCI scan issue is resolved by now (by the PCI domain patch of Bug #197572) - I tried the new RPMs, but now we run into a different problem - at least on machine 'brouwer'.

(gdb) run
Starting program: /usr/bin/Xorg 

X Window System Version 7.1.0
Release Date: Fri Aug 25 08:50:06 UTC 2006
X Protocol Version 11, Revision 0, Release 7.1
Build Operating System: openSUSE SUSE LINUX
Current Operating System: Linux brouwer 2.6.18-rc4-2-default #1 SMP Tue Aug 8 09
:58:49 UTC 2006 ia64
Build Date: 25 August 2006
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Module Loader present
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Fri Aug 25 11:22:17 2006
(==) Using config file: "/etc/X11/xorg.conf"
[tcsetpgrp failed in terminal_inferior: Operation not permitted]
(EE) Failed to load module "rfbkeyb" (module does not exist, 0)
(EE) Failed to load module "rfbmouse" (module does not exist, 0)
(WW) RADEON: No matching Device section for instance (BusID PCI:1536:0:1) found
(WW) RADEON: No matching Device section for instance (BusID PCI:1280:0:1) found
(**) RADEON(0): RADEONPreInit

Program received signal SIGILL, Illegal instruction.
0x20000000005cfb80 in _inb (port=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/ia64/ioperm.c:156
156       ret = *addr;
(gdb) bt
#0  0x20000000005cfb80 in _inb (port=<value optimized out>)
    at ../sysdeps/unix/sysv/linux/ia64/ioperm.c:156
#1  0x40000000000dbab0 in inb (port=972) at ./../shared/ia64Pci.c:134
#2  0x2000000004e164f0 in stdReadMiscOut (hwp=<value optimized out>)
    at vgaHW.c:262
#3  0x2000000004e166b0 in vgaHWGetIOBase (hwp=0x6000000000182f10)
    at vgaHW.c:1810
#4  0x2000000000cf43c0 in RADEONPreInit ()
   from /usr/lib/xorg/modules/drivers//radeon_drv.so
#5  0x40000000000b5910 in InitOutput (pScreenInfo=0x600000000014b0a8, argc=1, 
    argv=0x607ffffffec3ed98) at xf86Init.c:596
#6  0x4000000000039ba0 in main (argc=1, argv=0x607ffffffec3ed98, 
    envp=0x607ffffffec3eda8) at main.c:369
(gdb) quit
Comment 8 Andreas Schwab 2006-09-01 09:47:57 UTC
hw/xfree86/os-support/shared/ia64Pci.c:
 * We use special in/out routines here since Altix platforms require the
 * use of the sysfs legacy_io interface.  The legacy_io file maps to the I/O
 * space of a given PCI domain; reads and writes are used to do port I/O.
 * The file descriptor for the file is stored in the upper bits of the
 * value passed in by the caller, and is created and populated by
 * xf86MapDomainIO.

hw/xfree86/os-support/bus/linuxPci.c:
 * xf86MapDomainIO - map I/O space in this domain
 *
 * Each domain has a legacy ISA I/O space.  This routine will try to
 * map it using the Linux sysfs legacy_io interface.  If that fails,
 * it'll fall back to using /proc/bus/pci.


Breakpoint 1, xf86MapDomainIO (ScreenNum=0xffffffff, Flags=0x2, Tag=0x5000000, 
    Base=0x0, Size=0x1) at linuxPci.c:723
723         int domain = xf86GetPciDomain(Tag);
(gdb) c
Continuing.

Breakpoint 3, linuxOpenLegacy (Tag=0x5000000, 
    name=0x4000000000330a48 "legacy_io") at linuxPci.c:636
636         if (!path)
(gdb) u 647
linuxOpenLegacy (Tag=<value optimized out>, 
    name=0x4000000000330a48 "legacy_io") at linuxPci.c:647
647             fd = open(path, O_RDWR);
(gdb) p path
$6 = 0x60000000001779e0 "/sys/class/pci_bus/0000:00/legacy_io"
                                            ^^^^^^^

# ll /sys/class/pci_bus/
total 0
drwxr-xr-x 2 root root 0 Sep  1 11:40 0001:00
drwxr-xr-x 2 root root 0 Aug 31 14:02 0001:01
drwxr-xr-x 2 root root 0 Aug 31 14:02 0002:00
drwxr-xr-x 2 root root 0 Aug 31 14:02 0003:00
drwxr-xr-x 2 root root 0 Aug 31 14:02 0004:00
drwxr-xr-x 2 root root 0 Aug 31 14:02 0005:00
drwxr-xr-x 2 root root 0 Aug 31 14:02 0006:00
Comment 9 Stefan Dirsch 2006-09-01 12:14:08 UTC
Obviously domain is wrong. Probably the domain support in linuxOpenLegacy() needs to be changed:

[...]
        domain = xf86GetPciDomain(Tag);
        bus = PCI_BUS_NO_DOMAIN(PCI_BUS_FROM_TAG(Tag));

Other functions use

       domain = PCI_DOM_FROM_TAG(tag);
       bus  = PCI_BUS_NO_DOMAIN(PCI_BUS_FROM_TAG(tag));

instead. Would be worth a try ...
Comment 10 Stefan Dirsch 2006-09-01 14:19:24 UTC
Andreas, how is comment #8 related to comment #7?
Comment 11 Andreas Schwab 2006-09-01 14:22:29 UTC
How about reading the comment? "Altix platforms require the use of the sysfs legacy_io interface".
Comment 12 Stefan Dirsch 2006-09-01 14:34:54 UTC
Are you sure you debug the issue in comment #7?
Comment 13 Andreas Schwab 2006-09-01 14:43:55 UTC
Of course I am.
Comment 14 Stefan Dirsch 2006-09-01 14:47:34 UTC
Ok. I'm asking since it looks completely different to me. It looks more like the initial issue to me and not the SIGILL.
Comment 15 Andreas Schwab 2006-09-01 14:51:53 UTC
They are obviously the same issue.  Just read the comments in the cited files.
Comment 16 Stefan Dirsch 2006-09-01 15:04:03 UTC
You're right. After setting domain to "1" in linuxOpenLegacy() via gdb the X Server succesfully freezed the machine in RADEONPreInit instead of terminating with SIGILL ...
Comment 17 Andreas Schwab 2006-09-01 15:08:41 UTC
Entered OS MCA handler. PSP=20000000fff211a0 cpu=0 monarch=1
cpu 0, MCA inconsistent r12 and r13, original stack not modified
OS MCA slave did not rendezvous on cpu 1
  1 out of 2 cpus in kdb, waiting for the rest, timeout in 10 second(s)
...1 cpu is not in kdb, its state is unknown

Entering kdb (current=0xe000003007138000, pid 0) on processor 0 due to KDB_ENTER()
Comment 18 Stefan Dirsch 2006-09-01 15:36:18 UTC
Ok. I tried my patch in comment #9. Result is, that it didn't change anything.
Additionally it crashed the machine again. I should better stop debugging for today. :-)
Comment 19 Stefan Dirsch 2006-09-11 12:33:07 UTC
Reassigning to our current PCI domain specialist. :-)
Comment 20 Matthias Hopf 2006-09-29 16:43:32 UTC
Have been testing IA64 support on chandrasekhar:

Ok, according to a first test at least the PCI domain support does work now. The cards are correctly detected.
However, the machine crashes deep down in RADEONPreInit() if I let xf86GetPciDomain return the apparently correct domain number. All of the domain patches have originally been created by SGI, right?

Egbert is definitvely more fluently in this area...
Comment 22 Matthias Hopf 2006-10-05 17:28:27 UTC
FWIW xf86GetPciDomain() seems to be supposed to return the domain of the host bridge - as in contrast to Stefans opinion I'm not a PCI specialist at all, I might need some help from Egbert to understand what xf86GetPciDomain() is *exactly* supposed to return.

(In reply to comment #8)
> (gdb) u 647
> linuxOpenLegacy (Tag=<value optimized out>, 
>     name=0x4000000000330a48 "legacy_io") at linuxPci.c:647
> 647             fd = open(path, O_RDWR);
> (gdb) p path
> $6 = 0x60000000001779e0 "/sys/class/pci_bus/0000:00/legacy_io"
>                                             ^^^^^^^

This is probably to pciBusInfo_t not correctly filled out. I need more meta information for that. Egbert?

Of course, this is only the first part of this issue. The machine freeze is yet to be debugged.
Comment 23 Matthias Hopf 2006-10-05 18:59:59 UTC
The machine finally freezes in xf86InitInt10(), if xf86GetPciDomain() is instructed to return the domain of the Gfx hardware. Of which I'm still not convinced it is correct.

I think this basically happens because the special bridge scaning code for Altrix isn't called, because the chipset is not detected any longer.

Why? The scaning code explicitely asks for some device on bus 0, and invokes a pciReadLong() - which explicitely does nothing if bus == 0.
As this function hasn't been changed since 2003, I wonder how this could ever work... I guess this is a long standing bug.

Even if you skip that check it won't work, as the device is tried to be opened in domain 0 - which does not exist.

Ok, if you fix this check, it still segfaults due to not using legacy_io. If you fix an (probably) obvious bug in xf86GetPciDomain(), the machine freezes again in xf86InitInt10(). Only this time the PCI maps look much better...
Comment 24 Matthias Hopf 2006-10-10 16:36:15 UTC
Update: Machine freezes in xf86ReadBIOS(). So this is not an x86 emulator issue (yet).
Comment 25 Matthias Hopf 2006-10-12 17:18:00 UTC
Int vector is mapped with
    mmap (0, 0x4000, PROT_READ, MAP_SHARED, "/dev/mem", 0)

The machine freezes on the memcopy operation
    (void)memcpy(Buf, (void *)(ptr + Offset), Len);
directly after the mmap.
Buf = some memory location, ptr=return from mmap(), Offset = 0, Len=0x600.

This is probably something for sgi to investigate, as it seems to involve either kernel space or some wrong pci initialization.
BTW - is mapping /dev/mem the right way on a domain based machine?

Stefan, are the patched xorg packages already available in stable? Have to test them first...
Comment 26 Stefan Dirsch 2006-10-12 17:42:10 UTC
(In reply to comment #25)
> Stefan, are the patched xorg packages already available in stable? Have to
> test them first...
Yes.


Comment 27 Matthias Hopf 2006-10-13 12:34:26 UTC
Ok, just verified that all patches are included.

Xorg from SLES10 works in a chroot environment, so it is not the kernel at fault.
Comment 28 Matthias Hopf 2006-10-18 15:51:16 UTC
Created attachment 101924 [details]
Fix INT10 compiler options.

Another patch needed for IA64 support. Upto now the INT10 module was compiled with -D_PC for *all* architectures (not only ix86 and x86_64).
Comment 29 Matthias Hopf 2006-10-19 16:49:03 UTC
The servers works fine now, if the dri and glx modules are not loaded. Strangely enough Xorg 7.2 loads dri and glx even if no 'Load "dri"' is given in the config file, but this is a different bug.

The server still crashes the machine on shutdown, though. Will investigate this after PPC (bug #202133).
Comment 30 Greg Edwards 2006-10-24 15:59:00 UTC
Matthias, what hw were you seeing the crash on shutdown on?  What is the prom
version (cat /proc/sgi_prominfo/node*/version)?

I've been running xorg-x11-server-7.2-4 (factory) + the patch from comment #28
on my prism deskside box without any problems for a couple days now.  Great
work!

It'd be nice to get this stuff upstream, so we don't have to continually
struggle with this.  Sorry our guys dropped the ball on that one.  We have
an (old) xorg bug with the previous pci domain patches at:

Bug 5000 Domain support does not work for SGI Altix machines
https://bugs.freedesktop.org/show_bug.cgi?id=5000

Should we use this bug, and obsolete the old patches?  You did all the work
on this, so it's your call how you'd like to handle it.
Comment 31 Andreas Schwab 2006-10-24 16:18:09 UTC
I have recently upgraded the prom to 4.61 (was 4.56 before).
Comment 32 Greg Edwards 2006-10-24 16:23:19 UTC
Andreas, good - that's the latest and recommended one.  I was testing with 4.61
as well.
Comment 33 Andreas Schwab 2006-10-24 20:24:14 UTC
Yet it still crashes the system.
Comment 34 Greg Edwards 2006-10-24 20:30:52 UTC
Created attachment 102483 [details]
salinfo oem rpm (decodes firmware blob in mca records)

Andreas, is it getting an MCA?  If so, can you install
this rpm, which will decode the oem-specific MCA
information in the MCA record.
Comment 35 Greg Edwards 2006-10-24 20:32:24 UTC
Created attachment 102484 [details]
xorg.conf for dual-X2 prism

Here's the xorg.conf I'm using (successfully).  I've
got two X2 cards in my prism.  I think you have two X3?
Comment 36 Andreas Schwab 2006-10-25 08:54:52 UTC
Created attachment 102536 [details]
MCA dump
Comment 37 Matthias Hopf 2006-10-25 12:19:34 UTC
Greg, I will push the changes upstream of course, as soon as it is clear that we do not have any (read: any known) regressions with the patches. So far I'm quite confident.

I'd like to find out where the Xserver crashes on shutdown on our machine - though this will probably not affect the patches so far.

Regarding the dri module: If the module section is empty, the Xserver loads a default set, which includes dri. This change has only been introduced recently.
Comment 38 Greg Edwards 2006-10-25 17:19:37 UTC
Matthias/Andreas, the MCA is because of a PIO timeout to address

      prb address       state to rp DID  LEN BE   R
      31  0x0040230003c8  0x2  1  0 0x0c0  0 0x01 0
                               |
                               `timeout

which, on a prism deskside is a register on the ATI card in 0005:00:00.0.  If
this is like the last time we saw something like this (see bug #140420),
the card wedges, which causes a timeout on the bus, and an MCA is raised.

If I'm interpreting it correctly, that would make the register offset 0x03c8
which in radeon_reg.h looks like RADEON_FP_V2_SYNC_STRT_WID?

The MCA record shows an iip of 0xa000000100077a40.  What function is this
in your System.map?  This may tell us if timed out on a read or write to
this location.  I think we can also tell it from info in the MCA record, but
I'll need to go find someone who knows where to look for this.
Comment 39 Andreas Schwab 2006-10-25 17:53:23 UTC
System.map-2.6.18.1-7-default

a000000100077900 t rebalance_tick
a0000001000782c0 T scheduler_tick
Comment 40 Greg Edwards 2006-10-25 18:28:49 UTC
I found out prbs are reads, wrbs are writes, so this is a read timeout.
Comment 41 Greg Edwards 2006-10-25 19:22:05 UTC
Ok, so it appears I got wrong info previously.  One of our platform guys told
me the following:

--------

PRB 31 shows:

prb address       state to rp DID  LEN BE   R
31  0x0040230003c8  0x2  1  0 0x0c0  0 0x01 0

..which means it was a write (R=0). The MCA was due to a BERR, which is used
by sw to trigger an MCA. If the time-out was due to a read, the MCA would
have triggered due to a bus check due to a hard fail (on shub1).

---------

So, if I got the register right, the only write to RADEON_FP_V2_SYNC_STRT_WID
is in RADEONRestoreCrtc2Registers(), and looks to only happen for a kind of
flat planel, which would make sense why I don't see it (with 2 CRTs).
Comment 42 Greg Edwards 2006-10-30 17:23:54 UTC
Andreas/Matthias, did you figure out if that's where we were running into
problems in the radeon driver?
Comment 43 Andreas Schwab 2006-11-07 15:42:01 UTC
Current server is just hanging after
(**) RADEON(0):   Map Changed ! Applying ...
Comment 44 Andreas Schwab 2006-11-08 14:53:22 UTC
When I wait long enough it crashes the system.
Comment 45 Andreas Schwab 2006-11-08 14:59:44 UTC
Created attachment 104308 [details]
MCA dump
Comment 46 Greg Edwards 2006-11-08 17:12:03 UTC
Timeout on a write to the same register as before.  From mca:

      prb address       state to rp DID  LEN BE   R
      31  0x0040230003c8  0x2  1  0 0x0c0  0 0x01 0

What version of the xorg-x11-server package is this from?  factory is showing
xorg-x11-server-7.2-13 as the latest (which I'm running with success).  Andreas,
are you running something newer?
Comment 47 Andreas Schwab 2006-11-08 17:38:17 UTC
We already have -17 internally, which contains an update to current CVS.  Will probably appear on factory soon.

Comment 48 Greg Edwards 2006-11-08 17:41:37 UTC
Ok, thanks Andreas.  I'll take a look once that shows up on the external mirrors.
Comment 49 Greg Edwards 2006-11-20 20:30:05 UTC
Stefan, looks like Matthias' off by one fix got dropped by accident:

http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commitdiff;h=1b94c117e0f294ef2f89bf24d45ba7a8e45efe35

With this patch added back in, xorg-x11-server-7.2-23 works for me on an
SGI Prism.
Comment 50 Stefan Dirsch 2006-11-20 21:12:49 UTC
I've readded this patch for STABLE/Factory/Buildservice. Unfortunately to late for RC1.
Comment 51 Andreas Schwab 2006-11-21 09:50:29 UTC
The MCA at shutdown remains anyway.
Comment 52 Greg Edwards 2006-11-21 15:06:58 UTC
Thanks for confirming, Andreas.  Have you been able to run it under a debugger
and see if RADEONRestoreCrtc2Registers() is where you hit it?
Comment 53 Andreas Schwab 2006-11-21 15:12:55 UTC
The last time I tried RADEONRestoreCrtc2Registers wasn't hit.
Comment 54 Greg Edwards 2006-11-21 21:46:01 UTC
I scrounged up some hardware today and can reproduce this now.  It only occurs
on the combination of X3 cards (R420) plus flat panels.  It doesn't occur on
X2 cards (R350) and flat panels, nor X3 cards and CRTs.  I'll dig into it more
tomorrow.
Comment 55 Greg Edwards 2006-11-28 15:28:44 UTC
This is where we're dying in shutdown.  The same thing happens if you try to
switch to one of the text consoles (ctrl-alt-f1) when the xserver is up.

hw/xfree86/os-support/linux/lnx_init.c:xf86CloseConsole()

    327     /* Back to text mode ... */
    328     if (ioctl(xf86Info.consoleFd, KDSETMODE, KD_TEXT) < 0)  <--
Comment 56 Matthias Hopf 2006-11-28 16:15:48 UTC
So this is an issue in the kernel module? Seems to be for me...
Comment 57 Greg Edwards 2006-11-28 16:19:20 UTC
Yeah, looks like it.  That's what I'm looking at now.
Comment 58 Matthias Hopf 2006-11-28 16:24:29 UTC
Great. Thanks for looking into this, Greg!

I'll set this bug to needinfo on you for now.
Comment 59 Greg Edwards 2006-12-01 00:41:25 UTC
We die in drivers/video/console/vgacon.c:vga_set_palette()

        vga_w(state.vgabase, VGA_PEL_MSK, 0xff);
        for (i = j = 0; i < 16; i++) {
                vga_w(state.vgabase, VGA_PEL_IW, table[i]);   <---

when we do the first write to VGA_PEL_IW (0x3c8) with a value of 0.  Since the
write times out, I suspect it causes the card to wedge.  We go through
vga_set_palette twice before successfully, once on bootup and once on X
startup.

We know this hardware combination works fine in SLES10, so to narrow it down, I
tried it with a sles10 kernel (required a change to
valid_mmap_phys_addr_range() so the /dev/mem mmap didn't fail) and the radeon 
driver at tag XORG-7_0 (which I believe is SLES10?).  This failed the same way,
so it's back looking like an xorg issue.

I'll probably need a kick in the right direction at where to start looking next.
Comment 60 Stefan Dirsch 2006-12-01 04:36:36 UTC
>We know this hardware combination works fine in SLES10, so to narrow it down, >I tried it with a sles10 kernel (required a change to
>valid_mmap_phys_addr_range() so the /dev/mem mmap didn't fail) and the radeon 
>driver at tag XORG-7_0 (which I believe is SLES10?).
Not at all. We had tons of patches on top of X.Org 6.9's radeon driver --> see xorg-x11-driver-video package of SLES10. 
Comment 61 Matthias Hopf 2006-12-06 16:47:32 UTC
(In reply to comment #59)
> We die in drivers/video/console/vgacon.c:vga_set_palette()
> 
>         vga_w(state.vgabase, VGA_PEL_MSK, 0xff);
>         for (i = j = 0; i < 16; i++) {
>                 vga_w(state.vgabase, VGA_PEL_IW, table[i]);   <---
> 
> when we do the first write to VGA_PEL_IW (0x3c8) with a value of 0.  Since the
> write times out, I suspect it causes the card to wedge.  We go through
> vga_set_palette twice before successfully, once on bootup and once on X
> startup.

In that case I guess the *previous* access to some register is bogus. Do you have the possibility to catch a backtrace of the call, so one can estimate what has been happening before?
Comment 62 Greg Edwards 2006-12-11 19:21:44 UTC
To answer Stefan's previous question, I just compiled the radeon10b driver from
SLES10 (ati-1_0_branch) and this meets with the same demise, but works fine
under SLES10.

Matthias, the call chain in the kernel to vga_set_palette() is

vt_ioctl
    do_unblank_screen
        redraw_screen
            set_palette
                vgacon_set_palette
                    vga_set_palette
Comment 63 Greg Edwards 2006-12-12 05:12:30 UTC
Created attachment 109281 [details]
bogus int10 patch

By accident, I broke the loading of the int10 module with this bogus patch, but the machine didn't MCA on xserver shutdown!  Go figure.  I'll start looking here.  

Matthias, any idea why this would happen?
Comment 64 Stefan Dirsch 2006-12-18 17:42:21 UTC
(In reply to comment #49)
> Stefan, looks like Matthias' off by one fix got dropped by accident:
> http://gitweb.freedesktop.org/?p=xorg/xserver.git;a=commitdiff;h=1b94c117e0f294ef2f89bf24d45ba7a8e45efe35
> 
> With this patch added back in, xorg-x11-server-7.2-23 works for me on an
> SGI Prism.

Andreas made a different patch to address issues in Bug #229278. I'll attach
it.


Comment 65 Stefan Dirsch 2006-12-18 17:44:14 UTC
Created attachment 110177 [details]
p_pci-off-by-one.diff

-------------------------------------------------------------------
Mon Dec 18 17:08:00 CET 2006 - schwab@suse.de

- Fix off-by-one in pci multi-domain support [#229278].
Comment 66 Stefan Dirsch 2006-12-18 17:48:51 UTC
*** Bug 229278 has been marked as a duplicate of this bug. ***
Comment 67 Matthias Hopf 2006-12-19 12:30:16 UTC
Do not apply; Domain 0 is really considered special, not only in this piece of sh^H^Hcode but also in other parts of the Xserver.

I've already tested exactly this approach, it failed.
Comment 68 Stefan Dirsch 2006-12-19 13:44:18 UTC
Ok. I've reverted this change.
Comment 69 Matthias Hopf 2006-12-19 13:50:52 UTC
See Bug #197572, patch p_pci-domain.diff, compare comments 22 + 27.

I'm sorry, I don't remember the exact circumstances.
Comment 70 Matthias Hopf 2007-01-26 16:02:17 UTC
So what's the status of this bug?

Greg, do you have any success in finding out why disabling the int10 module removed the crash? Is the int10 module needed at all for this machine?
Comment 71 Greg Edwards 2007-01-26 16:30:29 UTC
I had to give back my loaned hardware around Christmas time, so I don't have the
equipment to reproduce this anymore, and hadn't been able to look at it much
beyond my last comments.

It sure doesn't appear we need the int10 module at all, though.  I just removed
it entirely on my system, and X comes up fine.  Mike or Jonathan, do you know
if we rely on the int10 module for setting anything up?
Comment 72 Matthias Hopf 2007-02-21 17:36:33 UTC
Ping?
Comment 73 John Hesterberg 2007-02-22 14:29:35 UTC
Greg hasn't had the time or hardware to look at this more,
so I've moved the needinfo to Jonathan Lim, who should be able
to start looking at this.
Comment 74 Jonathan Lim 2007-03-08 00:15:02 UTC
Regarding the following trace obtained when Xorg exits:

  ddxGiveUp, xf86Init.c
    xf86CloseConsole, lnx_init.c
      ioctl(xf86Info.consoleFd, KDSETMODE, KD_TEXT)
        vt_ioctl, vt_ioctl.c
          do_unblank_screen, vt.c
            update_screen(x) -> redraw_screen(x, 0)
              set_palette, vt.c
                vgacon_set_palette
                  vga_set_palette

I wasn't able to confirm that do_unblank_screen was called because the machine hung as soon as it went into vt_ioctl.

There are two other instances where vga_set_palette is called:

  1. During Xorg startup:

       console_callback, vt.c
         change_console
           complete_change_console, vt_ioctl.c
             switch_screen(x) -> redraw_screen(x, 1)
               set_palette
                 vgacon_set_palette
                   vga_set_palette

  2. When the blanked console comes back on following an input event:

       console_callback, vt.c
         poke_blanked_console, vt.c
           do_unblank_screen, vt.c
             vgacon_blank
               vga_set_palette

What version of openSUSE and Xorg worked with this hardware combination before?

Also, if I replace radeon with vga in xorg.conf, the machine panics when Xorg starts up.  The last thing written in Xorg.0.log is

  (II) VGA(0): initializing int10.
Comment 75 Matthias Hopf 2007-03-08 11:32:35 UTC
Greg, I think this is one last question to you ;-)
Do you remember the configuration that worked for you?
Comment 76 Greg Edwards 2007-03-08 15:50:35 UTC
I believe it worked up until openSUSE switched over to the modular xorg code
base (and rebased on current bits).  From the xorg-x11-server changelog, this 
looks like the end of June 2006.  SLES10 works fine, so that is probably your 
best stable point.
Comment 77 Jonathan Lim 2007-03-13 20:29:45 UTC
I've found the fix for this bug:

--- xorg-server-1.2.0/hw/xfree86/int10/Makefile.am.O    2007-01-22 21:39:16.000000000 -0800
+++ xorg-server-1.2.0/hw/xfree86/int10/Makefile.am      2007-03-13 04:13:54.000000000 -0700
@@ -29,6 +29,9 @@ endif
 if INT10_X86EMU
 AM_CFLAGS = $(I386_VIDEO_CFLAGS) -D_X86EMU -DNO_SYS_HEADERS \
            $(XORG_CFLAGS) $(EXTRA_CFLAGS)
+if LINUX_IA64
+AM_CFLAGS += -DNO_LONG_LONG
+endif
 INCLUDES = $(XORG_INCS) -I$(srcdir)/../x86emu
 libint10_la_SOURCES = \
        $(COMMON_SOURCES) \

--- xorg-server-1.2.0/hw/xfree86/int10/Makefile.in.O    2007-03-13 04:10:14.000000000 -0700
+++ xorg-server-1.2.0/hw/xfree86/int10/Makefile.in      2007-03-13 04:14:38.000000000 -0700
@@ -571,6 +571,7 @@ COMMON_SOURCES = \
 @INT10_VM86_TRUE@AM_CFLAGS = $(I386_VIDEO_CFLAGS) -D_VM86_LINUX $(XORG_CFLAGS) $(EXTRA_CFLAGS)
 @INT10_X86EMU_TRUE@AM_CFLAGS = $(I386_VIDEO_CFLAGS) -D_X86EMU -DNO_SYS_HEADERS \
 @INT10_X86EMU_TRUE@           $(XORG_CFLAGS) $(EXTRA_CFLAGS)
+@LINUX_IA64_TRUE@AM_CFLAGS += -DNO_LONG_LONG
 
 @INT10_VM86_TRUE@INCLUDES = $(XORG_INCS)
 @INT10_X86EMU_TRUE@INCLUDES = $(XORG_INCS) -I$(srcdir)/../x86emu

Can someone please verify?  Thanks.
Comment 78 Matthias Hopf 2007-03-14 11:40:31 UTC
OMFG. That looks like (again) something was lost in the autoconfigifying process...

Thanks a lot, Jonathan.

The implementations in hw/xfree86/x86emu/prim_ops.c for w/ and w/o long long types are vastly different. I don't see at once why the long long version should fail, but that code *could* have a endianess bug.

Stefan, please apply this patch, it certainly doesn't do any harm. We can test as soon as the package is built.
Comment 79 Stefan Dirsch 2007-03-14 17:36:05 UTC
fixed for STABLE/Factory.

xorg-x11-server.changes:
[...]
-------------------------------------------------------------------
Wed Mar 14 15:43:46 CET 2007 - sndirsch@suse.de

- bug197190-ia64.diff:
  * missing -DNO_LONG_LONG for IA64 (Bug #197190) 
Comment 80 Jonathan Lim 2007-03-14 18:55:49 UTC
> The implementations in hw/xfree86/x86emu/prim_ops.c for w/ and w/o long long
> types are vastly different. I don't see at once why the long long version
> should fail, but that code *could* have a endianess bug.

In .../x86emu/types.h, right before __HAS_LONG_LONG__ is defined, there's a comment indicating "Currently only for Linux/32bit".
Comment 81 Andreas Schwab 2007-03-15 09:36:15 UTC
The why is x86-64 not defining it?
Comment 82 Andreas Schwab 2007-03-15 09:39:35 UTC
This can definitely not be the real bug.
Comment 83 Stefan Dirsch 2007-03-15 10:04:12 UTC
Hmm ... in X.Org 6.9 "-DNO_LONG_LONG" is even set for *all* architectures, so
__HAS_LONG_LONG__ is never set.

extras/x86emu/include/x86emu/types.h:
[...]
/* Currently only for Linux/32bit */
#undef  __HAS_LONG_LONG__
#if defined(__GNUC__) && !defined(NO_LONG_LONG)
#define __HAS_LONG_LONG__
#endif

programs/Xserver/hw/xfree86/int10/Imakefile:
[...]
X86EMUDEFINES = -D__DRIVER__ -DFORCE_POST -D_CEXPORT= -DNO_LONG_LONG ... 

programs/Xserver/hw/xfree86/os-support/linux/int10/x86emu/Imakefile:
[...]
X86EMUDEFINES = -D__DRIVER__ -DFORCE_POST -D_CEXPORT= -DNO_LONG_LONG ... 
Comment 84 Andreas Schwab 2007-03-15 10:24:28 UTC
Created attachment 124590 [details]
x86emu.diff
Comment 85 Stefan Dirsch 2007-03-15 11:31:28 UTC
So this obsoletes the patch mentioned in comment #77?
Comment 86 Andreas Schwab 2007-03-15 11:34:17 UTC
I definitely hope so.
Comment 87 Stefan Dirsch 2007-03-15 11:38:46 UTC
So could you give it a try?
Comment 88 Matthias Hopf 2007-03-15 11:39:54 UTC
(In reply to comment #83)
> Hmm ... in X.Org 6.9 "-DNO_LONG_LONG" is even set for *all* architectures, so
> __HAS_LONG_LONG__ is never set.

And on i386 the emulator isn't used at all...

Andreas' patch looks very reasonable, so I will probably push it upstream. I'm unsure, though, whether __HAS_LONG_LONG__ will work after that patch, because the code path is very different, and I assume it hasn't been tested for a *long* time.

Did you test this, Andreas?
Stefan, maybe it's a good idea to disable Jonathan's patch and use Andreas' for the next round of testing. The long long code path is certainly much faster than the other one.
Comment 89 Andreas Schwab 2007-03-15 11:50:12 UTC
This also fixes bug 248195.  NO_LONG_LONG should definitely be removed.
Comment 90 Stefan Dirsch 2007-03-15 15:25:03 UTC
Patch applied. Fixed for STABLE/Factory.
Comment 91 Matthias Hopf 2007-03-15 15:58:32 UTC
Committed upstream.
Comment 92 Jonathan Lim 2007-03-15 18:58:14 UTC
Started and stopped Xorg without problems on my machine using the new fix.