Bug 1209005

Summary: intel [Skylake] GNOME shell 43.3.1 / X11 dumps core - Mesa 23.0.0 regression
Product: [openSUSE] openSUSE Tumbleweed Reporter: Martin Wilck <martin.wilck>
Component: X.OrgAssignee: Gfx Bugs <gfx-bugs>
Status: RESOLVED FIXED QA Contact: Gfx Bugs <gfx-bugs>
Severity: Normal    
Priority: P3 - Medium CC: martin.wilck, mkoutny, vliaskovitis
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
URL: https://gitlab.freedesktop.org/mesa/mesa/-/issues/8542
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Martin Wilck 2023-03-07 11:31:05 UTC
Since today's TW update, gnome shell (X11) dumps core repeatedly, roughly every 7s.
I've collected >200 core dumps in a few minutes. I needed to kill the systemd user instance because systemd would try to restart gnome-shell forever.

GNOME/wayland seems to work, but I can't use it because I need the barrier KVM switch software which doesn't work with wayland.

Note that neither gnome-shell nor mutter or gogl has been updated in the transaction that lead to the issue. The only package in the list of updated packages that looks suspicious in the context shown below is Mesa, which was updated from 22.3.5-343.1 to 23.0.0-345.1, and the kernel, which has been updated from 6.1.12-1 to 6.2.1-1.

Sample core: (it's too large to attach to bugzilla but I can provide it on demand):

           PID: 30814 (gnome-shell)
           UID: 17326 (mwilck)
           GID: 50 (suse)
        Signal: 11 (SEGV)
     Timestamp: Tue 2023-03-07 10:05:12 CET (1h 50min ago)
  Command Line: /usr/bin/gnome-shell
    Executable: /usr/bin/gnome-shell
 Control Group: /user.slice/user-17326.slice/user@17326.service/session.slice/org.gnome.Shell@x11.service
          Unit: user@17326.service
     User Unit: org.gnome.Shell@x11.service
         Slice: user-17326.slice
     Owner UID: 17326 (mwilck)
       Boot ID: 725dacfb20e34ebaa2c5bb3384db7c25
    Machine ID: a0385656b74c9241b77c1bb6577a603b
      Hostname: apollon.suse.de
       Storage: /var/lib/systemd/coredump/core.gnome-shell.17326.725dacfb20e34ebaa2c5bb3384db7c25.30814.1678179912000000.zst (present)
  Size on Disk: 18.4M
       Message: Process 30814 (gnome-shell) of user 17326 dumped core.

(gdb) bt
#0  0x00007fd104389e3d in cogl_onscreen_glx_notify_swap_buffers (swap_event=0x7ffc351d7f00, onscreen=0x55655988d120 [CoglOnscreenGlx])
    at ../cogl/cogl/winsys/cogl-onscreen-glx.c:991
#1  notify_swap_buffers (context=<optimized out>, swap_event=0x7ffc351d7f00) at ../cogl/cogl/winsys/cogl-winsys-glx.c:184
#2  glx_event_filter_cb (xevent=0x7ffc351d7f00, data=<optimized out>) at ../cogl/cogl/winsys/cogl-winsys-glx.c:224
#3  0x00007fd104388f18 in _cogl_renderer_handle_native_event (renderer=<optimized out>, event=0x7ffc351d7f00) at ../cogl/cogl/cogl-renderer.c:636
#4  cogl_xlib_renderer_handle_event (renderer=<optimized out>, event=0x7ffc351d7f00) at ../cogl/cogl/cogl-xlib-renderer.c:579
#5  0x00007fd1048de110 in cogl_xlib_filter (xevent=<optimized out>, event=<optimized out>, data=<optimized out>) at ../src/backends/x11/meta-clutter-backend-x11.c:94
#6  0x00007fd1048e9d93 in meta_clutter_backend_x11_process_event_filters
    (clutter_backend_x11=0x5565596b0010 [MetaClutterBackendX11], event=0x55655dd7a2e0, native=0x7ffc351d7f00) at ../src/backends/x11/meta-clutter-backend-x11.c:329
#7  meta_clutter_backend_x11_translate_event (clutter_backend=0x5565596b0010 [MetaClutterBackendX11], native=0x7ffc351d7f00, event=0x55655dd7a2e0)
    at ../src/backends/x11/meta-clutter-backend-x11.c:363
#8  0x00007fd10498c090 in meta_x11_handle_event.isra.0 (backend=backend@entry=0x5565595f31d0 [MetaBackendX11Cm], xevent=xevent@entry=0x7ffc351d7f00)
    at ../src/backends/x11/meta-event-x11.c:82
#9  0x00007fd1048e576d in handle_host_xevent (event=0x7ffc351d7f00, backend=0x5565595f31d0 [MetaBackendX11Cm]) at ../src/backends/x11/meta-backend-x11.c:421
#10 x_event_source_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at ../src/backends/x11/meta-backend-x11.c:475
#11 0x00007fd1056a0a90 in g_main_dispatch (context=0x5565595de9f0) at ../glib/gmain.c:3454
#12 g_main_context_dispatch (context=context@entry=0x5565595de9f0) at ../glib/gmain.c:4172
#13 0x00007fd1056a0e48 in g_main_context_iterate (context=0x5565595de9f0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib/gmain.c:4248
#14 0x00007fd1056a110f in g_main_loop_run (loop=0x55655b770f00) at ../glib/gmain.c:4448
#15 0x00007fd1048c28c5 in meta_context_run_main_loop (context=<optimized out>, error=error@entry=0x7ffc351d8160) at ../src/core/meta-context.c:465
#16 0x000055655892d904 in main (argc=<optimized out>, argv=<optimized out>) at ../src/main.c:582


Crashes here:

976	cogl_onscreen_glx_notify_swap_buffers (CoglOnscreen          *onscreen,
977	                                       GLXBufferSwapComplete *swap_event)
978	{
979	  CoglOnscreenGlx *onscreen_glx = COGL_ONSCREEN_GLX (onscreen);
980	  CoglFramebuffer *framebuffer = COGL_FRAMEBUFFER (onscreen);
981	  CoglContext *context = cogl_framebuffer_get_context (framebuffer);
982	  gboolean ust_is_monotonic;
983	  CoglFrameInfo *info;
984	
985	  /* We only want to notify that the swap is complete when the
986	     application calls cogl_context_dispatch so instead of immediately
987	     notifying we'll set a flag to remember to notify later */
988	  set_sync_pending (onscreen);
989	
990	  info = cogl_onscreen_peek_head_frame_info (onscreen);
991	  info->flags |= COGL_FRAME_INFO_FLAG_VSYNC;   // <====
992	

because info is NULL:

>   0x00007fd104389e31 <+417>:	mov    %rbp,%rdi
>   0x00007fd104389e34 <+420>:	call   0x7fd104352e80 <cogl_onscreen_peek_head_frame_info@plt>
>   0x00007fd104389e39 <+425>:	mov    0x30(%r13),%rsi
> => 0x00007fd104389e3d <+429>:	orl    $0x8,0x70(%rax)


(gdb) info reg
rax            0x0                 0
rbx            0x7ffc351d7f00      140721199611648
rcx            0x5565596b0fa0      93893780246432
rdx            0x5565596b0fa0      93893780246432
rsi            0x20000c            2097164
rdi            0x55655988d040      93893782196288
rbp            0x55655988d120      0x55655988d120
rsp            0x7ffc351d7c60      0x7ffc351d7c60
r8             0x28                40
r9             0x50                80
r10            0x0                 0
r11            0x1                 1
r12            0xfffffffffffffff0  -16
r13            0x55655988d120      93893782196512
r14            0x55655982f130      93893781811504
r15            0x5565595f6800      93893779482624
rip            0x7fd104389e3d      0x7fd104389e3d <glx_event_filter_cb+429>
eflags         0x10246             [ PF ZF IF RF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0

The caller of notify_swap_buffers is handling a GLX_BufferSwapComplete event:

201	glx_event_filter_cb (XEvent *xevent, void *data)
...
217	#ifdef GLX_INTEL_swap_event
218	  glx_renderer = context->display->renderer->winsys;
219	
220	  if (xevent->type == (glx_renderer->glx_event_base + GLX_BufferSwapComplete))
221	    {
222	      GLXBufferSwapComplete *swap_event = (GLXBufferSwapComplete *) xevent;
223	
224	      notify_swap_buffers (context, swap_event);
(gdb) 
225	
226	      /* remove SwapComplete events from the queue */
227	      return COGL_FILTER_REMOVE;
228	    }
229	#endif /* GLX_INTEL_swap_event */

(gdb) p *swap_event
$12 = {
  type = 96,
  serial = 5387,
  send_event = 0,
  display = 0x5565595f6800,
  drawable = 2097163,
  event_type = 33153,
  ust = 1796404299,
  msc = 107467,
  sbc = 1
}
Comment 1 Martin Wilck 2023-03-07 11:32:51 UTC
Changed component to X.org.
Comment 2 Martin Wilck 2023-03-07 11:45:22 UTC
(In reply to Martin Wilck from comment #0)

> Note that neither gnome-shell nor mutter or gogl has been updated in the
> transaction that lead to the issue. The only package in the list of updated
> packages that looks suspicious in the context shown below is Mesa, which was
> updated from 22.3.5-343.1 to 23.0.0-345.1, and the kernel, which has been
> updated from 6.1.12-1 to 6.2.1-1.

It happens with 6.1.12-1, too, so the kernel is not the culprit.
Comment 3 Stefan Dirsch 2023-03-07 12:00:49 UTC
Ok. So try to downgrade Mesa.

Possible package list you need to downgrade:

Mesa
Mesa-KHR-devel
Mesa-devel
Mesa-dri
Mesa-dri-devel
Mesa-dri-nouveau
Mesa-dri-vc4
Mesa-gallium
Mesa-libEGL-devel
Mesa-libEGL1
Mesa-libGL-devel
Mesa-libGL1
Mesa-libGLESv1_CM-devel
Mesa-libGLESv2-devel
Mesa-libGLESv3-devel
Mesa-libOpenCL
Mesa-libRusticlOpenCL
Mesa-libd3d
Mesa-libd3d-devel
Mesa-libglapi-devel
Mesa-libglapi0
Mesa-libva
Mesa-vulkan-device-select
Mesa-vulkan-overlay
libOSMesa-devel
libOSMesa8
libgbm-devel
libgbm1
libvdpau_nouveau
libvdpau_r300
libvdpau_r600
libvdpau_radeonsi
libvdpau_virtio_gpu
libvulkan_broadcom
libvulkan_freedreno
libvulkan_intel
libvulkan_lvp
libvulkan_radeon
libxatracker-devel
libxatracker2
Comment 4 Stefan Dirsch 2023-03-07 12:01:10 UTC
And you need to restart Xserver after this.
Comment 5 Martin Wilck 2023-03-07 12:20:26 UTC
I just downgraded the following packages (sorry I started before I read your comment):

Mesa-libglapi0|22.3.5-343.1
Mesa-KHR-devel|22.3.5-343.1
Mesa-libEGL1|22.3.5-343.1
Mesa-libGL1|22.3.5-343.1
Mesa-gallium|22.3.5-343.1
Mesa|22.3.5-343.1
Mesa-dri|22.3.5-343.1
Mesa-libEGL-devel|22.3.5-343.1
Mesa-libGLESv2-devel|22.3.5-343.1
Mesa-libGLESv1_CM-devel|22.3.5-343.1
Mesa-dri-devel|22.3.5-343.1
Mesa-libGL-devel|22.3.5-343.1
libOSMesa8|22.3.5-343.1
libOSMesa-devel|22.3.5-343.1
Mesa-libglapi-devel|22.3.5-343.1
Mesa-gallium-32bit|22.3.5-343.1
Mesa-dri-32bit|22.3.5-343.1
Mesa-32bit|22.3.5-343.1
Mesa-libGL1-32bit|22.3.5-343.1
Mesa-devel|22.3.5-343.1
Mesa-libglapi0-32bit|22.3.5-343.1
Mesa-vulkan-device-select|22.3.5-343.1
Mesa-libva|22.3.5-343.1
Mesa-libEGL1-32bit|22.3.5-343.1

GNOME seems to work now. I noticed that GDM wouldn't offer me "GNOME/Xorg" any more. But when I start the "GNOME" session, it seems to start GNOME/Xorg. At least X is running.

So the original problem isn't observed.

However typing in this browser window feels sluggish. It seems that I'm not getting any acceleration any more.

I'll downgrade the other packages you recommended and see how it goes.
Comment 6 Martin Wilck 2023-03-07 12:32:12 UTC
All packages listed in comment 3 downgraded. I'm offered a GNOME/Xorg session now again, and it doesn't crash.
Comment 7 Martin Wilck 2023-03-07 12:33:08 UTC
> However typing in this browser window feels sluggish 

this was a different problem, related to bluetooth and my BT keyboard. Forget it.
Comment 8 Stefan Dirsch 2023-03-07 12:40:49 UTC
Thanks. So apparently it's a Mesa issue. Which graphic is this?

glxinfo -B
inxi -aG

would be useful here.
Comment 9 Martin Wilck 2023-03-07 14:14:52 UTC
$ glxinfo  -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Intel (0x8086)
    Device: Mesa Intel(R) HD Graphics 520 (SKL GT2) (0x1916)
    Version: 22.3.5
    Accelerated: yes
    Video memory: 7713MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) HD Graphics 520 (SKL GT2)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 22.3.5
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 22.3.5
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 22.3.5
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

# inxi -aG
Graphics:
  Device-1: Intel Skylake GT2 [HD Graphics 520] vendor: Dell Latitude E7470 driver: i915 v: kernel
    arch: Gen-9 process: Intel 14n built: 2015-16 ports: active: DP-2,DP-4,eDP-1 empty: DP-1, DP-3,
    HDMI-A-1, HDMI-A-2 bus-ID: 00:02.0 chip-ID: 8086:1916 class-ID: 0300
  Device-2: Sunplus Innovation Integrated_Webcam_HD type: USB driver: uvcvideo bus-ID: 1-2:2
    chip-ID: 1bcf:28b8 class-ID: 0e02
  Display: server: X.org v: 1.21.1.7 with: Xwayland v: 22.1.8 compositor: gnome-shell v: 43.3
    driver: X: loaded: intel unloaded: fbdev,modesetting,vesa dri: i965 gpu: i915 tty: 135x30
  Monitor-1: DP-2 model: Dell P2414H serial: KKMMW62572JB built: 2016 res: 1920x1080 dpi: 93
    gamma: 1.2 size: 527x297mm (20.75x11.69") diag: 605mm (23.8") ratio: 16:9 modes: max: 1920x1080
    min: 720x400
  Monitor-2: DP-4 model: Fujitsu Siemens P24W-7 LED serial: YV8S000307 built: 2014
    res: 1920x1200 dpi: 94 gamma: 1.2 size: 518x324mm (20.39x12.76") diag: 611mm (24.1")
    ratio: 16:10 modes: max: 1920x1200 min: 720x400
  Monitor-3: eDP-1 model: LG Display 0x0490 built: 2014 res: 1920x1080 dpi: 158 gamma: 1.2
    size: 309x174mm (12.17x6.85") diag: 355mm (14") ratio: 16:9 modes: 1920x1080
  API: OpenGL Message: GL data unavailable in console for root.
Comment 10 Martin Wilck 2023-03-07 14:16:00 UTC
Graphics:
  Device-1: Intel Skylake GT2 [HD Graphics 520] vendor: Dell Latitude E7470
    driver: i915 v: kernel arch: Gen-9 process: Intel 14n built: 2015-16 ports:
    active: DP-2,DP-4,eDP-1 empty: DP-1, DP-3, HDMI-A-1, HDMI-A-2
    bus-ID: 00:02.0 chip-ID: 8086:1916 class-ID: 0300
  Device-2: Sunplus Innovation Integrated_Webcam_HD type: USB
    driver: uvcvideo bus-ID: 1-2:2 chip-ID: 1bcf:28b8 class-ID: 0e02
  Display: x11 server: X.Org v: 21.1.7 with: Xwayland v: 22.1.8
    compositor: gnome-shell v: 43.3 driver: X: loaded: intel
    unloaded: fbdev,modesetting,vesa dri: i965 gpu: i915 display-ID: :0
    screens: 1
  Screen-1: 0 s-res: 3648x1920 s-dpi: 96 s-size: 965x508mm (37.99x20.00")
    s-diag: 1091mm (42.93")
  Monitor-1: DP-2 mapped: DP1-1 pos: primary,top-center model: Dell P2414H
    serial: KKMMW62572JB built: 2016 res: 1080x1920 hz: 60 dpi: 91 gamma: 1.2
    size: 300x530mm (11.81x20.87") diag: 605mm (23.8") ratio: 16:9 modes:
    max: 1920x1080 min: 720x400
  Monitor-2: DP-4 mapped: DP1-3 pos: top-left model: Fujitsu Siemens P24W-7
    LED serial: YV8S000307 built: 2014 res: 1200x1920 hz: 60 dpi: 95
    gamma: 1.2 size: 320x520mm (12.6x20.47") diag: 611mm (24.1") ratio: 16:10
    modes: max: 1920x1200 min: 720x400
  Monitor-3: eDP-1 mapped: eDP1 pos: bottom-r model: LG Display 0x0490
    built: 2014 res: 1368x768 dpi: 112 gamma: 1.2 size: 310x170mm (12.2x6.69")
    diag: 355mm (14") ratio: 16:9 modes: 1920x1080
  API: OpenGL v: 4.6 Mesa 22.3.5 renderer: Mesa Intel HD Graphics 520 (SKL
    GT2) direct render: Yes
Comment 11 Vasilis Liaskovitis 2023-03-07 22:16:48 UTC
Seeing that the crash in mutter/gnome-shell is in code related to handling GLX_INTEL_swap_event, I wonder if reverting this Mesa 23.0 commit makes a difference:

"
From 19c57ea3bf6d77cf6f07f2a56e781f55b0e6013b Mon Sep 17 00:00:00 2001
From: Adam Jackson <ajax@redhat.com>
Date: Tue, 13 Dec 2022 12:26:58 -0500
Subject: [PATCH] glx: Remove pointless GLX_INTEL_swap_event paranoia

It's not our job to filter this out, it's the server's job to not send
events that haven't been selected for. We'll still throw the event away
if we don't have any client-side state for it though."

Debug package with this patch reverted:

https://build.opensuse.org/package/binaries/home:vliaskovitis:branches:X11:XOrg/Mesa/openSUSE_Tumbleweed

However as the upstream commit log implies, this is only for debug purposes: If the revert does fix things, it likely means the proper solution is to handle the event differently in either gnome's mutter/gnome-shell or xorg server, not in Mesa itself. So if the revert fixes things, it would just help us find the correct component to focus on.
Comment 12 Martin Wilck 2023-03-09 07:34:22 UTC
Side note: there seems to be a package dependency issue here. Spin-off bug 1209086 created.
Comment 13 Martin Wilck 2023-03-09 08:12:45 UTC
My first attempt to update from Vasilis' repo resulted in the following package mix from Vaslis and Factory:

Factory:
Mesa-libglapi0-23.0.0-345.1
Mesa-KHR-devel-23.0.0-345.1
Mesa-libEGL1-23.0.0-345.1
Mesa-gallium-23.0.0-345.1
Mesa-23.0.0-345.1
Mesa-dri-23.0.0-345.1
Mesa-libEGL-devel-23.0.0-345.1
Mesa-libGLESv2-devel-23.0.0-345.1
Mesa-libGLESv1_CM-devel-23.0.0-345.1
Mesa-dri-devel-23.0.0-345.1
Mesa-libGL-devel-23.0.0-345.1
libOSMesa8-23.0.0-345.1
libOSMesa-devel-23.0.0-345.1
Mesa-libglapi-devel-23.0.0-345.1
Mesa-gallium-32bit-23.0.0-345.1
Mesa-dri-32bit-23.0.0-345.1
Mesa-32bit-23.0.0-345.1
Mesa-libGL1-32bit-23.0.0-345.1
Mesa-devel-23.0.0-345.1

Vasilis:
Mesa-libGL1-23.0.0-1453.1
Mesa-vulkan-device-select-23.0.0-1453.1
Mesa-libglapi0-32bit-23.0.0-1453.1
Mesa-libEGL1-32bit-23.0.0-1453.1
libvulkan_intel-23.0.0-1453.1
libgbm-devel-23.0.0-1453.1
libvdpau_virtio_gpu-23.0.0-1453.1
libvdpau_r600-23.0.0-1453.1
libvdpau_r300-23.0.0-1453.1
libvdpau_nouveau-23.0.0-1453.1
Mesa-libva-23.0.0-1453.1

Anyway, the issue is gone. As the commit mentioned in comment 11 affects Mesa-libGL1 (AFAICT), Vasilis' hypothesis is confirmed. Thanks!
Comment 18 Martin Wilck 2023-03-09 09:27:21 UTC
Upstream: https://gitlab.freedesktop.org/mesa/mesa/-/issues/8542
Comment 19 Stefan Dirsch 2023-03-09 11:36:09 UTC
(In reply to Martin Wilck from comment #18)
> Upstream: https://gitlab.freedesktop.org/mesa/mesa/-/issues/8542

Thanks a lot! Watching now ...
Comment 20 Stefan Dirsch 2023-03-12 10:47:23 UTC
@Martin While the issue is addressed upstream. Is this a fatal issue, I mean does this break the GNOME desktop completely, so shouldn't I reverse apply this patch for now ASAP?
Comment 21 Martin Wilck 2023-03-14 07:37:57 UTC
It breaks the desktop for me, because I have to use GNOME/X11 in order to use barrier, and GNOME/X11 crashes on every startup (not only once, but many times in a row - that's another bug actually, the number of desktop restarts by systemd should be limited somehow).

If there aren't a lot of other people affected, I can just set a lock on Mesa and keep the package that works for me. That should work for a limited amount of time.

But the issue doesn't seem to have got much upstream traction so far...
Comment 22 Stefan Dirsch 2023-03-14 12:10:56 UTC
Thanks. I can't imagine you being the only one affected by this. Probably you're just the first one reporting it. I'll do the following for the time being.

-------------------------------------------------------------------
Tue Mar 14 11:53:20 UTC 2023 - Stefan Dirsch <sndirsch@suse.com>

- U_glx-Remove-pointless-GLX_INTEL_swap_event-paranoia.patch
  * reverse apply this patch to fix a regression caused by this
    commit, which resulted in gnome-shell constantly crashing, which
    is making a GNOME/X11 session impossible (boo#1209005)

We'll see what upstream thinks about this ...
Comment 23 Stefan Dirsch 2023-03-14 12:14:13 UTC
https://build.opensuse.org/request/show/1071497

Lowering severity due to regression commit reverted now.
Comment 24 Michal Koutný 2023-03-20 12:29:54 UTC
*** Bug 1209203 has been marked as a duplicate of this bug. ***
Comment 25 Stefan Dirsch 2023-06-20 13:33:15 UTC
Now also reverted upstream. I'll close this one once I update to a Mesa version, which supersedes our "revert"-patch.
Comment 26 Stefan Dirsch 2023-06-23 07:52:55 UTC
Patch is now reverted in Mesa 23.1.3. Will be in TW soon. Closing ...
.
Comment 27 OBSbugzilla Bot 2023-06-23 08:45:03 UTC
This is an autogenerated message for OBS integration:
This bug (1209005) was mentioned in
https://build.opensuse.org/request/show/1094791 Factory / Mesa