Bug 971350

Summary: X with kms_swrast_dri causes SIGFPE when built with GCC 6
Product: [openSUSE] openSUSE Tumbleweed Reporter: Richard Biener <rguenther>
Component: X.OrgAssignee: E-mail List <xorg-maintainer-bugs>
Status: RESOLVED FIXED QA Contact: E-mail List <xorg-maintainer-bugs>
Severity: Normal    
Priority: P5 - None CC: rguenther, schwab
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Richard Biener 2016-03-16 09:24:39 UTC
In openSUSE:Factory:Staging:Gcc6 on x86_64 launching X causes

[    26.429] (EE) Backtrace:
[    26.429] (EE) 0: /usr/bin/X (xorg_backtrace+0x4a) [0x59721a]
[    26.429] (EE) 1: /usr/bin/X (0x400000+0x19b569) [0x59b569]
[    26.429] (EE) 2: /lib64/libc.so.6 (0x7f2cdee91000+0x34b10) [0x7f2cdeec5b10]
[    26.429] (EE) 3: /lib64/libpthread.so.0 (pthread_barrier_destroy+0x9) [0x7f2cdec81c19]
[    26.429] (EE) 4: /usr/lib64/dri/kms_swrast_dri.so (0x7f2cd8a74000+0x63d45f) [0x7f2cd90b145f]
[    26.429] (EE) 5: /usr/lib64/dri/kms_swrast_dri.so (0x7f2cd8a74000+0x648b01) [0x7f2cd90bcb01]
[    26.429] (EE) 6: /usr/lib64/dri/kms_swrast_dri.so (0x7f2cd8a74000+0x2f8c5f) [0x7f2cd8d6cc5f]
[    26.429] (EE) 7: /usr/lib64/dri/kms_swrast_dri.so (0x7f2cd8a74000+0x2f8d05) [0x7f2cd8d6cd05]
[    26.429] (EE) 8: /usr/lib64/dri/kms_swrast_dri.so (0x7f2cd8a74000+0x2f71a2) [0x7f2cd8d6b1a2]
[    26.429] (EE) 9: /usr/lib64/libgbm.so.1 (0x7f2cda047000+0x328d) [0x7f2cda04a28d]
[    26.429] (EE) 10: /usr/lib64/xorg/modules/libglamoregl.so (0x7f2cda255000+0x746b) [0x7f2cda25c46b]
[    26.429] (EE) 11: /usr/lib64/xorg/modules/libglamoregl.so (glamor_egl_init+0x117) [0x7f2cda25d6a7]
[    26.429] (EE) 12: /usr/lib64/xorg/modules/drivers/modesetting_drv.so (0x7f2cdaa9d000+0x87ac) [0x7f2cdaaa57ac]
[    26.429] (EE) 13: /usr/bin/X (InitOutput+0xa6d) [0x47e7ed]
[    26.429] (EE) 14: /usr/bin/X (0x400000+0x3d106) [0x43d106]
[    26.429] (EE) 15: /lib64/libc.so.6 (__libc_start_main+0xf1) [0x7f2cdeeb1721]
[    26.429] (EE) 16: /usr/bin/X (_start+0x29) [0x4283e9]
[    26.429] (EE)
[    26.429] (EE) Floating point exception at address 0x7f2cdec81c19
[    26.429] (EE)
Fatal server error:
[    26.429] (EE) Caught signal 8 (Floating point exception). Server aborting

where glibc does

int
pthread_barrier_destroy (pthread_barrier_t *barrier)
{
  struct pthread_barrier *bar = (struct pthread_barrier *) barrier;

  /* Destroying a barrier is only allowed if no thread is blocked on it.
     Thus, there is no unfinished round, and all modifications to IN will
     have happened before us (either because the calling thread took part
     in the most recent round and thus synchronized-with all other threads
     entering, or the program ensured this through other synchronization).
     We must wait until all threads that entered so far have confirmed that
     they have exited as well.  To get the notification, pretend that we have
     reached the reset threshold.  */
  unsigned int count = bar->count;
  unsigned int max_in_before_reset = BARRIER_IN_THRESHOLD
                                   - BARRIER_IN_THRESHOLD % count;

thus barrier->count is zero somehow.


This is from within Qemu with the Test-DVD for openQA.  Note the project
has glibc from Base:System which is at 2.23 (factory is 2.22 still).
Comment 1 Richard Biener 2016-03-16 09:28:16 UTC
Eventually X doesn't check if pthread_barrier_init succeeds and passes it
a zero count (upon which it exits with barrier uninitialized and thus likely
all zeros).
Comment 2 Egbert Eich 2016-03-16 10:22:10 UTC
Any idea why this doesn't happen with gcc < 6?
Comment 3 Richard Biener 2016-03-16 10:48:06 UTC
No idea yet - trying to create a Test-DVD with gdb and required debuginfo/source
packages to look what happens.

But certainly in ./src/gallium/auxiliary/os/os_thread.h

typedef pthread_barrier_t pipe_barrier;

static inline void pipe_barrier_init(pipe_barrier *barrier, unsigned count)
{
   pthread_barrier_init(barrier, NULL, count);
}

static inline void pipe_barrier_destroy(pipe_barrier *barrier)
{
   pthread_barrier_destroy(barrier);
}

static inline void pipe_barrier_wait(pipe_barrier *barrier)
{
   pthread_barrier_wait(barrier);
}

doesn't properly verify pthread_barrier_init succeeds (nor does it return
the return value).

in ./src/gallium/drivers/llvmpipe/lp_rast.c I see

   for (i = 0; i < MAX2(1, num_threads); i++) {
      struct lp_rasterizer_task *task = &rast->tasks[i];
...
   rast->num_threads = num_threads;

   rast->no_rast = debug_get_bool_option("LP_NO_RAST", FALSE);

   create_rast_threads(rast);

   /* for synchronizing rasterization threads */
   pipe_barrier_init( &rast->barrier, rast->num_threads );

so the loop cares for num_threads < 1 which looks to me it can be zero
and this value is passed to pipe_barrier_init unmodified.

Looks like a non-GCC specific issue (I didn't try to see if it reproduces
with GCC 5).  Note that I have a new LLVM in Staging:Gcc6 as well (not
sure if that matters).

Caller has

   screen->num_threads = util_cpu_caps.nr_cpus > 1 ? util_cpu_caps.nr_cpus : 0;
#ifdef PIPE_SUBSYSTEM_EMBEDDED
   screen->num_threads = 0;
#endif
   screen->num_threads = debug_get_num_option("LP_NUM_THREADS", screen->num_threads);
   screen->num_threads = MIN2(screen->num_threads, LP_MAX_THREADS);

   screen->rast = lp_rast_create(screen->num_threads);

so there are certainly cases where num_threads == 0.  I suppose in that
case the pipe_barrier should be not used at all?  In fact if nr_cpus is 1
(as default in qemu) num_threads will be zero.
Comment 4 Richard Biener 2016-03-16 10:59:56 UTC
Hmm, running X in gdb just hangs the machine :/
Comment 5 Richard Biener 2016-03-16 11:05:07 UTC
Ok, probably "caused" by the new glibc as b02840ba introduced the modulo operation.

commit b02840bacdefde318d2ad2f920e50785b9b25d69
Author: Torvald Riegel <triegel@redhat.com>
Date:   Wed Jun 24 14:37:32 2015 +0200

    New pthread_barrier algorithm to fulfill barrier destruction requirements.
    
    The previous barrier implementation did not fulfill the POSIX requirements
    for when a barrier can be destroyed.  Specifically, it was possible that
    threads that haven't noticed yet that their round is complete still access
    the barrier's memory, and that those accesses can happen after the barrier
    has been legally destroyed.
    The new algorithm does not have this issue, and it avoids using a lock
    internally.


You need to fix Mesa to not call pthread_barrier_init with zero count
(or not destroy or wait on such barrier).  You likely don't wait on it
already as that has the same modulo operation.

Guard in ./src/gallium/drivers/llvmpipe/lp_rast.c

void lp_rast_destroy( struct lp_rasterizer *rast )
{
...
  /* for synchronizing rasterization threads */
   pipe_barrier_destroy( &rast->barrier );


with if rast->num_threads >= 1

the pipe_thread_wait in this function is already guarded by means of the loop

   for (i = 0; i < rast->num_threads; i++) {
#ifdef _WIN32
      pipe_semaphore_wait(&rast->tasks[i].work_done);
#else
      pipe_thread_wait(rast->threads[i]);
#endif
   }
Comment 6 Richard Biener 2016-03-16 11:06:16 UTC
Oh, and best not call pipe_barrier_init with num-threads == 0 either.
Comment 8 Richard Biener 2016-03-16 11:21:28 UTC
And X indeed works fine with qemu -smp cpus=2 ...
Comment 9 Stefan Dirsch 2016-03-16 11:28:55 UTC
Talked to Richard. I'll add the patch to Mesa in X11:XOrg and submit it to factory/TW.
Comment 10 Stefan Dirsch 2016-03-16 11:47:41 UTC
(In reply to Stefan Dirsch from comment #9)
> Talked to Richard. I'll add the patch to Mesa in X11:XOrg and submit it to
> factory/TW.

done (SR#373686)

Richard, feel free to linkpac to obs://X11:XOrg/Mesa for now. ;-)
Comment 11 Bernhard Wiedemann 2016-03-16 12:00:11 UTC
This is an autogenerated message for OBS integration:
This bug (971350) was mentioned in
https://build.opensuse.org/request/show/373686 Factory / Mesa
Comment 12 Bernhard Wiedemann 2016-03-16 20:00:15 UTC
This is an autogenerated message for OBS integration:
This bug (971350) was mentioned in
https://build.opensuse.org/request/show/373998 Factory / Mesa