Bug 1204607

Summary: Version 1.9.1-2.1 of irqbalance generates journal error
Product: [openSUSE] openSUSE Tumbleweed Reporter: Gene Snider <genes1122>
Component: OtherAssignee: Thomas Renninger <trenn>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: dmueller, genes1122, josef.moellers, opensuse, trenn
Version: Current   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE Tumbleweed   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Gene Snider 2022-10-23 16:41:27 UTC
The update to irqbalancd-1.9.1-2.1 produced a journalctl error and a coredump at boot.

> journalctl -b -p err
Oct 23 09:31:50 Mobile-PC /usr/sbin/irqbalance[916]: thermal: socket bind failed.

> coredumpctl | grep irqbalance
Sat 2022-10-22 12:04:42 PDT  994    0    0 SIGABRT inaccessible /usr/sbin/irqbalance            n/a
Sun 2022-10-23 09:32:32 PDT  916    0    0 SIGABRT inaccessible /usr/sbin/irqbalance            n/a

This occurs for kernel 6.0 and kernel 6.1.

Thanks,

Gene
Comment 1 Andreas Stieger 2022-10-23 16:57:02 UTC
SR#1029930 enabled thermald support
Comment 2 Gene Snider 2022-10-23 23:12:24 UTC
Does that mean irqbalance is broken?  My understanding is that AMD CPUs don't support thermald.  What is the status of this bug report?

Thanks,

Gene
Comment 3 Dirk Mueller 2022-10-24 07:15:29 UTC
(In reply to Gene Snider from comment #2)

> Does that mean irqbalance is broken?  

You reported a coredump, unfortunately without much details so far. 

> My understanding is that AMD CPUs
> don't support thermald.  What is the status of this bug report?

You're right that it loggs an error on missing thermal, but that is harmless (I have it here as well on a ryzen cpu). it certainly is not the trigger of an abort as far as I can say. 

could you please systemctl stop irqbalance and then run "irqbalance -d --foreground" as root and print the output?

also if you can capture the core file, would you mind running "gdb /usr/sbin/irqbalance <corefilepath>" and then type "bt" and paste the result?



> 
> Thanks,
> 
> Gene
Comment 4 Gene Snider 2022-10-24 18:08:46 UTC
Request 1:

Since I have the old rpm, I diffed the outputs for brevity.

Old version:
# zypper -q se -sx irqbalance.x86_64

S  | Name       | Type    | Version   | Arch   | Repository
---+------------+---------+-----------+--------+----------------------
i+ | irqbalance | package | 1.9.1-1.1 | x86_64 | (System Packages)
v  | irqbalance | package | 1.9.1-2.1 | x86_64 | Main Repository (OSS)

# irqbalance -d --foreground > irqbalance.old

New version:
# zypper -q se -sx irqbalance.x86_64

S  | Name       | Type    | Version   | Arch   | Repository
---+------------+---------+-----------+--------+----------------------
i+ | irqbalance | package | 1.9.1-2.1 | x86_64 | Main Repository (OSS)

# irqbalance -d --foreground > irqbalance.new

Differences:
# diff -s irqbalance.old irqbalance.new

2a3
> Prevent irq assignment to these thermal-banned CPUs: 00000000
94a96
> thermal: received group id (3).

The first message looks reasonable for a CPU that doesn't support thermald, I'm not sure about the second message.  However, the message in the journal is confusing.  

Oct 24 08:35:16 Mobile-PC /usr/sbin/irqbalance[3546]: thermal: socket bind failed.

It sounds like something went wrong, when in fact, the action was normal for a non-Intel CPU.

Request 2:
Results of gdb backtrace for the irqbalance coredump:

# coredumpctl -1 debug
           PID: 943 (irqbalance)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 6 (ABRT)
     Timestamp: Mon 2022-10-24 08:49:03 PDT (1h 33min ago)
  Command Line: /usr/sbin/irqbalance --foreground
    Executable: /usr/sbin/irqbalance
 Control Group: /system.slice/irqbalance.service
          Unit: irqbalance.service
         Slice: system.slice
       Boot ID: 1fe3c0c81e464baab607f078ceddadd1
    Machine ID: 67007840ea464a92b139e093c5491dfd
      Hostname: Mobile-PC
       Storage: /var/lib/systemd/coredump/core.irqbalance.0.1fe3c0c81e464baab607f078ceddadd1.943.1666626543000000.zst (present)
     Disk Size: 58.6K
       Message: Process 943 (irqbalance) of user 0 dumped core.
#0  0x00007f9d49c3980c in __pthread_kill_implementation () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7f9d49af2780 (LWP 943))]
Missing separate debuginfos, use: zypper install irqbalance-debuginfo-1.9.1-2.1.x86_64

...

(gdb) bt
#0  0x00007f9d49c3980c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f9d49be6846 in raise () from /lib64/libc.so.6
#2  0x00007f9d49bcf81c in abort () from /lib64/libc.so.6
#3  0x00007f9d49c2c9ae in __libc_message () from /lib64/libc.so.6
#4  0x00007f9d49c4414c in malloc_printerr () from /lib64/libc.so.6
#5  0x00007f9d49c46626 in _int_free () from /lib64/libc.so.6
#6  0x00007f9d49c48b13 in free () from /lib64/libc.so.6
#7  0x0000564f3cfa6f41 in ?? ()
#8  0x00007f9d49bd05b0 in __libc_start_call_main () from /lib64/libc.so.6
#9  0x00007f9d49bd0679 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x0000564f3cfa81a5 in ?? ()
(gdb)
Comment 5 Dirk Mueller 2022-10-25 16:25:04 UTC
(In reply to Gene Snider from comment #4)

> > Prevent irq assignment to these thermal-banned CPUs: 00000000
> 94a96
> > thermal: received group id (3).
> 
> The first message looks reasonable for a CPU that doesn't support thermald,
> I'm not sure about the second message.  However, the message in the journal
> is confusing.  

I agree, will submit a ticket upstream to not make that an error for AMD cpus.

> Oct 24 08:35:16 Mobile-PC /usr/sbin/irqbalance[3546]: thermal: socket bind
> failed.
> 
> It sounds like something went wrong, when in fact, the action was normal for
> a non-Intel CPU.

correct. 

> Request 2:
> Results of gdb backtrace for the irqbalance coredump:
> 
> # coredumpctl -1 debug
>            PID: 943 (irqbalance)
>            UID: 0 (root)
>            GID: 0 (root)
>         Signal: 6 (ABRT)
>      Timestamp: Mon 2022-10-24 08:49:03 PDT (1h 33min ago)
>   Command Line: /usr/sbin/irqbalance --foreground
>     Executable: /usr/sbin/irqbalance
>  Control Group: /system.slice/irqbalance.service
>           Unit: irqbalance.service
>          Slice: system.slice
>        Boot ID: 1fe3c0c81e464baab607f078ceddadd1
>     Machine ID: 67007840ea464a92b139e093c5491dfd
>       Hostname: Mobile-PC
>        Storage:
> /var/lib/systemd/coredump/core.irqbalance.0.1fe3c0c81e464baab607f078ceddadd1.
> 943.1666626543000000.zst (present)
>      Disk Size: 58.6K
>        Message: Process 943 (irqbalance) of user 0 dumped core.
> #0  0x00007f9d49c3980c in __pthread_kill_implementation () from
> /lib64/libc.so.6
> [Current thread is 1 (Thread 0x7f9d49af2780 (LWP 943))]
> Missing separate debuginfos, use: zypper install
> irqbalance-debuginfo-1.9.1-2.1.x86_64
> 
> ...
> 
> (gdb) bt
> #0  0x00007f9d49c3980c in __pthread_kill_implementation () from
> /lib64/libc.so.6
> #1  0x00007f9d49be6846 in raise () from /lib64/libc.so.6
> #2  0x00007f9d49bcf81c in abort () from /lib64/libc.so.6
> #3  0x00007f9d49c2c9ae in __libc_message () from /lib64/libc.so.6
> #4  0x00007f9d49c4414c in malloc_printerr () from /lib64/libc.so.6
> #5  0x00007f9d49c46626 in _int_free () from /lib64/libc.so.6
> #6  0x00007f9d49c48b13 in free () from /lib64/libc.so.6
> #7  0x0000564f3cfa6f41 in ?? ()
> #8  0x00007f9d49bd05b0 in __libc_start_call_main () from /lib64/libc.so.6
> #9  0x00007f9d49bd0679 in __libc_start_main_impl () from /lib64/libc.so.6
> #10 0x0000564f3cfa81a5 in ?? ()
> (gdb)


okay awesome. can you install the debuginfo package also and then either do this exercise again, or run it manually via "valgrind irqbalance --debug --foreground" and paste the output.
Comment 6 Dirk Mueller 2022-10-25 16:25:34 UTC
I've tried my best to reproduce that on a ryzen/tumbleweed system, no luck so far.
Comment 7 Dirk Mueller 2022-10-25 16:27:57 UTC
do you get this sigabort after a wakeup from suspend/ resume?
Comment 8 Gene Snider 2022-10-25 17:25:23 UTC
I will install the debug package ASAP and run the back trace again.  For now, this SIGABORT occurs at boot.  The CPU is a Zen 3 AMD Ryzen 7 5825U with Radeon Graphics and sleep is broken.

Gene
Comment 9 Gene Snider 2022-10-25 18:22:38 UTC
I installed all the debug packages requested by gdb and got this result:

# coredumpctl -1 debug
           PID: 943 (irqbalance)
           UID: 0 (root)
           GID: 0 (root)
        Signal: 6 (ABRT)
     Timestamp: Mon 2022-10-24 08:49:03 PDT (1 day 1h ago)
  Command Line: /usr/sbin/irqbalance --foreground
    Executable: /usr/sbin/irqbalance
 Control Group: /system.slice/irqbalance.service
          Unit: irqbalance.service
         Slice: system.slice
       Boot ID: 1fe3c0c81e464baab607f078ceddadd1
    Machine ID: 67007840ea464a92b139e093c5491dfd
      Hostname: Mobile-PC
       Storage: /var/lib/systemd/coredump/core.irqbalance.0.1fe3c0c81e464baab607f078ceddadd1.943.1666626543000000.zst (present)
     Disk Size: 58.6K
       Message: Process 943 (irqbalance) of user 0 dumped core.

...

Core was generated by `/usr/sbin/irqbalance --foreground'.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/943' in core file too small.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, 
    no_tid=no_tid@entry=0) at pthread_kill.c:44
44	pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7f9d49af2780 (LWP 943))]
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, 
    no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f9d49c39893 in __pthread_kill_internal (signo=6, threadid=<optimized out>)
    at pthread_kill.c:78
#2  0x00007f9d49be6846 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f9d49bcf81c in __GI_abort () at abort.c:79
#4  0x00007f9d49c2c9ae in __libc_message (action=action@entry=do_abort, 
    fmt=fmt@entry=0x7f9d49d5544f "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#5  0x00007f9d49c4414c in malloc_printerr (
    str=str@entry=0x7f9d49d58088 "free(): double free detected in tcache 2") at malloc.c:5660
#6  0x00007f9d49c46626 in _int_free (av=0x7f9d49d8ec60 <main_arena>, p=0x564f3e856190, 
    have_lock=have_lock@entry=0) at malloc.c:4469
#7  0x00007f9d49c48b13 in __GI___libc_free (mem=<optimized out>) at malloc.c:3385
#8  0x00007f9d49e99938 in nl_socket_free (sk=<optimized out>) at lib/socket.c:250
#9  0x0000564f3cfacea0 in deinit_thermal () at /usr/src/debug/irqbalance-1.9.1/thermal.c:510
#10 0x0000564f3cfa6f41 in main (argc=<optimized out>, argv=<optimized out>)
    at /usr/src/debug/irqbalance-1.9.1/irqbalance.c:722
(gdb)

Gene
Comment 10 Michael Hirmke 2022-10-25 18:40:11 UTC
On my machine I have a similar (the same?) problem, but it is an Intel machine:

Operating System: openSUSE Tumbleweed 20221023
Kernel Version: 6.0.3-1-default (64-bit)
Processors: 8 × 11th Gen Intel® Core™ i7-1165G7 @ 2.80GHz
Memory: 15,0 GiB of RAM
Graphics Processor: Mesa Intel® Xe Graphics
Manufacturer: Dell Inc.
Product Name: XPS 13 9310 2-in-1

systemd-coredump[5213]: Process 1194 (irqbalance) of user 0 dumped core

Module linux-vdso.so.1 with build-id 579e18c134a1f616b0583ad0ffab87c51446b7d0
Module libpcre2-8.so.0 with build-id 17925aea3c556e532db80b000dc37241719a22e4
Module ld-linux-x86-64.so.2 with build-id 38764db6e54b75b0b7046cddf3536c594729a9de
Module libc.so.6 with build-id 8ae7094d0a249562d918dcb19926298cca072d72
Module libm.so.6 with build-id d61a4ea5dc902fdf548527f785b604362959dfe5
Module libnl-3.so.200 with build-id 52b7ded7122b78e7e6bb9874717c37ddc1ce75dc
Module libnl-genl-3.so.200 with build-id 8604e7cb8d147c2a31d29c6ea467dbc6e4edf93c
Module libnuma.so.1 with build-id bc3eb8df9ec9f6ceb791c5f2bacdd0d87009df6c
Module libglib-2.0.so.0 with build-id 1f2d864b71cf4b6bb70b21343eb155f01d78b482
Module libcap-ng.so.0 with build-id dfb546870d6b1921453de4a9650d88111514d723
Module irqbalance with build-id 894b14c3f20ce73cac24b81012e37c5164925988
Stack trace of thread 1194:
#0  0x00007f802e46a80c __pthread_kill_implementation (libc.so.6 + 0x9080c)
#1  0x00007f802e417846 raise (libc.so.6 + 0x3d846)
#2  0x00007f802e40081c abort (libc.so.6 + 0x2681c)  
#3  0x00007f802e40072b __assert_fail_base.cold (libc.so.6 + 0x2672b)
#4  0x00007f802e40fc46 __assert_fail (libc.so.6 + 0x35c46) 
#5  0x00007f802e6c61dd nl_cb_put (libnl-3.so.200 + 0xd1dd)
#6  0x00007f802e6ca92f nl_socket_free (libnl-3.so.200 + 0x1192f)
#7  0x0000556c6b718f41 n/a (irqbalance + 0x3f41)
#8  0x00007f802e4015b0 __libc_start_call_main (libc.so.6 + 0x275b0)
#9  0x00007f802e401679 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x27679)
#10 0x0000556c6b71a1a5 n/a (irqbalance + 0x51a5)    
                                                    
Stack trace of thread 1210:
#0  0x00007f802e4e0a8f __poll (libc.so.6 + 0x106a8f)
#1  0x00007f802e74dd7e n/a (libglib-2.0.so.0 + 0x58d7e) 
#2  0x00007f802e74de9c g_main_context_iteration (libglib-2.0.so.0 + 0x58e9c)
#3  0x00007f802e74dee1 n/a (libglib-2.0.so.0 + 0x58ee1) 
#4  0x00007f802e7782a5 n/a (libglib-2.0.so.0 + 0x832a5) 
#5  0x00007f802e4689ad start_thread (libc.so.6 + 0x8e9ad)
#6  0x00007f802e4ef290 __clone3 (libc.so.6 + 0x115290)  
ELF object binary architecture: AMD x86-64
Comment 11 Dirk Mueller 2022-10-25 21:39:44 UTC
thanks for the detailed backtrace. I submitted a potential fix (pick it up from Base:System or wait until it enters the next factory snapshot).

Please let me know if it does resolve the issue you're seeing.
Comment 12 OBSbugzilla Bot 2022-10-25 22:15:03 UTC
This is an autogenerated message for OBS integration:
This bug (1204607) was mentioned in
https://build.opensuse.org/request/show/1031212 Factory / irqbalance
Comment 13 Gene Snider 2022-10-25 23:44:46 UTC
The double free core dump appears to be fixed by irqbalance-1.9.1-244.1.  Hopefully it works for Michael Hirmke as well.

Thanks!!

Gene
Comment 14 Dirk Mueller 2022-10-28 15:09:30 UTC
Michael, please reopen if there is still an issue for you. the new package has landed in factory.
Comment 16 Michael Hirmke 2022-10-29 09:39:00 UTC
(In reply to Dirk Mueller from comment #14)
> Michael, please reopen if there is still an issue for you. the new package
> has landed in factory.

After installing openSUSE Tumbleweed 20221027 I get:


/usr/sbin/irqbalance[1221]: thermal: socket bind failed.
/usr/sbin/irqbalance[1221]: Failed to initialize thermal events.
systemd[1]: irqbalance.service: Deactivated successfully.

Is that the expected behaviour?
Comment 17 Dirk Mueller 2022-11-10 08:32:29 UTC
the "deactivated successfully" means that the service exited with exit code 0. which means it is no longer coredumping, which was the issue in this bugreport. 

there are valid reasons where irqbalance thinks it has nothing to do, so the deactivation is not necessarily a problem. if it is new behavior then it is worth a separate bugreport for investigation.
Comment 18 Michael Hirmke 2022-11-10 20:54:42 UTC
(In reply to Dirk Mueller from comment #17)
> the "deactivated successfully" means that the service exited with exit code
> 0. which means it is no longer coredumping, which was the issue in this
> bugreport. 
> 
> there are valid reasons where irqbalance thinks it has nothing to do, so the
> deactivation is not necessarily a problem. if it is new behavior then it is
> worth a separate bugreport for investigation.

https://bugzilla.suse.com/show_bug.cgi?id=1205308