|
Bugzilla – Full Text Bug Listing |
| Summary: | Version 1.9.1-2.1 of irqbalance generates journal error | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Gene Snider <genes1122> |
| Component: | Other | Assignee: | Thomas Renninger <trenn> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | dmueller, genes1122, josef.moellers, opensuse, trenn |
| Version: | Current | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | openSUSE Tumbleweed | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
|
Description
Gene Snider
2022-10-23 16:41:27 UTC
SR#1029930 enabled thermald support Does that mean irqbalance is broken? My understanding is that AMD CPUs don't support thermald. What is the status of this bug report? Thanks, Gene (In reply to Gene Snider from comment #2) > Does that mean irqbalance is broken? You reported a coredump, unfortunately without much details so far. > My understanding is that AMD CPUs > don't support thermald. What is the status of this bug report? You're right that it loggs an error on missing thermal, but that is harmless (I have it here as well on a ryzen cpu). it certainly is not the trigger of an abort as far as I can say. could you please systemctl stop irqbalance and then run "irqbalance -d --foreground" as root and print the output? also if you can capture the core file, would you mind running "gdb /usr/sbin/irqbalance <corefilepath>" and then type "bt" and paste the result? > > Thanks, > > Gene Request 1: Since I have the old rpm, I diffed the outputs for brevity. Old version: # zypper -q se -sx irqbalance.x86_64 S | Name | Type | Version | Arch | Repository ---+------------+---------+-----------+--------+---------------------- i+ | irqbalance | package | 1.9.1-1.1 | x86_64 | (System Packages) v | irqbalance | package | 1.9.1-2.1 | x86_64 | Main Repository (OSS) # irqbalance -d --foreground > irqbalance.old New version: # zypper -q se -sx irqbalance.x86_64 S | Name | Type | Version | Arch | Repository ---+------------+---------+-----------+--------+---------------------- i+ | irqbalance | package | 1.9.1-2.1 | x86_64 | Main Repository (OSS) # irqbalance -d --foreground > irqbalance.new Differences: # diff -s irqbalance.old irqbalance.new 2a3 > Prevent irq assignment to these thermal-banned CPUs: 00000000 94a96 > thermal: received group id (3). The first message looks reasonable for a CPU that doesn't support thermald, I'm not sure about the second message. However, the message in the journal is confusing. Oct 24 08:35:16 Mobile-PC /usr/sbin/irqbalance[3546]: thermal: socket bind failed. It sounds like something went wrong, when in fact, the action was normal for a non-Intel CPU. Request 2: Results of gdb backtrace for the irqbalance coredump: # coredumpctl -1 debug PID: 943 (irqbalance) UID: 0 (root) GID: 0 (root) Signal: 6 (ABRT) Timestamp: Mon 2022-10-24 08:49:03 PDT (1h 33min ago) Command Line: /usr/sbin/irqbalance --foreground Executable: /usr/sbin/irqbalance Control Group: /system.slice/irqbalance.service Unit: irqbalance.service Slice: system.slice Boot ID: 1fe3c0c81e464baab607f078ceddadd1 Machine ID: 67007840ea464a92b139e093c5491dfd Hostname: Mobile-PC Storage: /var/lib/systemd/coredump/core.irqbalance.0.1fe3c0c81e464baab607f078ceddadd1.943.1666626543000000.zst (present) Disk Size: 58.6K Message: Process 943 (irqbalance) of user 0 dumped core. #0 0x00007f9d49c3980c in __pthread_kill_implementation () from /lib64/libc.so.6 [Current thread is 1 (Thread 0x7f9d49af2780 (LWP 943))] Missing separate debuginfos, use: zypper install irqbalance-debuginfo-1.9.1-2.1.x86_64 ... (gdb) bt #0 0x00007f9d49c3980c in __pthread_kill_implementation () from /lib64/libc.so.6 #1 0x00007f9d49be6846 in raise () from /lib64/libc.so.6 #2 0x00007f9d49bcf81c in abort () from /lib64/libc.so.6 #3 0x00007f9d49c2c9ae in __libc_message () from /lib64/libc.so.6 #4 0x00007f9d49c4414c in malloc_printerr () from /lib64/libc.so.6 #5 0x00007f9d49c46626 in _int_free () from /lib64/libc.so.6 #6 0x00007f9d49c48b13 in free () from /lib64/libc.so.6 #7 0x0000564f3cfa6f41 in ?? () #8 0x00007f9d49bd05b0 in __libc_start_call_main () from /lib64/libc.so.6 #9 0x00007f9d49bd0679 in __libc_start_main_impl () from /lib64/libc.so.6 #10 0x0000564f3cfa81a5 in ?? () (gdb) (In reply to Gene Snider from comment #4) > > Prevent irq assignment to these thermal-banned CPUs: 00000000 > 94a96 > > thermal: received group id (3). > > The first message looks reasonable for a CPU that doesn't support thermald, > I'm not sure about the second message. However, the message in the journal > is confusing. I agree, will submit a ticket upstream to not make that an error for AMD cpus. > Oct 24 08:35:16 Mobile-PC /usr/sbin/irqbalance[3546]: thermal: socket bind > failed. > > It sounds like something went wrong, when in fact, the action was normal for > a non-Intel CPU. correct. > Request 2: > Results of gdb backtrace for the irqbalance coredump: > > # coredumpctl -1 debug > PID: 943 (irqbalance) > UID: 0 (root) > GID: 0 (root) > Signal: 6 (ABRT) > Timestamp: Mon 2022-10-24 08:49:03 PDT (1h 33min ago) > Command Line: /usr/sbin/irqbalance --foreground > Executable: /usr/sbin/irqbalance > Control Group: /system.slice/irqbalance.service > Unit: irqbalance.service > Slice: system.slice > Boot ID: 1fe3c0c81e464baab607f078ceddadd1 > Machine ID: 67007840ea464a92b139e093c5491dfd > Hostname: Mobile-PC > Storage: > /var/lib/systemd/coredump/core.irqbalance.0.1fe3c0c81e464baab607f078ceddadd1. > 943.1666626543000000.zst (present) > Disk Size: 58.6K > Message: Process 943 (irqbalance) of user 0 dumped core. > #0 0x00007f9d49c3980c in __pthread_kill_implementation () from > /lib64/libc.so.6 > [Current thread is 1 (Thread 0x7f9d49af2780 (LWP 943))] > Missing separate debuginfos, use: zypper install > irqbalance-debuginfo-1.9.1-2.1.x86_64 > > ... > > (gdb) bt > #0 0x00007f9d49c3980c in __pthread_kill_implementation () from > /lib64/libc.so.6 > #1 0x00007f9d49be6846 in raise () from /lib64/libc.so.6 > #2 0x00007f9d49bcf81c in abort () from /lib64/libc.so.6 > #3 0x00007f9d49c2c9ae in __libc_message () from /lib64/libc.so.6 > #4 0x00007f9d49c4414c in malloc_printerr () from /lib64/libc.so.6 > #5 0x00007f9d49c46626 in _int_free () from /lib64/libc.so.6 > #6 0x00007f9d49c48b13 in free () from /lib64/libc.so.6 > #7 0x0000564f3cfa6f41 in ?? () > #8 0x00007f9d49bd05b0 in __libc_start_call_main () from /lib64/libc.so.6 > #9 0x00007f9d49bd0679 in __libc_start_main_impl () from /lib64/libc.so.6 > #10 0x0000564f3cfa81a5 in ?? () > (gdb) okay awesome. can you install the debuginfo package also and then either do this exercise again, or run it manually via "valgrind irqbalance --debug --foreground" and paste the output. I've tried my best to reproduce that on a ryzen/tumbleweed system, no luck so far. do you get this sigabort after a wakeup from suspend/ resume? I will install the debug package ASAP and run the back trace again. For now, this SIGABORT occurs at boot. The CPU is a Zen 3 AMD Ryzen 7 5825U with Radeon Graphics and sleep is broken. Gene I installed all the debug packages requested by gdb and got this result:
# coredumpctl -1 debug
PID: 943 (irqbalance)
UID: 0 (root)
GID: 0 (root)
Signal: 6 (ABRT)
Timestamp: Mon 2022-10-24 08:49:03 PDT (1 day 1h ago)
Command Line: /usr/sbin/irqbalance --foreground
Executable: /usr/sbin/irqbalance
Control Group: /system.slice/irqbalance.service
Unit: irqbalance.service
Slice: system.slice
Boot ID: 1fe3c0c81e464baab607f078ceddadd1
Machine ID: 67007840ea464a92b139e093c5491dfd
Hostname: Mobile-PC
Storage: /var/lib/systemd/coredump/core.irqbalance.0.1fe3c0c81e464baab607f078ceddadd1.943.1666626543000000.zst (present)
Disk Size: 58.6K
Message: Process 943 (irqbalance) of user 0 dumped core.
...
Core was generated by `/usr/sbin/irqbalance --foreground'.
Program terminated with signal SIGABRT, Aborted.
warning: Section `.reg-xstate/943' in core file too small.
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
no_tid=no_tid@entry=0) at pthread_kill.c:44
44 pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7f9d49af2780 (LWP 943))]
(gdb) bt
#0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6,
no_tid=no_tid@entry=0) at pthread_kill.c:44
#1 0x00007f9d49c39893 in __pthread_kill_internal (signo=6, threadid=<optimized out>)
at pthread_kill.c:78
#2 0x00007f9d49be6846 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3 0x00007f9d49bcf81c in __GI_abort () at abort.c:79
#4 0x00007f9d49c2c9ae in __libc_message (action=action@entry=do_abort,
fmt=fmt@entry=0x7f9d49d5544f "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#5 0x00007f9d49c4414c in malloc_printerr (
str=str@entry=0x7f9d49d58088 "free(): double free detected in tcache 2") at malloc.c:5660
#6 0x00007f9d49c46626 in _int_free (av=0x7f9d49d8ec60 <main_arena>, p=0x564f3e856190,
have_lock=have_lock@entry=0) at malloc.c:4469
#7 0x00007f9d49c48b13 in __GI___libc_free (mem=<optimized out>) at malloc.c:3385
#8 0x00007f9d49e99938 in nl_socket_free (sk=<optimized out>) at lib/socket.c:250
#9 0x0000564f3cfacea0 in deinit_thermal () at /usr/src/debug/irqbalance-1.9.1/thermal.c:510
#10 0x0000564f3cfa6f41 in main (argc=<optimized out>, argv=<optimized out>)
at /usr/src/debug/irqbalance-1.9.1/irqbalance.c:722
(gdb)
Gene
On my machine I have a similar (the same?) problem, but it is an Intel machine:
Operating System: openSUSE Tumbleweed 20221023
Kernel Version: 6.0.3-1-default (64-bit)
Processors: 8 × 11th Gen Intel® Core™ i7-1165G7 @ 2.80GHz
Memory: 15,0 GiB of RAM
Graphics Processor: Mesa Intel® Xe Graphics
Manufacturer: Dell Inc.
Product Name: XPS 13 9310 2-in-1
systemd-coredump[5213]: Process 1194 (irqbalance) of user 0 dumped core
Module linux-vdso.so.1 with build-id 579e18c134a1f616b0583ad0ffab87c51446b7d0
Module libpcre2-8.so.0 with build-id 17925aea3c556e532db80b000dc37241719a22e4
Module ld-linux-x86-64.so.2 with build-id 38764db6e54b75b0b7046cddf3536c594729a9de
Module libc.so.6 with build-id 8ae7094d0a249562d918dcb19926298cca072d72
Module libm.so.6 with build-id d61a4ea5dc902fdf548527f785b604362959dfe5
Module libnl-3.so.200 with build-id 52b7ded7122b78e7e6bb9874717c37ddc1ce75dc
Module libnl-genl-3.so.200 with build-id 8604e7cb8d147c2a31d29c6ea467dbc6e4edf93c
Module libnuma.so.1 with build-id bc3eb8df9ec9f6ceb791c5f2bacdd0d87009df6c
Module libglib-2.0.so.0 with build-id 1f2d864b71cf4b6bb70b21343eb155f01d78b482
Module libcap-ng.so.0 with build-id dfb546870d6b1921453de4a9650d88111514d723
Module irqbalance with build-id 894b14c3f20ce73cac24b81012e37c5164925988
Stack trace of thread 1194:
#0 0x00007f802e46a80c __pthread_kill_implementation (libc.so.6 + 0x9080c)
#1 0x00007f802e417846 raise (libc.so.6 + 0x3d846)
#2 0x00007f802e40081c abort (libc.so.6 + 0x2681c)
#3 0x00007f802e40072b __assert_fail_base.cold (libc.so.6 + 0x2672b)
#4 0x00007f802e40fc46 __assert_fail (libc.so.6 + 0x35c46)
#5 0x00007f802e6c61dd nl_cb_put (libnl-3.so.200 + 0xd1dd)
#6 0x00007f802e6ca92f nl_socket_free (libnl-3.so.200 + 0x1192f)
#7 0x0000556c6b718f41 n/a (irqbalance + 0x3f41)
#8 0x00007f802e4015b0 __libc_start_call_main (libc.so.6 + 0x275b0)
#9 0x00007f802e401679 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x27679)
#10 0x0000556c6b71a1a5 n/a (irqbalance + 0x51a5)
Stack trace of thread 1210:
#0 0x00007f802e4e0a8f __poll (libc.so.6 + 0x106a8f)
#1 0x00007f802e74dd7e n/a (libglib-2.0.so.0 + 0x58d7e)
#2 0x00007f802e74de9c g_main_context_iteration (libglib-2.0.so.0 + 0x58e9c)
#3 0x00007f802e74dee1 n/a (libglib-2.0.so.0 + 0x58ee1)
#4 0x00007f802e7782a5 n/a (libglib-2.0.so.0 + 0x832a5)
#5 0x00007f802e4689ad start_thread (libc.so.6 + 0x8e9ad)
#6 0x00007f802e4ef290 __clone3 (libc.so.6 + 0x115290)
ELF object binary architecture: AMD x86-64
thanks for the detailed backtrace. I submitted a potential fix (pick it up from Base:System or wait until it enters the next factory snapshot). Please let me know if it does resolve the issue you're seeing. This is an autogenerated message for OBS integration: This bug (1204607) was mentioned in https://build.opensuse.org/request/show/1031212 Factory / irqbalance The double free core dump appears to be fixed by irqbalance-1.9.1-244.1. Hopefully it works for Michael Hirmke as well. Thanks!! Gene Michael, please reopen if there is still an issue for you. the new package has landed in factory. (In reply to Dirk Mueller from comment #14) > Michael, please reopen if there is still an issue for you. the new package > has landed in factory. After installing openSUSE Tumbleweed 20221027 I get: /usr/sbin/irqbalance[1221]: thermal: socket bind failed. /usr/sbin/irqbalance[1221]: Failed to initialize thermal events. systemd[1]: irqbalance.service: Deactivated successfully. Is that the expected behaviour? the "deactivated successfully" means that the service exited with exit code 0. which means it is no longer coredumping, which was the issue in this bugreport. there are valid reasons where irqbalance thinks it has nothing to do, so the deactivation is not necessarily a problem. if it is new behavior then it is worth a separate bugreport for investigation. (In reply to Dirk Mueller from comment #17) > the "deactivated successfully" means that the service exited with exit code > 0. which means it is no longer coredumping, which was the issue in this > bugreport. > > there are valid reasons where irqbalance thinks it has nothing to do, so the > deactivation is not necessarily a problem. if it is new behavior then it is > worth a separate bugreport for investigation. https://bugzilla.suse.com/show_bug.cgi?id=1205308 |