Bugzilla – Bug 1184339
select() no longer modifies timeout value
Last modified: 2023-12-04 12:20:51 UTC
Created attachment 847988 [details] YaST2 hwinfo pop-up Running "Hardware Information" from the YaST Control Center results in a YaST2 - hwinfo window saying "Probing Hardware..." and DSL with a progress of 1%, see attachment. What is bad is that stopping the process using the Abort button does not work. Did a bit of debugging, see also https://forums.opensuse.org/showthread.php/552284-YAST-Hardware-Information-Hangs I can reproduce the problem using (as root) /usr/bin/ruby.ruby2.7 -rdebug --encoding=utf-8 /usr/lib/YaST2/bin/y2start hwinfo qt -name YaST2 -icon yast I see that /usr/share/YaST2/clients/hwinfo.rb is executed but running that stand-alone is not possible as it needs some Yast::Client so there my debugging ended. Based on the forum thread at least 3 other users have the same problem. Executing hwinfo as root from the command prompt works fine.
Please attach YaST logs https://en.opensuse.org/openSUSE:Report_a_YaST_bug
Created attachment 847991 [details] Yast Debug log
Some more observations: If I run "/usr/bin/ruby.ruby2.7 --encoding=utf-8 /usr/lib/YaST2/bin/y2start hwinfo qt -name YaST2 -icon yast" as normal user, things work fine, the hardware is probe successful and the process completes with a report. If I run the same command as root the process hangs just like starting it from the GUI. The GUI likely also runs it as root as to run YaST2 I have to log in as root.
I can reproduce this in my system. It hangs when calling this with the "pat" variable being ".probe.dsl". > SCR.Read(pat) If I add this line just before that call, all the other probing works like a charm: > return if pat.to_s == ".probe.dsl" So the problem is indeed a libhd hang when probing DSL. And it seems to happen consistently in the current Tumbleweed. Steffen, any clue?
Strange. It hangs only in yast. 'hwinfo --dsl' works fine for me.
For a simpler test case: echo '`Read(.probe.dsl)' | /usr/lib/YaST2/bin/y2base stdio scr
This select call: https://github.com/openSUSE/hwinfo/blob/master/src/hd/pppoe.c#L463 runs into (strace output): 12184 pselect6(1024, [7], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = ? ERESTARTNOHAND (To be restarted if no handler) 12184 --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=240733328, ptr=0x7f980e594c90}} --- 12184 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) while it usually just times out: 12409 pselect6(1024, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 0 (Timeout) Happens only in yast in TW.
Ok, found the difference: in current TW, select() does not adjust the timeout value upon return, while previously it did.
Looking closer, the difference is that in TW, select() ends up as pselect6() while in Leap it's still select(). I've no idea whether it's a glibc or kernel change. But select(2) still says: "On Linux, select() also modifies timeout if the call is interrupted by a signal handler" So, it's either a bug or needs clarification how to handle select() in future.
A good find that select() ends up as pselect6() for Thumbleweed while in Leap it's still is select() Any clues on why running as root the hang is there but if you run as normal user, everything is fine?
As a normal user, the step is skipped.
Created attachment 848031 [details] test program FWIW, here's a small test program demonstrating the issue.
Basically: don't rely on the timeout argument after select() returns. I'll cite from current pselect6(2)/select(2): ------------------------------------ On Linux, select() modifies timeout to reflect the amount of time not slept; most other implementations do not do this. (POSIX.1 permits ei- ther behavior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct timeval for multiple select()s in a loop without reinitializing it. /Consider timeout to be undefined after se- lect() returns./ ... RETURN VALUE On success, select() and pselect() return the number of file descrip- tors contained in the three returned descriptor sets (that is, the to- tal number of bits that are set in readfds, writefds, exceptfds) which may be zero if the timeout expires before anything interesting happens. On error, -1 is returned, and errno is set to indicate the error; the file descriptor sets are unmodified, /and timeout becomes undefined./ ... The pselect() interface described in this page is implemented by glibc. The underlying Linux system call is named pselect6(). This system call has somewhat different behavior from the glibc wrapper function. The Linux pselect6() system call modifies its timeout argument. How- ever, the glibc wrapper function hides this behavior by using a local variable for the timeout argument that is passed to the system call. /Thus, the glibc pselect() function does not modify its timeout argu- ment; this is the behavior required by POSIX.1-2001./ ------------------------------- So, the syscalls do modify timeout on return, but the glibc wrapper differs: for select(), when using select(2) it reflects the modifications, when using pselect6(2) it doesn't. And pselect() (which you aren't using, but for completeness) is even required to not modify it. All in all: don't rely on the value of timeout after return. That implies that you have to reset it before each call to select. (Random note on your testprogram, fprintf() might also return with errno modified, so for reliability you'd have to save errno before it for testing it in the loop condition) FWIW: this specific behaviour was a change in glibc (2433d39b) that made the update of the timeout argument be dependend on select (and pselect6) returning successfully, when formerly it was always updated after the syscall, but only when timespec64 isn't supported. That's arguably a change in behaviour that might have been unintended. Still, for future proofing: don't rely on the timeout argument after function return.
Yeah, I've also checked the docs on select vs. pselect vs. pselect6 and my personal feeling is that here idiologies clash. It's fair to say it's a portability issue - but the issue here is that the behavior changed *on linux* - and it has been stable for ages. And even *if* you make a case for glibc adopting the new behavior you're still left with the task to update the documentation. So far it feels like some collateral damage.
(In reply to Steffen Winterfeldt from comment #14) > Yeah, I've also checked the docs on select vs. pselect vs. pselect6 and > my personal feeling is that here idiologies clash. > > It's fair to say it's a portability issue - but the issue here is that the > behavior changed *on linux* - and it has been stable for ages. And even *if* > you make a case for glibc adopting the new behavior you're still left with > the > task to update the documentation. > > So far it feels like some collateral damage. Sure, no doubt about that. But given the realities that we live in I think it best if you'd adapt the client code of select to not rely on the timeout value. Even if glibc were changed back it would still be only in some next version. And if only the docu were changed you'd still have to change your code, so: rock and hard place :-)
FWIW, I've changed hwinfo to not rely on select() behavior: https://github.com/openSUSE/hwinfo/pull/95
This is an autogenerated message for OBS integration: This bug (1184339) was mentioned in https://build.opensuse.org/request/show/885031 Factory / glibc
*** Bug 1184708 has been marked as a duplicate of this bug. ***
Tested Tumbleweed version 20210420 and problem solved. Thanks!
Fixed.
SUSE-RU-2022:2678-1: An update that has three recommended fixes can now be installed. Category: recommended (important) Bug References: 1184339,1198043,1199948 CVE References: JIRA References: Sources used: openSUSE Leap 15.3 (src): hwinfo-21.82-150300.3.3.1 SUSE Linux Enterprise Module for Basesystem 15-SP3 (src): hwinfo-21.82-150300.3.3.1 SUSE Linux Enterprise Micro 5.2 (src): hwinfo-21.82-150300.3.3.1 SUSE Linux Enterprise Micro 5.1 (src): hwinfo-21.82-150300.3.3.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
SUSE-RU-2022:2678-2: An update that has three recommended fixes can now be installed. Category: recommended (important) Bug References: 1184339,1198043,1199948 CVE References: JIRA References: Sources used: openSUSE Leap Micro 5.2 (src): hwinfo-21.82-150300.3.3.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.