|
Bugzilla – Full Text Bug Listing |
| Summary: | select() no longer modifies timeout value | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | M Fredericks <emfee> |
| Component: | YaST2 | Assignee: | Andreas Schwab <schwab> |
| Status: | RESOLVED FIXED | QA Contact: | Jiri Srain <jsrain> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | ancor, emfee, ioannis.bonatakis, matz, snwint |
| Version: | Current | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | openSUSE Tumbleweed | ||
| URL: | https://trello.com/c/utNuL3Pd | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
YaST2 hwinfo pop-up
Yast Debug log test program |
||
|
Description
M Fredericks
2021-04-05 14:56:50 UTC
Please attach YaST logs https://en.opensuse.org/openSUSE:Report_a_YaST_bug Created attachment 847991 [details]
Yast Debug log
Some more observations: If I run "/usr/bin/ruby.ruby2.7 --encoding=utf-8 /usr/lib/YaST2/bin/y2start hwinfo qt -name YaST2 -icon yast" as normal user, things work fine, the hardware is probe successful and the process completes with a report. If I run the same command as root the process hangs just like starting it from the GUI. The GUI likely also runs it as root as to run YaST2 I have to log in as root. I can reproduce this in my system. It hangs when calling this with the "pat" variable being ".probe.dsl". > SCR.Read(pat) If I add this line just before that call, all the other probing works like a charm: > return if pat.to_s == ".probe.dsl" So the problem is indeed a libhd hang when probing DSL. And it seems to happen consistently in the current Tumbleweed. Steffen, any clue? Strange. It hangs only in yast. 'hwinfo --dsl' works fine for me. For a simpler test case: echo '`Read(.probe.dsl)' | /usr/lib/YaST2/bin/y2base stdio scr This select call: https://github.com/openSUSE/hwinfo/blob/master/src/hd/pppoe.c#L463 runs into (strace output): 12184 pselect6(1024, [7], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = ? ERESTARTNOHAND (To be restarted if no handler) 12184 --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=240733328, ptr=0x7f980e594c90}} --- 12184 rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call) while it usually just times out: 12409 pselect6(1024, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 0 (Timeout) Happens only in yast in TW. Ok, found the difference: in current TW, select() does not adjust the timeout value upon return, while previously it did. Looking closer, the difference is that in TW, select() ends up as pselect6() while in Leap it's still select(). I've no idea whether it's a glibc or kernel change. But select(2) still says: "On Linux, select() also modifies timeout if the call is interrupted by a signal handler" So, it's either a bug or needs clarification how to handle select() in future. A good find that select() ends up as pselect6() for Thumbleweed while in Leap it's still is select() Any clues on why running as root the hang is there but if you run as normal user, everything is fine? As a normal user, the step is skipped. Created attachment 848031 [details]
test program
FWIW, here's a small test program demonstrating the issue.
Basically: don't rely on the timeout argument after select() returns. I'll cite from current pselect6(2)/select(2):
------------------------------------
On Linux, select() modifies timeout to reflect the amount of time not
slept; most other implementations do not do this. (POSIX.1 permits ei-
ther behavior.) This causes problems both when Linux code which reads
timeout is ported to other operating systems, and when code is ported
to Linux that reuses a struct timeval for multiple select()s in a loop
without reinitializing it. /Consider timeout to be undefined after se-
lect() returns./
...
RETURN VALUE
On success, select() and pselect() return the number of file descrip-
tors contained in the three returned descriptor sets (that is, the to-
tal number of bits that are set in readfds, writefds, exceptfds) which
may be zero if the timeout expires before anything interesting happens.
On error, -1 is returned, and errno is set to indicate the error; the
file descriptor sets are unmodified, /and timeout becomes undefined./
...
The pselect() interface described in this page is implemented by glibc.
The underlying Linux system call is named pselect6(). This system call
has somewhat different behavior from the glibc wrapper function.
The Linux pselect6() system call modifies its timeout argument. How-
ever, the glibc wrapper function hides this behavior by using a local
variable for the timeout argument that is passed to the system call.
/Thus, the glibc pselect() function does not modify its timeout argu-
ment; this is the behavior required by POSIX.1-2001./
-------------------------------
So, the syscalls do modify timeout on return, but the glibc wrapper differs:
for select(), when using select(2) it reflects the modifications, when using
pselect6(2) it doesn't. And pselect() (which you aren't using, but for
completeness) is even required to not modify it.
All in all: don't rely on the value of timeout after return. That implies
that you have to reset it before each call to select.
(Random note on your testprogram, fprintf() might also return with errno
modified, so for reliability you'd have to save errno before it for testing
it in the loop condition)
FWIW: this specific behaviour was a change in glibc (2433d39b) that made
the update of the timeout argument be dependend on select (and pselect6)
returning successfully, when formerly it was always updated after the syscall,
but only when timespec64 isn't supported. That's arguably a change in behaviour
that might have been unintended. Still, for future proofing: don't rely on
the timeout argument after function return.
Yeah, I've also checked the docs on select vs. pselect vs. pselect6 and my personal feeling is that here idiologies clash. It's fair to say it's a portability issue - but the issue here is that the behavior changed *on linux* - and it has been stable for ages. And even *if* you make a case for glibc adopting the new behavior you're still left with the task to update the documentation. So far it feels like some collateral damage. (In reply to Steffen Winterfeldt from comment #14) > Yeah, I've also checked the docs on select vs. pselect vs. pselect6 and > my personal feeling is that here idiologies clash. > > It's fair to say it's a portability issue - but the issue here is that the > behavior changed *on linux* - and it has been stable for ages. And even *if* > you make a case for glibc adopting the new behavior you're still left with > the > task to update the documentation. > > So far it feels like some collateral damage. Sure, no doubt about that. But given the realities that we live in I think it best if you'd adapt the client code of select to not rely on the timeout value. Even if glibc were changed back it would still be only in some next version. And if only the docu were changed you'd still have to change your code, so: rock and hard place :-) FWIW, I've changed hwinfo to not rely on select() behavior: https://github.com/openSUSE/hwinfo/pull/95 This is an autogenerated message for OBS integration: This bug (1184339) was mentioned in https://build.opensuse.org/request/show/885031 Factory / glibc *** Bug 1184708 has been marked as a duplicate of this bug. *** Tested Tumbleweed version 20210420 and problem solved. Thanks! Fixed. SUSE-RU-2022:2678-1: An update that has three recommended fixes can now be installed. Category: recommended (important) Bug References: 1184339,1198043,1199948 CVE References: JIRA References: Sources used: openSUSE Leap 15.3 (src): hwinfo-21.82-150300.3.3.1 SUSE Linux Enterprise Module for Basesystem 15-SP3 (src): hwinfo-21.82-150300.3.3.1 SUSE Linux Enterprise Micro 5.2 (src): hwinfo-21.82-150300.3.3.1 SUSE Linux Enterprise Micro 5.1 (src): hwinfo-21.82-150300.3.3.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination. SUSE-RU-2022:2678-2: An update that has three recommended fixes can now be installed. Category: recommended (important) Bug References: 1184339,1198043,1199948 CVE References: JIRA References: Sources used: openSUSE Leap Micro 5.2 (src): hwinfo-21.82-150300.3.3.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination. |