Bug 1184339

Summary: select() no longer modifies timeout value
Product: [openSUSE] openSUSE Tumbleweed Reporter: M Fredericks <emfee>
Component: YaST2Assignee: Andreas Schwab <schwab>
Status: RESOLVED FIXED QA Contact: Jiri Srain <jsrain>
Severity: Normal    
Priority: P5 - None CC: ancor, emfee, ioannis.bonatakis, matz, snwint
Version: Current   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE Tumbleweed   
URL: https://trello.com/c/utNuL3Pd
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: YaST2 hwinfo pop-up
Yast Debug log
test program

Description M Fredericks 2021-04-05 14:56:50 UTC
Created attachment 847988 [details]
YaST2 hwinfo pop-up

Running "Hardware Information" from the YaST Control Center results in a YaST2 - hwinfo window saying "Probing Hardware..." and DSL with a progress of 1%, see attachment. What is bad is that stopping the process using the Abort button does not work.

Did a bit of debugging, see also https://forums.opensuse.org/showthread.php/552284-YAST-Hardware-Information-Hangs

I can reproduce the problem using (as root)

/usr/bin/ruby.ruby2.7 -rdebug --encoding=utf-8 /usr/lib/YaST2/bin/y2start hwinfo qt -name YaST2 -icon yast

I see that /usr/share/YaST2/clients/hwinfo.rb is executed but running that stand-alone is not possible as it needs some Yast::Client so there my debugging ended.

Based on the forum thread at least 3 other users have the same problem. Executing hwinfo as root from the command prompt works fine.
Comment 1 Ancor Gonzalez Sosa 2021-04-05 15:15:35 UTC
Please attach YaST logs https://en.opensuse.org/openSUSE:Report_a_YaST_bug
Comment 2 M Fredericks 2021-04-05 19:26:25 UTC
Created attachment 847991 [details]
Yast Debug log
Comment 3 M Fredericks 2021-04-05 19:27:21 UTC
Some more observations:

If I run "/usr/bin/ruby.ruby2.7 --encoding=utf-8 /usr/lib/YaST2/bin/y2start hwinfo qt -name YaST2 -icon yast" as normal user, things work fine, the hardware is probe successful and the process completes with a report.

If I run the same command as root the process hangs just like starting it from the GUI. The GUI likely also runs it as root as to run YaST2 I have to log in as root.
Comment 4 Ancor Gonzalez Sosa 2021-04-06 10:46:03 UTC
I can reproduce this in my system. It hangs when calling this with the "pat" variable being ".probe.dsl".

> SCR.Read(pat)

If I add this line just before that call, all the other probing works like a charm:

> return if pat.to_s == ".probe.dsl"

So the problem is indeed a libhd hang when probing DSL. And it seems to happen consistently in the current Tumbleweed.

Steffen, any clue?
Comment 5 Steffen Winterfeldt 2021-04-06 11:14:44 UTC
Strange. It hangs only in yast. 'hwinfo --dsl' works fine for me.
Comment 6 Steffen Winterfeldt 2021-04-06 11:20:20 UTC
For a simpler test case:

echo '`Read(.probe.dsl)' | /usr/lib/YaST2/bin/y2base stdio scr
Comment 7 Steffen Winterfeldt 2021-04-06 11:50:43 UTC
This select call:

https://github.com/openSUSE/hwinfo/blob/master/src/hd/pppoe.c#L463

runs into (strace output):

12184 pselect6(1024, [7], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = ? ERESTARTNOHAND (To be restarted if no handler)
12184 --- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=240733328, ptr=0x7f980e594c90}} ---
12184 rt_sigreturn({mask=[]})           = -1 EINTR (Interrupted system call)

while it usually just times out:

12409 pselect6(1024, [3], NULL, NULL, {tv_sec=3, tv_nsec=0}, NULL) = 0 (Timeout)

Happens only in yast in TW.
Comment 8 Steffen Winterfeldt 2021-04-06 14:55:50 UTC
Ok, found the difference: in current TW, select() does not adjust the timeout
value upon return, while previously it did.
Comment 9 Steffen Winterfeldt 2021-04-06 15:19:07 UTC
Looking closer, the difference is that in TW, select() ends up as pselect6()
while in Leap it's still select().

I've no idea whether it's a glibc or kernel change. But select(2) still
says:

"On Linux, select() also modifies timeout if the call is interrupted by a
signal handler"

So, it's either a bug or needs clarification how to handle select() in future.
Comment 10 M Fredericks 2021-04-07 05:25:45 UTC
A good find that select() ends up as pselect6() for Thumbleweed while in Leap it's still is select()

Any clues on why running as root the hang is there but if you run as normal user, everything is fine?
Comment 11 Steffen Winterfeldt 2021-04-07 07:07:15 UTC
As a normal user, the step is skipped.
Comment 12 Steffen Winterfeldt 2021-04-07 07:33:22 UTC
Created attachment 848031 [details]
test program

FWIW, here's a small test program demonstrating the issue.
Comment 13 Michael Matz 2021-04-07 12:38:39 UTC
Basically: don't rely on the timeout argument after select() returns.  I'll cite from current pselect6(2)/select(2):

------------------------------------
       On Linux, select() modifies timeout to reflect the amount of  time  not
       slept; most other implementations do not do this.  (POSIX.1 permits ei-
       ther behavior.)  This causes problems both when Linux code which  reads
       timeout  is  ported to other operating systems, and when code is ported
       to Linux that reuses a struct timeval for multiple select()s in a  loop
       without  reinitializing it.  /Consider timeout to be undefined after se-
       lect() returns./
...
RETURN VALUE
       On success, select() and pselect() return the number of  file  descrip-
       tors  contained in the three returned descriptor sets (that is, the to-
       tal number of bits that are set in readfds, writefds, exceptfds)  which
       may be zero if the timeout expires before anything interesting happens.
       On error, -1 is returned, and errno is set to indicate the  error;  the
       file descriptor sets are unmodified, /and timeout becomes undefined./
...
       The pselect() interface described in this page is implemented by glibc.
       The underlying Linux system call is named pselect6().  This system call
       has somewhat different behavior from the glibc wrapper function.

       The  Linux  pselect6() system call modifies its timeout argument.  How-
       ever, the glibc wrapper function hides this behavior by using  a  local
       variable  for  the  timeout argument that is passed to the system call.
       /Thus, the glibc pselect() function does not modify  its  timeout  argu-
       ment; this is the behavior required by POSIX.1-2001./
-------------------------------

So, the syscalls do modify timeout on return, but the glibc wrapper differs:
for select(), when using select(2) it reflects the modifications, when using
pselect6(2) it doesn't.  And pselect() (which you aren't using, but for
completeness) is even required to not modify it.

All in all: don't rely on the value of timeout after return.  That implies
that you have to reset it before each call to select.

(Random note on your testprogram, fprintf() might also return with errno
modified, so for reliability you'd have to save errno before it for testing
it in the loop condition)

FWIW: this specific behaviour was a change in glibc (2433d39b) that made
the update of the timeout argument be dependend on select (and pselect6)
returning successfully, when formerly it was always updated after the syscall,
but only when timespec64 isn't supported.  That's arguably a change in behaviour
that might have been unintended.  Still, for future proofing: don't rely on
the timeout argument after function return.
Comment 14 Steffen Winterfeldt 2021-04-07 13:06:56 UTC
Yeah, I've also checked the docs on select vs. pselect vs. pselect6 and
my personal feeling is that here idiologies clash.

It's fair to say it's a portability issue - but the issue here is that the
behavior changed *on linux* - and it has been stable for ages. And even *if*
you make a case for glibc adopting the new behavior you're still left with the
task to update the documentation.

So far it feels like some collateral damage.
Comment 15 Michael Matz 2021-04-07 13:30:37 UTC
(In reply to Steffen Winterfeldt from comment #14)
> Yeah, I've also checked the docs on select vs. pselect vs. pselect6 and
> my personal feeling is that here idiologies clash.
> 
> It's fair to say it's a portability issue - but the issue here is that the
> behavior changed *on linux* - and it has been stable for ages. And even *if*
> you make a case for glibc adopting the new behavior you're still left with
> the
> task to update the documentation.
> 
> So far it feels like some collateral damage.

Sure, no doubt about that.  But given the realities that we live in I think it
best if you'd adapt the client code of select to not rely on the timeout value.
Even if glibc were changed back it would still be only in some next version.
And if only the docu were changed you'd still have to change your code, so: rock
and hard place :-)
Comment 16 Steffen Winterfeldt 2021-04-13 12:24:48 UTC
FWIW, I've changed hwinfo to not rely on select() behavior:

https://github.com/openSUSE/hwinfo/pull/95
Comment 17 OBSbugzilla Bot 2021-04-13 15:00:03 UTC
This is an autogenerated message for OBS integration:
This bug (1184339) was mentioned in
https://build.opensuse.org/request/show/885031 Factory / glibc
Comment 18 Steffen Winterfeldt 2021-04-14 16:21:10 UTC
*** Bug 1184708 has been marked as a duplicate of this bug. ***
Comment 19 M Fredericks 2021-04-24 06:59:41 UTC
Tested Tumbleweed version 20210420 and problem solved. Thanks!
Comment 20 Andreas Schwab 2021-05-04 12:20:42 UTC
Fixed.
Comment 23 Swamp Workflow Management 2022-08-05 07:16:01 UTC
SUSE-RU-2022:2678-1: An update that has three recommended fixes can now be installed.

Category: recommended (important)
Bug References: 1184339,1198043,1199948
CVE References: 
JIRA References: 
Sources used:
openSUSE Leap 15.3 (src):    hwinfo-21.82-150300.3.3.1
SUSE Linux Enterprise Module for Basesystem 15-SP3 (src):    hwinfo-21.82-150300.3.3.1
SUSE Linux Enterprise Micro 5.2 (src):    hwinfo-21.82-150300.3.3.1
SUSE Linux Enterprise Micro 5.1 (src):    hwinfo-21.82-150300.3.3.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 24 Swamp Workflow Management 2022-09-01 15:37:05 UTC
SUSE-RU-2022:2678-2: An update that has three recommended fixes can now be installed.

Category: recommended (important)
Bug References: 1184339,1198043,1199948
CVE References: 
JIRA References: 
Sources used:
openSUSE Leap Micro 5.2 (src):    hwinfo-21.82-150300.3.3.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.