Bug 380751

Summary: k45.suse.de - pthread_cond_latency hangs while several runs
Product: [SUSE Linux Enterprise Real Time Extension] SUSE Linux Enterprise Real Time 10 SP2 (SLERT 10 SP2) Reporter: Daniel Gollub <dgollub>
Component: kernelAssignee: Sven Dietrich <sdietrich>
Status: RESOLVED NORESPONSE QA Contact: Erik Hamera <erik.hamera>
Severity: Normal    
Priority: P5 - None CC: ihno
Version: BETA6   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: sysrq-t output

Description Daniel Gollub 2008-04-17 12:29:05 UTC
Kernel: SLERT 10 SP2 Beta 6 + NO_IRQ_HW_AFFINITY
ltp-realtime: ltp-realtime-20080229-1
Host: K45.suse.de

pthread_cond_latency keeps hanging while running several times. The test got looped at least three times.

The test got called within the SLERT-ltp-realtime testsuite in a loop. Regarding Felix, the entire SLERT-ltp-realtime takes longer on each run:
#1 approx. 3h
#2 approx. 4.5h
#3 hanging for 1day 2h

-----

K45:~ # rpm -q kernel-rt --changelog
* Fri Apr 11 2008 - dgollub@suse.de
- patches.rt/shield-procs: got touched to apply smoothly for recent
  revert of CPU affinity for hardware IRQ threads. Avoid CPU affinity
  on hardware IRQ threads.

----

(gdb) thread apply all bt

Thread 2 (Thread 1082132800 (LWP 10093)):
#0  0x00002ac13c22e5f6 in poll () from /lib64/libc.so.6
#1  0x0000000000401f93 in childfunc (arg=<value optimized out>) at pthread_cond_latency.c:116
#2  0x00002ac13be04143 in start_thread () from /lib64/libpthread.so.0
#3  0x00002ac13c2368cd in clone () from /lib64/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 1 (Thread 47009427580624 (LWP 10090)):
#0  0x00002ac13c21fb17 in sched_yield () from /lib64/libc.so.6
#1  0x0000000000401cb3 in test_signal (broadcast_flag=1, iter=4) at pthread_cond_latency.c:165
#2  0x0000000000401f03 in main (argc=2, argv=0x7fff6edc4eb8) at pthread_cond_latency.c:237
#3  0x00002ac13c193184 in __libc_start_main () from /lib64/libc.so.6
#4  0x0000000000401a49 in _start ()
#0  0x00002ac13c21fb17 in sched_yield () from /lib64/libc.so.6


----

ltp-realtime package is available at /suse/dgollub/SLERT/testpackages-20080409/ltp-realtime/
Comment 1 Daniel Gollub 2008-04-17 12:39:57 UTC
Felix, could you stop your testrun and try to reproduce this with calling pthread_cond_latency several times in row?

If pthread_cond_latency doesn't hang after 10 cycles - could you give feedback and try to find a way to reproduce the issue quickly?
Comment 2 Felix Foerster 2008-04-17 14:33:27 UTC
The first testrun works fine , but further runs of the same test will hang. Sometimes two runs will work fine, but then the third run will hang. Even longer breaks between the testruns of pthread_cond_latency won't help.
I called pthread_cond_latency with the parameter "4", indicating that each testrun itself will do 4 loops.
If i run the test with "1" as parameter the test works fine.
The original script from rt-tests/ltp-realtime also uses four loops.
Comment 3 Peter Morreale 2008-04-17 15:10:46 UTC
Felix,
Please capture a stack trace (sysrq-t) when the test hangs and attach. I am looking into a similar hang on the pthread-detach test and would like to see if there is a correlation.

thx
Comment 4 Sven Dietrich 2008-04-17 15:19:17 UTC
This sounds like a futex is not properly getting unlocked.
Subsequent tests would sleep trying to lock that futex.
Comment 5 Felix Foerster 2008-04-17 15:43:28 UTC
Created attachment 208686 [details]
sysrq-t output

Attached to this comment is the output from sysrq-t in /var/log/messages (unnecessary parts removed)
Comment 6 Ihno Krumreich 2008-10-20 17:15:10 UTC
Still valid?
Comment 8 Sven Dietrich 2008-11-13 22:17:30 UTC
Closing - no response from reporter.