Bug 612794 - process hanging in D state
Summary: process hanging in D state
Status: RESOLVED FIXED
Alias: None
Product: openSUSE 11.3
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Factory
Hardware: Other Other
: P2 - High : Critical with 5 votes (vote)
Target Milestone: ---
Assignee: Neil Brown
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-09 09:44 UTC by Marcus Meissner
Modified: 2018-07-03 20:31 UTC (History)
3 users (show)

See Also:
Found By: Development
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
sysrq-t.log (399.63 KB, text/plain)
2010-06-09 09:45 UTC, Marcus Meissner
Details
syrq-t.log (323.48 KB, text/plain)
2010-06-17 14:30 UTC, Marcus Meissner
Details
2.6.34-9-diskwait.log (227.24 KB, text/plain)
2010-06-24 15:19 UTC, Marcus Meissner
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marcus Meissner 2010-06-09 09:44:36 UTC
On my ppc64 machine with autobuild running I get "lsof" from seccheck entering "D" state.

smells a bit NFS related. or uname related.

I will attach sysrq-t output
Comment 1 Marcus Meissner 2010-06-09 09:45:24 UTC
Created attachment 368051 [details]
sysrq-t.log

sysrq-t output
Comment 2 Marcus Meissner 2010-06-09 09:52:11 UTC
lsof currently has process 24325 open (autobuild)

ls -la /proc/27555/fd
insgesamt 0
dr-x------ 2 root root  0  9. Jun 11:45 .
dr-xr-xr-x 7 root root  0  9. Jun 11:37 ..
lr-x------ 1 root root 64  9. Jun 11:45 0 -> /dev/null
l-wx------ 1 root root 64  9. Jun 11:45 1 -> pipe:[5639109]
l-wx------ 1 root root 64  9. Jun 11:45 2 -> pipe:[5638514]
lr-x------ 1 root root 64  9. Jun 11:45 3 -> /proc
lr-x------ 1 root root 64  9. Jun 11:45 4 -> /proc/24325/fd


i cant really see what the autobuild script has open :(
Comment 3 Marcus Meissner 2010-06-16 14:50:37 UTC
also seen by Rudi
Comment 4 Marcus Meissner 2010-06-17 14:30:22 UTC
Created attachment 369785 [details]
syrq-t.log

hung yet again some hours after reboot.
Comment 5 Marcus Meissner 2010-06-22 10:59:26 UTC
also seen by Dirk I think.
Comment 6 Dirk Mueller 2010-06-22 21:32:10 UTC
for me also "sync" gets stuck (can't really see where, strace does not do anything when attaching to the process, same for gdb, it just hangs). 

which means that also "reboot" without "-n" hangs. 

it frequently happens over the weekend when autobuild was performing a lot of build jobs (friday evening rebuild). how can i debug this? it is really getting annoying to reset the machine every monday.
Comment 10 Jeff Mahoney 2010-06-22 23:08:50 UTC
(In reply to comment #6)
> for me also "sync" gets stuck (can't really see where, strace does not do
> anything when attaching to the process, same for gdb, it just hangs). 

Yep. If a process is in D state, it can't be attached to via ptrace, which is what both strace and gdb use.
Comment 11 Neil Brown 2010-06-22 23:28:24 UTC
autobuild wants to write to file, is first flushing writes
that someone else made, and is waiting in writeback_single_inode
for some other thread to finish synching the inode.

uname is closing a file while exiting and so (as this is NFS) is
synching it out and is blocked in nfs_writepages (form
writeback_single_inode - so uname is what autobuild is waiting for)
trying to get a lock on a page.

flush-0:23 wants to flush data and is also waiting for uname to
finish nfs_writepages

lsof performing a 'stat' which needs for flush writes so that it can
be sure the 'size' is correct, so it is in nfs_writepages waiting for
uname to finish so it can have a turn.

So the central problem is uname trying to lock a page.

My guess is that the problem file is a log file - connected to stdout
on uname.

The second trace shows a similar pattern, though 'w' is the central blocker.

I note that 'grape' is running 2.6.34-rc6-7-ppc64.  Is that right? an
-rc for 11.3?  Maybe it has changed since 2 weeks ago when the problem
was reported.

It looks a bit like the bug fixed by
  commit a6305ddb080fb483ca41ca56cacb6f96089f0c8e

which is in -rc6.  Do we know exactly what kernel was running on grape
at the time?
Comment 12 Marcus Meissner 2010-06-24 15:17:50 UTC
i have just booted 2.6.34-9

flushd was in "D" already.

I then typed "sync" which entered "D" state and large parts of the usual suspects immediately followed.

echo "t" > sysrq-trigger output follows.
Comment 13 Marcus Meissner 2010-06-24 15:19:50 UTC
Created attachment 371535 [details]
2.6.34-9-diskwait.log

2.6.34-9 diskwait log.

rpm -q --changelog kernel-ppc64-2.6.34-9.10.ppc |less

* Mit Jun 02 2010 bphilips@suse.de
- patches.drivers/e1000e-entropy-source.patch: Reintroduce IRQF_SHARED
  to fix non-MSI case (bnc#610362).
Comment 14 Neil Brown 2010-06-24 21:52:06 UTC
Thanks.  It looks like the same problem - lots of processes waiting on
writeback or occasionally the page lock.  Maybe the commit I mentioned
above didn't quite fix the problem.
I'll go exploring.
Comment 15 Neil Brown 2010-06-29 00:39:33 UTC
Problem appears to be fixed upstream by
commit 0522f6adedd2736cbca3c0e16ca51df668993eee

The description is an exact fit of the symptom.

I have added this patch to git for openSUSE-11.3

Please test and confirm.
Comment 16 Marcus Meissner 2010-07-01 12:10:21 UTC
current openSUSE-11.3 branch kernel built and installed on my machine, 
rebooted this morning.

after 4 hours: No Diskwait processes yet... will keep you updated.
Comment 17 Dirk Mueller 2010-07-01 16:44:21 UTC
also installed kernel with this fix now on x86_64 (after I had another lockup today)
Comment 18 Marcus Meissner 2010-07-05 12:30:45 UTC
my machine has run over the weekend, full autobuilding etc.

no hangs anymore.

-> fixed I would say
Comment 19 Neil Brown 2010-07-06 00:46:08 UTC
Thanks.  
Patch is in git for 11.3 and is already upstream, so resolvng as FIXED.
Comment 20 Bernhard Wiedemann 2016-04-15 11:51:26 UTC
This is an autogenerated message for OBS integration:
This bug (612794) was mentioned in
https://build.opensuse.org/request/show/42266 Factory / kernel-source
https://build.opensuse.org/request/show/42378 11.3:Test / kernel-source