Bugzilla – Bug 612794
process hanging in D state
Last modified: 2018-07-03 20:31:37 UTC
On my ppc64 machine with autobuild running I get "lsof" from seccheck entering "D" state. smells a bit NFS related. or uname related. I will attach sysrq-t output
Created attachment 368051 [details] sysrq-t.log sysrq-t output
lsof currently has process 24325 open (autobuild) ls -la /proc/27555/fd insgesamt 0 dr-x------ 2 root root 0 9. Jun 11:45 . dr-xr-xr-x 7 root root 0 9. Jun 11:37 .. lr-x------ 1 root root 64 9. Jun 11:45 0 -> /dev/null l-wx------ 1 root root 64 9. Jun 11:45 1 -> pipe:[5639109] l-wx------ 1 root root 64 9. Jun 11:45 2 -> pipe:[5638514] lr-x------ 1 root root 64 9. Jun 11:45 3 -> /proc lr-x------ 1 root root 64 9. Jun 11:45 4 -> /proc/24325/fd i cant really see what the autobuild script has open :(
also seen by Rudi
Created attachment 369785 [details] syrq-t.log hung yet again some hours after reboot.
also seen by Dirk I think.
for me also "sync" gets stuck (can't really see where, strace does not do anything when attaching to the process, same for gdb, it just hangs). which means that also "reboot" without "-n" hangs. it frequently happens over the weekend when autobuild was performing a lot of build jobs (friday evening rebuild). how can i debug this? it is really getting annoying to reset the machine every monday.
(In reply to comment #6) > for me also "sync" gets stuck (can't really see where, strace does not do > anything when attaching to the process, same for gdb, it just hangs). Yep. If a process is in D state, it can't be attached to via ptrace, which is what both strace and gdb use.
autobuild wants to write to file, is first flushing writes that someone else made, and is waiting in writeback_single_inode for some other thread to finish synching the inode. uname is closing a file while exiting and so (as this is NFS) is synching it out and is blocked in nfs_writepages (form writeback_single_inode - so uname is what autobuild is waiting for) trying to get a lock on a page. flush-0:23 wants to flush data and is also waiting for uname to finish nfs_writepages lsof performing a 'stat' which needs for flush writes so that it can be sure the 'size' is correct, so it is in nfs_writepages waiting for uname to finish so it can have a turn. So the central problem is uname trying to lock a page. My guess is that the problem file is a log file - connected to stdout on uname. The second trace shows a similar pattern, though 'w' is the central blocker. I note that 'grape' is running 2.6.34-rc6-7-ppc64. Is that right? an -rc for 11.3? Maybe it has changed since 2 weeks ago when the problem was reported. It looks a bit like the bug fixed by commit a6305ddb080fb483ca41ca56cacb6f96089f0c8e which is in -rc6. Do we know exactly what kernel was running on grape at the time?
i have just booted 2.6.34-9 flushd was in "D" already. I then typed "sync" which entered "D" state and large parts of the usual suspects immediately followed. echo "t" > sysrq-trigger output follows.
Created attachment 371535 [details] 2.6.34-9-diskwait.log 2.6.34-9 diskwait log. rpm -q --changelog kernel-ppc64-2.6.34-9.10.ppc |less * Mit Jun 02 2010 bphilips@suse.de - patches.drivers/e1000e-entropy-source.patch: Reintroduce IRQF_SHARED to fix non-MSI case (bnc#610362).
Thanks. It looks like the same problem - lots of processes waiting on writeback or occasionally the page lock. Maybe the commit I mentioned above didn't quite fix the problem. I'll go exploring.
Problem appears to be fixed upstream by commit 0522f6adedd2736cbca3c0e16ca51df668993eee The description is an exact fit of the symptom. I have added this patch to git for openSUSE-11.3 Please test and confirm.
current openSUSE-11.3 branch kernel built and installed on my machine, rebooted this morning. after 4 hours: No Diskwait processes yet... will keep you updated.
also installed kernel with this fix now on x86_64 (after I had another lockup today)
my machine has run over the weekend, full autobuilding etc. no hangs anymore. -> fixed I would say
Thanks. Patch is in git for 11.3 and is already upstream, so resolvng as FIXED.
This is an autogenerated message for OBS integration: This bug (612794) was mentioned in https://build.opensuse.org/request/show/42266 Factory / kernel-source https://build.opensuse.org/request/show/42378 11.3:Test / kernel-source