Bug 446233 - nscd crashes with "nscd: mem.c:413: gc: Assertion `(*next_data)->packet == off_alloc' failed."
Summary: nscd crashes with "nscd: mem.c:413: gc: Assertion `(*next_data)->packet == of...
Status: RESOLVED DUPLICATE of bug 387202
: 445723 (view as bug list)
Alias: None
Product: openSUSE 11.1
Classification: openSUSE
Component: Other (show other bugs)
Version: Beta 5
Hardware: x86-64 Other
: P2 - High : Major (vote)
Target Milestone: ---
Assignee: Petr Baudis
QA Contact: E-mail List
URL:
Whiteboard: maint:released:11.0:21210 maint:relea...
Keywords:
Depends on:
Blocks:
 
Reported: 2008-11-18 19:19 UTC by Tvrtko Ursulin
Modified: 2009-02-05 14:44 UTC (History)
3 users (show)

See Also:
Found By: Community User
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Core dump when nscd dies. (285.90 KB, application/x-gzip)
2008-11-19 09:35 UTC, Tvrtko Ursulin
Details
Core dump with database persistence disabled. (342.19 KB, application/x-gzip)
2008-11-24 15:11 UTC, Tvrtko Ursulin
Details
After a segfault (369.38 KB, application/x-gzip)
2008-11-25 23:11 UTC, Tvrtko Ursulin
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tvrtko Ursulin 2008-11-18 19:19:03 UTC
If you make a mistake when editing and remove one line break, to have a snippet in /etc/nssswitch.conf which looks like this (all single line, in case it gets wrapped):

hosts:          files mdns4_minimal [NOTFOUND=return] wins dnsnetworks:       files dns

Then nscd will silently exit at runtime. Not a big deal, but it would be nicer if it complained about it when starting up.
Comment 1 Tvrtko Ursulin 2008-11-18 20:22:43 UTC
Just to clarify, it dies at a "random" time, not immediately on startup. Maybe it has something to do when it gets to the point of trying to resolve something with a non-existent method. But I am just guessing, I don't even understand how the hosts line works and why are entries after [NOTFOUND=return] used.
Comment 2 Tvrtko Ursulin 2008-11-18 20:45:18 UTC
It seems that the problem was not in corrupt nsswitch.conf after all. nscd just dissapeared again after I corrected that issue.

Hm, could it be something related to wins resolver I added? Although I am pretty sure this setup worked nicely in 11.

I ran nscd --debug under gdb and it exited with broken pipe signal, no debug symbols so nothing more useful for now.
Comment 3 Petr Baudis 2008-11-18 22:01:38 UTC
Please call ulimit -c unlimited and then run nscd -d outside of gdb and record the last few lines before it aborts. Attach the core if it dumps any.
Comment 4 Tvrtko Ursulin 2008-11-19 09:35:55 UTC
Created attachment 253292 [details]
Core dump when nscd dies.
Comment 5 Tvrtko Ursulin 2008-11-19 09:36:17 UTC
Pasting output and attaching core as requested.

--------------------------------------
20309:  GETPWBYNAME (root)
20309: handle_request: request received (Version = 2) from PID 22926
20309:  GETPWBYNAME (root)
20309: handle_request: request received (Version = 2) from PID 22926
20309:  GETPWBYNAME (root)
20309: handle_request: request received (Version = 2) from PID 22926
20309:  GETPWBYNAME (root)
20309: handle_request: request received (Version = 2) from PID 22927
20309:  GETFDGR
20309: provide access to FD 9, for group
20309: Reloading "LIVING" in hosts cache!
20309: Reloading "secure.sophos.com" in hosts cache!
20309: Reloading "nobody" in password cache!
20309: Reloading "postfix" in password cache!
20309: remove GETAI entry "localhost"
20309: remove GETPWBYUID entry "1005"
20309: remove GETPWBYNAME entry "tvrtko"
nscd: mem.c:413: gc: Assertion `(*next_data)->packet == off_alloc' failed.
Aborted (core dumped)
Comment 6 Petr Baudis 2008-11-19 19:01:37 UTC
There is also SLES11 bug 445723 with different assertion failure message, but quite possibly the same corruption.
Comment 7 Forgotten User qMyteedNxa 2008-11-22 17:16:41 UTC
side-note:

there`s a drop-in replacement for nscd at http://busybox.net/~vda/unscd/

it`s trying to adress issues like these - maybe worth to take a look at.

description:

nscd problems are not exactly unheard of. Over the years, there were
quite a bit of bugs in it. This leads people to invent babysitters
which restart crashed/hung nscd. This is ugly.

After looking at nscd source in glibc I arrived to the conclusion
that its desidn is contributing to this significantly. Even if nscd's
code is 100.00% perfect and bug-free, it can still suffer from bugs
in libraries it calls.

As designed, it's a multithreaded program which calls NSS libraries.
These libraries are not part of libc, they may be provided
by third-party projects (samba, ldap, you name it).

Thus nscd cannot be sure that libraries it calls do not have memory
or file descriptor leaks and other bugs.

Since nscd is multithreaded program with single shared cache,
any resource leak in any NSS library has cumulative effect.
Even if an NSS library leaks a file descriptor 0.01% of the time,
this will make nscd crash or hang after some time.

Of course bugs in NSS .so modules should be fixed, but meanwhile
I do want nscd which does not crash or lock up.

So I went ahead and wrote a replacement.

It is a single-threaded server process which offloads all NSS
lookups to worker children (not threads, but fully independent
processes). Cache hits are handled by parent. Only cache misses
start worker children. This design is immune against
resource leaks and hangs in NSS libraries.

It is also many times smaller.

Currently (v0.34) it emulates glibc nscd pretty closely
(handles same command line flags and config file), and is moderately tested.

Please note that as of 2008-08 it is not in wide use (yet?).
If you have trouble compiling it, see an incompatibility with
"standard" one or experience hangs/crashes, please report it to
vda.linux@googlemail.com
Comment 8 Petr Baudis 2008-11-23 13:17:57 UTC
*** Bug 445723 has been marked as a duplicate of this bug. ***
Comment 9 Petr Baudis 2008-11-23 13:29:44 UTC
Yes, we are aware of and considering unscd in the long run; I'm currently working on preparing unscd in the build service so that interested people can start trying it out.

I think I have tracked down the bug: http://sourceware.org/bugzilla/show_bug.cgi?id=5381

I'm building a test patch now.
Comment 10 Petr Baudis 2008-11-23 15:39:29 UTC
Can you please test http://www.suse.de/~pbaudis/bug-446233-0-11.1/? (Should mirror out within an hour. Installing the nscd package should be enough.) It runs fine for me but I had trouble reproducing the race on my system before.
Comment 11 Tvrtko Ursulin 2008-11-23 16:38:15 UTC
Could you please make a x86_64 build as well? I would have to install 32-bit samba-client in order to get 32-bit libnss_wins which I use.
Comment 12 Petr Baudis 2008-11-23 19:59:32 UTC
Done, it will be at http://www.suse.de/~pbaudis/bug-446233-0-11.1-x86_64/
Comment 13 Tvrtko Ursulin 2008-11-24 13:18:31 UTC
Thanks.

Got a different assertion with this one:

8499: remove GETPWBYUID entry "51"
8499: remove GETPWBYNAME entry "nobody"
8499: remove GETPWBYUID entry "65534"
8499: remove GETPWBYNAME entry "postfix"
nscd: mem.c:413: gc: Assertion `next_data < &he_data[db->head->nentries]' failed.

Unfotunately I forgot to enable core dumps this time. Will attach it if I get the same failure again.

Also, when restarting after a failure it complains about this:
10160: invalid persistent database file "/var/run/nscd/passwd": verification failed

So I rm-ed that file and started it again.
Comment 14 Petr Baudis 2008-11-24 13:21:27 UTC
Actually, please turn persistency off in /etc/nscd.conf before starting nscd -d to gather cores, otherwise the core files aren't usable since the interesting data is in mapped files.
Comment 15 Tvrtko Ursulin 2008-11-24 15:10:52 UTC
Ok, another crash with the same assertion and this time the core (persistence disabled) will follow:

10276: Reloading "195.29.150.5" in hosts cache!
10276: Reloading "localhost" in hosts cache!
10276: Reloading "nobody" in password cache!
10276: Reloading "postfix" in password cache!
10276: Reloading "213.202.100.127" in hosts cache!
10276: remove GETPWBYUID entry "1005"
10276: remove GETPWBYNAME entry "tvrtko"
nscd: mem.c:413: gc: Assertion `next_data < &he_data[db->head->nentries]' failed.
Comment 16 Tvrtko Ursulin 2008-11-24 15:11:58 UTC
Created attachment 254816 [details]
Core dump with database persistence disabled.
Comment 17 Petr Baudis 2008-11-25 01:53:05 UTC
Oh, there was a simple logic error in my previous patch. Rebuilding.
Comment 20 Tvrtko Ursulin 2008-11-25 23:10:34 UTC
New crash flavour:

nscd[4108]: segfault at 7f6145e98a21 ip 00007f60e086b102 sp 00007f60d787fde0 error 4 in nscd[7f60e085a000+1f000]

Last output:

4106: provide access to FD 5, for passwd
4106: handle_request: request received (Version = 2) from PID 8630
4106:   GETFDPW
4106: provide access to FD 5, for passwd
4106: handle_request: request received (Version = 2) from PID 8632
4106:   GETFDPW
4106: provide access to FD 5, for passwd
4106: handle_request: request received (Version = 2) from PID 8634
4106:   GETFDPW
4106: provide access to FD 5, for passwd

Core dump coming shortly.
Comment 21 Tvrtko Ursulin 2008-11-25 23:11:46 UTC
Created attachment 255484 [details]
After a segfault
Comment 22 Petr Baudis 2008-11-26 12:58:37 UTC
Curious. I wonder if this is caused by my patch or another bug not visible before. Did your nscd uptimes improve since applying the patch or stay more-or-less the same?

It would be help a lot if you could run

   nscd -d -d -d 2>&1 | grep -v "$(echo -e ': \tGET')" | grep -v 'provide access' | grep -v 'request received' | tee nscd.log

keep capturing the cores and also attach the corresponding nscd.log files (complete if possible, but at least containing activity since the last prune).
Comment 23 Petr Baudis 2008-11-27 15:22:09 UTC
Crashed for me, from what I can see this appears to be a prune_cache race, seems to be trivial to fix. Please keep gathering data.
Comment 24 Petr Baudis 2008-11-28 00:28:54 UTC
http://www.suse.de/~pbaudis/bug-446233-2-11.1-x86_64/ and
http://www.suse.de/~pbaudis/bug-446233-2-11.1/ should fix the prune-invalidate race, please try it out.
Comment 25 Tvrtko Ursulin 2008-11-28 08:39:34 UTC
Hard to say about the uptime - I think it feeled better after your patch but the sample was really to small to say for sure. 

I am continuing to keep an eye on it (now with your latest version) but this week I wasn't able to spend much time on it.

Possibly interesting data point is that nscd is unstable on Ubuntu 8.10 as well. They have version 2.8~20080505-0ubuntu7 there which is also giving random general protection faults. So your fix may be worth pushing upstream and pinging the competition about it. I may do this last bit, in fact I was planning to raise a bug there and link it to this one but had really no time this week.
Comment 26 Dirk Mueller 2008-12-01 10:50:47 UTC
I still get: 

18965: remove GETHOSTBYNAME entry "localhost"
18965: remove GETPWBYNAME entry "nobody"
18965: remove GETPWBYUID entry "65534"
nscd: mem.c:477: gc: Assertion `next_hash == &he[db->head->nentries]' failed.
Aborted

unfortunately I didn't have a core file this time.. I'll try to reproduce
Comment 27 Petr Baudis 2008-12-02 19:29:52 UTC
Tvrtko: Yes, the Ubuntu nscd maintainer already got in touch with me earlier. Upstream seems to have decided to ignore my bug report for now, but maybe Ubuntu will pick it directly.

Dirk: Interesting - I'm looking forward to a core.

The current version had a newly introduced dead-lock in it, I have rebuilt with a more proper fix for the prune-invalidate race:

http://www.suse.de/~pbaudis/bug-446233-3-11.1-x86_64/ and
http://www.suse.de/~pbaudis/bug-446233-3-11.1/
Comment 28 Dirk Mueller 2008-12-03 07:40:34 UTC
Core was generated by `nscd -d'.
Program terminated with signal 6, Aborted.
#0  0x00007f272b843645 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x00007f272b843645 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x00007f272b844c33 in *__GI_abort () at abort.c:88
#2  0x00007f272b87f8e8 in __libc_message (do_abort=2, fmt=0x7f272b933f40 "*** glibc detected *** %s: %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
#3  0x00007f272b885118 in malloc_printerr (action=2, str=0x7f272b933f70 "munmap_chunk(): invalid pointer",
    ptr=<value optimized out>) at malloc.c:5994
#4  0x00007f272c5f5a1c in ?? () from /usr/sbin/nscd
#5  0x00007f272c5f4527 in ?? () from /usr/sbin/nscd
#6  0x00007f272c5eb247 in ?? () from /usr/sbin/nscd
#7  0x00007f272bd89070 in start_thread (arg=<value optimized out>) at pthread_create.c:297
#8  0x00007f272b8e40ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#9  0x0000000000000000 in ?? ()
Comment 29 Petr Baudis 2008-12-03 08:52:27 UTC
Huh, now that seems to be yet another different bug, let's call it 446233.28 for now. ;-) Can you make the core available and attach your /etc/nsswitch.conf?
Comment 32 Raymond Planthold 2008-12-30 10:06:54 UTC
Two things:

1. I'm pretty sure bug 387202 is the same bug, just against 11.0.

2. Your patched nscd appears to have eliminated this crash for me.  (Thanks!)  However, after nscd has been running several hours, it still crashes.  While trying to run it in gdb, I was getting a lot of SIGPIPE false-positives.  Of course, now that I know they're false positives, I've got gdb ignoring them, and hopefully I'll be able to get a useful backtrace for whatever this new problem is.
Comment 33 Swamp Workflow Management 2009-01-01 17:04:53 UTC
Update released for: glibc, glibc-devel, glibc-html, glibc-i18ndata, glibc-info, glibc-locale, glibc-obsolete, glibc-profile, nscd
Products:
openSUSE 11.0 (debug, i386, i686, ppc, ppc64, x86_64)
Comment 34 Petr Baudis 2009-01-05 16:55:26 UTC
Since 11.0 and 11.1 codebase is in sync now, I'm marking this a dupe of 387202 to ease tracking of problems. I have not forgot about the oldboy.suse.de coredump and I'm still going to research it, don't worry. ;-)

*** This bug has been marked as a duplicate of bug 387202 ***
Comment 35 Swamp Workflow Management 2009-02-05 14:44:57 UTC
Update released for: glibc, glibc-debuginfo, glibc-debugsource, glibc-devel, glibc-html, glibc-i18ndata, glibc-info, glibc-locale, glibc-obsolete, glibc-profile, nscd
Products:
openSUSE 11.1 (debug, i586, i686, ppc, ppc64, x86_64)