Bugzilla – Bug 446233
nscd crashes with "nscd: mem.c:413: gc: Assertion `(*next_data)->packet == off_alloc' failed."
Last modified: 2009-02-05 14:44:57 UTC
If you make a mistake when editing and remove one line break, to have a snippet in /etc/nssswitch.conf which looks like this (all single line, in case it gets wrapped): hosts: files mdns4_minimal [NOTFOUND=return] wins dnsnetworks: files dns Then nscd will silently exit at runtime. Not a big deal, but it would be nicer if it complained about it when starting up.
Just to clarify, it dies at a "random" time, not immediately on startup. Maybe it has something to do when it gets to the point of trying to resolve something with a non-existent method. But I am just guessing, I don't even understand how the hosts line works and why are entries after [NOTFOUND=return] used.
It seems that the problem was not in corrupt nsswitch.conf after all. nscd just dissapeared again after I corrected that issue. Hm, could it be something related to wins resolver I added? Although I am pretty sure this setup worked nicely in 11. I ran nscd --debug under gdb and it exited with broken pipe signal, no debug symbols so nothing more useful for now.
Please call ulimit -c unlimited and then run nscd -d outside of gdb and record the last few lines before it aborts. Attach the core if it dumps any.
Created attachment 253292 [details] Core dump when nscd dies.
Pasting output and attaching core as requested. -------------------------------------- 20309: GETPWBYNAME (root) 20309: handle_request: request received (Version = 2) from PID 22926 20309: GETPWBYNAME (root) 20309: handle_request: request received (Version = 2) from PID 22926 20309: GETPWBYNAME (root) 20309: handle_request: request received (Version = 2) from PID 22926 20309: GETPWBYNAME (root) 20309: handle_request: request received (Version = 2) from PID 22927 20309: GETFDGR 20309: provide access to FD 9, for group 20309: Reloading "LIVING" in hosts cache! 20309: Reloading "secure.sophos.com" in hosts cache! 20309: Reloading "nobody" in password cache! 20309: Reloading "postfix" in password cache! 20309: remove GETAI entry "localhost" 20309: remove GETPWBYUID entry "1005" 20309: remove GETPWBYNAME entry "tvrtko" nscd: mem.c:413: gc: Assertion `(*next_data)->packet == off_alloc' failed. Aborted (core dumped)
There is also SLES11 bug 445723 with different assertion failure message, but quite possibly the same corruption.
side-note: there`s a drop-in replacement for nscd at http://busybox.net/~vda/unscd/ it`s trying to adress issues like these - maybe worth to take a look at. description: nscd problems are not exactly unheard of. Over the years, there were quite a bit of bugs in it. This leads people to invent babysitters which restart crashed/hung nscd. This is ugly. After looking at nscd source in glibc I arrived to the conclusion that its desidn is contributing to this significantly. Even if nscd's code is 100.00% perfect and bug-free, it can still suffer from bugs in libraries it calls. As designed, it's a multithreaded program which calls NSS libraries. These libraries are not part of libc, they may be provided by third-party projects (samba, ldap, you name it). Thus nscd cannot be sure that libraries it calls do not have memory or file descriptor leaks and other bugs. Since nscd is multithreaded program with single shared cache, any resource leak in any NSS library has cumulative effect. Even if an NSS library leaks a file descriptor 0.01% of the time, this will make nscd crash or hang after some time. Of course bugs in NSS .so modules should be fixed, but meanwhile I do want nscd which does not crash or lock up. So I went ahead and wrote a replacement. It is a single-threaded server process which offloads all NSS lookups to worker children (not threads, but fully independent processes). Cache hits are handled by parent. Only cache misses start worker children. This design is immune against resource leaks and hangs in NSS libraries. It is also many times smaller. Currently (v0.34) it emulates glibc nscd pretty closely (handles same command line flags and config file), and is moderately tested. Please note that as of 2008-08 it is not in wide use (yet?). If you have trouble compiling it, see an incompatibility with "standard" one or experience hangs/crashes, please report it to vda.linux@googlemail.com
*** Bug 445723 has been marked as a duplicate of this bug. ***
Yes, we are aware of and considering unscd in the long run; I'm currently working on preparing unscd in the build service so that interested people can start trying it out. I think I have tracked down the bug: http://sourceware.org/bugzilla/show_bug.cgi?id=5381 I'm building a test patch now.
Can you please test http://www.suse.de/~pbaudis/bug-446233-0-11.1/? (Should mirror out within an hour. Installing the nscd package should be enough.) It runs fine for me but I had trouble reproducing the race on my system before.
Could you please make a x86_64 build as well? I would have to install 32-bit samba-client in order to get 32-bit libnss_wins which I use.
Done, it will be at http://www.suse.de/~pbaudis/bug-446233-0-11.1-x86_64/
Thanks. Got a different assertion with this one: 8499: remove GETPWBYUID entry "51" 8499: remove GETPWBYNAME entry "nobody" 8499: remove GETPWBYUID entry "65534" 8499: remove GETPWBYNAME entry "postfix" nscd: mem.c:413: gc: Assertion `next_data < &he_data[db->head->nentries]' failed. Unfotunately I forgot to enable core dumps this time. Will attach it if I get the same failure again. Also, when restarting after a failure it complains about this: 10160: invalid persistent database file "/var/run/nscd/passwd": verification failed So I rm-ed that file and started it again.
Actually, please turn persistency off in /etc/nscd.conf before starting nscd -d to gather cores, otherwise the core files aren't usable since the interesting data is in mapped files.
Ok, another crash with the same assertion and this time the core (persistence disabled) will follow: 10276: Reloading "195.29.150.5" in hosts cache! 10276: Reloading "localhost" in hosts cache! 10276: Reloading "nobody" in password cache! 10276: Reloading "postfix" in password cache! 10276: Reloading "213.202.100.127" in hosts cache! 10276: remove GETPWBYUID entry "1005" 10276: remove GETPWBYNAME entry "tvrtko" nscd: mem.c:413: gc: Assertion `next_data < &he_data[db->head->nentries]' failed.
Created attachment 254816 [details] Core dump with database persistence disabled.
Oh, there was a simple logic error in my previous patch. Rebuilding.
New builds will be at http://www.suse.de/~pbaudis/bug-446233-0-11.1-x86_64/ and http://www.suse.de/~pbaudis/bug-446233-0-11.1/
Actually at http://www.suse.de/~pbaudis/bug-446233-1-11.1-x86_64/ and http://www.suse.de/~pbaudis/bug-446233-1-11.1/ ;) Testing..
New crash flavour: nscd[4108]: segfault at 7f6145e98a21 ip 00007f60e086b102 sp 00007f60d787fde0 error 4 in nscd[7f60e085a000+1f000] Last output: 4106: provide access to FD 5, for passwd 4106: handle_request: request received (Version = 2) from PID 8630 4106: GETFDPW 4106: provide access to FD 5, for passwd 4106: handle_request: request received (Version = 2) from PID 8632 4106: GETFDPW 4106: provide access to FD 5, for passwd 4106: handle_request: request received (Version = 2) from PID 8634 4106: GETFDPW 4106: provide access to FD 5, for passwd Core dump coming shortly.
Created attachment 255484 [details] After a segfault
Curious. I wonder if this is caused by my patch or another bug not visible before. Did your nscd uptimes improve since applying the patch or stay more-or-less the same? It would be help a lot if you could run nscd -d -d -d 2>&1 | grep -v "$(echo -e ': \tGET')" | grep -v 'provide access' | grep -v 'request received' | tee nscd.log keep capturing the cores and also attach the corresponding nscd.log files (complete if possible, but at least containing activity since the last prune).
Crashed for me, from what I can see this appears to be a prune_cache race, seems to be trivial to fix. Please keep gathering data.
http://www.suse.de/~pbaudis/bug-446233-2-11.1-x86_64/ and http://www.suse.de/~pbaudis/bug-446233-2-11.1/ should fix the prune-invalidate race, please try it out.
Hard to say about the uptime - I think it feeled better after your patch but the sample was really to small to say for sure. I am continuing to keep an eye on it (now with your latest version) but this week I wasn't able to spend much time on it. Possibly interesting data point is that nscd is unstable on Ubuntu 8.10 as well. They have version 2.8~20080505-0ubuntu7 there which is also giving random general protection faults. So your fix may be worth pushing upstream and pinging the competition about it. I may do this last bit, in fact I was planning to raise a bug there and link it to this one but had really no time this week.
I still get: 18965: remove GETHOSTBYNAME entry "localhost" 18965: remove GETPWBYNAME entry "nobody" 18965: remove GETPWBYUID entry "65534" nscd: mem.c:477: gc: Assertion `next_hash == &he[db->head->nentries]' failed. Aborted unfortunately I didn't have a core file this time.. I'll try to reproduce
Tvrtko: Yes, the Ubuntu nscd maintainer already got in touch with me earlier. Upstream seems to have decided to ignore my bug report for now, but maybe Ubuntu will pick it directly. Dirk: Interesting - I'm looking forward to a core. The current version had a newly introduced dead-lock in it, I have rebuilt with a more proper fix for the prune-invalidate race: http://www.suse.de/~pbaudis/bug-446233-3-11.1-x86_64/ and http://www.suse.de/~pbaudis/bug-446233-3-11.1/
Core was generated by `nscd -d'. Program terminated with signal 6, Aborted. #0 0x00007f272b843645 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) bt #0 0x00007f272b843645 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f272b844c33 in *__GI_abort () at abort.c:88 #2 0x00007f272b87f8e8 in __libc_message (do_abort=2, fmt=0x7f272b933f40 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:170 #3 0x00007f272b885118 in malloc_printerr (action=2, str=0x7f272b933f70 "munmap_chunk(): invalid pointer", ptr=<value optimized out>) at malloc.c:5994 #4 0x00007f272c5f5a1c in ?? () from /usr/sbin/nscd #5 0x00007f272c5f4527 in ?? () from /usr/sbin/nscd #6 0x00007f272c5eb247 in ?? () from /usr/sbin/nscd #7 0x00007f272bd89070 in start_thread (arg=<value optimized out>) at pthread_create.c:297 #8 0x00007f272b8e40ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #9 0x0000000000000000 in ?? ()
Huh, now that seems to be yet another different bug, let's call it 446233.28 for now. ;-) Can you make the core available and attach your /etc/nsswitch.conf?
Two things: 1. I'm pretty sure bug 387202 is the same bug, just against 11.0. 2. Your patched nscd appears to have eliminated this crash for me. (Thanks!) However, after nscd has been running several hours, it still crashes. While trying to run it in gdb, I was getting a lot of SIGPIPE false-positives. Of course, now that I know they're false positives, I've got gdb ignoring them, and hopefully I'll be able to get a useful backtrace for whatever this new problem is.
Update released for: glibc, glibc-devel, glibc-html, glibc-i18ndata, glibc-info, glibc-locale, glibc-obsolete, glibc-profile, nscd Products: openSUSE 11.0 (debug, i386, i686, ppc, ppc64, x86_64)
Since 11.0 and 11.1 codebase is in sync now, I'm marking this a dupe of 387202 to ease tracking of problems. I have not forgot about the oldboy.suse.de coredump and I'm still going to research it, don't worry. ;-) *** This bug has been marked as a duplicate of bug 387202 ***
Update released for: glibc, glibc-debuginfo, glibc-debugsource, glibc-devel, glibc-html, glibc-i18ndata, glibc-info, glibc-locale, glibc-obsolete, glibc-profile, nscd Products: openSUSE 11.1 (debug, i586, i686, ppc, ppc64, x86_64)