Bug 545995

Summary: PAM/nscd prevents logins on systems with less than 130MB memory
Product: [openSUSE] openSUSE 11.1 Reporter: Bernhard Wiedemann <novellbmw>
Component: OtherAssignee: Petr Baudis <pbaudis>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P3 - Medium    
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE 11.1   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Bernhard Wiedemann 2009-10-10 19:36:10 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.23) Gecko/20090912 SUSE/1.1.18-1.1 SeaMonkey/1.1.18

On my server I run several VMs for small tasks.
Today I found that I could no more log into my openSUSE-11.1 VM running nothing but bind9 which has reasonable 64MB RAM and 150MB swap.
After testing I found that cause was a "zypper up" which updated nscd yesterday and obviously also did an "insserv nscd" in the process.

Booting your normal system with mem=100M could probably also trigger this.

Reproducible: Always

Steps to Reproduce:
1. have x86_64 OS and 3GB free disk space to run my VM:
zypper in qemu-svn #0.11
wget -nc http://www1.zq1.de/bernhard/temp/opensuse-11.1-bad.img.gz
gzip -cd opensuse-11.1-bad.img.gz > opensuse-11.1-bad.img
2. qemu-system-x86_64 -hda opensuse-11.1-bad.img -m 128
3. wait til it is booted up, then login
root
root
4. and test
rcnscd start
su -
ssh root@localhost
rcnscd stop

Actual Results:  
# su -
su: user root does not exist
# ssh root@localhost
You don't exist, go away!
# tail /var/log/messages
... cannot write to database file /var/run/nscd/services: Cannot allocate memory

Expected Results:  
# su -
#
# ssh root@localhost
Password:
Last login: ...
Have a lot of fun...


Further tests showed that even with 140MB RAM there are errors in /var/log/messages.
"free" shows below 20MB used by apps.
Effects and error messages vary with amount of free/used RAM.
With 64MB RAM+swap, the error messages are even worse, as they never mention nscd - just non-existing users "root" and "sshd" for priviledge separation.

also tried setting up extra swap with
dd if=/dev/zero of=/tmp/swap bs=1M count=192
mkswap /tmp/swap
swapon /tmp/swap
and found that at 100MB even without nscd, ssh works only when swap is added at boot time (i.e. via /etc/fstab). So maybe PAM causes this?

I adjusted /etc/init.d/boot:111 to provide enough inodes in /dev even on low-mem systems.
My old openSUSE-10.3 system without nscd ran fine (apart from some memory leaks) with 32MB for over a year.
Comment 1 Bernhard Wiedemann 2009-10-23 16:00:27 UTC
I have done some test row at varying levels of total RAM using qemu-kvm.

what I did was very much like the above:
login as root
rcnscd start # to trigger the bug
su -

what I found is only partially making sense to me

last time I saw that at 128M su fails

at 140M su succeeds but syslog has
cannot write to database file /var/run/nscd/hosts: No such file or directory
cannot write to database file /var/run/nscd/services: Cannot allocate memory

at 180M su succeeds but syslog has
cannot write to database file /var/run/nscd/services: Cannot allocate memory

at 198M su succeeds but syslog has
could not start clean-up thread; terminating

at 205M su succeeds but syslog has
could not start any worker thread; terminating

at 209M su succeeds

at 210-215M su fails badly
# su -
su: user root does not exist

and no message is added to syslog

at 216M and above, su succeeds without error again.


Do you have any way to explain how something can fail with more memory?

btw: stopping other services or occupying extra RAM did not seem to alter behaviour.
Comment 2 Bernhard Wiedemann 2009-10-23 16:28:56 UTC
I also did at 215M
strace id root

socket(PF_FILE, 0x80801 /* SOCK_??? */, 0) = 3
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\v\0\0\0\7\0\0\0passwd\0", 19, MSG_NOSIGNAL, NULL, 0) = 19
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
recvmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"passwd\0", 7}, {"\270O\3\0\0\0\0\0", 8}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {4}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 15
mmap(NULL, 217016, PROT_READ, MAP_SHARED, 4, 0) = 0x7fa9f529a000
close(4)                                = 0
close(3)                                = 0
socket(PF_FILE, 0x80801 /* SOCK_??? */, 0) = 3
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\0\0\0\0\5\0\0\0root\0", 17, MSG_NOSIGNAL, NULL, 0) = 17
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
read(3, "\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0"..., 36) = 36
close(3)                                = 0


then it wrote "id: root: No such user"


but at 216M "id root" does

socket(PF_FILE, 0x80801 /* SOCK_??? */, 0) = 3
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\v\0\0\0\7\0\0\0passwd\0", 19, MSG_NOSIGNAL, NULL, 0) = 19
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
recvmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"passwd\0", 7}, {"\270O\3\0\0\0\0\0", 8}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {4}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 15
mmap(NULL, 217016, PROT_READ, MAP_SHARED, 4, 0) = 0x7f3b7b394000
close(4)                                = 0
close(3)                                = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3b7b393000
socket(PF_FILE, 0x80801 /* SOCK_??? */, 0) = 3
connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\f\0\0\0\6\0\0\0group\0", 18, MSG_NOSIGNAL, NULL, 0) = 18
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
recvmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"group\0", 6}, {"\270O\3\0\0\0\0\0", 8}], msg_controllen=24, {cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, {4}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 14
mmap(NULL, 217016, PROT_READ, MAP_SHARED, 4, 0) = 0x7f3b7b35e000
close(4)                                = 0
close(3)


In the second part I can see an extra fstat to my pts and writing "group" instead of "root" onto the socket.
So something on the requesting side is already different.
Comment 3 Michael Calmer 2009-11-13 11:11:43 UTC
re-assign to nscd maintainer
Comment 4 Petr Baudis 2009-11-13 15:34:21 UTC
I'm sorry, 11.1 is in maintenance mode where we don't have the resources to fix bugs like this ourselves, especially since 11.2 uses a completely different nscd daemon (unscd). If this issue still persists for you in 11.2, I will look into it. Alternatively, you may report the issue upstream (nscd in 11.1 is part of glibc).