Bug 908780

Summary: getdents breakage on symlink causes ldconfig core dump
Product: [openSUSE] openSUSE Tumbleweed Reporter: Martin Pluskal <mpluskal>
Component: KernelAssignee: Jan Kara <jack>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P3 - Medium CC: comes, jack, lmb, mpluskal, rjschwei, werner
Version: 201412*   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: systemd-coredumpctl gdb 4612
strace
xfs_repair -n /dev/md0

Description Martin Pluskal 2014-12-08 08:26:05 UTC
Created attachment 616228 [details]
systemd-coredumpctl gdb 4612

On current factory, following occurs:
# ldconfig 
Aborted (core dumped)

# rpm -qf `which ldconfig`
glibc-2.20-2.1.x86_64

(gdb) bt
#0  0x0000000000458d57 in raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x000000000040ddfa in abort () at abort.c:78
#2  0x0000000000404905 in insert_to_aux_cache (id=id@entry=0x7fff186e3ca0, flags=771, osversion=0, soname=0x1acde30 "libnuma.so.1", used=used@entry=1) at cache.c:648
#3  0x0000000000405654 in add_to_aux_cache (stat_buf=stat_buf@entry=0x7fff186e4d50, flags=<optimized out>, osversion=<optimized out>, soname=<optimized out>) at cache.c:673
#4  0x00000000004043b9 in search_dir (entry=0x1a9d780, entry=0x1a9d780) at ldconfig.c:887
#5  0x0000000000401b04 in search_dirs () at ldconfig.c:1030
#6  main (argc=<optimized out>, argv=<optimized out>) at ldconfig.c:1385


# cat /etc/ld.so.conf
/usr/local/lib64
/usr/local/lib
include /etc/ld.so.conf.d/*.conf
# /lib64, /lib, /usr/lib64 and /usr/lib gets added
# automatically by ldconfig after parsing this file.
# So, they do not need to be listed.

# cat /etc/ld.so.conf.d/*
/usr/lib64/graphviz
/usr/lib64/graphviz/sharp
/usr/lib64/graphviz/java
/usr/lib64/graphviz/perl
/usr/lib64/graphviz/php
/usr/lib64/graphviz/ocaml
/usr/lib64/graphviz/python
/usr/lib64/graphviz/lua
/usr/lib64/graphviz/tcl
/usr/lib64/graphviz/guile
/usr/lib64/graphviz/ruby

# strace ldconfig
...
lstat("/usr/lib64/libldap_r-2.4.so.2.10.2", {st_mode=S_IFREG|0755, st_size=335584, ...}) = 0
lstat("/usr/lib64/libnuma.so.1", {st_mode=S_IFLNK|0777, st_size=16, ...}) = 0
open("/usr/lib64/libnuma.so.1", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0755, st_size=48256, ...}) = 0
mmap(NULL, 48256, PROT_READ, MAP_SHARED, 4, 0) = 0x7f5c0a3fe000
munmap(0x7f5c0a3fe000, 48256)           = 0
close(4)                                = 0
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid()                                = 17702
tgkill(17702, 17702, SIGABRT)           = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=17702, si_uid=0} ---
+++ killed by SIGABRT (core dumped) +++
Comment 1 Martin Pluskal 2014-12-08 08:26:27 UTC
Created attachment 616229 [details]
strace
Comment 2 Lars Marowsky-Bree 2014-12-11 12:18:47 UTC
Same here, exact same trace.

This also implies that all rpm installs that call ldconfig now are rather ugly.
Comment 3 Andreas Schwab 2014-12-11 12:51:34 UTC
I'm unable to reproduce.
Comment 5 Lars Marowsky-Bree 2014-12-11 16:47:34 UTC
I can also upload a core + strace, if that helps.
Comment 6 Andreas Schwab 2014-12-15 10:50:22 UTC
This is a kernel bug:

getdents(3, {... {d_ino=671856685, d_off=3127, d_reclen=32, d_name="libnuma.so.1", d_type=DT_REG} ...})

$ stat /usr/lib64/libnuma.so.1
  File: ‘/usr/lib64/libnuma.so.1’ -> ‘libnuma.so.1.0.0’
  Size: 16              Blocks: 0          IO Block: 4096   symbolic link
Device: 900h/2304d      Inode: 671856685   Links: 1
Comment 7 Lars Marowsky-Bree 2014-12-23 19:45:02 UTC
Problem persists with 3.19.0-rc1-1.g85f0072-desktop and glibc-2.20-2.1.x86_64 and libnuma1-2.0.10-1.1.

Indeed, if I remove libnuma1, ldconfig does not crash. Somehow, ldconfig becomes confused trying to parse those links. Andreas seems to think this is a kernel bug?
Comment 8 Martin Pluskal 2014-12-28 10:00:38 UTC
I am not sure if it is relevant but fs is xfs, issue occurs with both 3.17.4 and 3.18.1
Comment 9 Martin Pluskal 2014-12-28 12:55:55 UTC
Hi could you look into this? Thanks a lot.
Comment 10 Jan Kara 2015-01-05 16:30:10 UTC
I guess Andreas points at the fact that getdents(2) reports the file as being a regular file (d_type == DT_REG) but it is in fact a symbolic link as stat(2) shows. I agree this is a kernel problem (I'll look into that) although it isn't completely clear to me whether this is a problem which makes ldconfig(1) crash.
Comment 11 Jan Kara 2015-01-05 16:47:12 UTC
Looking at medusa.suse.cz it seems the filetype is wrong on disk so this looks like an XFS bug. I'll look into it further...
Comment 12 Jan Kara 2015-01-05 20:22:00 UTC
OK, so when I deleted the libnuma.so.1 symlink with wrong file type and created a new one, it got created with proper file type. Now also ldconfig doesn't crash. So the question is how file type in the directory gets corrupted...

I have checked the code and I don't see where we could get the file type wrong. Guys do you know when the problem started appearing?
Comment 13 Martin Pluskal 2015-01-06 07:24:51 UTC
(In reply to Jan Kara from comment #12)
> 
> I have checked the code and I don't see where we could get the file type
> wrong. Guys do you know when the problem started appearing?
If I recall correctly, issue started to occur after numactl was updated - probably after this https://build.opensuse.org/request/show/262711 , but there does not seem to be anything suspicious in sr.
Comment 14 Jan Kara 2015-01-06 08:12:14 UTC
OK, that makes some sense. On the update symlink libnuma.so.1 has been recreated and apparently got a wrong file type. Can you run "xfs_repair -n" on the filesystem (either from a rescue CD or single-user mode) to check whether there are more inconsistencies of this kind?
Comment 15 Jan Kara 2015-01-06 08:14:58 UTC
Oh, and if you find some inconsistencies, please run "xfs_metadump -o <root-device> <some-file-eg-on-usb-stick>" to preserve fs metadata for further inspection and after that you can run xfs_repair without -n to fix the filesystem. Thanks!
Comment 16 Martin Pluskal 2015-01-06 10:10:19 UTC
Created attachment 618715 [details]
xfs_repair -n /dev/md0

xfs_metadump is placed at /boot/908780/xfs_metadump.log.xz on medusa.suse.cz (too large for bugzilla)
Comment 17 Jan Kara 2015-01-06 13:17:28 UTC
Thanks, I have copied it to my machine and will investigate.
Comment 18 Jan Kara 2015-01-06 22:02:53 UTC
BTW, Martin, have you cleanly unmounted the filesystem before you ran xfs_repair? I'm just wondering because of those unlinked inode xfs_repair also complains about.
Comment 19 Martin Pluskal 2015-01-06 22:41:22 UTC
(In reply to Jan Kara from comment #18)
> BTW, Martin, have you cleanly unmounted the filesystem before you ran
> xfs_repair? I'm just wondering because of those unlinked inode xfs_repair
> also complains about.

Honestly I am not sure, it might be possible that if something went wrong during shutdown/reboot that machine might have been reseted by watchdog.
Comment 20 Martin Pluskal 2015-01-08 15:45:03 UTC
Btw I am not sure if boo#910336 is not somehow related.
Comment 21 Jan Kara 2015-01-08 17:01:45 UTC
That's a good comment but I don't think it does - after that even you ran xfs_repair and it didn't find any inconsistency. So at least at that point the filesystem was still clean. But at least it does give us some clue that the corruption happened relatively recently.
Comment 22 Jan Kara 2015-01-29 13:36:26 UTC
I was looking into this for a while but I still cannot find the place where the corruption happens and I'll need to create a reproducer. Since the corruption happens in /usr/bin and /usr/lib it's likely triggered by package updates. Hopefully I can somehow simulate that...
Comment 23 Jan Kara 2015-03-04 16:23:49 UTC
Martin, have you seen the issue happen recently? If not, maybe patches from boo#910336 did help after all...
Comment 24 Martin Pluskal 2015-03-04 18:22:16 UTC
(In reply to Jan Kara from comment #23)
> Martin, have you seen the issue happen recently? If not, maybe patches from
> boo#910336 did help after all...

I haven't seen this issue for a while so we can probably assume that they helped.
Comment 25 Jan Kara 2015-04-21 14:55:55 UTC
OK, I'll close the bug for now assuming patches fixed the problem. Please reopen the bug in case you see the problem again.
Comment 26 Dr. Werner Fink 2015-04-24 12:44:43 UTC
(In reply to Jan Kara from comment #25)

Hmmm ... I see

  d136:~ # ldconfig               
  Aborted
  d136:~ # rpm -q --changelog kernel-desktop | head
  * Tue Apr 14 2015 jlee@suse.com
  - Update config files. (boo#925479)
    Do not set CONFIG_SYSTEM_TRUSTED_KEYRING until we need it in future
    openSUSE version:
    e.g. MODULE_SIG, IMA, PKCS7(new), KEXEC_BZIMAGE_VERIFY_SIG(new)
  - commit 74c332b

  * Mon Apr 13 2015 jslaby@suse.cz
  - Linux 3.19.4.
  - commit 51ddeac

... has this fix really reached openSUSE Factory?
Comment 27 Jan Kara 2015-04-24 17:09:15 UTC
So when the problem happens, the filesystem gets corrupted (the kernel bug results in a filesystem corruption) and you have to run fsck.xfs to fix the problem. If the problem happens again after you've fixed your filesystem with fsck, please report here. Thanks!
Comment 28 Bernhard Wiedemann 2015-04-26 12:31:36 UTC
*** Bug 928534 has been marked as a duplicate of this bug. ***
Comment 29 Andreas Schwab 2015-08-12 15:27:52 UTC
*** Bug 941305 has been marked as a duplicate of this bug. ***
Comment 30 Giacomo Comes 2015-08-12 17:50:01 UTC
I submitted boo#941305 and I was redirected here.

System: 13.2 with all updates, root filesystem: xfs

I boot from the rescue image and I run xfs_repair /dev/sda2
This will fix a lot of ftype mismatch errors.

Now I reboot and run:
zypper rm sbl
zypper in sbl-3.5.0 (install previous version of sbl)
zypper in sbl (install sbl update)

There is the following message:
 Additional rpm output:
/var/tmp/rpm-tmp.7MTEpn: line 1: 26734 Aborted                 /sbin/ldconfig

Now ldconfid doesn't work anymore:

ldconfig
  Aborted

I reboot from the rescue image and I run again xfs_repair /dev/sda2:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 3
        - agno = 0
        - agno = 2
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
fixing ftype mismatch (1/7) in directory/child inode 578442/793961
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

A new ftype mismatch error was fixed. Now I reboot and ldconfig works again.
Comment 31 Martin Pluskal 2015-08-12 18:38:53 UTC
Well it seems that fix mentioned at boo#910336#c14 never got to any maintenance update and thus is not present in 13.2.
Comment 32 Giacomo Comes 2015-08-12 18:55:26 UTC
If the patches mentioned at boo#910336#c14 are the following ones:

http://oss.sgi.com/archives/xfs/2015-01/msg00341.html
http://oss.sgi.com/archives/xfs/2015-01/msg00339.html
http://oss.sgi.com/archives/xfs/2015-01/msg00363.html
http://oss.sgi.com/archives/xfs/2015-01/msg00364.html
http://oss.sgi.com/archives/xfs/2015-01/msg00342.html

then they are all included in 13.2. If not then please let me know where 
the patch is so I can test it.
Comment 33 Martin Pluskal 2015-08-12 18:59:22 UTC
(In reply to Giacomo Comes from comment #32)
> If the patches mentioned at boo#910336#c14 are the following ones:
> 
> http://oss.sgi.com/archives/xfs/2015-01/msg00341.html
> http://oss.sgi.com/archives/xfs/2015-01/msg00339.html
> http://oss.sgi.com/archives/xfs/2015-01/msg00363.html
> http://oss.sgi.com/archives/xfs/2015-01/msg00364.html
> http://oss.sgi.com/archives/xfs/2015-01/msg00342.html
> 
> then they are all included in 13.2. If not then please let me know where 
> the patch is so I can test it.

hmpf this would actually suggest that issue is not fixed :(
Comment 34 Martin Pluskal 2015-08-13 08:34:52 UTC
Closing (see comment#5 at boo#941305)
Comment 35 Michal Marek 2015-10-07 10:11:43 UTC
*** Bug 945909 has been marked as a duplicate of this bug. ***