|
Bugzilla – Full Text Bug Listing |
| Summary: | getdents breakage on symlink causes ldconfig core dump | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Martin Pluskal <mpluskal> |
| Component: | Kernel | Assignee: | Jan Kara <jack> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Major | ||
| Priority: | P3 - Medium | CC: | comes, jack, lmb, mpluskal, rjschwei, werner |
| Version: | 201412* | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
systemd-coredumpctl gdb 4612
strace xfs_repair -n /dev/md0 |
||
Created attachment 616229 [details]
strace
Same here, exact same trace. This also implies that all rpm installs that call ldconfig now are rather ugly. I'm unable to reproduce. I can also upload a core + strace, if that helps. This is a kernel bug:
getdents(3, {... {d_ino=671856685, d_off=3127, d_reclen=32, d_name="libnuma.so.1", d_type=DT_REG} ...})
$ stat /usr/lib64/libnuma.so.1
File: ‘/usr/lib64/libnuma.so.1’ -> ‘libnuma.so.1.0.0’
Size: 16 Blocks: 0 IO Block: 4096 symbolic link
Device: 900h/2304d Inode: 671856685 Links: 1
Problem persists with 3.19.0-rc1-1.g85f0072-desktop and glibc-2.20-2.1.x86_64 and libnuma1-2.0.10-1.1. Indeed, if I remove libnuma1, ldconfig does not crash. Somehow, ldconfig becomes confused trying to parse those links. Andreas seems to think this is a kernel bug? I am not sure if it is relevant but fs is xfs, issue occurs with both 3.17.4 and 3.18.1 Hi could you look into this? Thanks a lot. I guess Andreas points at the fact that getdents(2) reports the file as being a regular file (d_type == DT_REG) but it is in fact a symbolic link as stat(2) shows. I agree this is a kernel problem (I'll look into that) although it isn't completely clear to me whether this is a problem which makes ldconfig(1) crash. Looking at medusa.suse.cz it seems the filetype is wrong on disk so this looks like an XFS bug. I'll look into it further... OK, so when I deleted the libnuma.so.1 symlink with wrong file type and created a new one, it got created with proper file type. Now also ldconfig doesn't crash. So the question is how file type in the directory gets corrupted... I have checked the code and I don't see where we could get the file type wrong. Guys do you know when the problem started appearing? (In reply to Jan Kara from comment #12) > > I have checked the code and I don't see where we could get the file type > wrong. Guys do you know when the problem started appearing? If I recall correctly, issue started to occur after numactl was updated - probably after this https://build.opensuse.org/request/show/262711 , but there does not seem to be anything suspicious in sr. OK, that makes some sense. On the update symlink libnuma.so.1 has been recreated and apparently got a wrong file type. Can you run "xfs_repair -n" on the filesystem (either from a rescue CD or single-user mode) to check whether there are more inconsistencies of this kind? Oh, and if you find some inconsistencies, please run "xfs_metadump -o <root-device> <some-file-eg-on-usb-stick>" to preserve fs metadata for further inspection and after that you can run xfs_repair without -n to fix the filesystem. Thanks! Created attachment 618715 [details]
xfs_repair -n /dev/md0
xfs_metadump is placed at /boot/908780/xfs_metadump.log.xz on medusa.suse.cz (too large for bugzilla)
Thanks, I have copied it to my machine and will investigate. BTW, Martin, have you cleanly unmounted the filesystem before you ran xfs_repair? I'm just wondering because of those unlinked inode xfs_repair also complains about. (In reply to Jan Kara from comment #18) > BTW, Martin, have you cleanly unmounted the filesystem before you ran > xfs_repair? I'm just wondering because of those unlinked inode xfs_repair > also complains about. Honestly I am not sure, it might be possible that if something went wrong during shutdown/reboot that machine might have been reseted by watchdog. Btw I am not sure if boo#910336 is not somehow related. That's a good comment but I don't think it does - after that even you ran xfs_repair and it didn't find any inconsistency. So at least at that point the filesystem was still clean. But at least it does give us some clue that the corruption happened relatively recently. I was looking into this for a while but I still cannot find the place where the corruption happens and I'll need to create a reproducer. Since the corruption happens in /usr/bin and /usr/lib it's likely triggered by package updates. Hopefully I can somehow simulate that... Martin, have you seen the issue happen recently? If not, maybe patches from boo#910336 did help after all... (In reply to Jan Kara from comment #23) > Martin, have you seen the issue happen recently? If not, maybe patches from > boo#910336 did help after all... I haven't seen this issue for a while so we can probably assume that they helped. OK, I'll close the bug for now assuming patches fixed the problem. Please reopen the bug in case you see the problem again. (In reply to Jan Kara from comment #25) Hmmm ... I see d136:~ # ldconfig Aborted d136:~ # rpm -q --changelog kernel-desktop | head * Tue Apr 14 2015 jlee@suse.com - Update config files. (boo#925479) Do not set CONFIG_SYSTEM_TRUSTED_KEYRING until we need it in future openSUSE version: e.g. MODULE_SIG, IMA, PKCS7(new), KEXEC_BZIMAGE_VERIFY_SIG(new) - commit 74c332b * Mon Apr 13 2015 jslaby@suse.cz - Linux 3.19.4. - commit 51ddeac ... has this fix really reached openSUSE Factory? So when the problem happens, the filesystem gets corrupted (the kernel bug results in a filesystem corruption) and you have to run fsck.xfs to fix the problem. If the problem happens again after you've fixed your filesystem with fsck, please report here. Thanks! *** Bug 928534 has been marked as a duplicate of this bug. *** *** Bug 941305 has been marked as a duplicate of this bug. *** I submitted boo#941305 and I was redirected here. System: 13.2 with all updates, root filesystem: xfs I boot from the rescue image and I run xfs_repair /dev/sda2 This will fix a lot of ftype mismatch errors. Now I reboot and run: zypper rm sbl zypper in sbl-3.5.0 (install previous version of sbl) zypper in sbl (install sbl update) There is the following message: Additional rpm output: /var/tmp/rpm-tmp.7MTEpn: line 1: 26734 Aborted /sbin/ldconfig Now ldconfid doesn't work anymore: ldconfig Aborted I reboot from the rescue image and I run again xfs_repair /dev/sda2: Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 1 - agno = 3 - agno = 0 - agno = 2 Phase 5 - rebuild AG headers and trees... - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... fixing ftype mismatch (1/7) in directory/child inode 578442/793961 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... done A new ftype mismatch error was fixed. Now I reboot and ldconfig works again. Well it seems that fix mentioned at boo#910336#c14 never got to any maintenance update and thus is not present in 13.2. If the patches mentioned at boo#910336#c14 are the following ones: http://oss.sgi.com/archives/xfs/2015-01/msg00341.html http://oss.sgi.com/archives/xfs/2015-01/msg00339.html http://oss.sgi.com/archives/xfs/2015-01/msg00363.html http://oss.sgi.com/archives/xfs/2015-01/msg00364.html http://oss.sgi.com/archives/xfs/2015-01/msg00342.html then they are all included in 13.2. If not then please let me know where the patch is so I can test it. (In reply to Giacomo Comes from comment #32) > If the patches mentioned at boo#910336#c14 are the following ones: > > http://oss.sgi.com/archives/xfs/2015-01/msg00341.html > http://oss.sgi.com/archives/xfs/2015-01/msg00339.html > http://oss.sgi.com/archives/xfs/2015-01/msg00363.html > http://oss.sgi.com/archives/xfs/2015-01/msg00364.html > http://oss.sgi.com/archives/xfs/2015-01/msg00342.html > > then they are all included in 13.2. If not then please let me know where > the patch is so I can test it. hmpf this would actually suggest that issue is not fixed :( Closing (see comment#5 at boo#941305) *** Bug 945909 has been marked as a duplicate of this bug. *** |
Created attachment 616228 [details] systemd-coredumpctl gdb 4612 On current factory, following occurs: # ldconfig Aborted (core dumped) # rpm -qf `which ldconfig` glibc-2.20-2.1.x86_64 (gdb) bt #0 0x0000000000458d57 in raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55 #1 0x000000000040ddfa in abort () at abort.c:78 #2 0x0000000000404905 in insert_to_aux_cache (id=id@entry=0x7fff186e3ca0, flags=771, osversion=0, soname=0x1acde30 "libnuma.so.1", used=used@entry=1) at cache.c:648 #3 0x0000000000405654 in add_to_aux_cache (stat_buf=stat_buf@entry=0x7fff186e4d50, flags=<optimized out>, osversion=<optimized out>, soname=<optimized out>) at cache.c:673 #4 0x00000000004043b9 in search_dir (entry=0x1a9d780, entry=0x1a9d780) at ldconfig.c:887 #5 0x0000000000401b04 in search_dirs () at ldconfig.c:1030 #6 main (argc=<optimized out>, argv=<optimized out>) at ldconfig.c:1385 # cat /etc/ld.so.conf /usr/local/lib64 /usr/local/lib include /etc/ld.so.conf.d/*.conf # /lib64, /lib, /usr/lib64 and /usr/lib gets added # automatically by ldconfig after parsing this file. # So, they do not need to be listed. # cat /etc/ld.so.conf.d/* /usr/lib64/graphviz /usr/lib64/graphviz/sharp /usr/lib64/graphviz/java /usr/lib64/graphviz/perl /usr/lib64/graphviz/php /usr/lib64/graphviz/ocaml /usr/lib64/graphviz/python /usr/lib64/graphviz/lua /usr/lib64/graphviz/tcl /usr/lib64/graphviz/guile /usr/lib64/graphviz/ruby # strace ldconfig ... lstat("/usr/lib64/libldap_r-2.4.so.2.10.2", {st_mode=S_IFREG|0755, st_size=335584, ...}) = 0 lstat("/usr/lib64/libnuma.so.1", {st_mode=S_IFLNK|0777, st_size=16, ...}) = 0 open("/usr/lib64/libnuma.so.1", O_RDONLY) = 4 fstat(4, {st_mode=S_IFREG|0755, st_size=48256, ...}) = 0 mmap(NULL, 48256, PROT_READ, MAP_SHARED, 4, 0) = 0x7f5c0a3fe000 munmap(0x7f5c0a3fe000, 48256) = 0 close(4) = 0 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 gettid() = 17702 tgkill(17702, 17702, SIGABRT) = 0 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=17702, si_uid=0} --- +++ killed by SIGABRT (core dumped) +++