Bug 929806

Summary: y2base occasionally freezes during install due to bug exposed by glibc: SR#295007
Product: [openSUSE] openSUSE Tumbleweed Reporter: Dominique Leuenberger <dimstar>
Component: InstallationAssignee: Martin Vidner <mvidner>
Status: RESOLVED FIXED QA Contact: Jiri Srain <jsrain>
Severity: Normal    
Priority: P1 - Urgent CC: dimstar, mgorman, mpluskal, mvidner, schwab
Version: 201503*Flags: mvidner: needinfo? (mgorman)
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Dominique Leuenberger 2015-05-06 06:26:19 UTC
the latest submission of glibc has been reverted in openSUSE:Factory again.

Already in the staging area we saw an increased number of 'installation hangs' with this patch. The system would simply not finish installing and deadlocking.

In order to exclude this to just be an anomaly, we checked it in to openSUSE:Factory, where we saw the same hang across multiple test runs.

As a consequence, that patch has been reverted again until somebody can find out / fix the installation hangs introduced by it.

A sample test:
https://openqa.opensuse.org/tests/60367/modules/livecdreboot/steps/21
Comment 1 Mel Gorman 2015-05-06 06:53:38 UTC
Note that this probably indicates that the installer uses uninitialised memory. The primary impact this patch has is that some new allocations that were filled with zeros now contain uninitialised data. In older versions of glibc the application would work by co-incidence. Reverting the patch avoids the problem temporarily but it'll recur when glibc 2.22 is released if it's used by openSUSE.

One way of testing would be to force the installer to globally set MALLOC_CHECK_=2 during installation and see does that "fix" it. I don't know how to setup a temporary installation environment like that but some of the yast people should.
Comment 2 Mel Gorman 2015-05-06 10:35:08 UTC
Marcus, I see you assigned this to Andreas but did you see comment 2 where it was stated that this is very likely to be a bug in the installer using uninitialised memory?
Comment 3 Dominique Leuenberger 2015-05-06 10:38:11 UTC
(In reply to Mel Gorman from comment #2)
> Marcus, I see you assigned this to Andreas but did you see comment 2 where
> it was stated that this is very likely to be a bug in the installer using
> uninitialised memory?

Or rpm - or any of the rpm scriptlets running code. or libzypp, or [...]

In the various tests I'd seen, the lockup was not always in the same package(s).
Comment 4 Mel Gorman 2015-05-06 10:59:33 UTC
(In reply to Dominique Leuenberger from comment #3)
> (In reply to Mel Gorman from comment #2)
> > Marcus, I see you assigned this to Andreas but did you see comment 2 where
> > it was stated that this is very likely to be a bug in the installer using
> > uninitialised memory?
> 
> Or rpm - or any of the rpm scriptlets running code. or libzypp, or [...]
> 
> In the various tests I'd seen, the lockup was not always in the same
> package(s).

I think the installation scripts are a bad fit because we'd expect the same packages to freeze each time. It's also very likely that they are single-threaded which means they are unaffected by the glibc patch. rpm also feels like a bad fit because it's short-lived and I don't see calls to pthread_create in there.

libzypp, zypper or the installer are better candidates because at least zypper is threaded and they're long-lived enough to eventually see an unluckly allocation pattern that gets uninitialised memory. I guessed the installer simply because zypper use on an installed system seems ok.

Bugs due to uninitialised memory are not a bug in glibc though so the assignee still is inappropriate.
Comment 5 Andreas Schwab 2015-05-06 14:49:23 UTC
Please file bug reports for every lockup you see and assign to the respective maintainer.
Comment 6 Mel Gorman 2015-05-07 07:18:51 UTC
The installer is the most likely component to be locking up here but I don't know how to setup the appropriate test environment. I'm going to attempt a reassign and see if the maintainers respond. To be clear, based on previous tests I believe that the installer is using uninitialised memory and getting confused. If it's not fixed now, it'll just be a problem later when glibc is next updated.
Comment 7 Martin Vidner 2015-05-11 15:56:15 UTC
> One way of testing would be to force the installer to globally set MALLOC_CHECK_=2 during installation and see does that "fix" it. I don't know how to setup a temporary installation environment like that but some of the yast people should.

Sure :-)

Simply use a boot parameter MALLOC_CHECK_=2 and the installer will export it to the environment, producing the desired result. It seems even PID 1 has it.
Comment 8 Mel Gorman 2015-05-13 09:05:06 UTC
(In reply to Martin Vidner from comment #7)
> > One way of testing would be to force the installer to globally set MALLOC_CHECK_=2 during installation and see does that "fix" it. I don't know how to setup a temporary installation environment like that but some of the yast people should.
> 
> Sure :-)
> 
> Simply use a boot parameter MALLOC_CHECK_=2 and the installer will export it
> to the environment, producing the desired result. It seems even PID 1 has it.

All righty Martin, thanks.

Dominique, I know these are dumb questions but I never deal with the installer and just want to push this along so we don't get burned in the future when glibc updates again. Is there still an ISO image available that freezes during install? I can at least download it and see if MALLOC_CHECK_=2 "fixes" it. That would at least indicate that something in the installer has an uninitialised memory bug.
Comment 9 Dominique Leuenberger 2015-05-13 09:14:05 UTC
@Mel,

The link in the original comment to openQA also allows you to get the ISO file used for the task.

https://openqa.opensuse.org/tests/60367 => https://openqa.opensuse.org/tests/60367/asset/3037

The difficulty in finding the root cause will likely be that it's not forcibly the yast installer failing, but it could as well be RPM (as we spawn rpm ever so often), zypp/libzypp, or any of the rpm scriptlets commands that might possily cause this.
Comment 10 Mel Gorman 2015-05-13 15:04:29 UTC
(In reply to Dominique Leuenberger from comment #9)
> @Mel,
> 
> The link in the original comment to openQA also allows you to get the ISO
> file used for the task.
> 
> https://openqa.opensuse.org/tests/60367 =>
> https://openqa.opensuse.org/tests/60367/asset/3037
> 

Well, I get a duh prize.

I used to ISO and KVM to reproduce this. 1 in 5 installations appear to fail with a freeze where the UI ceases to interact -- X pointer works, no text can be selected and the UI cannot be interacted with. Terminal switching still works and using that I checked what was active.

There were no RPM scripts active or any portion of rpm. tar existed as a zombie process that was a child of y2base. Even if they were the problem with packages, the UI would not freeze and besides, it would always be the same package that froze. The window manager is not threaded so that's not likely to be the problem. What appears to be frozen is y2base.

I'll now test with MALLOC_CHECK_=2 and see does it freeze but right now, y2base appears to be the primary candidate as the problem. Martin, would you be able to or identify someone on the yast team that could run the installer through valgrind to see if it spits out any warnings about uninitialised memory use and debug it? Ideally it would be with the devel version of glibc but it's not strictly necessary as uninitinialised memory use is unconditionally a bug regardless of system libraries used.
Comment 11 Mel Gorman 2015-05-13 18:25:09 UTC
(In reply to Mel Gorman from comment #10)
> (In reply to Dominique Leuenberger from comment #9)
> > @Mel,
> > 
> > The link in the original comment to openQA also allows you to get the ISO
> > file used for the task.
> > 
> > https://openqa.opensuse.org/tests/60367 =>
> > https://openqa.opensuse.org/tests/60367/asset/3037
> > 
> 
> <SNIP>
> I used to ISO and KVM to reproduce this. 1 in 5 installations appear to fail
> with a freeze where the UI ceases to interact -- X pointer works, no text
> can be selected and the UI cannot be interacted with. Terminal switching
> still works and using that I checked what was active.
> 
> I'll now test with MALLOC_CHECK_=2 and see does it freeze

I successfully installed 10 times without freezes with MALLOC_CHECK_=2 specified as a boot parameter. At this point, it really looks like y2base is the source. Based on the experiences with llvm regression suites, I also suspect it's due to an uninitialised memory bug. I updated the bug title accordingly.
Comment 12 Mel Gorman 2015-05-20 10:16:13 UTC
Martin, any thoughts?
Comment 13 Martin Vidner 2015-05-21 09:08:26 UTC
I will test the installation with valgrind myself.
Comment 14 Martin Vidner 2015-05-22 14:15:15 UTC
The test is still running, and it has found some bugs but I guess they are pretty harmless. The TUmbleweed repo doesn't have that glibc patch though.
Comment 15 Mel Gorman 2015-05-22 14:33:31 UTC
(In reply to Martin Vidner from comment #14)
> The test is still running, and it has found some bugs but I guess they are
> pretty harmless. The TUmbleweed repo doesn't have that glibc patch though.

Anything resembling an uninitialised memory usage bug or a use-after-free bug could cause problems with the newer version of glibc. It's not in Tumbleweed because it was backed out due to the installer occasionally freezing. The devel project still has the updates though and it builds cleanly against factory
https://build.opensuse.org/package/show/Base:System/glibc .
Comment 16 Martin Vidner 2015-05-26 14:20:54 UTC
I have used https://github.com/openSUSE/mksusecd to make an installation ISO with the new glibc, but I still cannot reproduce the problem.

I have used kvm on x86_64, first with a single cpu, then with "-smp 2".
I have used MALLOC_CHECK_=3 and run y2base under valgrind. It has uncovered problems that I reported in bug 932306, but they all seem minor and not related to the UI thread.
Comment 17 Mel Gorman 2015-05-26 14:45:13 UTC
(In reply to Martin Vidner from comment #16)
> I have used https://github.com/openSUSE/mksusecd to make an installation ISO
> with the new glibc, but I still cannot reproduce the problem.
> 

Have you tried with the iso linked at https://openqa.opensuse.org/tests/60367/asset/3037? I was definitely able to stall that when installing under KVM. I was using a machine with 8 logical CPUs and the launch command

qemu-kvm \
         -cpu host \
         -hda disk.img \
	 -drive file=3037.iso,media=cdrom \
         -net nic,model=rtl8139 -net user,hostname=installcheck \
         -m 1G \
         -monitor stdio \
         -name Installer \
         "$@"

It's not 100% reproducible. Only 1 in 5 installations failed.

> I have used kvm on x86_64, first with a single cpu, then with "-smp 2".
> I have used MALLOC_CHECK_=3 and run y2base under valgrind. It has uncovered
> problems that I reported in bug 932306, but they all seem minor and not
> related to the UI thread.

There is an outside possibility that this was fixed since by accident. If you make the ISO you used available somewhere then I can try installing with it and see can I hit the problem.
Comment 18 Martin Vidner 2015-05-27 08:43:00 UTC
Thank you, Mel. But https://openqa.opensuse.org/tests/60367/asset/3037 seems to have expired. Do you have the image around?

I am testing with https://w3.suse.de/~mvidner/glibc-bsc929806.iso which I run as

qemu-kvm -m 4096 -smp 2 -cdrom ~/svn/mksusecd/glibc-bsc929806.iso scratch.qcow2 
with the boot option VALGRIND=1

The CD was made with:
./mksusecd --verbose \
  --create glibc-bsc929806.iso \
  --micro \
  --initrd ~/tmp/glibc-debuginfo-2.21-409.39.x86_64.rpm \
  --initrd ~/tmp/glibc-2.21-409.39.x86_64.rpm \
  --initrd ~/tmp/valgrind-3.10.1-2.1.x86_64.rpm \
  --initrd ~/tmp/yast2-core-3.1.17-2.1.x86_64.rpm \
  --initrd ~/tmp/yast2-core-debuginfo-3.1.17-2.1.x86_64.rpm \
  --initrd ~/dl/yast2-storage-debuginfo-3.1.55-1.3.x86_64.rpm \
  --initrd ~/dl/libstorage6-debuginfo-2.25.20-2.2.x86_64.rpm \
  --initrd ~/dl/libstorage6-2.25.20-2.2.x86_64.rpm \
  --initrd ~/dl/yast2-storage-3.1.55-1.3.x86_64.rpm \
  --initrd ~/dl/ruby2.2-debuginfo-2.2.2-1.3.x86_64.rpm \
  --initrd ~/dl/ruby2.2-stdlib-debuginfo-2.2.2-1.3.x86_64.rpm \
  --initrd ~/dl/libruby2_2-2_2-2.2.2-1.3.x86_64.rpm \
  --initrd ~/dl/ruby2.2-2.2.2-1.3.x86_64.rpm \
  --initrd ~/dl/ruby2.2-stdlib-2.2.2-1.3.x86_64.rpm \
  --initrd ~/dl/libruby2_2-2_2-debuginfo-2.2.2-1.3.x86_64.rpm \
  --initrd ~/dl/libstorage-ruby-debuginfo-2.25.20-2.2.x86_64.rpm \
  --initrd ~/dl/libstorage-ruby-2.25.20-2.2.x86_64.rpm \
  --initrd ~/dl/gdb-7.9-2.1.x86_64.rpm \
  --initrd initrd \
  /dist/install/openSUSE-UNTESTED/openSUSE-Tumbleweed-NET-x86_64-Snapshot20150525-Media.iso

(master) mvidner@mrakoplas:mksusecd$ diff -u initrd/usr/lib/YaST2/startup/YaST2.call{.orig,}
--- initrd/usr/lib/YaST2/startup/YaST2.call.orig        2015-05-26 10:23:36.479179018 +0200
+++ initrd/usr/lib/YaST2/startup/YaST2.call     2015-05-27 10:07:59.813261210 +0200
@@ -307,8 +307,15 @@
        log "\tUI_ARGS:      $Y2_UI_ARGS"
        log "\tQT_IM_MODULE: $QT_IM_MODULE"
 
+        if [ "$VALGRIND" = 1 ]; then
+            VALGRIND="valgrind --leak-check=no --track-origins=yes \
+  --time-stamp=yes \
+  --main-stacksize=10000000 \
+  --log-file=/tmp/valgrind"
+        fi
+
        if [ "$Y2GDB" != "1" ]; then
-           $OPT_FBITERM y2base         \
+           $OPT_FBITERM $VALGRIND y2base               \
                "$Y2_MODULE_NAME"       \
                $Y2_MODE_FLAGS          \
                $Y2_MODULE_ARGS         \
Comment 19 Mel Gorman 2015-05-28 08:35:33 UTC
(In reply to Martin Vidner from comment #18)
> Thank you, Mel. But https://openqa.opensuse.org/tests/60367/asset/3037 seems
> to have expired. Do you have the image around?
> 

I have a copy locally but it could take a few days to complete an upload due to limited upstream bandwidth. Does anyone cc'd have a copy on a machine in an office that they could make available?

> I am testing with https://w3.suse.de/~mvidner/glibc-bsc929806.iso which I
> run as
> 
> qemu-kvm -m 4096 -smp 2 -cdrom ~/svn/mksusecd/glibc-bsc929806.iso
> scratch.qcow2 
> with the boot option VALGRIND=1
> 

I'm unable to reproduce the freeze with this ISO. Has anything changed in yast since about mid-April? It's possible it got accidentally fixed or worked around since the original glibc submission. Related to that, is the version of yast used the same as what it is in Factory? If so then it might be appropriate to try resubmit SR#295007. At worst, the same problem will recur but there will be a problematic ISO available.
Comment 20 Martin Vidner 2015-06-01 14:24:22 UTC
> I'm unable to reproduce the freeze with this ISO. Has anything changed in yast since about mid-April?

Actually, yes, we have fixed some GCC warnings:
https://github.com/yast/yast-core/pull/100
I *think* this should not change things related to uninitialized memory, but it seems best to retry the glibc submission. 
I am not sure how to do that since https://build.opensuse.org/request/show/295007 is marked as Accepted.

Mel, can you resubmit glibc and then resolve this as Works For Me please?
Comment 21 Dominique Leuenberger 2015-06-01 14:50:17 UTC
(In reply to Martin Vidner from comment #20)
> > I'm unable to reproduce the freeze with this ISO. Has anything changed in yast since about mid-April?
> 
> Actually, yes, we have fixed some GCC warnings:
> https://github.com/yast/yast-core/pull/100
> I *think* this should not change things related to uninitialized memory, but
> it seems best to retry the glibc submission. 
> I am not sure how to do that since
> https://build.opensuse.org/request/show/295007 is marked as Accepted.

As glibc was revertd post-accept, you will have create a new submitrequest:
> osc sr Base:System glibc openSUSE:Factory -m "Let's retry to see what this brings"
Comment 22 Mel Gorman 2015-06-01 16:52:23 UTC
(In reply to Dominique Leuenberger from comment #21)
> (In reply to Martin Vidner from comment #20)
> > > I'm unable to reproduce the freeze with this ISO. Has anything changed in yast since about mid-April?
> > 
> > Actually, yes, we have fixed some GCC warnings:
> > https://github.com/yast/yast-core/pull/100
> > I *think* this should not change things related to uninitialized memory, but
> > it seems best to retry the glibc submission. 
> > I am not sure how to do that since
> > https://build.opensuse.org/request/show/295007 is marked as Accepted.
> 
> As glibc was revertd post-accept, you will have create a new submitrequest:

As there have been no changes to the Base:System glibc project since, I went ahead and created a new request 309677. Thanks.
Comment 23 Mel Gorman 2015-06-01 16:56:44 UTC
(In reply to Martin Vidner from comment #20)

> Mel, can you resubmit glibc and then resolve this as Works For Me please?

It's resubmitted but I did not close this as resolved until we see if the ISO created for openQA testing reproduces the problem or not.
Comment 24 Martin Vidner 2015-06-09 10:51:27 UTC
Status update:
The new submission https://build.opensuse.org/request/show/309677 revealed a crash in mksquashfs.
Mel has made a patch for that yesterday: https://sourceware.org/ml/libc-alpha/2015-06/msg00255.html which I don't see in our builds yet.
Comment 25 Mel Gorman 2015-06-09 11:22:11 UTC
(In reply to Martin Vidner from comment #24)
> Mel has made a patch for that yesterday:
> https://sourceware.org/ml/libc-alpha/2015-06/msg00255.html which I don't see
> in our builds yet.


It's not included in the builds yet because I need upstream to review and merge it before I can add it to Base:System/glibc.
Comment 26 Mel Gorman 2015-06-16 13:00:43 UTC
glibc has now been updated in Factory and the installer was fine. Closing this bug now. Thanks Martin for all your help on this.