Bug 965564

Summary: make[2]: fork: Resource temporarily unavailable
Product: [openSUSE] openSUSE Tumbleweed Reporter: Dominique Leuenberger <dimstar>
Component: BasesystemAssignee: systemd maintainers <systemd-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: adrian.schroeter, asarai, dimstar, fbui, fvogt, jengelh, jslaby, lnussel, ro, stefan.fent, thomas.blume
Version: CurrentFlags: mmarek: needinfo?
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: systemd-obs-defaults-patch

Description Dominique Leuenberger 2016-02-08 08:27:30 UTC
Since the introduction of 'lamb' there is a high number of builds 'randomly' failing with the error message:

make[2]: fork: Resource temporarily unavailable

I have seen this on packages with as low memory usage as 600MB in their previous builds but also on larger packages...

The latest one seen this morning:

https://build.opensuse.org/package/live_build_log/openSUSE:Factory:Staging:J:DVD/kconfig/standard/x86_64

Build was running on:
2016-02-05 20:06:01 CET  kconfig                                            meta change      unchanged                4m  7s   cloud113:4      
2016-02-08 01:49:48 CET  kconfig                                            meta change      failed                   1m 15s   lamb09:5        
2016-02-08 08:20:31 CET  kconfig                                            new build        failed                   1m 15s   lamb06:5 

A retrigger, paired with some luck, allows the build to pass - but is of course frustrating.
Comment 1 Adrian Schröter 2016-02-08 09:07:48 UTC
I believe that it happens, but it is not clear to me where the problem is. First guess is of course that there is not enough memory for the instance, but the setup looks okay to me.

What makes lamb* special is the use of tmpfs, but this shouldn't become visiable
inside of the VM. I see also nothing suspect in the kernel logs.

So, how can we tackle this? My first attempt would be to patch "make" to report at least errno . Maybe adding some code to dump the guest kernel messages (they should be visible in logfile already though). Maybe also just reading /proc/meminfo and writing it.

Rudi, do you have another idea here?
Comment 2 Ruediger Oertel 2016-02-08 11:15:33 UTC
can we print an "ulimit -a" at the start of the build section somehow ?
it sounds like something might overwrite the number of max processes
to a low value and too many processes are still running inside the VM.

are we installing/configuring /etc/security/limits.conf inside the distro
maybe ?
Comment 3 Aleksa Sarai 2016-02-08 11:22:14 UTC
The error message "fork: Resource temporarily unavailable" tells me it's probably caused by one of the following:

1. RLIMIT_NPROC being set too low (however, rlimits apply to a process tree so this seems unlikely to be a "random" occurence).
2. A pids cgroup limit being set too low (similar to RLIMIT_NPROC, but it's based on cgroups, and it's also not likely to happen randomly).
3. A ulimit set too low (since ulimits apply on a per-user basis this seems likely to me).

AFAIK, memory issues doesn't cause fork to fail (OOM killer would SIGKILL you though).
Comment 4 Dominique Leuenberger 2016-02-08 11:33:52 UTC
(In reply to Ruediger Oertel from comment #2)
> can we print an "ulimit -a" at the start of the build section somehow ?
> it sounds like something might overwrite the number of max processes
> to a low value and too many processes are still running inside the VM.
> 
> are we installing/configuring /etc/security/limits.conf inside the distro
> maybe ?

I created a local (local to the staging project) modification on 
openSUSE:Factory:Staging:C:DVD/kdesignerplugin

this package had the error before (after a short 2 minute build); can we somehow force this to end up on lamb?
Comment 5 Dominique Leuenberger 2016-02-08 11:39:58 UTC
ok, ended up quickly on lamb again:

[  132s] + echo JUST SOME OBS DEBUG
[  132s] JUST SOME OBS DEBUG
[  132s] + ulimit -a
[  132s] core file size          (blocks, -c) 0
[  132s] data seg size           (kbytes, -d) unlimited
[  132s] scheduling priority             (-e) 0
[  132s] file size               (blocks, -f) unlimited
[  132s] pending signals                 (-i) 31890
[  132s] max locked memory       (kbytes, -l) 64
[  132s] max memory size         (kbytes, -m) unlimited
[  132s] open files                      (-n) 1024
[  132s] pipe size            (512 bytes, -p) 8
[  132s] POSIX message queues     (bytes, -q) 819200
[  132s] real-time priority              (-r) 0
[  132s] stack size              (kbytes, -s) 8192
[  132s] cpu time               (seconds, -t) unlimited
[  132s] max user processes              (-u) 1200
[  132s] virtual memory          (kbytes, -v) unlimited
[  132s] file locks                      (-x) unlimited
Comment 6 Dominique Leuenberger 2016-02-08 11:46:33 UTC
For reference, the output while building on cloud125:

[  209s] + echo JUST SOME OBS DEBUG
[  209s] JUST SOME OBS DEBUG
[  209s] + ulimit -a
[  209s] core file size          (blocks, -c) 0
[  209s] data seg size           (kbytes, -d) unlimited
[  209s] scheduling priority             (-e) 0
[  209s] file size               (blocks, -f) unlimited
[  209s] pending signals                 (-i) 11704
[  209s] max locked memory       (kbytes, -l) 64
[  209s] max memory size         (kbytes, -m) unlimited
[  209s] open files                      (-n) 1024
[  209s] pipe size            (512 bytes, -p) 8
[  209s] POSIX message queues     (bytes, -q) 819200
[  209s] real-time priority              (-r) 0
[  209s] stack size              (kbytes, -s) 8192
[  209s] cpu time               (seconds, -t) unlimited
[  209s] max user processes              (-u) 1200
[  209s] virtual memory          (kbytes, -v) unlimited
[  209s] file locks                      (-x) unlimited

so looks rather similar, the only diff being

lamb:  pending signals             (-i) 31890
cloud: pending signals             (-i) 11704
Comment 7 Dominique Leuenberger 2016-02-08 11:47:17 UTC
2016-02-08 09:45:54 CET  kdesignerplugin                                    new build        failed                   2m 32s   lamb03:1        
2016-02-08 11:34:57 CET  kdesignerplugin                                    source change    succeeded                2m 17s   build80:1       
2016-02-08 11:38:51 CET  kdesignerplugin                                    new build        failed                   2m 27s   lamb19:7        
2016-02-08 11:45:44 CET  kdesignerplugin                                    new build        succeeded                5m  9s   cloud125:4
Comment 8 Adrian Schröter 2016-02-08 15:15:21 UTC
It looks systemd applies the pids limit. We do not using systemd on purpose during build, but it gets now used already as part of the initrd from kernel-obs-build package.

So either the initrd should not use systemd or systemd should not apply these new limits. When it happens during build, the same errors can also happen at runtime IMHO.

In any case, either an initrd or systemd issue.
Comment 10 Ludwig Nussel 2016-02-09 08:01:37 UTC
systemd sets up /sys/fs/cgroup/pids/init.scope specifically for pid 1. So it could be argued that the bug is that systemd doesn't fully clean up after itself when switching to the real root, assuming pid 1 there is also systemd.
Comment 11 Aleksa Sarai 2016-02-09 08:36:01 UTC
(In reply to Ludwig Nussel from comment #10)
> systemd sets up /sys/fs/cgroup/pids/init.scope specifically for pid 1. So it
> could be argued that the bug is that systemd doesn't fully clean up after
> itself when switching to the real root, assuming pid 1 there is also systemd.

I'm not sure if you'll be able convince the systemd guys that this is a bug. However, here's a simple workaround to stick in the start of the build scripts:

% echo $$ > /sys/fs/cgroups/pids/cgroup.procs

Since the root cgroup doesn't allow for pids limits, attaching to the root cgroup should solve your problems. We could also increase the ulimits here if appropriate.
Comment 12 Jan Engelhardt 2016-02-09 08:50:47 UTC
I suppose the problem will be aggraviated for worker hosts which use the "chroot" type configuration, because there, all worker jobs sit in the same cgroup because obsworker is one huge service rather than a service template :-(
Comment 13 Thomas Blume 2016-02-09 14:55:20 UTC
(In reply to Ludwig Nussel from comment #9)
> The upstream discussion about that feature is here:
> https://lists.freedesktop.org/archives/systemd-devel/2015-November/035006.
> html
> 
> Commits:
> https://github.com/systemd/systemd/commit/
> 0af20ea2ee2af2bcf2258e7a8e1a13181a6a75d6
> https://github.com/systemd/systemd/commit/
> 9ded9cd14cc03c67291b10a5c42ce5094ba0912f

systemd 228 has these defaults (/etc/systemd/system.conf):

#DefaultTasksMax=512

I guess this is the bottleneck.
Can't you just set it to a higher value for the build machines?
Comment 14 Adrian Schröter 2016-02-09 15:05:03 UTC
The build machines just provide the VM.

You need to fix that in the distribution, so for example configuring the right defaults in the initrd of the  kernel-obs-build package.

(However, personal comment: I wonder if this leads to similar errors as with btrfs when compiling localy and having random errors ... lost days already with that :/)
Comment 15 Ludwig Nussel 2016-02-09 16:24:49 UTC
(In reply to Thomas Blume from comment #13)
> systemd 228 has these defaults (/etc/systemd/system.conf):
> 
> #DefaultTasksMax=512
> 
> I guess this is the bottleneck.
> Can't you just set it to a higher value for the build machines?

Still feels like a workaround. Shouldn't systemd clean up after itself and undo the changes it did to cgroups?
Comment 17 Thomas Blume 2016-02-10 07:08:01 UTC
(In reply to Ludwig Nussel from comment #15)
> (In reply to Thomas Blume from comment #13)
> > systemd 228 has these defaults (/etc/systemd/system.conf):
> > 
> > #DefaultTasksMax=512
> > 
> > I guess this is the bottleneck.
> > Can't you just set it to a higher value for the build machines?
> 
> Still feels like a workaround. Shouldn't systemd clean up after itself and
> undo the changes it did to cgroups?

Hm, systemd only resets RLIMIT_NOFILE at reexecute:

-->--
                /* Reset the RLIMIT_NOFILE to the kernel default, so
                 * that the new systemd can pass the kernel default to
                 * its child processes */
                if (saved_rlimit_nofile.rlim_cur > 0)
                        (void) setrlimit(RLIMIT_NOFILE, &saved_rlimit_nofile);
--<--

Maybe it should also reset RLIMIT_NPROC?

But I'm unsure wheter this would have an effect on the reported behaviour, unless:

DefaultTasksAccounting=no

is set in system.conf. See the systemd.resource-control manpage for details.
Comment 18 Dominique Leuenberger 2016-02-10 08:14:10 UTC
(In reply to Thomas Blume from comment #17)

> Maybe it should also reset RLIMIT_NPROC?
> 
> But I'm unsure wheter this would have an effect on the reported behaviour,
> unless:

Probably it won't as the qemu / build VM is started with:

init=/.build/build (so systemd is not even re-executed)

So in this case any remainings of systemd's limits is just confusing, as systemd is not pid1
Comment 19 Thomas Blume 2016-02-10 09:01:00 UTC
(In reply to Dominique Leuenberger from comment #18)
> (In reply to Thomas Blume from comment #17)
> 
> > Maybe it should also reset RLIMIT_NPROC?
> > 
> > But I'm unsure wheter this would have an effect on the reported behaviour,
> > unless:
> 
> Probably it won't as the qemu / build VM is started with:
> 
> init=/.build/build (so systemd is not even re-executed)
> 
> So in this case any remainings of systemd's limits is just confusing, as
> systemd is not pid1
 
dracut in hostonly mode copies /etc/systemd/system.conf into the initrd. 
So a solution would be to provide an adapted system.conf before the initrd is built.
Comment 20 Thomas Blume 2016-02-10 09:03:46 UTC
(In reply to Thomas Blume from comment #19)
> dracut in hostonly mode copies /etc/systemd/system.conf into the initrd. 
> So a solution would be to provide an adapted system.conf before the initrd
> is built.

To be more precise, this is rather a workaround.
I agree that system should do a proper cleanup when it gets shut down.
Comment 21 Dr. Werner Fink 2016-02-10 09:12:57 UTC
(In reply to Dominique Leuenberger from comment #18)

> So in this case any remainings of systemd's limits is just confusing, as
> systemd is not pid1

The design of systemd and any other init program is that it has pid 1.  Otherwise it can not wipe out any zombi of a died daemon process.
Comment 22 Dominique Leuenberger 2016-02-10 09:25:47 UTC
(In reply to Dr. Werner Fink from comment #21)
> (In reply to Dominique Leuenberger from comment #18)
> 
> > So in this case any remainings of systemd's limits is just confusing, as
> > systemd is not pid1
> 
> The design of systemd and any other init program is that it has pid 1. 
> Otherwise it can not wipe out any zombi of a died daemon process.

right.. and the init program in the build bot is called /.build/build - NOT systemd. Systemd just wrongly survives as being spawned out of initrd already and setting up limits which are not on the actual system. THAT's the issue and that's what we claim systemd should cleanup

From withing a OBS worker:
[  176s]     1 ttyS0    Ss+    0:01 /bin/bash /.build/build
Comment 23 Franck Bui 2016-02-10 09:39:57 UTC
(In reply to Dominique Leuenberger from comment #22)
> (In reply to Dr. Werner Fink from comment #21)
> > (In reply to Dominique Leuenberger from comment #18)
> > 
> > > So in this case any remainings of systemd's limits is just confusing, as
> > > systemd is not pid1
> > 
> > The design of systemd and any other init program is that it has pid 1. 
> > Otherwise it can not wipe out any zombi of a died daemon process.
> 
> right.. and the init program in the build bot is called /.build/build - NOT
> systemd. Systemd just wrongly survives as being spawned out of initrd
> already and setting up limits which are not on the actual system. THAT's the
> issue and that's what we claim systemd should cleanup
> 

Then don't use systemd at all.

PID1 is not supposed to be started and replaced by another init system later.

Also I would suggest to teach your init system to do some basic initialisations when starting instead of totally relying on an undefined state.
Comment 24 Dr. Werner Fink 2016-02-10 09:48:15 UTC
(In reply to Dominique Leuenberger from comment #22)

Hmm ... IMHO not using systemd (or sysvinit) with pid 1 is the fault. If the bash script /.build/build is pid 1 the it has to clean up the zombies as well as to perform the other minimal jobs of a init program.  Using systemd in parallel to an other init program smells like a dirty hack.   IMHO the /.build/build script should become a real service unit and systemd its pid equal to 1 as well as the configuration of the new limit features has to be adopted to the needs of the build system.
Comment 25 Ludwig Nussel 2016-02-10 11:12:56 UTC
I'm not sure the discussion is going into the right direction here.

systemd sits in the initrd, so no matter what becomes pid 1 in the target system, there will always be a systemd first. Now unfortunately that systemd in initrd sets up some cgroups there already that have consequences for the target system later. That means no matter whether the target system uses sysvinit, bash or busybox those cgroups are still there.
Comment 26 Ludwig Nussel 2016-02-10 11:14:04 UTC
(In reply to Dominique Leuenberger from comment #22)
> Systemd just wrongly survives as being spawned out of initrd

I'm not sure that's accurate, what we know is that the cgroups survive.
Comment 27 Dr. Werner Fink 2016-02-10 13:06:10 UTC
(In reply to Ludwig Nussel from comment #26)

As already mentioned, if you use systemd you might configure cgroups and limits accordingly to your needs.  That is e.g. Delegate=no and TasksMax=infinity in /usr/lib/systemd/system/systemd-nspawn@.service. 

From NEWS of current git repository:

        * There's a new system.conf setting DefaultTasksMax= to
          control the default TasksMax= setting for services and
          scopes running on the system. (TasksMax= is the primary
          setting that exposes the "pids" cgroup controller on systemd
          and was introduced in the previous systemd release.) The
          setting now defaults to 512, which means services that are
          not explicitly configured otherwise will only be able to
          create 512 processes or threads at maximum, from this
          version on. Note that this means that thread- or
          process-heavy services might need to be reconfigured to set
          TasksMax= to a higher value. It is sufficient to set
          TasksMax= in these specific unit files to a higher value, or
          even "infinity". Similar, there's now a logind.conf setting
          UserTasksMax= that defaults to 4096 and limits the total
          number of processes or tasks each user may own
          concurrently. nspawn containers also have the TasksMax=
          value set by default now, to 8192. Note that all of this
          only has an effect if the "pids" cgroup controller is
          enabled in the kernel. The general benefit of these changes
          should be a more robust and safer system, that provides a
          certain amount of per-service fork() bomb protection.
Comment 28 Thomas Blume 2016-02-11 08:43:48 UTC
Created attachment 665194 [details]
systemd-obs-defaults-patch

man systemd.resource-control shows:

-->--
TasksAccounting=

Turn on task accounting for this unit. Takes a boolean argument. If enabled, the system manager will keep track of the number of tasks in the unit. The number of tasks accounted this way includes both kernel threads and userspace processes, with each thread counting individually. Note that turning on tasks accounting for one unit will also implicitly turn it on for all units contained in the same slice and for all its parent slices and the units contained therein. The system default for this setting may be controlled with DefaultTasksAccounting= in systemd-system.conf(5).
--<--

Please try attached patch for the kernel-obs-build package.
Comment 29 Dominique Leuenberger 2016-02-11 08:59:17 UTC
(In reply to Thomas Blume from comment #28)
> Created attachment 665194 [details]
> systemd-obs-defaults-patch

I tested a similar hack/workaround yesterday and this works indeed. This is currently WIP in progress by the kernel team to get integrated into the kernel-source package.
Comment 30 Dr. Werner Fink 2016-02-11 09:44:37 UTC
(In reply to Dominique Leuenberger from comment #29)

This is not a hack nor a workaround, it is simply the configuration for the Open Build Service VM builds.

Beside this an other place could be an own module script below

   /usr/lib/dracut/modules.d/99obs/

for obs to change the configuration within the initrd for its own needs.
Comment 31 Adrian Schröter 2016-02-11 09:56:21 UTC
Werner, you are right it is a configuration thing.

But what is the reason to have this only inside of OBS builds, you can run into the exact same thing on your workstation as a developer.

And I can tell you, random failures in large builds can drive you nuts. esp. since fork and make failures are almost never handled in a good way out there. often you even get an result, it is just broken :/

(Had this with random btrfs disk-full errors on a 20% filled disk. It took me two days to learn that the 1GB sized source code was not buildable due to the file system. IMHO this is almost the same situation here, where you sometimes will run into this limit. An opt-in makes IMHO way more sense, so the user is at least aware of the existens of such limits.)
Comment 32 Michal Marek 2016-02-11 10:12:19 UTC
Hm, I just pulled this:

commit 39b708bbcea0079de363746a4d3323e7d3016e67
Author: Jiri Slaby <jslaby@suse.cz>
Date:   Thu Feb 11 09:46:23 2016 +0100

    rpm/kernel-obs-build.spec.in: do not limit TasksMax
    
    We run with build as PID 1 (boo#965564).

diff --git a/rpm/kernel-obs-build.spec.in b/rpm/kernel-obs-build.spec.in
index 897f496be7ec..7ae2588749e2 100644
--- a/rpm/kernel-obs-build.spec.in
+++ b/rpm/kernel-obs-build.spec.in
@@ -99,6 +99,10 @@ info "  binfmt misc..."
 modprobe binfmt_misc
 EOF
 chmod a+rx /usr/lib/dracut/modules.d/80obs/setup_obs.sh
+# Configure systemd in kernel-obs-build's initrd not to limit TasksMax,
+# we run with build as PID 1 (boo#965564)
+echo "DefaultTasksMax=infinity" >> /etc/systemd/system.conf
+echo "DefaultTasksAccounting=no" >> /etc/systemd/system.conf

This comes from https://build.opensuse.org/request/show/358722 and looks equivalent to the approach suggested by Thomas. Also, which distributions besides Factory need this?
Comment 33 Dominique Leuenberger 2016-02-11 10:16:51 UTC
(In reply to Michal Marek from comment #32)
> Hm, I just pulled this:
> 
> commit 39b708bbcea0079de363746a4d3323e7d3016e67
> Author: Jiri Slaby <jslaby@suse.cz>
> Date:   Thu Feb 11 09:46:23 2016 +0100
> 
>     rpm/kernel-obs-build.spec.in: do not limit TasksMax
>     

Indeed, this is the same approach, even the solution is not that different.

This will be needed in everything with a chance to get systemd >= 226, so currently that's Tumbleweed and, from what I gathered, SLE12SP2 should also receive the updated systemd version.
Comment 34 Dr. Werner Fink 2016-02-11 10:22:45 UTC
(In reply to Adrian Schröter from comment #31)

Hmmm ... if you buy a modern new car then you get ABS, ESP, and some more features ... right?  But if you want to be or are a racing driver then you might consider to disable ABS together with EPS to be able to drift within sharp turns and be able to use the brakes to get the wheels short before they become really blocked ...

If you are developer in front of a modern worksation you might consider how to configure the system in front ... right?

Indeed I'm thinking about one or more default configurations for systemd for several setups ... similar to /etc/permissions.paranoid, /etc/permissions.secure, and /etc/permissions.easy ;)
Comment 39 Franck Bui 2017-05-24 13:52:03 UTC
Ok, assuming this bug is fixed, closing.