Bug 1126703 (CVE-2018-20784) - VUL-0: CVE-2018-20784: kernel-source: kernel/sched/fair.c mishandles leaf cfs_rq's, which allows attackers to cause a denial of service (infinite loop in update_blocked_averages) or possibly have unspecifi
Summary: VUL-0: CVE-2018-20784: kernel-source: kernel/sched/fair.c mishandles leaf cfs...
Status: RESOLVED FIXED
Alias: CVE-2018-20784
Product: SUSE Security Incidents
Classification: Novell Products
Component: Incidents (show other bugs)
Version: unspecified
Hardware: Other Other
: P3 - Medium : Major
Target Milestone: ---
Assignee: Security Team bot
QA Contact: Security Team bot
URL: https://smash.suse.de/issue/225243/
Whiteboard: CVSSv3.1:SUSE:CVE-2018-20784:5.9:(AV:...
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-25 06:14 UTC by Marcus Meissner
Modified: 2024-04-19 08:36 UTC (History)
14 users (show)

See Also:
Found By: Security Response Team
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marcus Meissner 2019-02-25 06:14:30 UTC
CVE-2018-20784

In the Linux kernel before 4.20.2, kernel/sched/fair.c mishandles leaf cfs_rq's,
which allows attackers to cause a denial of service (infinite loop in
update_blocked_averages) or possibly have unspecified other impact by inducing a
high load.

References:
http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-20784
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c40f7d74c741a907cfaeb73a7697081881c497d0
https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.20.2
https://github.com/torvalds/linux/commit/c40f7d74c741a907cfaeb73a7697081881c497d0
Comment 1 Marcus Meissner 2019-02-25 06:16:50 UTC
fixes by makes it currently 4.13 or later, please check
Comment 2 Michal Hocko 2019-02-25 08:52:48 UTC
Frederic, could you have a look please?
Comment 8 Frederic Weisbecker 2022-03-22 10:21:12 UTC
I just checked every trees and only cve/linux-4.4 based trees (SLE12-SP3-TD, SLE12-SP3-LTSS, SLE12-SP2-LTSS) are concerned. I'm cooking the backport for cve/linux-4.4.
Comment 9 Gabriele Sonnu 2022-04-08 14:26:49 UTC
@Frederic: any update on this?
Comment 10 Frederic Weisbecker 2022-04-14 16:36:00 UTC
(In reply to Gabriele Sonnu from comment #9)
> @Frederic: any update on this?

Pushed to cve/linux-4.4
Comment 11 Michal Hocko 2022-04-20 12:32:57 UTC
Let's bounce back to the security team
Comment 12 Gabriele Sonnu 2022-05-09 12:47:29 UTC
Done.
Comment 16 Swamp Workflow Management 2022-06-14 22:17:40 UTC
SUSE-SU-2022:2077-1: An update that solves 29 vulnerabilities and has two fixes is now available.

Category: security (important)
Bug References: 1055710,1065729,1084513,1087082,1126703,1158266,1173265,1182171,1183646,1183723,1187055,1191647,1196426,1197343,1198031,1198032,1198516,1198577,1198660,1198687,1198742,1199012,1199063,1199426,1199505,1199507,1199605,1199650,1200143,1200144,1200249
CVE References: CVE-2017-13695,CVE-2018-20784,CVE-2018-7755,CVE-2019-19377,CVE-2020-10769,CVE-2021-20292,CVE-2021-20321,CVE-2021-28688,CVE-2021-33061,CVE-2021-38208,CVE-2022-1011,CVE-2022-1184,CVE-2022-1353,CVE-2022-1419,CVE-2022-1516,CVE-2022-1652,CVE-2022-1729,CVE-2022-1734,CVE-2022-1974,CVE-2022-1975,CVE-2022-21123,CVE-2022-21125,CVE-2022-21127,CVE-2022-21166,CVE-2022-21180,CVE-2022-21499,CVE-2022-28388,CVE-2022-28390,CVE-2022-30594
JIRA References: 
Sources used:
SUSE Linux Enterprise Server 12-SP2-BCL (src):    kernel-default-4.4.121-92.175.2, kernel-source-4.4.121-92.175.2, kernel-syms-4.4.121-92.175.2

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 17 Swamp Workflow Management 2022-06-14 22:25:35 UTC
SUSE-SU-2022:2082-1: An update that solves 29 vulnerabilities and has 8 fixes is now available.

Category: security (important)
Bug References: 1051510,1055710,1065729,1084513,1087082,1126703,1158266,1173265,1182171,1183646,1183723,1187055,1191647,1195651,1196426,1197343,1198031,1198032,1198516,1198577,1198660,1198687,1198742,1198962,1198997,1199012,1199063,1199314,1199426,1199505,1199507,1199605,1199650,1199785,1200143,1200144,1200249
CVE References: CVE-2017-13695,CVE-2018-20784,CVE-2018-7755,CVE-2019-19377,CVE-2020-10769,CVE-2021-20292,CVE-2021-20321,CVE-2021-28688,CVE-2021-33061,CVE-2021-38208,CVE-2022-1011,CVE-2022-1184,CVE-2022-1353,CVE-2022-1419,CVE-2022-1516,CVE-2022-1652,CVE-2022-1729,CVE-2022-1734,CVE-2022-1974,CVE-2022-1975,CVE-2022-21123,CVE-2022-21125,CVE-2022-21127,CVE-2022-21166,CVE-2022-21180,CVE-2022-21499,CVE-2022-28388,CVE-2022-28390,CVE-2022-30594
JIRA References: 
Sources used:
SUSE OpenStack Cloud Crowbar 8 (src):    kernel-default-4.4.180-94.164.3, kernel-source-4.4.180-94.164.2, kernel-syms-4.4.180-94.164.2, kgraft-patch-SLE12-SP3_Update_45-1-4.3.2
SUSE OpenStack Cloud 8 (src):    kernel-default-4.4.180-94.164.3, kernel-source-4.4.180-94.164.2, kernel-syms-4.4.180-94.164.2, kgraft-patch-SLE12-SP3_Update_45-1-4.3.2
SUSE Linux Enterprise Server for SAP 12-SP3 (src):    kernel-default-4.4.180-94.164.3, kernel-source-4.4.180-94.164.2, kernel-syms-4.4.180-94.164.2, kgraft-patch-SLE12-SP3_Update_45-1-4.3.2
SUSE Linux Enterprise Server 12-SP3-LTSS (src):    kernel-default-4.4.180-94.164.3, kernel-source-4.4.180-94.164.2, kernel-syms-4.4.180-94.164.2, kgraft-patch-SLE12-SP3_Update_45-1-4.3.2
SUSE Linux Enterprise Server 12-SP3-BCL (src):    kernel-default-4.4.180-94.164.3, kernel-source-4.4.180-94.164.2, kernel-syms-4.4.180-94.164.2
SUSE Linux Enterprise High Availability 12-SP3 (src):    kernel-default-4.4.180-94.164.3
HPE Helion Openstack 8 (src):    kernel-default-4.4.180-94.164.3, kernel-source-4.4.180-94.164.2, kernel-syms-4.4.180-94.164.2, kgraft-patch-SLE12-SP3_Update_45-1-4.3.2

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 18 Michal Hocko 2023-05-31 09:08:13 UTC
(In reply to Frederic Weisbecker from comment #10)
> (In reply to Gabriele Sonnu from comment #9)
> > @Frederic: any update on this?
> 
> Pushed to cve/linux-4.4

Frederic, we've had a performance regression reported for SLE12-SP3-TD (bug 1210904) and it has turned out a fix up is far from trivial and very error prone as proven by bug 1211747. Could you please re-evaluate the actual risk vs. benefit of this fix for 4.4 based codestreams please?
Comment 19 Frederic Weisbecker 2023-06-09 11:06:19 UTC
(In reply to Michal Hocko from comment #18)
> (In reply to Frederic Weisbecker from comment #10)
> > (In reply to Gabriele Sonnu from comment #9)
> > > @Frederic: any update on this?
> > 
> > Pushed to cve/linux-4.4
> 
> Frederic, we've had a performance regression reported for SLE12-SP3-TD (bug
> 1210904) and it has turned out a fix up is far from trivial and very error
> prone as proven by bug 1211747. Could you please re-evaluate the actual risk
> vs. benefit of this fix for 4.4 based codestreams please?

So now that we have reverted all the problematic patches in https://bugzilla.suse.com/show_bug.cgi?id=1211747,
the initial issue fixed by the following remains:

        c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b9c")

As mentioned in this changelog and the related discussion here:

        https://lore.kernel.org/all/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com/T/#u

the issue can be triggered when a high load is involved with cgroups.
It's probably not easy to trigger in practice because it took more than
one year to ever fire. It's still reproducible though and the symptoms
can be serious.

So that is in the balance in favour of a full backport.

On the other side of the balance, the full backport is very large, invasive
and error prone. The performance regression reported by the customer
after our first try is such an example. We may be able to fix it with
tracking potential missing changes but that may introduce even further
issues.

So it's hard to find the right decision to take. If we take the direction
of a full backport, we must ensure that QA on either side performs intensive
testing to track down further issue. But even then it's not guaranteed
that something won't fall into the cracks.
Comment 20 Michal Hocko 2023-06-09 12:13:04 UTC
(In reply to Frederic Weisbecker from comment #19)
[...]
> So it's hard to find the right decision to take. If we take the direction
> of a full backport, we must ensure that QA on either side performs intensive
> testing to track down further issue. But even then it's not guaranteed
> that something won't fall into the cracks.

One thing that might help to evaluate is how serious of a fallout we can expect should this ever hit. Would that be something hard/soft lockup can detect? Would a special cgroup configuration need to be configured or a plain heavy load (many tasks?) can trigger this? If the latter is this so much worse than an old "good" fork bomb?
Comment 21 Frederic Weisbecker 2023-06-14 09:08:46 UTC
(In reply to Michal Hocko from comment #20)
> (In reply to Frederic Weisbecker from comment #19)
> [...]
> > So it's hard to find the right decision to take. If we take the direction
> > of a full backport, we must ensure that QA on either side performs intensive
> > testing to track down further issue. But even then it's not guaranteed
> > that something won't fall into the cracks.
> 
> One thing that might help to evaluate is how serious of a fallout we can
> expect should this ever hit. Would that be something hard/soft lockup can
> detect?

Likely detectable with the hardlockup detector given that this function is called with IRQs disabled.

> Would a special cgroup configuration need to be configured or a
> plain heavy load (many tasks?) can trigger this? If the latter is this so
> much worse than an old "good" fork bomb?

It rather seems to involve a topology of several nodes. Vincent example's describes a 3 level configuration. But the issue might happen with more simple setup. I can't really tell for sure since I have a very limited understanding of this subsystem.
Comment 24 Michal Hocko 2023-06-15 10:52:16 UTC
Our internal evaluation shows that the CVE is possible but the fix itself is causing more problems than it actually solves. Triggering the issue shouldn't cause any way to gain privileges and it seems it only allows to DoS the system under very specific consequences (cpu throttling enabled with many cgroups and a lot of them being idle). These setups are going to suffer from cpu scheduling problems already and the cpu throttling will be unpredictable at best. In the worst case we expect hard/soft lockup detector to complain.

An adversary to hit this bug would have to have a pretty unconstrained execution capabilities and in that case there are many other ways to DoS the system so being particularly concerned about this one seems to be far fetched.

All that being said, considering the fix for this CVE is causing performance overhead that is not limited to workloads which hit the problem and all the fixes required to address all the fallouts are too risky we have concluded that they are likely more harmful than the underlying problem.
Comment 28 Maintenance Automation 2023-07-11 08:36:56 UTC
SUSE-SU-2023:2805-1: An update that solves 38 vulnerabilities and has four fixes can now be installed.

Category: security (important)
Bug References: 1126703, 1204405, 1205756, 1205758, 1205760, 1205762, 1205803, 1206878, 1207036, 1207125, 1207168, 1207795, 1208600, 1208777, 1208837, 1209008, 1209039, 1209052, 1209256, 1209287, 1209289, 1209291, 1209532, 1209549, 1209687, 1209871, 1210329, 1210336, 1210337, 1210498, 1210506, 1210647, 1210715, 1210940, 1211105, 1211186, 1211449, 1212128, 1212129, 1212154, 1212501, 1212842
CVE References: CVE-2017-5753, CVE-2018-20784, CVE-2022-3566, CVE-2022-45884, CVE-2022-45885, CVE-2022-45886, CVE-2022-45887, CVE-2022-45919, CVE-2023-0590, CVE-2023-1077, CVE-2023-1095, CVE-2023-1118, CVE-2023-1249, CVE-2023-1380, CVE-2023-1390, CVE-2023-1513, CVE-2023-1611, CVE-2023-1670, CVE-2023-1989, CVE-2023-1990, CVE-2023-1998, CVE-2023-2124, CVE-2023-2162, CVE-2023-2194, CVE-2023-23454, CVE-2023-23455, CVE-2023-2513, CVE-2023-28328, CVE-2023-28464, CVE-2023-28772, CVE-2023-30772, CVE-2023-3090, CVE-2023-3141, CVE-2023-31436, CVE-2023-3159, CVE-2023-3161, CVE-2023-32269, CVE-2023-35824
Sources used:
SUSE Linux Enterprise Server 12 SP2 BCL 12-SP2 (src): kernel-syms-4.4.121-92.205.1, kernel-source-4.4.121-92.205.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 29 Michal Koutný 2023-07-26 11:25:09 UTC
Let me add one more update on this CVE and our 4.4-based kernels.
We stick to _not having_ the patch for this CVE there because those are not
affected by the scenario (and because of performance drop/heavy fixups it would
generate).

TL;DR Commit 9c2791f936ef ("sched/fair: Fix hierarchical order in
rq->leaf_cfs_rq_list") is needed to be exposed to the infinite loop risk.

The argument why 4.4-based kernels are not affected CVE-2018-20784 is as follows:
- The original report [1] points to an infinite loop caused by corrupted
  rq->leaf_cfs_rq_list.
- In the kernels under discussion, the list is iterated with
  for_each_leaf_cfs_rq_safe() in UBA() while holding rq->lock [2].
- The list is mutated with list_add_tail_rcu(), list_add_rcu() and
  list_del_rcu() (with a9e7f6544b9c) where these functions are used in a
  standard way by passing the list's head and the node operated upon -- always
  under rq->lock (UBA(), enqueue_entity()). (*)
- The list is initialized empty (sched_init()) early (no concurrent access yet).
- Relying on:
  a) the correct synchronization and
  b) list mutation primitives preserving list's validity,
  we can tell that the rq->leaf_cfs_rq_list will always be a valid structure
  whose iteration is bound by the number of entries (not infinite).

The commit 9c2791f936ef ("sched/fair: Fix hierarchical order in
rq->leaf_cfs_rq_list") changes the code in such a way that a reference in the
middle of the list (or even removed elements) is taken and that is then passed
to list mutation primitives, the statement (*) then doesn't hold.
The commit 9c2791f936ef is necessary (not sufficient) for rq->leaf_cfs_rq_list
corruption bringing about cyclic structure of the list that may cause infinite
looping.

I've blacklisted the commit 9c2791f936ef in our 4.4-based kernels as a safety
fuse against CVE-2018-20784 (and it's not present now).

You may ask what'll be missed without commit 9c2791f936ef, i.e. possibly
breaking the invariant of hierarchical sorting of rq->leaf_cfs_rq_list.
It would be needed for proper bottom-up calculation in UBA(), however, the
kernel between:
> 9d89c257dfb9c ("sched/fair: Rewrite runnable load and utilization average tracking") v4.3-rc1~136^2~21
and 
> 4e5160766fcc9 ("sched/fair: Propagate asynchrous detach") v4.10-rc1~189^2~24
does not rely on the bottom-up traversal AFAICS.

(Note about cpu controller access: systemd default behavior disallows access to
CPU controller to unprivileged users (on SLE12-SP3). That makes control (wrt
triggering the CVE) over taskgroups more difficult to an unprivileged user.
However, I can't rule out a clever combination of idle/active tasks that may
cause the list corruption if the commit 9c2791f936ef were present.)

[1] https://lore.kernel.org/all/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com/
[2] Also in print_cfs_stats() which is a debugging interface with mode 444.
    Here it's without rq->lock but within an RCU read section.
Comment 30 Maintenance Automation 2023-08-16 08:31:33 UTC
SUSE-SU-2023:3324-1: An update that solves 14 vulnerabilities and has two fixes can now be installed.

Category: security (important)
Bug References: 1087082, 1126703, 1206418, 1207561, 1209779, 1210584, 1211738, 1211867, 1212502, 1213059, 1213167, 1213251, 1213286, 1213287, 1213585, 1213588
CVE References: CVE-2018-20784, CVE-2018-3639, CVE-2022-40982, CVE-2023-0459, CVE-2023-1637, CVE-2023-20569, CVE-2023-20593, CVE-2023-2985, CVE-2023-3106, CVE-2023-3268, CVE-2023-35001, CVE-2023-3567, CVE-2023-3611, CVE-2023-3776
Sources used:
SUSE Linux Enterprise Server 12 SP2 BCL 12-SP2 (src): kernel-syms-4.4.121-92.208.1, kernel-source-4.4.121-92.208.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.