Bug 1095131

Summary: kubelet service (1.10.2) fails to start: failed to get device for dir "/var/lib/kubelet"
Product: [openSUSE] openSUSE Tumbleweed Reporter: Maximilian Meister <mmeister>
Component: KubicAssignee: Maximilian Meister <mmeister>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P2 - High CC: aherzig, rbrown, vrothberg
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard: obs:running:10751:important
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Maximilian Meister 2018-05-30 06:08:10 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36
Build Identifier: 

May 30 04:56:24 admin hyperkube[2766]: F0530 04:56:24.779150    2766 kubelet.go:1354] Failed to start ContainerManager failed to get rootfs info: failed to get device for dir "/var/lib/kubelet": could not find device with major: 0, minor: 46 in cached partitions map

kubelet wont start with kubernetes 1.10.2

it might be a regression of https://github.com/google/cadvisor/pull/1668 ?

k8s is not yet updated officially, i have used: https://build.opensuse.org/package/show/home:m_meister:branches:devel:CaaSP:Head:ControllerNode/kubernetes

tested on the Stack-hardware image, which comes with the additional 9p kernel modules and k8s preinstalled: https://download.opensuse.org/repositories/devel:/CaaSP:/images/images/openSUSE-Tumbleweed-Kubic.x86_64-15.0-CaaSP-Stack-hardware-x86_64-Build4.43.qcow2

some previous discussion can be found also in https://bugzilla.suse.com/show_bug.cgi?id=1084766

Reproducible: Always
Comment 1 Maximilian Meister 2018-06-15 07:29:22 UTC
i've run the conformance tests for 1.10.4 [0] (sle based environment with updated cri-o and crio-tools) and they were green, which makes me wonder why this only happens on kubic. i skimmed through the k8s changelogs but couldnt find any meaningful entry about sth having fixed this issue

@richard any idea what could be the difference here? or should we test it again on kubic with 1.10.4? last test was done with 1.10.3 IIRC

also feel free to adapt the priority of the bug

[0] http://jenkins.caasp.suse.net/job/caasp-manual-sandbox/job/master/60/
Comment 2 Thorsten Kukuk 2018-06-15 08:22:27 UTC
(In reply to Maximilian Meister from comment #1)
> @richard any idea what could be the difference here? or should we test it
> again on kubic with 1.10.4? last test was done with 1.10.3 IIRC

SLE12 SP3 (CaaSP until v3) has /var/lib/kubelet as subvolume
Kubic (CaaSP from v4) has /var as subvolume and /var/lib/kubelet is a directory inside this subvolume.

I bet that this is what confuses kubernetes.
Comment 3 Richard Brown 2018-06-15 19:38:37 UTC
(In reply to Thorsten Kukuk from comment #2)
> (In reply to Maximilian Meister from comment #1)
> > @richard any idea what could be the difference here? or should we test it
> > again on kubic with 1.10.4? last test was done with 1.10.3 IIRC
> 
> SLE12 SP3 (CaaSP until v3) has /var/lib/kubelet as subvolume
> Kubic (CaaSP from v4) has /var as subvolume and /var/lib/kubelet is a
> directory inside this subvolume.
> 
> I bet that this is what confuses kubernetes.

Indeed - my guestimate suggests that https://github.com/google/cadvisor/pull/1668 only works if /var/lib/kubelet is it's own subvolume

It's only a guestimate because I really don't understand how go's 'stat' works, so I'm little blind as to how that fix worked in the past.

But one thing we can say for sure is that it doesn't work on Kubic and the difference in the subvolume layout is the biggest change that is likely to trigger any difference in logic for volume/partition ID detection.

That change isn't just present in Kubic - we can expect similar behaviour in any SLE 15 based CaaSP also (eg. CaaSP v4)

So I'd recommend running any conformance tests for 1.10.x against both SLE 12/CaaSP v3 and SLE 15/Kubic/CaaSP v4 - assuming both are being targetted for k8s 1.10 releases.

Bumping up the severity and priority on the grounds of Kubic/CaaSP v4 without kubernetes is as useful as a submarine with a sunroof or an inflatable dartboard ;)

I'd recommend the bug be considered equally important for CaaSPv4 until it's proven that it doesn't exist there.
Comment 4 Maximilian Meister 2018-06-18 12:59:49 UTC
i've added a patch as part of [0] to fix this bug, and asmallfter a local test, k8s was running fine and the error message hasn't appeared anymore, i only ran into a failing openldap as a followup but this was more or less expected

[0] https://build.opensuse.org/request/show/617020
Comment 5 Maximilian Meister 2018-06-18 13:04:16 UTC
(In reply to Maximilian Meister from comment #4)
> i've added a patch as part of [0] to fix this bug, and asmallfter a local
> test, k8s was running fine and the error message hasn't appeared anymore, i
> only ran into a failing openldap as a followup but this was more or less
> expected
> 
> [0] https://build.opensuse.org/request/show/617020

old sr, this is the correct one -> https://build.opensuse.org/request/show/617501
Comment 6 Thorsten Kukuk 2018-06-18 13:16:57 UTC
(In reply to Maximilian Meister from comment #4)
> i've added a patch as part of [0] to fix this bug, and asmallfter a local
> test, k8s was running fine and the error message hasn't appeared anymore, i
> only ran into a failing openldap as a followup but this was more or less
> expected

Looks like the containers were not part of the last Tumbleweed snapshot ...
Comment 7 Maximilian Meister 2018-06-18 14:41:11 UTC
has been accepted to devel now. the fix is part of this factory SR -> https://build.opensuse.org/request/show/617520
Comment 8 Maximilian Meister 2018-08-27 12:41:50 UTC
fixed
Comment 13 Swamp Workflow Management 2018-12-07 17:25:26 UTC
SUSE-SU-2018:4020-1: An update that solves two vulnerabilities and has 7 fixes is now available.

Category: security (important)
Bug References: 1084765,1095131,1108195,1111341,1112967,1112980,1114645,1116933,1118198
CVE References: CVE-2016-8859,CVE-2018-1002105
Sources used:
SUSE CaaS Platform 3.0 (src):    caasp-container-manifests-3.0.0+git_r291_33f7b2d-3.6.3, cri-o-1.10.6-4.8.5, cri-tools-1.0.0beta2-3.3.3, kubernetes-1.10.11-4.8.2, kubernetes-salt-3.0.0+git_r888_7af7095-3.33.2
Comment 14 Swamp Workflow Management 2018-12-17 15:42:00 UTC
This is an autogenerated message for OBS integration:
This bug (1095131) was mentioned in
https://build.opensuse.org/request/show/658922 15.0 / kubectl
Comment 15 Swamp Workflow Management 2018-12-18 09:40:11 UTC
This is an autogenerated message for OBS integration:
This bug (1095131) was mentioned in
https://build.opensuse.org/request/show/659046 15.0+Backports:SLE-12 / kubectl
Comment 16 Swamp Workflow Management 2018-12-18 11:41:35 UTC
This is an autogenerated message for OBS integration:
This bug (1095131) was mentioned in
https://build.opensuse.org/request/show/659074 15.0 / kubectl
Comment 17 Swamp Workflow Management 2019-07-11 19:21:10 UTC
This is an autogenerated message for OBS integration:
This bug (1095131) was mentioned in
https://build.opensuse.org/request/show/714707 15.1 / kubernetes
Comment 18 Swamp Workflow Management 2019-07-12 00:21:02 UTC
This is an autogenerated message for OBS integration:
This bug (1095131) was mentioned in
https://build.opensuse.org/request/show/714723 15.1 / kubernetes
Comment 19 Swamp Workflow Management 2020-04-26 19:15:46 UTC
openSUSE-SU-2020:0554-1: An update that solves 7 vulnerabilities and has 22 fixes is now available.

Category: security (important)
Bug References: 1039663,1042383,1042387,1057277,1059207,1061027,1065972,1069469,1084765,1084766,1085009,1086185,1086412,1095131,1095154,1096773,1097473,1100838,1101010,1104598,1104821,1112980,1118897,1118898,1136403,1144065,1155323,1161056,1161179
CVE References: CVE-2016-5195,CVE-2016-8859,CVE-2017-1002101,CVE-2018-1002105,CVE-2018-16873,CVE-2018-16874,CVE-2019-10214
Sources used:
openSUSE Leap 15.1 (src):    cri-o-1.17.1-lp151.2.2, cri-tools-1.18.0-lp151.2.1, go1.14-1.14-lp151.6.1, kubernetes-1.18.0-lp151.5.1