Kubernetes / Flannel – Failed to list *v1.Service
I was recently building a new HA Kubernetes cluster and was having issues with CoreDNS failing to do it’s thing.
The Problem
The Kubernetes nodes were reporting as ready however the CoreDNS pods were refusing to be healthy
[root@km01 ~]# kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-6955765f44-kxvnf 0/1 Running 2 18h
kube-system coredns-6955765f44-wgh6c 0/1 Running 2 18h
All the nodes were reporting as Ready
[root@km01 ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
km01 Ready master 13d v1.17.3
km02 Ready master 3d22h v1.17.3
km03 Ready master 3d20h v1.17.3
kw01 Ready <none> 2d19h v1.17.3
kw02 Ready <none> 21h v1.17.3
kw03 Ready <none> 21h v1.17.3
The 2 CoreDNS pods logs were full of
root@km01 ~]# kubectl logs coredns-6955765f44-kxvnf -n kube-system'
I0806 23:36:54.132644 1 trace.go:82] Trace[1939897294]: "Reflector pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98 ListAndWatch" (started: 2020-08-06 23:36:24.131764014 +0000 UTC m=+63460.404444423) (total time: 30.000829888s):
Trace[1939897294]: [30.000829888s] [30.000829888s] END
E0806 23:36:54.132664 1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0806 23:36:54.132664 1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0806 23:36:54.132664 1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
E0806 23:36:54.132664 1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190620085101-78d2af792bab/tools/cache/reflector.go:98: Failed to list *v1.Service: Get https://10.96.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.96.0.1:443: i/o timeout
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
[INFO] plugin/ready: Still waiting on: "kubernetes"
After checking the following:
- Flannel interfaces existed on each node
- The routes on each table were checked to ensure each other servers subnet was present and I could ping flannel interfaces on other servers
- Kublet was running with the parameter –network-plugin=cni
- kube-controller-manager was running with the parameters –allocate-node-cidrs=true and –cluster-cidr
The Solution
After looking at the Flannel subnet.env file way to many times it finally clicked. The FLANNEL_SUBNET doesn’t exist in the FLANNEL_NETWORK and looking at the other nodes they also had FLANNEL_SUBNETS outside of FLANNEL_NETWORK
[root@km01 ~]# cat /run/flannel/subnet.env
FLANNEL_NETWORK=10.244.0.0/16
FLANNEL_SUBNET=10.243.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
This is caused because we boot strapped the cluster to use the cidr 10.243.0.0/16. The default flannel deployment has the network set as 10.244.0.0/16 conflicting with what we bootstrapped the cluster with.
net-conf.json: |
{
"Network": "10.244.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
The fix is to edit the flannel config map to update the network subnet to match.
[root@km01 ~]# kubectl edit configmap kube-flannel-cfg -n kube-system
apiVersion: v1
data:
cni-conf.json: |
{
"name": "cbr0",
"cniVersion": "0.3.1",
"plugins": [
{
"type": "flannel",
"delegate": {
"hairpinMode": true,
"isDefaultGateway": true
}
},
{
"type": "portmap",
"capabilities": {
"portMappings": true
}
}
]
}
net-conf.json: |
{
"Network": "10.243.0.0/16",
"Backend": {
"Type": "vxlan"
}
}
Restart the flannel pods
[root@km01 ~]# kubectl delete pods -l app=flannel -n kube-system
pod "kube-flannel-ds-amd64-9tmwn" deleted
pod "kube-flannel-ds-amd64-b4htd" deleted
pod "kube-flannel-ds-amd64-b6kmn" deleted
pod "kube-flannel-ds-amd64-nlmqk" deleted
pod "kube-flannel-ds-amd64-qpqsv" deleted
pod "kube-flannel-ds-amd64-zhh95" deleted
After flannel restarted the subnet.env showed the correct values.
[root@km01 ~]# cat /var/run/flannel/subnet.env
FLANNEL_NETWORK=10.243.0.0/16
FLANNEL_SUBNET=10.243.0.1/24
FLANNEL_MTU=1450
FLANNEL_IPMASQ=true
And CoreDNS started reporting as healthy.
[root@km01 ~]# kubectl get pods -l k8s-app=kube-dns -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-sxctb 1/1 Running 0 66m
coredns-6955765f44-xhhcp 1/1 Running 1 66m