Tanzu Kubernetes Grid(TKGm) 1.2.1 で MachineHealthCheck の機能を利用する
Tanzu Kubernetes Grid(TKG) 1.2.1からClusterAPI のMachineHealthCheckの機能がControl Plane にも利用出来る様になりました。試してみたので、以下に記載していきます。
環境
- TKG v1.2.1
- AWS
手順
Management Cluster の作成
公式の手順に従って、GUI or CLI を用いて、AWS 上にManagement Cluster を作成していきます。
MachineHealthCheck の機能を利用するため、クラスタ作成時にMachineHealthCheck の機能を有効化します。
tkg CLIのバージョン確認
$ tkg version
Client:
Version: v1.2.1
Git commit: 9d15a485f2ccc462622f8df6a81e5fa831c51895
下の手順では、AWS上にTKGをデプロイするための環境変数を記載したconfig ファイルを読み込み、
~/.tkg/config.yaml
を修正し、CLI でManagement Cluster をデプロイしています。環境変数としては、以下を設定しています。
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_DEFAULT_REGION
AWS_SSH_KEY_NAME
AWS_REGION
AWS_NODE_AZ
AWS_NODE_AZ_1
AWS_NODE_AZ_2
AWS_PRIVATE_NODE_CIDR
AWS_PRIVATE_NODE_CIDR_1
AWS_PRIVATE_NODE_CIDR_2
AWS_PUBLIC_NODE_CIDR
AWS_PUBLIC_NODE_CIDR_1
AWS_PUBLIC_NODE_CIDR_2
AWS_SSH_KEY_NAME
AWS_VPC_CIDR
CONTROL_PLANE_MACHINE_TYPE
NODE_MACHINE_TYPE
ENABLE_MHC
AWS_B64ENCODED_CREDENTIALS
Management Cluster を作成します。
$ tkg init --infrastructure aws --name schecter-aws --plan prod
MachineHealthCheck を確認します。
$ kubectl config use-context schecter-aws-admin@schecter-aws
$ kubectl get machinehealthcheck -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
tkg-system schecter-aws 100% 1 1
Control Plane 用のMachineHealthCheck マニフェストの作成
TKG v1.2.0 からMachineHealthCheck の機能は使える様になっていますが、Worker ノードだけが対象になっています。
また、Cluster API がControl Plane のMachineHealthCheck に対応したのが、Cluster API v0.3.11以降であり、この機能はTKG v1.2.1 から使える様になっております。ただ、手順に従ってデプロイしただけでは、MachineHealthCheck の機能はControl Plane に対して有効になっていません。
ですので、マニフェストファイルを作成し、Control Plane にもMachineHealthCheck の機能を有効化します。
$ cat mhc-schecter-aws-cp.yaml
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineHealthCheck
metadata:
labels:
cluster.x-k8s.io/cluster-name: schecter-aws
name: schecter-aws-cp
namespace: tkg-system
spec:
clusterName: schecter-aws
maxUnhealthy: 100%
nodeStartupTimeout: 20m0s
selector:
matchLabels:
cluster.x-k8s.io/control-plane: ""
cluster.x-k8s.io/cluster-name: schecter-aws
unhealthyConditions:
- status: Unknown
timeout: 5m0s
type: Ready
- status: "False"
timeout: 5m0s
type: Ready
$ kubectl apply -f mhc-schecter-aws-cp.yaml
$ kubectl get machinehealthcheck -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
tkg-system schecter-aws 100% 1 1
tkg-system schecter-aws-cp 100% 3 3
以降でAZ 分散した構成にしたいので、Management Cluster のWorker ノードをAZ 毎に配置させるため、Worker ノードをスケールアウトさせます。
$ tkg scale cluster schecter-aws --worker-machine-count 3 --namespace tkg-system
Successfully updated worker node machine deployment replica count for cluster schecter-aws
workload cluster schecter-aws is being scaled
$ kubectl get machinehealthcheck -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
tkg-system schecter-aws 100% 3 1
tkg-system schecter-aws-cp 100% 3 3
$ kubectl get machinehealthcheck -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
tkg-system schecter-aws 100% 3 3
tkg-system schecter-aws-cp 100% 3 3
インスタンス自動復旧の確認-その1
ap-northeast-1d に存在しているインスタンスをAWS Console から削除してみます。AZ 分散している構成ですので、Management Cluster のControl Plane ノード、Worker ノードそれぞれ1台ずつが削除対象になっています。
MachineHealthCheck のステータスが変更されており、
EXPECTEDMACHINES
とCURRENTHEALTHY
で差分があるのが分かると思います。$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal Ready none 28m v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal Ready master 23m v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal Ready none 4m28s v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal Ready master 26m v1.19.3+vmware.1
ip-10-0-4-231.ap-northeast-1.compute.internal Ready none 3m59s v1.19.3+vmware.1
ip-10-0-4-38.ap-northeast-1.compute.internal Ready master 30m v1.19.3+vmware.1
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal Ready none 32m v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal Ready master 27m v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal Ready none 8m42s v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal Ready master 30m v1.19.3+vmware.1
$ kubectl get machinehealthcheck -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
tkg-system schecter-aws 100% 3 2
tkg-system schecter-aws-cp 100% 3 2
しばらくすると、自動復旧してきます。
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal Ready none 34m v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal Ready master 29m v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal Ready none 10m v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal Ready master 32m v1.19.3+vmware.1
ip-10-0-4-241.ap-northeast-1.compute.internal Ready none 51s v1.19.3+vmware.1
ip-10-0-4-46.ap-northeast-1.compute.internal Ready master 42s v1.19.3+vmware.1
$ kubectl get machinehealthcheck -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
tkg-system schecter-aws 100% 3 3
tkg-system schecter-aws-cp 100% 3 3
Workload Cluster の作成
Management Cluster から Workload Cluster を作成します。
プランはprod を指定し、Worker ノード: 3台、Control Plane ノード: 3台で構成します。
$ tkg create cluster prs01 --plan prod -w 3
Logs of the command execution can also be found at: /var/folders/s5/yw3x_zj91v75b_pmkb55dw700000gp/T/tkg-20210226T144624504942630.log
Validating configuration...
Creating workload cluster 'prs01'...
Waiting for cluster to be initialized...
Waiting for cluster nodes to be available...
Waiting for addons installation...
Workload cluster 'prs01' created
$ tkg get cluster
NAME NAMESPACE STATUS CONTROLPLANE WORKERS KUBERNETES ROLES
prs01 default running 3/3 3/3 v1.19.3+vmware.1 none
$ tkg get credentials prs01
Credentials of workload cluster 'prs01' have been saved
You can now access the cluster by running 'kubectl config use-context prs01-admin@prs01'
$ kubectl config use-context prs01-admin@prs01
Switched to context "prs01-admin@prs01".
$ kubectl get ns
NAME STATUS AGE
default Active 17m
kube-node-lease Active 17m
kube-public Active 17m
kube-system Active 17m
tkg-system-public Active 16m
Management Cluster の時と同様、Workload Cluster もControl Plane のMachineHealthCheck の機能は有効になっていないため、ここでもマニフェストを作成し、適用します。
$ kubectl config use-context schecter-aws-admin@schecter-aws
$ kubectl get machinehealthchecks -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
default prs01 100% 3 3
tkg-system schecter-aws 100% 3 3
tkg-system schecter-aws-cp 100% 3 3
$ cat mhc-prs01-cp.yaml
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineHealthCheck
metadata:
labels:
cluster.x-k8s.io/cluster-name: demo
name: prs01-cp
namespace: default
spec:
clusterName: prs01
maxUnhealthy: 100%
nodeStartupTimeout: 20m0s
selector:
matchLabels:
cluster.x-k8s.io/control-plane: ""
cluster.x-k8s.io/cluster-name: prs01
unhealthyConditions:
- status: Unknown
timeout: 5m0s
type: Ready
- status: "False"
timeout: 5m0s
type: Ready
$ kubectl apply -f mhc-prs01-cp.yaml
$ kubectl get mhc -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
default prs01 100% 3 3
default prs01-cp 100% 3 3
tkg-system schecter-aws 100% 3 3
tkg-system schecter-aws-cp 100% 3 3
$ kubectl get mhc prs01-cp -o yaml
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineHealthCheck
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"cluster.x-k8s.io/v1alpha3","kind":"MachineHealthCheck","metadata":{"annotations":{},"labels":{"cluster.x-k8s.io/cluster-name":"demo"},"name":"prs01-cp","namespace":"default"},"spec":{"clusterName":"prs01","maxUnhealthy":"100%","nodeStartupTimeout":"20m0s","selector":{"matchLabels":{"cluster.x-k8s.io/cluster-name":"prs01","cluster.x-k8s.io/control-plane":""}},"unhealthyConditions":[{"status":"Unknown","timeout":"5m0s","type":"Ready"},{"status":"False","timeout":"5m0s","type":"Ready"}]}}
creationTimestamp: "2021-02-26T06:16:35Z"
generation: 1
labels:
cluster.x-k8s.io/cluster-name: prs01
managedFields:
- apiVersion: cluster.x-k8s.io/v1alpha3
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:kubectl.kubernetes.io/last-applied-configuration: {}
f:labels:
.: {}
f:cluster.x-k8s.io/cluster-name: {}
f:spec:
.: {}
f:clusterName: {}
f:maxUnhealthy: {}
f:nodeStartupTimeout: {}
f:selector:
.: {}
f:matchLabels:
.: {}
f:cluster.x-k8s.io/cluster-name: {}
f:cluster.x-k8s.io/control-plane: {}
f:unhealthyConditions: {}
manager: kubectl-client-side-apply
operation: Update
time: "2021-02-26T06:16:35Z"
- apiVersion: cluster.x-k8s.io/v1alpha3
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:ownerReferences: {}
f:status:
.: {}
f:conditions: {}
f:currentHealthy: {}
f:expectedMachines: {}
f:observedGeneration: {}
f:remediationsAllowed: {}
f:targets: {}
manager: manager
operation: Update
time: "2021-02-26T06:16:35Z"
name: prs01-cp
namespace: default
ownerReferences:
- apiVersion: cluster.x-k8s.io/v1alpha3
kind: Cluster
name: prs01
uid: 930fcd95-7212-4073-aee2-54c028d69fad
resourceVersion: "26636"
selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/default/machinehealthchecks/prs01-cp
uid: 28e2cd6e-2de7-462e-8e1b-8b92899980fe
spec:
clusterName: prs01
maxUnhealthy: 100%
nodeStartupTimeout: 20m0s
selector:
matchLabels:
cluster.x-k8s.io/cluster-name: prs01
cluster.x-k8s.io/control-plane: ""
unhealthyConditions:
- status: Unknown
timeout: 5m0s
type: Ready
- status: "False"
timeout: 5m0s
type: Ready
status:
conditions:
- lastTransitionTime: "2021-02-26T06:16:35Z"
status: "True"
type: RemediationAllowed
currentHealthy: 3
expectedMachines: 3
observedGeneration: 1
remediationsAllowed: 3
targets:
- prs01-control-plane-nfp2n
- prs01-control-plane-c6khb
- prs01-control-plane-tzjsl
インスタンス自動復旧の確認-その2
Management Cluster だけの場合と同様、ap-northeast-1d に存在しているインスタンスをAWS Console から削除してみます。
Management Cluster, Workload Cluster それぞれのMachineHealthCheck のステータスが変更されており、
EXPECTEDMACHINES
とCURRENTHEALTHY
で差分があるのが分かると思います。$ kubectl config use-context schecter-aws-admin@schecter-aws
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal Ready none 75m v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal Ready master 70m v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal Ready none 51m v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal Ready master 74m v1.19.3+vmware.1
$ kubectl get mhc -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
default prs01 100% 3 2
default prs01-cp 100% 3 2
tkg-system schecter-aws 100% 3 2
tkg-system schecter-aws-cp 100% 3 2
しばらくすると、自動復旧してきます。
kubectl get nodes
で出力される下2つのノードが自動復旧されたノードになります。$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal Ready none 76m v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal Ready master 71m v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal Ready none 52m v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal Ready master 75m v1.19.3+vmware.1
ip-10-0-4-134.ap-northeast-1.compute.internal NotReady master 7s v1.19.3+vmware.1
ip-10-0-4-55.ap-northeast-1.compute.internal Ready none 39s v1.19.3+vmware.1
$ kubectl get mhc -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
default prs01 100% 3 2
default prs01-cp 100% 3 3
tkg-system schecter-aws 100% 3 3
tkg-system schecter-aws-cp 100% 3 2
$ kubectl get mhc -A
NAMESPACE NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY
default prs01 100% 3 3
default prs01-cp 100% 3 3
tkg-system schecter-aws 100% 3 3
tkg-system schecter-aws-cp 100% 3 3
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal Ready none 77m v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal Ready master 72m v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal Ready none 53m v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal Ready master 75m v1.19.3+vmware.1
ip-10-0-4-134.ap-northeast-1.compute.internal Ready master 51s v1.19.3+vmware.1
ip-10-0-4-55.ap-northeast-1.compute.internal Ready none 83s v1.19.3+vmware.1
Workload Cluster も確認してみます。
$ kubectl config use-context prs01-admin@prs01
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-47.ap-northeast-1.compute.internal Ready none 29m v1.19.3+vmware.1
ip-10-0-0-9.ap-northeast-1.compute.internal Ready master 27m v1.19.3+vmware.1
ip-10-0-2-231.ap-northeast-1.compute.internal Ready master 24m v1.19.3+vmware.1
ip-10-0-2-27.ap-northeast-1.compute.internal Ready none 29m v1.19.3+vmware.1
ip-10-0-4-244.ap-northeast-1.compute.internal Ready master 84s v1.19.3+vmware.1
ip-10-0-4-88.ap-northeast-1.compute.internal Ready none 104s v1.19.3+vmware.1
AWS 環境において、TKG v1.2.1からControl Plane、Worker ノード共にインスタンス障害時に自動復旧する事を確認出来ました。
- リンクを取得
- ×
- メール
- 他のアプリ