Tanzu Kubernetes Grid(TKG) 1.2.1からClusterAPI のMachineHealthCheckの機能がControl Plane にも利用出来る様になりました。試してみたので、以下に記載していきます。

環境

TKG v1.2.1
AWS

手順

Management Cluster の作成

公式の手順に従って、GUI or CLI を用いて、AWS 上にManagement Cluster を作成していきます。

Deploy Management Clusters to Amazon EC2

MachineHealthCheck の機能を利用するため、クラスタ作成時にMachineHealthCheck の機能を有効化します。

About MachineHealthCheck

tkg CLIのバージョン確認

$ tkg version
Client:
	Version: v1.2.1
	Git commit: 9d15a485f2ccc462622f8df6a81e5fa831c51895

下の手順では、AWS上にTKGをデプロイするための環境変数を記載したconfig ファイルを読み込み、~/.tkg/config.yamlを修正し、CLI でManagement Cluster をデプロイしています。

環境変数としては、以下を設定しています。

  AWS_ACCESS_KEY_ID
  AWS_SECRET_ACCESS_KEY
  AWS_DEFAULT_REGION
  AWS_SSH_KEY_NAME
  AWS_REGION
  AWS_NODE_AZ
  AWS_NODE_AZ_1
  AWS_NODE_AZ_2
  AWS_PRIVATE_NODE_CIDR
  AWS_PRIVATE_NODE_CIDR_1
  AWS_PRIVATE_NODE_CIDR_2
  AWS_PUBLIC_NODE_CIDR
  AWS_PUBLIC_NODE_CIDR_1
  AWS_PUBLIC_NODE_CIDR_2
  AWS_SSH_KEY_NAME
  AWS_VPC_CIDR
  CONTROL_PLANE_MACHINE_TYPE
  NODE_MACHINE_TYPE
  ENABLE_MHC
  AWS_B64ENCODED_CREDENTIALS

Management Cluster を作成します。

$ tkg init --infrastructure aws --name schecter-aws --plan prod

MachineHealthCheck を確認します。

$ kubectl config use-context schecter-aws-admin@schecter-aws

$ kubectl get machinehealthcheck -A
NAMESPACE    NAME           MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
tkg-system   schecter-aws   100%           1                  1

Control Plane 用のMachineHealthCheck マニフェストの作成

TKG v1.2.0 からMachineHealthCheck の機能は使える様になっていますが、Worker ノードだけが対象になっています。

また、Cluster API がControl Plane のMachineHealthCheck に対応したのが、Cluster API v0.3.11以降であり、この機能はTKG v1.2.1 から使える様になっております。ただ、手順に従ってデプロイしただけでは、MachineHealthCheck の機能はControl Plane に対して有効になっていません。

ですので、マニフェストファイルを作成し、Control Plane にもMachineHealthCheck の機能を有効化します。

$ cat mhc-schecter-aws-cp.yaml
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineHealthCheck
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: schecter-aws
  name: schecter-aws-cp
  namespace: tkg-system
spec:
  clusterName: schecter-aws
  maxUnhealthy: 100%
  nodeStartupTimeout: 20m0s
  selector:
    matchLabels:
      cluster.x-k8s.io/control-plane: ""
      cluster.x-k8s.io/cluster-name: schecter-aws
  unhealthyConditions:
  - status: Unknown
    timeout: 5m0s
    type: Ready
  - status: "False"
    timeout: 5m0s
    type: Ready

$ kubectl apply -f mhc-schecter-aws-cp.yaml

$ kubectl get machinehealthcheck -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
tkg-system   schecter-aws      100%           1                  1
tkg-system   schecter-aws-cp   100%           3                  3

以降でAZ 分散した構成にしたいので、Management Cluster のWorker ノードをAZ 毎に配置させるため、Worker ノードをスケールアウトさせます。

$ tkg scale cluster schecter-aws --worker-machine-count 3 --namespace tkg-system
Successfully updated worker node machine deployment replica count for cluster schecter-aws
workload cluster schecter-aws is being scaled

$ kubectl get machinehealthcheck -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
tkg-system   schecter-aws      100%           3                  1
tkg-system   schecter-aws-cp   100%           3                  3

$ kubectl get machinehealthcheck -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
tkg-system   schecter-aws      100%           3                  3
tkg-system   schecter-aws-cp   100%           3                  3

インスタンス自動復旧の確認-その1

ap-northeast-1d に存在しているインスタンスをAWS Console から削除してみます。AZ 分散している構成ですので、Management Cluster のControl Plane ノード、Worker ノードそれぞれ1台ずつが削除対象になっています。

MachineHealthCheck のステータスが変更されており、EXPECTEDMACHINES とCURRENTHEALTHY で差分があるのが分かると思います。

$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE     VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal   Ready    none   28m     v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal   Ready    master   23m     v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal   Ready    none   4m28s   v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal    Ready    master   26m     v1.19.3+vmware.1
ip-10-0-4-231.ap-northeast-1.compute.internal   Ready    none   3m59s   v1.19.3+vmware.1
ip-10-0-4-38.ap-northeast-1.compute.internal    Ready    master   30m     v1.19.3+vmware.1

$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE     VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal   Ready    none   32m     v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal   Ready    master   27m     v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal   Ready    none   8m42s   v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal    Ready    master   30m     v1.19.3+vmware.1

$ kubectl get machinehealthcheck -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
tkg-system   schecter-aws      100%           3                  2
tkg-system   schecter-aws-cp   100%           3                  2

しばらくすると、自動復旧してきます。

$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal   Ready    none   34m   v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal   Ready    master   29m   v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal   Ready    none   10m   v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal    Ready    master   32m   v1.19.3+vmware.1
ip-10-0-4-241.ap-northeast-1.compute.internal   Ready    none   51s   v1.19.3+vmware.1
ip-10-0-4-46.ap-northeast-1.compute.internal    Ready    master   42s   v1.19.3+vmware.1

$ kubectl get machinehealthcheck -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
tkg-system   schecter-aws      100%           3                  3
tkg-system   schecter-aws-cp   100%           3                  3

Workload Cluster の作成

Management Cluster から Workload Cluster を作成します。

プランはprod を指定し、Worker ノード: 3台、Control Plane ノード: 3台で構成します。

$ tkg create cluster prs01 --plan prod -w 3
Logs of the command execution can also be found at: /var/folders/s5/yw3x_zj91v75b_pmkb55dw700000gp/T/tkg-20210226T144624504942630.log
Validating configuration...
Creating workload cluster 'prs01'...
Waiting for cluster to be initialized...
Waiting for cluster nodes to be available...
Waiting for addons installation...

Workload cluster 'prs01' created

$ tkg get cluster
 NAME   NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES
 prs01  default    running  3/3           3/3      v1.19.3+vmware.1  none

$ tkg get credentials prs01
Credentials of workload cluster 'prs01' have been saved
You can now access the cluster by running 'kubectl config use-context prs01-admin@prs01'

$ kubectl config use-context prs01-admin@prs01
Switched to context "prs01-admin@prs01".

$ kubectl get ns
NAME                STATUS   AGE
default             Active   17m
kube-node-lease     Active   17m
kube-public         Active   17m
kube-system         Active   17m
tkg-system-public   Active   16m

Management Cluster の時と同様、Workload Cluster もControl Plane のMachineHealthCheck の機能は有効になっていないため、ここでもマニフェストを作成し、適用します。

$ kubectl config use-context schecter-aws-admin@schecter-aws

$ kubectl get machinehealthchecks -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
default				 prs01             100%           3                  3
tkg-system     schecter-aws      100%           3                  3
tkg-system     schecter-aws-cp   100%           3                  3

$ cat mhc-prs01-cp.yaml
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineHealthCheck
metadata:
  labels:
    cluster.x-k8s.io/cluster-name: demo
  name: prs01-cp
  namespace: default
spec:
  clusterName: prs01
  maxUnhealthy: 100%
  nodeStartupTimeout: 20m0s
  selector:
    matchLabels:
      cluster.x-k8s.io/control-plane: ""
      cluster.x-k8s.io/cluster-name: prs01
  unhealthyConditions:
  - status: Unknown
    timeout: 5m0s
    type: Ready
  - status: "False"
    timeout: 5m0s
    type: Ready

$ kubectl apply -f mhc-prs01-cp.yaml

$ kubectl get mhc -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
default      prs01             100%           3                  3
default      prs01-cp          100%           3                  3
tkg-system   schecter-aws      100%           3                  3
tkg-system   schecter-aws-cp   100%           3                  3

$ kubectl get mhc prs01-cp -o yaml
apiVersion: cluster.x-k8s.io/v1alpha3
kind: MachineHealthCheck
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"cluster.x-k8s.io/v1alpha3","kind":"MachineHealthCheck","metadata":{"annotations":{},"labels":{"cluster.x-k8s.io/cluster-name":"demo"},"name":"prs01-cp","namespace":"default"},"spec":{"clusterName":"prs01","maxUnhealthy":"100%","nodeStartupTimeout":"20m0s","selector":{"matchLabels":{"cluster.x-k8s.io/cluster-name":"prs01","cluster.x-k8s.io/control-plane":""}},"unhealthyConditions":[{"status":"Unknown","timeout":"5m0s","type":"Ready"},{"status":"False","timeout":"5m0s","type":"Ready"}]}}
  creationTimestamp: "2021-02-26T06:16:35Z"
  generation: 1
  labels:
    cluster.x-k8s.io/cluster-name: prs01
  managedFields:
  - apiVersion: cluster.x-k8s.io/v1alpha3
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:cluster.x-k8s.io/cluster-name: {}
      f:spec:
        .: {}
        f:clusterName: {}
        f:maxUnhealthy: {}
        f:nodeStartupTimeout: {}
        f:selector:
          .: {}
          f:matchLabels:
            .: {}
            f:cluster.x-k8s.io/cluster-name: {}
            f:cluster.x-k8s.io/control-plane: {}
        f:unhealthyConditions: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2021-02-26T06:16:35Z"
  - apiVersion: cluster.x-k8s.io/v1alpha3
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences: {}
      f:status:
        .: {}
        f:conditions: {}
        f:currentHealthy: {}
        f:expectedMachines: {}
        f:observedGeneration: {}
        f:remediationsAllowed: {}
        f:targets: {}
    manager: manager
    operation: Update
    time: "2021-02-26T06:16:35Z"
  name: prs01-cp
  namespace: default
  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1alpha3
    kind: Cluster
    name: prs01
    uid: 930fcd95-7212-4073-aee2-54c028d69fad
  resourceVersion: "26636"
  selfLink: /apis/cluster.x-k8s.io/v1alpha3/namespaces/default/machinehealthchecks/prs01-cp
  uid: 28e2cd6e-2de7-462e-8e1b-8b92899980fe
spec:
  clusterName: prs01
  maxUnhealthy: 100%
  nodeStartupTimeout: 20m0s
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: prs01
      cluster.x-k8s.io/control-plane: ""
  unhealthyConditions:
  - status: Unknown
    timeout: 5m0s
    type: Ready
  - status: "False"
    timeout: 5m0s
    type: Ready
status:
  conditions:
  - lastTransitionTime: "2021-02-26T06:16:35Z"
    status: "True"
    type: RemediationAllowed
  currentHealthy: 3
  expectedMachines: 3
  observedGeneration: 1
  remediationsAllowed: 3
  targets:
  - prs01-control-plane-nfp2n
  - prs01-control-plane-c6khb
  - prs01-control-plane-tzjsl

インスタンス自動復旧の確認-その2

Management Cluster だけの場合と同様、ap-northeast-1d に存在しているインスタンスをAWS Console から削除してみます。

Management Cluster, Workload Cluster それぞれのMachineHealthCheck のステータスが変更されており、EXPECTEDMACHINESとCURRENTHEALTHYで差分があるのが分かると思います。

$ kubectl config use-context schecter-aws-admin@schecter-aws

$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal   Ready    none   75m   v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal   Ready    master   70m   v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal   Ready    none   51m   v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal    Ready    master   74m   v1.19.3+vmware.1

$ kubectl get mhc -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
default      prs01             100%           3                  2
default      prs01-cp          100%           3                  2
tkg-system   schecter-aws      100%           3                  2
tkg-system   schecter-aws-cp   100%           3                  2

しばらくすると、自動復旧してきます。kubectl get nodes で出力される下2つのノードが自動復旧されたノードになります。

$ kubectl get nodes
NAME                                            STATUS     ROLES    AGE   VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal   Ready      none   76m   v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal   Ready      master   71m   v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal   Ready      none   52m   v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal    Ready      master   75m   v1.19.3+vmware.1
ip-10-0-4-134.ap-northeast-1.compute.internal   NotReady   master   7s    v1.19.3+vmware.1
ip-10-0-4-55.ap-northeast-1.compute.internal    Ready      none   39s   v1.19.3+vmware.1

$ kubectl get mhc -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
default      prs01             100%           3                  2
default      prs01-cp          100%           3                  3
tkg-system   schecter-aws      100%           3                  3
tkg-system   schecter-aws-cp   100%           3                  2

$ kubectl get mhc -A
NAMESPACE    NAME              MAXUNHEALTHY   EXPECTEDMACHINES   CURRENTHEALTHY
default      prs01             100%           3                  3
default      prs01-cp          100%           3                  3
tkg-system   schecter-aws      100%           3                  3
tkg-system   schecter-aws-cp   100%           3                  3

$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE   VERSION
ip-10-0-0-150.ap-northeast-1.compute.internal   Ready    none   77m   v1.19.3+vmware.1
ip-10-0-0-201.ap-northeast-1.compute.internal   Ready    master   72m   v1.19.3+vmware.1
ip-10-0-2-185.ap-northeast-1.compute.internal   Ready    none   53m   v1.19.3+vmware.1
ip-10-0-2-59.ap-northeast-1.compute.internal    Ready    master   75m   v1.19.3+vmware.1
ip-10-0-4-134.ap-northeast-1.compute.internal   Ready    master   51s   v1.19.3+vmware.1
ip-10-0-4-55.ap-northeast-1.compute.internal    Ready    none   83s   v1.19.3+vmware.1

Workload Cluster も確認してみます。

$ kubectl config use-context prs01-admin@prs01

$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE    VERSION
ip-10-0-0-47.ap-northeast-1.compute.internal    Ready    none   29m    v1.19.3+vmware.1
ip-10-0-0-9.ap-northeast-1.compute.internal     Ready    master   27m    v1.19.3+vmware.1
ip-10-0-2-231.ap-northeast-1.compute.internal   Ready    master   24m    v1.19.3+vmware.1
ip-10-0-2-27.ap-northeast-1.compute.internal    Ready    none   29m    v1.19.3+vmware.1
ip-10-0-4-244.ap-northeast-1.compute.internal   Ready    master   84s    v1.19.3+vmware.1
ip-10-0-4-88.ap-northeast-1.compute.internal    Ready    none   104s   v1.19.3+vmware.1

AWS 環境において、TKG v1.2.1からControl Plane、Worker ノード共にインスタンス障害時に自動復旧する事を確認出来ました。

このブログを検索

blog.chobi

Tanzu Kubernetes Grid(TKGm) 1.2.1 で MachineHealthCheck の機能を利用する

環境

手順

Management Cluster の作成

Control Plane 用のMachineHealthCheck マニフェストの作成

インスタンス自動復旧の確認-その1

Workload Cluster の作成

インスタンス自動復旧の確認-その2

このブログの人気の投稿

Terraform Cloud Agent のPROXY 設定とログ設定

TKGm v1.4 にContour, Harbor, Prometheus, Grafana をデプロイする

Terraform Cloud Agent コンテナをカスタマイズする

Tanzu Kubernetes Grid(TKGm) 1.2.1 にConcourse CI をインストールする

Tanzu Kubernetes Grid(TKGm) v1.3.1 にArgo CD をインストールする