Cost Optimization and EKS

Cloud Journey
9 min readOct 31, 2021

Overview

In this article, we will explore AWS EKS cost optimization through auto scaling and EC2 spot instances.

Spot Instances are spare Amazon EC2 capacity that allows customers to save up to 90% over On-Demand prices. Spot capacity is split into pools determined by instance type, Availability Zone (AZ), and AWS Region.

System Diagram

The idea is to create AWS EKS cluster with three NodeGroups, two node group with spot instance, and one node group with EC2 on demand instance. Cluster Autoscaler is installed in on-demand instance, termination hander is installed on all instances.

Cluster Autosacler

Open source tool to scale EC2 instances automatically according to pods running in the cluster autoscaler/cluster-autoscaler at master · kubernetes/autoscaler (github.com)

AWS Node Termination Handler

Open source tool to detect EC2 spot interruptions and automatically drains nodes aws/aws-node-termination-handler: Gracefully handle EC2 instance shutdown within Kubernetes (github.com)

Getting Started on Lab Environment

Install eksctl

Provision an ubuntu EC2 to run eksctl cli tool, when ssh to ubuntu EC2, use username ubuntu.

ubuntu@ip-172-31-18-248:~$ curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
ubuntu@ip-172-31-18-248:~$ sudo mv /tmp/eksctl /usr/local/bin
ubuntu@ip-172-31-18-248:~$ eksctl
The official CLI for Amazon EKS
Usage: eksctl [command] [flags]Commands:
eksctl anywhere EKS anywhere
eksctl associate Associate resources with a cluster

EKS Provisioning

Since this lab is purely experiment, and won’t run any real workload, we won’t need much computing power, to save cost, I select t3a.small as node type, which is $0.0188/hour unit price.

For cluster version, I select the latest version 1.21.

To authenticate to your AWS account, install aws cli, then configure through “aws configure”. After that we may start to provision EKS using eksctl.

eksctl create cluster --version=1.21 --name=spotcluster-eksctl --node-private-networking --managed --nodes=2 --alb-ingress-access --region=us-east-1 --node-type t3a.small --node-labels="lifecycle=OnDemand" --asg-access

Two CloudFormation stacks are created, and in stack visual editor, it generates diagram automatically.

Cluster stack

Core Resource: EKS control plane

Networking Resources:
security group between all nodes in cluster
security group control plane and worker node
Ingress Cluster To Node SG
Ingress Inter Node Group SG
Ingress Node to Cluster SG
IGW
VPC
Two public subnet, two private subnet
Attach IGW to VPC
NATIP
NAT gateway in public subnet 1F
Private Route Table useast1c
Private Route Table useast1f
NAT Private Subnet Route useast1c
NAT Private Subnet Route useast1f
Public Route Table
Public subnet route
Route Table and Subnet Associations

IAM Resources:
Service Role (AWS managed policy AmazonEKSClusterPolicy, AmazonEKSVPCResourceController and in line policies)
IAM policy CloudWatch
IAM policy ELB

NodeGroup Stack

List of Resources:
- Node Instance Role (Four AWS managed IAM policy and two inline policy)
- Load Balancer Controller IAM policy
- Auto Scaling IAM policy
- Launch Template
- Managed NodeGroup

Validation

To test node connectivity, install and run kubectl.

ubuntu@ip-172–31–18–248:~$ sudo snap install kubectl — classic
kubectl 1.22.3 from Canonical✓ installed

ubuntu@ip-172–31–18–248:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192–168–118–216.ec2.internal Ready <none> 53m v1.21.4-eks-033ce7e
ip-192–168–65–200.ec2.internal Ready <none> 53m v1.21.4-eks-033ce7e

Add Spot Instance NodeGroup

Cluster Autoscaler requires all instances within a node group to share the same number of vCPU and amount of RAM. To adhere to Spot Instance best practices and maximize diversification, you use multiple node groups. Each of these node groups is a mixed-instance Auto Scaling group with capacity-optimized Spot allocation strategy.

To save cost, I select latest series and low end sku, minsize is 0, meaning could shutdown all instances, below is spotinstances-ng.yml.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: spotcluster-eksctl
region: us-east-1
nodeGroups:
- name: ng-small-spot
minSize: 0
maxSize: 5
desiredCapacity: 1
instancesDistribution:
instanceTypes: ["t3a.small","t4g.small"]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: capacity-optimized
labels:
lifecycle: Ec2Spot
intent: apps
aws.amazon.com/spot: "true"
tags:
k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
k8s.io/cluster-autoscaler/node-template/label/intent: apps
iam:
withAddonPolicies:
autoScaler: true
albIngress: true
- name: ng-medium-spot
minSize: 0
maxSize: 5
desiredCapacity: 1
instancesDistribution:
instanceTypes: ["t3a.medium","t4g.medium"]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: capacity-optimized
labels:
lifecycle: Ec2Spot
intent: apps
aws.amazon.com/spot: "true"
tags:
k8s.io/cluster-autoscaler/node-template/label/lifecycle: Ec2Spot
k8s.io/cluster-autoscaler/node-template/label/intent: apps
iam:
withAddonPolicies:
autoScaler: true
albIngress: true

After two node groups are created, two more EC2 instances are added, and one EC2 instance for each node group, since we set desired capacity as 1.

eksctl create nodegroup -f spotinstances-ng.ymlubuntu@ip-172-31-18-248:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-118-216.ec2.internal Ready <none> 98m v1.21.4-eks-033ce7e
ip-192-168-32-103.ec2.internal Ready <none> 78s v1.21.4-eks-033ce7e
ip-192-168-61-77.ec2.internal Ready <none> 97s v1.21.4-eks-033ce7e
ip-192-168-65-200.ec2.internal Ready <none> 98m v1.21.4-eks-033ce7e

Furthermore describe the new ec2 instance, and it shows as spot instance.

ubuntu@ip-172–31–18–248:~$ aws ec2 describe-instances — instance-id i-0767abe5db3db4ff1 |grep InstanceLifecycle
“InstanceLifecycle”: “spot”,

Node Termination Handler

Install termination handler in all four instances. (note: to find the latest version, refer to https://github.com/aws/aws-node-termination-handler#installation-and-configuration-1)

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.14.0/all-resources.yamlkubectl get daemonsets --all-namespacesNAMESPACE     NAME                               DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR              AGE
kube-system aws-node 4 4 4 4 4 <none> 130m
kube-system aws-node-termination-handler 4 4 4 4 4 kubernetes.io/os=linux 26s
kube-system aws-node-termination-handler-win 0 0 0 0 0 kubernetes.io/os=windows 26s
kube-system kube-proxy 4 4 4 4 4 <none> 130m

Cluster Autoscaler

Download configuration file

curl -o cluster-autoscaler-autodiscover.yaml https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml

Edit configuration file, replace place holder with your cluster name, add two options and update expander to random. (see highlighted)

spec:
priorityClassName: system-cluster-critical
securityContext:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
serviceAccountName: cluster-autoscaler
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=random
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/spotcluster-eksctl
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false

Deploy cluster Autoscaler and annotate.

kubectl apply -f cluster-autoscaler-autodiscover.yamlubuntu@ip-172-31-18-248:~$ kubectl -n kube-system annotate deployment.apps/cluster-autoscaler cluster-autoscaler.kubernetes.io/safe-to-evict="false"
deployment.apps/cluster-autoscaler annotated

Check latest release number from Releases · kubernetes/autoscaler · GitHub, and set Autoscaler image tag, since my cluster is k8s v1.21, so I use autoscaler v1.21.1.

kubectl -n kube-system set image deployment.apps/cluster-autoscaler cluster-autoscaler=us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.21.1

deployment.apps/cluster-autoscaler image updated

Check deployment status and view cluster Autoscaler logs:

ubuntu@ip-172-31-18-248:~$ kubectl get deployments --namespace kube-system
NAME READY UP-TO-DATE AVAILABLE AGE
cluster-autoscaler 1/1 1 1 0h46m
~$ kubectl -n kube-system logs -f deployment.apps/cluster-autoscaler
......
I1031 20:09:02.520520 1 leaderelection.go:243] attempting to acquire leader lease kube-system/cluster-autoscaler...
I1031 20:09:02.525927 1 leaderelection.go:346] lock is held by cluster-autoscaler-6bdddb89bc-pvbxw and has not yet expired
I1031 20:09:02.525973 1 leaderelection.go:248] failed to acquire lease kube-system/cluster-autoscaler

I1031 20:09:19.950980 1 leaderelection.go:253] successfully acquired lease kube-system/cluster-autoscaler
I1031 20:09:19.951086 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Lease", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"e49cb7ad-436a-45f0-9200-ddd013f928d2", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"25009", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-6bdddb89bc-cwd28 became leader
I1031 20:09:19.957454 1 reflector.go:219] Starting reflector *v1.Pod (1h0m0s) from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188
I1031 20:09:19.957492 1 reflector.go:255] Listing and watching *v1.Pod from k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:188
......

After some time EC2 spot instances from two spot node groups are all terminated (minSize: 0).

Validate All Together

Below is web-app.yaml, 0.5 vCPU and 512MiB memory per application.

apiVersion: apps/v1
kind: Deployment
metadata:
name: web-stateless
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
service: nginx
app: nginx
spec:
containers:
- image: nginx
name: web-stateless
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 500m
memory: 512Mi
nodeSelector:
lifecycle: Ec2Spot
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-stateful
spec:
replicas: 2
selector:
matchLabels:
app: redis
template:
metadata:
labels:
service: redis
app: redis
spec:
containers:
- image: redis:3.2-alpine
name: web-stateful
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 500m
memory: 512Mi
nodeSelector:
lifecycle: OnDemand

Deploy application and confirm deployments are running. It takes a little bit time for the stateless deployment to turn into available status, EC2 spot instances are created and brought up upon deployment.

There are five running pods.

kubectl apply -f web-app.yamlubuntu@ip-172-31-18-248:~$ kubectl get deployment/web-stateless
NAME READY UP-TO-DATE AVAILABLE AGE
web-stateless 3/3 3 3 3m26s
ubuntu@ip-172-31-18-248:~$ kubectl get deployment/web-stateful
NAME READY UP-TO-DATE AVAILABLE AGE
web-stateful 2/2 2 2 3m35s
ubuntu@ip-172-31-18-248:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
web-stateful-5c6b49b9ff-cd9cd 1/1 Running 0 13m
web-stateful-5c6b49b9ff-krkcx 1/1 Running 0 13m
web-stateless-758b795b85-hzd68 1/1 Running 0 13m
web-stateless-758b795b85-l4f5n 1/1 Running 0 13m
web-stateless-758b795b85-pldm6 1/1 Running 0 13m

Let’s scale out stateless application:

kubectl scale --replicas=20 deployment/web-stateless

Pods are pending:

ubuntu@ip-172-31-18-248:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
web-stateful-5c6b49b9ff-cd9cd 1/1 Running 0 15m
web-stateful-5c6b49b9ff-krkcx 1/1 Running 0 15m
web-stateless-758b795b85-2g96g 0/1 Pending 0 6s
web-stateless-758b795b85-2nq4h 0/1 Pending 0 6s
web-stateless-758b795b85-2rdgl 0/1 Pending 0 6s
web-stateless-758b795b85-2rjgq 0/1 Pending 0 6s
web-stateless-758b795b85-2smm8 0/1 Pending 0 6s
web-stateless-758b795b85-2zvpd 0/1 Pending 0 6s
web-stateless-758b795b85-755dk 0/1 Pending 0 6s
......

Two spot node groups are maximized with five instances, and altogether 12 instances.

ubuntu@ip-172-31-18-248:~$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-192-168-118-216.ec2.internal Ready <none> 5h59m v1.21.4-eks-033ce7e
ip-192-168-15-72.ec2.internal Ready <none> 3m52s v1.21.4-eks-033ce7e
ip-192-168-17-203.ec2.internal Ready <none> 3m49s v1.21.4-eks-033ce7e
ip-192-168-2-217.ec2.internal Ready <none> 19m v1.21.4-eks-033ce7e
ip-192-168-21-165.ec2.internal Ready <none> 4m9s v1.21.4-eks-033ce7e
ip-192-168-34-106.ec2.internal Ready <none> 4m2s v1.21.4-eks-033ce7e
ip-192-168-40-119.ec2.internal Ready <none> 3m48s v1.21.4-eks-033ce7e
ip-192-168-45-236.ec2.internal Ready <none> 19m v1.21.4-eks-033ce7e
ip-192-168-47-163.ec2.internal Ready <none> 3m38s v1.21.4-eks-033ce7e
ip-192-168-61-183.ec2.internal Ready <none> 4m4s v1.21.4-eks-033ce7e
ip-192-168-63-92.ec2.internal Ready <none> 3m40s v1.21.4-eks-033ce7e
ip-192-168-65-200.ec2.internal Ready <none> 5h59m v1.21.4-eks-033ce7e

Now more pods are running:

ubuntu@ip-172-31-18-248:~$ kubectl get pods
NAME READY STATUS RESTARTS AGE
web-stateful-5c6b49b9ff-cd9cd 1/1 Running 0 32m
web-stateful-5c6b49b9ff-krkcx 1/1 Running 0 32m
web-stateless-758b795b85-2nq4h 1/1 Running 0 17m
web-stateless-758b795b85-2rdgl 1/1 Running 0 17m
web-stateless-758b795b85-2rjgq 1/1 Running 0 17m
web-stateless-758b795b85-2zvpd 1/1 Running 0 17m
web-stateless-758b795b85-755dk 1/1 Running 0 17m
web-stateless-758b795b85-b8kvw 1/1 Running 0 17m
web-stateless-758b795b85-bt497 1/1 Running 0 17m
......

Termination handler running in all 12 instances:

ubuntu@ip-172-31-18-248:~$ kubectl get daemonsets --all-namespaces
NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
kube-system aws-node 12 12 12 12 12 <none> 6h43m
kube-system aws-node-termination-handler 12 12 12 12 12 kubernetes.io/os=linux 4h33m
kube-system aws-node-termination-handler-win 0 0 0 0 0 kubernetes.io/os=windows 4h33m
kube-system kube-proxy 12 12 12 12 12 <none> 6h43m

Clean-up

kubectl delete daemonset aws-node-termination-handler -n kube-systemeksctl delete nodegroup ng-small-spo t--cluster spotcluster-eksctleksctl delete nodegroup ng-medium-spot --cluster spotcluster-eksctleksctl delete cluster --name spotcluster-eksctl

By the way, I kept the lab environment for half day, total cost is $1.37, including EKS and EC2 cost, pretty affordable for blog purpose.

References

EC2 On-Demand Instance Pricing — Amazon Web Services

Install and Set Up kubectl on Linux | Kubernetes

Cluster Autoscaler — Amazon EKS

https://github.com/aws/aws-node-termination-handler#installation-and-configuration-1

--

--

Cloud Journey

All blogs are strictly personal and do not reflect the views of my employer. https://github.com/Ronnie-personal