Skip to content

eks

Karpenter maxpods

Karpenter Nodes maxPods not working with Bottlerocket despite enabling VPC-Prefix

karpenter

Problem

- Bottle rocket nodes max pod capacity is less than it should be
- Karpenter provisioned bottlerocket does not respect VPC prefix deligation
- Max pods on a Large instance remain 29 despite enabling VPC prefix deligation
- Large instances only has capacity for 29 pods and not 110 with prefix

Solution

Assuming VPC previx is enabled with correct VPC CNI version, there are 2 main reasons for this. 1. Instance provisioned is not Nitro instance 2. Bottlerocket require max pods to be overriden

Explanation

VPC prefix works on AWS Nitro based hypervisor servers only. They comes with dedicated hardware for handling networking traffic and supports VPC Prefix delegation. OlderXen based server like M4,C4 etc does not have this support. So you must configure Karpenter to provision Nitro based instances only. This is defined in the Nodepool configuration

karpenter.k8s.aws/instance-hypervisor: nitro

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: primary-nodepool
spec:
  template:
    spec:
      requirements:
        - key: "karpenter.k8s.aws/instance-hypervisor"
          operator: In
          values: ["nitro"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]

Second part of the problem is to do with Bottlerocket AMI. When Karpenter provisions the Node, it does not run script to check if Prefix delegation is enabled or not. This results in Max pods being calculated incorrectly. For example c6.large will show as supporting max 29 Pods.

Solution is to update Kubelet configuration to override default value. With Karpenter V1, Objects for setting Kubelet features have been moved from the NodePool spec to the EC2NodeClasses spec, to not require other Karpenter providers to support those features.

Update kubelet with maxPods: 110

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: bottlerocket
spec:
  kubelet:
    maxPods: 110

Complete EKS cluster [Terraform]

Getting started with creating a functional EKS cluster from scratch can be challenging as requires some specific settings. While EKS module will create a new cluster, it does not address how you will expose an applicaition, tags required for subnets, number of pod IP addresses etc

🖥 EKS cluster using terraform contains everything required for you to spin up a new cluster and expose application via Application Loadbalancer. All you need to do is apply terraform code

Source Code, Sample app

  • [x] VPC with 2 private and 2 public zones
  • [x] EKS cluster with Managed NodeGroup (1 Node)
  • [x] VPC CNI add-on with prefix delegation
  • [x] AWS Loadbalancer controller

EKS secrets as env variable with CSI driver

kubernetes csi

CSI driver by default configured to mount secrets as file. However it is possible to mount secrets are Environment variable using below method. When the pod is started, driver will create a secret and mount as Environment variable. Secrets object only exists while pod is active.

Source code Aws-Eks-SecretsManager

Configuring Secrets manager AWS Secrets Manager and Config Provider

Scenario

  • You have configured Secret called MySecret with data username and password
  • Necessary policy created in AWS to allow access to Secret
  • Iam serviceAccount called 'nginx-deployment-sa' created and policy attached

Creating K8s Secrets from AWS Secrets

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: aws-secret-to-k8s-secret
  namespace: default
spec:
  provider: aws
  parameters:
    objects: |
        - objectName: "MySecret"
          objectType: "secretsmanager"
          objectAlias: mysecret
          jmesPath:
            - path: "username"
              objectAlias: "Username"
  secretObjects:
    - secretName: myusername
      type: Opaque
      data: 
        - objectName: "Username"
          key: "username"
    - secretName: myk8ssecret
      type: Opaque
      data: 
        - objectName: "mysecret"
          key: "mysecret"
spec.parameters
  • objectName: Name of the secret object in secretStore
  • objectAlias: Optional Alias name for secretObject.
  • jmesPath.path : Name of specific secret to be exposed
  • jmesPath.objectAlias: Alias name for the seccret to be used.
spec.secretObjects
  • secretName: Name of the secret to be created in k8s
  • data.objectName: Name of the secretObject/Alias to retrieve data from
  • key: Name of the key with in k8s secret to be used for storing retrieved data.

Above configuration will create k8s secret called 'myusername' with value of username in key 'username'. k8s secret 'mysecrets' will contain all objects in Mysecrets under k8s secret key 'mysecrets'

EKS: avoid errors and timeout during deployment (ALB)

Scenario

Eks cluster configured with Application loadbalancer. During deployments, pods become unhealthy in target group for short while and causes brief outage.

target group status

Root cause

There are 2 possible reasons for this scenario and both must be addressed. 1. ALB taking longer to initialize new pods 2. ALB is slow to detect and drain terminated pods.

Solution

Enable pod readiness Gate

Configure Pod readiness Gate to indicate that pod is registered to the ALB/NLB and healthy to receive traffic. This will ensure pod is healthy in target group before terminating old pod.

To enable Pod readiness Gate, add label elbv2.k8s.aws/pod-readiness-gate-inject: enabled to applications Namespace. Change will be effective for any new pod being deployed.

kind: Namespace
metadata:
  labels:
    elbv2.k8s.aws/pod-readiness-gate-inject: enabled

Pod lifecycle preStop

When a pod is terminated, it can take couple of seconds for ALB to pick up the change and start draining connection. By this time, most likely pod already been terminated by K8s. Solution to this issue is a workaround. Add a lifecycle policy to the pod to ensure pods are de-registered before termination

    spec:
      terminationGracePeriodSeconds: 60
      containers:          
          lifecycle:
             preStop:
               exec:
                 command: ["/bin/sh", "-c", "sleep 60"]

Adjust ALB/TG De-registration time to be smaller than lifecycle time by adding annotation de-registration_delay.timeout_seconds

ingress:
  enabled: true
  className: "alb"
  annotations: 
    alb.ingress.kubernetes.io/scheme: internet-facing
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/ssl-redirect: '443'
    alb.ingress.kubernetes.io/healthcheck-protocol: HTTP
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30