Optimized Kubernetes on GKE: Best Practices & Autoscaling

Best Practices for running optimized Kubernetes applications on GKE The Forrester Wave™: Public Cloud Container Platforms, Q1 2022 "Google Cloud is the best fit for firms that want extensive cutting-edge cloud-native capabilities for distributed workloads spanning public cloud, private cloud, and multicloud environments.” [report and blog post] Understanding GKE options 01 We support the 4 scalability dimensions Horizontal Pod Autoscaler (HPA) Vertical Pod Autoscaler (VPA) Different Pod Size Workload Cluster Autoscaler (CA) More nodes Infrastructure Node AutoProvisioning (NAP) Different Node Size We support the 4 scalability dimensions Horizontal Pod Autoscaler (HPA) Vertical Pod Autoscaler (VPA) Different Pod Size Workload Cluster Autoscaler (CA) More nodes Infrastructure Node AutoProvisioning (NAP) Different Node Size GKE Standard Here is where you can save money -GKE Autopilot This is a Google responsibility HPA - Horizontal Pod Autoscaler ● Target Utilization: ○ ○ ● apiVersion: autoscaling/v1 : only CPU utilization apiVersion: autoscaling/v2beta2 : multiple Recommended metrics, even custom (eg. requests per second) Indicated for stateless workers that need to handle demand either growth or spike ○ Eg. 30% growth in traffic in a couple of minutes HPA - Buffer size ● Small ○ Prevents early scale ups ○ It can overload your application during spikes. ● Big ○ Causes resource waste, increasing the cost of your bill. ○ Need to be enough for handling requests during two or three minutes in a spike. For more information, see Configuring a Horizontal Pod Autoscaler PUBLIC PREVIEW HPA - Load balancer-based autoscaling As of GKE 1.22, Gateways are “capacity-aware” ● They monitor the incoming traffic load relative to the Service’s capacity and can autoscale Pods or spillover traffic to other clusters when load exceeds capacity. Autoscaler behavior: 10rps =2 70% 10rps Best practice: always configure a secondary target, like CPU See Gateway traffic management and Autoscaling based on load balancer traffic VPA - Vertical Pod Autoscaler ● Indicated for stateless workloads, not handled by HPA, where traffic can be evenly distributed between the replicas ● Not indicated Init/Auto mode for statefulset and if you need to handle sudden growth in traffic ○ ● Modes: ○ ○ ○ ● Use HPA instead. Off: recommendation Initial: do not restart Auto: restart Be careful when enabling the Auto Mode ○ ○ Application will restart, can it? ■ Usually stateful can not Help VPA out by setting ■ pod disruption budgets ■ min & max container sizes Deeply integrated with Cluster Autoscaler For more information, see Configuring Vertical Pod Autoscaling. Mixing HPA and VPA Never mix HPA and VPA on either CPU or Memory Two options are safe to implement: ● ● HPA using custom metrics VPA in Recommendation Mode In either case, make sure your application is always running over HPA min-replicas CA - Cluster Autoscaler Scaling up ● Indicated for: whenever you are using either HPA or VPA ● Optimized for the cost of infrastructure and based on scheduling simulation and declared Pod requests Scaling down CA Profiles - Different scheduling Balanced Profile Optimize Utilization Profile TL;DR: Prefer less used nodes To keep all nodes, well… balanced TL;DR: Prefer most used nodes To leave nodes with less requested resources that can be scaled down quickly PENDING PENDING Application pod Application pod SCHEDULING SCHEDULING Application pod Application pod For more information, see Autoscaling profiles. Disclaimer: Google may fine tune profiles to improve either reliability or the desire profile proposal CA Profiles - Different scale downs Balanced Profile TL;DR: Maximize availability Optimize Utilization Profile TL;DR: Remove more nodes, faster Runs every: 10m Scale down: nodes < 50% resource requests Runs every: 1m Scale down: nodes < 85% resource requests Delete empty nodes and evict pods from nodes flagged as scale down Delete empty nodes and evict pods from nodes flagged as scale down Delete empty nodes and evict pods from nodes flagged as scale down Delete empty nodes -Nothing else to do 20m to delete 2 nodes Delete empty nodes -Nothing else to do 3m to delete 3 nodes For more information, see Autoscaling profiles. Disclaimer: Google may fine tune profiles to improve either reliability or the desire profile proposal CA Profiles and Location Policy ● Use optimize-utilization profile for a more aggressive scale down ● Use location-policy=ANY (1.24+) to reduce preemption and speed up auto scaling in edge cases ○ ○ ○ Default for Spot nodepools CA queries the Recommend Locations API for optimal VMs location. ■ It prioritizes user's reservations and pick up zones with most unused capacity, reducing preemption considerably. ■ Coming soon. Algorithm to use signals to minimize the risk of preemptions. Prefer setting --total-max-nodes when using this location policy For more information, see Autoscaling a cluster and Learn how to analyse Cluster Autoscaler events in the logs. Bin Packing ● Make sure your workload fits well inside the machine size (cpu, memory, ports) ● You can create multiple node pools and use either nodeSelector based on t-shirt sizes to select which node your pod must run. ● Another simpler option is to configure NAP What happens when you scale up (workload only) ● Download Container Image = Mounting Volume → Fetching Image ● App Startup = Init Containers Startup → Regular Containers Startup → Health Checks pass What happens when you scale up (infra and workload) GKE image streaming for fast image downloading during autoscaling Regular image pulling 1. 2. Container image is downloaded to the VM Application starts GKE image streaming 1. Application starts. Files are streamed from remote filesystem. GKE Image Streaming runs mysql container (~140MB image) in 3s (vs 16s in regular image pulling process) For more information read GKE image streaming for fast application startup and autoscaling HPA - Scheduled & Predictive Autoscaling EAP GKE uses machine-learning to predict which huge traffic spikes are likely to occur and preemptively scale your workloads to handle anticipated traffic Ask your account team for more information. HPA - Scheduled & Predictive Autoscaling EAP Use when Don't use when ● Strong spikes and/or deep valleys ● Predictable traffic ebb and flow ○ ○ ○ Scale down to minimum during off-peak hours Increase capacity previously to known events Warm up stage clusters for load tests ● Current spikes can be handled by HPA and/or pause pods without big overprovisioning ● If you purchased CUD for your entire fleet HPA + CA (possibly pause Pods) Keep you cluster warm CUD CPU Utilization CUD CPU Utilization Consider scheduled autoscaler ● Combine with pause pods if unsure about which workload will scale up ● For batch workloads For more information, see Reducing costs by scaling down GKE clusters during off-peak hours. Review your logging and monitoring strategies ● Cloud Logging and Cloud Monitoring are pay-as-you-go systems ● The more you log and the longer you keep the logs, the more you pay ○ Exclude logs to avoid ingestion cost ■ However, the best place to do such a work is in the app ■ Note: If you create new buckets & routers, exclude routed logs from _Default, otherwise you will changed twice ○ Monitor log bytes per application ○ Create an alerting system for logs See Cost optimization for Cloud Logging, Cloud Monitoring, and Application Performance Management. Review inter-region egress traffic in regional and multi-zonal clusters ● Either Regional or multi-zonal clusters ○ ● Need to pay egress traffic between the zones Suggestions ○ ○ ○ ○ Deploy development clusters in single-zone Enable GKE usage metering its network egress agent (not enable by default) Consider inter-pod affinity/anti-affinity to co-locate dependent Pods from different services in the same nodes or in the same availability zone. Note this can compromise HA. In GKE 1.24 you will be able to favor traffic to local zones using TopologyAwareHints. However, you may face some issues if the traffic is all concentrated in one unique zone. ■ For example: HPA may not realize it needs to create new Pods because only Pods in one zone are running out of cpu (33%), all the others (66%) are running idle. Batch workloads ● Usually concerned about the work being done eventually and commonly tolerant to some latency at the job startup time ● Separate batch workloads in different nodepools. Reasoning: ○ ○ ○ Cluster Autoscaler can delete empty nodes faster when it doesn't need to restart pods You should consider configuring CA's optimize-utilization profile to improve even more the speed of scale down activities Some Pods cannot be restarted and so they will permanently block the scale down of their nodes ● Use Node auto-provisioning to automatically create dedicated node pools for jobs ● Make sure your long running jobs are saving checkpoints once they can be evicted during CA downscaling See Optimizing resource usage in a multi-tenant GKE cluster using node auto-provisioning tutorial. Serving workloads ● Must respond as quickly as possible to bursts or spikes ● Problems in handling such spikes are commonly related to one or more of the following reasons: ○ Applications not ready to run on top of Kubernetes. ○ For example, apps with big image size, slow startup time, non-optimal kubernetes configurations, etc. ○ Applications depending on Infrastructure which takes some time to be provisioned. ■ For example, GPUs. ○ Autoscalers and Over-provisioning not being appropriately set. ● In an attempt to "fix" the instability caused by these short-lived traffic peaks, companies tend to over-provision their clusters. Although it may work in some cases, this is not a cost effective solution Reliable Kubernetes applications 02 Understanding your application capacity ● Per Pod capacity: ○ ● Per app capacity ○ ○ ● ● Isolate a single application Pod replica with auto scaling off, and then execute the tests simulating a real usage load Can your app support this? Configure your Cluster Autoscaler, container replicas, resources requests and limits, and either Horizontal Pod Autoscaler or Vertical Pod Autoscaler Stress your application with more strength to simulate sudden bursts or spikes. To eliminate latency concerns, these tests must run from the same region/zone the application is running on Google Cloud Or use VPA in recommendation mode See Distributed load testing using Google Kubernetes Engine tutorial. Make sure your application can grow both vertically and horizontally ● Make sure your application can handle spikes by adding more CPU and memory, or by increasing the number of Pod replicas ● This gives you the flexibility to experiment what fits your application better ● Unfortunately, there are some applications that are single threaded or architecturally limited to a number of workers/sub-process Understanding QoS Best Effort Burstable Guaranteed apiVersion: v1 … spec: containers: - name: qos-demo-3-ctr image: nginx apiVersion: v1 … spec: containers: - name: qos-demo-3-ctr image: nginx resources: requests: cpu: 500m memory: 100Mi limits: cpu: 1 memory: 200Mi apiVersion: v1 … spec: containers: - name: qos-demo-3-ctr image: nginx resources: requests: cpu: 3 memory: 200Mi limits: cpu: 3 memory: 200Mi Pod can be killed and marked as Fail at any time Pod can be killed and marked as Fail when used memory is bigger than requested Pod is guaranteed to run Set the appropriate resources and limits ● Always set your container resources requests ○ ● For Burstable workloads, set ○ ○ ● If you only set limits, guaranteed is assumed Memory: the same amount for both requests and limits CPU: bigger or unbounded CPU limit Use GKE Cost Insights and Recommendations ○ ○ To help you understand the recommended values to your production workload Discussed in the next section Make your container image as lean as possible Having the smallest image as possible: ● Every time Cluster Autoscaler provisions a new node, the node needs to download the images. The smaller the image, the faster the node can download it. ● Examples: ○ ○ Go: you can get images with ~10Mb (distroless!) Java: consider jib (for lean images); GraalVM (for low latency native binaries); Quarkus, Micronaut, and Spring Boot (for Kubernetes Native Java stack) Starting the application as fast as possible: ● ● Some apps can take minutes to start due to class loading, caching, etc. Whenever a pod requires a long start up, your requests may fail while your application is booting up For more information about how to build containers, see Best practices for building containers. Add Pod Disruption Budget (PDB) to your application ● PDBs are essential for autoscale activities, mainly during Cluster Autoscaler compacting phase For more information, see Specifying a Disruption Budget for your Application. Set meaningful Kubernetes Health checks ● Readiness probes: to know if your pods must be added to or removed from load balancers ● Liveness probes: to know when your pods need to be restarted ● Do this to ensure the correct lifecycle of your application: ○ ○ ○ ○ ○ Define at least the readiness probe for all your containers If a cache is to be loaded at startup time, the readiness probe must say it's ready only after the cache is fully loaded If the app can start serving right away, use a simple probe. For example, an HTTP endpoint returning 200 status code. If a more advanced probe is needed (e.g. checking if the connection pool has available resources), make sure your error rate doesn't increase when comparing it with a simpler implementation. Never make any probe logic access other services. It can compromise the lifecycle of your Pod in case these services do not respond timely. For more information, see Configure Liveness, Readiness and Startup Probes. Make sure your apps are shutting down correctly Consider the async nature of how Kubernetes updates endpoints and load balancers: ● Don't stop accepting new requests right after SIGTERM. ● If your application doesn't follow the above practice, use the preStop hook. ● Handle SIGTERM for clean ups only. Eg. in-memory state which needs to be persisted. ● Configure terminationGracePeriodSeconds to fit your application needs. ○ ○ ○ ● High values may increase time for node upgrades, rollouts, etc. Low values may not give enough time for Kubernetes to finish the Pod termination process Either way, it is recommended that you keep your application's termination period smaller than 10 minutes because Cluster Autoscaler honors it for 10 minutes only If your application uses Container-native load balancing, it is important that you start failing your readiness probe as soon as you receive a SIGTERM For more information, see Kubernetes best practices: terminating with grace. Monitoring and enforcing https://www.youtube.com/watch?v=sYdCqxM7OFM 03 Workload rightsizing recommendations Features: ● ● VPA Recommendations in GKE UI VPA Metrics in Cloud Monitoring Coming soon: ● VPA Recommendations in Recommendation Hub For more details, see GKE workload rightsizing — from recommendations to action Workload rightsizing recommendations at scale VPA recommendation intelligence: ● ● Allows you to get workload recommendations at org, division or team level Allow you to figure out: a. If it worth investing in workload right-sizing? b. What is the size of the effort? c. Where should you focus initially? d. How many workloads are over-provisioned and how many are under-provisioned? e. How many workloads are at reliability or performance risk? f. If you are getting better over-time on workload rightsizing? For more details, see Right-sizing your GKE workloads at scale using GKE built-in Vertical Pod Autoscaler recommendations Use Kubernetes Resource Quotas ● Multitenant clusters: different teams are responsible for workloads in different namespaces ● Kubernetes Resource Quotas: ○ ○ Manage the amount of resources used by objects in a namespace You can set quotas in terms of: ■ ■ ○ compute (CPU and memory) and storage resources object counts They let you ensure that no tenant uses more than its assigned share of cluster resources For more information, see Configure Memory and CPU Quotas for a Namespace. 05 What's next? Kubernetes General General Best Practices ● Write configuration in YAML - CICD your YAMLs ● When defining configurations, specify the latest stable API version ● Avoid direct command line interaction with nodes in the cluster ● Administrator-only SSH access to Kubernetes nodes. Devs may only use “kubectl exec” ● In deployment use the “record” option for easier rollbacks Don’t use naked Pod ReplicaSet - Name = “my-rc” - Selector = {“App”: “MyApp”} - Template = { ... } - Replicas = 4 Pods not bound to a ReplicaSet or Deployment will not be rescheduled in the event of a node failure. ReplicaSets are a simple control loop that ensures N copies of a Pod: ● Grouped by a selector ● Too few? Start some ● Too many? Kill some How many? Start 1 more 3 How many? OK API server 4 Use labels Define and use labels that identify semantic attributes of your application or Deployment, such as: ● App: MyApp ● Phase: Prod You can use these labels to select the appropriate Deployments for other resources such as ReplicaSets and Services. App: MyApp App: MyApp Phase: prod Phase: prod Role: FE Role: BE App: MyApp App: MyApp Phase: test Phase: test Role: FE Role: BE Kubernetes Container Images Base Image It is highly recommended to use managed image as a base image. Google provide managed base images which are base container images that are automatically patched by Google for security vulnerabilities, using the most recent patches available from the project upstream. Make image light When it makes container, it is recommended to install required utilities only and make container size small. Unnecessary utilities make container images bigger in size. They also increase the chances to host security holes. In addition, big container images take longer time in building and deployment cycles. For example, Node.js 6 image from Google Cloud Container Registry takes around 670MB, while Ubuntu 16 only images takes 140MB. Change tag every new container build In addition, every container image needs to have new tag name. Kubernetes caches container image based on tag name. It means that if you build new container with same tag image, the container image will not be pulled. Instead of the new image, cached image will be used. So every container build needs to have new tag name. Don’t use :latest or no tag This one is pretty obvious and most people doing this today already. If you don’t add a tag for your container, it will always try to pull the latest one from the repository and that may or may not contain the changes you think it has. Kubernetes Security Avoid privileged containers And grant Linux capabilities only if strictly necessary Avoid running as root Containers share kernel resources! Run as non-root in Docker file Bad Good FROM node:carbon EXPOSE 8080 WORKDIR /home/appuser COPY server.js . CMD node server.js > /home/appuser/log.out FROM node:carbon EXPOSE 8080 RUN groupadd -r -g 2001 appuser && useradd -r -u 1001 -g appuser appuser RUN mkdir /home/appuser && chown appuser /home/appuser USER appuser WORKDIR /home/appuser COPY --chown=appuser:appuser server.js . CMD node server.js > /home/appuser/log.out GKE General General Best Practices ● Use Kubectl or the Google Cloud Platform console as primary method of Kubernetes Engine interaction. ● Use Container OS for a streamlined platform that minimizes maintenance and administration Use Regional Clusters for production GKE provides two options about master node management. Zonal Clusters: Have one master node in one zone, hence it cannot support zonal level high availability. Regional Clusters: Enables zero-downtime upgrades and 99.95% uptime by deploying multiple masters across a region For production, it is recommended to use regional cluster. Zone A Zone B Zone C Regional Control Plane Nodes Nodes Nodes Enable Node Auto-Repair Keep nodes in your K8s cluster in a healthy, running state. Nodes that fail health checks are automatically scheduled for repair, including: ● Unreachable nodes ● Nodes reporting a NotReady status for a ● prolonged period ● Nodes whose boot disks are out of space for a prolonged period Unhealthy nodes have their pods drained, then node is recreated GKE Upgrades Enable Node Auto Upgrades Automatically keep nodes in your K8s cluster or node pool up to date with the latest stable version of Kubernetes Get the latest Kubernetes features; apply security updates automatically Control when upgrades occur with maintenance windows Enable Release Channels for Upgrades Release channels allow you to control your cluster and node pool version based on a version's stability rather than managing the version directly. When you enroll a cluster in a release channel, that cluster is upgraded automatically when a new version is available in that channel. Node upgrades ● ● Upgrades the node OS, Kubernetes, and the container runtime ○ Nodes are cordoned, drained, and replaced one at a time ○ Respects pod disruption budgets (minAvailable and maxUnavailable) to avoid service downtime ○ Supports roll-backs Trigger cluster upgrades manually (per node pool), or use the Node auto-upgrade functionality ○ Use blue-green upgrade if you have highly-available production workloads that you need to be able to roll back quickly in case the workload does not tolerate the upgrade, and a temporary cost increase is acceptable ○ Enable maintenance windows and maintenance exclusions, which provide control over when cluster maintenance such as auto-upgrades can and cannot occur on your Google Kubernetes Engine (GKE) clusters GKE Security GKE: Minimal OS Container-optimized OS (COS) based on Chromium OS, and maintained by Google ● Built from source: Since COS is based on Chromium OS, Google maintains all components and is able to rebuild from source if a new vulnerability is discovered and needs to be patched ● Smaller attack surface: Container-Optimized OS is purpose-built to run containers, has a smaller footprint, reducing your instance's potential attack surface ● Locked-down by default: Firewall restricts all TCP/UDP except SSH on port 22, and prevents kernel modules. Root file system is mounted read-only ● Automatic Updates: COS instances automatically download weekly updates in the background; only a reboot is necessary to use the latest updates. Google provides patches and maintenance Use private cluster Cluster will not have public IP. It can be only accessed from private network only. A private cluster can use HTTP Load balancer or a network load balancer to accept incoming traffic, even though the cluster nodes do not have public IP address. A private cluster also use an internal load balancer to accept traffic from within your VPC network On premises Private clusters WWW Cloud NAT Kubectl client Node a Pod a Node z Pod z Pod a Private Google Access Google Services Pod z RFC 1918 only (10.0.0.0, 172.16.0.0, 192.168.0.0) Security on GKE IAM and RBAC GKE: Using IAM and RBAC Use IAM at the project level Set roles for ● Cluster Admin: manage clusters ● Container Developer: API access GCP Project Cluster Node Pod Container Container Namespace Use RBAC at the cluster and namespace level Set permissions on individual clusters and namespaces Use Groups Only ever add groups Groups get one or more roles IAM: Example of managing access with roles Applying the principle of least privilege for developers across environments Project: DEV Project: PROD Cluster: Dev Playground Cluster: Prod A Cluster: Prod B namespace: dev-alice-app-a namespace: app-a namespace: app-a namespace: dev-mary-app-a namespace: app-b namespace: app-b developers=container.developer developers=container.viewer Use Least Privilege Service Accounts Security on GKE Encryption, Keys and Secrets Kubernetes Secrets My little secret Auth to Vault dir e ctly Kubernetes Secret Secret with EncryptionConfig Secret with EncryptionConfig backed by KMS provider Acting as KMS provider to manage KEK Acting as Secrets Engine for keys and encryption Acting as KMS provider to manage KEK Acting as Secrets Engine for service accounts management Cloud KMS Cloud IAM Encryption and keys ● All persistent disks are encrypted by default! ● GKE Customer-Managed Encryption Keys (CMEK) ○ ● Encrypt volumes with keys managed by Cloud KMS Plugin for Cloud KMS (open-sourced for DIY:ers) ○ Also known as “Application-layer Secrets Encryption” ○ DEKs on master(s) encrypt Kubernetes Secrets ○ KEK in Cloud KMS, potentially backed by HMS Master Cloud KMS Nodes Persistent Disk Keeping up with the news Google Cloud blog – filter on container-related news https://cloud.google.com/blog/products/containers-kubernetes Exclusive roadmap for customers and partners https://www.cloudconnectcommunity.com/ The “Google Cloud Platform Podcast” https://www.gcppodcast.com/ The “Kubernetes Podcast from Google” https://kubernetespodcast.com/ Thank you!

Optimized Kubernetes on GKE: Best Practices & Autoscaling

Related documents

Products

Support

Optimized Kubernetes on GKE: Best Practices & Autoscaling

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib