About MCaaS
Configuration Best Practices
In this documentation, we are going over things we HIGHLY recommend you to configure to have a successful go-live and continuously operating in the production environment. You should configure them correctly so that the applications are resilient and scalable. Please use the following checklist to help you.
Note
This checklist is not intended to be part of the applications ATO processes. It is to help you to be successful on the MCaaS platform when going live.
Production Readiness Checklist
- Application has Datadog APM configured
- Application metrics are monitored in Datadog Dashboard
- Application issues and errors are monitored by Datadog monitors and can send automated alerts to the right people
- Application deployment has realistic application liveness and readiness probes
- Application deployment replicas, scaling and pod resources are configured
- Application deployment pod disruption budget is configured
- Application deployment is using advanced deployment strategy
- Application deployment has security context enabled
Monitoring
If you have not configured any monitoring of your applications, please refer to Datadog Overview to start the configurations.
- To have visibility to your applications traces and performance of each trace, please review Datadog Application Performance Monitoring.
- To set up dashboards to monitoring clusters and applications in a single view, please review Datadog Dashboard.
- To set up automated alerts to trigger when events happen, please review Datadog Monitors.
Application Deployment Configurations
In this section, we are going over the values you can set on your HelmRelease
file for your application deployment. Some values, like replica count, should be carefully configured so that you don’t over or under acquire resources.
Important
It is highly recommended to perform load and performance testing using realistic numbers on the applications before going live so that you know the baselines and spikes of the traffic. This will help you with the following configurations. Please coordinate with the MCaaS team in advance when performing these tests.
Configure realistic liveness and readiness probes
Configuring liveness and readiness probes is the first step to make sure that the application deployment pods are ready to accept traffic. Therefore, it is important to have realistic liveness and readiness checks on the application level.
readinessProbe:
exec:
enabled: false # change httpGet.enabled to false if using this
command: [] # commands for readiness probe
httpGet:
enabled: true
path: /readiness
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
livenessProbe:
exec:
enabled: true
command:
- /bin/sh
- /healthcheck.sh
httpGet:
enabled: false
Helpful references:
Configure deployment replica count, resources per replica and scaling
It is important to know how many replicas you need to serve traffic and how much resources each replica to run itself. These values should be configured with care as you can easily take a lot of resources from the cluster even though the application does not actually need them. Thus, load testing is important to know the baselines for these values.
If autoscaling is not enabled, please use replicaCount
to have a static count of replicas.
# how many replica pods to deploy as part of the deployment
replicaCount: 2
## resources the pod reserves.
resources:
requests:
cpu: 100m
memory: 200Mi
limits: # if the pod is trying to use more resources than the limits, Kubernetes will automatically kill the pods.
cpu: 250m
memory: 500Mi
Helpful references:
Important
It is important for your applications to scale horizontally instead of vertically so that you can have more replicas with smaller resource requests than a single replica with large resource requests. This will also help your applications availability by eliminating single point of failure.
There are two types of auto-scaling tools available for you to use: horizontal auto-scaler (HPA) and vertical auto-scaler (VPA). We do not recommend the use of VPA unless the application can only scalevertically. HPA and VPA cannot be used together on CPU/RAM metrics as they conflict with each other when scaling. Please let the MCaaS team know that you need VPA and provide us a reason as using VPA requires further configuration.
HPA can be enabled like the following. In this example, HPA is monitoring the application pods CPU and RAM usage. If one of the metrics is hitting the threshold of 50% utilization of resources requested by the pod, it will scale up to more pods. If the traffic decreases and utilization goes down, HPA will automatically scale down the number of pods after a minute or so.
# replicaCount: 2 ## minReplicas in horizontalPodAutoscaler configuration is used instead of this.
## defines how pod can horizontally scale; please do not add this section if unsure
horizontalPodAutoscaler:
enabled: true
minReplicas: 2
maxReplicas: 10
cpuMetrics:
enabled: true
averageUtilization: 50
memoryMetrics:
enabled: true
averageUtilization: 50
otherMetrics: [] # custom metrics or external metrics; for more information: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-custom-metrics
This is an example of using VPA. Currently, VPA does not automatically update the pod’s resource requests and limits. It will provide suggestions of what they should be configured. The suggestions can be viewed from Datadog Infrastructure pod details page.
## defines how pod can vertically scale; please do not add this section if unsure
# cannot be used together with HPA on cpu and memory; for more information on limitations: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#known-limitations
verticalPodAutoscaler:
enabled: false
minAllowed:
cpu: 100m
memory: 200Mi
maxAllowed:
cpu: 200m
memory: 400Mi
Helpful references:
-
https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
-
https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#intro
Configure pod disruption budget
Pod disruption budget (PDB) is important to have on all deployments to minimize any down times, especially in production. This is used to specify at any given time, x number of pod(s) need to be running. The use case is that whenever the MCaaS team does cluster upgrades, the existing cluster nodes will get drained and deleted. Having PDB specified, the drain process will wait for other replica pods to spin back up on new nodes before killing existing pods.
Important
PDB should be configured to be less than the replicaCount or minReplicas you specify.
## defines how many pods should be available during a deployment or cluster upgrade
podDisruptionBudget:
enabled: true
minAvailable: 1 # this number should always be less than replicaCount or horizontalPodAutoscaler.minReplicas
Important
PDB should be NOT be configured if replicaCount is 1. This will cause un-necessary delays during EKS node_group upgrade.
Helpful references:
Configure advanced deployment strategy
Using advanced deployment strategy is highly recommended in production environment to have a smoother deployment experience for the end users. This topic is covered extensively in Advanced Deployment Strategies documentation. Please refer to the documentation to learn more.
Enable security context
Security context adds security restrictions on what the pod can do. This will prevent malicious code or rouge container from breaking out and perform malicious acts such as taking over the cluster. GSA SecOps team has container runtime security agents running the cluster to prevent that but having this enabled is an extra layer of security on your pods. To enable it, simply put this:
securityContext:
enabled: true
The following configurations will be set on your deployments if it is enabled:
hostIPC: false
hostNetwork: false
hostPID: false
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
...
securityContext:
privileged: false
allowPrivilegeEscalation: false
runAsNonRoot: true
Please refer to the chart to see where they are applied: https://github.helix.gsa.gov/MCaaS/mcaas-tenant-charts/blob/tenant-prod/tenant-stateless-application/templates/deployment.yaml.
For more information about what each parameter does, please refer to the helpful reference below.
Helpful references:
Assume IAM Role
Applications frequently require permission to access AWS resources, such as an s3 bucket or an SQS queue. MCaaS can create an IAM role with the necessary permissions for your application. However the application will need to assume this role in order to leverage it’s permissions.
To enable an application to assume an IAM role, MCaaS supports IAM Roles for Service Accounts. Under this approach, an application’s Kubernetes service account will be annotated with the IAM role it can assume. MCaaS tenant charts support this option via:
assumeRoleArn: <role arn>
Set this value in an application’s helm release.