About MCaaS

Configuration Best Practices

In this documentation, we are going over things we HIGHLY recommend you to configure to have a successful go-live and continuously operating in the production environment. You should configure them correctly so that the applications are resilient and scalable. Please use the following checklist to help you.

Note

This checklist is not intended to be part of the applications ATO processes. It is to help you to be successful on the MCaaS platform when going live.

Production Readiness Checklist

Application has Datadog APM configured
Application metrics are monitored in Datadog Dashboard
Application issues and errors are monitored by Datadog monitors and can send automated alerts to the right people
Application deployment has realistic application liveness and readiness probes
Application deployment replicas, scaling and pod resources are configured
Application deployment pod disruption budget is configured
Application deployment is using advanced deployment strategy
Application deployment has security context enabled

Monitoring

If you have not configured any monitoring of your applications, please refer to Datadog Overview to start the configurations.

To have visibility to your applications traces and performance of each trace, please review Datadog Application Performance Monitoring.
To set up dashboards to monitoring clusters and applications in a single view, please review Datadog Dashboard.
To set up automated alerts to trigger when events happen, please review Datadog Monitors.

Application Deployment Configurations

In this section, we are going over the values you can set on your HelmRelease file for your application deployment. Some values, like replica count, should be carefully configured so that you don’t over or under acquire resources.

Important

It is highly recommended to perform load and performance testing using realistic numbers on the applications before going live so that you know the baselines and spikes of the traffic. This will help you with the following configurations. Please coordinate with the MCaaS team in advance when performing these tests.

Configure realistic liveness and readiness probes

Configuring liveness and readiness probes is the first step to make sure that the application deployment pods are ready to accept traffic. Therefore, it is important to have realistic liveness and readiness checks on the application level.

readinessProbe:
  exec:
    enabled: false # change httpGet.enabled to false if using this
    command: [] # commands for readiness probe
  httpGet:
    enabled: true
  path: /readiness
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 5
  successThreshold: 1
  failureThreshold: 5
livenessProbe:
  exec:
    enabled: true
    command:
    - /bin/sh
    - /healthcheck.sh
  httpGet:
    enabled: false

Helpful references:

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Configure deployment replica count, resources per replica and scaling

It is important to know how many replicas you need to serve traffic and how much resources each replica to run itself. These values should be configured with care as you can easily take a lot of resources from the cluster even though the application does not actually need them. Thus, load testing is important to know the baselines for these values.

If autoscaling is not enabled, please use replicaCount to have a static count of replicas.

# how many replica pods to deploy as part of the deployment
replicaCount: 2
## resources the pod reserves.
resources:
  requests:
    cpu: 100m
    memory: 200Mi
  limits: # if the pod is trying to use more resources than the limits, Kubernetes will automatically kill the pods.
    cpu: 250m
    memory: 500Mi

Helpful references:

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

Important

It is important for your applications to scale horizontally instead of vertically so that you can have more replicas with smaller resource requests than a single replica with large resource requests. This will also help your applications availability by eliminating single point of failure.

There are two types of auto-scaling tools available for you to use: horizontal auto-scaler (HPA) and vertical auto-scaler (VPA). We do not recommend the use of VPA unless the application can only scalevertically. HPA and VPA cannot be used together on CPU/RAM metrics as they conflict with each other when scaling. Please let the MCaaS team know that you need VPA and provide us a reason as using VPA requires further configuration.

HPA can be enabled like the following. In this example, HPA is monitoring the application pods CPU and RAM usage. If one of the metrics is hitting the threshold of 50% utilization of resources requested by the pod, it will scale up to more pods. If the traffic decreases and utilization goes down, HPA will automatically scale down the number of pods after a minute or so.

# replicaCount: 2 ## minReplicas in horizontalPodAutoscaler configuration is used instead of this.
## defines how pod can horizontally scale; please do not add this section if unsure
horizontalPodAutoscaler:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  cpuMetrics:
    enabled: true
    averageUtilization: 50
  memoryMetrics:
    enabled: true
    averageUtilization: 50
  otherMetrics: [] # custom metrics or external metrics; for more information: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-custom-metrics

This is an example of using VPA. Currently, VPA does not automatically update the pod’s resource requests and limits. It will provide suggestions of what they should be configured. The suggestions can be viewed from Datadog Infrastructure pod details page.

## defines how pod can vertically scale; please do not add this section if unsure
# cannot be used together with HPA on cpu and memory; for more information on limitations: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler#known-limitations
verticalPodAutoscaler:
  enabled: false
  minAllowed:
    cpu: 100m
    memory: 200Mi
  maxAllowed:
    cpu: 200m
    memory: 400Mi

Helpful references:

Configure pod disruption budget

Pod disruption budget (PDB) is important to have on all deployments to minimize any down times, especially in production. This is used to specify at any given time, x number of pod(s) need to be running. The use case is that whenever the MCaaS team does cluster upgrades, the existing cluster nodes will get drained and deleted. Having PDB specified, the drain process will wait for other replica pods to spin back up on new nodes before killing existing pods.

Important

PDB should be configured to be less than the replicaCount or minReplicas you specify.

## defines how many pods should be available during a deployment or cluster upgrade
podDisruptionBudget:
  enabled: true
  minAvailable: 1 # this number should always be less than replicaCount or horizontalPodAutoscaler.minReplicas

Important

PDB should be NOT be configured if replicaCount is 1. This will cause un-necessary delays during EKS node_group upgrade.

Helpful references:

https://kubernetes.io/docs/concepts/workloads/pods/disruptions/

Configure advanced deployment strategy

Using advanced deployment strategy is highly recommended in production environment to have a smoother deployment experience for the end users. This topic is covered extensively in Advanced Deployment Strategies documentation. Please refer to the documentation to learn more.

Enable security context

Security context adds security restrictions on what the pod can do. This will prevent malicious code or rouge container from breaking out and perform malicious acts such as taking over the cluster. GSA SecOps team has container runtime security agents running the cluster to prevent that but having this enabled is an extra layer of security on your pods. To enable it, simply put this:

securityContext:
  enabled: true

The following configurations will be set on your deployments if it is enabled:

hostIPC: false
hostNetwork: false
hostPID: false
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 3000
  fsGroup: 2000
...
securityContext:
  privileged: false
  allowPrivilegeEscalation: false
  runAsNonRoot: true

Please refer to the chart to see where they are applied: https://github.helix.gsa.gov/MCaaS/mcaas-tenant-charts/blob/tenant-prod/tenant-stateless-application/templates/deployment.yaml.

For more information about what each parameter does, please refer to the helpful reference below.

Helpful references:

https://kubernetes.io/docs/concepts/security/pod-security-standards/

Assume IAM Role

Applications frequently require permission to access AWS resources, such as an s3 bucket or an SQS queue. MCaaS can create an IAM role with the necessary permissions for your application. However the application will need to assume this role in order to leverage it’s permissions.

To enable an application to assume an IAM role, MCaaS supports IAM Roles for Service Accounts. Under this approach, an application’s Kubernetes service account will be annotated with the IAM role it can assume. MCaaS tenant charts support this option via:

assumeRoleArn: <role arn>

Set this value in an application’s helm release.

On this page:

Configure realistic liveness and readiness probes
Configure deployment replica count, resources per replica and scaling
Configure pod disruption budget
Configure advanced deployment strategy
Enable security context
Assume IAM Role

test