ECS container network latency

ECS container network latency disrupts the state of infrastructure resources. It brings delay on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.

It causes network stress on the containers of the ECS task using the given CLUSTER_NAME environment variable for a specific duration.
To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
It tests the ECS task sanity (service availability) and recovery of the task containers subject to network stress.

ECS Container Network Latency

Usage

View fault usage

This fault degrades the network of the task container without the container being marked as unhealthy/ (or unworthy) of traffic. It simulates issues within the ECS task network or communication across services in different availability zones (or regions). This can be resolved using middleware that switches traffic based on certain SLOs (or performance parameters). This can also be resolved by highlighting the degradation using notifications (or alerts). It also determines the impact of the fault on the microservice. The task may stall or get corrupted while waiting endlessly for a packet. The fault limits the impact (blast radius) to only the traffic you wish to test by specifying the service to find TUC (Task Under Chaos). This fault helps improve the resilience of the services over time.

Prerequisites

Kubernetes >= 1.17
Adequate AWS access to stop and start an EC2 instance.
Create a Kubernetes secret that has the AWS access configuration(key) in the CHAOS_NAMESPACE. Below is a sample secret file:

apiVersion: v1
kind: Secret
metadata:
  name: cloud-secret
type: Opaque
stringData:
  cloud_config.yml: |-
    # Add the cloud AWS credentials respectively
    [default]
    aws_access_key_id = XXXXXXXXXXXXXXXXXXX
    aws_secret_access_key = XXXXXXXXXXXXXXX

It is recommended to use the same secret name, i.e. cloud-secret. Otherwise, you will need to update the AWS_SHARED_CREDENTIALS_FILE environment variable in the fault template and you may be unable to use the default health check probes.
Refer to AWS Named Profile For Chaos to know how to use a different profile for AWS faults.

Permissions required

Here is an example AWS policy to execute the fault.

View policy for the fault

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ecs:UpdateContainerInstancesState",
                "ecs:RegisterContainerInstance",
                "ecs:ListContainerInstances",
                "ecs:DeregisterContainerInstance",
                "ecs:DescribeContainerInstances",
                "ecs:ListTasks",
                "ecs:DescribeClusters"

            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetDocument",
                "ssm:DescribeDocument",
                "ssm:GetParameter",
                "ssm:GetParameters",
                "ssm:SendCommand",
                "ssm:CancelCommand",
                "ssm:CreateDocument",
                "ssm:DeleteDocument",
                "ssm:GetCommandInvocation",          
                "ssm:UpdateInstanceInformation",
                "ssm:DescribeInstanceInformation"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2messages:AcknowledgeMessage",
                "ec2messages:DeleteMessage",
                "ec2messages:FailMessage",
                "ec2messages:GetEndpoint",
                "ec2messages:GetMessages",
                "ec2messages:SendReply"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

Refer to the superset permission/policy to execute all AWS faults.

Default validations

The ECS container instance should be in a healthy state.

Fault tunables

Mandatory fields

Variables	Description	Notes
CLUSTER_NAME	Name of the target ECS cluster.	For example, `cluster-1`.
REGION	Region name of the target ECS cluster.	For example, `us-east-1`.

Optional fields

Variables	Description	Notes
TOTAL_CHAOS_DURATION	Duration that you specify, through which chaos is injected into the target resource (in seconds).	Defaults to 30s.
CHAOS_INTERVAL	Interval between successive instance terminations (in seconds).	Defaults to 30s.
AWS_SHARED_CREDENTIALS_FILE	Path to the AWS secret credentials.	Defaults to `/tmp/cloud_config.yml`.
NETWORK_LATENCY	Latency you wish to induce within the service (in milliseconds).	Defaults to 2000 ms.
DESTINATION_IPS	IP addresses of the services or the CIDR blocks(range of IPs), the accessibility to which is impacted	comma-separated IP(S) or CIDR(S) can be provided. if not provided, it will induce network chaos for all ips/destinations
DESTINATION_HOSTS	DNS Names of the services, the accessibility to which, is impacted	if not provided, it will induce network chaos for all ips/destinations or DESTINATION_IPS if already defined
NETWORK_INTERFACE	Name of ethernet interface considered for shaping traffic	Defaults to `eth0`
JITTER	Specify the value of jitter.	Defaults to 0.
SEQUENCE	It defines sequence of chaos execution for multiple instance	Defaults to parallel. Supports serial sequence as well.
RAMP_TIME	Period to wait before and after injecting chaos (in seconds).	For example, 30

Fault examples

Common and AWS-specific tunables

Refer to the common attributes and AWS-specific tunables to tune the common tunables for all faults and aws specific tunables.

Network latency

It defines the network latency(in ms) to be injected in the targeted container. You can tune it using the NETWORK_LATENCY ENV.

Use the following example to tune it:

# injects network latency for a certain chaos duration
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: ecs-container-network-latency
    spec:
      components:
        env:
        # network latency to be injected
        - name: NETWORK_LATENCY
          value: '2000' #in ms
        - name: TOTAL_CHAOS_DURATION
          value: '60'

Network interface

The defined name of the ethernet interface, which is considered for shaping traffic. You can tune it using the NETWORK_INTERFACE ENV. Its default value is eth0.

Use the following example to tune it:

# provide the network interface
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: ecs-container-network-latency
    spec:
      components:
        env:
        # name of the network interface
        - name: NETWORK_INTERFACE
          value: 'eth0'
        - name: TOTAL_CHAOS_DURATION
          value: '60'

Jitter

It defines the jitter (in ms), a parameter that allows introducing a network delay variation. You can tune it using the JITTER ENV. Its default value is 0. Use the following example to tune it:

# provide the network latency jitter
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: ecs-container-network-latency
    spec:
      components:
        env:
        # value of the network latency jitter (in ms)
        - name: JITTER
          value: '200'

Destination IPs and destination hosts

The network faults interrupt traffic for all the IPs/hosts by default. The interruption of specific IPs/Hosts can be tuned via DESTINATION_IPS and DESTINATION_HOSTS ENV.

DESTINATION_IPS: It contains the IP addresses of the services or pods or the CIDR blocks(range of IPs), the accessibility to which is impacted.
DESTINATION_HOSTS: It contains the DNS Names/FQDN names of the services, the accessibility to which, is impacted.

Use the following example to tune it:

# it inject the chaos for the egress traffic for specific ips/hosts
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: ecs-container-network-latency
    spec:
      components:
        env:
        # supports comma-separated destination ips
        - name: DESTINATION_IPS
          value: '8.8.8.8,192.168.5.6'
        # supports comma-separated destination hosts
        - name: DESTINATION_HOSTS
          value: 'nginx.default.svc.cluster.local,google.com'
        - name: TOTAL_CHAOS_DURATION
          value: '60'

Usage​

Prerequisites​

Permissions required​

Default validations​

Fault tunables​

Mandatory fields

Optional fields

Fault examples​

Common and AWS-specific tunables​

Network latency​

Network interface​

Jitter​

Destination IPs and destination hosts​

Usage

Prerequisites

Permissions required

Default validations

Fault tunables

Fault examples

Common and AWS-specific tunables

Network latency

Network interface

Jitter

Destination IPs and destination hosts