Skip to content

add tutorial on simulating outages #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added src/assets/images/aws/tutorials/banner.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
230 changes: 230 additions & 0 deletions src/content/docs/aws/tutorials/simulating-outages.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
---
title: "Chaos Engineering: Simulating Outages using Chaos API"
description: Use the Chaos API to simulate service disruptions and assess how well your infrastructure can deploy and recover from unexpected situations.
services:
- ecs
- ec2
- agw
- ddb
platform:
- JavaScript
deployment:
- terraform
pro: true
leadimage: "banner.png"
---

## Introduction

[LocalStack Chaos API](/aws/capabilities/chaos-engineering/chaos-api) is capable of simulating infrastructure faults to allow conducting controlled chaos engineering tests on AWS infrastructure.
Its purpose is to uncover vulnerabilities and improve system robustness.
Chaos API offers a means to deliberately introduce failures and observe their impacts, helping developers to better equip their systems against actual outages.

## Getting started

In this tutorial we study the effects of outages on a sample AWS application.
We use the Chaos API to simulate the outage and design a mitigation to make the application resilient against database outages.

This tutorial is designed for users new to the Chaos API and assumes basic knowledge of the AWS CLI and our [`awslocal`](https://github.com/localstack/awscli-local) wrapper script.
In this example, we will use the Chaos API to create controlled outages in a DynamoDB database.
The aim is to test the software's behavior and error handling capabilities.

For this particular example, we'll be using a [sample application repository](https://github.com/localstack-samples/samples-chaos-engineering/tree/master/chaos-api).
Clone the repository, and follow the instructions below to get started.

### Prerequisites

The general prerequisites for this guide are:

- LocalStack Pro with [LocalStack Auth Token](/aws/getting-started/auth-token)
- [AWS CLI](/aws/integrations/aws-native-tools/aws-cli) with the [`awslocal` wrapper](/aws/integrations/aws-native-tools/aws-cli#localstack-aws-cli-awslocal)
- [Docker](https://docs.docker.com/get-docker/) and [Docker Compose](https://docs.docker.com/compose/install/)

Start LocalStack by using the `docker-compose.yml` file from the repository.
Ensure to set your Auth Token as an environment variable during this process.
The cloud resources will be automatically created upon the LocalStack start.

```bash
LOCALSTACK_AUTH_TOKEN=<YOUR_LOCALSTACK_AUTH_TOKEN>
docker compose up
```

### Architecture

The following diagram shows the architecture that this application builds and deploys:

![Architecture](/images/aws/arch-1.png)

### Preflight checks

Before starting any outages, it's important to verify that our application is functioning correctly.
Start by creating an entity and saving it.
To do this, use curl to call the API Gateway endpoint for the POST method:

```bash
curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-2004",
"name": "Ultimate Gadget",
"price": "49.99",
"description": "The Ultimate Gadget is the perfect tool for tech enthusiasts looking for the next level in gadgetry.
Compact, powerful, and loaded with features."
}'
```

```bash title="Output"
Product added/updated successfully.
```

### Simulating the outage

Next, we will configure the Chaos API to target all DynamoDB operations.
The Chaos API is powerful enough to refine outages to particular operations like `PutItem` or `GetItem`, but the objective here is to simulate a failure of entire service.
The following configuration will cause all API calls to fail with a 80% failure rate, each resulting in an HTTP 500 status code and a `SomethingWentWrong` error.

```bash
curl --location --request PATCH 'http://localhost.localstack.cloud:4566/_localstack/chaos/faults' \
--header 'Content-Type: application/json' \
--data '
[
{
"service": "dynamodb",
"probability": 0.8,
"error": {
"statusCode": 500,
"code": "SomethingWentWrong"
}
}
]'
```

This makes the database inaccessible.
No external client or a LocalStack service can retrieve or add new products, resulting in the API Gateway returning an Internal Server Error.

Downtime and data loss are critical issues to avoid in enterprise applications.
Fortunately, encountering this issue early in the development phase allows developers to implement effective error handling and develop mechanisms to prevent data loss during a database outage.

### Designing a more resilient system

![Architecture](/images/aws/arch-2.png)

A possible solution involves setting up an SNS topic, an SQS queue, and a Lambda function.
The Lambda function will be responsible for retrieving queued items and attempting to re-execute the `PutItem` operation on the database.
If DynamoDB remains unavailable, the item will be placed back in the queue for a later retry.

```bash
curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-1003",
"name": "Super Widget",
"price": "29.99",
"description": "A versatile widget that can be used for a variety of purposes.
Durable, reliable, and affordable."
}'
```

```bash title="Output"
A DynamoDB error occurred.
Message sent to queue.
```

If we review the logs, it will show that the `DynamoDbException` has been managed effectively.

```text
2023-11-06T22:21:40.789 INFO --- [ asgi_gw_2] localstack.request.aws : AWS dynamodb.PutItem => 500 (DynamoDbException)
2023-11-06T22:21:40.834 DEBUG --- [ asgi_gw_4] l.services.sns.publisher : Topic 'arn:aws:sns:us-east-1:000000000000:ProductEventsTopic' publishing '5520d37a-fc21-4a73-b1bf-f9b9afce5908' to subscribed
'arn:aws:sqs:us-east-1:000000000000:ProductEventsQueue' with protocol 'sqs' (subscription 'arn:aws:sns:us-east-1:000000000000:ProductEventsTopic:0a4abf8c-744a-404a-9ff9-f132e25d1b30')
```

This element will remain in the queue until the outage is resolved.

### Ending the outage

To stop the outage, use the following configuration:

```bash
curl --location --request POST 'http://localhost.localstack.cloud:4566/_localstack/chaos/faults' \
--header 'Content-Type: application/json' \
--data '[]'
```

With the outage now ended, the Product that initially failed to reach the database to finally be stored successfully.
This can be confirmed by scanning the database.

```bash
awslocal dynamodb scan --table-name Products
```

```bash title="Output"
{
"Items": [
{
"name": {
"S": "Super Widget"
},
"description": {
"S": "A versatile widget that can be used for a variety of purposes.
Durable, reliable, and affordable."
},
"id": {
"S": "prod-1003"
},
"price": {
"N": "29.99"
}
},
{
"name": {
"S": "Ultimate Gadget"
},
"description": {
"S": "The Ultimate Gadget is the perfect tool for tech enthusiasts looking for the next level in gadgetry.
Compact, powerful, and loaded with features."
},
"id": {
"S": "prod-2004"
},
"price": {
"N": "49.99"
}
}
],
"Count": 2,
"ScannedCount": 2,
"ConsumedCapacity": null
}
```

### Introducing network latency

The LocalStack Chaos API can also introduce a network latency for all connections.
This can be done with the following configuration:

```bash
curl --location --request POST 'http://localhost.localstack.cloud:4566/_localstack/chaos/effects' \
--header 'Content-Type: application/json' \
--data '{
"latency": 5000
}'
```

With this configured, you can use the same sample stack to observe and understand the effects of a 5-second delay on each service call.

```bash
curl --location 'http://12345.execute-api.localhost.localstack.cloud:4566/dev/productApi' \
--max-time 2 \
--header 'Content-Type: application/json' \
--data '{
"id": "prod-1088",
"name": "Super Widget",
"price": "29.99",
"description": "A versatile widget that can be used for a variety of purposes.
Durable, reliable, and affordable."
}'
```

```bash title="Output"
An error occurred (InternalError) when calling the GetResources operation (reached max retries: 4)
```