The links below all give information on traffic count and major road data.
- Data: https://storage.googleapis.com/dft-statistics/road-traffic/downloads/data-gov-uk/dft_traffic_counts_aadf_by_direction.zip
- Metadata: https://storage.googleapis.com/dft-statistics/road-traffic/all-traffic-data-metadata.pdf
- General info: https://roadtraffic.dft.gov.uk/about
I will Create (and deploy) a web service/API that allows a user to navigate this data.
Requires
BEWARE the following guide assumes the user is able to decide what is best for them and how to operate with scripts, errors, debugging etcetera. This is not by all means a polished up code!!
I created a Jupyter-Notebook data_explore
that allows to visualize a bit the input data contained in traffic_data.csv
from here I defined a few models that allow to count and filter the data.
The app can be either run locally or inside a docker container.
to run the code first create a virtual environment. I used poetry to manage dependencies.
poetry install
poetry shell
or if you prefer
python -m venv .venv
source .venv/bin/activate
poetry install
**BEWARE the project requires pandas. The docker image I am currently using contains already Pandas v1.4.2 but if you are not running the app inside a container remember to intall pandas! **
A Dockerfile
and a docker-compose
files are provided.
I created a nginx proxy (a gitsubmodule is provided in the repo) that can be used as Django's built-in webserver is not designed for production use. Nginx is designed to be fast, efficient, and secure, so it's a better choice to handle incoming web requests when your website is on the public Internet and thus subject to large amounts of traffic (if you're lucky and your site takes off)
The first time is necessary to build the Docker image.
docker build -f Dockerfile-postgres-pandas-numpy.dockerfile -t <image-name>:<tag> .
docker tag <image-name>:<tag> <your-username>/<image-name>:<tag>
docker login
docker push <your-username>/<image-name>:<tag>
edit Dockerfile
so it can point to <your-username>/<image-name>:<tag>
I also provide a script (docker-task.sh
) that can help speed up building, running and pushing to AWS images. The user can feel free to explore that script (there is a help provided) or just run:
docker-compose build
and then
docker-compose up
and finally open a browser and navigate to 127.0.0.0:8000
if you prefer to run locally simulating the production environment with the proxy run
docker-compose -f docker-compose-proxy.yml up
and then navigate to
127.0.0.0:8000
to run tests
docker-compose run --rm app sh -c "python manage.py wait_for_db && pytest API/tests"
The database the very first time the app is run will be empty. To populate it
docker-compose -f docker-compose-proxy.yml run --rm app sh -c "python manage.py populate_db --num 1000 --print True"
This will read the input data from its remote location and populate the db with the first 1000 rows. An informative output will be shown on the terminal.
it is also possible to run the app without using Docker. After setting up the local environment just run:
./run.sh
this will make migration, apply them, run unit tests, populate the database and run the app.
to deploy to AWS I chose to use Gitlab CI (the repo also contains Github workflows if the users prefers that) and terraform in conjunction with docker.
to do so I created a little script in the file ./aws_scripts.sh
with a set of instruction to create policies, users, and some resources that the infrastructure needs.
This script needs to be run once and before deploying to AWS.
After those are created and you have published to AWS ECR the images of the proxy and the app you are ready to run terraform.
Before doing so it is necessary to store as env variables AWS credentials. I choose to use aws-vault as provides additional security creating ephemeral credentials that last maximum 12h.
In my case I will run
aws-vault exec <myuser> --duration=12h
or you can choose to export
- AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
- AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
- AWS_SESSION_TOKEN=${AWS_SESSION_TOKEN}
Furthermore it is necessary to edit the deploy/terrafor/variables.tf
file and change
variable "dns_zone_name" {
description = "Domain name"
default = "<yourdomain>"
}
variable "ecr_image_api" {
description = "ECR Image for API"
default = "<youraccountid>.dkr.ecr.us-east-1.amazonaws.com/traffic-django-restapi-app:latest"
}
variable "ecr_image_proxy" {
description = "ECR Image for API"
default = "<youraccountid>.dkr.ecr.us-east-1.amazonaws.com/traffic-django-restapi-proxy:latest"
}
To make life easy in the repo there is a makefile with some alias to run terraform commands. Without going into the details. The users has to create a workspace (dev, staging, prod ...), initialize, plan and apply.
The apply
command will create resources.
A deployment can also be triggered using the Gitlab CI, a .gitlab-ci.yml
is provided in the repo. Here the user can tag releases and branches.
if everything is successful you can access the API at
api.<workspace>.<yourdns>
once the url is opened a user can filter data clicking on the filter button and then use the dropdown menu or input case sensitive strings in the line edit menus. ###curl
curl -X GET /count/?road__category__name=TA" -H "accept: application/json" or any combination of the following:
road__road__name,
date__year,
location__count_point_ref,
road__category__name,
road__junc_start__name,
road__junc_end__name,
road__direction__name,
Upon deployment the database will be empty. it can be populated using the bastion host
user can use the keypair generated prior to ssh into the instance (or use the AWS console).
$(aws ecr get-login --no-include-email --region <your-aws-region>)
docker run -it \
-e DB_HOST=<DB_HOST> \
-e DB_NAME=<DB_NAME> \
-e DB_USER=<DB_USER> \
-e DB_PASS=<DB_PASS> \
<ECR_REPO>:latest \
sh -c "python manage.py populate_db --num 999 --print True"
The API at this stage assumes that the input data (a large 170MB file) is read from a remote location. There is room to improvement.
For example:
- the data could be stored in S3 and be accessed from there
- better if the data could be ingested creating a lambda function that check for changes in S3 and then populate a DynamoDB table, used down along the line to store columnar data ready to be pushed in RDS database.
- another way could be to create an data ingestion pipeline using Firehose and storing again into S3.
- allow deployment using CloudFormation (application won't work with current templates)
Another area of improvement can be the Django models and Filters. At this stage I create a set of simple models to characterize the data. Many columns have been left aside and are not used. So framing in a different way the models could improve the overall filtering and counting capabilities of the API.
Finally, the API at this stage is public. The Terraform architecture is already providing a bastion host for admin access. So in the future one could think of implementing an user model into the Django app to regulate access and expose only certain counting features or filters leaving admin or other super-users with private endpoints.