Malware URL Lookup Coding Exercise

Problem Statement

We have an HTTP proxy that is scanning traffic, looking for malware URLs. Before allowing HTTP connections to be made, this proxy asks a service that maintains several databases of malware URLs if the resource being requested is known to contain malware.

Write a small web service that responds to GET requests where the caller passes in a URL and the service responds with some information about that URL. The GET requests would look like this:

GET /v1/urlinfo/{resource_url_with_query_string}

The caller wants to know if it is safe to access that URL or not.

Quickstart Guide

Ensure you have the required dependencies installed:

git
docker
make
go #(if you want to run client for example)

Build and run necessary containers:

cd malware-url-lookup
make build
make run

Perform URL Lookups or Updates

# Perform desired API calls
# Small client script included for convinience
go run cmd/client/main.go

What to expect when you run the client:

Uploads 5000 dummy URLs url0.com - url4999.com

- Should return {"success":true}

Checks for existence of URLs

- url1.com
- - Should return {"safe":false}
- url2500.com
- - Should return {"safe":false}
- url4999.com
- - Should return {"safe":false}

Checks for non-existence of URLs

- url5000.com
- - Should return {"safe":true}

Clean up

make -i clean

API Schema

Check if URL is a known source of malware

GET /v1/urlinfo/{resource_url_with_query_string}

Returns an application/json reponse object indicating whether or not the URL is "safe".

    {
        "safe": true
    }

Submit URL(s) that are a known source of malware

POST /v1/urlinfo/

Accepts URLs to be added to the malware database. Expects them as an application/json object in the body of a POST request (as defined below)

    {
        "urls": [
            "url1.com",
            "url2.com",
            ...
            "urln.com"
        ]
    }

Bonus Features

The size of the URL list could grow infinitely, how might you scale this beyond the memory capacity of the system? Bonus if you implement this.

There are two dimensions to this questions, the first being reducing the storage footprint/memory requirements of the system, and the second being: given the system is optimized, how does it scale?

For the latter, my choice would be to use a distributed cloud-based database, which would scale automatically given the needs of the system. Again, I will tackle this if I have time.

For the former, I made the design decision to ONLY include known malicious URLs in the database, since their non-existence in the database implies their safety.

I also took some time to consider the database best suited for this job. A relational database (e.g. MySQL) is not needed (at time of writing) since we need not know any details related to the URL in question. We need only a single column, yet even a columnar database (e.g. Cassandra) is suboptimal since it strives to make the data/data ranges query-able. We are looking for something which offers a one-to-one "does it exist" type of functionality. Our data is not suited for a Graph Database, nor do we need the flexibility of a Document Database. The best choice for this problem is a Key-Value Database.

I have further refined this selection to my final choice, Redis, because of its aptitude to run as a container, and because it runs in memory (i.e. RAM), which is well suited for this application since it is highly latency-sensitive.

The number of requests may exceed the capacity of this system, how might you solve that? Bonus if you implement this.

This question is similar to the last, in that we should examine the problem from the same 2 angles. Firstly, for the system to function at optimal capacity, we would need it to handle requests conncurrently (i.e. non-blocking) so that it may accept incoming requests while it's still processing prior requests. Caching should also be added for optimal efficiency of retrieval.

A nuance of our caching system is that we are only storing the URLs which are a known source of malware in the database. HOWEVER, to truly optimize the system, the cache should also contain commonly requested NON-malicious URLs, since this will help to reduce latency overall.

A further optimization may be possible for example with different caches for valid and invalid URLs, since depending on the source of the URLs, it may be more common for a URL to be malicious or non-malicious and therefore out-compete those of the other type in the cache.

Secondly, given the system is already optimized, my answer would be to load balance between distributed instances of the application runnning on different devices, for example Docker containers orchestrated by Kubernetes.

What are some strategies you might use to update the service with new URLs? Updates may be as many as 5000 URLs a day with updates arriving every 10 minutes.

In my opinion, the best way to handle these variably-lengthed, variably-timed updates would be to receive them via API. For the sake of the example, I will assume they come from a trusted, vetted source. To mamiximize the system's optimization, this "Update API" could be decoupled from the previously defined "Lookup API" to allow them to scale independently, and without resource competition.

Next Steps and Future Improvements

Fully-featured client allowing for user-defined uploads/queries
Unit testing
Decouple GET and POST APIs into discrete microservices
Kubernetes manifests to help easily deploy at scale

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
cmd		cmd
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malware URL Lookup Coding Exercise

Problem Statement

Quickstart Guide

Ensure you have the required dependencies installed:

Build and run necessary containers:

Perform URL Lookups or Updates

What to expect when you run the client:

Clean up

API Schema

Check if URL is a known source of malware

Submit URL(s) that are a known source of malware

Bonus Features

Next Steps and Future Improvements

About

Releases

Packages

Languages

License

kianjones9/malware-url-lookup

Folders and files

Latest commit

History

Repository files navigation

Malware URL Lookup Coding Exercise

Problem Statement

Quickstart Guide

Ensure you have the required dependencies installed:

Build and run necessary containers:

Perform URL Lookups or Updates

What to expect when you run the client:

Clean up

API Schema

Check if URL is a known source of malware

Submit URL(s) that are a known source of malware

Bonus Features

Next Steps and Future Improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages