In this project I have built and app which helps users to avoid wasting time in crowds. Imagine if we had data about people's location. It would be nice if we could use that data to detect clusters of population. The app that I have built takes user's location of interest in San Fransisco and a search radius seperated by comma. The result shows the clusters of people and can look like this:
Here are the details of how I approached this problem:
-
Data Collection: Since such data is not available to me I engineered it. I took data from Yelp which contains location of restaurants in San Fransisco. I performed a random walk to create more points around the restaurants.
-
Here's how my pipeline looks like:
I use Streaming K-means in the spark streaming environment. There are two indices are created on elasticsearch. One contains data about peoples location and is updated every day due to lack of memory storage. The other index contains location of 1000000 people and is updated every 3 minutes.
Some Engineering challenges :
-
Tunning kafka, spark streaming and elasticsearch in order to update the map as quick as possible. In particular tunning batch intervals has to be done carefully to avoid situations where the map is empty of points.
-
Choosing k.