Github Issue Classification Usecase

This end-to-end use case uses the Data Analytics Reference Stack and the Deep Learning Reference Stack to walk through a classification example, using data from GitHub* to show how Machine Learning can analyze and tag new issues to save time for developers who are creating issues, and to ensure the issues come to the attention of the right audience through correct tagging.

Introduction

Github issues are currently manually classified, but why not automate the tagging process? A simple ML algorithm can be used to analyze the issue content and tag it automatically, saving time for developers and directing their focus to critical issues. This usecase shows the user how to do exactly that.

To run the usecase locally, follow the steps below. You will need to manually preprocess the data using DARS container which is an optimized spark container, train the data with DLRS which is a deep learning container, then serve it with rest.py, and run the frontend within the website folder.

If you would prefer a simple walkthrough with a jupyter notebook, feel free to explore github-notebook.ipynb. It is a self contained and simplified example of this usecase. Instructions to use are located below under "Training the Model using DLRS and Jupyter Notebooks"

Installation

To install and use the Data Analytics Reference Stack (DARS), refer here

To install and use the Deep Learning Refence Stack (DLRS), refer here

Table of contents

data
- Where the raw and clean data is stored
kubeflow
- All cloud uses and implementations are here
models
- Where Machine Learning and vectorizer models are stored
scripts
- A bash script to retrieve data, and a scala script to process the data
website
- A flask based server that displays a front end on host to interact with the model
Dockerfile
- Builds an image based on DLRS that will automatically run rest.py
Makefile
- make command instructions for Dockerfile
config.make
- Configurations for the Makefile
github-notebook.ipynb
- A user friendly walkthrough and explanation of what's happening in train.py
requirements.txt
- Requirements needed in the DLRS image during inference
rest.py
- This file runs a RESTful API server that receives issue content and returns labels
train.py
- This file trains the model

Local Container Walkthrough

Clone this repo and pull the DARS container:

git clone https://github.com/intel/stacks-usecase

docker pull clearlinux/stacks-dars-mkl:latest

cd stacks-usecase/github-issue-classification

docker run -p 8888:8888 -it --ulimit nofile=1000000:1000000 -v ${PWD}:/workdir clearlinux/stacks-dars-mkl bash

Prepare the spark environment

In this section we will prepare our Spark environment using DARS.

First, create the output directory if it doesn't exist

cd /workdir
mkdir /data
mkdir /data/raw

Change the "get-data.sh" script to an executable and execute it to retrieve clearlinux issues data

cd /workdir/scripts
chmod u+x get-data.sh
./get-data.sh
cd /workdir

Note that you must be in the /workdir directory before starting Spark.

Run the spark shell

spark-shell

Process the data

Import session and instantiate a spark context

import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("github-issue-classification").getOrCreate()
import spark.implicits._

Load the data to a spark dataframe

var df = spark.read.option("multiline", true).json("file:///workdir/data/raw/*.json")

Select the labels, name, body, and id columns

df = df.select(col("body"), col("id"), col("labels.name"))

Explode the labels column to prepare for filtering the top labels

var df2 = df.select(col("id"),explode(col("name")).as("labels"))

Order the Labels by frequency

var df3 = df2.select("labels").groupBy("labels").count().orderBy(col("count").desc).limit(10).select("labels")

Turn the top labels into a list (to put into the next step)

var list = df3.select("labels").map(r => r.getString(0)).collect.toList

Filter the top labels

df2 = df2.filter($"labels".isin(list:_*))

Recombine the top labels

df2 = df2.groupBy("id").agg(collect_set("labels").alias("labels"))

Take intersection of label ids and body ids to get final list

df = df.join(df2, "id").select("body","labels")

Save the data

df.write.json("file:///workdir/data/tidy/")

Or, from within the spark shell run:

:load -v scripts/proc-data.scala

The proc-data.scala script performs all the steps 2-10 described above.

Train a model using DLRS

In this section we will train a model using DLRS in preparation for serving it.

If you have not done so already, clone the usecases repo into your local workspace

git clone https://github.com/intel/stacks-usecase

cd stacks-usecase/github-issue-classification

Pull and run the Deep Learning Reference Stack (DLRS)

docker pull clearlinux/stacks-dlrs-mkl

docker run -it -v ${PWD}:/workdir clearlinux/stacks-dlrs-mkl

Navigate to the github usecase and install requirements

cd /workdir/docker

pip install -r requirements_train.txt

Create the output directory

mkdir /workdir/models

Run the training script

cd /workdir/python

python train.py

That's it! At its core, DLRS does not require that you change your code. Once the environment is set up (steps 1-4), a single call to your code will run as expected, and it will utilize Intel optimizations. This is the base functionality of DLRS, and most implementations will be built off this example section.

Serve the model

To run inference, we've set up a special dockerfile based on our image. The dockerfile creates a RESTful API that will communicate to a local flask server to run live inference.

From your local system, navigate to the github-issues-classification folder, where "Dockerfiles" are stored inside the "docker" directory:

To build the training container, run:

make train

To build the inference container, run:

make infer

To finally deploy the model for inference using a high performance async REST server, run:

make infer_run

The server is built using Quart web microframework and deployed using an ASGI server Hypercorn with uvloop

Now run one last step in a second terminal:

cd ../website
flask run

This will create a flask server on your local system. Open your favorite browser and navigate to localhost:5000 to see an interactive example of the guthub issues usecase. Simply copy or type any issue into the top left box, and hit submit. The flask server will call the REST API, which will process your input and return the appropriate labels.

Training the Model using DLRS and Jupyter Notebooks

Pull and run the Deep Learning Reference Stack (DLRS). You will need to mount it to disk and connect a jupyter notebook port.

docker pull clearlinux/stacks-dlrs-mkl
docker run -it -v ${PWD}:/workdir -p 8888:8888 clearlinux/stacks-dlrs-mkl

From within the container, navigate to the workspace, install sklearn, and start a jupyter notebook that is linked to the exterior port. Make sure to copy the token from the output.

cd ../workdir
pip install sklearn
jupyter notebook --ip 0.0.0.0 --no-browser --allow-root

Open a browser and navigate to localhost:8888. If the notebook asks for a token, paste the token from the previous step and submit. You now have a notebook running out of DLRS that can access any local files. We have a jupyter notebook prebuilt for you.

NOTE: If you get a 'hit rate limits' error when fetching raw json file from github API, you have to add the -u "<github username>" option to curl

Mailing List

See our public mailing list page for details on how to contact us. You should only subscribe to the Stacks mailing lists using an email address that you don't mind being public.

Reporting Security Issues

Security issues can be reported to Intel's security incident response team via https://intel.com/security.