Skip to content
This repository has been archived by the owner on Dec 17, 2022. It is now read-only.

Latest commit

 

History

History

github-issue-classification

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Github Issue Classification Usecase

This end-to-end use case uses the Data Analytics Reference Stack and the Deep Learning Reference Stack to walk through a classification example, using data from GitHub* to show how Machine Learning can analyze and tag new issues to save time for developers who are creating issues, and to ensure the issues come to the attention of the right audience through correct tagging.

Introduction

Github issues are currently manually classified, but why not automate the tagging process? A simple ML algorithm can be used to analyze the issue content and tag it automatically, saving time for developers and directing their focus to critical issues. This usecase shows the user how to do exactly that.

Github issue classifier architecture diagram

To run the usecase locally, follow the steps below. You will need to manually preprocess the data using DARS container which is an optimized spark container, train the data with DLRS which is a deep learning container, then serve it with rest.py, and run the frontend within the website folder.

If you would prefer a simple walkthrough with a jupyter notebook, feel free to explore github-notebook.ipynb. It is a self contained and simplified example of this usecase. Instructions to use are located below under "Training the Model using DLRS and Jupyter Notebooks"

Installation

To install and use the Data Analytics Reference Stack (DARS), refer here

To install and use the Deep Learning Refence Stack (DLRS), refer here

Table of contents

  • data
    • Where the raw and clean data is stored
  • kubeflow
    • All cloud uses and implementations are here
  • models
    • Where Machine Learning and vectorizer models are stored
  • scripts
    • A bash script to retrieve data, and a scala script to process the data
  • website
    • A flask based server that displays a front end on host to interact with the model
  • Dockerfile
    • Builds an image based on DLRS that will automatically run rest.py
  • Makefile
    • make command instructions for Dockerfile
  • config.make
    • Configurations for the Makefile
  • github-notebook.ipynb
    • A user friendly walkthrough and explanation of what's happening in train.py
  • requirements.txt
    • Requirements needed in the DLRS image during inference
  • rest.py
    • This file runs a RESTful API server that receives issue content and returns labels
  • train.py
    • This file trains the model

Local Container Walkthrough

Clone this repo and pull the DARS container:

git clone https://github.com/intel/stacks-usecase
docker pull clearlinux/stacks-dars-mkl:latest
cd stacks-usecase/github-issue-classification
docker run -p 8888:8888 -it --ulimit nofile=1000000:1000000 -v ${PWD}:/workdir clearlinux/stacks-dars-mkl bash

Prepare the spark environment

In this section we will prepare our Spark environment using DARS.

First, create the output directory if it doesn't exist

cd /workdir
mkdir /data
mkdir /data/raw

Change the "get-data.sh" script to an executable and execute it to retrieve clearlinux issues data

cd /workdir/scripts
chmod u+x get-data.sh
./get-data.sh
cd /workdir

Note that you must be in the /workdir directory before starting Spark.

Run the spark shell

spark-shell

Process the data

  1. Import session and instantiate a spark context
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.appName("github-issue-classification").getOrCreate()
import spark.implicits._
  1. Load the data to a spark dataframe
var df = spark.read.option("multiline", true).json("file:///workdir/data/raw/*.json")
  1. Select the labels, name, body, and id columns
df = df.select(col("body"), col("id"), col("labels.name"))
  1. Explode the labels column to prepare for filtering the top labels
var df2 = df.select(col("id"),explode(col("name")).as("labels"))
  1. Order the Labels by frequency
var df3 = df2.select("labels").groupBy("labels").count().orderBy(col("count").desc).limit(10).select("labels")
  1. Turn the top labels into a list (to put into the next step)
var list = df3.select("labels").map(r => r.getString(0)).collect.toList
  1. Filter the top labels
df2 = df2.filter($"labels".isin(list:_*))
  1. Recombine the top labels
df2 = df2.groupBy("id").agg(collect_set("labels").alias("labels"))
  1. Take intersection of label ids and body ids to get final list
df = df.join(df2, "id").select("body","labels")
  1. Save the data
df.write.json("file:///workdir/data/tidy/")

Or, from within the spark shell run:

:load -v scripts/proc-data.scala

The proc-data.scala script performs all the steps 2-10 described above.

Train a model using DLRS

In this section we will train a model using DLRS in preparation for serving it.

  1. If you have not done so already, clone the usecases repo into your local workspace
git clone https://github.com/intel/stacks-usecase
cd stacks-usecase/github-issue-classification
  1. Pull and run the Deep Learning Reference Stack (DLRS)
docker pull clearlinux/stacks-dlrs-mkl
docker run -it -v ${PWD}:/workdir clearlinux/stacks-dlrs-mkl
  1. Navigate to the github usecase and install requirements
cd /workdir/docker
pip install -r requirements_train.txt
  1. Create the output directory
mkdir /workdir/models
  1. Run the training script
cd /workdir/python
python train.py

That's it! At its core, DLRS does not require that you change your code. Once the environment is set up (steps 1-4), a single call to your code will run as expected, and it will utilize Intel optimizations. This is the base functionality of DLRS, and most implementations will be built off this example section.

Serve the model

To run inference, we've set up a special dockerfile based on our image. The dockerfile creates a RESTful API that will communicate to a local flask server to run live inference.

From your local system, navigate to the github-issues-classification folder, where "Dockerfiles" are stored inside the "docker" directory:

To build the training container, run:

make train

To build the inference container, run:

make infer

To finally deploy the model for inference using a high performance async REST server, run:

make infer_run

The server is built using Quart web microframework and deployed using an ASGI server Hypercorn with uvloop

Now run one last step in a second terminal:

cd ../website
flask run

This will create a flask server on your local system. Open your favorite browser and navigate to localhost:5000 to see an interactive example of the guthub issues usecase. Simply copy or type any issue into the top left box, and hit submit. The flask server will call the REST API, which will process your input and return the appropriate labels.

Training the Model using DLRS and Jupyter Notebooks

  1. Pull and run the Deep Learning Reference Stack (DLRS). You will need to mount it to disk and connect a jupyter notebook port.
docker pull clearlinux/stacks-dlrs-mkl
docker run -it -v ${PWD}:/workdir -p 8888:8888 clearlinux/stacks-dlrs-mkl
  1. From within the container, navigate to the workspace, install sklearn, and start a jupyter notebook that is linked to the exterior port. Make sure to copy the token from the output.
cd ../workdir
pip install sklearn
jupyter notebook --ip 0.0.0.0 --no-browser --allow-root
  1. Open a browser and navigate to localhost:8888. If the notebook asks for a token, paste the token from the previous step and submit. You now have a notebook running out of DLRS that can access any local files. We have a jupyter notebook prebuilt for you.

NOTE: If you get a 'hit rate limits' error when fetching raw json file from github API, you have to add the -u "<github username>" option to curl

Mailing List

See our public mailing list page for details on how to contact us. You should only subscribe to the Stacks mailing lists using an email address that you don't mind being public.

Reporting Security Issues

Security issues can be reported to Intel's security incident response team via https://intel.com/security.