Finding code similarities based on Code2Vec model

This extension to the Code2Vec model enables finding the most similar code snippets in the given code database.

README of the original Code2Vec code can be found at CODE2VEC_README.md.

Requirements

python (3.6)
TensorFlow - version 2.0.0
If you are using a GPU, you will need CUDA 10.0 and cuDNN (>=7.5)
Java JDK

Usage

Training a model

You can either download an already-trained model, or train a new model using a preprocessed dataset.

Information about the training process can be found in the Code2Vec README. To download the trained model we can use:

wget https://s3.amazonaws.com/code2vec/model/java14m_model.tar.gz
tar -xvzf java14m_model.tar.gz

Preprocessing the dataset

In order to create and preprocess a new dataset (some raw datasets are provided by the authors of the original code):

Edit the file preprocess_for_extracting.sh using the instructions there, pointing it to the data directory.
Run the preprocess_for_extracting.sh file

./preprocess_for_extracting.sh

Extracting code vectors

To extract names and code vectors of the preprocessed data we need to run:

python code2vec.py --load *path_to_model* --ev *path_to_data*

for example:

python code2vec.py --load models/java14_model/saved_model_iter8.release -ev data/smallest/smallest.data.c2v

Manually extraction

To manually get closes code functions based on a trained model, run:

python code2vec.py --load *path_to_model* --interactive_extract

After the model loads, follow the instructions and edit the file Input.java and enter a Java method or code snippet.

Example result:

Original name:  f
    (0.921215) predicted: ['test', 'swap']
    (0.032008) predicted: ['bubble', 'sort']
    (0.009010) predicted: ['visit', 'binary', 'spatial', 'operator']
Most similar functions:
    sort                dist:  0.5792298538826088
    sort|writable       dist:  0.626716160262489
    distance|to         dist:  0.639008344707509
    down|heap           dist:  0.6416091544528166
    swap                dist:  0.6422565405152388
    generate|rows       dist:  0.6526368586445036
    swap                dist:  0.6582794146808126
    to|json|array       dist:  0.6597630007702668
    main                dist:  0.6661443300050862
    move|to|next|queue  dist:  0.6737719875144312

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
CSharpExtractor		CSharpExtractor
JavaExtractor		JavaExtractor
images		images
.gitignore		.gitignore
CODE2VEC_README.md		CODE2VEC_README.md
Input.java		Input.java
KMeans.ipynb		KMeans.ipynb
KMeans_centers.vectors		KMeans_centers.vectors
LICENSE		LICENSE
Model_sensitivity.ipynb		Model_sensitivity.ipynb
README.md		README.md
__init__.py		__init__.py
build_extractor.sh		build_extractor.sh
code2vec.py		code2vec.py
code_sensitivity.vectors		code_sensitivity.vectors
common.py		common.py
config.py		config.py
extractor.py		extractor.py
interactive_extract.py		interactive_extract.py
interactive_predict.py		interactive_predict.py
keras_attention_layer.py		keras_attention_layer.py
keras_checkpoint_saver_callback.py		keras_checkpoint_saver_callback.py
keras_model.py		keras_model.py
keras_topk_word_predictions_layer.py		keras_topk_word_predictions_layer.py
keras_word_prediction_layer.py		keras_word_prediction_layer.py
keras_words_subtoken_metrics.py		keras_words_subtoken_metrics.py
model_base.py		model_base.py
path_context_reader.py		path_context_reader.py
preprocess.py		preprocess.py
preprocess.sh		preprocess.sh
preprocess_csharp.sh		preprocess_csharp.sh
preprocess_for_evaluation.py		preprocess_for_evaluation.py
preprocess_for_evaluation.sh		preprocess_for_evaluation.sh
project_description.md		project_description.md
requirements.txt		requirements.txt
tensorflow_model.py		tensorflow_model.py
train.sh		train.sh
vocabularies.py		vocabularies.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Finding code similarities based on Code2Vec model

Requirements

Usage

Training a model

Preprocessing the dataset

Extracting code vectors

Manually extraction

About

Uh oh!

Releases

Packages

Languages

License

MatMarkiewicz/code2vec

Folders and files

Latest commit

History

Repository files navigation

Finding code similarities based on Code2Vec model

Requirements

Usage

Training a model

Preprocessing the dataset

Extracting code vectors

Manually extraction

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages