Skip to content

PurCL/LLMSCAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLMSCAN

LLMSCAN is a tool designed to parse and analyze source code to instantiate LLM-based program analysis. Based on Tree-sitter, it provides functionality to identify and extract functions from the source code, along with their metadata such as function name, line numbers, parameters, call sites, and other program constructs (including branches and loops). Importantly, it achieves light-weighted call graph analysis based on parsing, which enables more effective code browsing and navigation for real-world programs. The latest version of LLMSCAN can support four programming languages, including C, C++, Java, and Python.

Attention: Considering the language syntax differences, we give up supporting multiple languages in main branch. Since 2024/11/23, the active development branches have been cpp, java, and python.

Features

  • Parse source code using Tree-sitter.
  • Browse code for prompting-based static analysis.
  • Multi-linguistic support.

Functionality

  • MetaScan: Extract syntactic facts as function metadata.

You can define your own scanners in the directory src/pipeline.

Installation

  1. Clone the repository:

    git clone [email protected]:PurCL/LLMSCAN.git
    cd LLMSCAN
  2. Install the required dependencies:

    pip install -r requirements.txt
  3. Ensure you have the Tree-sitter library and language bindings installed:

    cd lib
    python build.py
  4. Configure the keys:

    export OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxkey1:sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxkey2:sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxkey3:sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxkey4 >> ~/.bashrc

    We suggest including multiple keys to facilitate parallel analysis with high throughput.

    Similarly, the other two keys can be set as follows:

    export REPLICATE_API_TOKEN=xxxxxx >> ~/.bashrc
    export GEMINI_KEY=xxxxxx >> ~/.bashrc

Quick Start

  1. Prepare the project that you want to analyze. Here we use the Linux kernel as an example:

    cd benchmark
    mkdir C && cd C
    git clone [email protected]:torvalds/linux.git

    You can also use our provided benchmark programs to run a demo.

  2. Run the analysis to extract the meta data of each function:

    cd src
    ./run.sh

The output files are dumped in the directory log.

How to Extend

More Program Facts

You can implement your own analysis by adding more modules, such as more parsing-based primitives (in parser/program_parser). If you want to derive semantic facts, which may be beyond the capability of parsing-based analysis, you can customize the prompts and leverage LLMs to derive them in a neural manner.

More Programming Languages

The framework is language-agnostic. To migrate the current implementations to other programming languages or extract more syntactic facts, please refer to the grammar files in the corresponding Tree-sitter libraries and refactor the code in parser/program_parser.py. Basically, you only need to change the node types when invoking find_nodes_by_type.

Here are the links to grammar files in Tree-sitter libraries targeting mainstream programming languages:

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

Contact

For any questions or suggestions, please contact [email protected].

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •