A Python CLI application that analyzes codebases and generates structured documentation as markdown files or a ChromaDB vector database using local LLMs and is aimed for small codebases (like couple of hundred source files).
This tool has been developed using both LM Studio and Ollama as LLM providers. The idea behind using a local LLM, like Google's Gemma-3 1B, is data privacy and low cost. In addition, with a good LLM a multitude of programming languages can be analyzed without depending on specific code parsers. Depending on the available hardware and used model, though, performance may vary and accuracy may be affected.
There's lots of tweaking done to format and process the LLM's responses, but depending on the source there may still some warnings. An LLM may not always format code within JSON replies correctly, then one enty might be skipped.
Output was initially only aimed for markdown files, but has since been extended to also allow a vector database (ChromaDB), that can be queried against (see a small test script in the examples folder).
An example output file can be found here, which is an analysis of the CSA project itself as of March, 17th 2025.
This repository is experimental and was developed almost entirely using Claude 3.7 Sonnet AI. The code structure, documentation, and implementation reflect an AI-assisted development approach, showcasing the capabilities of modern LLMs in software engineering.
- Recursively scans source directories for code files
- Filters files by extension and excludes binary/generated folders
- Analyzes code files in chunks using local LLM's (via LMStudio or Ollama)
- Generates either Markdown or ChromaDB vector database documentation with:
- File structure visualization (Mermaid diagram)
- File-by-file analysis summaries
- User-friendly CLI with:
- Comprehensive help documentation with usage examples
- Optional file inclusion/exclusion patterns
- LM Studio and Ollama integration:
- Smart extraction of content from LLM responses
- Multiple fallback mechanisms for resilient operation
- Extensible output formats through the reporter pattern
- Clean markdown output
- Optional logging (csa.log)
- Supports gitignore-based file exclusion
- Custom chunk sizing for optimal LLM context utilization
- Environment variable configuration via .env files
- Cross-platform compatibility (Windows, Linux, WSL2)
- Extensive test suite with unit and integration tests
- Efficient error handling and recovery mechanisms
- Python 3.10 or later
- One of the following LLM providers:
- LM Studio running locally (default, configure with LMSTUDIO_HOST)
- Ollama running locally (configure with OLLAMA_HOST and OLLAMA_MODEL)
- Clone this repository
- Run
setup.bat
to create a virtual environment and install dependencies - Make sure one of the following LLM providers is running:
- LM Studio on localhost:1234 (default)
- Ollama on localhost:11434
-
Clone this repository
-
Make the shell scripts executable:
chmod +x setup.sh run_tests.sh
-
Run
./setup.sh
to create a virtual environment and install dependencies -
Make sure one of the following LLM providers is running:
- LM Studio on localhost:1234 (default)
- Ollama on localhost:11434
- Clone this repository
- Create a virtual environment:
python -m venv venv
- Activate the virtual environment:
- Windows:
venv\Scripts\activate
- Linux/WSL2:
source venv/bin/activate
- Windows:
- Install dependencies:
pip install -r requirements.txt
- In folder
csa
create.env
file from.env.example
Basic usage:
# Using the Python module directly
python -m csa.cli /path/to/source/directory
# Or if installed via pip
csa /path/to/source/directory
This will analyze the codebase in the specified directory and generate a trace_ai.md
file in the current directory.
usage: python -m csa.cli [-h] [-o OUTPUT] [-c CHUNK_SIZE] [--folders]
[--reporter {markdown,chromadb}] [--llm-provider LLM_PROVIDER]
[--llm-host LLM_HOST] [--lmstudio-host LMSTUDIO_HOST]
[--ollama-host OLLAMA_HOST] [--include INCLUDE]
[--exclude EXCLUDE] [--obey-gitignore] [--no-dependencies]
[--no-functions] [--verbose]
[source_dir]
Code Structure Analyzer - Generate structured documentation for codebases
positional arguments:
source_dir Path to the source directory to analyze
optional arguments:
-h, --help Show this help message and examples.
-o OUTPUT, --output OUTPUT
Path to the output markdown file or chromadb directory (default: trace_ai.md)
-c CHUNK_SIZE, --chunk-size CHUNK_SIZE
Number of lines to read in each chunk (default: 200)
--folders Recursively include files in sub-folders of the source directory.
--reporter Reporter type to use (markdown or chromadb) - for chromadb, the -o/--output must specify a folder name (e.g., "data")
--llm-provider LLM_PROVIDER
LLM provider to use (default: lmstudio)
--llm-host LLM_HOST Host address for the LLM provider (default: localhost:1234)
--lmstudio-host LMSTUDIO_HOST
Host address for the LM Studio provider (default: localhost:1234)
--ollama-host OLLAMA_HOST
Host address for Ollama (default: localhost:11434)
--include INCLUDE Comma-separated list in double quotes of file patterns to include (gitignore style)
--exclude EXCLUDE Comma-separated list in double quotes of file patterns to exclude (gitignore style)
--obey-gitignore Whether to obey .gitignore files in the processed folder
--no-dependencies Disable output of dependencies/imports in the analysis
--no-functions Disable output of functions list in the analysis
--verbose, -v Enable verbose logging
# Analyze the current directory with default settings
python -m csa.cli .
# Analyze a specific directory with a custom output file
python -m csa.cli /path/to/source -o analysis.md
# Analyze with a larger chunk size (for processing more lines at once)
python -m csa.cli /path/to/source -c 200
# Analyze recursively including all sub-folders
python -m csa.cli /path/to/source --folders
# Show detailed help text with examples
python -m csa.cli --help
# Use LM Studio with a specific host
python -m csa.cli /path/to/source --llm-provider lmstudio --lmstudio-host localhost:1234
# Use Ollama as the LLM provider with specific host and model
python -m csa.cli /path/to/source --llm-provider ollama --ollama-host localhost:11434 --ollama-model qwen2.5-coder:14b
# Use the legacy --llm-host parameter (will set the appropriate provider-specific host based on llm-provider)
python -m csa.cli /path/to/source --llm-provider lmstudio --llm-host localhost:5000
# Include only specific file patterns
python -m csa.cli /path/to/source --include "*.cs,*.py"
# Exclude specific file patterns
python -m csa.cli /path/to/source --exclude "test_*.py,*.tmp"
# Obey .gitignore files in the processed folder
python -m csa.cli /path/to/source --obey-gitignore
# Disable dependencies/imports in the output
python -m csa.cli /path/to/source --no-dependencies
# Disable functions list in the output
python -m csa.cli /path/to/source --no-functions
CSA employs several architectural patterns to ensure maintainability and extensibility:
The application uses a reporter pattern to separate analysis logic from output formatting:
BaseAnalysisReporter
: Abstract base class defining the interface for all reportersMarkdownAnalysisReporter
: Concrete implementation that formats analysis results as Markdown
This design allows for:
- Easy addition of new output formats (HTML, JSON, etc.)
- Clear separation of concerns
- Better testability of individual components
If you want to create your own output format, simply:
- Subclass
BaseAnalysisReporter
- Implement the required methods (
initialize
,update_file_analysis
,finalize
) - Pass your reporter instance to
analyze_codebase()
The application supports multiple LLM providers through a provider abstraction:
LLMProvider
: Base class for all LLM providers- Provider-specific implementations for LM Studio, Ollama, etc.
This allows for easy integration of new LLM backends while maintaining a consistent interface.
The code analysis process follows a pipeline approach:
- File discovery and filtering
- File chunking to fit within LLM context windows
- Analysis of each chunk with LLM
- Aggregation of chunk analyses into a comprehensive file analysis
- Output generation via the reporter system
The project includes test scripts for both Windows and Linux environments:
Run tests using the batch script:
run_tests.bat # Run unit tests only
run_tests.bat --all # Run all tests (including integration tests)
run_tests.bat --integration # Run only integration tests
First, ensure the shell script is executable:
chmod +x run_tests.sh
Then run tests:
./run_tests.sh # Run unit tests only
./run_tests.sh --all # Run all tests (including integration tests)
./run_tests.sh --integration # Run only integration tests
Note: Integration tests require a running LLM provider. By default, they expect LM Studio running on localhost:1234, but this can be configured through environment variables.
If you're interested in contributing to CSA, follow these steps to set up your development environment:
-
Clone this repository and navigate to it
-
Create a virtual environment:
python -m venv .venv
-
Activate the virtual environment:
-
Windows (Command Prompt):
.venv\Scripts\activate
-
Windows (Git Bash):
source .venv/Scripts/activate
-
Linux/macOS:
source .venv/bin/activate
-
-
Install project dependencies:
pip install -r requirements.txt
-
Install development dependencies including pre-commit hooks:
pip install pre-commit==3.7.0 pip install ruff mypy types-requests types-setuptools types-pyyaml types-toml
-
Set up pre-commit hooks:
pre-commit run --config ./dev_config/python/.pre-commit-config.yaml --all-files
To manually run the linting/pre-commit tools from Git Bash:
-
First, activate your virtual environment:
source .venv/Scripts/activate
-
Run the full pre-commit suite:
pre-commit run --config ./dev_config/python/.pre-commit-config.yaml --all-files
This will:
- Format your code with Ruff
- Run linting checks
- Check for type errors with MyPy
- Fix common issues automatically
-
To run specific hooks individually:
# Run just the ruff linter pre-commit run ruff --config ./dev_config/python/.pre-commit-config.yaml --all-files # Run just the ruff formatter pre-commit run ruff-format --config ./dev_config/python/.pre-commit-config.yaml --all-files # Run just the mypy type checker pre-commit run mypy --config ./dev_config/python/.pre-commit-config.yaml --all-files
-
Run the linting tools directly (without pre-commit):
# Run ruff linter ruff check --config dev_config/python/ruff.toml . # Run ruff formatter ruff format --config dev_config/python/ruff.toml . # Run mypy type checker mypy --config-file dev_config/python/mypy.ini .
If you want to run these tools on specific files instead of the entire project, replace the --all-files
flag with the path to the specific files, or provide the file path directly to the linting tools.
Configuration is handled through environment variables or a .env
file:
LLM_PROVIDER
: LLM provider to use (default: "lmstudio")LMSTUDIO_HOST
: Host address for the LMStudio provider (default: "localhost:1234")OLLAMA_HOST
: Host address for the Ollama provider (default: "localhost:11434")OLLAMA_MODEL
: Model name for Ollama (default: "qwen2.5-coder:14b")CHUNK_SIZE
: Number of lines to read in each chunk (default: 200)OUTPUT_FILE
: Default output file path (default: "trace_ai.md")FILE_EXTENSIONS
: Comma-separated list of file extensions to analyze (default: ".cs,.py,.js,.ts,.html,.css")
csa/
+-- setup.bat # Windows setup script
+-- setup.sh # Linux/WSL2 setup script
+-- requirements.txt # Dependencies
+-- pyproject.toml # Python project configuration
+-- setup.py # Legacy setup file for compatibility
+-- csa/ # Python package
| +-- .env.example # Example environment variables
| +-- __init__.py # Package initialization
| +-- config.py # Configuration handling
| +-- llm.py # LLM wrapper for different providers
| +-- analyzer.py # Core file analysis logic
| +-- code_analyzer.py # Code analysis
| +-- reporters.py # Output formatting abstraction
| +-- cli.py # Command-line interface (entry point)
+-- tests/ # Test directory
+-- run_tests.bat # Windows test script
+-- run_tests.sh # Linux/WSL2 test script
+-- README.md # Documentation
The generated trace_ai.md
file will have the following structure:
# Code Structure Analysis
Source directory: `/path/to/source/directory`
Analysis started: 2023-06-10 14:30:45
## Codebase Structure
```mermaid
graph TD
...
<details>
<summary>path/to/file.cs</summary>
- **Lines Analyzed**: 1-150 of 150
- **Description**: This file contains a class that implements the IDisposable interface...
</details>
Special credits to X user shannonNullCode for the initial idea and inspiration for this project.
MIT License - This software is provided "as is" without warranty of any kind, express or implied, and you are free to use, modify, and distribute it under the terms of the MIT License. See the LICENSE file for the full text of the MIT License, which grants permissions to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software.
Contributions are welcome! Please feel free to submit a Pull Request.