Code Structure Analyzer (CSA)

A Python CLI application that analyzes codebases and generates structured documentation as markdown files or a ChromaDB vector database using local LLMs and is aimed for small codebases (like couple of hundred source files).

Dev Notes

This tool has been developed using both LM Studio and Ollama as LLM providers. The idea behind using a local LLM, like Google's Gemma-3 1B, is data privacy and low cost. In addition, with a good LLM a multitude of programming languages can be analyzed without depending on specific code parsers. Depending on the available hardware and used model, though, performance may vary and accuracy may be affected.

There's lots of tweaking done to format and process the LLM's responses, but depending on the source there may still some warnings. An LLM may not always format code within JSON replies correctly, then one enty might be skipped.

Output was initially only aimed for markdown files, but has since been extended to also allow a vector database (ChromaDB), that can be queried against (see a small test script in the examples folder).

An example output file can be found here, which is an analysis of the CSA project itself as of March, 17th 2025.

This repository is experimental and was developed almost entirely using Claude 3.7 Sonnet AI. The code structure, documentation, and implementation reflect an AI-assisted development approach, showcasing the capabilities of modern LLMs in software engineering.

Features

Recursively scans source directories for code files
Filters files by extension and excludes binary/generated folders
Analyzes code files in chunks using local LLM's (via LMStudio or Ollama)
Generates either Markdown or ChromaDB vector database documentation with:
- File structure visualization (Mermaid diagram)
- File-by-file analysis summaries
User-friendly CLI with:
- Comprehensive help documentation with usage examples
- Optional file inclusion/exclusion patterns
LM Studio and Ollama integration:
- Smart extraction of content from LLM responses
- Multiple fallback mechanisms for resilient operation
Extensible output formats through the reporter pattern
Clean markdown output
Optional logging (csa.log)
Supports gitignore-based file exclusion
Custom chunk sizing for optimal LLM context utilization
Environment variable configuration via .env files
Cross-platform compatibility (Windows, Linux, WSL2)
Extensive test suite with unit and integration tests
Efficient error handling and recovery mechanisms

Requirements

Python 3.10 or later
One of the following LLM providers:
- LM Studio running locally (default, configure with LMSTUDIO_HOST)
- Ollama running locally (configure with OLLAMA_HOST and OLLAMA_MODEL)

Installation

Windows

Clone this repository
Run setup.bat to create a virtual environment and install dependencies
Make sure one of the following LLM providers is running:
- LM Studio on localhost:1234 (default)
- Ollama on localhost:11434

Linux/WSL2

Clone this repository
Make the shell scripts executable:
```
chmod +x setup.sh run_tests.sh
```
Run ./setup.sh to create a virtual environment and install dependencies
Make sure one of the following LLM providers is running:
- LM Studio on localhost:1234 (default)
- Ollama on localhost:11434

Manual Installation

Clone this repository
Create a virtual environment: python -m venv venv
Activate the virtual environment:
- Windows: venv\Scripts\activate
- Linux/WSL2: source venv/bin/activate
Install dependencies: pip install -r requirements.txt
In folder csa create .env file from .env.example

Usage

Basic usage:

# Using the Python module directly
python -m csa.cli /path/to/source/directory

# Or if installed via pip
csa /path/to/source/directory

This will analyze the codebase in the specified directory and generate a trace_ai.md file in the current directory.

Command-line Options

usage: python -m csa.cli [-h] [-o OUTPUT] [-c CHUNK_SIZE] [--folders]
              [--reporter {markdown,chromadb}] [--llm-provider LLM_PROVIDER]
              [--llm-host LLM_HOST] [--lmstudio-host LMSTUDIO_HOST]
              [--ollama-host OLLAMA_HOST] [--include INCLUDE]
              [--exclude EXCLUDE] [--obey-gitignore] [--no-dependencies]
              [--no-functions] [--verbose]
              [source_dir]

Code Structure Analyzer - Generate structured documentation for codebases

positional arguments:
  source_dir            Path to the source directory to analyze

optional arguments:
  -h, --help            Show this help message and examples.
  -o OUTPUT, --output OUTPUT
                        Path to the output markdown file or chromadb directory (default: trace_ai.md)
  -c CHUNK_SIZE, --chunk-size CHUNK_SIZE
                        Number of lines to read in each chunk (default: 200)
  --folders             Recursively include files in sub-folders of the source directory.
  --reporter            Reporter type to use (markdown or chromadb) - for chromadb, the -o/--output must specify a folder name (e.g., "data")
  --llm-provider LLM_PROVIDER
                        LLM provider to use (default: lmstudio)
  --llm-host LLM_HOST   Host address for the LLM provider (default: localhost:1234)
  --lmstudio-host LMSTUDIO_HOST
                        Host address for the LM Studio provider (default: localhost:1234)
  --ollama-host OLLAMA_HOST
                        Host address for Ollama (default: localhost:11434)
  --include INCLUDE     Comma-separated list in double quotes of file patterns to include (gitignore style)
  --exclude EXCLUDE     Comma-separated list in double quotes of file patterns to exclude (gitignore style)
  --obey-gitignore      Whether to obey .gitignore files in the processed folder
  --no-dependencies     Disable output of dependencies/imports in the analysis
  --no-functions        Disable output of functions list in the analysis
  --verbose, -v         Enable verbose logging

Examples

# Analyze the current directory with default settings
python -m csa.cli .

# Analyze a specific directory with a custom output file
python -m csa.cli /path/to/source -o analysis.md

# Analyze with a larger chunk size (for processing more lines at once)
python -m csa.cli /path/to/source -c 200

# Analyze recursively including all sub-folders
python -m csa.cli /path/to/source --folders

# Show detailed help text with examples
python -m csa.cli --help

# Use LM Studio with a specific host
python -m csa.cli /path/to/source --llm-provider lmstudio --lmstudio-host localhost:1234

# Use Ollama as the LLM provider with specific host and model
python -m csa.cli /path/to/source --llm-provider ollama --ollama-host localhost:11434 --ollama-model qwen2.5-coder:14b

# Use the legacy --llm-host parameter (will set the appropriate provider-specific host based on llm-provider)
python -m csa.cli /path/to/source --llm-provider lmstudio --llm-host localhost:5000

# Include only specific file patterns
python -m csa.cli /path/to/source --include "*.cs,*.py"

# Exclude specific file patterns
python -m csa.cli /path/to/source --exclude "test_*.py,*.tmp"

# Obey .gitignore files in the processed folder
python -m csa.cli /path/to/source --obey-gitignore

# Disable dependencies/imports in the output
python -m csa.cli /path/to/source --no-dependencies

# Disable functions list in the output
python -m csa.cli /path/to/source --no-functions

Architecture

CSA employs several architectural patterns to ensure maintainability and extensibility:

Reporter Pattern

The application uses a reporter pattern to separate analysis logic from output formatting:

BaseAnalysisReporter: Abstract base class defining the interface for all reporters
MarkdownAnalysisReporter: Concrete implementation that formats analysis results as Markdown

This design allows for:

Easy addition of new output formats (HTML, JSON, etc.)
Clear separation of concerns
Better testability of individual components

If you want to create your own output format, simply:

Subclass BaseAnalysisReporter
Implement the required methods (initialize, update_file_analysis, finalize)
Pass your reporter instance to analyze_codebase()

LLM Provider Abstraction

The application supports multiple LLM providers through a provider abstraction:

LLMProvider: Base class for all LLM providers
Provider-specific implementations for LM Studio, Ollama, etc.

This allows for easy integration of new LLM backends while maintaining a consistent interface.

Code Analysis Pipeline

The code analysis process follows a pipeline approach:

File discovery and filtering
File chunking to fit within LLM context windows
Analysis of each chunk with LLM
Aggregation of chunk analyses into a comprehensive file analysis
Output generation via the reporter system

Testing

The project includes test scripts for both Windows and Linux environments:

Windows Tests

Run tests using the batch script:

run_tests.bat             # Run unit tests only
run_tests.bat --all       # Run all tests (including integration tests)
run_tests.bat --integration  # Run only integration tests

Linux/WSL2 Tests

First, ensure the shell script is executable:

chmod +x run_tests.sh

Then run tests:

./run_tests.sh            # Run unit tests only
./run_tests.sh --all      # Run all tests (including integration tests)
./run_tests.sh --integration  # Run only integration tests

Note: Integration tests require a running LLM provider. By default, they expect LM Studio running on localhost:1234, but this can be configured through environment variables.

Development

If you're interested in contributing to CSA, follow these steps to set up your development environment:

Setting Up Development Environment

Clone this repository and navigate to it
Create a virtual environment:
```
python -m venv .venv
```
Activate the virtual environment:
- Windows (Command Prompt):
```
.venv\Scripts\activate
```
- Windows (Git Bash):
```
source .venv/Scripts/activate
```
- Linux/macOS:
```
source .venv/bin/activate
```
Install project dependencies:
```
pip install -r requirements.txt
```

Install development dependencies including pre-commit hooks:

pip install pre-commit==3.7.0
pip install ruff mypy types-requests types-setuptools types-pyyaml types-toml

Set up pre-commit hooks:

pre-commit run --config ./dev_config/python/.pre-commit-config.yaml --all-files

Running Pre-commit Hooks Manually

To manually run the linting/pre-commit tools from Git Bash:

First, activate your virtual environment:
```
source .venv/Scripts/activate
```
Run the full pre-commit suite:
```
pre-commit run --config ./dev_config/python/.pre-commit-config.yaml --all-files
```
This will:
- Format your code with Ruff
- Run linting checks
- Check for type errors with MyPy
- Fix common issues automatically

To run specific hooks individually:

# Run just the ruff linter
pre-commit run ruff --config ./dev_config/python/.pre-commit-config.yaml --all-files

# Run just the ruff formatter
pre-commit run ruff-format --config ./dev_config/python/.pre-commit-config.yaml --all-files

# Run just the mypy type checker
pre-commit run mypy --config ./dev_config/python/.pre-commit-config.yaml --all-files

Run the linting tools directly (without pre-commit):

# Run ruff linter
ruff check --config dev_config/python/ruff.toml .

# Run ruff formatter
ruff format --config dev_config/python/ruff.toml .

# Run mypy type checker
mypy --config-file dev_config/python/mypy.ini .

If you want to run these tools on specific files instead of the entire project, replace the --all-files flag with the path to the specific files, or provide the file path directly to the linting tools.

Configuration

Configuration is handled through environment variables or a .env file:

LLM_PROVIDER: LLM provider to use (default: "lmstudio")
LMSTUDIO_HOST: Host address for the LMStudio provider (default: "localhost:1234")
OLLAMA_HOST: Host address for the Ollama provider (default: "localhost:11434")
OLLAMA_MODEL: Model name for Ollama (default: "qwen2.5-coder:14b")
CHUNK_SIZE: Number of lines to read in each chunk (default: 200)
OUTPUT_FILE: Default output file path (default: "trace_ai.md")
FILE_EXTENSIONS: Comma-separated list of file extensions to analyze (default: ".cs,.py,.js,.ts,.html,.css")

Project Structure

csa/
+-- setup.bat                # Windows setup script
+-- setup.sh                 # Linux/WSL2 setup script
+-- requirements.txt         # Dependencies
+-- pyproject.toml           # Python project configuration
+-- setup.py                 # Legacy setup file for compatibility
+-- csa/                     # Python package
|   +-- .env.example         # Example environment variables
|   +-- __init__.py          # Package initialization
|   +-- config.py            # Configuration handling
|   +-- llm.py               # LLM wrapper for different providers
|   +-- analyzer.py          # Core file analysis logic
|   +-- code_analyzer.py     # Code analysis
|   +-- reporters.py         # Output formatting abstraction
|   +-- cli.py               # Command-line interface (entry point)
+-- tests/                   # Test directory
+-- run_tests.bat            # Windows test script
+-- run_tests.sh             # Linux/WSL2 test script
+-- README.md                # Documentation

Example Output

The generated trace_ai.md file will have the following structure:

# Code Structure Analysis

Source directory: `/path/to/source/directory`
Analysis started: 2023-06-10 14:30:45

## Codebase Structure

```mermaid
graph TD
  ...

Files Analyzed

<details>
<summary>path/to/file.cs</summary>

- **Lines Analyzed**: 1-150 of 150
- **Description**: This file contains a class that implements the IDisposable interface...

</details>

Credits

Special credits to X user shannonNullCode for the initial idea and inspiration for this project.

LM Studio for their LM Studio and Python SDK
Mermaid for the diagramming library

License

MIT License - This software is provided "as is" without warranty of any kind, express or implied, and you are free to use, modify, and distribute it under the terms of the MIT License. See the LICENSE file for the full text of the MIT License, which grants permissions to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code Structure Analyzer (CSA)

Dev Notes

Features

Requirements

Installation

Windows

Linux/WSL2

Manual Installation

Usage

Command-line Options

Examples

Architecture

Reporter Pattern

LLM Provider Abstraction

Code Analysis Pipeline

Testing

Windows Tests

Linux/WSL2 Tests

Development

Setting Up Development Environment

Running Pre-commit Hooks Manually

Configuration

Project Structure

Example Output

Files Analyzed

Credits

License

Contributing

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
csa		csa
dev_config/python		dev_config/python
examples		examples
img		img
tests		tests
.cursorignore		.cursorignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_tests.bat		run_tests.bat
run_tests.sh		run_tests.sh
setup.bat		setup.bat
setup.py		setup.py
setup.sh		setup.sh
trace_ai.md		trace_ai.md

License

tobitege/CodeStructureAnalyzer

Folders and files

Latest commit

History

Repository files navigation

Code Structure Analyzer (CSA)

Dev Notes

Features

Requirements

Installation

Windows

Linux/WSL2

Manual Installation

Usage

Command-line Options

Examples

Architecture

Reporter Pattern

LLM Provider Abstraction

Code Analysis Pipeline

Testing

Windows Tests

Linux/WSL2 Tests

Development

Setting Up Development Environment

Running Pre-commit Hooks Manually

Configuration

Project Structure

Example Output

Files Analyzed

Credits

License

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages