Skip to content

Commit

Permalink
Update mkdocs content and config.
Browse files Browse the repository at this point in the history
  • Loading branch information
eli64s committed Dec 11, 2023
1 parent 7687a0b commit 0edc428
Show file tree
Hide file tree
Showing 4 changed files with 142 additions and 2 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ Pipfile
Pipfile.lock

# Temporarily Ignored
mkdocs.yml
docs/docs
docs/notes
examples/markdown/readme-edgecase.md
Expand Down
132 changes: 132 additions & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Architecture

## Repository Preprocessing

- User provides a repository URL or local path to command line interface.
- Input arguments are sanitized and validated.
- Temporary directory is created to clone the user's repository.
- Firstly, a file tree structure is generated from the cloned content. It serves two purposes:
- Used to provide context to the language model.
- Displayed in the output README file.

> Directory Tree Example:
```sh
└── readmeai/
├── Dockerfile
├── Makefile
├── poetry.lock
├── pyproject.toml
├── readmeai/
│ ├── cli/
│ │ ├── commands.py
│ │ └── options.py
│ ├── config/
│ │ ├── __Init__.py
│ │ └── settings.py
│ ├── core/
│ │ ├── factory.py
│ │ ├── logger.py
│ │ ├── model.py
│ │ ├── parser.py
│ │ ├── preprocess.py
│ │ └── tokens.py
│ ├── main.py
│ ├── markdown/
│ │ ├── badges.py
│ │ ├── headers.py
│ │ ├── quickstart.py
│ │ ├── tables.py
│ │ ├── template.py
│ │ └── tree.py
│ ├── services/
│ │ └── version_control.py
│ ├── settings/
│ │ ├── config.toml
│ │ ├── dependency_files.toml
│ │ ├── identifiers.toml
│ │ ├── ignore_files.toml
│ │ ├── language_names.toml
│ │ └── language_setup.toml
│ └── utils/
│ └── utils.py
```

Following this, repo contents and metadata are processed and organized in the [preprocess.py](https://github.com/eli64s/readme-ai/blob/main/readmeai/core/preprocess.py) script. The module returns two data structures in [main.py](https://github.com/eli64s/readme-ai/blob/main/readmeai/main.py#L81) that are used in downstream tasks. This includes a list of dependencies and a file contents dictionary.

### *Dependencies List*

A list of project dependencies is genereated from scanning each file and its content during preprocessing. The list includes the dependencies and packages fround in the codebase files (e.g. `requirements.txt`, `package.json`, `poetry.lock`, etc.), programming languges used, platforms i.e. Docker, and any other relevant information.

> Dependencies List Example:
```python
dependencies = ['poetry.lock', 'pandas', 'click', 'sh' 'pytest', 'python', 'streamlit', 'numpy']
```

The list is used downstream for a variety of purposes, including generating the badges rendered in the output file, and also injected into prompts to give the LLM additional context of the repository.


### *File Contents Dictionary*

A dictionary of file paths and contents is generated from the repository.

> File Contents Dictionary Example:
```json
{
"file_path": "my_root_directory/my_subdirectory/my_file.py",
"file_content_string": "import numpy as np\n\n\ndef my_function():\n return np.random.rand(1)\n",
}
```

This structure is first used to generate the file summaries at the individual file level. The file summaries are then used to generate three additional README.md sections.

## OpenAI API Model

### *File Summaries*

The program invokes OpenAI API and begins processing the file contents dictionary. First, the code summary prompt is built, injecting the file path and contents into the prompt. Additionally, the prompt is embedded with the directory tree structure to provide context to the LLM about the current file.

> Code Summary Prompt Example:
```toml
summaries = """Offer a comprehensive summary <= 80 words that encapsulates the core functionalities of the code below. Aim for precision and conciseness in your explanation, ensuring a fine balance between detail and brevity.\nDirectory Tree: {0}\nPath: {1}\nCode:\n{2}\n"""
```

After the prompt is created, we check if the prompt exceeds the model's maximum tokenn limit. If true, the prompt is truncated to the maximum and sent to the API.

> The current default max is 4000 tokens for the `gpt-3.5-turbo` engine.
This is not ideal, but from what I've seen, the model is able to generate a decent summary even with a truncated prompt. We are looking at ways to enhance context and accuracy, considering tools like [LangChain](https://python.langchain.com/docs/get_started/introduction) and [LlamaIndex](https://gpt-index.readthedocs.io/en/stable/index.html).

### *Architecture Section*

The prompt I use to generate the arcitecture table is located in the project settings file [here](https://github.com/eli64s/readme-ai/blob/main/readmeai/settings/config.toml#L31) and is as follows:

```toml
features = """Hello! Analyze the repository {0} and following the instructions below to generate a comprehensive list of features.
Please provide a comprehensive technical analysis of the codebase and its components.
Consider the codebase as a whole and highlight the key characteristics, design patterns, architectural decisions, and any other noteworthy elements.
Generate your response as a Markdown table with the following columns:
| | Feature | Description |
|----|--------------------|--------------------------------------------------------------------------------------------------------------------|
| ⚙️ | **Architecture** | Analyze the structural design of the system here. Limit your response to a maximum of 200 characters. |
| 📄 | **Documentation** | Discuss the quality and comprehensiveness of the documentation here. Limit your response to a maximum of 200 characters.|
| 🔗 | **Dependencies** | Examine the external libraries or other systems that this system relies on here. Limit your response to a maximum of 200 characters.|
| 🧩 | **Modularity** | Discuss the system's organization into smaller, interchangeable components here. Limit your response to a maximum of 200 characters.|
| 🧪 | **Testing** | Evaluate the system's testing strategies and tools here. Limit your response to a maximum of 200 characters. |
| ⚡️ | **Performance** | Analyze how well the system performs, considering speed, efficiency, and resource usage here. Limit your response to a maximum of 200 characters.|
| 🔐 | **Security** | Assess the measures the system uses to protect data and maintain functionality here. Limit your response to a maximum of 200 characters.|
| 🔀 | **Version Control**| Discuss the system's version control strategies and tools here. Limit your response to a maximum of 200 characters.|
| 🔌 | **Integrations** | Evaluate how the system interacts with other systems and services here. Limit your response to a maximum of 200 characters.|
| 📶 | **Scalability** | Analyze the system's ability to handle growth here. Limit your response to a maximum of 200 characters. |
Repository Details:
\nDirectory Tree: {1}\nDependencies: {2}\nCode Summaries: {3}\n
"""
```

The prompt has four `{}` placeholders with data injected as we did previously. This includes the repository URL, directory tree, dependencies list, and the code summaries dictionary generated in the preprocessing step.

The prompt is very specific in its instructions, explicity telling the LLM to generate a Markdown table with the specified columns and giving a 200 character limit for each row of the table.

---
3 changes: 3 additions & 0 deletions docs/examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
## Examples

---
8 changes: 7 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
site_name: My Docs
site_name: readme-ai
theme:
name: material
palette:
primary: blue
accent: blue
#

0 comments on commit 0edc428

Please sign in to comment.