RepoHarvester is a Python utility designed to clone a GitHub repository and compile all non-media files into a single text file. It's particularly useful for code analysis, backups, or when you need an offline version of the repository's text-based content.
- Selective Cloning: Clones a GitHub repository and intelligently excludes non-essential file types (e.g., media, system files).
- Comment Removal: Provides an option to remove comments from code files, supporting commonly used programming languages.
- Flexible Exclusions: Allows customization of excluded file types using the --no-skip command-line argument.
- Maximum File Size Control: Implements size restrictions for included files:
-
- Files larger than a specified threshold (
--max-size
) are automatically skipped.
- Files larger than a specified threshold (
-
- Files exceeding 500KB but smaller than the maximum size are logged for awareness.
- Folder Exclusion: Exclude specific folders and their contents using the --exclude option.
- Single File Output: Compiles relevant files into a unified, well-structured text file.
- Logging: Detailed logs, including file processing information, are saved to the specified log file (--log option).
- Ensure you have Python (https://www.python.org/) and Git (https://git-scm.com/) installed.
- Clone the RepoHarvester repository or download the repoharvester.py and comment_pattens.py.
Run the script using the following command:
python repoharvester.py <repo_url> [options]
repo_url
: (Required): The SSH URL of the GitHub repository to clone.--remove
\-r
(Optional): Removes comments from supported code files.--no-skip
(Optional): Includes specified file groups that would otherwise be excluded.-
- Available groups: media, office, system, executables, archive, audio, video, database, font, temporary, compiled_code, certificate, configuration, virtual_env, node_modules, python_bytecode, package_locks, log_files, cache_files
--max-size
(Optional): Set the maximum file size in KB (default: 1000 KB). Files exceeding this size are skipped. Files larger than 500KB but within the limit are logged.--log
(Optional): Path to the log file (default: output/union_file.log)--exclude
(Optional): Specify folders to exclude (and their contents).
- `repo_url``: The SSH URL of the GitHub repository to clone.
-r, --remove
: Remove comments from code files office files.--skip
: Skip files of certain types. You can specify multiple types separated by spaces. Available types are: media, office, system, executables, archive, audio, video, database, font, temporary, compiled_code, certificate, configuration, virtual_env, node_modules, python_bytecode, package_locks, log_files, cache_files.
This command:
- Clones the specified repository.
- Removes comments from code files.
- Includes media and office files.
- Sets the maximum allowed file size to 250KB.
- Exclude folder
vendor
python repoharvester.py [email protected]:username/repo.git --remove --no-skip media office --max-size 250 --exclude vendor
RepoHarvester excludes many common file types by default. These include but are not limited to:
-
Media files: png, jpg, jpeg, gif, bmp, tiff, svg, ico, raw, psd, ai
-
Office documents: xlsx, xls, docx, pptx, pdf
-
System files: pack, idx, log, DS_Store, sys, ini, bat, plist
-
Executables: exe, dll, so, bin
-
And many more…
-
See the EXTENSION_GROUPS dictionary within the code for the complete list.
The comment_patterns.py file contains regular expressions used to identify and remove comments from various programming languages:
COMMENT_PATTERNS = {
'py': r'#.*', # Python
'js': r'//.*|/\*[\s\S]*?\*/', # JavaScript
...
# Add more patterns for different file types as needed
}
The generated text file will have the following structure:
## <name_of_repository>
### <filename>.<file_extension>
<content_of_the_file>
### end of file
...
- non-UTF-8 files are skipped and logged.
- A list of skipped files is saved to
output/skipped_files.txt
. - The
--exclude
option allows for more granular control over which files are included.
Contributions to RepoHarvester are welcome! Please read our contributing guidelines to get started.
RepoHarvester is released under the MIT License. See the LICENSE file for more details.