This project is a Python tool for classifying URLs into predefined categories using Natural Language Processing. It extracts and cleans text from web pages, processes the content, and compares it against a corpus of categorized text data using a probabilistic model.
Sample output data for the current contents of links.json
can be found in result.json
.
- Downloads Wikipedia data into
category_data/
to create the categorized corpus. - Fetches web pages and extracts title, meta tags, and content.
- Lemmatizes and cleans the extracted text.
- Compares the text against the categorized corpus.
- Assigns the URL to the category with the highest probabilistic match.
- Python: 3.8 or higher
- Dependencies:
requests
nltk
Scrapy
beautifulsoup4
newspaper3k
lxml_html_clean
-
Clone the repository:
git clone [email protected]:6b70/nlp-url-categorizer.git cd nlp-url-categorizer
-
Set up a virtual environment:
python3 -m venv .venv source .venv/bin/activate
-
Install dependencies:
pip install -r requirements.txt
- The input is a JSON list of links stored in
links.json
. Example:[ "https://developer.apple.com", "https://github.com" ]
- Corpus Data:
- The program automatically downloads corpus data from Wikipedia and organizes it into
category_data/
for predefined categories.- Categories are stored in directories, each containing .txt files with data.
- The program automatically downloads corpus data from Wikipedia and organizes it into
-
Run the Program:
python3 main.py
-
Output:
- The program generates a
result.json
file with categorized results based on the input URLs.
- The program generates a