NLP URL Categorizer

This project is a Python tool for classifying URLs into predefined categories using Natural Language Processing. It extracts and cleans text from web pages, processes the content, and compares it against a corpus of categorized text data using a probabilistic model.

Sample output data for the current contents of links.json can be found in result.json.

How It Works

Downloads Wikipedia data into category_data/ to create the categorized corpus.
Fetches web pages and extracts title, meta tags, and content.
Lemmatizes and cleans the extracted text.
Compares the text against the categorized corpus.
Assigns the URL to the category with the highest probabilistic match.

Requirements

Python: 3.8 or higher
Dependencies:
- requests
- nltk
- Scrapy
- beautifulsoup4
- newspaper3k
- lxml_html_clean

Installation

Clone the repository:

git clone [email protected]:6b70/nlp-url-categorizer.git
cd nlp-url-categorizer

Set up a virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Usage

Input Data

The input is a JSON list of links stored in links.json. Example:

[
    "https://developer.apple.com",
    "https://github.com"
]

Corpus Data:
- The program automatically downloads corpus data from Wikipedia and organizes it into category_data/ for predefined categories.
  - Categories are stored in directories, each containing .txt files with data.

Run the Program:
```
python3 main.py
```
Output:
- The program generates a result.json file with categorized results based on the input URLs.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
links.json		links.json
main.py		main.py
requirements.txt		requirements.txt
result.json		result.json
wikipedia-links.json		wikipedia-links.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP URL Categorizer

How It Works

Requirements

Installation

Usage

Input Data

About

Releases

Packages

Languages

License

6b70/nlp-url-categorizer

Folders and files

Latest commit

History

Repository files navigation

NLP URL Categorizer

How It Works

Requirements

Installation

Usage

Input Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages