pyminiscraper

Introduction

pyminiscraper is a lightweight Python library designed for easy web scraping. It provides a simple interface to extract data from web pages with minimal setup. Whether you are a beginner or an advanced user, pyminiscraper offers the flexibility to handle various scraping tasks efficiently.

Features

Feature	Description
Basic Web Page scraping	Scrape HTML content from web pages
Async scraping	Extremely scalable asynchronous scraping
Web Page spidering	Follow and scrape links from web pages
Parallel requests	Configure number of concurrent requests
Headless browser support	JavaScript rendering support
Robots.txt parsing	Respect robots.txt rules
Sitemap parsing	Parse and follow sitemap.xml
RSS/Atom parsing	Parse and follow RSS/Atom feeds
Open Graph parsing	Extract Open Graph metadata
Rate limiting	Configurable per-domain rate limiting
Error handling	Robust error handling with retry logic
Depth control	Control recursion depth for link following
Custom user agent	Set custom User-Agent strings
File storage	Built-in file storage system
Custom callbacks	Define custom processing logic
Domain restrictions	Control allowed/blocked domains
Request timeout	Configurable request timeouts
Page caching	Cache and reuse downloaded pages

Installation

pip install pyminiscraper

How does it work

┌───────────────────┐
│                   │
│  Initializing     │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Download      │
│     Robots.txt    │
│                   │
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│     Queue for     │
|    Configurable   |◀────┐
|      Parallel     |     |
|     Processing    |     |
└─────────┬─────────┘     |
          │               |
          ▼               |
┌───────────────────┐     |    ┌───────────────────┐
│        Scrape     │     |    │                   │
|      Web Pages,   |     |    │    Loading        │
│     RSS & Atom    │──── | ───│    Saving         │
│                   │     |    │    Web Pages      │    
└─────────┬─────────┘     |    └───────────────────┘ 
          │               |
          ▼               |
┌───────────────────┐     |
│      Discover     │     |   
│      Outgoing     |     |
|     Web Page      │─────┘
|  RSS/Atom feed    |
│      links        │
└───────────────────┘

Basic Usage

Downloading Sitemap-Referenced Pages

This example shows how to scrape only pages referenced in a sitemap:

from pyminiscraper import Scraper, ScraperConfig, ScraperUrl, FileStore
from pyminiscraper.config import ScraperDomainConfig, ScraperDomainConfigMode

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl("https://www.example.com/", max_depth=2)
        ],
        follow_sitemap_links=True,
        follow_web_page_links=False,
        follow_feed_links=False,
        callback=FileStore(storage_dir),
    ),
)
await scraper.run()

Scraping RSS/Atom Feeds

Example of scraping content from RSS/Atom feeds:

from pyminiscraper import Scraper, ScraperConfig, ScraperUrl, ScraperUrlType, FileStore

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl(
                "https://feeds.feedburner.com/PythonInsider", 
                type=ScraperUrlType.FEED
            )
        ],
        follow_sitemap_links=False,
        follow_web_page_links=False,
        follow_feed_links=True,
        callback=FileStore(storage_dir),
    ),
)
await scraper.run()

Full Website Crawling

Example of comprehensive website crawling using all available sources:

scraper = Scraper(
    ScraperConfig(
        seed_urls=[
            ScraperUrl("https://www.example.com/")
        ],
        follow_sitemap_links=True,
        follow_web_page_links=True,
        follow_feed_links=True,
        callback=FileStore(storage_dir),
    ),
)
await scraper.run()

Custom Processing with Callbacks

Example of custom processing using callbacks:

from pyminiscraper.config import ScraperCallback, ScraperContext
from pyminiscraper.model import ScraperWebPage, ScraperUrl

class CustomCallback(ScraperCallback):
    async def on_web_page(self, context: ScraperContext, request: ScraperUrl, response: ScraperWebPage) -> None:
        # Custom processing logic here
        print(f"Processing {response.url}")
        print(f"Title: {response.metadata_title}")
        
    async def on_feed(self, context: ScraperContext, feed: Feed) -> None:
        # Custom feed processing
        for item in feed.items:
            print(f"Feed item: {item.title}")

scraper = Scraper(
    ScraperConfig(
        seed_urls=[ScraperUrl("https://example.com")],
        callback=CustomCallback(),
    )
)
await scraper.run()

Configuration Options

Configuration for web scraping behavior.

Parameters:

seed_urls (list[ScraperUrl]): Initial URLs to start scraping from
callback (ScraperCallback): Callback for processing scraped content
include_path_patterns (list[str]): URL paths to include (default: [])
exclude_path_patterns (list[str]): URL paths to exclude (default: [])
max_parallel_requests (int): Maximum concurrent requests (default: 16)
use_headless_browser (bool): Use headless browser for JavaScript (default: False)
request_timeout_seconds (int): Request timeout in seconds (default: 30)
follow_web_page_links (bool): Follow links in web pages (default: False)
follow_sitemap_links (bool): Follow sitemap.xml links (default: True)
follow_feed_links (bool): Follow RSS/Atom feed links (default: True)
prevent_default_queuing (bool): Disable automatic URL queuing (default: False)
max_requested_urls (int): Maximum total URLs to request (default: 65536)
max_back_to_back_errors (int): Consecutive errors before stopping (default: 128)
on_response_callback (ScraperResponseCallback): Optional response callback
max_depth (int): Maximum recursion depth for links (default: 16)
crawl_delay_seconds (int): Delay between requests per domain (default: 1)
domain_config (ScraperDomainConfig): Allowed/blocked domains configuration
user_agent (str): User agent string (default: 'pyminiscraper')
referer (str): Referer header (default: "https://www.google.com")

Domain Configuration

Control which domains are allowed or blocked:

from pyminiscraper.config import ScraperDomainConfig, ScraperDomainConfigMode

# Allow only specific domains
config = ScraperDomainConfig(
    allowance=ScraperAllowedDomains(domains=["example.com", "api.example.com"]),
    forbidden_domains=["ads.example.com"]
)

# Allow all domains
config = ScraperDomainConfig(
    allowance=ScraperDomainConfigMode.ALLOW_ALL
)

# Allow only domains derived from seed URLs
config = ScraperDomainConfig(
    allowance=ScraperDomainConfigMode.DERIVE_FROM_SEED_URLS
)

Error Handling

The scraper includes built-in error handling:

Respects max_back_to_back_errors to stop after consecutive failures
Retries failed requests with exponential backoff
Logs errors for debugging
Continues operation after non-fatal errors

Performance Tips

Adjust max_parallel_requests based on your needs and server capacity
Use crawl_delay_seconds to control request rate
Enable use_headless_browser only when JavaScript rendering is required
Implement caching in your callback to avoid re-downloading pages
Use path patterns to filter URLs before downloading

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
.vscode		.vscode
pyminiscraper		pyminiscraper
tests		tests
.gitignore		.gitignore
README.md		README.md
bootstrap.sh		bootstrap.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test-requirements.txt		test-requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyminiscraper

Introduction

Features

Installation

How does it work

Basic Usage

Downloading Sitemap-Referenced Pages

Scraping RSS/Atom Feeds

Full Website Crawling

Custom Processing with Callbacks

Configuration Options

Domain Configuration

Error Handling

Performance Tips

Contributing

License

About

Releases

Packages

Languages

timurua/pyminiscraper

Folders and files

Latest commit

History

Repository files navigation

pyminiscraper

Introduction

Features

Installation

How does it work

Basic Usage

Downloading Sitemap-Referenced Pages

Scraping RSS/Atom Feeds

Full Website Crawling

Custom Processing with Callbacks

Configuration Options

Domain Configuration

Error Handling

Performance Tips

Contributing

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages