Memory Issues / Timeouts Running Docker Images on AWS #361

kylepatel · 2024-12-19T18:57:15Z

Awesome project! Super helpful!

I'm just using the images right from docker hub for basic-amd64 and all-amd64.

I'm seeing some strange memory behavior when running as a docker container on AWS. When the crawl is running the memory utilization steadily climbs. If I let it go, it ends up going to 99+% and it gets super slow (or just starts timing out), so what I've been doing it every hour or so just rebooting the container to clear out the memory, but thats not the best practice. Curious whats going on with the memory management under the hood.

My typical crawl params:

            "crawler_params": {
                "headless": True,
                "simulate_user": True,
                "page_timeout": 3000,
                "remove_overlay_elements": True,
                "magic": True,
                "override_navigator": True,
            },

Example: At the beginning my crawl is running and memory is steadying being consumed. When the container is idling (for a solid 10 hours) memory isn't be released. At the end I just rebooted it to clear it out.

This machine has 32GB RAM, which should be plenty.

Haven't had a chance yet to debug this locally myself.

The text was updated successfully, but these errors were encountered:

unclecode · 2024-12-25T11:32:15Z

@kylepatel Thanks for trying Crawl4AI! Apologies for the delayed response—I’ve been busy updating documentation. Let’s address your issue:

Optimization Recommendations:
- Disable the GPU by passing --disable-gpu as a browser argument.
- Reduce the viewport size to improve performance.
Crawl Usage:
- Please share your implementation details, especially your approach to creating instances of the AsyncWebCrawler class.
- Avoid creating a new browser instance for each URL, this significantly impacts performance. Instead:
  - Create one browser instance and open new pages or tabs within it.
  - For sequential crawling, use a session ID to reuse the same tab rather than opening multiple tabs or processes.
Next Steps:
- Let me know how you’ve used Crawl4AI and share your code if possible. I can provide more specific guidance based on your setup.

I also share a coding snippet that explains the general approach. Let me know your thoughts.

"""
This example demonstrates optimal browser usage patterns in Crawl4AI:
1. Sequential crawling with session reuse
2. Parallel crawling with browser instance reuse
3. Performance optimization settings
"""

import asyncio
import os
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def crawl_sequential(urls: List[str]):
    """
    Sequential crawling using session reuse - most efficient for moderate workloads
    """
    print("\n=== Sequential Crawling with Session Reuse ===")
    
    # Configure browser with optimized settings
    browser_config = BrowserConfig(
        headless=True,
        browser_args=[
            "--disable-gpu",              # Disable GPU acceleration
            "--disable-dev-shm-usage",    # Disable /dev/shm usage
            "--no-sandbox",               # Required for Docker
        ],
        viewport={'width': 800, 'height': 600}  # Smaller viewport for better performance
    )
    
    # Configure crawl settings
    crawl_config = CrawlerRunConfig(
        content_filter=PruningContentFilter(),
        markdown_generator=DefaultMarkdownGenerator(),
        screenshot=False  # Disable screenshots if not needed
    )
    
    # Create single crawler instance
    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()
    
    try:
        session_id = "session1"  # Use same session for all URLs
        for url in urls:
            result = await crawler.arun(
                url=url,
                config=crawl_config,
                session_id=session_id  # Reuse same browser tab
            )
            if result.success:
                print(f"Successfully crawled {url}")
                print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
    finally:
        await crawler.close()

async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
    """
    Parallel crawling while reusing browser instance - best for large workloads
    """
    print("\n=== Parallel Crawling with Browser Reuse ===")
    
    browser_config = BrowserConfig(
        headless=True,
        browser_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
        viewport={'width': 800, 'height': 600}
    )
    
    crawl_config = CrawlerRunConfig(
        content_filter=PruningContentFilter(),
        markdown_generator=DefaultMarkdownGenerator(),
        screenshot=False
    )
    
    # Create single crawler instance for all parallel tasks
    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()
    
    try:
        # Create tasks in batches to control concurrency
        for i in range(0, len(urls), max_concurrent):
            batch = urls[i:i + max_concurrent]
            tasks = []
            
            for j, url in enumerate(batch):
                session_id = f"parallel_session_{j}"  # Different session per concurrent task
                task = crawler.arun(
                    url=url,
                    config=crawl_config,
                    session_id=session_id
                )
                tasks.append(task)
            
            # Wait for batch to complete
            results = await asyncio.gather(*tasks, return_exceptions=True)
            
            # Process results
            for url, result in zip(batch, results):
                if isinstance(result, Exception):
                    print(f"Error crawling {url}: {str(result)}")
                elif result.success:
                    print(f"Successfully crawled {url}")
                    print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
    finally:
        await crawler.close()

async def main():
    # Example URLs
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
        "https://example.com/page4",
    ]
    
    # Demo sequential crawling
    await crawl_sequential(urls)
    
    # Demo parallel crawling
    await crawl_parallel(urls, max_concurrent=2)

if __name__ == "__main__":
    asyncio.run(main())

Looking forward to your details so I can assist further!

kylepatel changed the title ~~Memory Issues / Timeouts Running in Docker on AWS~~ Memory Issues / Timeouts Running Docker Images on AWS Dec 19, 2024

unclecode closed this as completed Dec 25, 2024

unclecode self-assigned this Dec 25, 2024

unclecode mentioned this issue Dec 25, 2024

This way takes way too long and won't work. Can we make it more efficient? #315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Issues / Timeouts Running Docker Images on AWS #361

Memory Issues / Timeouts Running Docker Images on AWS #361

kylepatel commented Dec 19, 2024 •

edited

Loading

unclecode commented Dec 25, 2024

Memory Issues / Timeouts Running Docker Images on AWS #361

Memory Issues / Timeouts Running Docker Images on AWS #361

Comments

kylepatel commented Dec 19, 2024 • edited Loading

unclecode commented Dec 25, 2024

kylepatel commented Dec 19, 2024 •

edited

Loading