Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Issues / Timeouts Running Docker Images on AWS #361

Closed
kylepatel opened this issue Dec 19, 2024 · 1 comment
Closed

Memory Issues / Timeouts Running Docker Images on AWS #361

kylepatel opened this issue Dec 19, 2024 · 1 comment
Assignees

Comments

@kylepatel
Copy link

kylepatel commented Dec 19, 2024

Awesome project! Super helpful!

I'm just using the images right from docker hub for basic-amd64 and all-amd64.

I'm seeing some strange memory behavior when running as a docker container on AWS. When the crawl is running the memory utilization steadily climbs. If I let it go, it ends up going to 99+% and it gets super slow (or just starts timing out), so what I've been doing it every hour or so just rebooting the container to clear out the memory, but thats not the best practice. Curious whats going on with the memory management under the hood.

My typical crawl params:

            "crawler_params": {
                "headless": True,
                "simulate_user": True,
                "page_timeout": 3000,
                "remove_overlay_elements": True,
                "magic": True,
                "override_navigator": True,
            },

Example: At the beginning my crawl is running and memory is steadying being consumed. When the container is idling (for a solid 10 hours) memory isn't be released. At the end I just rebooted it to clear it out.

This machine has 32GB RAM, which should be plenty.
image

Haven't had a chance yet to debug this locally myself.

@kylepatel kylepatel changed the title Memory Issues / Timeouts Running in Docker on AWS Memory Issues / Timeouts Running Docker Images on AWS Dec 19, 2024
@unclecode
Copy link
Owner

@kylepatel Thanks for trying Crawl4AI! Apologies for the delayed response—I’ve been busy updating documentation. Let’s address your issue:

  1. Optimization Recommendations:

    • Disable the GPU by passing --disable-gpu as a browser argument.
    • Reduce the viewport size to improve performance.
  2. Crawl Usage:

    • Please share your implementation details, especially your approach to creating instances of the AsyncWebCrawler class.
    • Avoid creating a new browser instance for each URL, this significantly impacts performance. Instead:
      • Create one browser instance and open new pages or tabs within it.
      • For sequential crawling, use a session ID to reuse the same tab rather than opening multiple tabs or processes.
  3. Next Steps:

    • Let me know how you’ve used Crawl4AI and share your code if possible. I can provide more specific guidance based on your setup.

I also share a coding snippet that explains the general approach. Let me know your thoughts.

"""
This example demonstrates optimal browser usage patterns in Crawl4AI:
1. Sequential crawling with session reuse
2. Parallel crawling with browser instance reuse
3. Performance optimization settings
"""

import asyncio
import os
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

async def crawl_sequential(urls: List[str]):
    """
    Sequential crawling using session reuse - most efficient for moderate workloads
    """
    print("\n=== Sequential Crawling with Session Reuse ===")
    
    # Configure browser with optimized settings
    browser_config = BrowserConfig(
        headless=True,
        browser_args=[
            "--disable-gpu",              # Disable GPU acceleration
            "--disable-dev-shm-usage",    # Disable /dev/shm usage
            "--no-sandbox",               # Required for Docker
        ],
        viewport={'width': 800, 'height': 600}  # Smaller viewport for better performance
    )
    
    # Configure crawl settings
    crawl_config = CrawlerRunConfig(
        content_filter=PruningContentFilter(),
        markdown_generator=DefaultMarkdownGenerator(),
        screenshot=False  # Disable screenshots if not needed
    )
    
    # Create single crawler instance
    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()
    
    try:
        session_id = "session1"  # Use same session for all URLs
        for url in urls:
            result = await crawler.arun(
                url=url,
                config=crawl_config,
                session_id=session_id  # Reuse same browser tab
            )
            if result.success:
                print(f"Successfully crawled {url}")
                print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
    finally:
        await crawler.close()

async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
    """
    Parallel crawling while reusing browser instance - best for large workloads
    """
    print("\n=== Parallel Crawling with Browser Reuse ===")
    
    browser_config = BrowserConfig(
        headless=True,
        browser_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
        viewport={'width': 800, 'height': 600}
    )
    
    crawl_config = CrawlerRunConfig(
        content_filter=PruningContentFilter(),
        markdown_generator=DefaultMarkdownGenerator(),
        screenshot=False
    )
    
    # Create single crawler instance for all parallel tasks
    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()
    
    try:
        # Create tasks in batches to control concurrency
        for i in range(0, len(urls), max_concurrent):
            batch = urls[i:i + max_concurrent]
            tasks = []
            
            for j, url in enumerate(batch):
                session_id = f"parallel_session_{j}"  # Different session per concurrent task
                task = crawler.arun(
                    url=url,
                    config=crawl_config,
                    session_id=session_id
                )
                tasks.append(task)
            
            # Wait for batch to complete
            results = await asyncio.gather(*tasks, return_exceptions=True)
            
            # Process results
            for url, result in zip(batch, results):
                if isinstance(result, Exception):
                    print(f"Error crawling {url}: {str(result)}")
                elif result.success:
                    print(f"Successfully crawled {url}")
                    print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
    finally:
        await crawler.close()

async def main():
    # Example URLs
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
        "https://example.com/page4",
    ]
    
    # Demo sequential crawling
    await crawl_sequential(urls)
    
    # Demo parallel crawling
    await crawl_parallel(urls, max_concurrent=2)

if __name__ == "__main__":
    asyncio.run(main())

Looking forward to your details so I can assist further!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants