Web Scraper to Markdown Converter

This project is a C# console application that scrapes content from a list of URLs specified in an Excel file, converts the HTML content to Markdown, and saves the Markdown files to a specified directory structure. The application uses the ClosedXML library to read Excel files, HtmlAgilityPack for web scraping, and ReverseMarkdown for converting HTML to Markdown.

Features

Excel Integration: Reads URLs and corresponding folder paths from an Excel file.
Web Scraping: Scrapes the main content of web pages.
HTML to Markdown Conversion: Converts the scraped HTML content to Markdown format.
Parallel Processing: Uses Parallel.ForEachAsync to process multiple URLs concurrently.
File Organization: Saves Markdown files in a structured directory based on the folder paths specified in the Excel file.

Prerequisites

.NET SDK (version 6.0 or later)
ClosedXML (for Excel file handling)
HtmlAgilityPack (for web scraping)
ReverseMarkdown (for HTML to Markdown conversion)

Installation

Clone the repository:

git clone https://github.com/yourusername/web-scraper-markdown-converter.git
cd web-scraper-markdown-converter

Install the required NuGet packages:

dotnet add package ClosedXML
dotnet add package HtmlAgilityPack
dotnet add package ReverseMarkdown

Build the project:
```
dotnet build
```

Usage

Prepare the Excel File:
- Create an Excel file named links.xlsx in the root directory.
- The first column should contain the URLs, and the second column should contain the folder paths where the Markdown files will be saved.
- Example:
  
  URL Folder Path
  
  https://example.com/page1 folder1/subfolder1
  
  https://example.com/page2 folder2/subfolder2
Run the Application:
```
dotnet run
```
Output:
- The application will create a directory named output in the root directory.
- Markdown files will be saved in the specified folder structure within the output directory.

Code Overview

Program.cs: The main entry point of the application.
- Reads URLs and folder paths from the Excel file.
- Scrapes the content from the URLs.
- Converts the HTML content to Markdown.
- Saves the Markdown files to the specified directory structure.
Excel File Handling: Uses ClosedXML to read the Excel file.
Web Scraping: Uses HtmlAgilityPack to scrape the main content of web pages.
HTML to Markdown Conversion: Uses ReverseMarkdown to convert HTML to Markdown.

Example

Given the following Excel file:

URL	Folder Path
https://example.com/page1	folder1/subfolder1
https://example.com/page2	folder2/subfolder2

The application will create the following directory structure:

output/
├── folder1/
│   └── subfolder1/
│       └── page1.md
└── folder2/
    └── subfolder2/
        └── page2.md

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

ClosedXML for Excel file handling.
HtmlAgilityPack for web scraping.
ReverseMarkdown for HTML to Markdown conversion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Web Scraper to Markdown Converter

Features

Prerequisites

Installation

Usage

Code Overview

Example

Contributing

License

Acknowledgments

Files

README.md

Latest commit

History

README.md

File metadata and controls

Web Scraper to Markdown Converter

Features

Prerequisites

Installation

Usage

Code Overview

Example

Contributing

License

Acknowledgments