This project is a C# console application that scrapes content from a list of URLs specified in an Excel file, converts the HTML content to Markdown, and saves the Markdown files to a specified directory structure. The application uses the ClosedXML
library to read Excel files, HtmlAgilityPack
for web scraping, and ReverseMarkdown
for converting HTML to Markdown.
- Excel Integration: Reads URLs and corresponding folder paths from an Excel file.
- Web Scraping: Scrapes the main content of web pages.
- HTML to Markdown Conversion: Converts the scraped HTML content to Markdown format.
- Parallel Processing: Uses
Parallel.ForEachAsync
to process multiple URLs concurrently. - File Organization: Saves Markdown files in a structured directory based on the folder paths specified in the Excel file.
- .NET SDK (version 6.0 or later)
- ClosedXML (for Excel file handling)
- HtmlAgilityPack (for web scraping)
- ReverseMarkdown (for HTML to Markdown conversion)
-
Clone the repository:
git clone https://github.com/yourusername/web-scraper-markdown-converter.git cd web-scraper-markdown-converter
-
Install the required NuGet packages:
dotnet add package ClosedXML dotnet add package HtmlAgilityPack dotnet add package ReverseMarkdown
-
Build the project:
dotnet build
-
Prepare the Excel File:
-
Create an Excel file named
links.xlsx
in the root directory. -
The first column should contain the URLs, and the second column should contain the folder paths where the Markdown files will be saved.
-
Example:
URL Folder Path https://example.com/page1 folder1/subfolder1 https://example.com/page2 folder2/subfolder2
-
-
Run the Application:
dotnet run
-
Output:
- The application will create a directory named
output
in the root directory. - Markdown files will be saved in the specified folder structure within the
output
directory.
- The application will create a directory named
-
Program.cs: The main entry point of the application.
- Reads URLs and folder paths from the Excel file.
- Scrapes the content from the URLs.
- Converts the HTML content to Markdown.
- Saves the Markdown files to the specified directory structure.
-
Excel File Handling: Uses
ClosedXML
to read the Excel file. -
Web Scraping: Uses
HtmlAgilityPack
to scrape the main content of web pages. -
HTML to Markdown Conversion: Uses
ReverseMarkdown
to convert HTML to Markdown.
Given the following Excel file:
URL | Folder Path |
---|---|
https://example.com/page1 | folder1/subfolder1 |
https://example.com/page2 | folder2/subfolder2 |
The application will create the following directory structure:
output/
├── folder1/
│ └── subfolder1/
│ └── page1.md
└── folder2/
└── subfolder2/
└── page2.md
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License. See the LICENSE file for details.
- ClosedXML for Excel file handling.
- HtmlAgilityPack for web scraping.
- ReverseMarkdown for HTML to Markdown conversion.