Web Scraper to Markdown Converter

This project is a C# console application that scrapes content from a list of URLs specified in an Excel file, converts the HTML content to Markdown, and saves the Markdown files to a specified directory structure. The application uses the ClosedXML library to read Excel files, HtmlAgilityPack for web scraping, and ReverseMarkdown for converting HTML to Markdown.

Features

Excel Integration: Reads URLs and corresponding folder paths from an Excel file.
Web Scraping: Scrapes the main content of web pages.
HTML to Markdown Conversion: Converts the scraped HTML content to Markdown format.
Parallel Processing: Uses Parallel.ForEachAsync to process multiple URLs concurrently.
File Organization: Saves Markdown files in a structured directory based on the folder paths specified in the Excel file.

Prerequisites

.NET SDK (version 6.0 or later)
ClosedXML (for Excel file handling)
HtmlAgilityPack (for web scraping)
ReverseMarkdown (for HTML to Markdown conversion)

Installation

Clone the repository:

git clone https://github.com/yourusername/web-scraper-markdown-converter.git
cd web-scraper-markdown-converter

Install the required NuGet packages:

dotnet add package ClosedXML
dotnet add package HtmlAgilityPack
dotnet add package ReverseMarkdown

Build the project:
```
dotnet build
```

Usage

Prepare the Excel File:
- Create an Excel file named links.xlsx in the root directory.
- The first column should contain the URLs, and the second column should contain the folder paths where the Markdown files will be saved.
- Example:
  
  URL Folder Path
  
  https://example.com/page1 folder1/subfolder1
  
  https://example.com/page2 folder2/subfolder2
Run the Application:
```
dotnet run
```
Output:
- The application will create a directory named output in the root directory.
- Markdown files will be saved in the specified folder structure within the output directory.

Code Overview

Program.cs: The main entry point of the application.
- Reads URLs and folder paths from the Excel file.
- Scrapes the content from the URLs.
- Converts the HTML content to Markdown.
- Saves the Markdown files to the specified directory structure.
Excel File Handling: Uses ClosedXML to read the Excel file.
Web Scraping: Uses HtmlAgilityPack to scrape the main content of web pages.
HTML to Markdown Conversion: Uses ReverseMarkdown to convert HTML to Markdown.

Example

Given the following Excel file:

URL	Folder Path
https://example.com/page1	folder1/subfolder1
https://example.com/page2	folder2/subfolder2

The application will create the following directory structure:

output/
├── folder1/
│   └── subfolder1/
│       └── page1.md
└── folder2/
    └── subfolder2/
        └── page2.md

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

ClosedXML for Excel file handling.
HtmlAgilityPack for web scraping.
ReverseMarkdown for HTML to Markdown conversion.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
bin/Debug/net7.0		bin/Debug/net7.0
obj		obj
output		output
Program.cs		Program.cs
README.md		README.md
Webscraper.csproj		Webscraper.csproj
Webscraper.sln		Webscraper.sln
links.xlsx		links.xlsx
links_EF.xlsx		links_EF.xlsx
links_csharp.xlsx		links_csharp.xlsx
links_identity.xlsx		links_identity.xlsx
links_linq.xlsx		links_linq.xlsx
links_mongoDB.xlsx		links_mongoDB.xlsx
links_sql.xlsx		links_sql.xlsx
links_webapi.xlsx		links_webapi.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper to Markdown Converter

Features

Prerequisites

Installation

Usage

Code Overview

Example

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

Pritam-Ganguly/Webscraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraper to Markdown Converter

Features

Prerequisites

Installation

Usage

Code Overview

Example

Contributing

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages