CRMScraper.Library 1.1.84

There is a newer version of this package available.
See the version list below for details.
dotnet add package CRMScraper.Library --version 1.1.84                
NuGet\Install-Package CRMScraper.Library -Version 1.1.84                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="CRMScraper.Library" Version="1.1.84" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add CRMScraper.Library --version 1.1.84                
#r "nuget: CRMScraper.Library, 1.1.84"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install CRMScraper.Library as a Cake Addin
#addin nuget:?package=CRMScraper.Library&version=1.1.84

// Install CRMScraper.Library as a Cake Tool
#tool nuget:?package=CRMScraper.Library&version=1.1.84                

CRM Scraper

codecov
NuGet Version
Build Status
.NET 8

Features

  • Static HTML Parsing: Scrape static websites using HtmlAgilityPack.
  • Dynamic Content Scraping: Use Playwright to scrape JavaScript-heavy websites.
  • Extensible API: Flexible and easily extendable for custom requirements.
  • Retry Mechanism: Built-in retry logic with exponential backoff.
  • Concurrent Scraping: Supports scraping multiple pages simultaneously.
  • Unit Tested: Extensive test coverage using xUnit.

NuGet Package

You can install the CRMScraper.Library package via NuGet:

Platform Version
.NET 8.0 1.1.58

Installation

To install the package via .NET CLI:

dotnet add package CRMScraper.Library --version 1.1.58

To install via the NuGet Package Manager:

Install-Package CRMScraper.Library -Version 1.1.58

Dependencies

  • HtmlAgilityPack (>= 1.11.65)
  • Microsoft.Playwright (>= 1.47.0)

Project Structure

.
├── .github                     # GitHub Actions for CI/CD workflows
├── .gitignore                   # Git ignore rules
├── README.md                    # Project documentation
├── samples                      # Sample applications for testing
│   └── ScraperConsoleApp        # Console application for manual testing
├── scraping_service_library_net.sln # Solution file
├── scripts                      # Scripts for building and publishing
│   ├── build_and_test.sh        # Script for building and running tests
│   └── publish_nuget.sh         # Script for packing and publishing NuGet packages
├── src
│   ├── CRMScraper.Library       # Main library containing the scraping logic
│   │   ├── Core                 # Core components for scraping logic
│   │   ├── CRMScraper.Library.csproj # Library project file
│   └── CRMScraper.Tests         # Unit tests for the library
└── scraping_service_library_net.sln # Solution file

Getting Started

Prerequisites

  • .NET 8 SDK or later
  • Playwright (for dynamic content scraping)

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/scraping_service_library_net.git
    cd scraping_service_library_net
    
  2. Restore dependencies:

    dotnet restore
    
  3. Build the project:

    dotnet build --configuration Release
    
  4. Run the console application:

    cd ScraperConsoleApp
    dotnet run
    

Running Tests

The project uses xUnit for unit tests and coverlet for code coverage. To run the tests and generate coverage reports:

dotnet test --configuration Release --collect:"XPlat Code Coverage" --results-directory TestResults/ --logger "trx;LogFileName=TestResults.trx"

CI/CD Pipeline

This project uses GitHub Actions for continuous integration and deployment. The pipeline automatically:

  • Builds the project
  • Runs unit tests with code coverage
  • Generates a NuGet package and uploads it as an artifact

See .github/workflows/dotnet-ci.yml for the pipeline configuration.

Creating a NuGet Package

To create a NuGet package, run the following command:

dotnet pack --configuration Release --output ./nupkgs

Usage

This section explains how to use the CRMScraper.Library for both static and dynamic content scraping.

1. Scraping Static Pages

Use the ScraperClient class to scrape static web pages and extract HTML content, JavaScript, and API requests.

Example: Scraping a Static Page
using CRMScraper.Library;
using CRMScraper.Library.Core;
using System.Net.Http;
using System.Threading.Tasks;

class Program
{
    static async Task Main(string[] args)
    {
        var httpClient = new HttpClient();
        var pageElementsExtractor = new PageElementsExtractor();  // Implement to extract JavaScript and API requests
        var scraperClient = new ScraperClient(httpClient, pageElementsExtractor);

        var result = await scraperClient.ScrapePageAsync("https://example.com");

        Console.WriteLine($"URL: {result.Url}");
        Console.WriteLine($"HTML Content: {result.HtmlContent}");
        Console.WriteLine($"JavaScript Data: {string.Join(", ", result.JavaScriptData)}");
        Console.WriteLine($"API Requests: {string.Join(", ", result.ApiRequests)}");
    }
}

2. Scraping Dynamic Pages

For JavaScript-heavy websites, ScraperClient uses Playwright to fully render the page before scraping.

Example: Scraping a Dynamic Page
using CRMScraper.Library;
using CRMScraper.Library.Core;
using System.Threading.Tasks;

class Program
{
    static async Task Main(string[] args)
    {
        var httpClient = new HttpClient();
        var pageElementsExtractor = new PageElementsExtractor();  // Implement to extract JavaScript and API requests
        var scraperClient = new ScraperClient(httpClient, pageElementsExtractor);

        var result = await scraperClient.ScrapeDynamicPageAsync("https://example.com");

        Console.WriteLine($"URL: {result.Url}");
        Console.WriteLine($"HTML Content: {result.HtmlContent}");
        Console.WriteLine($"API Requests: {string.Join(", ", result.ApiRequests)}");
    }
}

3. Concurrent Scraping

For large-scale scraping, use ScraperTaskExecutor to scrape multiple pages concurrently.

Example: Concurrent Scraping Task
using CRMScraper.Library.Core;
using System;
using System.Threading;
using System.Threading.Tasks;

class Program
{
    static async Task Main(string[] args)
    {
        var httpClient = new HttpClient();
        var pageElementsExtractor = new PageElementsExtractor();
        var scraperClient = new ScraperClient(httpClient, pageElementsExtractor);
        var scraperTaskExecutor = new ScraperTaskExecutor(scraperClient);

        var scrapingTask = new ScrapingTask
        {
            TargetUrl = "https://example.com",
            MaxPages = 10,
            TimeLimit = TimeSpan.FromMinutes(1),
            MaxConcurrentPages = 3,
            UseDynamicScraping = true
        };

        var cancellationTokenSource = new CancellationTokenSource();
        var results = await scraperTaskExecutor.ExecuteScrapingTaskAsync(scrapingTask, cancellationTokenSource.Token);

        foreach (var result in results)
        {
            Console.WriteLine($"Scraped URL: {result.Url}");
            Console.WriteLine($"HTML Content: {result.HtmlContent}");
        }
    }
}

Core Classes

  • ScraperClient: Core logic for static and dynamic page scraping.
  • ScraperTaskExecutor: Manages concurrent scraping tasks and retries.
  • ScrapedPageResult: Represents the result of a scraping operation.
  • ScrapingTask: Defines a scraping task with limits on pages and time.

Contributing

Contributions are welcome! If you find a bug or have a feature request, please open an issue or submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.1.95 130 9/18/2024
1.1.92 94 9/18/2024
1.1.89 98 9/18/2024
1.1.84 89 9/18/2024
1.1.79 101 9/18/2024
1.1.65 84 9/17/2024
1.1.58 85 9/17/2024
1.1.52 87 9/17/2024