ProSol.WebScrap
                             
                            
                                2.0.2
                            
                        
                    dotnet add package ProSol.WebScrap --version 2.0.2
NuGet\Install-Package ProSol.WebScrap -Version 2.0.2
        
        
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
                    
    
    <PackageReference Include="ProSol.WebScrap" Version="2.0.2" />
        
        
For projects that support PackageReference, copy this XML node into the project file to reference the package.
                    
    
    <PackageVersion Include="ProSol.WebScrap" Version="2.0.2" />
<PackageReference Include="ProSol.WebScrap" />
        
        
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
                    
    
    paket add ProSol.WebScrap --version 2.0.2
        
        
 The NuGet Team does not provide support for this client. Please contact its maintainers for support.
                    
    
    #r "nuget: ProSol.WebScrap, 2.0.2"
        
        
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
                    
    
    #:package ProSol.WebScrap@2.0.2
        
        
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
                    
    
    #addin nuget:?package=ProSol.WebScrap&version=2.0.2
#tool nuget:?package=ProSol.WebScrap&version=2.0.2
        
        
 The NuGet Team does not provide support for this client. Please contact its maintainers for support.
                    
    
    ProSol.WebScrap
A HTML parser, for extracting the text from a web pages, with CSS selectors.
Purpose
The purpose of this library is to get the essential data from a web-page for a user, in JSON format.
It could be further used for:
- Analyzing the essential data. Like a charts, diagramms, plain tables.
- Tracking the history of the essential data. Like prices for sales, currencies, user activity.
- Searching for specific essential data. Some word in multiple html resources, like movie title, or any other product, any mentioning.
Usage
Let's make a console demo and install the package:
dotnet new console -n WebScrap.Demo.CLI
cd WebScrap.Demo.CLI
dotnet add package ProSol.WebScrap --version 2.0.0
And try the following code:
using ProSol.WebScrap;
var request = "https://en.wikipedia.org/wiki/Food_energy";
// Download the html:
using var client = new HttpClient();
using var response = await client.GetAsync(request);
var html = await response.Content.ReadAsStringAsync();
// Run the WebScrapper:
var css = "#firstHeading";
var result = WebScrapper
    .Run(html, css)
    .ToJsonString();
// Get the results:
Console.WriteLine(result);
// OUTPUT:
// [{"key":"#firstHeading","values":[{"value":"Food energy"}]}]
Console.Read();
Known Issues
The project currently under active development, and there are some issues, some of the obvious, which are not the priority right now.
CSS
- multiple css entries, comma-separated, are not supported.
- attribute-based css are not supported.
HTML
- object model returns tags in reverse order.
- non-unicode text is not converted.
Goals
This project is for extracting text from html in a performant way.
Extract text
- Plain text: This tool must extract a plain text from html.
- User-defined result structure: The amount of text, and it's structure is defined by user, via multiple css selectors.
Performance
- Parallel processing: All of css selectors should process the html in parallel.
- Stream-based processing: The processed parts of html should be disposed from memory.
Footnote
- The versioning is complied to the Semver 2.0.0. Please refer to semver.org for details.
- Please refer to the Changelog for the progress.
| Product | Versions Compatible and additional computed target framework versions. | 
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. | 
        
        Compatible target framework(s)
    
    
        
        Included target framework(s) (in package)
    
    Learn more about Target Frameworks and .NET Standard.
- 
                                                    net8.0- ProSol.Html.TagsProvider (>= 2.0.0)
- ProSol.Messaging (>= 4.0.0)
 
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.