PDFParser-CSharp 1.2.2

dotnet add package PDFParser-CSharp --version 1.2.2
NuGet\Install-Package PDFParser-CSharp -Version 1.2.2
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="PDFParser-CSharp" Version="1.2.2" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add PDFParser-CSharp --version 1.2.2
#r "nuget: PDFParser-CSharp, 1.2.2"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install PDFParser-CSharp as a Cake Addin
#addin nuget:?package=PDFParser-CSharp&version=1.2.2

// Install PDFParser-CSharp as a Cake Tool
#tool nuget:?package=PDFParser-CSharp&version=1.2.2

PDFIndexer

Useful and easy way to get text from pdf (including metadata)

Which a single line you can add a batch of PDF's And with other single line you can search exactly where she is on the text (or more!)

Architecture

To Install using Nuget PM

Install-Package PDFParser-CSharp -Version 1.2.1

How to use

General use:

    //TO ADD A BATCH OF PDF'S
    ProcessPDF.AddPDFs(new List<string>() { path });
    
    //TO SEARCH OVER THEM
    var result = ProcessPDF.GetVisualResults("{your search word}");

To use:

            string path = "path with my pdf"
            TextExtractor te = new TextExtractor();
            var list = te.ExtractLinesMetadata(path);

Methods

ExtractFullText → Extract full text as a single string

ExtractWordsMetadata → Extract every single word with metadata (text, Point X, Point Y, Width and Height)

ExtractLinesMetada → Extract every single word with metadata (text, Point X, Point Y, Width and Height)

GeIndexMetadata → To create a hOCR or other xml pattern page, we have this class with all text and points of every line and word.

In all cases you can use string or stream to pass the pdf document.

Main Methods

  • AddPDFs → receive a list of strings to process and save
  • GetVisualResults → Recieve a string and search on the metadata database ** The result should be a list of SampleObject with the word, the position and others metadatas for each word found
    {
        HighlightObject = {
            IndexMetadata Metadata
            List<BoundingBox> HighlightedWords
            string Keyword
            int PageNumber
        },
        Metadata = {
            string Text
            List<PdfMetadata> ListOfLines
            List<PdfMetadata> ListOfWords
            string PDFURI
        },
        ImageUri = "https://{uri_image_path}"
    };
    

Expected Results

ExtractFullText

"some text of entire page (or pages)"

ExtractWordsMetadata

[
    {
        Text = "some"
        X = 150.233
        Y = 88.45
        Width = 12.2
        Height = 11.82
        PageInfo =  {
                        PageNumber = 1,
                        BlobkId = 0
                    }
}

    {
        Text = "text"
        X = 170.233
        Y = 88.45
        Width = 12.2
        Height = 11.82
        PageInfo =  {
                        PageNumber = 1,
                        BlobkId = 1
                    }
    }
]

ExtractLinesMetada

{
    Text = "some text of entire line"
    X = 150.233
    Y = 88.45
    Width = 12.2
    Height = 11.82
    PageInfo =  {
                    PageNumber = 1,
                    BlobkId = 0
                }
}

GetIndexMetadata

{
    Text = "some text of entire line"
    ListOfLines = 
    [ 
        {
            Text = "some text of entire line"
            X = 150.233
            Y = 88.45
            Width = 12.2
            Height = 11.82
            PageInfo =  {
                            PageNumber = 2,
                            BlobkId = 12
                        }
        },
        ... 
    ]
    ListOfWords = 
    [
        {
            Text = "some"
            X = 150.233
            Y = 88.45
            Width = 12.2
            Height = 11.82
            PageInfo =  {
                            PageNumber = 1,
                            BlobkId = 0
                        }
        }

        {
            Text = "text"
            X = 170.233
            Y = 88.45
            Width = 12.2
            Height = 11.82
            PageInfo =  {
                            PageNumber = 1,
                            BlobkId = 1
                        }
        }
    ]
}
Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 was computed.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
.NET Core netcoreapp2.0 is compatible.  netcoreapp2.1 was computed.  netcoreapp2.2 was computed.  netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.2.2 6,006 6/8/2019
1.2.1 722 4/2/2019