GroupDocs.Parser 20.1.0

GroupDocs.Parser for .NET is a useful parsing class library which allows to extract different data from documents of various formats. The data extraction API allows to extract quick raw or quality formatted text from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX. The library will allow you to create document descriptive templates and apply them on documents, specific to your business workflow and extract required data.

Features:

 * Extract both raw and formatted text associated with supported file formats with a few lines of code;
 * Extract metadata associated with supported file formats with a few lines of code;
 * Extract content from formats that contain attachments (PDF, Email) and extract name, path, media type and content;
 * Support encrypted document formats;
 * Extract structured text;
 * Extract images;
 * Text Analysis API;
 * Extract PDF form data;
 * Tools for encoding detection;
 * Tools for media type detection;
 * Document data parsing API by template;
 * Zip archives support;

Supported document formats:

 * Microsoft Word documents - DOC, DOT, DOCX, DOCM, DOTX, DOTM, TXT, RTF;
 * Microsoft Excel spreadsheets - XLS, XLT, XLSX, XLSM, XLSB, XLTX, XLTM,CSV, XLA, XLAM, XML;
 * Microsoft PowerPoint presentations - PPT, PPS, POT, PPTX, PPTM, POTX, POTM, PPSX, PPSM;
 * Microsoft OneNote - ONE;
 * Open Document formats - ODP, ODS, ODT, OTT;
 * Portable Document Formats - PDF;
 * Email - PST, OST, EML, EMLX, MSG;
 * Ebook - EPUB, FB2, CHM;
 * Archive - ZIP;
 * Markup - HTML, XHTML, MHTML, MD, XML;

For more details on the GroupDocs.Parser for .NET API, please visit GroupDocs website at:
https://www.groupdocs.com/products/parser/net

Note: GroupDocs.Parser for .NET will run in evaluation mode. In order to test full features of the product, please request a free 30-day temporary license.

There is a newer version of this package available.
See the version list below for details.
Install-Package GroupDocs.Parser -Version 20.1.0
dotnet add package GroupDocs.Parser --version 20.1.0
<PackageReference Include="GroupDocs.Parser" Version="20.1.0" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add GroupDocs.Parser --version 20.1.0
The NuGet Team does not provide support for this client. Please contact its maintainers for support.

Document Parser .NET API

This text parser on-premise API works well to search & extract formatted text as well as the raw text from a variety of documents of supported file formats.

Document Parser Processing Features

  • Parse documents by user-defined templates.
  • Extract plain and structured text.
  • Extract text areas with coordinates, text styles and other information.
  • Search text by a keyword or regular expression; extract text around that word.
  • Extract HTML or Markdown (MD) formatted text for a fast preview.
  • Increase performance by extracting raw text.
  • Extract formatted text, metadata, images, containers, and attachments.
  • Extract table of contents for some supported document formats.
  • Parse form data from PDF documents.

New Features in Version 20.1.0

  • Extract text by TOC item.
  • Extract TOC form:
    • Word Processing documents
    • PDF documents

Breaking Changes in Version 20.1.0

  • Legacy API are removed (all types from GroupDocs.Parser.Legacy namespace are removed).

For the detailed notes, please visit GroupDocs.Parser for .NET 20.1 Release Notes.

Word Processing: DOC, DOT, DOCX, DOCM, DOTX, DOTM, ODT, OTT, RTF
Spreadsheet: XLS, XLT, XLSX, XLSM, XLSB, XLTX, XLTM, ODS, OTS, XLA, XLAM, NUMBERS
Presentation: PPT, PPS, POT, PPTX, PPTM, POTX, POTM, PPSX, PPSM, ODP, OTP
Email: EML, EMLX, MSG
Portable: PDF
Archive: ZIP

Extract Containers and Attachments

Email: PST, OST, EML, EMLX, MSG
Portable: PDF
Archive: ZIP

Parse Form Data

Portable: PDF

Extract Table of Contents

Word Processing: DOC, DOT, DOCX, DOCM, DOTX, DOTM, ODT, OTT, RTF
Portable: PDF
eBook: CHM, EPUB
Databases: Databases are supported via ADO.NET. To work with the corresponding database format install its database provider.

Platform Independence

GroupDocs.Parser for .NET does not require any external software or third party tool to be installed. GroupDocs.Parser for .NET supports any 32-bit or 64-bit operating system where .NET or Mono framework is installed. The other details are as follows:

Microsoft Windows: Microsoft Windows Desktop (x86, x64) (XP & up), Microsoft Windows Server (x86, x64) (2000 & up), Windows Azure
Mac OS: Mac OS X
Linux: Linux (Ubuntu, OpenSUSE, CentOS and others)
Development Environments: Microsoft Visual Studio (2010 & up), Xamarin.Android, Xamarin.IOS, Xamarin.Mac, MonoDevelop 2.4 and later.
Supported Frameworks: GroupDocs.Conversion for .NET supports .NET and Mono frameworks.

Getting Started with GroupDocs.Parser for .NET

Are you ready to give GroupDocs.Parser for .NET a try? Simply execute Install-Package GroupDocs.Parser from Package Manager Console in Visual Studio to fetch & reference GroupDocs.Parser assembly in your project. If you already have GroupDocs.Parser for .Net and want to upgrade it, please execute Update-Package GroupDocs.Parser to get the latest version.

Please check the GitHub Repository for other common usage scenarios.

Use C# Code to Extract Data from Database

string connectionString = string.Format("Provider=System.Data.Sqlite;Data Source={0};Version=3;", "database.db");
// create an instance of Parser class to extract tables from the database
// as filePath connection parameters are passed; LoadOptions is set to Database file format
using (Parser parser = new Parser(connectionString, new LoadOptions(FileFormat.Database)))
{
    // check if text extraction is supported
    if (!parser.Features.Text)
    {
        Console.WriteLine("Text extraction isn't supported.");
        return;
    }
    // check if toc extraction is supported
    if (!parser.Features.Toc)
    {
        Console.WriteLine("Toc extraction isn't supported.");
        return;
    }
    // get a list of tables
    IEnumerable<TocItem> toc = parser.GetToc();
    // iterate over tables
    foreach (TocItem i in toc)
    {
        // print the table name
        Console.WriteLine(i.Text);
        // extract a table content as a text
        using (TextReader reader = parser.GetText(i.PageIndex.Value))
        {
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

Extract all Images and Save them in PNG Format via C# Code

// create an instance of Parser class
using (Parser parser = new Parser(Constants.SampleZip))
{
    // extract images from document
    IEnumerable<PageImageArea> images = parser.GetImages();
    // check if images extraction is supported
    if (images == null)
    {
        Console.WriteLine("Page images extraction isn't supported");
        return;
    }
    // create the options to save images in PNG format
    ImageOptions options = new ImageOptions(ImageFormat.Png);
    int imageNumber = 0;
    // iterate over images
    foreach (PageImageArea image in images)
    {
        // save the image to the png file
        image.Save(imageNumber.ToString() + ".png", options);
        imageNumber++;
    }
}

Product Page | Documentation | API Reference | Code Samples | Blog | Free Support | Temporary License

Document Parser .NET API

This text parser on-premise API works well to search & extract formatted text as well as the raw text from a variety of documents of supported file formats.

Document Parser Processing Features

  • Parse documents by user-defined templates.
  • Extract plain and structured text.
  • Extract text areas with coordinates, text styles and other information.
  • Search text by a keyword or regular expression; extract text around that word.
  • Extract HTML or Markdown (MD) formatted text for a fast preview.
  • Increase performance by extracting raw text.
  • Extract formatted text, metadata, images, containers, and attachments.
  • Extract table of contents for some supported document formats.
  • Parse form data from PDF documents.

New Features in Version 20.1.0

  • Extract text by TOC item.
  • Extract TOC form:
    • Word Processing documents
    • PDF documents

Breaking Changes in Version 20.1.0

  • Legacy API are removed (all types from GroupDocs.Parser.Legacy namespace are removed).

For the detailed notes, please visit GroupDocs.Parser for .NET 20.1 Release Notes.

Word Processing: DOC, DOT, DOCX, DOCM, DOTX, DOTM, ODT, OTT, RTF
Spreadsheet: XLS, XLT, XLSX, XLSM, XLSB, XLTX, XLTM, ODS, OTS, XLA, XLAM, NUMBERS
Presentation: PPT, PPS, POT, PPTX, PPTM, POTX, POTM, PPSX, PPSM, ODP, OTP
Email: EML, EMLX, MSG
Portable: PDF
Archive: ZIP

Extract Containers and Attachments

Email: PST, OST, EML, EMLX, MSG
Portable: PDF
Archive: ZIP

Parse Form Data

Portable: PDF

Extract Table of Contents

Word Processing: DOC, DOT, DOCX, DOCM, DOTX, DOTM, ODT, OTT, RTF
Portable: PDF
eBook: CHM, EPUB
Databases: Databases are supported via ADO.NET. To work with the corresponding database format install its database provider.

Platform Independence

GroupDocs.Parser for .NET does not require any external software or third party tool to be installed. GroupDocs.Parser for .NET supports any 32-bit or 64-bit operating system where .NET or Mono framework is installed. The other details are as follows:

Microsoft Windows: Microsoft Windows Desktop (x86, x64) (XP & up), Microsoft Windows Server (x86, x64) (2000 & up), Windows Azure
Mac OS: Mac OS X
Linux: Linux (Ubuntu, OpenSUSE, CentOS and others)
Development Environments: Microsoft Visual Studio (2010 & up), Xamarin.Android, Xamarin.IOS, Xamarin.Mac, MonoDevelop 2.4 and later.
Supported Frameworks: GroupDocs.Conversion for .NET supports .NET and Mono frameworks.

Getting Started with GroupDocs.Parser for .NET

Are you ready to give GroupDocs.Parser for .NET a try? Simply execute Install-Package GroupDocs.Parser from Package Manager Console in Visual Studio to fetch & reference GroupDocs.Parser assembly in your project. If you already have GroupDocs.Parser for .Net and want to upgrade it, please execute Update-Package GroupDocs.Parser to get the latest version.

Please check the GitHub Repository for other common usage scenarios.

Use C# Code to Extract Data from Database

string connectionString = string.Format("Provider=System.Data.Sqlite;Data Source={0};Version=3;", "database.db");
// create an instance of Parser class to extract tables from the database
// as filePath connection parameters are passed; LoadOptions is set to Database file format
using (Parser parser = new Parser(connectionString, new LoadOptions(FileFormat.Database)))
{
    // check if text extraction is supported
    if (!parser.Features.Text)
    {
        Console.WriteLine("Text extraction isn't supported.");
        return;
    }
    // check if toc extraction is supported
    if (!parser.Features.Toc)
    {
        Console.WriteLine("Toc extraction isn't supported.");
        return;
    }
    // get a list of tables
    IEnumerable<TocItem> toc = parser.GetToc();
    // iterate over tables
    foreach (TocItem i in toc)
    {
        // print the table name
        Console.WriteLine(i.Text);
        // extract a table content as a text
        using (TextReader reader = parser.GetText(i.PageIndex.Value))
        {
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

Extract all Images and Save them in PNG Format via C# Code

// create an instance of Parser class
using (Parser parser = new Parser(Constants.SampleZip))
{
    // extract images from document
    IEnumerable<PageImageArea> images = parser.GetImages();
    // check if images extraction is supported
    if (images == null)
    {
        Console.WriteLine("Page images extraction isn't supported");
        return;
    }
    // create the options to save images in PNG format
    ImageOptions options = new ImageOptions(ImageFormat.Png);
    int imageNumber = 0;
    // iterate over images
    foreach (PageImageArea image in images)
    {
        // save the image to the png file
        image.Save(imageNumber.ToString() + ".png", options);
        imageNumber++;
    }
}

Product Page | Documentation | API Reference | Code Samples | Blog | Free Support | Temporary License

Release Notes

https://docs.groupdocs.com/display/parsernet/GroupDocs.Parser+for+.NET+20.1+Release+Notes

NuGet packages (2)

Showing the top 2 NuGet packages that depend on GroupDocs.Parser:

Package Downloads
GroupDocs.Total
GroupDocs.Total for .NET is a compilation of every .NET API offered by GroupDocs. We compile it on a daily basis to ensure that it contains the most up to date versions of each of our .NET document manipulation APIs. With GroupDocs.Total for .NET developers can use all our APIs with a single license. However, you can order any individual API as well. The APIs we offer include: GroupDocs.Viewer GroupDocs.Annotation GroupDocs.Conversion GroupDocs.Comparison GroupDocs.Signature GroupDocs.Assembly GroupDocs.Metadata GroupDocs.Search GroupDocs.Parser GroupDocs.Watermark GroupDocs.Editor GroupDocs.Merger GroupDocs.Redaction GroupDocs.Classification GroupDocs.Total for .NET Documentation https://docs.groupdocs.com/display/gdtotalproductfamily/GroupDocs.Total+for+.NET Free support for GroupDocs.Total for .NET is provided on our support forum: https://forum.groupdocs.com/
Conholdate.Total
Conholdate.Total for .NET is a complete package to work with a large number of file formats from Microsoft Word, Excel, PowerPoint, Outlook, Project, Visio, Adobe Acrobat, Illustrator, Photoshop, AutoCAD, OpenOffice and many more. Conholdate.Total for .NET allows you to use any API released under Aspose and GroupDocs for .NET in order to create, convert, read, edit, update and print popular document formats. Moreover, you may view, annotate, watermark, assemble, classify, search, redact, parse, merge and compare documents without needing to install the native applications. Conholdate.Total for .NET also includes specialized APIs to read and create barcodes, extract text from images using OCR as well as extract human marked data from questioners, surveys, quizzes, MCQ papers and feedback forms.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version History

Version Downloads Last updated
20.10.0 445 10/27/2020
20.8.0 475 8/19/2020
20.6.1 282 6/30/2020
20.6.0 179 6/19/2020
20.5.0 425 5/8/2020
20.3.0 415 3/19/2020
20.1.0 339 1/31/2020
19.12.0 297 12/27/2019
19.11.0 262 11/22/2019
19.9.0 288 9/27/2019
19.5.0 435 5/29/2019
18.12.0 464 12/11/2018
18.11.0 330 11/8/2018
18.10.0 353 10/10/2018
18.9.0 295 9/5/2018
18.8.0 397 8/7/2018
18.7.0 430 7/3/2018
18.5.0 489 5/23/2018
Show less