GroupDocs.Parser 24.10.0

dotnet add package GroupDocs.Parser --version 24.10.0                
NuGet\Install-Package GroupDocs.Parser -Version 24.10.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="GroupDocs.Parser" Version="24.10.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add GroupDocs.Parser --version 24.10.0                
#r "nuget: GroupDocs.Parser, 24.10.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install GroupDocs.Parser as a Cake Addin
#addin nuget:?package=GroupDocs.Parser&version=24.10.0

// Install GroupDocs.Parser as a Cake Tool
#tool nuget:?package=GroupDocs.Parser&version=24.10.0                

Advanced Document Parsing API for .NET

Version 24.10.0 NuGet .NET

banner


Product Page Docs API Ref Examples Blog Releases Support Temp License


Important Note: Starting from 24.2.0, the GroupDocs.Parser package has been split into two distinct platform packages: .NET Standard and .NET Framework. The GroupDocs.Parser package is specifically designed to support the .NET Standard platform, making it compatible with .NET Core, .NET 5, .NET 6, etc. It includes backward compatibility improvements, allowing it to function with .NET Framework versions starting from 4.6.2. In addition, we have introduced the GroupDocs.Parser.NETFramework package, which is optimized to run seamlessly in the .NET Framework runtime because it includes all the GroupDocs product libraries in their respective .NET Framework versions. It is tailored specifically for .NET Framework users and offers better dependency resolution for those utilizing the .NET Framework. We hope these changes will enhance your experience and provide a more streamlined approach to using the GroupDocs.Parser package. If you have any further questions or concerns, please don't hesitate to reach out to our free support forum.

GroupDocs.Parser for .NET is a powerful API designed for advanced document parsing, offering extensive features like text extraction, metadata retrieval, and image extraction across various document formats, including PDFs, Word, Excel, and PowerPoint. This robust API supports .NET Standard and .NET Framework, making it compatible with .NET Core, .NET 5, and .NET 6, while also providing backward compatibility with older .NET Framework versions. With specialized parsing capabilities for PDF documents, email parsing, and template-based data extraction, GroupDocs.Parser ensures high-performance, secure parsing and scalability, suitable for cross-platform environments including Windows, Linux, and macOS. It's the ideal solution for developers needing to integrate efficient document processing into their .NET applications.

Text Extraction

Document Text Extraction

Extract text from PDF, Word, Excel, and more.

Retain Text Formatting

Extract text with font styles, sizes, and colors.

Text Search and Extraction

Search for and extract specific text.

OCR Text Extraction

Extract text from images using OCR.

Metadata Extraction

Document Metadata Extraction

Extract properties like author, title, and subject.

Date Property Extraction

Extract creation and modification dates.

Field-Specific Data Extraction

Extract custom fields like invoice numbers.

Image and Attachment Extraction

Extract Embedded Images

Extract images within documents.

Extract File Attachments

Extract attachments from PDF and email files.

Barcode Extraction

Extract and recognize barcodes from documents.

Document Structure Analysis

Structured Document Analysis

Analyze and extract tables, lists, and paragraphs.

Table Extraction

Extract tables and their content.

Extract hyperlinks from documents.

Bookmark Extraction

Extract bookmarks from PDFs.

PDF-Specific Parsing

PDF Parsing

Extract text, images, and metadata from PDFs.

Extract PDF Page Count

Extract page count and PDF-specific properties.

PDF Bookmark Management

Extract and manage bookmarks in PDFs.

Email Parsing

Email Content Extraction

Extract text, attachments, and metadata from emails.

Email Property Extraction

Extract sender, receiver, subject, and body content.

Spreadsheet Parsing

Excel Data Extraction

Extract text, metadata, and data from Excel files.

Specific Range Extraction

Extract specific cells, ranges, or sheets from Excel.

Presentation Parsing

PowerPoint Extraction

Extract text, images, and metadata from presentations.

Slide-Specific Extraction

Extract content from slides, including notes and shapes.

Template-Based Data Extraction

Template Data Extraction

Use templates for structured data extraction.

Template Editor

Create and edit templates for data extraction.

Custom Parsing Rules

Define custom content extraction rules.

Advanced Features

Multi-Format Support

Support for PDF, DOCX, XLSX, PPTX, and more.

Cross-Platform Compatibility

Works on Windows, Linux, and macOS.

.NET Integration

Integrate with .NET applications.

High Performance

Efficient handling of large documents.

Secure Parsing

Maintain document security and integrity.

Scalable Batch Processing

Handle large document volumes.

Additional Features

Page Count Retrieval

Retrieve the number of pages in a document.

Form Data Extraction

Extract data from forms and interactive elements.

Content-Aware Parsing

Detect and extract specific data types.

Supported Document Formats

Word Processing

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
DOC - Microsoft Word Document ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
DOT - Microsoft Word Document Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
DOCX - Office Open XML Document ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
DOCM - Office Open XML Macro-Enabled Document ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
DOTX - Office Open XML Document Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
DOTM - Office Open XML Document Macro-Enabled Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
TXT - Plain text ✔️
ODT - Open Document Text ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
OTT - Open Document Text Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
RTF - Rich Text Format ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️

PDF

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
PDF - Portable Document Format ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️

Markup

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
XHTML - Extensible Hypertext Markup Language File ✔️ ✔️
MHTML - MIME HTML File ✔️ ✔️
MD - Markdown ✔️ ✔️ (Formatted Text is Not supported)
XML - XML File ✔️

Ebook

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
CHM - Compiled HTML Help File ✔️ ✔️ ✔️ ✔️ ✔️
EPUB - Digital E-Book File Format ✔️ ✔️ ✔️ ✔️ ✔️
FB2 - FictionBook 2.0 File ✔️ ✔️
MOBI - Mobipocket ✔️
AZW3 - Kindle Format 8 ✔️

Spreadsheet

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
XLS - Microsoft Excel Spreadsheet ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
XLT - Microsoft Excel Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
XLSX - Office Open XML Spreadsheet ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
XLSM - Office Open XML Macro-Enabled Spreadsheet ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
XLSB - Office Open XML Binary Spreadsheet ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
XLTX - Office Open XML Spreadsheet Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
XLTM - Office Open XML Macro-Enabled Spreadsheet Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
ODS - Open Document Spreadsheet ✔️ ✔️ ✔️ ✔️ ✔️
OTS - Open Document Spreadsheet Template ✔️ ✔️ ✔️ ✔️ ✔️
CSV - Comma Separated Values ✔️
XLA - Excel Add-In File ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
XLAM - Excel Open XML Macro-Enabled Add-In ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
NUMBERS - Apple iWork Numbers ✔️ ✔️ ✔️ ✔️

Presentation

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
PPT - PowerPoint Presentation ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
PPS - PowerPoint Slideshow ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
POT - PowerPoint Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
PPTX - Office Open XML Presentation ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
PPTM - Office Open XML Macro-Enabled Presentation ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
POTX - Office Open XML Presentation Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
POTM - Office Open XML Macro-Enabled Presentation Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
PPSX - Office Open XML Presentation Slideshow ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
PPSM - Office Open XML Macro-Enabled Presentation Slideshow ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
ODP - Open Document Presentation ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
OTP - Open Document Presentation Template ✔️ ✔️ ✔️ ✔️ ✔️ ✔️

Email

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
PST - Outlook Personal Information Store File ✔️
OST - Outlook Offline Data File ✔️
EML - E-Mail Message ✔️ ✔️ ✔️ ✔️ ✔️
EMLX - Apple Mail Message ✔️ ✔️ ✔️ ✔️ ✔️
MSG - Outlook Mail Message ✔️ ✔️ ✔️ ✔️ ✔️

Note

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
ONE - OneNote Document ✔️

Archive

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
7Z* - 7Z File ✔️ ✔️
ZIP - Zipped File ✔️ ✔️
RAR - Rar File ✔️ ✔️
TAR - Tar File ✔️ ✔️
GZ - GZip file ✔️ ✔️
BZ2 - BZip2 File ✔️ ✔️

Note: Encrypted 7-zip archives are not supported.

Image*

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
BMP - Bitmap Image file ✔️ ✔️
GIF - Graphical Interchange Format ✔️
JP2 - JPEG 2000 ✔️
JPG, JPEG - JPEG Image file ✔️ ✔️
PNG - Portable Network Graphics ✔️ ✔️
TIF, TIFF - Tagged Image File Format ✔️ ✔️
DICOM - DICOM (Digital Imaging and Communications in Medicine) ✔️
DJVU - DjVu File Format ✔️ ✔️
EMF - Enhanced metafile ✔️
J2K - JPEG 2000 ✔️
PS - PostScript File Format ✔️
PSD - Photoshop Document ✔️
SVG - Scalar Vector Graphics file ✔️
SVGZ - Scalar Vector Graphics file (with gzip compression) ✔️
WEBP - WebP Image File Format ✔️
WMF - Microsoft Windows Metafile ✔️

Database

Document Type Parse Document by Template Extract Text (Accurate) Extract Text (Raw) Extract Structured Text and Formatted Text Extract Text Areas Extract Metadata Extract Images Extract Containers and Attachments Parse Form Data Extract Table of Contents Scan Barcode
ADO.NET ✔️ ✔️

Platform Independence

GroupDocs.Parser for .NET does not require any external software or third-party tool to be installed. GroupDocs.Parser for .NET supports any 32-bit or 64-bit operating system where .NET or Mono framework is installed. The other details are as follows:

Microsoft Windows: Microsoft Windows Desktop (x86, x64) (XP & up), Microsoft Windows Server (x86, x64) (2000 & up), Windows Azure
Mac OS: Mac OS X
Linux: Linux (Ubuntu, OpenSUSE, CentOS and others)
Development Environments: Microsoft Visual Studio (2010 & up), Xamarin.Android, Xamarin.IOS, Xamarin.Mac, MonoDevelop 2.4 and later.
Supported Frameworks: GroupDocs.Conversion for .NET supports .NET and Mono frameworks.

Get Started

Are you ready to give GroupDocs.Parser for .NET a try? Simply execute Install-Package GroupDocs.Parser from Package Manager Console in Visual Studio to fetch & reference GroupDocs.Parser assembly in your project. If you already have GroupDocs.Parser for .Net and want to upgrade it, please execute Update-Package GroupDocs.Parser to get the latest version.

Please check the GitHub Repository for other common usage scenarios.

How to Install GroupDocs.Parser for .NET

1. Install from NuGet
Option 1: Using Package Manager GUI
  1. Open Visual Studio:

    • Load your solution/project.
  2. Access NuGet Package Manager:

    • Go to Tools -> NuGet Package Manager -> Manage NuGet Packages for Solution.
    • Alternatively, right-click the solution or project in Solution Explorer and select Manage NuGet Packages.
  3. Search for GroupDocs.Parser:

    • Navigate to the Browse tab.
    • Type “GroupDocs.Parser” in the search box.
  4. Install the Package:

    • Click the Install button to add the latest version of GroupDocs.Parser to your project.
Option 2: Using Package Manager Console
  1. Open Visual Studio:

    • Load your solution/project.
  2. Open Package Manager Console:

    • Go to Tools -> NuGet Package Manager -> Package Manager Console.
  3. Install GroupDocs.Parser:

    • Type the command Install-Package GroupDocs.Parser and press Enter.
  4. Verify Installation:

    • GroupDocs.Parser should now be referenced in your application.
2. Handling .NET Framework and .NET Standard
  • Starting with version 24.2, GroupDocs.Parser is split into two packages: one for .NET Framework and one for .NET Standard.
  • For .NET Framework projects:
    • Ensure AutoGenerateBindingRedirects is enabled.
    • Add the following to your project file for unit tests:
<PropertyGroup>
    <AutoGenerateBindingRedirects>true</AutoGenerateBindingRedirects>
    <GenerateBindingRedirectsOutputType>true</GenerateBindingRedirectsOutputType>
</PropertyGroup>
3. Install from the Official GroupDocs Website
  1. Download GroupDocs.Parser:

    • Visit the official GroupDocs website and download the package.
  2. Unpack or Install:

    • Unzip the archive or run the MSI installer.
  3. Add a Reference in Visual Studio:

    • In Solution Explorer, right-click the References node of your project and select Add Reference.
    • If you used the MSI installer, select GroupDocs.Parser from the .NET tab. Otherwise, browse to the location of the GroupDocs.Parser.dll file.
  4. Confirm Reference:

    • Ensure GroupDocs.Parser appears under the References node in your project.
4. Additional Considerations
  • .NET Standard 2.0 Version:

    • This version has external references to several packages like System.Drawing.Common, System.Text.Encoding.CodePages, SkiaSharp, etc.
  • Linux Environment:

    • Install the following packages for proper functionality:
      • libgdiplus
      • libc6-dev
      • ttf-mscorefonts-installer (e.g., sudo apt-get install ttf-mscorefonts-installer)
    • Also, ensure SkiaSharp.NativeAssets.Linux.NoDependencies is installed.

GroupDocs.Parser for .NET Coding Samples

Code Sample 1: Extracting Text from a PDF Document

This code loads a PDF file (sample.pdf) and extracts its text content using the GetText() method. The extracted text is then displayed in the console.

using GroupDocs.Parser;
using GroupDocs.Parser.Options;

public class ExtractTextFromPdf
{
    public static void Run()
    {
        // Load the PDF document
        using (Parser parser = new Parser("sample.pdf"))
        {
            // Extract text from the document
            string text = parser.GetText();
            
            // Output the extracted text
            Console.WriteLine(text);
        }
    }
}

Code Sample 2: Extracting Images from a Word Document

This code loads a Word document (sample.docx) and extracts all images found within the document. Each image is saved as a separate PNG file.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;

public class ExtractImagesFromWord
{
    public static void Run()
    {
        // Load the Word document
        using (Parser parser = new Parser("sample.docx"))
        {
            // Get images from the document
            IEnumerable<PageImageArea> images = parser.GetImages();
            
            // Save each image to a file
            int imageNumber = 1;
            foreach (PageImageArea image in images)
            {
                image.Save($"image{imageNumber++}.png");
            }
        }
    }
}

Code Sample 3: Parsing Metadata from an Excel Spreadsheet

This code loads an Excel spreadsheet (sample.xlsx) and extracts its metadata, such as author, title, and creation date. The metadata is then displayed in the console.

using GroupDocs.Parser;
using GroupDocs.Parser.Data;

public class ExtractMetadataFromExcel
{
    public static void Run()
    {
        // Load the Excel spreadsheet
        using (Parser parser = new Parser("sample.xlsx"))
        {
            // Get document's metadata
            IEnumerable<MetadataItem> metadata = parser.GetMetadata();
            
            // Output the metadata
            foreach (var item in metadata)
            {
                Console.WriteLine($"{item.Name}: {item.Value}");
            }
        }
    }
}

Product Page Docs API Ref Examples Blog Releases Support Temp License


Tags

.NET | Text Parsing | Document Parsing | NuGet | Data Extraction | Metadata Extraction | Document Automation | OCR | PDF Parsing | Email Parsing | Spreadsheet Parsing | Presentation Parsing | Template-based Parsing | Cross Platform | High Performance | API | Batch Processing | Secure Parsing | Document Security | Scalable API | Microsoft Word | Excel | PowerPoint | PDF | Email | Barcode Recognition | Linux | macOS | Windows | Software Development | C# | Programming | Application Development | Content Extraction | Structured Data Parsing | Document Structure Analysis | Hyperlink Extraction | Bookmark Extraction | Table Extraction | Form Parsing | Image Extraction | File Attachment Extraction

Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 was computed.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
.NET Core netcoreapp2.0 was computed.  netcoreapp2.1 was computed.  netcoreapp2.2 was computed.  netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
.NET Standard netstandard2.0 is compatible.  netstandard2.1 was computed. 
.NET Framework net461 was computed.  net462 was computed.  net463 was computed.  net47 was computed.  net471 was computed.  net472 was computed.  net48 was computed.  net481 was computed. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen40 was computed.  tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
24.10.0 1,414 11/1/2024
24.9.0 2,231 9/30/2024
24.8.0 33,128 8/30/2024
24.7.0 1,535 7/24/2024
24.6.0 2,744 6/29/2024
24.5.0 5,651 5/31/2024
24.4.0 6,021 4/23/2024
24.2.1 7,376 3/13/2024
24.2.0 1,310 2/29/2024
23.12.0 134,304 12/23/2023
23.11.0 37,081 11/24/2023
23.10.0 13,683 10/21/2023
23.8.0 65,642 8/18/2023
23.5.0 85,313 5/31/2023
23.3.0 16,120 3/31/2023
23.2.0 22,869 3/1/2023
22.11.1 25,837 1/17/2023
22.11.0 38,897 11/29/2022
22.8.0 74,549 8/12/2022
22.6.0 31,447 6/7/2022
22.2.0 37,481 2/25/2022
21.5.0 63,488 5/31/2021
21.2.0 51,089 2/22/2021
20.12.0 24,482 12/30/2020
20.10.0 169,927 10/27/2020
20.8.0 49,153 8/19/2020
20.6.1 47,555 6/30/2020
20.6.0 20,123 6/19/2020
20.5.0 35,251 5/8/2020
20.3.0 48,546 3/19/2020
20.1.0 35,811 1/31/2020
19.12.0 33,538 12/27/2019
19.11.0 28,459 11/22/2019
19.9.0 2,810 9/27/2019
19.5.0 3,040 5/29/2019
18.12.0 3,215 12/11/2018
18.11.0 2,702 11/8/2018
18.10.0 2,786 10/10/2018
18.9.0 2,773 9/5/2018
18.8.0 2,842 8/7/2018
18.7.0 2,792 7/3/2018
18.5.0 3,014 5/23/2018