DocumentAtom 1.0.10

dotnet add package DocumentAtom --version 1.0.10                
NuGet\Install-Package DocumentAtom -Version 1.0.10                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="DocumentAtom" Version="1.0.10" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add DocumentAtom --version 1.0.10                
#r "nuget: DocumentAtom, 1.0.10"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install DocumentAtom as a Cake Addin
#addin nuget:?package=DocumentAtom&version=1.0.10

// Install DocumentAtom as a Cake Tool
#tool nuget:?package=DocumentAtom&version=1.0.10                

<img src="https://github.com/jchristn/DocumentAtom/blob/main/assets/icon.png" width="256" height="256">

DocumentAtom

DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.

Package Version Downloads
DocumentAtom.Excel NuGet Version NuGet
DocumentAtom.Image NuGet Version NuGet
DocumentAtom.Markdown NuGet Version NuGet
DocumentAtom.Pdf NuGet Version NuGet
DocumentAtom.PowerPoint NuGet Version NuGet
DocumentAtom.Ocr NuGet Version NuGet
DocumentAtom.Text NuGet Version NuGet
DocumentAtom.Word NuGet Version NuGet

New in v1.0.x

  • Initial release

Motivation

Parsing documents and extracting constituent parts is one part science and one part black magic. If you find ways to improve processing and extraction in any way that is horizontally useful, I'd would love your feedback on ways to make this library more accurate, more useful, faster, and overall better. My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.

Bugs, Quality, Feedback, or Enhancement Requests

Please feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.

Types Supported

DocumentAtom supports the following input file types:

  • Text
  • Markdown
  • Microsoft Word (.docx)
  • Microsoft Excel (.xlsx)
  • Microsoft PowerPoint (.pptx)
  • PNG images (requires Tesseract on the host)
  • PDF

Simple Example

Refer to the various Test projects for working examples.

The following example shows processing a markdown (.md) file.

using DocumentAtom.Core.Atoms;
using DocumentAtom.Markdown;

MarkdownProcessorSettings settings = new MarkdownProcessorSettings();
MarkdownProcessor processor = new MarkdownProcessor(_Settings);
foreach (Atom atom in processor.Extract(filename))
{
    Console.WriteLine(atom.ToString());
}

Atom Types

DocumentAtom parses input data assets into a variety of Atom objects. Each Atom includes top-level metadata including:

  • GUID
  • Type - including Text, Image, Binary, Table, and List
  • PageNumber - where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when rendered
  • Position - the ordinal position of the Atom, relative to others
  • Length - the length of the Atom's content
  • MD5Hash - the MD5 hash of the Atom content
  • SHA1Hash - the SHA1 hash of the Atom content
  • SHA256Hash - the SHA256 hash of the Atom content
  • Quarks - sub-atomic particles created from the Atom content, for instance, when chunking text

The AtomBase class provides the aforementioned metadata, and several type-specific Atoms are returned from the various processors, including:

  • BinaryAtom - includes a Bytes property
  • DocxAtom - includes Text, HeaderLevel, UnorderedList, OrderedList, Table, and Binary properties
  • ImageAtom - includes BoundingBox, Text, UnorderedList, OrderedList, Table, and Binary properties
  • MarkdownAtom - includes Formatting, Text, UnorderedList, OrderedList, and Table properties
  • PdfAtom - includes BoundingBox, Text, UnorderedList, OrderedList, Table, and Binary properties
  • PptxAtom - includes Title, Subtitle, Text, UnorderedList, OrderedList, Table, and Binary properties
  • TableAtom - includes Rows, Columns, Irregular, and Table properties
  • TextAtom - includes Text
  • XlsxAtom - includes SheetName, CellIdentifier, Text, Table, and Binary properties

Table objects inside of Atom objects are always presented as SerializableDataTable objects (see SerializableDataTable for more information) to provide simple serialization and conversion to native System.Data.DataTable objects.

Underlying Libraries

DocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.

Each of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.

My libraries used within DocumentAtom:

Version History

Please refer to CHANGELOG.md for version history.

Thanks

Special thanks to iconduck.com and the content authors for producing this icon.

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (7)

Showing the top 5 NuGet packages that depend on DocumentAtom:

Package Downloads
DocumentAtom.Image

DocumentAtom provides a light, fast library for breaking input images into constituent text parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.

DocumentAtom.Pdf

DocumentAtom provides a light, fast library for breaking input PDF documents into constituent parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.

DocumentAtom.PowerPoint

DocumentAtom provides a light, fast library for breaking input PowerPoint (pptx) documents into constituent parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.

DocumentAtom.Markdown

DocumentAtom provides a light, fast library for breaking input markdown documents into constituent parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.

DocumentAtom.Excel

DocumentAtom provides a light, fast library for breaking input Excel (xlsx) documents into constituent parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.0.10 86 1/25/2025
1.0.9 46 1/25/2025
1.0.8 34 1/25/2025
1.0.7 35 1/25/2025
1.0.6 32 1/25/2025
1.0.5 33 1/25/2025
1.0.3 36 1/25/2025
1.0.2 78 1/25/2025
1.0.1 30 1/25/2025
1.0.0 150 12/28/2024

Initial release