SemanticChunker.NET 1.0.1

dotnet add package SemanticChunker.NET --version 1.0.1
                    
NuGet\Install-Package SemanticChunker.NET -Version 1.0.1
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="SemanticChunker.NET" Version="1.0.1" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="SemanticChunker.NET" Version="1.0.1" />
                    
Directory.Packages.props
<PackageReference Include="SemanticChunker.NET" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add SemanticChunker.NET --version 1.0.1
                    
#r "nuget: SemanticChunker.NET, 1.0.1"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package SemanticChunker.NET@1.0.1
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=SemanticChunker.NET&version=1.0.1
                    
Install as a Cake Addin
#tool nuget:?package=SemanticChunker.NET&version=1.0.1
                    
Install as a Cake Tool

SemanticChunker.NET Logo

SemanticChunker.NET

Automatic Semantic Chunking for RAG in .NET
Transforms long texts into coherent, retrieval ready chunks with a single call - powered by embeddings and fully compatible with Semantic Kernel and Microsoft.Extensions.AI.

NuGet License: Apache 2.0

Split long documents into semantically coherent chunks that fit your LLM’s context window while maximising retrieval precision.

Features ✨

  • Plug‑and‑play API – One call to CreateChunksAsync returns ready‑to‑use Chunk objects with ID, text, and embedding.
  • Model‑agnostic – Works with any embedding generator supported by Microsoft.Extensions.AI; no framework lock‑in.
  • Four breakpoint strategiesPercentile, StandardDeviation, InterQuartile, and Gradient cover most corpus profiles.
  • Context buffer window – Configurable bufferSize preserves cross‑sentence semantics.
  • Target chunk count – Unique targetChunkCount option produces exactly the number of chunks you need.
  • Multilingual sentence splitting – ICU4N ensures accurate sentence boundaries in 70+ languages.
  • Token‑limit safety – Automatic 10 % safety margin below your model’s context window.
  • Parallel embedding generation – Maximises throughput when your embedding provider supports batching.
  • Zero external overhead – Pure .NET plus ICU4N; lightweight for microservices and serverless functions.

Installation 📦

dotnet add package SemanticChunker.NET

Quick Start 🛠️

using Microsoft.Extensions.AI;
using Microsoft.SemanticKernel;
using SemanticChunkerNET;

// 1. Wire an embedding generator (example uses LM Studio)
var builder = Kernel.CreateBuilder();

#pragma warning disable SKEXP0010
builder.Services.AddLmStudioEmbeddingGenerator("text-embedding-multilingual-e5-base");
#pragma warning restore SKEXP0010

var kernel = builder.Build();

// 2. Create Chunker with your model’s token limit (e.g. 512)
var embeddingGenerator = kernel.Services.GetRequiredService<IEmbeddingGenerator<string, Embedding<float>>>();

var semanticChunker = new SemanticChunker(embeddingGenerator, tokenLimit: 512);

// 3. Chunk text
string input = File.ReadAllText("whitepaper.md");
IList<Chunk> chunks = await semanticChunker.CreateChunksAsync(input);

// 4. Persist embeddings to your vector store
await myVectorStore.UpsertAsync(chunks);

Step‑by‑Step Calibration Guide

This section walks you through finding the best settings for your corpus and embedding model.

Step Action Why
1 Choose an embedding model Prefer models whose training data match your language/domain. Embedding quality dominates chunk quality.
2 Set tokenLimit Use the embedding model token limit. Leaves headroom for prompts/RAG metadata.
3 Pick a buffer size Start with 1; raise to 2–3 if individual sentences lose context. Neighbor sentences improve semantic continuity.
4 Choose a breakpoint strategy Percentile 95 % is the industry default. Switch to StandardDeviation when your corpus shows heavy-tail distance distributions. Percentile is robust; SD handles outliers.
5 Adjust thresholdAmount Lower value → more chunks, higher recall; Higher value → fewer, longer chunks, better precision. Tune in 5‑point increments (e.g. 90, 95, 98). Balances retrieval recall vs. answer accuracy.
6 Optionally set targetChunkCount If you know how many chunks you need (e.g. for fixed‑budget eval), supply it and skip manual threshold tuning. Directly controls output size.
7 Evaluate Measure Answer F1/EM and retrieval hit rate on a validation set. Iterate Steps 4–6 until metrics plateau. Empirical tuning beats rules of thumb.
8 Lock parameters in production Persist calibrated values in app settings or environment variables. Guarantees reproducibility across builds.

Configuration Reference

Ctor Parameter Default Description
tokenLimit Max LLM tokens per chunk (safety margin = 10 %).
bufferSize 1 Sentences added before/after current sentence during embedding.
thresholdType Percentile Breakpoint metric (StandardDeviation, InterQuartile, Gradient).
thresholdAmount see table E.g. 95 % for Percentile, 3 σ for Standard Deviation.
targetChunkCount null Overrides thresholds to hit an exact chunk count.
minChunkChars 0 Skip chunks shorter than this.

👨‍💻 Author

Gregor Biswanger - is a leading expert in generative AI, a Microsoft MVP for Azure AI and Web App Development. As an independent consultant, he works closely with the Microsoft product team for GitHub Copilot and supports companies in implementing modern AI solutions.

As a freelance consultant, trainer, and author, he shares his expertise in software architecture and cloud technologies and is a sought-after speaker at international conferences. For several years, he has been live-streaming every Friday evening on Twitch with My Coding Zone in german and is an active YouTuber.

Reach out to Gregor if you need support in the form of consulting, training, or implementing AI solutions using .NET or Node.js. LinkedIn or Twitter @BFreakout

See also the list of contributors who participated in this project.

🙋‍♀️🙋‍♂ Contributing

Feel free to submit a pull request if you find any bugs (to see a list of active issues, visit the Issues section. Please make sure all commits are properly documented.

The best thing would be to write about what you plan to do in the issue beforehand. Then there will be no disappointment if we cannot accept your pull request.

🙏 Donate

I work on this open-source project in my free time alongside a full-time job and raising three kids. If you`d like to support my work and help me dedicate more time to this project, consider sponsoring me on GitHub:

Your sponsorship allows me to invest more time in improving the project and prioritizing important issues or features. Any support is greatly appreciated - thank you! 🍻

📜 License

This project is licensed under the Apache License 2.0 - © Gregor Biswanger 2025

Happy chunking! 🧩

Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.0.1 81 7/27/2025
1.0.0 487 7/22/2025