SemanticChunker.NET
1.0.1
dotnet add package SemanticChunker.NET --version 1.0.1
NuGet\Install-Package SemanticChunker.NET -Version 1.0.1
<PackageReference Include="SemanticChunker.NET" Version="1.0.1" />
<PackageVersion Include="SemanticChunker.NET" Version="1.0.1" />
<PackageReference Include="SemanticChunker.NET" />
paket add SemanticChunker.NET --version 1.0.1
#r "nuget: SemanticChunker.NET, 1.0.1"
#:package SemanticChunker.NET@1.0.1
#addin nuget:?package=SemanticChunker.NET&version=1.0.1
#tool nuget:?package=SemanticChunker.NET&version=1.0.1
SemanticChunker.NET
Automatic Semantic Chunking for RAG in .NET
Transforms long texts into coherent, retrieval ready chunks with a single call - powered by embeddings and fully compatible with Semantic Kernel and Microsoft.Extensions.AI.
Split long documents into semantically coherent chunks that fit your LLM’s context window while maximising retrieval precision.
Features ✨
- Plug‑and‑play API – One call to
CreateChunksAsync
returns ready‑to‑useChunk
objects with ID, text, and embedding. - Model‑agnostic – Works with any embedding generator supported by
Microsoft.Extensions.AI
; no framework lock‑in. - Four breakpoint strategies –
Percentile
,StandardDeviation
,InterQuartile
, andGradient
cover most corpus profiles. - Context buffer window – Configurable
bufferSize
preserves cross‑sentence semantics. - Target chunk count – Unique
targetChunkCount
option produces exactly the number of chunks you need. - Multilingual sentence splitting – ICU4N ensures accurate sentence boundaries in 70+ languages.
- Token‑limit safety – Automatic 10 % safety margin below your model’s context window.
- Parallel embedding generation – Maximises throughput when your embedding provider supports batching.
- Zero external overhead – Pure .NET plus ICU4N; lightweight for microservices and serverless functions.
Installation 📦
dotnet add package SemanticChunker.NET
Quick Start 🛠️
using Microsoft.Extensions.AI;
using Microsoft.SemanticKernel;
using SemanticChunkerNET;
// 1. Wire an embedding generator (example uses LM Studio)
var builder = Kernel.CreateBuilder();
#pragma warning disable SKEXP0010
builder.Services.AddLmStudioEmbeddingGenerator("text-embedding-multilingual-e5-base");
#pragma warning restore SKEXP0010
var kernel = builder.Build();
// 2. Create Chunker with your model’s token limit (e.g. 512)
var embeddingGenerator = kernel.Services.GetRequiredService<IEmbeddingGenerator<string, Embedding<float>>>();
var semanticChunker = new SemanticChunker(embeddingGenerator, tokenLimit: 512);
// 3. Chunk text
string input = File.ReadAllText("whitepaper.md");
IList<Chunk> chunks = await semanticChunker.CreateChunksAsync(input);
// 4. Persist embeddings to your vector store
await myVectorStore.UpsertAsync(chunks);
Step‑by‑Step Calibration Guide
This section walks you through finding the best settings for your corpus and embedding model.
Step | Action | Why |
---|---|---|
1 Choose an embedding model | Prefer models whose training data match your language/domain. | Embedding quality dominates chunk quality. |
2 Set tokenLimit |
Use the embedding model token limit. | Leaves headroom for prompts/RAG metadata. |
3 Pick a buffer size | Start with 1; raise to 2–3 if individual sentences lose context. | Neighbor sentences improve semantic continuity. |
4 Choose a breakpoint strategy | Percentile 95 % is the industry default. Switch to StandardDeviation when your corpus shows heavy-tail distance distributions. |
Percentile is robust; SD handles outliers. |
5 Adjust thresholdAmount |
Lower value → more chunks, higher recall; Higher value → fewer, longer chunks, better precision. Tune in 5‑point increments (e.g. 90, 95, 98). | Balances retrieval recall vs. answer accuracy. |
6 Optionally set targetChunkCount |
If you know how many chunks you need (e.g. for fixed‑budget eval), supply it and skip manual threshold tuning. | Directly controls output size. |
7 Evaluate | Measure Answer F1/EM and retrieval hit rate on a validation set. Iterate Steps 4–6 until metrics plateau. | Empirical tuning beats rules of thumb. |
8 Lock parameters in production | Persist calibrated values in app settings or environment variables. | Guarantees reproducibility across builds. |
Configuration Reference
Ctor Parameter | Default | Description |
---|---|---|
tokenLimit |
‑ | Max LLM tokens per chunk (safety margin = 10 %). |
bufferSize |
1 |
Sentences added before/after current sentence during embedding. |
thresholdType |
Percentile |
Breakpoint metric (StandardDeviation , InterQuartile , Gradient ). |
thresholdAmount |
see table | E.g. 95 % for Percentile, 3 σ for Standard Deviation. |
targetChunkCount |
null |
Overrides thresholds to hit an exact chunk count. |
minChunkChars |
0 |
Skip chunks shorter than this. |
👨💻 Author
Gregor Biswanger - is a leading expert in generative AI, a Microsoft MVP for Azure AI and Web App Development. As an independent consultant, he works closely with the Microsoft product team for GitHub Copilot and supports companies in implementing modern AI solutions.
As a freelance consultant, trainer, and author, he shares his expertise in software architecture and cloud technologies and is a sought-after speaker at international conferences. For several years, he has been live-streaming every Friday evening on Twitch with My Coding Zone in german and is an active YouTuber.
Reach out to Gregor if you need support in the form of consulting, training, or implementing AI solutions using .NET or Node.js. LinkedIn or Twitter @BFreakout
See also the list of contributors who participated in this project.
🙋♀️🙋♂ Contributing
Feel free to submit a pull request if you find any bugs (to see a list of active issues, visit the Issues section. Please make sure all commits are properly documented.
The best thing would be to write about what you plan to do in the issue beforehand. Then there will be no disappointment if we cannot accept your pull request.
🙏 Donate
I work on this open-source project in my free time alongside a full-time job and raising three kids. If you`d like to support my work and help me dedicate more time to this project, consider sponsoring me on GitHub:
Your sponsorship allows me to invest more time in improving the project and prioritizing important issues or features. Any support is greatly appreciated - thank you! 🍻
📜 License
This project is licensed under the Apache License 2.0 - © Gregor Biswanger 2025
Happy chunking! 🧩
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- ICU4N (>= 60.1.0-alpha.438)
- Microsoft.Extensions.AI.Abstractions (>= 9.7.1)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.