Microsoft.ML.Tokenizers
1.0.0
Prefix Reserved
dotnet add package Microsoft.ML.Tokenizers --version 1.0.0
NuGet\Install-Package Microsoft.ML.Tokenizers -Version 1.0.0
<PackageReference Include="Microsoft.ML.Tokenizers" Version="1.0.0" />
paket add Microsoft.ML.Tokenizers --version 1.0.0
#r "nuget: Microsoft.ML.Tokenizers, 1.0.0"
// Install Microsoft.ML.Tokenizers as a Cake Addin #addin nuget:?package=Microsoft.ML.Tokenizers&version=1.0.0 // Install Microsoft.ML.Tokenizers as a Cake Tool #tool nuget:?package=Microsoft.ML.Tokenizers&version=1.0.0
About
Microsoft.ML.Tokenizers supports various the implementation of the tokenization used in the NLP transforms.
Key Features
- Extensible tokenizer architecture that allows for specialization of Normalizer, PreTokenizer, Model/Encoder, Decoder
- BPE - Byte pair encoding model
- English Roberta model
- Tiktoken model
- Llama model
- Phi2 model
How to Use
using Microsoft.ML.Tokenizers;
using System.Net.Http;
using System.IO;
//
// Using Tiktoken Tokenizer
//
// initialize the tokenizer for `gpt-4` model
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
string source = "Text tokenization is the process of splitting a string into a list of tokens.";
Console.WriteLine($"Tokens: {tokenizer.CountTokens(source)}");
// print: Tokens: 16
var trimIndex = tokenizer.GetIndexByTokenCountFromEnd(source, 5, out string processedText, out _);
Console.WriteLine($"5 tokens from end: {processedText.Substring(trimIndex)}");
// 5 tokens from end: a list of tokens.
trimIndex = tokenizer.GetIndexByTokenCount(source, 5, out processedText, out _);
Console.WriteLine($"5 tokens from start: {processedText.Substring(0, trimIndex)}");
// 5 tokens from start: Text tokenization is the
IReadOnlyList<int> ids = tokenizer.EncodeToIds(source);
Console.WriteLine(string.Join(", ", ids));
// prints: 1199, 4037, 2065, 374, 279, 1920, 315, 45473, 264, 925, 1139, 264, 1160, 315, 11460, 13
//
// Using Llama Tokenizer
//
// Open stream of remote Llama tokenizer model data file
using HttpClient httpClient = new();
const string modelUrl = @"https://huggingface.co/hf-internal-testing/llama-tokenizer/resolve/main/tokenizer.model";
using Stream remoteStream = await httpClient.GetStreamAsync(modelUrl);
// Create the Llama tokenizer using the remote stream
Tokenizer llamaTokenizer = LlamaTokenizer.Create(remoteStream);
string input = "Hello, world!";
ids = llamaTokenizer.EncodeToIds(input);
Console.WriteLine(string.Join(", ", ids));
// prints: 1, 15043, 29892, 3186, 29991
Console.WriteLine($"Tokens: {llamaTokenizer.CountTokens(input)}");
// print: Tokens: 5
Main Types
The main types provided by this library are:
Microsoft.ML.Tokenizers.Tokenizer
Microsoft.ML.Tokenizers.BpeTokenizer
Microsoft.ML.Tokenizers.EnglishRobertaTokenizer
Microsoft.ML.Tokenizers.TiktokenTokenizer
Microsoft.ML.Tokenizers.Normalizer
Microsoft.ML.Tokenizers.PreTokenizer
Additional Documentation
Related Packages
Feedback & Contributing
Microsoft.ML.Tokenizers is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 was computed. |
.NET Framework | net461 was computed. net462 was computed. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.0
- Google.Protobuf (>= 3.27.1)
- Microsoft.Bcl.HashCode (>= 6.0.0)
- Microsoft.Bcl.Memory (>= 9.0.0)
- System.Text.Json (>= 8.0.5)
-
net8.0
- Google.Protobuf (>= 3.27.1)
- System.Text.Json (>= 8.0.5)
NuGet packages (16)
Showing the top 5 NuGet packages that depend on Microsoft.ML.Tokenizers:
Package | Downloads |
---|---|
Microsoft.ML.TorchSharp
Microsoft.ML.TorchSharp contains ML.NET integration of TorchSharp. |
|
Microsoft.ML.Tokenizers.Data.Cl100kBase
The Microsoft.ML.Tokenizers.Data.Cl100kBase class includes the Tiktoken tokenizer data file cl100k_base.tiktoken, which is utilized by models such as GPT-4. |
|
Microsoft.ML.Tokenizers.Data.O200kBase
The Microsoft.ML.Tokenizers.Data.O200kBase includes the Tiktoken tokenizer data file o200k_base.tiktoken, which is utilized by models such as gpt-4o. |
|
Alkampfer.KernelMemory.Extensions
Added some extensions for Kernel Memory. |
|
Cnblogs.DashScope.Core
Provide pure api access to DashScope without extra references. Cnblogs.DashScope.Sdk should be used for general purpose. |
GitHub repositories (13)
Showing the top 5 popular GitHub repositories that depend on Microsoft.ML.Tokenizers:
Repository | Stars |
---|---|
microsoft/semantic-kernel
Integrate cutting-edge LLM technology quickly and easily into your apps
|
|
dotnet/extensions
This repository contains a suite of libraries that provide facilities commonly needed when creating production-ready applications.
|
|
microsoft/WhatTheHack
A collection of challenge based hack-a-thons including student guide, coach guide, lecture presentations, sample/instructional code and templates. Please visit the What The Hack website at: https://aka.ms/wth
|
|
dotnet/ResXResourceManager
Manage localization of all ResX-Based resources in one central place.
|
|
microsoft/teams-ai
SDK focused on building AI based applications and extensions for Microsoft Teams and other Bot Framework channels
|
Version | Downloads | Last updated |
---|---|---|
1.0.0 | 23,053 | 11/14/2024 |
0.22.0 | 5,095 | 11/13/2024 |
0.22.0-preview.24526.1 | 2,555 | 10/27/2024 |
0.22.0-preview.24522.7 | 1,987 | 10/23/2024 |
0.22.0-preview.24378.1 | 142,752 | 7/29/2024 |
0.22.0-preview.24271.1 | 154,312 | 5/21/2024 |
0.22.0-preview.24179.1 | 149,802 | 4/2/2024 |
0.22.0-preview.24162.2 | 20,462 | 3/13/2024 |
0.21.1 | 99,177 | 1/18/2024 |
0.21.0 | 52,453 | 11/27/2023 |
0.21.0-preview.23511.1 | 51,954 | 10/13/2023 |
0.21.0-preview.23266.6 | 51,415 | 5/17/2023 |
0.21.0-preview.22621.2 | 2,141 | 12/22/2022 |
0.20.1 | 89,089 | 2/1/2023 |
0.20.1-preview.22573.9 | 2,397 | 11/24/2022 |
0.20.0 | 31,501 | 11/8/2022 |
0.20.0-preview.22551.1 | 243 | 11/1/2022 |