BpeTokenizer 1.0.5

dotnet add package BpeTokenizer --version 1.0.5                
NuGet\Install-Package BpeTokenizer -Version 1.0.5                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="BpeTokenizer" Version="1.0.5" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add BpeTokenizer --version 1.0.5                
#r "nuget: BpeTokenizer, 1.0.5"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install BpeTokenizer as a Cake Addin
#addin nuget:?package=BpeTokenizer&version=1.0.5

// Install BpeTokenizer as a Cake Tool
#tool nuget:?package=BpeTokenizer&version=1.0.5                

BpeTokenizer

NuGet Last Commit GitHub Issues Used by Contributors License

BpeTokenizer is a C# implementation of tiktoken written by OpenAI. It is a byte pair encoding tokenizer that can be used to tokenize text into subword units.

This library is built for x64 architectures.

As a BpeTokenizer derived from tiktoken, it can be used as a token counter. Useful to ensure that when streaming tokens from the OpenAI API for GPT Chat Completions, you could keep track of the cost related to the software calling the API.

To Install BpeTokenizer, run the following command in the Package Manager Console

Install-Package BpeTokenizer

If you'd prefer to use the .NET CLI, run this command instead:

dotnet add package BpeTokenizer

Usage

To use BpeTokenizer, import the namespace:

using BpeTokenizer;

Then create an encoder by its model or encoding name:

// By its encoding name:
var encoder = await BytePairEncodingRegistry.GetEncodingAsync("cl100k_base");

// By its model:
var encoder = await BytePairEncodingModels.EncodingForModelAsync("gpt-4");

Both variants are async so you can await them, since they will either access a remote server to download the model or load it from the local cache.

Once you have an encoding, you can encode your text:

var tokens = encoder.Encode("Hello BPE world!"); //Results in: [9906, 426, 1777, 1917, 0]

To decode a stream of tokens, you can use the following:

var text = encoder.Decode(tokens); //Results in: "Hello BPE world!"

Supported Encodings/Models:

BpeTokenizer supports the following encodings:

  1. cl100k_base
  2. p50k_edit
  3. p50k_base
  4. r50k_base
  5. gpt2

You can use these encoding names when creating an encoder:

var cl100kBaseEncoder = await BytePairEncodingRegistry.GetEncodingAsync("cl100k_base");
var p50kEditEncoder   = await BytePairEncodingRegistry.GetEncodingAsync("p50k_edit");
var p50kBaseEncoder   = await BytePairEncodingRegistry.GetEncodingAsync("p50k_base");
var r50kBaseEncoder   = await BytePairEncodingRegistry.GetEncodingAsync("r50k_base");
var gpt2Encoder       = await BytePairEncodingRegistry.GetEncodingAsync("gpt2");

The following models are supported (from tiktoken source, embedding in parentheses):

  1. Chat (all cl100k_base)
    1. gpt-4 - e.g., gpt-4-0314, etc., plus gpt-4-32k
    2. gpt-3.5-turbo - e.g, gpt-3.5-turbo-0301, -0401, etc.
    3. gpt-35-turbo - Azure deployment name
  2. Text (future use, all cl100k_base API availability on Jan 4, 2024)
    1. ada-002
    2. babbage-002
    3. curie-002
    4. davinci-002
    5. gpt-3.5-turbo-instruct
  3. Code (all p50k_base)
    1. code-davinci-002
    2. code-davinci-001
    3. code-cushman-002
    4. code-cushman-001
    5. davinci-codex
    6. cushman-codex
  4. Edit (all p50k_edit)
    1. text-davinci-edit-001
    2. code-davinci-edit-001
  5. Embeddings
    1. text-embedding-ada-002 (cl100k_base)
  6. Legacy (no longer available on Jan 4, 2024)
    1. text-davinci-003 (p50k_base)
    2. text-davinci-002 (p50k_base)
    3. text-davinci-001 (r50k_base)
    4. text-curie-001 (r50k_base)
    5. text-babbage-001 (r50k_base)
    6. text-ada-001 (r50k_base)
    7. davinci (r50k_base)
    8. curie (r50k_base)
    9. babbage (r50k_base)
    10. ada (r50k_base)
  7. Old Embeddings (all r50k_base)
    1. text-similarity-davinci-001
    2. text-similarity-curie-001
    3. text-similarity-babbage-001
    4. text-similarity-ada-001
    5. text-search-davinci-doc-001
    6. text-search-curie-doc-001
    7. text-search-babbage-doc-001
    8. text-search-ada-doc-001
    9. code-search-babbage-code-001
    10. code-search-ada-code-001
  8. Open Source
    1. gpt2 (gpt2)

You can use these model names when creating an encoder (list not exhaustive):

var gpt4Encoder                     = await BytePairEncodingModels.EncodingForModelAsync("gpt-4");
var textDavinci003Encoder           = await BytePairEncodingModels.EncodingForModelAsync("text-davinci-003");
var textDavinci001Encoder           = await BytePairEncodingModels.EncodingForModelAsync("text-davinci-001");
var codeDavinci002Encoder           = await BytePairEncodingModels.EncodingForModelAsync("code-davinci-002");
var textDavinciEdit001Encoder       = await BytePairEncodingModels.EncodingForModelAsync("text-davinci-edit-001");
var textEmbeddingAda002Encoder      = await BytePairEncodingModels.EncodingForModelAsync("text-embedding-ada-002");
var textSimilarityDavinci001Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-similarity-davinci-001");
var gpt2Encoder                     = await BytePairEncodingModels.EncodingForModelAsync("gpt2");

Several of the older models are being deprecated at the start of 2024:

Token Counting

To count tokens in a given string, you can use the following:

var tokenCount = encoder.CountTokens("Hello BPE world!"); //Results in: 5
Product Compatible and additional computed target framework versions.
.NET net7.0 is compatible.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages (1)

Showing the top 1 NuGet packages that depend on BpeTokenizer:

Package Downloads
BpeChatAI

Package Description

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.0.5 224 7/21/2023
1.0.4 174 7/15/2023 1.0.4 is deprecated.
1.0.3 207 7/15/2023 1.0.3 is deprecated.
1.0.2 215 7/14/2023 1.0.2 is deprecated.
1.0.1 180 7/13/2023 1.0.1 is deprecated.
1.0.0 165 7/13/2023 1.0.0 is deprecated.

Corrected ReadMe.md to point to the appropriate branch for GitHub link.