BertJapaneseTokenizer 1.0.5
There is a newer version of this package available.
See the version list below for details.
See the version list below for details.
dotnet add package BertJapaneseTokenizer --version 1.0.5
NuGet\Install-Package BertJapaneseTokenizer -Version 1.0.5
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="BertJapaneseTokenizer" Version="1.0.5" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add BertJapaneseTokenizer --version 1.0.5
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: BertJapaneseTokenizer, 1.0.5"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install BertJapaneseTokenizer as a Cake Addin #addin nuget:?package=BertJapaneseTokenizer&version=1.0.5 // Install BertJapaneseTokenizer as a Cake Tool #tool nuget:?package=BertJapaneseTokenizer&version=1.0.5
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
BertJapaneseTokenizer
Minimal Tokenizer implementation of BertJapanese(cl-tohoku/bert-base-japanese) in C#
Quickstart
- Just add
BertJapaneseTokenizer
package from Nuget. - Download unidic mecab dictionary
unidic-mecab-2.1.2_bin.zip
from https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/ and unzip the archive into somewhere. - Download vocab file BertJapanese from Huggingface. For example,
vocab.txt
of bert-base-japanese-v2 can be accessed from [here].
(Or you can simply use my extension methodGetVocabFromHub()
. See the example below.) - Check the example code below and you are good to go.
using BertJapaneseTokenizer;
var dicPath = @"D:\DATASET\unidic-mecab-2.1.2_bin";
//var vocabPath = @"D:\DATASET\bert-japanese\bert-base-japanese-v2\vocab.txt";
var vocabPath = await HuggingFace.GetVocabFromHub("tohoku-nlp/bert-base-japanese-v2");
var tokenizer = new BertJapaneseTokenizer.BertJapaneseTokenizer(dicPath, vocabPath);
var sentence = "打ち合わせが終わった後にご飯を食べましょう。";
//var sentence = "ご飯を食べましょう。";
//var sentence = "打ち合わせ";
(var tokenIds, var attentionMask) = tokenizer.EncodePlus(sentence);
Console.WriteLine($"Sentence: {sentence}");
Console.WriteLine($"Token IDs: {string.Join(", ", tokenIds)}");
var decoded = tokenizer.Decode(tokenIds);
Console.WriteLine($"Decoded: {decoded}");
To-do List
- Implement Decode() method
- Support BPE-type vocabulary (like
cl-tohoku/bert-base-japanese-char
)
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
-
net8.0
- MeCab.DotNet (>= 1.2.0)
NuGet packages (1)
Showing the top 1 NuGet packages that depend on BertJapaneseTokenizer:
Package | Downloads |
---|---|
EDMTranslator
Text translator library based on LLM models, especially EncoderDecoderModel in HuggingFace |
GitHub repositories
This package is not used by any popular GitHub repositories.