EDMTranslator 1.2.1
dotnet add package EDMTranslator --version 1.2.1
NuGet\Install-Package EDMTranslator -Version 1.2.1
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="EDMTranslator" Version="1.2.1" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add EDMTranslator --version 1.2.1
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
#r "nuget: EDMTranslator, 1.2.1"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install EDMTranslator as a Cake Addin #addin nuget:?package=EDMTranslator&version=1.2.1 // Install EDMTranslator as a Cake Tool #tool nuget:?package=EDMTranslator&version=1.2.1
The NuGet Team does not provide support for this client. Please contact its maintainers for support.
EDMTranslator
Text translator library based on LLM models, especially EncoderDecoderModel in HuggingFace
Nuget Package list
Package | repo | description |
---|---|---|
EDMTranslator | Main library |
Requirements
- .NET 6 or above
- Free RAM spaces at least 3.5GB before running the translator
Supported models
- JESCJaEnTranslator(sappho192/jesc-ja-en-translator): Japanese-to-English translator based on
tohoku-nlp/bert-base-japanese-v2
andopenai-community/gpt2
, fine-tuned with JESC dataset - FF14JaKoTranslator(sappho192/ffxiv-ja-ko-translator): Japanese-to-Korean translator based on
tohoku-nlp/bert-base-japanese-v2
andskt/kogpt2-base-v2
, fine-tuned with FF14 dataset - AihubJaKoTranslator(sappho192/aihub-ja-ko-translator): Japanese-to-Korean translator based on
tohoku-nlp/bert-base-japanese-v2
andskt/kogpt2-base-v2
, fine-tuned with AIHub dataset - More to be added...
Quickstart
Following guide supposes that you are to use JESCJaEnTranslator mentioned above.
Install the packages
- From the NuGet, install
EDMTranslator
package - And then, install
Tokenizers.DotNet.runtime.win
package too
Prepare the required data
Japanese dictionary
- Download unidic mecab dictionary
unidic-mecab-2.1.2_bin.zip
from https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/ and unzip the archive into somewhere
Fine-tuned translator model
- Download the translator model from sappho192/jesc-ja-en-translator (especially
onnx_jesc-ja-en.7z
) and unzip the archive into somewhere
Implement the driver code
Write the code like below and you are good to go 🫡
Note that you need to fix the path of encoderDictDir
and modelDir
correctly.
// Console application which translates Japanese sentence to English with JESCJaEnTranslator
using EDMTranslator.Tokenization;
using EDMTranslator.Translation;
// Prepare the tokenizer
var encoderVocabPath = await BertJapaneseTokenizer.HuggingFace.GetVocabFromHub("tohoku-nlp/bert-base-japanese-v2");
var hubName = "openai-community/gpt2";
var decoderVocabFilename = "tokenizer.json";
var decoderVocabPath = await Tokenizers.DotNet.HuggingFace.GetFileFromHub(hubName, decoderVocabFilename, "deps");
string encoderDictDir = @"D:\DATASET\unidic-mecab-2.1.2_bin";
var tokenizer = new BertJa2GPTTokenizer(
encoderDictDir: encoderDictDir, encoderVocabPath: encoderVocabPath,
decoderVocabPath: decoderVocabPath);
void TestTokenizer(ITokenizer tokenizer)
{
Console.WriteLine("--Tokenizer test--");
Console.WriteLine("[Encode]");
var sentenceJa = "打ち合わせが終わった後にご飯を食べましょう。";
Console.WriteLine($"Input: {sentenceJa}");
var (embeddingsJa, attentionMask) = tokenizer.Encode(sentenceJa);
Console.WriteLine($"Encoded: {string.Join(", ", embeddingsJa)}");
Console.WriteLine("[Decode]");
// Tokens of "i was nervous before the exam, and i had a fever."
var tokens = new uint[] { 72, 373, 10927, 878, 262, 2814, 11, 290, 1312, 550, 257, 17372, 13 };
Console.WriteLine($"Input: {string.Join(", ", tokens)}");
var decoded = tokenizer.Decode(tokens);
Console.WriteLine($"Decoded: {decoded}");
}
TestTokenizer(tokenizer);
// Prepare the translator
string modelDir = @"D:\MODEL\jesc-ja-en-translator\onnx"; // The folder should contains encoder_model.onnx and decoder_model_merged.onnx
var translator = new JESCJaEnTranslator(tokenizer, modelDir);
void TestTranslator(JESCJaEnTranslator translator)
{
Console.WriteLine("--Translator test--");
Translate(translator, "打ち合わせが終わった後にご飯を食べましょう。");
Translate(translator, "試験前に緊張したあまり、熱がでてしまった。");
Translate(translator, "山田は英語にかけてはクラスの誰にも負けない。");
Translate(translator, "この本によれば、最初の人工橋梁は新石器時代にさかのぼるという。");
}
TestTranslator(translator);
static void Translate(JESCJaEnTranslator translator, string sentence)
{
Console.WriteLine($"SourceText: {sentence}");
string translated = translator.Translate(sentence);
Console.WriteLine($"Translated: {translated}");
}
How to build
- Prepare following stuff:
- .NET build system (
dotnet 6.0, 7.0, 8.0
) - PowerShell (Recommend
7.4.2
or above)
- .NET build system (
- Run
cbuild.ps1
The build artifact will be saved in nuget
directory.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 is compatible. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
-
net6.0
- BertJapaneseTokenizer (>= 1.0.9)
- Microsoft.ML.OnnxRuntime (>= 1.17.3)
- NumSharp (>= 0.30.0)
- Tokenizers.DotNet (>= 1.0.5)
-
net7.0
- BertJapaneseTokenizer (>= 1.0.9)
- Microsoft.ML.OnnxRuntime (>= 1.17.3)
- NumSharp (>= 0.30.0)
- Tokenizers.DotNet (>= 1.0.5)
-
net8.0
- BertJapaneseTokenizer (>= 1.0.9)
- Microsoft.ML.OnnxRuntime (>= 1.17.3)
- NumSharp (>= 0.30.0)
- Tokenizers.DotNet (>= 1.0.5)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.