Cynic-Magnit.Tokenization
1.0.0
See the version list below for details.
dotnet add package Cynic-Magnit.Tokenization --version 1.0.0
NuGet\Install-Package Cynic-Magnit.Tokenization -Version 1.0.0
<PackageReference Include="Cynic-Magnit.Tokenization" Version="1.0.0" />
paket add Cynic-Magnit.Tokenization --version 1.0.0
#r "nuget: Cynic-Magnit.Tokenization, 1.0.0"
// Install Cynic-Magnit.Tokenization as a Cake Addin #addin nuget:?package=Cynic-Magnit.Tokenization&version=1.0.0 // Install Cynic-Magnit.Tokenization as a Cake Tool #tool nuget:?package=Cynic-Magnit.Tokenization&version=1.0.0
Magnit.Tokenization
Tokenize strings into custom tokens using ordered regex operations.
Overview
This library takes a string input and asynchronously parses through it to produce a List
of Token
objects. These Token
objects are completely custom, and are used to represent whatever distinct parts of the text you would like to separate.
The Token
s are defined in a Specification
object that requires a Regex
to match a string and a "Type" string, which is used to identify the token to you.
The order that you define your Specification
is used to run the regex comparisons. That means that the first SpecificationItem
's regex to match a given string in the text will be how that string is tokenized. This usually takes a little trial and error, but it allows you to do things like ignore all whitespace. A good rule of thumb is to always use the "start of line" expression (^
), and not to use multiline flags. You can see working examples, below.
You also have the option of defining an asynchronous function to perform string manipulation on the matched string. That way if you match something with markup, like <custom-tag>
, you can strip the unnecessary markup and use custom-tag
as your Token
's value.
Usage
Create a Tokenizer
public Tokenizer Tokenizer { get; set; } = new Tokenizer(CurrentSpecification);
Create a Specification
public static Magnit.Tokenization.Specification CurrentSpecification { get; set; } = new()
{
// Whitespace
{ new Regex(@"^\s+"), null }, // Returning null as the token Type will skip the match. This regex prevents whitespace from being represented in the returned token list.
// Comments
{ new Regex(@"^\/\/.*"), null },
{ new Regex(@"^\/\*[\s\S]*?\*\/"), null },
// String
{ new Regex(@"^.*"), "STRING" },
// Tagged String
{ new Regex(@"^#.*"), "TAGGED_STRING" },
// Cleaned Tagged String
{ new Regex(@"^#.*"), "CLEANED_TAGGED_STRING", (result) => { return Task.FromResult(result.TrimStart('#')); } }, // Pass in an async function to handle any string manipulation on the matching token
// Utility
{ new Regex(@"^[\s\S]*"), "UNKNOWN" }, // Capture the point where an unknown character is represented to prevent errors.
};
Parse into a List of Tokens
private async Task ParseText(string input)
{
List<Token> tokens = await Tokenizer.Parse(input);
foreach (Token token in tokens)
{
Console.WriteLine($"Type: {token.Type}, Start Index: {token.StartIndex}")
}
}
What is this for?
Tokenization is used for breaking up plain text into discrete objects. That could be paragraphs, for grammatical tools, or into blocks that are then interpreted as logic for a code language.
It is usually just the very first part of a larger process. This library is focused on making Tokenization simple and straightforward, rather than super optimized. Most of what stops me from parsing plain text isn't the speed; it's the many layers of planning it takes to get something useable. This library crunches that down into three questions:
- What do you want to match?
- How do you want to represent that to your code?
- How do you want to handle the result?
With that simplification, I find it easier to convert plain text blobs into useable objects for my code to interact with. If it sounds like you would get that same benefit, give this library a try.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
-
net6.0
- No dependencies.
NuGet packages (1)
Showing the top 1 NuGet packages that depend on Cynic-Magnit.Tokenization:
Package | Downloads |
---|---|
Magnit.BranchingDialog.Development
A C# library for reading Magnit Branching Dialog markup and parsing it into Magnit.Branching dialog objects. This library allows developers to interpret the markup, dynamically, and then save the generated objects into well-formed, non-recursive, ID-keyed records. |
GitHub repositories
This package is not used by any popular GitHub repositories.