Luthor 2.2.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package Luthor --version 2.2.0                
NuGet\Install-Package Luthor -Version 2.2.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Luthor" Version="2.2.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add Luthor --version 2.2.0                
#r "nuget: Luthor, 2.2.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install Luthor as a Cake Addin
#addin nuget:?package=Luthor&version=2.2.0

// Install Luthor as a Cake Tool
#tool nuget:?package=Luthor&version=2.2.0                

Luthor

Extract structure from any text using a tokenising lexer.

Using Luthor you can convert any single or multiple line text into a collection containing runs of token types and their content. This provides access to the content at a higher level of abstraction, allowing further processing without having to worry about the specifics of the raw text.

For each token you get the offest, the line number, the column within the line, and the content.

For example:

Sample text.
Across 3 lines.
With a "multi 'word' string".

This gives a list of tokens like this (also including line number etc):

Letters    : "Sample"
Whitespace : " "
Letters    : "text"
Symbols    : "."
EOL        : \n
Letters    : "Across"
Whitespace : " "
Digits     : "3"
Whitespace : " "
Letters    : "lines"
Symbols    : "."
EOL        : \n
Letters    : "With"
Whitespace : " "
Letters    : "a"
Whitespace : " "
String     : ""multi 'word' string""
Symbols    : "."
EOF        : ""
  • Note the difference between Letters and String, the latter of which is quoted (single, double, or backticks) and can have other quotation symbols embedded within it.

This means that instead of having to understand a stream of plain text your code can deal in tokens, making your next steps simpler by working at a higher abstraction level.

Usage

To get the tokens from a given source text:

var tokens = new Lexer(sourceAsString).GetTokens();
tokens.ForEach(x => Console.WriteLine($"{x.Location.Offset,3}: {x.TokenType} => {x.Content}"));

To do the same, but with each whitespace run compressed to a single space:

var tokens = new Lexer(sourceAsString).GetTokens(true);
tokens.ForEach(x => Console.WriteLine($"{x.Location.Offset,3}: {x.TokenType} => {x.Content}"));

To get the tokens from a given source text as a collection of lines:

var lines = new Lexer(sourceAsString).GetTokensAsLines();
foreach (var line in lines)
{
    Console.WriteLine($"Line: {line.Key}");
    line.Value.ForEach(x => Console.WriteLine($" {x.Location.Column,3}: {x.TokenType} => {x.Content}"));
}

This call also supports the whitespace compression optional argument to GetTokensAsLines().

The output tokens

Token types

These are the default definitions of the available tokens.

  • Whitespace - spaces, tabs
  • Letters - upper and lower case English alphabet
  • Digits - 0 to 9
  • Symbols - any of !£$%^&*()-_=+[]{};:'@#~,.<>/?\|
  • String - anything enclosed in either ", ', or a backtick
  • Other - input characters not covered by other types
  • EOL - an LF (\n); any CRs (\r) are ignored
  • EOF - automatically added

Redefining the tokens

You can change the characters underlying the different token types:

var lexer = new Lexer(sourceAsString)
{
    Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz",
    Digits = "0123456789",
    Symbols = "!£$%^&*()-_=+[]{};:'@#~,.<>/?\\|",
    Whitespace = " \t",
    Quotes = "'\"`",
};
var tokens = lexer.GetTokens();

The Quotes characters are handled differently from the others. Each one represents a valid start/end character ('terminators'), and the same character must be used to close the string as to open it.

Other quote characters within the string (i.e. between the terminators) are considered plain content within the current string rather than terminators in their own right.

General comments

  • Linux/Unix, Mac OS, and Windows all have a \n (LF) in their line endings, so \r (CR) is discarded and won't appear in any tokens.
  • There will always be a final EOF token, even for an empty input string.
Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 was computed.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
.NET Core netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
.NET Standard netstandard2.1 is compatible. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • .NETStandard 2.1

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
2.3.0 258 8/30/2023
2.2.1 474 6/19/2021
2.2.0 386 6/18/2021
2.1.0 410 6/18/2021
1.0.1 925 8/19/2018