Levenshtypo 1.1.0

There is a newer version of this package available.
See the version list below for details.
dotnet add package Levenshtypo --version 1.1.0                
NuGet\Install-Package Levenshtypo -Version 1.1.0                
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Levenshtypo" Version="1.1.0" />                
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add Levenshtypo --version 1.1.0                
#r "nuget: Levenshtypo, 1.1.0"                
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install Levenshtypo as a Cake Addin
#addin nuget:?package=Levenshtypo&version=1.1.0

// Install Levenshtypo as a Cake Tool
#tool nuget:?package=Levenshtypo&version=1.1.0                

Levenshtypo - a .NET fuzzy matching string dictionary

Levenshtypo is a library which allows you to search large data sets by fuzzy matching the key strings.

The dataset is loaded upfront as a sequence of key-value pairs. Once loaded it allows searching for the values which are up to a certain Levenshtein Distance away from a query string.

Levenshtein Distance is the number of character insertions, deletions or substitutions required to transform one string into another.

Planned Work (Coming Soon)

  • Automaton to return edit distance
  • State Serialization logic
  • Preserialized state machines offered on GitHub

Installation

Install via Nuget.

Getting Started

// Start with a dataset
IEnumerable<KeyValuePair<string, object>> dataset = ...;

// Index the dataset in a levenshtrie. The levenshtrie should be stored for re-use.
Levenshtrie<object> levenshtrie = Levenshtrie<object>.Create(dataset);

// Search the dataset for keys with edit distance 2 from "hello"
object[] results = levenshtrie.Search("hello", 2);

Samples

These samples and more can be found in the samples directory.

<details> <summary>Suggest similar words</summary>

public class TypoSuggestion
{
    private readonly Levenshtrie<string> _trie;

    public TypoSuggestion(IEnumerable<string> words)
    {
        _trie = Levenshtrie<string>.Create(
            words.Select(w => new KeyValuePair<string, string>(w, w)),
            ignoreCase: true);
    }

    public string[] GetSimilarWords(string word)
    {
        return _trie.Search(word, maxEditDistance: 2);
    }
}

</details>

<details> <summary>Find whether a string matches blacklist</summary>

public class BlacklistDetection
{
    private readonly Levenshtrie<string> _trie;

    public BlacklistDetection(IEnumerable<string> blacklist)
    {
        _trie = Levenshtrie<string>.Create(
            blacklist.Select(w => new KeyValuePair<string, string>(w, w)),
            ignoreCase: true);
    }

    public bool IsBlacklisted(string word)
    {
        return _trie.Search(word, maxEditDistance: 1).Contains(word);
    }
}

</details>

</details>

<details> <summary>Quickly check whether a list of strings matches an input</summary>

// Benchmarks below show that a naive implementation,
// even if it is well written, is 10x slower than using
// an automaton.
// Benchmark run against English language dataset.
//
// | Method          | Mean       | Error     | StdDev    | Allocated |
// |-----------------|-----------:|----------:|----------:|----------:|
// | Using_naive     | 103.190 ms | 1.4706 ms | 1.3756 ms |     214 B |
// | Using_automaton |   8.161 ms | 0.0469 ms | 0.0439 ms |      12 B |

public static string[] Search(string searchWord, string[] against)
{
    var automaton = LevenshtomatonFactory.Instance.Construct(searchWord, maxEditDistance: 2);

    var results = new List<string>();

    foreach (var word in against)
    {
        // Naive version would be:
        // bool matches = LevenshteinDistance.Levenshtein(searchWord, word) <= 2;

        // Automaton version is:
        bool matches = automaton.Matches(word);
        if (matches)
        {
            results.Add(word);
        }
    }

    return results.ToArray();
}

</details>

Limitations

  • No lookup by UTF8 byte arrays.
  • No support for surrogate character pairs.
  • Only ordinal character comparison, whether case sensitive or insensitive.
  • Maximum Levenshtein Distance of 3.

Performance

The English Language dataset used in the benchmarks contains approximately 465,000 words.

<details> <summary>Search all English Language with a fuzzy key</summary>

  • Naive: Compute Levenshtein Distance against all words.
  • Levenshtypo: This library.
  • Dictionary: .NET Dictionary which only works for distance of 0.

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.400-preview.0.24324.5
  [Host]     : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2


Method Mean Error StdDev Gen0 Allocated
Distance0_Dictionary 8.623 ns 0.0761 ns 0.0712 ns - -
Distance0_Levenshtypo 597.182 ns 2.3004 ns 1.7960 ns 0.0124 208 B
Distance1_Levenshtypo 22,879.582 ns 149.3766 ns 139.7270 ns - 424 B
Distance2_Levenshtypo 305,240.260 ns 2,498.8835 ns 2,337.4572 ns - 1832 B
Distance3_Levenshtypo 1,690,603.294 ns 11,989.1677 ns 11,214.6749 ns - 17905 B
Distance0_Naive 862,346.973 ns 10,007.3755 ns 8,871.2777 ns - 89 B
Distance1_Naive 98,747,597.143 ns 564,828.7729 ns 500,705.9951 ns - 2770 B
Distance2_Naive 98,188,072.000 ns 638,972.9260 ns 597,695.6714 ns - 822 B
Distance3_Naive 99,317,118.889 ns 1,241,670.8616 ns 1,161,459.6944 ns - 4443 B

</details>

<details> <summary>Load all English Language dataset</summary>

  • Levenshtypo: This library.
  • Dictionary: .NET Dictionary for comparison.

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3880/23H2/2023Update/SunValley3)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.400-preview.0.24324.5
  [Host]     : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX2


Method Mean Error StdDev Gen0 Gen1 Gen2 Allocated
English_Dictionary 32,450.80 μs 647.413 μs 770.700 μs 781.2500 781.2500 781.2500 35524.19 KB
English_Levenshtypo 282,953.40 μs 4,376.502 μs 4,093.783 μs 27000.0000 6000.0000 2000.0000 527682.66 KB

</details>

References

The algorithm in this library is based on the 2002 paper Fast String Correction with Levenshtein-Automata by Klaus Schulz and Stoyan Mihov.

I used the following blog posts to further help understand the algorithm.

I used the following repository to obtain the list of English words, used in tests.

Product Compatible and additional computed target framework versions.
.NET net5.0 was computed.  net5.0-windows was computed.  net6.0 was computed.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
.NET Core netcoreapp3.0 was computed.  netcoreapp3.1 was computed. 
.NET Standard netstandard2.1 is compatible. 
MonoAndroid monoandroid was computed. 
MonoMac monomac was computed. 
MonoTouch monotouch was computed. 
Tizen tizen60 was computed. 
Xamarin.iOS xamarinios was computed. 
Xamarin.Mac xamarinmac was computed. 
Xamarin.TVOS xamarintvos was computed. 
Xamarin.WatchOS xamarinwatchos was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • .NETStandard 2.1

    • No dependencies.
  • net8.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
1.4.0 11,420 8/21/2024
1.3.0 6,942 8/12/2024
1.2.0 110 8/9/2024
1.1.0 67 8/2/2024
1.0.0 64 7/31/2024