Homoglyph 0.1.0
dotnet add package Homoglyph --version 0.1.0
NuGet\Install-Package Homoglyph -Version 0.1.0
<PackageReference Include="Homoglyph" Version="0.1.0" />
paket add Homoglyph --version 0.1.0
#r "nuget: Homoglyph, 0.1.0"
// Install Homoglyph as a Cake Addin #addin nuget:?package=Homoglyph&version=0.1.0 // Install Homoglyph as a Cake Tool #tool nuget:?package=Homoglyph&version=0.1.0
Homoglyph
In orthography and typography, a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar. Wikipedia
Usage
Allows the user to retrieve all homoglyphs for a specific utf-32 code-point as a ReadOnlySpan<uint>
.
using Homoglyph;
ReadOnlySpan<uint> codepoints = Homoglyphs.GetHomoglyphs('\u0020'); //
Console.WriteLine(codepoints.Contains(' ') ? "Yay" : "Nope");
// Output: Yay
API documentation
Continuous Integration
Build | Test | Coverage |
---|---|---|
<sup>Appveyor</sup> | <sup>Appveyor</sup> | <sup>Coveralls</sup> |
Behind the scenes
The library embeds a Homoglyph hash table into its DLL .text section. A specific feature that allows us to load byte sized data is used. If a property exposes a array of byte
or sbyte
as a ReadOnlySpan<byte>
, then the array is not allocation on the heap a 2nd time, but loaded directly from the dll .text section.
The following IL is generated for such a property getter.
using System;
Console.WriteLine(C.P[0]);
static class C {
public static ReadOnlySpan<byte> P => new byte[] { 0x20, };
}
IL_0000: ldsflda uint8 '<PrivateImplementationDetails>'::'36A9E7F1C95B82FFB99743E0C5C4CE95D83C9A430AAC59F84EF3CBFAB6145068'
IL_0005: ldc.i4.1
IL_0006: newobj instance void valuetype [System.Runtime]System.ReadOnlySpan`1<uint8>::.ctor(void*, int32)
IL_000b: ret
Currently, no array allocation is performed. If we were to replace byte
with nuint
or most other primitives a separate array would be allocated and filled inline.
IL_0000: ldc.i4.1
IL_0001: newarr [System.Runtime]System.UIntPtr
IL_0006: dup
IL_0007: ldc.i4.0
IL_0008: ldc.i4.s 32
IL_000a: conv.i
IL_000b: stelem.i
IL_000c: call valuetype [System.Runtime]System.ReadOnlySpan`1<!0> valuetype [System.Runtime]System.ReadOnlySpan`1<native uint>::op_Implicit(!0[])
IL_0011: ret
Hash map
The native shape of homoglyph groupings would be a array of dynamic sized arrays of codepoints.
uint[][] homoglyphs = new uint[][] {
new uint[] { 20,a0,1680,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,200a,2028,2029,202f,205f },
new uint[] { ... },
};
Lookup of any given group is O(n)
with n
the sum of the number of codepoints over all groups. Additionally it would require many separate heap allocations.
The first step is to flatten this array into a 1-dimensional structure, by concatenating all groupings. We note down the start index and length of each group in a separate array.
nuint[] homoglyph_flat = new uint[] { 20,a0,1680,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,200a,2028,2029,202f,205f, ... };
(ushort Index, byte Length)[] homoglyph_indexer = new[] { ((ushort)0, (byte)18), ...};
This memory optimization brings us down to two allocations, but lookup remains inefficient.
In order to optimize the lookup we reshape homoglyph_flat
into a new array of $8419$ items inserting at a index using a hash function. This same index is then used to store the index and length data. The resulting hash table has a fill of $6184 / 8419 ~= 73%$.
On this table a hash lookup using the codepoint
as its own hash value yields the index and length inside the homoglyph_flat
array. The slice of the array then represents the entire grouping.
uint[] blockValues = new uint[] { 20,a0, ... };
uint[] hashValues = new uint[] { 0, 0, 0, ... };
ushort[] hashIndices = new ushort[] { 0, 0, 0, ... };
byte[] hashIndices = new byte[] { 0, 0, 0, ... };
public static ReadOnlySpan<uint> GetHomoglyphs(uint codepoint) {
int prediction = codepoint % 8419;
while (true) {
if (hashValues[prediction] == codepoint) {
ushort index = hashIndicies[prediction];
byte length = hashLengths[prediction];
return blockValues[index..(index + length)];
}
if (predictionCount <= HashCapacity) {
prediction = (prediction + 1) % 8419;
continue;
}
return default;
}
}
Further optimization requires the embedding the arrays into the DLL .text section, optimizing the probing algorithm, the modulo calculation for 64bit platforms, and preventing probing bound checks for array access when possible.
The result of this optimization can be found in Homoglyph.cs
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
.NET Core | netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.1 is compatible. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETStandard 2.1
- System.Memory (>= 4.5.5)
-
net6.0
- System.Memory (>= 4.5.5)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
0.1.0 | 226 | 1/31/2023 |
# Version 0.1.0
- Initial Release
- Implement Homoglyphs