RecordLinkageNet 1.0.0
dotnet add package RecordLinkageNet --version 1.0.0
NuGet\Install-Package RecordLinkageNet -Version 1.0.0
<PackageReference Include="RecordLinkageNet" Version="1.0.0" />
paket add RecordLinkageNet --version 1.0.0
#r "nuget: RecordLinkageNet, 1.0.0"
// Install RecordLinkageNet as a Cake Addin #addin nuget:?package=RecordLinkageNet&version=1.0.0 // Install RecordLinkageNet as a Cake Tool #tool nuget:?package=RecordLinkageNet&version=1.0.0
Overview
aim: opensource library which offers help to compare datasets (csv, database tables,classes) in a memory-limited environment
license BSD 2-Clause
This project is a pure c# port of the super useful python package recordlinkage. Besides it tries to use the effective parts of the c# language (e.g. linq, dataflow).
features
- string comparision with multiple string metrics
- uses scoring method to calculate overall similarity
- uses own datatable struture to reduce memory footprint (in comparsison to system.data.datatable)
- uses dataflow to reduce memory footprint
- uses parallelism to reduce runtime
- limits: right now every datacell is string
plattforms:
all plattform which supports .NET 6.0 so:
- Linux
- MacOs
- Windows
minimal examples
This project should look and feel like using the pyhton equivalent:
//we create some testdata //see UnitTest.TestDataPerson
List<TestDataPerson> testDataPeopleA = new List<TestDataPerson>
{
new TestDataPerson("Thomas", "Mueller", "Lindetrasse", "Testhausen", "12345"),
new TestDataPerson("Thomas", "Mueller", "Lindenstrasse", "Testcity", "012345"),
new TestDataPerson("Thomas", "Müller", "Lindenstrasse", "Testcity", "012345"),
new TestDataPerson("Tomas", "Müller", "Lindenstroad", "Testhausen", "012342"),
new TestDataPerson("Tomas", "Müller", "Lindenstroad", "Dorf", "012342")
};
DataTableFeather tabA = TableConverter.CreateTableFeatherFromDataObjectList(testDataPeopleA);
//we load some data from sqlite file
DataTableFeather tabB = RecordLinkageNet.Util.SqliteReader.ReadTableFromSqliteFile("filenameof.sqlite","testtablename");
ConditionList conList = new ConditionList();
Condition.StringMethod testMethod = Condition.StringMethod.JaroWinklerSimilarity;
conList.String("NameFirst", "NameFirst", testMethod);
conList.String("Street", "Street", testMethod);
conList.String("PostalCode", "PostalCode", Condition.StringMethod.Exact);
conList.String("NameLast", "NameLast", testMethod);
//configure comparison
Configuration config = Configuration.Instance;
config.AddIndex(new IndexFeather().Create(tabB, tabA));
config.AddConditionList(conList);
config.SetStrategy(Configuration.CalculationStrategy.WeightedConditionSum);
config.SetNumberTransposeModus(NumberTransposeHelper.TransposeModus.LOG10); ;
//we init a worker
WorkScheduler workScheduler = new WorkScheduler();
var pipeLineCancellation = new CancellationTokenSource();//for optional cancellation
var resultTask = workScheduler.Compare(pipeLineCancellation.Token);
await resultTask;
int amount = resultTask.Result.Count();
The project implements mutliple metrics for string comparision as extensions:
- HammingDistance
- DamerauLevenshteinDistance
- JaroDistance
- JaroWinklerSimilarity
- ShannonEntropyDistance
using RecordLinkageNet.Core.Distance;
var result1 = "foo".HammingDistance("bar");//3
var result2 = "foo".DamerauLevenshteinDistance("bar");//3
var result3 = "foo".JaroWinklerSimilarity("bar");//0
The distances metrics are well tested with results from python lib jellyfish.
structure:
folder | description |
---|---|
RecordLinkageNet | c# library code |
UnitTest | test for the lib |
thanks to
- jamesturk for jellyfish and his c implementation of string metrics
- jeff-atwood for Shannon Entropy
- wickedshimmy and joannaksk for basic Damerau Levenshtein Distance
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net6.0 is compatible. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 was computed. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. |
-
net6.0
- Microsoft.Bcl.HashCode (>= 1.1.1)
- Microsoft.Data.Sqlite.Core (>= 7.0.10)
- System.Threading.Tasks.Dataflow (>= 7.0.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
Version | Downloads | Last updated |
---|---|---|
1.0.0 | 170 | 9/21/2023 |