ReLinker 1.2.0
dotnet add package ReLinker --version 1.2.0
NuGet\Install-Package ReLinker -Version 1.2.0
<PackageReference Include="ReLinker" Version="1.2.0" />
<PackageVersion Include="ReLinker" Version="1.2.0" />
<PackageReference Include="ReLinker" />
paket add ReLinker --version 1.2.0
#r "nuget: ReLinker, 1.2.0"
#:package ReLinker@1.2.0
#addin nuget:?package=ReLinker&version=1.2.0
#tool nuget:?package=ReLinker&version=1.2.0
ReLinker
ReLinker is a fast and flexible record linkage library for .NET that helps you find matching records across different datasets. Think of it as a smart way to connect customer records from different databases, even when the data isn't perfectly clean or consistent.
Built on the proven Fellegi-Sunter methodology, ReLinker handles the heavy lifting of comparing records at scale while giving you fine-grained control over how matches are found and scored.
Why ReLinker?
Record linkage is tricky. You need to balance accuracy with performance, handle messy real-world data, and often work with millions of records. ReLinker was designed to solve these challenges:
- Smart blocking reduces comparisons from quadratic to manageable
- Multi-level comparisons let you define nuanced similarity rules
- EM training automatically learns the best parameters from your data
- Hybrid memory caching keeps things fast even with large datasets
- Parallel processing takes advantage of modern multi-core machines
Getting Started
Installation
dotnet add package ReLinker
dotnet add package Microsoft.Extensions.Logging.Console # optional, for logging
Your First Linkage Job
Let's say you have two CSV files with customer data that you want to link together. Here's how you'd set that up:
1. Create Your Data Mapper
First, you need to tell ReLinker how to read and clean your data by implementing IRecordMapper
:
using ReLinker;
public sealed class CustomerMapper : IRecordMapper
{
public Record Map(Dictionary<string, string> row)
{
// Every record needs a unique ID - return null to skip this row
if (!row.TryGetValue("Id", out var id) || string.IsNullOrWhiteSpace(id))
return null;
// Helper to safely get values
string Get(string field) => row.TryGetValue(field, out var value) ?
value?.Trim() ?? string.Empty : string.Empty;
// Normalize text for better matching
string Normalize(string text) => text.Trim().ToLowerInvariant();
var fields = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
{
["FirstName"] = Get("FirstName"),
["LastName"] = Get("LastName"),
["Email"] = Get("Email"),
["Phone"] = Get("Phone"),
// Pre-compute normalized versions for blocking and comparison
["FullNameNorm"] = Normalize($"{Get("FirstName")} {Get("LastName")}"),
["EmailNorm"] = Normalize(Get("Email")),
};
return new Record(id, fields);
}
}
2. Set Up Data Sources
Create sources from your CSV files using CsvSource
:
var leftFile = CsvSource.From("customers_db1.csv", new CustomerMapper());
var rightFile = CsvSource.From("customers_db2.csv", new CustomerMapper());
3. Configure Your Output Sink
Implement IScoredPairSink
to handle the results:
public sealed class MatchSink : IScoredPairSink
{
private readonly StreamWriter _writer;
public MatchSink(string outputPath)
{
_writer = new StreamWriter(File.Create(outputPath));
_writer.WriteLine("LeftId,RightId,MatchScore,LeftName,RightName");
}
public ValueTask WriteAsync(string leftId, string rightId, double score,
Record left, Record right)
{
var leftName = left.Fields.GetValueOrDefault("FullNameNorm", "");
var rightName = right.Fields.GetValueOrDefault("FullNameNorm", "");
_writer.WriteLine($"{leftId},{rightId},{score:F4},{leftName},{rightName}");
return ValueTask.CompletedTask;
}
public async ValueTask DisposeAsync()
{
await _writer.FlushAsync();
_writer.Dispose();
}
}
4. Configure Linkage Settings
var settings = new LinkSettings
{
// Use probability-based scoring (more intuitive than raw scores)
UseProbabilityThreshold = true,
OutputScoreAsProbability = true,
MatchThreshold = 0.90, // Only accept matches with 90%+ confidence
// Performance optimizations
UseValueSpecificU = true, // Better accuracy for exact matches
EnableBucketMemoryCache = true, // Keep frequently used data in memory
BucketCacheMaxBuckets = 32,
UseParallelism = true,
MaxDegreeOfParallelism = 0 // Use all available CPU cores
};
5. Set Up Blocking Rules
// Only compare records that have similar names OR the same email domain
settings.Blocking.Add(Block.OnPrefix("FullNameNorm", 4)); // First 4 characters of name
settings.Blocking.Add(Block.OnPrefix("EmailNorm", 10)); // Email prefix matching
6. Define Comparison Logic
// Multi-level name comparison (most sophisticated approach)
settings.Comparisons.Add(
Compare.Levels(
"FullNameNorm",
CompareLevels.Exact("exact_match"), // Perfect match
CompareLevels.JaroWinklerAtLeast(0.95, "very_close"), // Almost perfect
CompareLevels.JaroWinklerAtLeast(0.85, "close"), // Pretty similar
CompareLevels.NullOrEmpty("missing_name"), // Handle missing data
CompareLevels.Else("different") // Everything else
)
);
// Provide probabilities for each level (these will be refined during training)
var nameMatchProbs = new double[] { 0.95, 0.80, 0.50, 0.10, 0.05 }; // m probabilities
var nameNoMatchProbs = new double[] { 0.01, 0.05, 0.20, 0.30, 0.95 }; // u probabilities
settings.LevelMProbsPerComparison = new() { nameMatchProbs };
settings.LevelUProbsPerComparison = new() { nameNoMatchProbs };
7. Run the Linkage
// Set up logging (optional but helpful)
using var loggerFactory = LoggerFactory.Create(b =>
b.AddSimpleConsole().SetMinimumLevel(LogLevel.Information));
var logger = loggerFactory.CreateLogger<Linker>();
// Configure EM training options
var trainingOptions = new EMOptions
{
MaxIterations = 20,
Tolerance = 1e-6,
UseParallelism = true,
UseBucketMemoryCache = true,
BucketCacheMaxBuckets = 64
};
// Create output sink
await using var outputSink = new MatchSink("customer_matches.csv");
// Run the entire process: train parameters, then find matches
await Linker.RunAsync(
settings,
leftFile,
rightFile,
outputSink,
logger,
train: true, // Let EM training improve the initial parameters
emOptions: trainingOptions
);
Console.WriteLine("Linkage complete! Check customer_matches.csv for results.");
Complete API Reference
Core Classes
Record
Represents a single record with an ID and field dictionary.
Constructor:
Record(string id, Dictionary<string, string> fields)
- Creates a new record
Properties:
string Id { get; }
- Unique identifier for this recordDictionary<string, string> Fields { get; }
- Case-insensitive field dictionary
Methods:
bool TryGet(string fieldName, out string value)
- Safely retrieve a field value
LinkSettings
Main configuration class for the linkage process.
Linkage Behavior Properties:
LinkType LinkType { get; set; }
- LinkOnly (default), DedupeOnly, or LinkAndDedupedouble MatchThreshold { get; set; }
- Minimum score to accept a match (default: 0.0)bool UseProbabilityThreshold { get; set; }
- Threshold on probability vs raw LLR (default: true)bool OutputScoreAsProbability { get; set; }
- Output probability vs LLR (default: true)double ProbabilityTwoRandomRecordsMatch { get; set; }
- Prior probability (default: 1e-6)
Performance Properties:
bool UseParallelism { get; set; }
- Enable parallel processing (default: false)int MaxDegreeOfParallelism { get; set; }
- CPU cores to use, 0 = all (default: 0)int BucketCount { get; set; }
- Number of disk buckets (default: 4096)bool EnableBucketMemoryCache { get; set; }
- Hybrid memory+disk cache (default: true)int BucketCacheMaxBuckets { get; set; }
- Max buckets in memory (default: 32)int OutputBatchSize { get; set; }
- Batch size for sink writes (default: 1024)
Advanced Properties:
bool UseValueSpecificU { get; set; }
- Use term frequency for exact matches (default: false)long TargetBucketFileSizeBytes { get; set; }
- Target bucket file size (default: 64MB)int RightSampleForSizing { get; set; }
- Records to sample for auto-sizing (default: 10000)
Configuration Lists:
List<IBlockRule> Blocking { get; }
- Blocking rules to reduce candidate pairsList<IComparison> Comparisons { get; }
- Field comparison rulesdouble[] MProbs { get; set; }
- Match probabilities for non-levelled comparisonsdouble[] UProbs { get; set; }
- Non-match probabilities for non-levelled comparisonsList<double[]> LevelMProbsPerComparison { get; set; }
- Match probabilities per levelList<double[]> LevelUProbsPerComparison { get; set; }
- Non-match probabilities per level
Linker
Main class for running record linkage.
Static Methods:
static Linker Create(LinkSettings settings, ILogger<Linker> logger)
- Create a new linker instancestatic Task RunAsync(LinkSettings settings, IRecordSource left, IRecordSource right, IScoredPairSink sink, ILogger<Linker> logger, bool train = false, EMOptions emOptions = null, CancellationToken cancellationToken = default)
- One-shot convenience method to train and predict
Instance Methods:
Linker InputLeft(IRecordSource left)
- Set the left dataset source (fluent)Linker InputRight(IRecordSource right)
- Set the right dataset source (fluent)Task TrainAsync(EMOptions options = null, CancellationToken cancellationToken = default)
- Run EM training to learn parametersTask PredictAsync(IScoredPairSink sink, CancellationToken cancellationToken = default)
- Find and output matches
EMOptions
Configuration for Expectation-Maximization training.
Convergence Properties:
int MaxIterations { get; set; }
- Maximum EM iterations (default: 20)double Tolerance { get; set; }
- Convergence tolerance on log-likelihood (default: 1e-5)double Smoothing { get; set; }
- Laplace smoothing to avoid zeros (default: 1e-6)
Estimation Properties:
bool EstimateLambda { get; set; }
- Learn the match prior probability (default: false)
Performance Properties:
bool UseParallelism { get; set; }
- Parallel E-step processing (default: true)bool DeduplicateCandidatesPerLeft { get; set; }
- Remove duplicate right candidates (default: true)bool UseBucketMemoryCache { get; set; }
- Use hybrid memory cache (default: true)int BucketCacheMaxBuckets { get; set; }
- Max buckets in memory (default: 16)
Sampling Properties:
int? SampleLeftEveryN { get; set; }
- Subsample left records (default: null)int? MaxCandidatePairsPerIteration { get; set; }
- Cap pairs per iteration (default: null)
Data Sources and Sinks
IRecordSource
Interface for reading records asynchronously.
Methods:
IAsyncEnumerable<Record> ReadAsync(CancellationToken cancellationToken = default)
- Stream records
CsvSource : IRecordSource
Built-in CSV file reader.
Static Methods:
static CsvSource From(string path, IRecordMapper mapper)
- Create from file path and mapper
IRecordMapper
Interface for converting raw CSV rows to Records.
Methods:
Record Map(Dictionary<string, string> row)
- Convert a CSV row, return null to skip
IScoredPairSink : IAsyncDisposable
Interface for handling match results.
Methods:
ValueTask WriteAsync(string id1, string id2, double score, Record record1, Record record2)
- Handle a matchValueTask DisposeAsync()
- Clean up resources
Blocking Rules
Blocking rules reduce the number of candidate pairs by only comparing records that share certain characteristics.
Block
Static Factory Class
Methods:
static IBlockRule OnPrefix(string fieldName, int prefixLength, bool toLower = true)
- Match on first N charactersstatic IBlockRule OnExact(string fieldName, bool toLower = true)
- Exact field matchstatic IBlockRule OnConcatExact(char separator = '|', bool toLower = true, params string[] fields)
- Exact match on concatenated fieldsstatic IBlockRule OnInitialAndSurnamePrefix(string firstNameField, string surnameField, int surnamePrefix, bool toLower = true)
- First initial + surname prefixstatic IBlockRule OnSoundex(string fieldName)
- Phonetic matching using Soundex
Examples:
settings.Blocking.Add(Block.OnPrefix("LastName", 3)); // First 3 chars of surname
settings.Blocking.Add(Block.OnExact("ZipCode")); // Exact zip code match
settings.Blocking.Add(Block.OnConcatExact('_', true, "City", "State")); // City_State key
settings.Blocking.Add(Block.OnSoundex("LastName")); // Phonetically similar surnames
Comparison Methods
Compare
Static Factory Class
Single-Field Continuous Comparisons:
static IComparison Jaro(string fieldName)
- Jaro string similaritystatic IComparison JaroWinkler(string fieldName)
- Jaro-Winkler similarity (emphasizes common prefixes)static IComparison Levenshtein(string fieldName)
- Normalized Levenshtein edit distancestatic IComparison TfIdf(string fieldName, Dictionary<string, double> idf)
- TF-IDF cosine similarity
Multi-Level Comparisons:
static IComparison Levels(string fieldName, params IComparisonLevel[] levels)
- Create a multi-level comparison
CompareLevels
Static Factory Class
Factory for creating individual comparison levels.
String Matching Levels:
static IComparisonLevel Exact(bool ignoreCase = true, string label = "exact")
- Perfect matchstatic IComparisonLevel JaroAtLeast(double threshold, string label = null)
- Jaro ≥ thresholdstatic IComparisonLevel JaroWinklerAtLeast(double threshold, string label = null)
- Jaro-Winkler ≥ thresholdstatic IComparisonLevel LevenshteinSimilarityAtLeast(double threshold, string label = null)
- Levenshtein similarity ≥ thresholdstatic IComparisonLevel JaccardTokensAtLeast(double threshold, string label = null)
- Jaccard token similarity ≥ threshold
Specialized Levels:
static IComparisonLevel NullOrEmpty(string label = "null_or_empty")
- Either field is null/emptystatic IComparisonLevel SoundexEqual(string label = "soundex_equal")
- Phonetically equalstatic IComparisonLevel NumericWithin(double tolerance, string label = null)
- Numeric values within tolerancestatic IComparisonLevel DateWithinDays(int days, string label = null)
- Dates within N daysstatic IComparisonLevel Else(string label = "else")
- Catch-all level (always matches)
Example Multi-Level Comparison:
settings.Comparisons.Add(
Compare.Levels(
"PersonName",
CompareLevels.Exact("exact"), // Perfect match
CompareLevels.JaroWinklerAtLeast(0.95, "very_close"), // Almost identical
CompareLevels.JaroWinklerAtLeast(0.85, "close"), // Pretty similar
CompareLevels.SoundexEqual("sounds_alike"), // Phonetically similar
CompareLevels.NullOrEmpty("missing"), // Handle missing data
CompareLevels.Else("different") // Everything else
)
);
String Similarity Classes
All similarity classes implement IStringSimilarity
with a single method:
double Compute(string inputString1, string inputString2)
- Returns similarity in [0,1]
JaroSimilarity
Classic Jaro string similarity algorithm. Good for names and short strings.
JaroWinklerSimilarity
Jaro-Winkler algorithm that gives extra weight to common prefixes.
Constructor:
JaroWinklerSimilarity(double prefixScale = 0.1, int maxPrefix = 4)
- Customize prefix weighting
LevenshteinSimilarity
Optimized Levenshtein edit distance, converted to similarity (1 - distance/maxLength).
JaccardTokenSimilarity
Jaccard similarity on word tokens. Good for addresses and multi-word fields.
TfIdfSimilarity
TF-IDF cosine similarity using pre-computed IDF weights.
Constructor:
TfIdfSimilarity(Dictionary<string, double> idf)
- Provide IDF dictionary
Advanced Features
Value-Specific U
When UseValueSpecificU = true
, exact matches use the actual frequency of the matched value in the right dataset instead of the learned u parameter. This dramatically improves accuracy for rare exact matches.
Hybrid Memory Caching
ReLinker can keep frequently-accessed bucket files parsed in memory to avoid repeated disk reads:
EnableBucketMemoryCache = true
- Enable the featureBucketCacheMaxBuckets = N
- Keep up to N buckets in memory (LRU eviction)
Parallel Processing
Enable parallel processing for both training and prediction:
UseParallelism = true
- Enable parallel candidate scoringMaxDegreeOfParallelism = 0
- Use all CPU cores (or specify a number)
Common Patterns and Recipes
High-Precision Linkage (Few False Positives)
settings.MatchThreshold = 0.95; // Very strict threshold
settings.UseValueSpecificU = true; // Better handling of rare exact matches
// Use restrictive blocking
settings.Blocking.Add(Block.OnExact("Email")); // Only compare same email domain
settings.Blocking.Add(Block.OnPrefix("Phone", 6)); // Similar phone prefixes
// Multi-level comparison with exact match having high weight
settings.Comparisons.Add(
Compare.Levels("FullName",
CompareLevels.Exact("exact"), // Very high m, very low u
CompareLevels.JaroWinklerAtLeast(0.95, "close"), // Still high confidence
CompareLevels.Else("different") // Low confidence
)
);
High-Recall Linkage (Find More Matches)
settings.MatchThreshold = 0.75; // More lenient threshold
// Multiple blocking strategies for broader coverage
settings.Blocking.Add(Block.OnPrefix("LastName", 3));
settings.Blocking.Add(Block.OnPrefix("FirstName", 2));
settings.Blocking.Add(Block.OnSoundex("LastName")); // Phonetic matching
settings.Blocking.Add(Block.OnInitialAndSurnamePrefix("FirstName", "LastName", 4));
// More granular comparison levels
settings.Comparisons.Add(
Compare.Levels("FullName",
CompareLevels.Exact("exact"),
CompareLevels.JaroWinklerAtLeast(0.95, "very_close"),
CompareLevels.JaroWinklerAtLeast(0.90, "close"),
CompareLevels.JaroWinklerAtLeast(0.80, "somewhat_close"),
CompareLevels.SoundexEqual("sounds_alike"),
CompareLevels.Else("different")
)
);
Large Dataset Processing
settings.BucketCount = 8192; // More, smaller bucket files
settings.EnableBucketMemoryCache = true;
settings.BucketCacheMaxBuckets = 128; // Use more memory for caching
settings.OutputBatchSize = 4096; // Larger output batches
settings.UseParallelism = true;
settings.MaxDegreeOfParallelism = 0; // Use all cores
var emOptions = new EMOptions
{
UseBucketMemoryCache = true,
BucketCacheMaxBuckets = 256, // Even more cache for training
UseParallelism = true,
MaxCandidatePairsPerIteration = 1_000_000 // Cap training pairs per iteration
};
Multi-Field Comparison
// Name comparison with multiple levels
settings.Comparisons.Add(
Compare.Levels("FullName",
CompareLevels.Exact(),
CompareLevels.JaroWinklerAtLeast(0.90),
CompareLevels.SoundexEqual(),
CompareLevels.Else()
)
);
// Address comparison
settings.Comparisons.Add(
Compare.Levels("Address",
CompareLevels.Exact(),
CompareLevels.JaroWinklerAtLeast(0.85),
CompareLevels.JaccardTokensAtLeast(0.75), // Good for addresses
CompareLevels.Else()
)
);
// Phone number comparison
settings.Comparisons.Add(
Compare.Levels("Phone",
CompareLevels.Exact(),
CompareLevels.NumericWithin(0), // Treat as numbers if possible
CompareLevels.Else()
)
);
// Provide separate m/u arrays for each comparison
settings.LevelMProbsPerComparison = new()
{
new double[] { 0.95, 0.80, 0.60, 0.05 }, // Name
new double[] { 0.90, 0.70, 0.50, 0.05 }, // Address
new double[] { 0.98, 0.85, 0.02 } // Phone
};
settings.LevelUProbsPerComparison = new()
{
new double[] { 0.01, 0.05, 0.15, 0.95 }, // Name
new double[] { 0.02, 0.10, 0.20, 0.95 }, // Address
new double[] { 0.001, 0.01, 0.99 } // Phone
};
Troubleshooting
"Not enough matches found"
- Lower your
MatchThreshold
- Add more blocking rules for better coverage (
Block.OnPrefix
,Block.OnSoundex
) - Check if your field names are consistent between left/right datasets
- Verify your data mapper is working correctly with sample data
- Use
train: true
to let EM improve your initial parameter guesses
"Too many false positives"
- Increase your
MatchThreshold
- Add more discriminative comparison levels
- Improve data normalization in your
IRecordMapper
- Set
UseValueSpecificU = true
for better exact match handling - Use more restrictive blocking rules
"Process is too slow"
- Set
UseParallelism = true
and tuneMaxDegreeOfParallelism
- Increase
BucketCacheMaxBuckets
if you have available memory - Make your blocking rules more restrictive (fewer candidates per record)
- Increase
OutputBatchSize
for slow output sinks - Consider using
EMOptions.SampleLeftEveryN
for faster training on large datasets
"Out of memory errors"
- Reduce
BucketCacheMaxBuckets
- Increase
BucketCount
to create smaller bucket files - Use more restrictive blocking to reduce candidate set size
- Process data in smaller chunks
"Training not converging"
- Increase
EMOptions.MaxIterations
(try 30-50) - Decrease
EMOptions.Tolerance
(try 1e-7) - Check that your initial m/u parameters make sense
- Ensure you have enough training data
- Verify your comparison levels are well-designed
Performance Characteristics
ReLinker is designed to handle large datasets efficiently:
- Memory usage: Configurable via bucket cache settings. Can run in low-memory mode (cache disabled) or high-memory mode (large cache).
- Disk I/O: Minimized through hybrid caching and sequential file access patterns.
- CPU scaling: Near-linear scaling with core count for candidate scoring (I/O remains single-threaded).
- Typical throughput: 10K-100K candidate pairs per second on modern hardware, depending on comparison complexity.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net8.0
- Microsoft.Extensions.Logging.Console (>= 9.0.6)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.