ReLinker 1.2.0

dotnet add package ReLinker --version 1.2.0
                    
NuGet\Install-Package ReLinker -Version 1.2.0
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="ReLinker" Version="1.2.0" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="ReLinker" Version="1.2.0" />
                    
Directory.Packages.props
<PackageReference Include="ReLinker" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add ReLinker --version 1.2.0
                    
#r "nuget: ReLinker, 1.2.0"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package ReLinker@1.2.0
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=ReLinker&version=1.2.0
                    
Install as a Cake Addin
#tool nuget:?package=ReLinker&version=1.2.0
                    
Install as a Cake Tool

ReLinker

ReLinker is a fast and flexible record linkage library for .NET that helps you find matching records across different datasets. Think of it as a smart way to connect customer records from different databases, even when the data isn't perfectly clean or consistent.

Built on the proven Fellegi-Sunter methodology, ReLinker handles the heavy lifting of comparing records at scale while giving you fine-grained control over how matches are found and scored.

Why ReLinker?

Record linkage is tricky. You need to balance accuracy with performance, handle messy real-world data, and often work with millions of records. ReLinker was designed to solve these challenges:

  • Smart blocking reduces comparisons from quadratic to manageable
  • Multi-level comparisons let you define nuanced similarity rules
  • EM training automatically learns the best parameters from your data
  • Hybrid memory caching keeps things fast even with large datasets
  • Parallel processing takes advantage of modern multi-core machines

Getting Started

Installation

dotnet add package ReLinker
dotnet add package Microsoft.Extensions.Logging.Console  # optional, for logging

Your First Linkage Job

Let's say you have two CSV files with customer data that you want to link together. Here's how you'd set that up:

1. Create Your Data Mapper

First, you need to tell ReLinker how to read and clean your data by implementing IRecordMapper:

using ReLinker;

public sealed class CustomerMapper : IRecordMapper
{
    public Record Map(Dictionary<string, string> row)
    {
        // Every record needs a unique ID - return null to skip this row
        if (!row.TryGetValue("Id", out var id) || string.IsNullOrWhiteSpace(id))
            return null;

        // Helper to safely get values
        string Get(string field) => row.TryGetValue(field, out var value) ? 
            value?.Trim() ?? string.Empty : string.Empty;
        
        // Normalize text for better matching
        string Normalize(string text) => text.Trim().ToLowerInvariant();

        var fields = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
        {
            ["FirstName"] = Get("FirstName"),
            ["LastName"] = Get("LastName"),
            ["Email"] = Get("Email"),
            ["Phone"] = Get("Phone"),
            
            // Pre-compute normalized versions for blocking and comparison
            ["FullNameNorm"] = Normalize($"{Get("FirstName")} {Get("LastName")}"),
            ["EmailNorm"] = Normalize(Get("Email")),
        };

        return new Record(id, fields);
    }
}
2. Set Up Data Sources

Create sources from your CSV files using CsvSource:

var leftFile = CsvSource.From("customers_db1.csv", new CustomerMapper());
var rightFile = CsvSource.From("customers_db2.csv", new CustomerMapper());
3. Configure Your Output Sink

Implement IScoredPairSink to handle the results:

public sealed class MatchSink : IScoredPairSink
{
    private readonly StreamWriter _writer;
    
    public MatchSink(string outputPath)
    {
        _writer = new StreamWriter(File.Create(outputPath));
        _writer.WriteLine("LeftId,RightId,MatchScore,LeftName,RightName");
    }

    public ValueTask WriteAsync(string leftId, string rightId, double score, 
                                Record left, Record right)
    {
        var leftName = left.Fields.GetValueOrDefault("FullNameNorm", "");
        var rightName = right.Fields.GetValueOrDefault("FullNameNorm", "");
        
        _writer.WriteLine($"{leftId},{rightId},{score:F4},{leftName},{rightName}");
        return ValueTask.CompletedTask;
    }

    public async ValueTask DisposeAsync()
    {
        await _writer.FlushAsync();
        _writer.Dispose();
    }
}
4. Configure Linkage Settings
var settings = new LinkSettings
{
    // Use probability-based scoring (more intuitive than raw scores)
    UseProbabilityThreshold = true,
    OutputScoreAsProbability = true,
    MatchThreshold = 0.90,  // Only accept matches with 90%+ confidence
    
    // Performance optimizations
    UseValueSpecificU = true,           // Better accuracy for exact matches
    EnableBucketMemoryCache = true,     // Keep frequently used data in memory
    BucketCacheMaxBuckets = 32,
    UseParallelism = true,
    MaxDegreeOfParallelism = 0         // Use all available CPU cores
};
5. Set Up Blocking Rules
// Only compare records that have similar names OR the same email domain
settings.Blocking.Add(Block.OnPrefix("FullNameNorm", 4));  // First 4 characters of name
settings.Blocking.Add(Block.OnPrefix("EmailNorm", 10));    // Email prefix matching
6. Define Comparison Logic
// Multi-level name comparison (most sophisticated approach)
settings.Comparisons.Add(
    Compare.Levels(
        "FullNameNorm",
        CompareLevels.Exact("exact_match"),                    // Perfect match
        CompareLevels.JaroWinklerAtLeast(0.95, "very_close"),  // Almost perfect
        CompareLevels.JaroWinklerAtLeast(0.85, "close"),       // Pretty similar
        CompareLevels.NullOrEmpty("missing_name"),             // Handle missing data
        CompareLevels.Else("different")                        // Everything else
    )
);

// Provide probabilities for each level (these will be refined during training)
var nameMatchProbs = new double[] { 0.95, 0.80, 0.50, 0.10, 0.05 };  // m probabilities
var nameNoMatchProbs = new double[] { 0.01, 0.05, 0.20, 0.30, 0.95 }; // u probabilities

settings.LevelMProbsPerComparison = new() { nameMatchProbs };
settings.LevelUProbsPerComparison = new() { nameNoMatchProbs };
7. Run the Linkage
// Set up logging (optional but helpful)
using var loggerFactory = LoggerFactory.Create(b => 
    b.AddSimpleConsole().SetMinimumLevel(LogLevel.Information));
var logger = loggerFactory.CreateLogger<Linker>();

// Configure EM training options
var trainingOptions = new EMOptions
{
    MaxIterations = 20,
    Tolerance = 1e-6,
    UseParallelism = true,
    UseBucketMemoryCache = true,
    BucketCacheMaxBuckets = 64
};

// Create output sink
await using var outputSink = new MatchSink("customer_matches.csv");

// Run the entire process: train parameters, then find matches
await Linker.RunAsync(
    settings,
    leftFile,
    rightFile, 
    outputSink,
    logger,
    train: true,  // Let EM training improve the initial parameters
    emOptions: trainingOptions
);

Console.WriteLine("Linkage complete! Check customer_matches.csv for results.");

Complete API Reference

Core Classes

Record

Represents a single record with an ID and field dictionary.

Constructor:

  • Record(string id, Dictionary<string, string> fields) - Creates a new record

Properties:

  • string Id { get; } - Unique identifier for this record
  • Dictionary<string, string> Fields { get; } - Case-insensitive field dictionary

Methods:

  • bool TryGet(string fieldName, out string value) - Safely retrieve a field value
LinkSettings

Main configuration class for the linkage process.

Linkage Behavior Properties:

  • LinkType LinkType { get; set; } - LinkOnly (default), DedupeOnly, or LinkAndDedupe
  • double MatchThreshold { get; set; } - Minimum score to accept a match (default: 0.0)
  • bool UseProbabilityThreshold { get; set; } - Threshold on probability vs raw LLR (default: true)
  • bool OutputScoreAsProbability { get; set; } - Output probability vs LLR (default: true)
  • double ProbabilityTwoRandomRecordsMatch { get; set; } - Prior probability (default: 1e-6)

Performance Properties:

  • bool UseParallelism { get; set; } - Enable parallel processing (default: false)
  • int MaxDegreeOfParallelism { get; set; } - CPU cores to use, 0 = all (default: 0)
  • int BucketCount { get; set; } - Number of disk buckets (default: 4096)
  • bool EnableBucketMemoryCache { get; set; } - Hybrid memory+disk cache (default: true)
  • int BucketCacheMaxBuckets { get; set; } - Max buckets in memory (default: 32)
  • int OutputBatchSize { get; set; } - Batch size for sink writes (default: 1024)

Advanced Properties:

  • bool UseValueSpecificU { get; set; } - Use term frequency for exact matches (default: false)
  • long TargetBucketFileSizeBytes { get; set; } - Target bucket file size (default: 64MB)
  • int RightSampleForSizing { get; set; } - Records to sample for auto-sizing (default: 10000)

Configuration Lists:

  • List<IBlockRule> Blocking { get; } - Blocking rules to reduce candidate pairs
  • List<IComparison> Comparisons { get; } - Field comparison rules
  • double[] MProbs { get; set; } - Match probabilities for non-levelled comparisons
  • double[] UProbs { get; set; } - Non-match probabilities for non-levelled comparisons
  • List<double[]> LevelMProbsPerComparison { get; set; } - Match probabilities per level
  • List<double[]> LevelUProbsPerComparison { get; set; } - Non-match probabilities per level
Linker

Main class for running record linkage.

Static Methods:

  • static Linker Create(LinkSettings settings, ILogger<Linker> logger) - Create a new linker instance
  • static Task RunAsync(LinkSettings settings, IRecordSource left, IRecordSource right, IScoredPairSink sink, ILogger<Linker> logger, bool train = false, EMOptions emOptions = null, CancellationToken cancellationToken = default) - One-shot convenience method to train and predict

Instance Methods:

  • Linker InputLeft(IRecordSource left) - Set the left dataset source (fluent)
  • Linker InputRight(IRecordSource right) - Set the right dataset source (fluent)
  • Task TrainAsync(EMOptions options = null, CancellationToken cancellationToken = default) - Run EM training to learn parameters
  • Task PredictAsync(IScoredPairSink sink, CancellationToken cancellationToken = default) - Find and output matches
EMOptions

Configuration for Expectation-Maximization training.

Convergence Properties:

  • int MaxIterations { get; set; } - Maximum EM iterations (default: 20)
  • double Tolerance { get; set; } - Convergence tolerance on log-likelihood (default: 1e-5)
  • double Smoothing { get; set; } - Laplace smoothing to avoid zeros (default: 1e-6)

Estimation Properties:

  • bool EstimateLambda { get; set; } - Learn the match prior probability (default: false)

Performance Properties:

  • bool UseParallelism { get; set; } - Parallel E-step processing (default: true)
  • bool DeduplicateCandidatesPerLeft { get; set; } - Remove duplicate right candidates (default: true)
  • bool UseBucketMemoryCache { get; set; } - Use hybrid memory cache (default: true)
  • int BucketCacheMaxBuckets { get; set; } - Max buckets in memory (default: 16)

Sampling Properties:

  • int? SampleLeftEveryN { get; set; } - Subsample left records (default: null)
  • int? MaxCandidatePairsPerIteration { get; set; } - Cap pairs per iteration (default: null)

Data Sources and Sinks

IRecordSource

Interface for reading records asynchronously.

Methods:

  • IAsyncEnumerable<Record> ReadAsync(CancellationToken cancellationToken = default) - Stream records
CsvSource : IRecordSource

Built-in CSV file reader.

Static Methods:

  • static CsvSource From(string path, IRecordMapper mapper) - Create from file path and mapper
IRecordMapper

Interface for converting raw CSV rows to Records.

Methods:

  • Record Map(Dictionary<string, string> row) - Convert a CSV row, return null to skip
IScoredPairSink : IAsyncDisposable

Interface for handling match results.

Methods:

  • ValueTask WriteAsync(string id1, string id2, double score, Record record1, Record record2) - Handle a match
  • ValueTask DisposeAsync() - Clean up resources

Blocking Rules

Blocking rules reduce the number of candidate pairs by only comparing records that share certain characteristics.

Block Static Factory Class

Methods:

  • static IBlockRule OnPrefix(string fieldName, int prefixLength, bool toLower = true) - Match on first N characters
  • static IBlockRule OnExact(string fieldName, bool toLower = true) - Exact field match
  • static IBlockRule OnConcatExact(char separator = '|', bool toLower = true, params string[] fields) - Exact match on concatenated fields
  • static IBlockRule OnInitialAndSurnamePrefix(string firstNameField, string surnameField, int surnamePrefix, bool toLower = true) - First initial + surname prefix
  • static IBlockRule OnSoundex(string fieldName) - Phonetic matching using Soundex

Examples:

settings.Blocking.Add(Block.OnPrefix("LastName", 3));           // First 3 chars of surname
settings.Blocking.Add(Block.OnExact("ZipCode"));               // Exact zip code match
settings.Blocking.Add(Block.OnConcatExact('_', true, "City", "State"));  // City_State key
settings.Blocking.Add(Block.OnSoundex("LastName"));            // Phonetically similar surnames

Comparison Methods

Compare Static Factory Class

Single-Field Continuous Comparisons:

  • static IComparison Jaro(string fieldName) - Jaro string similarity
  • static IComparison JaroWinkler(string fieldName) - Jaro-Winkler similarity (emphasizes common prefixes)
  • static IComparison Levenshtein(string fieldName) - Normalized Levenshtein edit distance
  • static IComparison TfIdf(string fieldName, Dictionary<string, double> idf) - TF-IDF cosine similarity

Multi-Level Comparisons:

  • static IComparison Levels(string fieldName, params IComparisonLevel[] levels) - Create a multi-level comparison
CompareLevels Static Factory Class

Factory for creating individual comparison levels.

String Matching Levels:

  • static IComparisonLevel Exact(bool ignoreCase = true, string label = "exact") - Perfect match
  • static IComparisonLevel JaroAtLeast(double threshold, string label = null) - Jaro ≥ threshold
  • static IComparisonLevel JaroWinklerAtLeast(double threshold, string label = null) - Jaro-Winkler ≥ threshold
  • static IComparisonLevel LevenshteinSimilarityAtLeast(double threshold, string label = null) - Levenshtein similarity ≥ threshold
  • static IComparisonLevel JaccardTokensAtLeast(double threshold, string label = null) - Jaccard token similarity ≥ threshold

Specialized Levels:

  • static IComparisonLevel NullOrEmpty(string label = "null_or_empty") - Either field is null/empty
  • static IComparisonLevel SoundexEqual(string label = "soundex_equal") - Phonetically equal
  • static IComparisonLevel NumericWithin(double tolerance, string label = null) - Numeric values within tolerance
  • static IComparisonLevel DateWithinDays(int days, string label = null) - Dates within N days
  • static IComparisonLevel Else(string label = "else") - Catch-all level (always matches)

Example Multi-Level Comparison:

settings.Comparisons.Add(
    Compare.Levels(
        "PersonName",
        CompareLevels.Exact("exact"),                           // Perfect match
        CompareLevels.JaroWinklerAtLeast(0.95, "very_close"),   // Almost identical
        CompareLevels.JaroWinklerAtLeast(0.85, "close"),        // Pretty similar
        CompareLevels.SoundexEqual("sounds_alike"),             // Phonetically similar
        CompareLevels.NullOrEmpty("missing"),                   // Handle missing data
        CompareLevels.Else("different")                         // Everything else
    )
);

String Similarity Classes

All similarity classes implement IStringSimilarity with a single method:

  • double Compute(string inputString1, string inputString2) - Returns similarity in [0,1]
JaroSimilarity

Classic Jaro string similarity algorithm. Good for names and short strings.

JaroWinklerSimilarity

Jaro-Winkler algorithm that gives extra weight to common prefixes.

Constructor:

  • JaroWinklerSimilarity(double prefixScale = 0.1, int maxPrefix = 4) - Customize prefix weighting
LevenshteinSimilarity

Optimized Levenshtein edit distance, converted to similarity (1 - distance/maxLength).

JaccardTokenSimilarity

Jaccard similarity on word tokens. Good for addresses and multi-word fields.

TfIdfSimilarity

TF-IDF cosine similarity using pre-computed IDF weights.

Constructor:

  • TfIdfSimilarity(Dictionary<string, double> idf) - Provide IDF dictionary

Advanced Features

Value-Specific U

When UseValueSpecificU = true, exact matches use the actual frequency of the matched value in the right dataset instead of the learned u parameter. This dramatically improves accuracy for rare exact matches.

Hybrid Memory Caching

ReLinker can keep frequently-accessed bucket files parsed in memory to avoid repeated disk reads:

  • EnableBucketMemoryCache = true - Enable the feature
  • BucketCacheMaxBuckets = N - Keep up to N buckets in memory (LRU eviction)
Parallel Processing

Enable parallel processing for both training and prediction:

  • UseParallelism = true - Enable parallel candidate scoring
  • MaxDegreeOfParallelism = 0 - Use all CPU cores (or specify a number)

Common Patterns and Recipes

High-Precision Linkage (Few False Positives)

settings.MatchThreshold = 0.95;          // Very strict threshold
settings.UseValueSpecificU = true;       // Better handling of rare exact matches

// Use restrictive blocking
settings.Blocking.Add(Block.OnExact("Email"));      // Only compare same email domain
settings.Blocking.Add(Block.OnPrefix("Phone", 6));  // Similar phone prefixes

// Multi-level comparison with exact match having high weight
settings.Comparisons.Add(
    Compare.Levels("FullName",
        CompareLevels.Exact("exact"),                    // Very high m, very low u
        CompareLevels.JaroWinklerAtLeast(0.95, "close"), // Still high confidence
        CompareLevels.Else("different")                  // Low confidence
    )
);

High-Recall Linkage (Find More Matches)

settings.MatchThreshold = 0.75;          // More lenient threshold

// Multiple blocking strategies for broader coverage
settings.Blocking.Add(Block.OnPrefix("LastName", 3));
settings.Blocking.Add(Block.OnPrefix("FirstName", 2));
settings.Blocking.Add(Block.OnSoundex("LastName"));         // Phonetic matching
settings.Blocking.Add(Block.OnInitialAndSurnamePrefix("FirstName", "LastName", 4));

// More granular comparison levels
settings.Comparisons.Add(
    Compare.Levels("FullName",
        CompareLevels.Exact("exact"),
        CompareLevels.JaroWinklerAtLeast(0.95, "very_close"),
        CompareLevels.JaroWinklerAtLeast(0.90, "close"),
        CompareLevels.JaroWinklerAtLeast(0.80, "somewhat_close"),
        CompareLevels.SoundexEqual("sounds_alike"),
        CompareLevels.Else("different")
    )
);

Large Dataset Processing

settings.BucketCount = 8192;                    // More, smaller bucket files
settings.EnableBucketMemoryCache = true;       
settings.BucketCacheMaxBuckets = 128;           // Use more memory for caching
settings.OutputBatchSize = 4096;                // Larger output batches
settings.UseParallelism = true;
settings.MaxDegreeOfParallelism = 0;            // Use all cores

var emOptions = new EMOptions
{
    UseBucketMemoryCache = true,
    BucketCacheMaxBuckets = 256,                // Even more cache for training
    UseParallelism = true,
    MaxCandidatePairsPerIteration = 1_000_000   // Cap training pairs per iteration
};

Multi-Field Comparison

// Name comparison with multiple levels
settings.Comparisons.Add(
    Compare.Levels("FullName",
        CompareLevels.Exact(),
        CompareLevels.JaroWinklerAtLeast(0.90),
        CompareLevels.SoundexEqual(),
        CompareLevels.Else()
    )
);

// Address comparison
settings.Comparisons.Add(
    Compare.Levels("Address", 
        CompareLevels.Exact(),
        CompareLevels.JaroWinklerAtLeast(0.85),
        CompareLevels.JaccardTokensAtLeast(0.75),  // Good for addresses
        CompareLevels.Else()
    )
);

// Phone number comparison
settings.Comparisons.Add(
    Compare.Levels("Phone",
        CompareLevels.Exact(),
        CompareLevels.NumericWithin(0),  // Treat as numbers if possible
        CompareLevels.Else()
    )
);

// Provide separate m/u arrays for each comparison
settings.LevelMProbsPerComparison = new()
{
    new double[] { 0.95, 0.80, 0.60, 0.05 },  // Name
    new double[] { 0.90, 0.70, 0.50, 0.05 },  // Address  
    new double[] { 0.98, 0.85, 0.02 }          // Phone
};

settings.LevelUProbsPerComparison = new()
{
    new double[] { 0.01, 0.05, 0.15, 0.95 },  // Name
    new double[] { 0.02, 0.10, 0.20, 0.95 },  // Address
    new double[] { 0.001, 0.01, 0.99 }        // Phone
};

Troubleshooting

"Not enough matches found"

  • Lower your MatchThreshold
  • Add more blocking rules for better coverage (Block.OnPrefix, Block.OnSoundex)
  • Check if your field names are consistent between left/right datasets
  • Verify your data mapper is working correctly with sample data
  • Use train: true to let EM improve your initial parameter guesses

"Too many false positives"

  • Increase your MatchThreshold
  • Add more discriminative comparison levels
  • Improve data normalization in your IRecordMapper
  • Set UseValueSpecificU = true for better exact match handling
  • Use more restrictive blocking rules

"Process is too slow"

  • Set UseParallelism = true and tune MaxDegreeOfParallelism
  • Increase BucketCacheMaxBuckets if you have available memory
  • Make your blocking rules more restrictive (fewer candidates per record)
  • Increase OutputBatchSize for slow output sinks
  • Consider using EMOptions.SampleLeftEveryN for faster training on large datasets

"Out of memory errors"

  • Reduce BucketCacheMaxBuckets
  • Increase BucketCount to create smaller bucket files
  • Use more restrictive blocking to reduce candidate set size
  • Process data in smaller chunks

"Training not converging"

  • Increase EMOptions.MaxIterations (try 30-50)
  • Decrease EMOptions.Tolerance (try 1e-7)
  • Check that your initial m/u parameters make sense
  • Ensure you have enough training data
  • Verify your comparison levels are well-designed

Performance Characteristics

ReLinker is designed to handle large datasets efficiently:

  • Memory usage: Configurable via bucket cache settings. Can run in low-memory mode (cache disabled) or high-memory mode (large cache).
  • Disk I/O: Minimized through hybrid caching and sequential file access patterns.
  • CPU scaling: Near-linear scaling with core count for candidate scoring (I/O remains single-threaded).
  • Typical throughput: 10K-100K candidate pairs per second on modern hardware, depending on comparison complexity.
Product Compatible and additional computed target framework versions.
.NET net8.0 is compatible.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed.  net9.0 was computed.  net9.0-android was computed.  net9.0-browser was computed.  net9.0-ios was computed.  net9.0-maccatalyst was computed.  net9.0-macos was computed.  net9.0-tvos was computed.  net9.0-windows was computed.  net10.0 was computed.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
1.2.0 144 9/22/2025
1.1.0 111 6/21/2025
1.0.111 114 6/20/2025
1.0.15 111 6/20/2025
1.0.13 111 6/20/2025
1.0.12 113 6/20/2025
1.0.0 117 6/20/2025