FilePrepper 0.4.8

dotnet add package FilePrepper --version 0.4.8
                    
NuGet\Install-Package FilePrepper -Version 0.4.8
                    
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="FilePrepper" Version="0.4.8" />
                    
For projects that support PackageReference, copy this XML node into the project file to reference the package.
<PackageVersion Include="FilePrepper" Version="0.4.8" />
                    
Directory.Packages.props
<PackageReference Include="FilePrepper" />
                    
Project file
For projects that support Central Package Management (CPM), copy this XML node into the solution Directory.Packages.props file to version the package.
paket add FilePrepper --version 0.4.8
                    
#r "nuget: FilePrepper, 0.4.8"
                    
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
#:package FilePrepper@0.4.8
                    
#:package directive can be used in C# file-based apps starting in .NET 10 preview 4. Copy this into a .cs file before any lines of code to reference the package.
#addin nuget:?package=FilePrepper&version=0.4.8
                    
Install as a Cake Addin
#tool nuget:?package=FilePrepper&version=0.4.8
                    
Install as a Cake Tool

FilePrepper

NuGet SDK NuGet CLI SDK Downloads CLI Downloads .NET Version License

A powerful .NET library and CLI tool for data preprocessing. Features a Pipeline API for efficient in-memory data transformations with 67-90% reduction in file I/O. Perfect for ML data preparation, ETL pipelines, and data analysis workflows.

🚀 Quick Start

SDK Installation

# Install FilePrepper SDK for programmatic use
dotnet add package FilePrepper

# Or install CLI tool globally
dotnet tool install -g fileprepper-cli
using FilePrepper.Pipeline;

// CSV Processing: Only 2 file I/O operations (read + write)
await DataPipeline
    .FromCsvAsync("data.csv")
    .Normalize(columns: new[] { "Age", "Salary", "Score" },
               method: NormalizationMethod.MinMax)
    .FillMissing(columns: new[] { "Score" }, method: FillMethod.Mean)
    .FilterRows(row => int.Parse(row["Age"]) >= 30)
    .ToCsvAsync("output.csv");

// Multi-Format Support: Excel → Transform → JSON
await DataPipeline
    .FromExcelAsync("sales.xlsx")
    .AddColumn("Total", row =>
        (double.Parse(row["Price"]) * double.Parse(row["Quantity"])).ToString())
    .FilterRows(row => double.Parse(row["Total"]) >= 1000)
    .ToJsonAsync("high_value_sales.json");

// Multi-File CSV Concatenation: Merge 33 files ⭐ NEW
await DataPipeline
    .ConcatCsvAsync("kemp-*.csv", "dataset/")
    .ParseKoreanTime("Time", "ParsedTime")  // Korean time format ⭐ NEW
    .ExtractDateFeatures("ParsedTime", DateFeatures.Hour | DateFeatures.Minute)
    .ToCsvAsync("processed.csv");

CLI Usage

# Normalize numeric columns
fileprepper normalize-data --input data.csv --output normalized.csv \
  --columns "Age,Salary,Score" --method MinMax

# Fill missing values
fileprepper fill-missing-values --input data.csv --output filled.csv \
  --columns "Age,Salary" --method Mean

# Get help
fileprepper --help
fileprepper <command> --help

📦 Supported Formats

Process data in multiple formats:

  • CSV (Comma-Separated Values)
  • TSV (Tab-Separated Values)
  • JSON (JavaScript Object Notation)
  • XML (Extensible Markup Language)
  • Excel (XLSX/XLS files)

🛠️ Available Commands (26+)

Data Transformation

  • normalize-data - Normalize columns (MinMax, ZScore)
  • scale-data - Scale numeric data (StandardScaler, MinMaxScaler, RobustScaler)
  • one-hot-encoding - Convert categorical to binary columns
  • data-type-convert - Convert column data types
  • date-extraction - Extract date features (Year, Month, Day, DayOfWeek)
  • datetime - Parse datetime and extract features ⭐ Phase 2
  • string - String transformations (upper, lower, trim, substring) ⭐ Phase 2
  • conditional - Conditional column creation with if-then-else logic ⭐ Phase 2

Data Cleaning

  • fill-missing-values - Fill missing data (Mean, Median, Mode, Forward, Backward, Constant)
  • drop-duplicates - Remove duplicate rows
  • value-replace - Replace values in columns

Column Operations

  • add-columns - Add new calculated columns
  • remove-columns - Delete unwanted columns
  • rename-columns - Rename column headers
  • reorder-columns - Change column order
  • column-interaction - Create interaction features

Data Analysis

  • basic-statistics - Calculate statistics (Mean, Median, StdDev, ZScore)
  • aggregate - Group and aggregate data
  • filter-rows - Filter rows by conditions
  • merge-asof - Time-series merge with tolerance ⭐ Phase 2

Data Organization

  • merge - Combine multiple files (Horizontal/Vertical merge)
  • merge-asof - Time-series merge with tolerance ⭐ Phase 2
  • data-sampling - Sample rows (Random, Stratified, Systematic)
  • file-format-convert - Convert between formats
  • unpivot - Reshape data from wide to long format ⭐ Phase 2

Feature Engineering

  • create-lag-features - Create time-series lag features
  • window - Window operations (resample, rolling aggregations) ⭐ Phase 2
  • file-format-convert - Convert between formats

💡 Common Use Cases

Data Cleaning Pipeline (CLI)

# 1. Remove unnecessary columns
fileprepper remove-columns --input raw.csv --output step1.csv \
  --columns "Debug,TempCol,Notes"

# 2. Drop duplicates
fileprepper drop-duplicates --input step1.csv --output step2.csv \
  --columns "Email" --keep First

# 3. Fill missing values
fileprepper fill-missing-values --input step2.csv --output step3.csv \
  --columns "Age,Salary" --method Mean

# 4. Normalize numeric columns
fileprepper normalize-data --input step3.csv --output clean.csv \
  --columns "Age,Salary,Score" --method MinMax

Time-Series Processing (Phase 2) ⭐

# 5-minute window aggregation for sensor data
fileprepper window --input sensor_current.csv --output aggregated.csv \n  --type resample --method mean \n  --columns "RMS[A]" --time-column "Time_s[s]" \n  --window 5T --header

# Rolling window for smoothing
fileprepper window --input noisy_data.csv --output smoothed.csv \n  --type rolling --method mean \n  --columns temperature,humidity --window-size 3 \n  --suffix "_smooth" --header

ML Feature Engineering (SDK - Efficient!)

using FilePrepper.Pipeline;

// Single pipeline: Only 2 file I/O operations instead of 8!
await DataPipeline
    .FromCsvAsync("orders.csv")
    .AddColumn("Year", row => DateTime.Parse(row["OrderDate"]).Year.ToString())
    .AddColumn("Month", row => DateTime.Parse(row["OrderDate"]).Month.ToString())
    .Normalize(columns: new[] { "Revenue", "Quantity" },
               method: NormalizationMethod.MinMax)
    .FilterRows(row => int.Parse(row["Year"]) >= 2023)
    .ToCsvAsync("features.csv");

// 67-90% reduction in file I/O compared to CLI approach!

Format Conversion

# CSV to JSON
fileprepper file-format-convert --input data.csv --output data.json --format JSON

# Excel to CSV
fileprepper file-format-convert --input report.xlsx --output report.csv --format CSV

# CSV to XML
fileprepper file-format-convert --input data.csv --output data.xml --format XML

Data Analysis

# Calculate statistics
fileprepper basic-statistics --input data.csv --output stats.csv \
  --columns "Age,Salary,Score" --statistics Mean,Median,StdDev,ZScore

# Aggregate by group
fileprepper aggregate --input sales.csv --output summary.csv \
  --group-by "Region,Category" --agg-columns "Revenue:Sum,Quantity:Mean"

# Sample data
fileprepper data-sampling --input large.csv --output sample.csv \
  --method Random --sample-size 1000

🔧 Programmatic Usage (SDK)

FilePrepper provides a powerful SDK with Pipeline API for efficient data processing:

dotnet add package FilePrepper

Benefits: 67-90% reduction in file I/O, fluent API, in-memory processing

using FilePrepper.Pipeline;
using FilePrepper.Tasks.NormalizeData;

// Efficient: Only 2 file I/O operations (read + write)
await DataPipeline
    .FromCsvAsync("data.csv")
    .Normalize(columns: new[] { "Age", "Salary", "Score" },
               method: NormalizationMethod.MinMax)
    .FillMissing(columns: new[] { "Score" }, method: FillMethod.Mean)
    .FilterRows(row => int.Parse(row["Age"]) >= 30)
    .AddColumn("ProcessedDate", _ => DateTime.Now.ToString())
    .ToCsvAsync("output.csv");

// Or work in-memory without any file I/O
var result = DataPipeline
    .FromData(inMemoryData)
    .Normalize(columns: new[] { "Age", "Salary" },
               method: NormalizationMethod.MinMax)
    .ToDataFrame();  // Get immutable snapshot

Advanced Pipeline Features

// Chain multiple transformations
var pipeline = await DataPipeline
    .FromCsvAsync("sales.csv")
    .RemoveColumns(new[] { "Debug", "TempCol" })
    .RenameColumn("OldName", "NewName")
    .AddColumn("Total", row =>
        (double.Parse(row["Price"]) * double.Parse(row["Quantity"])).ToString())
    .FilterRows(row => double.Parse(row["Total"]) > 100)
    .Normalize(columns: new[] { "Total" }, method: NormalizationMethod.MinMax);

// Get intermediate results without file I/O
var dataFrame = pipeline.ToDataFrame();
Console.WriteLine($"Processed {dataFrame.RowCount} rows");

// Continue processing
await pipeline
    .AddColumn("ProcessedAt", _ => DateTime.UtcNow.ToString("o"))
    .ToCsvAsync("output.csv");

In-Memory Processing

// Work entirely in memory - zero file I/O
var data = new List<Dictionary<string, string>>
{
    new() { ["Name"] = "Alice", ["Age"] = "25", ["Salary"] = "50000" },
    new() { ["Name"] = "Bob", ["Age"] = "30", ["Salary"] = "60000" }
};

var result = DataPipeline
    .FromData(data)
    .Normalize(columns: new[] { "Age", "Salary" },
               method: NormalizationMethod.MinMax)
    .AddColumn("Category", row =>
        int.Parse(row["Age"]) < 30 ? "Junior" : "Senior")
    .ToDataFrame();

// Access results directly
foreach (var row in result.Rows)
{
    Console.WriteLine($"{row["Name"]}: {row["Category"]}");
}

Traditional Task API

using FilePrepper.Tasks.NormalizeData;
using Microsoft.Extensions.Logging;

var options = new NormalizeDataOption
{
    InputPath = "data.csv",
    OutputPath = "normalized.csv",
    TargetColumns = new[] { "Age", "Salary", "Score" },
    Method = NormalizationMethod.MinMax
};

var task = new NormalizeDataTask(logger);
var context = new TaskContext(options);
bool success = await task.ExecuteAsync(context);

See SDK Usage Guide for comprehensive examples and best practices.

📖 Documentation

Getting Started

SDK & Programming

Advanced Features

Use Cases

For more documentation, see the docs/ directory.

🎯 Use Cases

  • Machine Learning - Prepare datasets for training (normalization, encoding, feature engineering)
  • Time-Series Analysis - Window aggregations, resampling, lag features ⭐ Phase 2 - Prepare datasets for training (normalization, encoding, feature engineering)
  • Data Analysis - Clean and transform data for analysis
  • ETL Pipelines - Extract, transform, and load data workflows with minimal I/O overhead
  • Data Migration - Convert between formats and clean legacy data
  • Automation - Script data processing with SDK or CLI
  • In-Memory Processing - Chain transformations without file I/O costs

📋 Requirements

  • .NET 9.0 or later
  • Cross-platform - Windows, Linux, macOS
  • Flexible Usage - CLI tool (no coding) or SDK (programmatic)

🤝 Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


**Made with ❤️ by iyulab | Efficient Data Preprocessing - CLI & SDK | Phase 2 Complete

Product Compatible and additional computed target framework versions.
.NET net10.0 is compatible.  net10.0-android was computed.  net10.0-browser was computed.  net10.0-ios was computed.  net10.0-maccatalyst was computed.  net10.0-macos was computed.  net10.0-tvos was computed.  net10.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last Updated
0.4.8 138 11/16/2025
0.4.7 247 11/14/2025
0.4.5 287 11/13/2025
0.4.3 263 11/10/2025
0.4.0 191 11/3/2025
0.2.3 190 11/3/2025
0.2.2 154 1/17/2025
0.2.1 132 1/16/2025
0.2.0 159 1/11/2025
0.1.1 167 12/16/2024
0.1.0 161 12/6/2024