Imagibee.Gigantor 0.3.5

There is a newer version of this package available.
See the version list below for details.
dotnet add package Imagibee.Gigantor --version 0.3.5
NuGet\Install-Package Imagibee.Gigantor -Version 0.3.5
This command is intended to be used within the Package Manager Console in Visual Studio, as it uses the NuGet module's version of Install-Package.
<PackageReference Include="Imagibee.Gigantor" Version="0.3.5" />
For projects that support PackageReference, copy this XML node into the project file to reference the package.
paket add Imagibee.Gigantor --version 0.3.5
#r "nuget: Imagibee.Gigantor, 0.3.5"
#r directive can be used in F# Interactive and Polyglot Notebooks. Copy this into the interactive tool or source code of the script to reference the package.
// Install Imagibee.Gigantor as a Cake Addin
#addin nuget:?package=Imagibee.Gigantor&version=0.3.5

// Install Imagibee.Gigantor as a Cake Tool
#tool nuget:?package=Imagibee.Gigantor&version=0.3.5

Gigantor

Gigantor provides classes that support regular expression searches of gigantic files

The purpose of Gigantor is robust, easy, ready-made searching of gigantic files that avoids common pitfalls. These goals include overcoming the problems of responsiveness, memory footprint, and processing time that are often encountered with this type of application.

In order to accomplish this goal, Gigantor provides RegexSearcher and LineIndexer classes that work together to search and read a file. Both these classes use a similar approach. They partition the file into chunks in the background, launch threads to work on each partition, update progress statistics, and finally join and sort the results.

Since many file processing applications fit into this parallel chunk processing paradigm, Gigantor also provides FileMapJoin<T> as a reusable base class for creating new file map/join classes. This base class is customizable through its Start, Map, Join, Finish methods as well as its chunkSize, maxWorkers, and joinMode constructor parameters.

Contents

  • RegexSearcher - regex searching in the background
  • LineIndexer - line counting in background, maps lines to fpos and fpos to lines
  • DuplicateChecker - file duplicate detection in the background
  • FileMapJoin<T> - base class for implementing custom file-based map/join operations
  • IBackground - common interface for contolling a background job
  • Background - functions for managing collections of IBackground

Example

Here is an examples that illustrate searching a large file and reading several lines around a match.

using Imagibee.Gigantor;

// Get enwik9 (this takes a moment)
var path = Utilities.GetEnwik9();

// The regular expression for the search
const string pattern = @"comfort\s*food";
Regex regex = new(
    pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);

// A shared wait event to facilitate progress notifications
AutoResetEvent progress = new(false);

// Create the search and indexing workers
LineIndexer indexer = new(path, progress);
RegexSearcher searcher = new(path, regex, progress);

// Create a IBackground collection for convenient managment
var processes = new List<IBackground>()
{
    indexer,
    searcher
};

// Create a progress bar to illustrate progress updates
Utilities.ByteProgress progressBar = new(
    40, processes.Count * Utilities.FileByteCount(path));

// Start search and indexing in parallel and wait for completion
Console.WriteLine($"Searching ...");
Background.StartAndWait(
    processes,
    progress,
    (_) =>
    {
        progressBar.Update(
            processes.Select((p) => p.ByteCount).Sum());
    },
    1000);
Console.Write('\n');

// All done, check for errors
var error = Background.AnyError(processes);
if (error.Length != 0) {
    throw new Exception(error);
}

// Check for cancellation
if (Background.AnyCancelled(processes)) {
    throw new Exception("search cancelled");
}

// Display search results
if (searcher.MatchCount != 0) {
    Console.WriteLine($"Found {searcher.MatchCount} matches ...");
    var matchDatas = searcher.GetMatchData();
    for (var i = 0; i < matchDatas.Count; i++) {
        var matchData = matchDatas[i];
        Console.WriteLine(
            $"[{i}]({matchData.Value}) ({matchData.Name}) " +
            $"line {indexer.LineFromPosition(matchData.StartFpos)} " +
            $"fpos {matchData.StartFpos}");
    }

    // Get the line of the 1st match
    var matchLine = indexer.LineFromPosition(
        searcher.GetMatchData()[0].StartFpos);

    // Open the searched file for reading
    using FileStream fileStream = new(path, FileMode.Open);
    Imagibee.Gigantor.StreamReader gigantorReader = new(fileStream);

    // Seek to the first line we want to read
    var contextLines = 6;
    fileStream.Seek(indexer.PositionFromLine(
        matchLine - contextLines), SeekOrigin.Begin);

    // Read and display a few lines around the match
    for (var line = matchLine - contextLines;
        line <= matchLine + contextLines;
        line++) {
        Console.WriteLine(
            $"[{line}]({indexer.PositionFromLine(line)})  " +
            gigantorReader.ReadLine());
    }
}

Example console output

 Searching ...
 ########################################
 Found 11 matches ...
 [0](Comfort food) (0) line 2115660 fpos 185913740
 [1](comfort food) (0) line 2115660 fpos 185913753
 [2](comfort food) (0) line 2405473 fpos 212784867
 [3](comfort food) (0) line 3254241 fpos 275813781
 [4](comfort food) (0) line 3254259 fpos 275817860
 [5](comfort food) (0) line 3993946 fpos 334916584
 [6](comfort food) (0) line 4029113 fpos 337507601
 [7](comfort food) (0) line 4194105 fpos 350053436
 [8](comfort food) (0) line 8614841 fpos 691616502
 [9](comfort food) (0) line 10190137 fpos 799397876
 [10](comfort food) (0) line 12488963 fpos 954837923
 [2115654](185912493)  
 [2115655](185912494)  Some [[fruit]]s were available in the area. [[Muscadine]]s, [[blackberry|blackberries]], [[raspberry|raspberries]], and many other wild berries were part of settlers&amp;#8217; diets when they were available.
 [2115656](185912703)  
 [2115657](185912704)  Early settlers also supplemented their diets with meats.  Most meat came from the hunting of native game.  [[Venison]] was an important meat staple due to the abundance of [[white-tailed deer]] in the area.  Settlers also hunted [[rabbit]]s, [[squirrel]]s, [[opossum]]s, and [[raccoon]]s, all of which were pests to the crops they raised.  [[Livestock]] in the form of [[hog]]s and [[cattle]] were kept.  When game or livestock was killed, the entire animal was used.  Aside from the meat, it was not uncommon for settlers to eat organ meats such as [[liver]], [[brain]]s and [[intestine]]s. This tradition remains today in hallmark dishes like [[chitterlings]] (commonly called ''chit&amp;#8217;lins'') which are fried large [[intestines]] of [[hog]]s, [[livermush]] (a common dish in the Carolinas made from hog liver), and pork [[brain]]s and eggs.  The fat of the animals, particularly hogs, was rendered and used for cooking and frying.
 [2115658](185913646)  
 [2115659](185913647)  ===Southern cuisine for the masses===
 [2115660](185913685)  A niche market for Southern food along with American [[Comfort food|comfort food]] has proven profitable for chains such as [[Cracker Barrel]], who have extended their market across the country, instead of staying solely in the South.
 [2115661](185913920)  
 [2115662](185913921)  Southern chains that are popular across the country include [[Stuckey's]] and [[Popeyes Chicken &amp; Biscuits|Popeye's]]. The former is known for being a &quot;pecan shoppe&quot; and the latter is known for its spicy fried chicken.
 [2115663](185914154)  
 [2115664](185914155)  Other Southern chains which specialize in this type of cuisine, but have decided mainly to stay in the South, are [[Po' Folks]] (also known as ''Folks'' in some markets) and Famous Amos. Another type of selection is [[Sonny's Real Pit Bar-B-Q]].
 [2115665](185914401)  
 [2115666](185914402)  ==Cajun and Creole cuisine==

Refer to the tests and console apps for additional examples.

Performance

The performance benchmark consists of running the included benchmarking apps over enwik9 and measuring the throughput. Enwik9 is a 1e9 byte file that is not included.

Throughput Graph

Here is the search benchmark console output for a 5 GiByte search. On the test system, performance peaked around 16 worker threads, and the peak is roughly eight times faster (8x) than the single threaded baseline.

$ dotnet SearchApp/bin/Release/netcoreapp3.1/SearchApp.dll benchmark ${TMPDIR}/enwik9
..................................
maxWorkers=1, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 31.0263856 seconds
-> 161.1531573307076 MBytes/s
...................
maxWorkers=2, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 16.8778632 seconds
-> 296.24603190290105 MBytes/s
.........
maxWorkers=4, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 9.1642743 seconds
-> 545.5969383194914 MBytes/s
.........
maxWorkers=8, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 5.2054124 seconds
-> 960.5386885388754 MBytes/s
....
maxWorkers=16, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 3.7841506 seconds
-> 1321.3004788974308 MBytes/s
....
maxWorkers=32, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 3.714317 seconds
-> 1346.1425074919562 MBytes/s
....
maxWorkers=64, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 3.7860814 seconds
-> 1320.6266510804548 MBytes/s
....
maxWorkers=128, chunkKiBytes=512, maxThread=32767
   105160 matches found
   searched 5000000000 bytes in 3.8072122 seconds
-> 1313.296905278881 MBytes/s

The hardware used to measure performance was a Macbook Pro

  • 8-Core Intel Core i9
  • L2 Cache (per Core): 256 KB
  • L3 Cache: 16 MB
  • Memory: 16 GB

License

MIT

Versioning

This package uses semantic versioning. Tags on the main branch indicate versions. It is recomended to use a tagged version. The latest version on the main branch should be considered under development when it is not tagged.

Issues

Report and track issues here.

Contributing

Minor changes such as bug fixes are welcome. Simply make a pull request. Please discuss more significant changes prior to making the pull request by opening a new issue that describes the change.

Product Compatible and additional computed target framework versions.
.NET net6.0 is compatible.  net6.0-android was computed.  net6.0-ios was computed.  net6.0-maccatalyst was computed.  net6.0-macos was computed.  net6.0-tvos was computed.  net6.0-windows was computed.  net7.0 was computed.  net7.0-android was computed.  net7.0-ios was computed.  net7.0-maccatalyst was computed.  net7.0-macos was computed.  net7.0-tvos was computed.  net7.0-windows was computed.  net8.0 was computed.  net8.0-android was computed.  net8.0-browser was computed.  net8.0-ios was computed.  net8.0-maccatalyst was computed.  net8.0-macos was computed.  net8.0-tvos was computed.  net8.0-windows was computed. 
Compatible target framework(s)
Included target framework(s) (in package)
Learn more about Target Frameworks and .NET Standard.
  • net6.0

    • No dependencies.

NuGet packages

This package is not used by any NuGet packages.

GitHub repositories

This package is not used by any popular GitHub repositories.

Version Downloads Last updated
3.0.0 399 4/5/2023
2.0.1 171 4/3/2023
2.0.0 168 4/3/2023
1.0.2 159 3/30/2023
1.0.1 179 3/25/2023
1.0.0 189 3/24/2023
0.8.2 220 3/8/2023
0.8.1 193 3/8/2023
0.8.0 214 3/6/2023
0.7.1 213 3/6/2023
0.7.0 213 3/5/2023
0.6.3 202 3/1/2023
0.6.2 209 2/21/2023
0.6.1 214 2/18/2023
0.6.0 229 2/18/2023
0.5.0 226 2/13/2023
0.4.1 228 2/10/2023
0.4.0 249 2/8/2023
0.3.5 353 2/7/2023
0.3.4 219 2/7/2023
0.3.3 227 2/6/2023
0.3.2 252 2/6/2023
0.3.1 241 2/6/2023