WebHdfs.Extensions.FileProviders
1.0.2
dotnet add package WebHdfs.Extensions.FileProviders --version 1.0.2
NuGet\Install-Package WebHdfs.Extensions.FileProviders -Version 1.0.2
<PackageReference Include="WebHdfs.Extensions.FileProviders" Version="1.0.2" />
<PackageVersion Include="WebHdfs.Extensions.FileProviders" Version="1.0.2" />
<PackageReference Include="WebHdfs.Extensions.FileProviders" />
paket add WebHdfs.Extensions.FileProviders --version 1.0.2
#r "nuget: WebHdfs.Extensions.FileProviders, 1.0.2"
#:package WebHdfs.Extensions.FileProviders@1.0.2
#addin nuget:?package=WebHdfs.Extensions.FileProviders&version=1.0.2
#tool nuget:?package=WebHdfs.Extensions.FileProviders&version=1.0.2
WebHdfs.Extensions.FileProviders
A comprehensive file provider implementation for Apache Hadoop HDFS through the WebHDFS REST API, fully compatible with the Microsoft.Extensions.FileProviders abstraction.
Problem Statement
Working with HDFS in .NET applications typically requires custom clients or complex setup procedures. Developers often need to:
- Learn Hadoop-specific APIs and protocols
- Handle low-level HTTP requests to WebHDFS endpoints
- Implement custom file system abstractions
- Deal with authentication and connection management
- Write boilerplate code for common file operations
This creates a barrier for .NET developers who want to integrate with Hadoop ecosystems in cloud-native or hybrid environments where HDFS is used for big data storage.
Solution
This library provides a standard IFileProvider implementation that seamlessly integrates HDFS with the .NET ecosystem. It abstracts away the complexity of WebHDFS protocol while providing:
- Native .NET Integration: Works with ASP.NET Core, dependency injection, and other .NET frameworks out of the box
- Familiar API: Uses the same IFileProvider interface that .NET developers already know
- WebHDFS Protocol: Leverages HTTP/HTTPS for cross-platform compatibility and firewall-friendly communication
- Change Monitoring: Implements polling-based file system watching for reactive applications
- Performance Optimized: Efficient streaming, connection reuse, and minimal memory footprint
Features
- Full IFileProvider Implementation: Seamless integration with ASP.NET Core and other .NET applications
- WebHDFS REST API: Access HDFS files and directories through HTTP/HTTPS without custom clients
- File Operations: Read file content, get file information, and browse directories with streaming support
- Change Detection: Monitor file system changes with configurable polling-based change tokens
- Cross-Platform: Works on Windows, Linux, and macOS with no native dependencies
- Production Ready: Thread-safe, efficient, and designed for high-throughput scenarios
- Multiple Framework Support: .NET Framework 4.6.2+, .NET Standard 2.0+, .NET 8.0+, and .NET 9.0
Installation
Install the package via NuGet:
dotnet add package WebHdfs.Extensions.FileProviders
Or via Package Manager Console:
Install-Package WebHdfs.Extensions.FileProviders
Usage
Basic File Provider Usage
using WebHdfs.Extensions.FileProviders;
// Create a file provider pointing to your HDFS NameNode
var nameNodeUri = new Uri("http://namenode:9870");
var fileProvider = new WebHdfsFileProvider(nameNodeUri);
// Get file info
var fileInfo = fileProvider.GetFileInfo("/path/to/your/file.txt");
if (fileInfo.Exists)
{
Console.WriteLine($"File size: {fileInfo.Length} bytes");
Console.WriteLine($"Last modified: {fileInfo.LastModified}");
// Read file content
using var stream = fileInfo.CreateReadStream();
using var reader = new StreamReader(stream);
var content = reader.ReadToEnd();
Console.WriteLine($"Content: {content}");
}
// Browse directory
var directoryContents = fileProvider.GetDirectoryContents("/path/to/directory");
foreach (var item in directoryContents)
{
Console.WriteLine($"{(item.IsDirectory ? "DIR" : "FILE")}: {item.Name}");
}
Direct File Access
using WebHdfs.Extensions.FileProviders;
var nameNodeUri = new Uri("http://namenode:9870");
var fileInfo = new WebHdfsFileInfo(nameNodeUri, "/path/to/file.txt");
if (fileInfo.Exists)
{
Console.WriteLine($"File exists: {fileInfo.Exists}");
Console.WriteLine($"File size: {fileInfo.Length} bytes");
Console.WriteLine($"Is directory: {fileInfo.IsDirectory}");
Console.WriteLine($"Last modified: {fileInfo.LastModified}");
// Read file content
using var stream = fileInfo.CreateReadStream();
using var reader = new StreamReader(stream);
var content = reader.ReadToEnd();
Console.WriteLine($"Content: {content}");
}
Change Monitoring
using WebHdfs.Extensions.FileProviders;
var nameNodeUri = new Uri("http://namenode:9870");
var fileProvider = new WebHdfsFileProvider(nameNodeUri, TimeSpan.FromSeconds(10)); // Poll every 10 seconds
// Monitor file changes
var changeToken = fileProvider.Watch("/path/to/watch/**");
changeToken.RegisterChangeCallback(_ =>
{
Console.WriteLine("File system changed!");
}, null);
ASP.NET Core Integration
public class Startup
{
public void ConfigureServices(IServiceCollection services)
{
// Register HDFS file provider
services.AddSingleton<IFileProvider>(serviceProvider =>
{
var nameNodeUri = new Uri("http://namenode:9870");
return new WebHdfsFileProvider(nameNodeUri);
});
// Or register as a named file provider
services.Configure<FileProviderOptions>(options =>
{
var nameNodeUri = new Uri("http://namenode:9870");
options.FileProviders.Add(new WebHdfsFileProvider(nameNodeUri));
});
}
public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
{
// Use HDFS for static files (read-only)
var nameNodeUri = new Uri("http://namenode:9870");
var hdfsProvider = new WebHdfsFileProvider(nameNodeUri);
app.UseStaticFiles(new StaticFileOptions
{
FileProvider = hdfsProvider,
RequestPath = "/hdfs-content"
});
}
}
How It Works
The WebHdfsFileProvider leverages the WebHDFS REST API to provide seamless access to HDFS:
- HTTP/HTTPS Communication: All operations use standard HTTP requests to the HDFS NameNode
- JSON Responses: Parses WebHDFS JSON responses for file metadata and directory listings
- Streaming Support: Efficient streaming for large file reads without loading entire files into memory
- Connection Reuse: Uses HttpClient connection pooling for optimal performance
- Polling-based Monitoring: Implements change detection through periodic metadata checks
Request Flow
// Simplified version of the internal process
GET /webhdfs/v1/path/to/file?op=GETFILESTATUS
→ Returns: {"FileStatus": {"length": 1024, "modificationTime": 1234567890, ...}}
GET /webhdfs/v1/path/to/file?op=OPEN
→ Returns: File content stream
GET /webhdfs/v1/path/to/directory?op=LISTSTATUS
→ Returns: {"FileStatuses": {"FileStatus": [...]}}
API Reference
WebHdfsFileProvider Class
Constructors
WebHdfsFileProvider(Uri nameNodeUri)
Creates a new instance with default polling interval (5 seconds).
Parameters:
nameNodeUri
: The URI of the HDFS NameNode (e.g.,http://namenode:9870
)
WebHdfsFileProvider(Uri nameNodeUri, TimeSpan pollingInterval)
Creates a new instance with custom polling interval for change detection.
Parameters:
nameNodeUri
: The URI of the HDFS NameNodepollingInterval
: How often to check for file system changes
Provider Methods
IFileInfo GetFileInfo(string subpath)
Gets file information for the specified path.
Parameters:
subpath
: The relative path to the file in HDFS
Returns: IFileInfo
instance with file metadata
IDirectoryContents GetDirectoryContents(string subpath)
Gets the contents of a directory.
Parameters:
subpath
: The relative path to the directory in HDFS
Returns: IDirectoryContents
containing directory items
IChangeToken Watch(string filter)
Monitors the specified path pattern for changes.
Parameters:
filter
: The path pattern to monitor (supports wildcards)
Returns: IChangeToken
for change notifications
WebHdfsFileInfo Class
File Info Constructors
WebHdfsFileInfo(Uri nameNodeUri, string path)
Creates a direct file info instance for the specified HDFS path.
Parameters:
nameNodeUri
: The URI of the HDFS NameNodepath
: The absolute path to the file in HDFS
Properties
bool Exists
: Whether the file or directory existslong Length
: File size in bytes (0 for directories)string Name
: File or directory nameDateTimeOffset LastModified
: Last modification timestampbool IsDirectory
: Whether this is a directorystring PhysicalPath
: Returns null (not applicable for HDFS)
File Info Methods
Stream CreateReadStream()
Creates a read-only stream for the file content.
Returns: Stream
for reading file data
Throws: InvalidOperationException
if the file doesn't exist or is a directory
Configuration
NameNode URI
The NameNode URI should include the protocol, hostname, and port:
// HTTP (default HDFS WebHDFS port)
var nameNodeUri = new Uri("http://namenode:9870");
// HTTPS (secure WebHDFS port)
var nameNodeUri = new Uri("https://namenode:9871");
// Custom port configuration
var nameNodeUri = new Uri("http://hdfs-cluster:8080");
Polling Interval
Configure the polling interval for change detection based on your requirements:
// Default polling interval (5 seconds)
var fileProvider = new WebHdfsFileProvider(nameNodeUri);
// Fast polling for real-time applications (1 second)
var fileProvider = new WebHdfsFileProvider(nameNodeUri, TimeSpan.FromSeconds(1));
// Slow polling for batch processing (30 seconds)
var fileProvider = new WebHdfsFileProvider(nameNodeUri, TimeSpan.FromSeconds(30));
// Disable polling (for read-only scenarios)
var fileProvider = new WebHdfsFileProvider(nameNodeUri, TimeSpan.FromMilliseconds(-1));
HTTP Client Configuration
For advanced scenarios, you can configure the underlying HttpClient:
// Configure with custom HttpClient
var httpClient = new HttpClient()
{
Timeout = TimeSpan.FromSeconds(30)
};
httpClient.DefaultRequestHeaders.Add("User-Agent", "MyApp/1.0");
var fileProvider = new WebHdfsFileProvider(nameNodeUri, TimeSpan.FromSeconds(5), httpClient);
Performance Considerations
- Streaming: The library uses efficient streaming for file reads, minimizing memory usage for large files
- Connection Pooling: Leverages HttpClient connection pooling for optimal network performance
- Lazy Loading: Directory contents and file information are loaded on-demand
- Caching: Change tokens use intelligent caching to reduce unnecessary WebHDFS requests
- Thread Safety: All operations are thread-safe and can be called concurrently
Compatibility
This library supports the following target frameworks:
- .NET Standard 2.0 (for broad compatibility)
- .NET Standard 2.1
- .NET Framework 4.6.2+
- .NET 8.0
- .NET 9.0
It's compatible with:
- ASP.NET Core 2.0+
- .NET Framework applications using Microsoft.Extensions.FileProviders
- Any application using the Microsoft.Extensions.FileProviders package
- Azure Functions, AWS Lambda, and other serverless environments
- Docker containers and Kubernetes deployments
Thread Safety
All methods in this library are thread-safe and can be called concurrently from multiple threads. The internal HTTP client and caching mechanisms use appropriate synchronization to ensure consistency.
Requirements
- Target Frameworks: .NET Framework 4.6.2+, .NET Standard 2.0+, .NET 8.0+, .NET 9.0
- HDFS Version: Compatible with Apache Hadoop 2.0+ with WebHDFS enabled
- Network Access: HTTP/HTTPS access to HDFS NameNode WebHDFS port (default: 9870)
- Permissions: Read access to HDFS paths (write operations not supported)
Dependencies
- Microsoft.Extensions.FileProviders.Abstractions
- System.Text.Json (for parsing WebHDFS JSON responses)
- System.Net.Http (for WebHDFS REST API communication)
Supported Operations
Operation | Supported | WebHDFS API | Notes |
---|---|---|---|
Read Files | ✅ | OPEN |
Full streaming support, efficient for large files |
Get File Info | ✅ | GETFILESTATUS |
Size, timestamps, permissions, type |
Browse Directories | ✅ | LISTSTATUS |
Recursive directory listing with metadata |
Change Monitoring | ✅ | GETFILESTATUS |
Polling-based with configurable intervals |
Write Files | ❌ | N/A | Read-only provider by design |
Create Directories | ❌ | N/A | Read-only provider by design |
Delete Files | ❌ | N/A | Read-only provider by design |
Move/Rename | ❌ | N/A | Read-only provider by design |
Troubleshooting
Common Issues
Connection Refused
System.Net.Http.HttpRequestException: Connection refused
Solution:
- Verify HDFS NameNode is running and WebHDFS is enabled
- Check the NameNode URI (default port is 9870 for HTTP, 9871 for HTTPS)
- Ensure network connectivity and firewall rules allow access
File Not Found
WebHdfsFileInfo.Exists returns false for existing files
Solution:
- Verify the file path is absolute (starts with
/
) - Check HDFS permissions (user must have read access)
- Ensure the path exists in HDFS using
hdfs dfs -ls /path/to/file
Slow Performance
Solutions:
- Increase polling interval for change monitoring
- Use connection pooling by reusing the same WebHdfsFileProvider instance
- Consider caching file information for frequently accessed files
- Check network latency between client and HDFS cluster
Authentication Errors
HTTP 401 Unauthorized
Solution:
- This library currently supports anonymous access only
- Ensure HDFS cluster allows anonymous read access
- For authenticated access, consider extending the library or using a proxy
Performance Tuning
Optimize Polling Interval
// For real-time applications
var fileProvider = new WebHdfsFileProvider(nameNodeUri, TimeSpan.FromSeconds(1));
// For batch processing
var fileProvider = new WebHdfsFileProvider(nameNodeUri, TimeSpan.FromMinutes(5));
// Disable polling for read-only scenarios
var fileProvider = new WebHdfsFileProvider(nameNodeUri, TimeSpan.FromMilliseconds(-1));
Connection Management
// Reuse provider instances
private static readonly WebHdfsFileProvider SharedProvider =
new WebHdfsFileProvider(new Uri("http://namenode:9870"));
// Configure HttpClient timeout
var httpClient = new HttpClient { Timeout = TimeSpan.FromMinutes(5) };
var fileProvider = new WebHdfsFileProvider(nameNodeUri, TimeSpan.FromSeconds(5), httpClient);
Memory Management
// Use streaming for large files
using var stream = fileInfo.CreateReadStream();
using var bufferedStream = new BufferedStream(stream, 8192);
// Process in chunks instead of loading entire file
var buffer = new byte[4096];
int bytesRead;
while ((bytesRead = await stream.ReadAsync(buffer, 0, buffer.Length)) > 0)
{
// Process chunk
}
Known Limitations
- Read-Only Operations: This provider only supports read operations. Write, delete, and create operations are not implemented by design
- Anonymous Authentication: Currently supports anonymous access only. For authenticated scenarios, consider using a proxy or extending the library
- Polling-Based Monitoring: Uses polling for change detection, which may not be suitable for high-frequency monitoring or real-time scenarios
- No Transaction Support: Operations are not transactional; partial reads may occur if files are modified during access
- Limited Error Handling: WebHDFS error responses are mapped to basic .NET exceptions; detailed HDFS error information may be lost
Alternatives
If this library doesn't meet your requirements, consider these alternatives:
- Microsoft.Hadoop.Client: Official Microsoft Hadoop client (deprecated, but may still work for some scenarios)
- HDInsight .NET SDK: Azure HDInsight specific client with more features
- Custom WebHDFS Implementation: Build your own WebHDFS client for specific authentication or feature requirements
- Hadoop Filesystem Bridge: Use JNI or process invocation to call native HDFS commands
- Apache Knox: Use Knox gateway for authentication and proxying to WebHDFS
Related Projects
- Microsoft.Extensions.FileProviders - The base file provider abstractions
- Apache Hadoop WebHDFS - Official WebHDFS documentation
- Azure Data Lake Storage - Microsoft's cloud-native HDFS alternative
References
- WebHDFS REST API Documentation
- HDFS Architecture Guide
- ASP.NET Core File Providers
- Microsoft.Extensions.FileProviders Source Code
License
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.
Support
For questions, issues, or contributions, please visit the GitHub repository.
Product | Versions Compatible and additional computed target framework versions. |
---|---|
.NET | net5.0 was computed. net5.0-windows was computed. net6.0 was computed. net6.0-android was computed. net6.0-ios was computed. net6.0-maccatalyst was computed. net6.0-macos was computed. net6.0-tvos was computed. net6.0-windows was computed. net7.0 was computed. net7.0-android was computed. net7.0-ios was computed. net7.0-maccatalyst was computed. net7.0-macos was computed. net7.0-tvos was computed. net7.0-windows was computed. net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 is compatible. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 was computed. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
.NET Core | netcoreapp2.0 was computed. netcoreapp2.1 was computed. netcoreapp2.2 was computed. netcoreapp3.0 was computed. netcoreapp3.1 was computed. |
.NET Standard | netstandard2.0 is compatible. netstandard2.1 is compatible. |
.NET Framework | net461 was computed. net462 is compatible. net463 was computed. net47 was computed. net471 was computed. net472 was computed. net48 was computed. net481 was computed. |
MonoAndroid | monoandroid was computed. |
MonoMac | monomac was computed. |
MonoTouch | monotouch was computed. |
Tizen | tizen40 was computed. tizen60 was computed. |
Xamarin.iOS | xamarinios was computed. |
Xamarin.Mac | xamarinmac was computed. |
Xamarin.TVOS | xamarintvos was computed. |
Xamarin.WatchOS | xamarinwatchos was computed. |
-
.NETFramework 4.6.2
- Microsoft.Extensions.FileProviders.Abstractions (>= 2.1.0)
- System.Net.Http (>= 4.3.4)
- System.Text.Json (>= 4.6.0)
-
.NETStandard 2.0
- Microsoft.Extensions.FileProviders.Abstractions (>= 2.1.0)
- System.Net.Http (>= 4.3.4)
- System.Text.Json (>= 4.6.0)
-
.NETStandard 2.1
- Microsoft.Extensions.FileProviders.Abstractions (>= 2.1.0)
- System.Net.Http (>= 4.3.4)
- System.Text.Json (>= 4.6.0)
-
net8.0
- Microsoft.Extensions.FileProviders.Abstractions (>= 2.1.0)
- System.Net.Http (>= 4.3.4)
- System.Text.Json (>= 4.6.0)
-
net9.0
- Microsoft.Extensions.FileProviders.Abstractions (>= 2.1.0)
- System.Net.Http (>= 4.3.4)
- System.Text.Json (>= 4.6.0)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
# Release Notes
## Version 1.0.2 (Initial Release)
### Features
- **Full IFileProvider Implementation**: Complete implementation of Microsoft.Extensions.FileProviders.IFileProvider interface
- **WebHDFS REST API Integration**: Native support for accessing HDFS through WebHDFS REST API
- **File Operations**: Read file content, get file information (size, timestamps, type), and browse directories
- **Change Detection**: Polling-based file system change monitoring with configurable intervals
- **Streaming Support**: Efficient streaming of large files from HDFS
- **Multi-Framework Support**: Comprehensive targeting for .NET Framework 4.6.2+, .NET Standard 2.0/2.1, .NET 8.0, and .NET 9.0
- **Cross-Platform Compatibility**: Works seamlessly on Windows, Linux, and macOS
### Core Components
- **WebHdfsFileProvider**: Main file provider implementation with configurable polling intervals
- **WebHdfsFileInfo**: File information implementation supporting both files and directories
- **WebHdfsDirectoryContents**: Directory enumeration with lazy loading
- **PollingFileChangeToken**: Change detection mechanism for file system monitoring
- **Error Handling**: Comprehensive error handling for network and HDFS-specific exceptions
### API Features
- **File Access**: Direct file access through WebHdfsFileInfo class
- **Directory Browsing**: Enumerate directory contents with file/directory type detection
- **Change Monitoring**: Watch file system changes with glob pattern support
- **Configuration**: Flexible NameNode URI configuration with HTTP/HTTPS support
- **Documentation**: Comprehensive XML documentation for all public APIs
### Performance & Security
- **Efficient HTTP Client Usage**: Optimized HTTP client usage with proper disposal patterns
- **Memory Management**: Efficient memory usage with streaming and IDisposable implementations
- **Connection Management**: Proper connection lifecycle management
- **Anonymous Access**: Secure anonymous access to HDFS clusters
### Compatibility
- **HDFS Version**: Compatible with Apache Hadoop 2.0+ with WebHDFS enabled
- **Network Requirements**: HTTP/HTTPS access to HDFS NameNode (default ports 9870/9871)
- **Dependencies**:
- Microsoft.Extensions.FileProviders.Abstractions
- System.Text.Json
- System.Net.Http
### Supported Operations
| Operation | Support | Implementation |
|-----------|---------|----------------|
| Read Files | ✅ | Full streaming support with efficient memory usage |
| Get File Info | ✅ | Complete metadata including size, timestamps, and type |
| Browse Directories | ✅ | Lazy-loaded directory enumeration |
| Change Monitoring | ✅ | Polling-based with configurable intervals |
| File Watching | ✅ | Glob pattern support for flexible monitoring |
### Known Limitations
- **Read-Only Provider**: This version supports read operations only (write operations planned for future releases)
- **Authentication**: Currently supports anonymous access only (OAuth and Kerberos planned)
- **Polling-Based Changes**: Uses polling for change detection (may not be suitable for high-frequency monitoring scenarios)
### Technical Details
- **Default Polling Interval**: 5 seconds (configurable)
- **WebHDFS API Version**: Compatible with WebHDFS REST API v1
- **JSON Serialization**: Uses System.Text.Json for efficient parsing
- **Nullable Reference Types**: Full support on modern .NET versions (.NET 8.0+, .NET Standard 2.1)
### Future Roadmap
- OAuth authentication support
- Kerberos authentication support
- Write operations (create, update, delete files and directories)
- Glob pattern enhancements for file watching
- Performance optimizations for large-scale deployments
- Azure Data Lake Storage Gen2 compatibility
- Real-time change notifications (when supported by HDFS)
### Breaking Changes
None - this is the initial release.
### Migration Guide
This is the initial release, no migration required.
### Contributors
- Zhang Shuai - Initial implementation and design
---
For detailed usage examples and API documentation, see the [README.md](README.md) file.