Reading Parquet Files with DuckDB in a .NET Core Project
Working with large datasets often requires a balance between performance, flexibility, and ease of use. Parquet, an open-source columnar storage file format, has become a standard for efficient data storage and processing. Its design optimizes for speed, compression, and compatibility with modern data processing tools.
When it comes to reading and querying Parquet files, DuckDB stands out as a lightweight, high-performance SQL database management system. Often referred to as the SQLite for Analytics, DuckDB provides robust support for analytical queries directly on Parquet files without the need for an intermediate database setup. This capability makes it a perfect match for scenarios involving rapid prototyping, data exploration, or even production-grade analytics.
In this blog post, we will explore how to integrate DuckDB with a .NET Core project to seamlessly read and query Parquet files. By the end, you’ll be equipped with:
- A step-by-step guide to set up a .NET Core project with DuckDB.
- Sample code to load and query data from a Parquet file.
- Practical insights into why DuckDB is a game-changer for .NET developers handling analytical workloads.
Whether you’re building a data-centric application or performing ad hoc analytics, this guide will show you how DuckDB can simplify your workflow while maintaining high performance. Let’s dive in!
Creating a .NET Core Project in Visual Studio 2022
1. Create the Project
- Open Visual Studio 2022.
- Click on Create a new project.
- Select Console App (.NET Core) from the project templates and click Next.
- Enter the Project Name (e.g.,
ParquetDuckDBReader
), choose a location, and click Create. - Choose the .NET Core version (e.g., .NET 6 or .NET 7) from the framework dropdown and click Create.
2. Install Required NuGet Packages
To work with DuckDB and Parquet files, install the following packages:
DuckDB.NET: Enables interaction with DuckDB in .NET.
- Open the NuGet Package Manager:
Go toTools > NuGet Package Manager > Manage NuGet Packages for Solution
. - Search for
DuckDB.NET
and install the latest version.
- Open the NuGet Package Manager:
Parquet.Net: Provides functionality to work with Parquet files in .NET.
- Search for
Parquet.Net
in the NuGet Package Manager and install it.
- Search for
Setting Up a DuckDB
using DuckDB.NET.Data; var connectionString = "DataSource=:memory:";
using var connection = new DuckDBConnection(connectionString);
connection.Open();
Console.WriteLine("DuckDB initialized!");
Loading and Reading a Parquet File
var sql = "CREATE TABLE sample AS SELECT * FROM parquet_scan('sample.parquet')";
using var command = connection.CreateCommand();
command.CommandText = sql;
command.ExecuteNonQuery();
Console.WriteLine("Parquet file loaded into DuckDB.");
Querying Data with DuckDB
command.CommandText = "SELECT * FROM sample LIMIT 10";
using var reader = command.ExecuteReader();
while (reader.Read()) {
Console.WriteLine($"{reader[0]}, {reader[1]}");
}
Conclusion
In this blog post, we explored how to leverage the power of DuckDB and .NET Core to read and query Parquet files effortlessly. DuckDB’s lightweight yet robust analytical capabilities, combined with the flexibility of .NET Core, provide an excellent solution for developers working with large datasets and columnar storage formats.
We started by creating a .NET Core project in Visual Studio 2022 and setting it up with essential NuGet packages like DuckDB.NET and Parquet.Net. Then, we demonstrated how to:
- Load a Parquet file into DuckDB with just a few lines of code.
- Query and manipulate the data efficiently using SQL within DuckDB.
This workflow eliminates the need for external database systems, making it perfect for tasks such as:
- Prototyping: Quickly load and analyze data without setting up a full database.
- Data Exploration: Perform ad hoc SQL queries on large datasets stored in Parquet format.
- Integration: Easily integrate Parquet data into larger .NET Core applications or APIs.
DuckDB’s ability to directly operate on Parquet files enhances productivity while maintaining exceptional performance. Combined with .NET Core’s versatility, this pairing becomes a powerful tool for developers seeking a scalable and efficient solution for analytical workloads.
Next Steps
Here are some ideas to extend what you’ve learned:
- Integrate this setup into a web API for real-time analytics.
- Explore DuckDB’s support for other file formats like CSV or JSON.
- Experiment with more complex analytical queries and performance optimizations.
With DuckDB and .NET Core in your toolbox, you’re well-equipped to tackle data processing challenges with speed and simplicity. Happy coding!