Exploring Parquet Files and DuckDB for App Development

Recently, I stumbled across the Parquet file format. I know that folks who work with data are acquainted with Parquet and its benefits, but I wanted to explore this from an app development perspective.

Quoting their documentation:

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.

I had some interesting experience with ClickHouse, another columnar database, so this could be great. Parquet has a Python library that I can use to build applications.

The REIT Data Challenge

On another sidenote, I have been thinking about a better way to calculate returns from my REIT investments. Most Indian portals do not provide a clear picture for REITs since there are only four listed REITs in India. I am never sure if the numbers that I am looking at, includes income distribution or not. It made sense to build a simple webapp to display the REIT returns. So, I decided to look for an API or scrape the REIT’s official websites to get distribution data.

I just did that. I tried using yfinance, but the data on Yahoo Finance for Indian REITs is completely incorrect. Official websites were the only reliable source I could find.

It seems that Yahoo Finance has fixed their data since.

Building with Parquet Instead of Traditional Databases

Now I made a simple web app, but instead of serving the data from a traditional database, I am using a Parquet file to store historical data. When I was looking around, I found DuckDB. This is a library that provides a SQL interface, and you can run queries directly against a Parquet file.

The combination of Parquet files and DuckDB offers several advantages for application development. The columnar storage format of Parquet provides excellent compression ratios and fast query performance, especially for analytical workloads. DuckDB acts as a lightweight SQL engine that can directly query these files without the overhead of setting up and maintaining a full database server.

This approach is particularly useful for applications that deal with time-series data, financial information, or any scenario where you need to perform analytical queries on structured data. For my REIT tracking application, it means I can store historical distribution data efficiently and query it using familiar SQL syntax without the complexity of database administration.

The setup is remarkably simple. You can read Parquet files into pandas DataFrames, manipulate the data as needed, and then use DuckDB to run complex queries that would typically require a full-featured database system. This makes it an excellent choice for prototyping, small to medium-scale applications, or scenarios where you want to avoid database infrastructure overhead.

For developers coming from traditional database backgrounds, this combination offers a refreshing alternative that maintains the power of SQL while simplifying deployment and maintenance requirements.

Why not SQLite?

SQLite would fit the bill here… mostly. I initially picked Parquet and DuckDB as a learning exercise. However, during the course, I realized that Parquet+DuckDB have several advantages:

The columnar storage format of Parquet is inherently better suited for analytical queries on time-series data. When calculating REIT returns over different time periods, I often need to aggregate data by date ranges or perform calculations across specific columns. Parquet’s columnar structure means these operations are significantly faster since only the required columns are read from disk.
The compression benefits of Parquet are substantial. Financial data often contains repetitive patterns and null values, which Parquet handles very efficiently. My REIT data files are roughly 60-70% smaller than equivalent SQLite databases, which matters when you’re storing years of historical distribution data.
I can easily export data to various analytics tools, share files with other developers, or even load the same Parquet files into different systems like Pandas, Spark, or cloud analytics platforms without conversion. SQLite, while portable, doesn’t offer this same level of ecosystem integration.

That said, SQLite would still be my go-to choice for transactional applications or when I need ACID compliance. But for this analytical workload focused on historical data and return calculations, the Parquet and DuckDB combination proved superior. For developers coming from traditional database backgrounds, this combination offers a great alternative that maintains the power of SQL while simplifying deployment and maintenance requirements.

Now checkout the app on https://trackreit.in/. Note that my UX skills still need improvement.

Aditya Patawari