DuckDB for Data Analytics

Posted on Feb 21, 2024

Saw an intriguing comment from Zetta Venture Partners in their 2024 “Unveiling AI Predictions for the Year Ahead” here that “DuckDB uses serverless OLAP to turn your laptop into a personal analytics engine”. Fast forward a few months and I can say it’s been a great experience adopting DuckDB in data analytics tasks.

I’ve always resorted to Excel (or Sheets) when analyzing data for whatever purpose. If it’s a one-off task I think there’s little that can beat the ease of use, the relatively frictionless path to a result, and the speed with which data can be cleaned, re-shaped and merged.

But when the analytics task will be repeated cyclically on refreshed data or there’s a need to iterate on the design then the limited scope within spreadsheets to encapsulate code segments and design for drop-in re-usability is an impediment to an efficient workflow.

Fast forward a few months and I can say that using DuckDB in various data-crunching applications has generally been a breeze. Clearly most data, certainly in a finance context, is suited to database table and row constructs–and applying that architecture explicitly in managing your data turns out to be much more intuitive. And all the analytics tools, such as the pivot table, are similarly as easy to apply in the database environment, albeit needing a little competence in SQL. A great pair programming resource for SQL is ChatGPT, as long as it is used appropriately in the design process.

Couple the improved experience of managing data with access to the Python ecosystem for data computation and visualization and it’s a really compelling everyday solution for applications in which the user would traditionally default to spreadsheets. Python also brings the terrific Streamlit app environment and the ability to deploy the analytics app easily to other users.

Lastly, the challenge of managing version control is left to git. I’ve never found Google Sheets versioning to be that usable in practice, and with Excel it was a case of hardcoding versioning in filenames.

DuckDB brings a multitude of other benefits and advantages over existing database technology solutions, from handling massive datasets, to parallel execution, to remote endpoint access and so on, much of which I will never use. But in enabling a replacement workflow to spreadsheets for everyday analytics it’s really compelling.