SQL Analytics at Lightning Speed: Getting Started with DuckDB in Python

Abdeladim Fadheli · 10 min read · Updated may 2026 · Database

Get a head start on your coding projects with our Python Code Generator. Perfect for those times when you need a quick solution. Don't wait, try it today!

If you work with data in Python, you've probably spent way too much time waiting for Pandas to grind through a CSV file. You know the drill — make coffee while read_csv() does its thing, then more coffee while groupby().agg() churns away.

Here's the thing: it doesn't have to be this way. In fact, by 2026, it really shouldn't be.

DuckDB is an embedded analytical database that you install with a single pip install and immediately start querying CSV, Parquet, and JSON files using plain SQL. No servers. No configuration files. No Docker containers. Just pure, fast analytical queries that consistently outperform Pandas by 4–10x on real workloads.

In this tutorial, I'll walk you through everything you need to start using DuckDB effectively — from basic queries to hybrid workflows that combine the best of DuckDB, Pandas, and Polars.

What Exactly Is DuckDB?

DuckDB is often called "SQLite for analytics." That comparison works because both are embedded databases — they live inside your application process rather than running as a separate server. But while SQLite is built for transactional workloads (think: your phone's contact list), DuckDB is purpose-built for analytical queries — the kind you'd run in a data warehouse.

Under the hood, DuckDB uses a columnar storage engine (data is stored column-by-column rather than row-by-row), which makes aggregations and scans dramatically faster. It's written in C++ with a Python client that's dead simple to use. And because it's vectorized and multi-threaded by default, it saturates your CPU cores without you lifting a finger.

The real headline feature, though, is this: you can point DuckDB at a CSV or Parquet file sitting on your disk and run SQL against it without importing anything first. No read_csv(), no schema inference headaches, no memory ballooning. Just write a query and get results.

Installation

Nothing fancy here — just the usual pip install:

pip install duckdb

If you want to follow along with the hybrid workflow examples later, grab Pandas and Polars while you're at it:

pip install duckdb pandas polars pyarrow

That's it. You're ready. No Docker, no Postgres connection string, no cloud account. Let's write some queries.

Getting Started: Your First DuckDB Queries

Fire up a Python shell or Jupyter notebook and create a connection. The default duckdb.connect() gives you an in-memory database — perfect for quick analysis:

import duckdb

conn = duckdb.connect()  # in-memory database

Let's create a small table from scratch to get a feel for the API:

conn.execute("""
    CREATE TABLE employees (
        id INTEGER,
        name VARCHAR,
        department VARCHAR,
        salary DECIMAL(10, 2)
    )
""")

conn.execute("""
    INSERT INTO employees VALUES
    (1, 'Alice', 'Engineering', 95000),
    (2, 'Bob', 'Engineering', 87000),
    (3, 'Charlie', 'Marketing', 72000),
    (4, 'Diana', 'Marketing', 78000),
    (5, 'Eve', 'Engineering', 105000),
    (6, 'Frank', 'Sales', 65000),
    (7, 'Grace', 'Sales', 71000)
""")

The fetchdf() method returns results as a Pandas DataFrame — DuckDB plays nicely with the Python data stack by default:

result = conn.execute("""
    SELECT * FROM employees ORDER BY salary DESC
""").fetchdf()

print(result)

Grouping and aggregation are just as straightforward:

conn.execute("""
    SELECT 
        department, 
        ROUND(AVG(salary), 2) AS avg_salary,
        COUNT(*) AS headcount
    FROM employees 
    GROUP BY department 
    ORDER BY avg_salary DESC
""").fetchdf()

So far, this looks like any SQL database. But the real magic starts when you stop importing data and start querying files directly.

The Killer Feature: Query Files Directly — No Import Required

This is where DuckDB genuinely changes how you work. Let me show you what I mean.

I generated a 500,000-row sales dataset — the kind of CSV that makes Pandas sweat a little. It's about 35 MB on disk with columns for order ID, region, product, category, quantity, unit price, and order date.

Here's the DuckDB way to analyze it:

result = conn.execute("""
    SELECT 
        region,
        category,
        COUNT(*) AS num_orders,
        ROUND(SUM(total_amount), 2) AS revenue,
        ROUND(AVG(total_amount), 2) AS avg_order_value
    FROM read_csv('sales_data.csv', AUTO_DETECT=TRUE)
    GROUP BY region, category
    ORDER BY revenue DESC
    LIMIT 10
""").fetchdf()

Notice what didn't happen here. I didn't call pd.read_csv(). I didn't check dtypes. I didn't worry about memory. I just wrote a SQL query that references a file path, and DuckDB handled schema detection, type inference, and query execution in one shot.

The result came back in 0.075 seconds:

  region     category  num_orders       revenue  avg_order_value
0  North       Office       50199  1.274445e+08          2538.79
1  South  Electronics       50228  1.271582e+08          2531.62
2   East  Electronics       49780  1.258078e+08          2527.28
3   West       Office       49771  1.257954e+08          2527.48
4   East       Office       49901  1.257002e+08          2518.99

This one capability — running SQL directly against files — eliminates the entire "load then query" workflow that dominates most Python data scripts. And it's not limited to CSV. DuckDB reads Parquet, JSON, and even queries files over HTTP/S3 natively.

DuckDB vs Pandas: Let's Talk Performance

I ran the exact same aggregation on the same 500K-row CSV file using both DuckDB and Pandas. Here's what happened:

# Pandas approach
df = pd.read_csv('sales_data.csv')
pandas_result = (df.groupby(["region", "category"])
                 .agg(num_orders=("order_id", "count"),
                      revenue=("total_amount", "sum"),
                      avg_order_value=("total_amount", "mean"))
                 .sort_values("revenue", ascending=False)
                 .head(10))

Method	Time	Speedup
Pandas (read_csv + groupby)	0.343s	—
DuckDB (direct query)	0.075s	4.6× faster

That's a nearly 5x speedup on a modest dataset. On larger files — the kind where Pandas starts pushing against your RAM ceiling — the gap widens significantly. DuckDB's columnar engine only needs to scan the columns referenced in your query, while Pandas loads the entire DataFrame into memory before doing anything.

Now, to be fair: if you're working with tiny datasets (a few thousand rows), the difference won't matter much. Pandas is perfectly fine for that. But the moment you're dealing with anything north of 100K rows, DuckDB starts pulling away fast.

Parquet: Even Faster, Even Smaller

If you're still storing analytical data as CSV in 2026, let me gently suggest you switch to Parquet. It's a columnar file format that compresses better and reads faster — and DuckDB absolutely flies on it.

That 35 MB CSV I mentioned earlier? As Parquet, it shrinks to 11 MB. And queries get even faster because DuckDB can skip entire columns it doesn't need:

result = conn.execute("""
    SELECT 
        product,
        ROUND(SUM(total_amount), 2) AS total_revenue,
        COUNT(*) AS units_sold,
        ROUND(AVG(quantity), 1) AS avg_qty_per_order
    FROM read_parquet('sales_data.parquet')
    GROUP BY product
    ORDER BY total_revenue DESC
""").fetchdf()

This query ran in 0.015 seconds — 5× faster than the already-fast CSV query. DuckDB can do this because Parquet files contain embedded statistics (min/max values per column chunk), allowing the engine to skip entire sections of the file that can't possibly match your query.

Window Functions: Ranking Without the Headache

One thing SQL has always done better than Pandas is window functions. Here's how to find the top 3 products by revenue in each region — a query that would take several chained Pandas operations:

result = conn.execute("""
    WITH ranked AS (
        SELECT 
            region,
            product,
            ROUND(SUM(total_amount), 2) AS revenue,
            ROW_NUMBER() OVER (
                PARTITION BY region 
                ORDER BY SUM(total_amount) DESC
            ) AS rank
        FROM read_parquet('sales_data.parquet')
        GROUP BY region, product
    )
    SELECT * FROM ranked WHERE rank <= 3
    ORDER BY region, rank
""").fetchdf()

The CTE (Common Table Expression) with ROW_NUMBER() OVER (PARTITION BY ...) is clean and readable. No multi-step index manipulation, no transform chains — just a single declarative query that reads like English.

The Hybrid Workflow: DuckDB + Pandas + Polars

Here's where things get interesting. DuckDB doesn't replace Pandas or Polars — it complements them. The sweet spot is using each tool for what it does best:

DuckDB for the heavy lifting — filtering, aggregating, and joining large files
Pandas for quick transformations and pivot tables on summarized data
Polars for lightning-fast final polishing when you need maximum performance

Here's a concrete example. First, DuckDB crunches through 500K rows and spits out a monthly summary:

# Step 1: DuckDB does the heavy aggregation (500K → 216 rows)
summary = conn.execute("""
    SELECT 
        region,
        category,
        DATE_TRUNC('month', order_date) AS month,
        COUNT(*) AS order_count,
        ROUND(SUM(total_amount), 2) AS monthly_revenue
    FROM read_parquet('sales_data.parquet')
    GROUP BY region, category, DATE_TRUNC('month', order_date)
""").fetchdf()
# DuckDB aggregation: 0.041s → 216 rows

Then Pandas takes over for a pivot table — something it handles beautifully on small data:

# Step 2: Pandas pivots the summary
pivot = summary.pivot_table(
    index="month",
    columns="region",
    values="monthly_revenue",
    aggfunc="sum"
).round(2)
# Pandas pivot: 0.005s

And if you need to squeeze out every last millisecond, Polars can take the baton:

# Step 3: Polars for final formatting
import polars as pl

pl_df = pl.from_pandas(summary)
top_month = (pl_df
             .group_by("region")
             .agg(pl.col("monthly_revenue").max().alias("best_month_revenue"))
             .sort("best_month_revenue", descending=True))
# Polars processing: 0.018s

The whole pipeline — 500K rows aggregated, pivoted, and polished — completes in under 0.07 seconds. Each tool handled exactly the part it excels at, and you didn't have to commit to any single framework for the entire workflow.

Persistent Databases: When You Need to Keep the Data Around

So far we've used in-memory connections. But DuckDB also supports file-based persistent databases — just pass a path to connect():

persistent_conn = duckdb.connect('analytics.duckdb')

# Create a table directly from a Parquet file
persistent_conn.execute("""
    CREATE OR REPLACE TABLE sales AS 
    SELECT * FROM read_parquet('sales_data.parquet')
""")

# Verify it's there
row_count = persistent_conn.execute(
    "SELECT COUNT(*) FROM sales"
).fetchone()[0]
print(f"Sales table: {row_count:,} rows")
# Sales table: 500,000 rows

persistent_conn.close()

The resulting analytics.duckdb file is only about 5 MB — smaller than both the CSV (35 MB) and the Parquet (11 MB) because DuckDB applies its own compression on top. You can share this file, back it up, or query it again later without re-ingesting the original data.

This is incredibly useful for iterative analysis. Load once, query many times — and close the connection when you're done without losing anything.

Exporting Results

Once you've done your analysis, DuckDB makes it trivial to export results in whatever format you need:

# Export to CSV
conn.execute("""
    COPY (
        SELECT region, product, ROUND(SUM(total_amount), 2) AS revenue
        FROM read_parquet('sales_data.parquet')
        GROUP BY region, product
        ORDER BY revenue DESC
    ) TO 'revenue_summary.csv' (HEADER, DELIMITER ',')
""")

# Export to Parquet
conn.execute("""
    COPY (
        SELECT region, product, ROUND(SUM(total_amount), 2) AS revenue
        FROM read_parquet('sales_data.parquet')
        GROUP BY region, product
        ORDER BY revenue DESC
    ) TO 'revenue_summary.parquet' (FORMAT PARQUET)
""")

The COPY command handles the serialization efficiently — no loop writing rows one at a time.

When Should You Actually Use DuckDB?

I'm not going to tell you to throw away Pandas. That would be bad advice. Here's my honest take on when each tool makes sense:

Scenario	Best Tool
Quick exploration of a small CSV (< 10K rows)	Pandas
Aggregating a large CSV/Parquet file	DuckDB
Joining multiple large files	DuckDB
Complex statistical modeling	Pandas + scikit-learn
ETL pipelines on structured data	DuckDB → Polars
BI-style reporting from files	DuckDB
Machine learning preprocessing	DuckDB → Pandas

The pattern to internalize is this: use DuckDB to shrink your data, then hand off a manageable DataFrame to Pandas or Polars for the parts they excel at.

Wrapping Up

DuckDB solves a problem that's been nagging Python data folks for years: the awkward gap between "this file is too big for Pandas to handle comfortably" and "I don't want to set up a full database server."

It's not a replacement for your production Postgres instance, and it won't help you train neural networks. But for the vast middle ground of data analysis — where you're wrangling CSV files, joining datasets, and building reports — it's genuinely transformative.

The fact that you can pip install this kind of performance and just start writing SQL against files sitting on your laptop is kind of remarkable. In 2026, there's very little reason to keep suffering through slow Pandas aggregations when DuckDB is a single import away.

Further Reading:

Ready for more? Dive deeper into coding with our AI-powered Code Explainer. Don't miss it!

View Full Code Convert My Code

Sharing is caring!

Comment panel

Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!