DuckDB is an embedded analytical database that runs entirely in-process, similar to SQLite but optimized for OLAP workloads. It is perfect for analytics on VPS instances without running a separate database server.
Installing DuckDB
# Install via pip (Python)
pip install duckdb
# Or download the CLI
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
sudo mv duckdb /usr/local/bin/
Querying Data Files Directly
# Query CSV files directly
duckdb -c "SELECT country, COUNT(*) as cnt FROM read_csv_auto('access_log.csv') GROUP BY country ORDER BY cnt DESC LIMIT 10;"
# Query Parquet files
duckdb -c "SELECT * FROM read_parquet('data/*.parquet') WHERE date > '2024-01-01';"
# Create persistent database
duckdb analytics.db
CREATE TABLE events AS SELECT * FROM read_csv_auto('events.csv');
.tables
.exit
Integration with Python
import duckdb
# Query directly from Python
result = duckdb.sql("""
SELECT date_trunc('hour', timestamp) as hour,
COUNT(*) as requests
FROM read_csv_auto('access.log')
GROUP BY 1 ORDER BY 1
""").fetchdf()
print(result)
Summary
DuckDB is an excellent choice for analytics workloads on VPS instances where you want powerful SQL capabilities without the overhead of running a separate database server.