Overview
Topic: Using Postgres or CSV to store stock/ crypto/ commodity price data? Category: Python Collection: Algorithmic Trading with Python
This guide shows when to use CSV vs Postgres, a minimal working example in Python for both, a quickstart with numbered steps, pitfalls, performance notes, and a short FAQ.
When to use CSV vs Postgres
| Use case | CSV | Postgres |
|---|---|---|
| Data volume | Small to moderate (≤ a few million rows total across files) | Large (tens of millions+), long history, many symbols |
| Concurrency | Single user/process | Multiple writers/readers, backtests + live ingestion |
| Queries | Sequential reads, simple filters | Complex filters, joins, aggregations, rolling windows |
| Data integrity | Minimal (no schema, easy to corrupt) | Strong (constraints, transactions, indexes) |
| Tooling | Dead simple, no server | Requires DB server, migrations |
Rule of thumb:
- Prototyping or single-strategy research: CSV is fine.
- Production or multi-strategy research on intraday data: Postgres is safer and faster for queries.
Minimal working example (CSV and Postgres)
The code below creates a tiny OHLCV DataFrame, writes/reads CSV, and persists/queries the same data in Postgres.
Requirements:
pip install pandas sqlalchemy psycopg[binary]
# file: demo_storage.py
import pandas as pd
from datetime import datetime, timezone
from sqlalchemy import create_engine
# 1) Create a tiny OHLCV DataFrame
rows = [
{"symbol": "BTC-USD", "timeframe": "1h", "ts": datetime(2024, 1, 1, 0, tzinfo=timezone.utc),
"open": 42000.0, "high": 42500.0, "low": 41800.0, "close": 42300.0, "volume": 12.3},
{"symbol": "BTC-USD", "timeframe": "1h", "ts": datetime(2024, 1, 1, 1, tzinfo=timezone.utc),
"open": 42300.0, "high": 42700.0, "low": 42050.0, "close": 42650.0, "volume": 9.7},
]
df = pd.DataFrame(rows)
# 2) CSV: write and read
csv_path = "prices.csv"
df.to_csv(csv_path, index=False)
print(f"Wrote {csv_path}")
loaded_csv = pd.read_csv(csv_path, parse_dates=["ts"])
print("CSV loaded rows:", len(loaded_csv))
# 3) Postgres: connect, create table, insert, query
# Ensure a local Postgres is running and accessible by this connection string.
engine = create_engine("postgresql+psycopg://postgres:postgres@localhost:5432/postgres", future=True)
create_sql = """
create table if not exists candles (
symbol text not null,
timeframe text not null,
ts timestamptz not null,
open double precision not null,
high double precision not null,
low double precision not null,
close double precision not null,
volume double precision not null,
primary key (symbol, timeframe, ts)
);
"""
with engine.begin() as conn:
conn.exec_driver_sql(create_sql)
# Insert using pandas to_sql for simplicity (fine for small batches)
df.to_sql("candles", engine, if_exists="append", index=False, method="multi")
# Query back
queried = pd.read_sql("select * from candles where symbol='BTC-USD' order by ts", engine)
print("Postgres loaded rows:", len(queried))
print(queried)
Notes:
- The example uses timestamptz to keep UTC. Keep all time in UTC for trading.
- For large batch loads, prefer Postgres COPY (see Performance Notes).
Quickstart
- Decide storage:
- CSV for small, single-user research; Postgres for serious, multi-symbol, intraday histories.
- Install dependencies:
- CSV only: pandas
- Postgres: Postgres server, plus pandas, SQLAlchemy, psycopg (driver)
- Design schema (Postgres):
- One table for candles with a composite primary key: (symbol, timeframe, ts).
- Create table:
create table if not exists candles (
symbol text not null,
timeframe text not null,
ts timestamptz not null,
open double precision not null,
high double precision not null,
low double precision not null,
close double precision not null,
volume double precision not null,
primary key (symbol, timeframe, ts)
);
-- Optional indexes to accelerate filters (composite PK already covers many queries):
create index if not exists candles_symbol_ts on candles(symbol, ts desc);
- Ingest data:
- CSV: df.to_csv per symbol/timeframe per day or per month files.
- Postgres: batch via to_sql for prototypes; switch to COPY for production.
- Query:
- CSV: read specific files into pandas; concatenate and filter.
- Postgres: push filters to SQL (WHERE symbol, timeframe, ts range), then fetch.
Recommended file layout (CSV)
- Use one directory per asset class and exchange (e.g., crypto/binance, stocks/nasdaq).
- Partition by symbol and timeframe; keep rolling files by month.
- Example path: data/crypto/binance/BTC-USD/1h/2024-01.csv
- Include header row; use ISO-8601 UTC timestamps.
Postgres tips for schema and queries
- Keep a compact row: symbol, timeframe, ts, ohlcv. Avoid JSON for hot paths.
- Use timestamptz and store UTC.
- Primary key on (symbol, timeframe, ts) prevents duplicates and speeds typical queries.
- Consider BRIN indexes for very large append-only time series (ts clustered inserts).
- For extremely large data, partition by timeframe and month (range partition on ts) or by symbol groups.
Performance notes
- CSV
- Reading: use dtype hints and parse_dates only on needed columns.
- Use chunked reads for large files: pd.read_csv(..., chunksize=1_000_000).
- Compression: gzip saves space; pandas can read/write .csv.gz directly.
- Postgres
- Batch inserts: prefer COPY over row-by-row inserts. Convert DataFrame to CSV in-memory and COPY into the table.
- Indexes: create only those you use; each index slows inserts.
- Connection pooling: reuse engines/cursors for ingestion jobs.
- Hardware: fast disks and sufficient shared_buffers help; keep autovacuum enabled.
- Query pushdown: filter by symbol, timeframe, and ts range in SQL to minimize data transfer.
Common pitfalls
- Timezones: mixing naive datetimes, exchange local times, and UTC creates bugs. Always store UTC (timestamptz) and convert at the edges.
- Duplicates: vendor redeliveries can cause duplicate candles. Use a primary key (Postgres) or de-duplicate in pandas before writing CSV.
- Floating-point rounding: OHLCV with floating doubles is usually fine; for money-sensitive PnL, use Decimal in Python or numeric in Postgres for monetary aggregates.
- CSV corruption and schema drift: CSV has no constraints. A bad line or extra column breaks parsing. Validate on write; keep strict column order.
- Transaction size: gigantic single transactions in Postgres can bloat memory. Commit in batches.
- Vacuum/maintenance: heavy updates/deletes without autovacuum tuning can degrade performance.
Example workflows
- Research with daily candles for 200 symbols:
- CSV: monthly files per symbol/timeframe. Load only needed months into memory.
- Live intraday ingestion (tick/1s/1m bars) with backtests reading concurrently:
- Postgres: ingest via COPY in batches every N seconds; backtests query by time windows.
Tiny FAQ
- Should I use SQLite instead?
- SQLite is great for single-user local databases and can outperform CSV while avoiding a server. For multi-process writes and heavy parallel queries, Postgres scales better.
- How big can CSVs get before they hurt?
- Past 1–5 GB per file, seek times and parsing costs climb. Split by month or day and index filenames by date to avoid scanning huge files.
- Do I store trades or candles?
- Store raw trades if you build custom bars or need microstructure; otherwise candles are simpler and smaller. You can aggregate trades into candles in Postgres with SQL.
- What about corporate actions and splits for stocks?
- Store raw prices and a separate adjustments table; adjust on query or maintain adjusted series as another timeframe/symbol suffix.
Summary
- CSV: low-friction, good for small-scale research, but fragile for integrity and concurrency.
- Postgres: robust, queryable, and scalable for production-grade algo trading datasets.
- Start with CSV; migrate to Postgres when data volume, query complexity, or concurrency demands it.