Benchmarks

TPC-H at SF=1 / SF=10 / SF=100 / SF=1000 — AWS cloud runs (single node vs DuckDB/Polars; 4-node mesh vs Trino/PySpark; 1 TB on one box) plus the strict single-machine protocol on an Apple M4 Max. Median ms, same hardware and same Parquet files.

TPC-H, all 22 queries, SF=1 / SF=10 / SF=100 / SF=1000, measured two ways:

AWS cloud runs — commodity c7i.4xlarge boxes: single node vs DuckDB / Polars, and the 4-node peer mesh vs Trino / PySpark on the identical cluster — plus the 1 TB run on a memory-class box. This is the “what you get on rented hardware” view.
Single-machine strict protocol — Apple M4 Max, per-query process isolation, thermal gating, 2σ verdicts, five engines. This is the tightest-controlled per-query view.

Same Parquet files per track, shipped pip install ematix-flow defaults, no per-query tuning anywhere.

AWS cloud runs (v0.13.0, July 2026)

All AWS numbers: c7i.4xlarge (16 vCPU / 32 GB, us-east-2), TPC-H Parquet loaded from S3, all 22 queries value-validated, 5 trials × 2 warmups, sum of per-query medians. ematix-flow runs the shipped pip install ematix-flow defaults — no flags, no tuning. Raw per-query JSONs with full provenance (instance, AZ, git SHA, env) are stored with every run.

Single node — vs DuckDB 1.5.4 / Polars 1.42.1

Scale	ematix-flow	DuckDB	Polars
SF=1	0.68 s	1.45 s	2.68 s (21/22)
SF=10	4.76 s	5.79 s	9.1 s (19/22)
SF=100	51.8 s	59.0 s	251 s (16/22)

Polars (standard build, per-query process isolation) is shown with its completion count: its totals cover only the queries it finished, so they understate its true totals — the comparison is conservative in its favor. What it can’t finish on this box is honest physics for an in-memory engine: five SF=100 queries (and two at SF=10) exceed 32 GB, and one query has no Polars SQL variant.

The SF=100 total includes the July 2026 join-side correction (default-on, no flags): Q09 — formerly a memory-cliff outlier that could swing 16–75 s on the 32 GB box — was root-caused to hash joins built on the wrong (150M/80M-row) side, whose ~12 GB peak evicted the page-cached working set. The planner now measures string-filter selectivity by sampling the actual file at plan time and builds on the small side; Q09 runs a flat 5.8 s (DuckDB: 6.2 s steady on the same box). Per query, ematix-flow is faster on 13 / 22 with 3 ties; every loss is under 0.7 s. The full diagnosis — cold-IO parity, page-cache eviction traces, before/after trials — ships with the raw run JSONs.

ClickHouse — both of its modes (SF=100, July 2026)

ClickHouse got its own head-to-head because it’s the engine most often cited against us on large data. Same c7i.4xlarge, same Parquet files, ClickHouse’s own published TPC-H queries and table DDL (vendored verbatim from their benchmark kit), newest stable server at the time (26.6.1.1193). Two modes, because ClickHouse is really two stories:

Mode	Suite total	Completed	Ingest first	Extra storage
ematix-flow (query-in-place)	51.8 s	22/22	none	none
ClickHouse MergeTree (their published mode)	58.2 s	19/22	23.9 min	41.4 GiB
ClickHouse on parquet (lakehouse mode)	740.5 s	20/22	none	none

The headline: even after granting ClickHouse a 24-minute ingest and 41 GiB of duplicated storage, its partial 19-query suite is still slower than ematix-flow’s complete 22-query suite running straight off the shared files. And the two ClickHouse modes fail different queries — on this 32 GB box there is no ClickHouse configuration that completes all 22.

Per query (ClickHouse, both modes; DNF = did not finish):

Query	MergeTree (server)	Parquet (chdb)	Query	MergeTree (server)	Parquet (chdb)
Q01	2.07 s	3.37 s	Q12	1.42 s	DNF¹
Q02	DNF²	0.87 s	Q13	3.09 s	38.93 s
Q03	2.00 s	58.53 s	Q14	1.07 s	2.57 s
Q04	1.38 s	2.31 s	Q15	1.64 s	3.47 s
Q05	DNF²	5.56 s	Q16	0.58 s	1.28 s
Q06	0.83 s	1.67 s	Q17	1.62 s	44.95 s
Q07	3.08 s	48.49 s	Q18	5.14 s	7.11 s
Q08	DNF²	360.29 s	Q19	1.79 s	4.01 s
Q09	20.72 s	DNF³	Q20	2.04 s	3.46 s
Q10	2.79 s	30.43 s	Q21	5.96 s	121.43 s
Q11	0.36 s	0.71 s	Q22	0.64 s	1.04 s

Honesty notes, all reproducible from the raw run JSONs:

Q12 on parquet is a ClickHouse engine bug, not physics: Not found column materialize(l_commitdate) in block from its parquet reader on a column-to-column predicate. The same query runs fine on native tables (1.42 s).
Q02/Q05/Q08 on MergeTree each exceeded a 600 s-per-execution bound with a pathological low-parallelism join phase (~2 of 16 cores busy after all input rows were read; EXPLAIN shows an ordinary hash-join chain). Reproduced on both the embedded engine and the standalone server; the same queries complete on parquet views of the same data. Loading into the native format made these queries worse, not better.
Q09 on parquet needed 21.5 GiB against the 18.5 GiB cap that keeps the box alive (two earlier runs at higher caps took the whole machine down via untracked page-cache pressure). DuckDB and ematix-flow complete Q09 on the same box.

Context for ClickHouse’s own published numbers: their “TPC-H SF100 in 19.8 s” runs on a 59-core / 236 GiB ClickHouse Cloud node, and their join-improvement benchmarks use 32 vCPU / 128 GiB — several times this box, always on pre-ingested MergeTree. Nothing on this page compares against those machines; every number here shares one box and one set of files.

Distributed — 4-node clusters, vs Trino 482 / PySpark 4.1.2

Same cluster shape for every engine within each row (coordinator + 3 workers), same S3 data. SF=1 and SF=10 rows ran on 4 × c7i.2xlarge; SF=100 on 4 × c7i.4xlarge — every engine in a row shares its hardware. ematix-flow’s mesh is auto-detected per query (EMAT_MESH unset = AUTO): small queries stay single-node, large scans fan out over Arrow Flight.

Scale	ematix-flow (auto)	Trino	PySpark
SF=1	1.47 s	10.65 s (7.2×)	28.2 s (19×)
SF=10	10.1 s	56.4 s (5.6×)	65.0 s (6.5×)
SF=100	46.0 s	497.0 s (10.8×)	374.5 s (8.1×)

Two honesty notes. PySpark’s SF=100 originally DNF’d 18/22 — that was our rig’s fault, not Spark’s: Amazon Linux mounts /tmp as a 16 GB RAM-backed tmpfs and Spark’s default shuffle dir sits there, so the shuffle-heavy queries ran out of “disk” no matter how large the EBS volume was. With shuffle scratch on EBS it completes all 22. The ematix totals were refreshed Jul 16 on the current build: AUTO now measures instead of predicting — each join query probes its execution modes untimed (native single-node, mesh, mesh with broadcast joins) and runs the trials on the fastest, with every probe row-verified against the other modes (that verification caught — and we then fixed — a distributed-planner shape that silently dropped rows, before any of it shipped in a release). Result: AUTO 46.0 s, vs forced-mesh 57.2 s and forced-single 57.1 s on the same fleet. The Jul 15 run (twin routing only) read 55.3 s; Jul 11 read forced-mesh 58.8 s / AUTO 59.1 s; the Jul 7 run predated the join-side correction at 61.9 s.

Two readings worth calling out:

One node vs their cluster: single-node ematix-flow at SF=10 (4.76 s) outruns the 4-node Trino cluster (56.4 s) by ~12× and the 4-node PySpark cluster (65.0 s) by ~14× — on a quarter of the hardware.
The cluster now beats one box at SF=100: 4-node AUTO (46.0 s) outruns single-node (51.8 s) — the first configuration where adding nodes lowers the TPC-H total instead of just adding capacity. AUTO gets there by refusing to guess: per query it runs whichever mode its own row-verified probes measured fastest, so it inherits the best of the native engine (Q05/Q12/Q15/Q21 stay single-node) and the mesh (broadcast-join fan-out on the scan-heavy joins) at once.

SF=1000 — one terabyte (July 2026)

The scale where “just use a single-node engine” is supposed to stop working, and where the cluster engines are supposed to earn their keep. Bigger memory-class boxes than the rest of this page: single-node engines ran on one r7i.8xlarge (32 vCPU / 256 GB); the cluster engines got 4× r7i.4xlarge — twice the aggregate cores and RAM of the one box. Same TPC-H Parquet dataset (~373 GB on disk), all 22 queries, 5 trials × 2 warmups, sum of per-query medians, value-validated.

Every engine we measured — TPC-H SF=1000

Sum of 22 query medians · single nodes are 1× r7i.8xlarge (32 vCPU / 256 GB), clusters are 4× r7i.4xlarge (64 vCPU / 512 GB aggregate)

ematix auto · 1 node 384 s

ematix-flow · 1 node 417 s

DuckDB · 1 node 468 s

Trino · 4 nodes · 2× the hardware 4121 s

PySpark · 4 nodes · 2× the hardware DNF — 4/22 queries, application aborted from Q05 on

ematix auto runs the whole suite on one box with a loopback peer, probing twin / mesh / mesh+broadcast per join query and running the fastest — 384.2 s, ahead of the single-node leg on the same box (16/22 queries) and DuckDB's 467.9 s. The clusters get a hardware handicap in their favor: 4× r7i.4xlarge is twice the cores and twice the RAM of the single box the single-node numbers come from. Trino completes all 22 — 10.7× slower than ematix auto. PySpark's totals cover only Q01–Q04; from Q05 on the Spark master repeatedly removed the application (executor loss on the shuffle-heavy joins) — raw log shipped with the run.

ematix-flow at SF=1000 is the same story as SF=1: shipped defaults, no flags, complete suite. The single-node leg runs all 22 in 417.0 s on one box (v0.14.2, NO_DISTRIBUTE=1). Turn the mesh on — still one box, now with a loopback peer — and AUTO runs it in 384.2 s, ahead of the single-node leg on the same machine (faster on 16 of 22 queries). Per query the mode memo probes twin / mesh / mesh+broadcast untimed and runs the fastest, so AUTO inherits the native engine on the join-light queries and the broadcast-join fan-out on the scan-heavy ones (Q10, Q17, Q21) at once — the same “cluster beats one box” crossover the SF=100 row shows, here on a single machine. DuckDB completes all 22 on the same box class at 467.9 s. The cluster engines, on twice the hardware: Trino finishes all 22 in 4121.3 s — 10.7× slower than ematix AUTO — and PySpark never gets past Q04 (the Spark master repeatedly removed the application on the shuffle-heavy joins; the raw log ships with the run stamp). Raw per-query JSONs with full provenance: 20260717T164020Z (ematix AUTO), 20260716T193957Z (ematix single), 20260712T175438Z (DuckDB), 20260712T222000Z (Trino / PySpark).

Single-machine strict protocol (Apple M4 Max)

The tightest-controlled per-query view: five engines, all 22 queries at three scales — SF=1 (~1 GB, fits in cache), SF=10 (~10 GB, production shape), SF=100 (~100 GB, out-of-core) — one M4 Max, the same Parquet files, per-query process isolation, thermal gating, 2σ verdicts.

How to read the table below: each row is one TPC-H query; each column is an engine’s median time in milliseconds (lower is better). The fastest engine per row is highlighted in teal, and ematix-flow’s whole column is tinted so you can scan down it. ”—” means that engine couldn’t run the query at that scale — see caveats. Switch scale factor with the tabs above the table.

SF=1 ~1 GB · fits in cache SF=10 ~10 GB · production scale SF=100 ~100 GB · out-of-core

22 / 22 ematix-flow wins 2.51× vs DuckDB 4.27× vs Polars 18.5× vs PySpark 17.0× vs Postgres

Query	ematix-flow	DuckDB	Polars	PySpark	Postgres
Q01	19.5	49.3	38.6	167	411
Q02	7.5	18.3	47.9	190	123
Q03	13.1	32.8	46.5	257	163
Q04	11.3	23.3	23.8	184	94.6
Q05	12.1	31.4	8,949	335	226
Q06	1.7	13.4	10.5	41.5	218
Q07	25.2	34.3	118	260	1,262
Q08	14.7	39.4	96.7	182	99.7
Q09	18.7	56.3	47.5	583	820
Q10	20.7	42.0	111	362	355
Q11	5.6	9.8	8.9	119	36.3
Q12	14.8	25.4	19.2	269	361
Q13	9.0	143	118	684	871
Q14	11.5	23.0	12.3	114	69.8
Q15	11.3	14.9	11.5	127	140
Q16	9.3	21.9	21.2	205	113
Q17	14.2	25.8	39.0	233	398
Q18	18.3	45.8	56.6	560	1,154
Q19	17.6	36.4	105	86.0	31.9
Q20	12.3	30.4	22.4	106	147
Q21	34.3	77.9	721	628	609
Q22	4.2	11.2	13.6	354	24.1

Median ms · lowest per row in teal.

20 / 22 ematix-flow wins 1.47× vs DuckDB 4.81× vs Polars (n=21) 14.3× vs PySpark 26.0× vs Postgres

Query	ematix-flow	DuckDB	Polars	PySpark	Postgres
Q01	248	246	547	732	4,306
Q02	19.3	39.2	417	599	2,222
Q03	129	136	572	2,722	3,488
Q04	58.5	85.0	264	1,711	905
Q05	137	132	—	4,589	3,233
Q06	25.4	75.5	56.3	205	1,373
Q07	118	135	1,287	3,737	2,188
Q08	147	157	1,172	940	1,340
Q09	271	288	445	2,187	7,431
Q10	193	221	3,145	2,355	3,362
Q11	11.4	25.4	39.7	197	578
Q12	86.5	109	116	826	3,542
Q13	102	233	419	2,069	10,989
Q14	82.0	126	84.4	379	813
Q15	61.8	81.1	62.1	645	1,606
Q16	44.2	57.4	169	638	1,098
Q17	80.3	147	473	3,956	5,387
Q18	178	199	589	6,953	19,846
Q19	122	189	1,264	493	148
Q20	91.9	137	265	419	3,279
Q21	180	375	34,184	7,523	6,952
Q22	23.7	52.6	122	628	203

Median ms · lowest per row in teal. “—” = engine couldn’t run the query at this scale (see caveats below).

22 / 22 ematix-flow wins 1.37× vs DuckDB 6.01× vs Polars (n=17) 10.4× vs PySpark 81.0× vs Postgres (n=6)

Query	ematix-flow	DuckDB	Polars	PySpark	Postgres
Q01	2,461	2,748	79,084	5,184	—
Q02	195	337	53,109	6,697	26,054
Q03	1,501	1,860	30,103	26,128	—
Q04	850	1,011	6,550	10,543	—
Q05	1,434	1,822	—	38,145	—
Q06	519	858	540	1,142	—
Q07	1,476	1,968	95,946	12,775	—
Q08	1,655	2,368	—	23,610	—
Q09	3,495	4,863	21,438	67,105	—
Q10	1,953	2,412	—	24,590	—
Q11	184	246	412	5,639	60,108
Q12	1,033	1,357	1,112	6,482	—
Q13	1,968	2,886	5,114	14,442	86,674
Q14	827	1,204	895	2,623	—
Q15	821	1,157	871	5,142	—
Q16	404	468	1,809	5,012	29,144
Q17	1,484	1,842	9,536	47,823	—
Q18	2,150	2,791	15,572	54,206	—
Q19	1,227	1,856	—	3,101	52,107
Q20	1,144	1,492	6,692	5,682	—
Q21	3,467	5,025	—	55,583	—
Q22	350	700	1,255	5,107	16,805

Median ms · lowest per row in teal. ematix + DuckDB re-measured 2026-07-03 (v0.12.0, strict protocol, per-query isolated, 4 invocations × 10 trials): 20 / 22 clear ematix wins at 2σ, Q01 + Q16 inside the noise bar, zero DuckDB wins. Postgres ran with a 90 s per-query cap — 16 / 22 heavy queries timed out (—). “—” = engine couldn’t run the query at this scale (see caveats below).

Full methodology — how each engine was measured

ematix-flow + DuckDB are measured under the repo’s strict protocol (scripts/bench/strict_22q.sh, 2026-07-03, the exact v0.12.0 code): solo-engine passes with per-query process isolation, plan cache off, thermal gating, 10 timed trials × 2 warmups × 4 invocations per engine (first invocation discarded), medians, and machine/power/git/env provenance (env.json) stored with every result (bench-results/release-v0.12.0/ in the repo). Verdicts use a 2σ noise bar — a query is a “win” only when the gap clears twice the larger arm’s spread; anything inside the bar is reported as a tie, not a win. The harness session is constructed from the shipped production preset and pinned to it by parity tests, so there are no bench-only rules by construction. The two engines are measured as a same-session pair per scale (cross-day machine state moves either column by ±4–5%).
Polars runs the same in-process harness (hand-translated q??.polars.sql where its planner rejects the canonical shape); Q14 / Q15 — the queries where the verdict is marginal — were re-measured alongside the 2026-06-12 pairs at the same 20×3 trial counts.
PySpark runs local[*] out-of-process on the JVM via bench-tpch-pyspark.py, against the same files.
Postgres 14 runs each query under EXPLAIN ANALYZE (B-tree indexed + ANALYZEd), reported as the planner’s Execution Time.

What changes across scale

Fast across the whole sweep — no soft spots (v0.12.0, 2σ verdicts). At SF=1 the working set is L3-resident and per-query constant cost dominates — ematix-flow is fastest on all 22 / 22 (2.51× geomean). At SF=10 the workload turns memory-bound: fastest on 17, with 5 queries a statistical tie (Q01 / Q03 / Q05 / Q08 / Q09, ematix nominally ahead in 3) and none slower — 1.47× geomean. At SF=100 — out-of-core, each query in its own process — fastest on 20, 2 ties (Q01, Q16), none slower — 1.37× geomean, +35% on the sum of medians. A handful of queries that were marginal in earlier releases (Q10 / Q18 at SF=100, Q05 at both scales) moved into the win column through engine work in 0.10–0.12: runtime-bloom build-side swaps, narrow-key decode, and the shuffle-free cluster-key aggregation (RANGE.AGG Stage 2) — each landed behind a strict interleaved A/B with a 2σ bar.

Concurrent-stream throughput

New in v0.12.0: the same strict harness measures N ∈ 100 simultaneous query streams per engine (seeded query permutations, solo engines, first batch discarded). ematix-flow’s concurrency-aware scheduling — a cross-process registry sizes partitions, the reader’s decode fan-out, and the rayon pool from one per-process core share — holds its throughput lead at every stream count (DuckDB shown as the columnar reference):

Config	ematix-flow QPH	DuckDB QPH	ratio
SF=10, 1 stream	27,462	21,405	1.28×
SF=10, 10 streams	28,333	26,463	1.07×
SF=10, 100 streams	26,073	25,824	1.01×
SF=100, 1 stream	2,212	1,814	1.22×
SF=100, 10 streams	1,581	1,118	1.41×

Caveats

Polars can’t run several canonical TPC-H shapes; we feed it semantically-identical hand-translated variants. Q05 still overflows Polars’s default 32-bit row index (the bigidx build would fix it) and shows ”—” at SF=10 / SF=100; a few more SF=100 queries (Q08, Q10, Q19, Q21) likewise exceed it.
Postgres is a row-store OLTP engine, included as a familiar single-node reference — not a columnar-analytics peer. At SF=100 it ran with a 90 s per-query cap; 16 / 22 heavy queries timed out (shown ”—”), so its SF=100 geomean covers only the 6 that finished.
DuckDB runs at defaults (in-memory read_parquet). ematix-flow runs the production preset — target_partitions = cores, plus the fused-aggregate, dict-group-count, push-LeftSemi, transitive-semi, runtime-bloom, scale-relative-broadcast, and cluster-key single-phase aggregation rules. That’s the same config you get from pip install ematix-flow; no per-query tuning, one fresh context per query.
Thermal note (SF=10 / SF=100): back-to-back runs on an M4 Max drift ~5–20 % as the package heats; the isolated passes run under caffeinate -i with warmups + 20-trial medians to bound it.

Reproducing

# ematix-flow + DuckDB — strict protocol (solo passes, per-query process
# isolation, thermal gating, env.json provenance, 2σ verdict diff)
cargo build --release --features triangulation --example tpch_triangulation_bench
scripts/bench/strict_22q.sh --sf 1 --engine ematix --isolate --out /tmp/em-sf1
scripts/bench/strict_22q.sh --sf 1 --engine duckdb --isolate --out /tmp/dk-sf1
python3 scripts/bench/strict_diff.py --a /tmp/em-sf1/strict-22q-summary.md \
  --b /tmp/dk-sf1/strict-22q-summary.md --out /tmp/verdict-sf1.md \
  --label-a ematix --label-b duckdb
#   → repeat with --sf 10 / --sf 100

# Concurrent-stream throughput (QPH, memory-guarded)
scripts/bench/strict_throughput.sh --sf 10 --streams "1,10,100" \
  --partitions auto --max-inflight 10 --min-free-gb 6 --out /tmp/tput-sf10

# Polars — the in-process triangulation harness
TPCH_DATA_DIR=examples/tpch/data/sf10 TPCH_SKIP_DUCKDB=1 TPCH_SKIP_EMATIX=1 \
  TPCH_OUT=/tmp/polars.md cargo run --release -p ematix-flow-core \
  --example tpch_triangulation_bench --features triangulation

# PySpark — needs JDK 17+ (brew install openjdk)
JAVA_HOME=$(/usr/libexec/java_home) \
  SPARK_DRIVER_MEM=16g SPARK_SHUFFLE_PARTS=64 \
  python scripts/bench-tpch-pyspark.py \
  --data-dir examples/tpch/data/sf100 --trials 5 --warmups 1

# Postgres — load the Parquet via ematix-flow's own ADBC ingest,
# then run each query under EXPLAIN ANALYZE.

The 2026-07-03 tables are the v0.12.0 release benchmark: ematix-flow and DuckDB re-measured at all three scales in one session with the exact tagged code under the strict protocol; full raw evidence (per-invocation runs, verdict diffs, env.json) ships in the repo at bench-results/release-v0.12.0/ and bench-results/sched-arc-2026-07-03/ (throughput). Polars / PySpark / Postgres columns are carried baselines from earlier same-machine runs (raw logs in bench-results/refresh-2026-05-30/). Release-over-release history lives in the repo CHANGELOG.