Benchmarks
TPC-H at SF=1 / SF=10 / SF=100, all 22 queries on an Apple M4 Max — ematix-flow vs DuckDB, Polars, single-node PySpark, and Postgres. Median ms, same hardware and same Parquet files.
Same-machine TPC-H benchmark (Apple M4 Max, single-node) over all 22 queries at three scale factors — SF=1 (~1 GB, fits in cache), SF=10 (~10 GB, production shape), and SF=100 (~100 GB, out-of-core). Five engines, the same Parquet files, the same machine. Switch scale with the tabs:
| Query | ematix-flow | DuckDB | Polars | PySpark | Postgres |
|---|---|---|---|---|---|
| Q01 | 17.1 | 48.5 | 38.6 | 167 | 411 |
| Q02 | 7.0 | 17.6 | 47.9 | 190 | 123 |
| Q03 | 9.6 | 32.8 | 46.5 | 257 | 163 |
| Q04 | 10.3 | 22.3 | 23.8 | 184 | 94.6 |
| Q05 | 6.6 | 31.7 | 8,949 | 335 | 226 |
| Q06 | 0.9 | 13.1 | 10.5 | 41.5 | 218 |
| Q07 | 27.0 | 32.9 | 118 | 260 | 1,262 |
| Q08 | 11.4 | 39.4 | 96.7 | 182 | 99.7 |
| Q09 | 17.4 | 55.6 | 47.5 | 583 | 820 |
| Q10 | 27.2 | 60.6 | 111 | 362 | 355 |
| Q11 | 6.0 | 9.7 | 8.9 | 119 | 36.3 |
| Q12 | 14.4 | 25.1 | 19.2 | 269 | 361 |
| Q13 | 9.2 | 141 | 118 | 684 | 871 |
| Q14 | 10.2 | 22.3 | 12.3 | 114 | 69.8 |
| Q15 | 10.4 | 14.1 | 11.2 | 127 | 140 |
| Q16 | 7.7 | 21.4 | 21.2 | 205 | 113 |
| Q17 | 14.8 | 24.7 | 39.0 | 233 | 398 |
| Q18 | 1.6 | 45.7 | 56.6 | 560 | 1,154 |
| Q19 | 15.5 | 34.4 | 105 | 86.0 | 31.9 |
| Q20 | 16.7 | 28.9 | 22.4 | 106 | 147 |
| Q21 | 35.0 | 74.5 | 721 | 628 | 609 |
| Q22 | 8.5 | 20.9 | 13.6 | 354 | 24.1 |
Median ms · lowest per row in teal.
| Query | ematix-flow | DuckDB | Polars | PySpark | Postgres |
|---|---|---|---|---|---|
| Q01 | 230 | 254 | 343 | 732 | 4,306 |
| Q02 | 18.8 | 41.6 | 447 | 599 | 2,222 |
| Q03 | 83.7 | 150 | 590 | 2,722 | 3,488 |
| Q04 | 57.9 | 92.3 | 277 | 1,711 | 905 |
| Q05 | 116 | 148 | — | 4,589 | 3,233 |
| Q06 | 37.8 | 85.8 | 58.6 | 205 | 1,373 |
| Q07 | 124 | 133 | 1,315 | 3,737 | 2,188 |
| Q08 | 200 | 181 | 1,282 | 940 | 1,340 |
| Q09 | 270 | 276 | 414 | 2,187 | 7,431 |
| Q10 | 200 | 385 | 4,153 | 2,355 | 3,362 |
| Q11 | 12.7 | 24.6 | 32.5 | 197 | 578 |
| Q12 | 96.7 | 122 | 134 | 826 | 3,542 |
| Q13 | 115 | 268 | 424 | 2,069 | 10,989 |
| Q14 | 82.4 | 123 | 83.8 | 379 | 813 |
| Q15 | 63.4 | 91.2 | 72.0 | 645 | 1,606 |
| Q16 | 33.6 | 59.6 | 171 | 638 | 1,098 |
| Q17 | 121 | 159 | 523 | 3,956 | 5,387 |
| Q18 | 20.6 | 245 | 624 | 6,953 | 19,846 |
| Q19 | 134 | 185 | 1,389 | 493 | 148 |
| Q20 | 110 | 140 | 275 | 419 | 3,279 |
| Q21 | 257 | 409 | 34,717 | 7,523 | 6,952 |
| Q22 | 51.3 | 115 | 112 | 628 | 203 |
Median ms · lowest per row in teal. “—” = engine couldn’t run the query at this scale (see caveats below).
| Query | ematix-flow | DuckDB | Polars | PySpark | Postgres |
|---|---|---|---|---|---|
| Q01 | 2,361 | 2,619 | 79,084 | 5,184 | — |
| Q02 | 241 | 419 | 53,109 | 6,697 | 26,054 |
| Q03 | 1,040 | 1,539 | 30,103 | 26,128 | — |
| Q04 | 840 | 897 | 6,550 | 10,543 | — |
| Q05 | 1,381 | 1,617 | — | 38,145 | — |
| Q06 | 483 | 742 | 540 | 1,142 | — |
| Q07 | 1,560 | 1,714 | 95,946 | 12,775 | — |
| Q08 | 1,901 | 2,417 | — | 23,610 | — |
| Q09 | 6,086 | 7,371 | 21,438 | 67,105 | — |
| Q10 | 3,291 | 2,691 | — | 24,590 | — |
| Q11 | 221 | 234 | 412 | 5,639 | 60,108 |
| Q12 | 1,009 | 1,229 | 1,112 | 6,482 | — |
| Q13 | 2,004 | 2,349 | 5,114 | 14,442 | 86,674 |
| Q14 | 835 | 1,249 | 895 | 2,623 | — |
| Q15 | 947 | 1,086 | 890 | 5,142 | — |
| Q16 | 196 | 401 | 1,809 | 5,012 | 29,144 |
| Q17 | 1,849 | 1,892 | 9,536 | 47,823 | — |
| Q18 | 495 | 2,812 | 15,572 | 54,206 | — |
| Q19 | 1,488 | 1,941 | — | 3,101 | 52,107 |
| Q20 | 2,445 | 2,019 | 6,692 | 5,682 | — |
| Q21 | 5,274 | 5,589 | — | 55,583 | — |
| Q22 | 831 | 804 | 1,255 | 5,107 | 16,805 |
Median ms · lowest per row in teal. Postgres ran with a 90 s per-query cap — 16 / 22 heavy queries timed out (—). “—” = engine couldn’t run the query at this scale (see caveats below).
Scope: every number here is single-node. ematix-flow also has an auto-detected distributed mode (Arrow Flight peer mesh — see Why ematix-flow). A cross-host cluster-scale panel is deferred to a later release; the harness (
tpch_distributed) already ships in the repo.
How to read it
- ematix-flow + DuckDB are co-measured in one process (10 timed trials after 3 warmups, medians) — the head-to-head that matters, so thermal drift hits both equally.
- Polars runs the same in-process harness (hand-translated
q??.polars.sqlwhere its planner rejects the canonical shape). - PySpark runs
local[*]out-of-process on the JVM viabench-tpch-pyspark.py, against the same files. - Postgres 14 runs each query under
EXPLAIN ANALYZE(B-tree indexed +ANALYZEd), reported as the planner’s Execution Time. - The fastest engine per row is highlighted; ematix-flow’s column is tinted for scanning.
What changes across scale
At SF=1 the working set is L3-resident and per-query constant cost dominates — ematix-flow’s fused aggregate / filter paths take all 22. At SF=10 and SF=100 the workload turns memory- then IO-bound: DuckDB’s mature join-order heuristics and vectorised kernels reclaim a handful of multi-fact joins (Q08 at SF=10; Q10 / Q20 / Q22 at SF=100), while ematix-flow still leads the field (21 / 22 then 18 / 22) and widens specific wins — most visibly Q18 at SF=100: 495 ms vs DuckDB’s 2 812 ms (5.7×), from the scale-relative broadcast-join rule.
Caveats
- Polars can’t run several canonical TPC-H shapes; we feed it
semantically-identical hand-translated variants. Q05 still overflows
Polars’s default 32-bit row index (the
bigidxbuild would fix it) and shows ”—” at SF=10 / SF=100; a few more SF=100 queries (Q08, Q10, Q19, Q21) likewise exceed it. - Postgres is a row-store OLTP engine, included as a familiar single-node reference — not a columnar-analytics peer. At SF=100 it ran with a 90 s per-query cap; 16 / 22 heavy queries timed out (shown ”—”), so its SF=100 geomean covers only the 6 that finished.
- DuckDB runs at defaults (in-memory
read_parquet). ematix-flow runs the production preset —target_partitions = cores, plus the fused-aggregate, dict-group-count, push-LeftSemi, runtime-bloom, and scale-relative-broadcast rules. That’s the same config you get frompip install ematix-flow; no per-query tuning. - Thermal note (SF=10 / SF=100): back-to-back runs on an M4 Max drift ~5–20 % as the package heats. ematix-flow and DuckDB move together, so the head-to-head holds — which is exactly why the two are co-measured in a single process.
Reproducing
# ematix-flow + DuckDB + Polars — same harness, swap the data dir per scale
TPCH_DATA_DIR=examples/tpch/data/sf1 \
cargo run --release -p ematix-flow-core \
--example tpch_triangulation_bench --features triangulation
# → sf10, then sf100 (SF=100 wants ~64 GB free + a few minutes/run)
# PySpark — needs JDK 17+ (brew install openjdk)
JAVA_HOME=$(/usr/libexec/java_home) \
SPARK_DRIVER_MEM=16g SPARK_SHUFFLE_PARTS=64 \
python scripts/bench-tpch-pyspark.py \
--data-dir examples/tpch/data/sf100 --trials 5 --warmups 1
# Postgres — load the Parquet via ematix-flow's own ADBC ingest,
# then run each query under EXPLAIN ANALYZE.
The 2026-06-07 ematix/DuckDB/Polars refresh re-runs the harness above; the carried PySpark/Postgres raw logs are in bench-results/refresh-2026-05-30/ (same machine).
Release-over-release history lives in the repo
CHANGELOG.