What's shipped

v1.0.0 surface matrix — what's stable, what's still in motion.

Backends

Backend	Batch source	Streaming source	Target	DDL planning	Strategy executors (append / merge / scd2 / truncate)	CDC target
Postgres	✅	—	✅	✅	✅ (native + COPY BINARY)	✅
MySQL	✅	—	✅	✅	✅ (`ON DUPLICATE KEY`)	✅
SQLite	✅	—	✅	✅	✅	✅
DuckDB	✅	—	✅	✅	✅	✅
Snowflake	✅	—	✅	n/a	append (Arrow → Parquet → `PUT` + `COPY INTO`) + truncate + merge (staged `MERGE`)	—
BigQuery	✅	—	✅	n/a	append (Arrow → Parquet → GCS + `load_table_from_uri`) + truncate + merge (`MERGE INTO` from staging)	—
Redshift	✅	—	✅	n/a	append (S3 → `COPY`) + truncate + merge (`MERGE INTO` from staging)	—
Delta Lake (local + S3)	✅	✅	✅	n/a	✅ (DataFusion-backed `MERGE`)	✅
Object stores (Parquet / CSV / ORC / JSONL, local + S3)	✅	✅	✅	n/a	append + truncate	—
Kafka	—	✅	✅	n/a	append (cross-backend)	source role only
RabbitMQ	—	✅	✅	n/a	append (cross-backend)	—
GCP Pub/Sub	—	✅	✅	n/a	append (cross-backend)	—
AWS Kinesis	—	✅	✅	n/a	append (cross-backend)	—

Batch source = readable by @ematix.pipeline (the function returns a SQL string; the framework executes it against the source connection). Streaming source = tailable by flow consume / @ematix.streaming_pipeline (long-running consumer with manual offset commit / ack). Target = writable by either pipeline shape. Cross-backend moves stream Apache Arrow batches end-to-end — same-DB pairs take the INSERT … SELECT fast path automatically.

Workflows + Jobs surface

@ematix.job — batch / scheduled. Atomic unit of work; one function, one target, one schedule. (Historical name @ematix.pipeline remains as an alias.)
ematix.workflow(name, jobs, depends_on) — names a group of jobs + declares the DAG between them. Workflows are the user-facing organizing concept on the Web UI; the DAG lives here, not on individual jobs.
@ematix.warehouse_pipeline — scheduled warehouse-to-warehouse read → DuckDB transform → bulk write, registered with the same cron / DAG / retry / flow run-due machinery.
@ematix.streaming_pipeline — long-running consumer.
@ematix.connection — typed connection with ${VAR} interpolation.
@ematix.table / ManagedTable — declarative schema + DDL.
@ematix_flow.udf / @ematix_flow.udaf — Python user-defined scalar + aggregate functions.

Modes

append — default.
truncate — full refresh.
merge (a.k.a. scd1) — upsert on merge keys.
scd2 — slowly-changing-dimension Type 2.

Execution

Single-node (default when no peers). The whole engine runs in-process; the flow binary is a single ~25 MB native executable.
Distributed (auto-detected). engine = "auto" is the default — set peers = [...] and SQL fans out across a peer-to-peer mesh of flow-worker processes via Apache Arrow Flight; otherwise falls back to in-process. mTLS-secured mesh, cross-pod lookup broadcast, no separate cluster service. Symmetric — any process linking ematix-flow-distributed can play coordinator or worker. Spark / DuckDB dialect translator means existing SQL ports across without rewrites.

Operational

Cron + DAG + cycle detection + retries. DAG edges live on the workflow declaration (or, legacy, on per-job depends_on=).
Cron schedule timezones — timezone="America/New_York" on the job; Web UI renders “Next: …” in the job’s tz.
Watermarks + restart-safe state.
Run-history store (queryable via flow runs ...).
Prometheus metrics + OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, or OTLP HTTP collectors).
Alerters: Slack, generic webhook, stdout, email (email://user:pass@host:port?from=&to=) and PagerDuty (pagerduty://routing_key?service=&severity= — auto-resolves on recovery).
Web UI bearer-token auth (flow web --token …) for off-host access; cross-pipeline DAG view of every depends_on edge; streaming pipeline live throughput + batch-cycle stats.
Grafana starter dashboard JSON (examples/grafana/ematix-flow-dashboard.json).
DLQ at app + broker level.
Schema Registry: Confluent + Apicurio (Avro + Protobuf) and AWS Glue Schema Registry (Avro, end-to-end Kafka dispatch + LocalStack integration tests).
Exactly-once Kafka → Kafka (transactions + consumer coordination).

Stream processing

SQL transforms over Arrow batches (in-process, no JVM).
Tumbling / hopping / session windows.
Scalar + aggregate Python UDFs.

Analytics (SQL Lab / charts / dashboards)

SQL Lab — ad-hoc SQL through the engine (Arrow-native, no pyarrow/pandas in the request path) against registered data sources (flow web --datasource name=url); schema browser + autocomplete, typed results grid, saved queries.
Query safety — read-only guard (SELECT / WITH / EXPLAIN only), filesystem / network / cross-DB function denylist (read_csv, *_scan, ATTACH, …), row cap, per-query timeout.
Charts — table / big-number / bar / line / area / pie / scatter via Apache ECharts, with a no-SQL GROUP BY builder; saved + reused.
Dashboards — drag/resize grid of charts, batch tile query + result cache, cross-filtering (click a bar/slice), drill-down, auto-refresh.
RBAC — reverse-proxy / SSO-trust identity header → roles (viewer / editor / admin); objects shared org-wide, owner recorded. Async query jobs + chart-metric alerts (/api/alerts/{id}/check).

Data quality + freshness

Expectations — declare expectations= on a pipeline (not-null, unique, range, regex, is-in, row-count, freshness…); they run through ematix-probe as a SQL-pushdown stage after the write. PK → unique and NOT NULL → not_null are auto-derived from the target table. on_quality_failure is warn (record + alert, run succeeds) or fail (run recorded failed, post-transforms skipped). Postgres + DuckDB targets.
Freshness SLOs — freshness_sla="6h" per pipeline, evaluated both at run-end and on a schedule (flow freshness-check) so a stalled pipeline is caught even when it isn’t firing; alerts fire on the healthy→breached edge, de-duped across runs.
Quality view — freshness cards + a table of expectation runs with per-assertion drill-down (/api/quality, /api/freshness, RBAC read-gated).
Alerts — quality_failed + sla_breached events through the existing Slack / email / PagerDuty / stdout alerters.
Opt-in: pip install "ematix-flow[quality]".

Recently closed (v0.14.2)

v0.14 is a planner-parity + perf release. No Python API changes. The headline: multi-file (partitioned) tables now get the same optimizer treatment as single-file tables, closing the last big single-node gap.

Partitioned-scan planner parity — the width-pinned interleaved union keeps partitioned scans at the configured parallelism (no more per-file fan-out multiplication), and the clustered-aggregation, runtime-bloom, and join-side-sampling rules all now see through the union. TPC-H SF=100 on the partitioned layout, one 32 GB box: 175.6 s → 60.6 s, with Q09 49.5 s → 6.5 s.
Grace hash join (opt-in) — EMAT_GRACE_JOIN=1 enables a partitioned, spill-capable hash join for memory-tight deployments. Kept opt-in: the same-box A/B showed the planner-parity fix beats spilling (a right-sized build needs no spill).
Sidecar index lookups, lazy path — point lookups through .parquet.idx sidecars now open footer-only and decode just the matching row group(s): ~3 s → ~18 ms per lookup at SF=100; the index beats a full scan 9–12× on a 3 GB part.
PyPI metadata — project links now point at ematix.dev.

Recently closed (v0.11.0)

v0.11.0 is a benchmark-accuracy + perf release. No Python API changes; the surface from v0.10.0 is unchanged. The headline is a corrected SF=100 measurement protocol plus the win-campaign optimizer levers.

Three-scale benchmark refresh — all 22 TPC-H queries across SF=1 / SF=10 / SF=100 and five engines (ematix-flow, DuckDB, Polars, single-node PySpark, Postgres), same hardware + same Parquet. Under the strict protocol (v0.12.0 re-measure, 2026-07-03, solo passes, per-query process isolation, 2σ bars) ematix-flow is fastest — or tied for fastest — on every query at every scale: 22 / 22 (SF=1), 17 fastest + 5 ties, 0 slower (SF=10), 20 fastest + 2 ties, 0 slower (SF=100), and holds the lead on concurrent-stream throughput at every scale × stream count. Full tabbed-by-scale tables at /reference/benchmarks.
Scale-relative broadcast join — when a plain dimension⋈fact Inner join has a probe side ≥ 16× the build side, the build is broadcast via CollectLeft instead of hash-repartitioned. DataFusion’s own gate is an absolute byte/row threshold that misses this at scale; the rule is scale-relative and default-on. Measured −54% on the Q18 shape when it landed; nets out positive at SF=1 / SF=10 too. (Earlier “Q18 SF=100 5.8×” claims came from a cached-plan benchmark context — the cold production number is a DuckDB win on that query; see Benchmarks.)
Cardinality-gated Robin-Hood aggregates — the fused SUM(f64) GROUP BY / COUNT / AVG operators now refuse to fire above a group-cardinality ceiling, where DataFusion’s vectorised hash aggregate wins. Fixes a ~9× Q18 SF=100 regression the rule used to cause when group count exploded.
CollectLeft for semi-bounded builds — runtime-bloom semi-detection extended to RightSemi / RightAnti, so a build side bounded by a semi-join is broadcast. Q18 SF=10.
Cost-model join reorder, default-on at scale — NDV-realistic selectivity (string-equality + FK cap) + leftmost-leaf build cost in the DP enumeration. Unblocks the Q05 / Q08 join funnels.
Q15 SF=100 correctness — repaired partitioning after the shared- subtree dedupe wrap (re-run EnforceDistribution); all 22 queries match DuckDB row-for-row at every scale.
ematix-parquet 0.16.3 — scale-aware footer cache collapses the O(N_rg²) re-parse on wide row-group files (Q06 SF=10).

Recently closed (v0.8.0)

v0.8.0 is a perf release. No Python API changes; the v0.7.0 workflow trigger surface (per-job DAG + trigger kwargs) is unchanged. The headline lift is in the single-node execution engine and the Web UI redesign — operational behavior the same, query latency materially better.

SF=1 + SF=10 release benchmarks at 20 trials × 3 warmups across ematix-flow / DuckDB / Polars / PySpark on the same hardware + the same Parquet files. Headline: SF=1 20 / 22 ematix wins vs DuckDB; SF=10 14 / 22 (11–14 noise band — back-to-back runs of the same binary moved every per-query median 5–29% in lockstep across both ematix and DuckDB, Mac thermal / frequency rather than engine). Full numbers + caveats at /reference/benchmarks.
Web UI Linear-style theme — dark-first, single teal accent, monospace numerics, theme toggle persists. Replaces the Pip-Boy / CRT-phosphor theme. Composite trigger expressions (nested AllOf / AnyOf) render as labelled state-pill trees inline in the Workflows view. Dev-only Vite mock middleware so npm run dev in web-ui/ works without a Python backend.
DAG flowchart — long workflow-qualified job names now fit cleanly: leaf name on the primary line, workflow prefix as a muted subtitle, ellipsis safety net for over-long names.
Σ.Q runtime bloom sideband — distributed-friendly per-query bloom-filter transport (BridgeFilterSideband + BuildSideBloomEmitterExec + EnableRuntimeBloomSidebandRule). Default ON with EMAT_RT_BLOOM_RATIO=1024; Inner-join firings selectivity-gated. Closes Q07 / Q17 / Q18 SF=10 wins.
Σ.Q.L10 PushDownLeftSemiRule — pushes LeftSemi joins toward the scan side. Default ON in the milestone config. Q18 SF=10 −54%.
Σ.O.c row-group decode cache — process-wide Arc-shared, byte-bounded LRU keyed by file + RG + column. Default ON. Q13 / Q21 SF=10 tail-rep speedups.
Robin-Hood SUM(f64) operator — EMAT_RH_SUM_F64=1, default ON in milestone config. Closes Q22 / Q04 SF=1.
Σ.L.1 adaptive runtime — speculative-race + 3-state DictArrivalVerdict + resolver. First piece of the “optimizer-that-learns-from-every-query” framing.
ematix-parquet 0.16.1 — V1 level-prefix correctness fix, LZ4_RAW arm + V1+Optional bug fix, splash-shaped scalar predicate-bitmap pack, full SIMD specialisation bw=1..=32, and the new sidecar-index capability (sorted / Bloom / composite / inverted-text) plus an ematix-iceberg crate. Sidecar consumption by flow is on the roadmap — not wired into v0.8.0.

Recently closed (v0.7.0)

v0.7.0 reorganises the workflow trigger model. The previous v0.6.0 shape — workflow with a centralised depends_on={dict} — is replaced by a richer trigger surface on the workflow + per-job DAG declaration on each member job. Hard break, no backwards compat — alpha project, nothing shipped externally depended on the v0.6.0 shape.

Workflow trigger kwargs (all AND-conjoined since last successful self-run):
- triggered_by=[name, ...] — workflow / job names that must have succeeded.
- schedule="cron" + optional timezone="IANA" — cron tick must reach.
- on_message=<source> — per-message firing (exclusive with the above).
- At least one required, unless the workflow contains a streaming pipeline (implicit streaming).
Composite triggers — declaring more than one kwarg ANDs the conditions together. Example: triggered_by=["wf_A", "wf_B"] + schedule="0 21 * * *" fires when both workflows have completed since this workflow last succeeded AND the 21:00 tick has reached. If conditions are satisfied at different times, the workflow fires immediately on the last satisfied condition rather than waiting for the next cron tick.
Within-workflow DAG declared on each job via @ematix.job(name=..., depends_on=[upstream_job, ...]). The workflow itself no longer takes a depends_on= dict — passing one raises ValueError with a pointer to the new model.
Validation at registration time for trigger conflicts (e.g. on_message + schedule together), workflow-level cycles, missing triggers on non-streaming workflows.
Web UI Workflows card renders a trigger summary line under the header: “Trigger: After: a, b · Schedule: 0 21 * * * America/New_York”. Streaming workflows keep the LIVE STREAMING pill in place of a trigger line.
/api/workflows carries triggered_by, schedule, timezone, on_message per workflow; edges derived by walking each member job’s depends_on.

Recently closed (v0.6.1)

Streaming workflows visible on the Workflows tab. /api/workflows now includes streaming pipelines as kind: "streaming" workflow-of-one cards. They previously only appeared on the Jobs and Runs tabs.
Web UI renders streaming workflows with the pulsing amber ▶ LIVE STREAMING pill and a live throughput / batch-cycle footer driven by the same streaming_stats snapshot the Jobs page uses.

Recently closed (v0.6.0)

v0.6.0 reorganised the user-facing model around Workflows and Jobs — same scheduler and runtime as v0.5.0, but the Web UI and decorator names now reflect the natural hierarchy.

ematix.workflow(name=..., jobs=[...], depends_on={...}) — new declaration grouping jobs and declaring the DAG between them. (Superseded in v0.7.0 by the trigger-kwargs surface above.)
@ematix.job decorator as the primary name; @ematix.pipeline remains a non-breaking alias.
/api/workflows endpoint — returns declared workflows + their member jobs and DAG edges, with synthetic single-job workflows for any job not in a declared workflow.
Web UI restructured into four tabs: Workflows (default, with inline SVG flowcharts), Jobs (flat list with filter + sort), Runs (renamed from Jobs, sortable columns), and DAG (cubic-Bézier arrows, #/dag/<job> to focus a subgraph).
flow web --module <name> — pre-imports a pipelines module so the UI can render schedules, next-run times, and the DAG without a separate scheduler tick having populated the rich-history first.
Loopback bind no longer requires a bearer token by default — set --token explicitly when binding to a non-loopback address.

Earlier (v0.5.0 highlights)

v0.5.0 was the operational-surface release — CLIs, alerters, Web UI, and observability on top of the v0.4.0 backend matrix.

@ematix.warehouse_pipeline decorator + Rust executor (invoke_warehouse_pipeline) via the PyO3 callback bridge — no subprocess fork per run.
AWS Glue Schema Registry — end-to-end Kafka dispatch (typed glue_schema_registry connection, Rust dispatch on both consumer and producer paths, zlib codec, LocalStack integration suite).
Four new CLI subcommands: flow doctor, flow init, flow logs, flow secrets test.
Web UI bearer-token auth (flow web --token …) + cross-pipeline DAG view at #/dag.
Email + PagerDuty alerters — email://… (stdlib smtplib) and pagerduty://… (Events API v2, auto-resolves on recovery).
OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, OTLP HTTP) + 6-panel Grafana starter dashboard.
Streaming-pipeline live stats in the Web UI — rolling 1 m / 5 m throughput + batch-cycle windows in place of the old “Median duration: —” footer.
Cron schedule timezones — timezone="America/New_York" on the job; is_due() evaluates in that tz, Web UI renders “Next: …” in the same.
Arrow-native warehouse adapters — Snowflake PUT + COPY INTO, Redshift S3 + COPY, BigQuery GCS + load_table_from_uri — pandas no longer on the warehouse write path.

Full changelog: see CHANGELOG.md in the repo for every prior release.

What’s still on the roadmap

Published distributed benchmarks. The distributed code path is shipped (see Execution above) and has a bench harness (tpch_distributed), but cluster-scale TPC-H runs at SF≥100 aren’t published yet — every number on /reference/benchmarks is single-machine.

Other roadmap items are tracked openly in the GitHub repo issues.