What's shipped
v0.10.0 surface matrix — what's stable, what's still in motion.
Backends
| Backend | Batch source | Streaming source | Target | DDL planning | Strategy executors (append / merge / scd2 / truncate) | CDC target |
|---|---|---|---|---|---|---|
| Postgres | ✅ | — | ✅ | ✅ | ✅ (native + COPY BINARY) | ✅ |
| MySQL | ✅ | — | ✅ | ✅ | ✅ (ON DUPLICATE KEY) | ✅ |
| SQLite | ✅ | — | ✅ | ✅ | ✅ | ✅ |
| DuckDB | ✅ | — | ✅ | ✅ | ✅ | ✅ |
| Snowflake | ✅ | — | ✅ | n/a | append (Arrow → Parquet → PUT + COPY INTO) + truncate + merge (staged MERGE) | — |
| BigQuery | ✅ | — | ✅ | n/a | append (Arrow → Parquet → GCS + load_table_from_uri) + truncate + merge (MERGE INTO from staging) | — |
| Redshift | ✅ | — | ✅ | n/a | append (S3 → COPY) + truncate + merge (MERGE INTO from staging) | — |
| Delta Lake (local + S3) | ✅ | ✅ | ✅ | n/a | ✅ (DataFusion-backed MERGE) | ✅ |
| Object stores (Parquet / CSV / ORC / JSONL, local + S3) | ✅ | ✅ | ✅ | n/a | append + truncate | — |
| Kafka | — | ✅ | ✅ | n/a | append (cross-backend) | source role only |
| RabbitMQ | — | ✅ | ✅ | n/a | append (cross-backend) | — |
| GCP Pub/Sub | — | ✅ | ✅ | n/a | append (cross-backend) | — |
| AWS Kinesis | — | ✅ | ✅ | n/a | append (cross-backend) | — |
Batch source = readable by @ematix.pipeline (the function returns a
SQL string; the framework executes it against the source connection).
Streaming source = tailable by flow consume /
@ematix.streaming_pipeline (long-running consumer with manual offset
commit / ack).
Target = writable by either pipeline shape. Cross-backend moves
stream Apache Arrow batches end-to-end — same-DB pairs take the
INSERT … SELECT fast path automatically.
Workflows + Jobs surface
@ematix.job— batch / scheduled. Atomic unit of work; one function, one target, one schedule. (Historical name@ematix.pipelineremains as an alias.)ematix.workflow(name, jobs, depends_on)— names a group of jobs + declares the DAG between them. Workflows are the user-facing organizing concept on the Web UI; the DAG lives here, not on individual jobs.@ematix.warehouse_pipeline— scheduled warehouse-to-warehouse read → DuckDB transform → bulk write, registered with the same cron / DAG / retry /flow run-duemachinery.@ematix.streaming_pipeline— long-running consumer.@ematix.connection— typed connection with${VAR}interpolation.@ematix.table/ManagedTable— declarative schema + DDL.@ematix_flow.udf/@ematix_flow.udaf— Python user-defined scalar + aggregate functions.
Modes
append— default.truncate— full refresh.merge(a.k.a.scd1) — upsert on merge keys.scd2— slowly-changing-dimension Type 2.
Execution
- Single-node (default when no peers). The whole engine runs
in-process; the
flowbinary is a single ~25 MB native executable. - Distributed (auto-detected).
engine = "auto"is the default — setpeers = [...]and SQL fans out across a peer-to-peer mesh offlow-workerprocesses via Apache Arrow Flight; otherwise falls back to in-process. mTLS-secured mesh, cross-pod lookup broadcast, no separate cluster service. Symmetric — any process linkingematix-flow-distributedcan play coordinator or worker. Spark / DuckDB dialect translator means existing SQL ports across without rewrites.
Operational
- Cron + DAG + cycle detection + retries. DAG edges live on the
workflow declaration (or, legacy, on per-job
depends_on=). - Cron schedule timezones —
timezone="America/New_York"on the job; Web UI renders “Next: …” in the job’s tz. - Watermarks + restart-safe state.
- Run-history store (queryable via
flow runs ...). - Prometheus metrics + OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, or OTLP HTTP collectors).
- Alerters: Slack, generic webhook, stdout, email
(
email://user:pass@host:port?from=&to=) and PagerDuty (pagerduty://routing_key?service=&severity=— auto-resolves on recovery). - Web UI bearer-token auth (
flow web --token …) for off-host access; cross-pipeline DAG view of everydepends_onedge; streaming pipeline live throughput + batch-cycle stats. - Grafana starter dashboard JSON
(
examples/grafana/ematix-flow-dashboard.json). - DLQ at app + broker level.
- Schema Registry: Confluent + Apicurio (Avro + Protobuf) and AWS Glue Schema Registry (Avro, end-to-end Kafka dispatch + LocalStack integration tests).
- Exactly-once Kafka → Kafka (transactions + consumer coordination).
Stream processing
- SQL transforms over Arrow batches (in-process, no JVM).
- Tumbling / hopping / session windows.
- Scalar + aggregate Python UDFs.
Recently closed (v0.9.0)
v0.10.0 is a perf + capability release. No Python API changes; the surface from v0.8.0 is unchanged. The work is in the optimizer’s join handling at scale and a full benchmark refresh.
- Three-scale benchmark refresh — all 22 TPC-H queries across SF=1 / SF=10 / SF=100 and five engines (ematix-flow, DuckDB, Polars, single-node PySpark, Postgres), same hardware + same Parquet. ematix-flow wins 22 / 22 (SF=1), 18 / 22 (SF=10), 16 / 22 (SF=100). Full tabbed-by-scale tables at /reference/benchmarks.
- Scale-relative broadcast join — when a plain dimension⋈fact Inner
join has a probe side ≥ 16× the build side, the build is broadcast via
CollectLeftinstead of hash-repartitioned. DataFusion’s own gate is an absolute byte/row threshold that misses this at scale; the rule is scale-relative and default-on. Q18 at SF=100: 377 ms vs DuckDB’s 2 193 ms (5.8×); nets out positive at SF=1 / SF=10 too. - Cardinality-gated Robin-Hood aggregates — the fused
SUM(f64) GROUP BY/COUNT/AVGoperators now refuse to fire above a group-cardinality ceiling, where DataFusion’s vectorised hash aggregate wins. Fixes a ~9× Q18 SF=100 regression the rule used to cause when group count exploded. - CollectLeft for semi-bounded builds — runtime-bloom semi-detection
extended to
RightSemi/RightAnti, so a build side bounded by a semi-join is broadcast. Q18 SF=10. - Cost-model join reorder, default-on at scale — NDV-realistic selectivity (string-equality + FK cap) + leftmost-leaf build cost in the DP enumeration. Unblocks the Q05 / Q08 join funnels.
- Q15 SF=100 correctness — repaired partitioning after the shared-
subtree dedupe wrap (re-run
EnforceDistribution); all 22 queries match DuckDB row-for-row at every scale. - ematix-parquet 0.16.3 — scale-aware footer cache collapses the O(N_rg²) re-parse on wide row-group files (Q06 SF=10).
Recently closed (v0.8.0)
v0.8.0 is a perf release. No Python API changes; the v0.7.0 workflow trigger surface (per-job DAG + trigger kwargs) is unchanged. The headline lift is in the single-node execution engine and the Web UI redesign — operational behavior the same, query latency materially better.
- SF=1 + SF=10 release benchmarks at 20 trials × 3 warmups across ematix-flow / DuckDB / Polars / PySpark on the same hardware + the same Parquet files. Headline: SF=1 20 / 22 ematix wins vs DuckDB; SF=10 14 / 22 (11–14 noise band — back-to-back runs of the same binary moved every per-query median 5–29% in lockstep across both ematix and DuckDB, Mac thermal / frequency rather than engine). Full numbers + caveats at /reference/benchmarks.
- Web UI Linear-style theme — dark-first, single teal accent,
monospace numerics, theme toggle persists. Replaces the Pip-Boy /
CRT-phosphor theme. Composite trigger expressions (nested
AllOf/AnyOf) render as labelled state-pill trees inline in the Workflows view. Dev-only Vite mock middleware sonpm run devinweb-ui/works without a Python backend. - DAG flowchart — long workflow-qualified job names now fit cleanly: leaf name on the primary line, workflow prefix as a muted subtitle, ellipsis safety net for over-long names.
- Σ.Q runtime bloom sideband — distributed-friendly per-query
bloom-filter transport (
BridgeFilterSideband+BuildSideBloomEmitterExec+EnableRuntimeBloomSidebandRule). Default ON withEMAT_RT_BLOOM_RATIO=1024; Inner-join firings selectivity-gated. Closes Q07 / Q17 / Q18 SF=10 wins. - Σ.Q.L10
PushDownLeftSemiRule— pushes LeftSemi joins toward the scan side. Default ON in the milestone config. Q18 SF=10 −54%. - Σ.O.c row-group decode cache — process-wide Arc-shared, byte-bounded LRU keyed by file + RG + column. Default ON. Q13 / Q21 SF=10 tail-rep speedups.
- Robin-Hood SUM(f64) operator —
EMAT_RH_SUM_F64=1, default ON in milestone config. Closes Q22 / Q04 SF=1. - Σ.L.1 adaptive runtime — speculative-race + 3-state
DictArrivalVerdict+ resolver. First piece of the “optimizer-that-learns-from-every-query” framing. - ematix-parquet 0.16.1 — V1 level-prefix correctness fix,
LZ4_RAW arm + V1+Optional bug fix, splash-shaped scalar
predicate-bitmap pack, full SIMD specialisation bw=1..=32, and the
new sidecar-index capability (sorted / Bloom / composite /
inverted-text) plus an
ematix-icebergcrate. Sidecar consumption by flow is on the roadmap — not wired into v0.8.0.
Recently closed (v0.7.0)
v0.7.0 reorganises the workflow trigger model. The previous v0.6.0
shape — workflow with a centralised depends_on={dict} — is replaced
by a richer trigger surface on the workflow + per-job DAG declaration
on each member job. Hard break, no backwards compat — alpha
project, nothing shipped externally depended on the v0.6.0 shape.
- Workflow trigger kwargs (all AND-conjoined since last successful
self-run):
triggered_by=[name, ...]— workflow / job names that must have succeeded.schedule="cron"+ optionaltimezone="IANA"— cron tick must reach.on_message=<source>— per-message firing (exclusive with the above).- At least one required, unless the workflow contains a streaming pipeline (implicit streaming).
- Composite triggers — declaring more than one kwarg ANDs the
conditions together. Example:
triggered_by=["wf_A", "wf_B"]+schedule="0 21 * * *"fires when both workflows have completed since this workflow last succeeded AND the 21:00 tick has reached. If conditions are satisfied at different times, the workflow fires immediately on the last satisfied condition rather than waiting for the next cron tick. - Within-workflow DAG declared on each job via
@ematix.job(name=..., depends_on=[upstream_job, ...]). The workflow itself no longer takes adepends_on=dict — passing one raisesValueErrorwith a pointer to the new model. - Validation at registration time for trigger conflicts (e.g.
on_message+scheduletogether), workflow-level cycles, missing triggers on non-streaming workflows. - Web UI Workflows card renders a trigger summary line under the header: “Trigger: After: a, b · Schedule: 0 21 * * * America/New_York”. Streaming workflows keep the LIVE STREAMING pill in place of a trigger line.
/api/workflowscarriestriggered_by,schedule,timezone,on_messageper workflow; edges derived by walking each member job’sdepends_on.
Recently closed (v0.6.1)
- Streaming workflows visible on the Workflows tab.
/api/workflowsnow includes streaming pipelines askind: "streaming"workflow-of-one cards. They previously only appeared on the Jobs and Runs tabs. - Web UI renders streaming workflows with the pulsing amber
▶ LIVE STREAMINGpill and a live throughput / batch-cycle footer driven by the samestreaming_statssnapshot the Jobs page uses.
Recently closed (v0.6.0)
v0.6.0 reorganised the user-facing model around Workflows and Jobs — same scheduler and runtime as v0.5.0, but the Web UI and decorator names now reflect the natural hierarchy.
ematix.workflow(name=..., jobs=[...], depends_on={...})— new declaration grouping jobs and declaring the DAG between them. (Superseded in v0.7.0 by the trigger-kwargs surface above.)@ematix.jobdecorator as the primary name;@ematix.pipelineremains a non-breaking alias./api/workflowsendpoint — returns declared workflows + their member jobs and DAG edges, with synthetic single-job workflows for any job not in a declared workflow.- Web UI restructured into four tabs: Workflows (default, with
inline SVG flowcharts), Jobs (flat list with filter + sort), Runs
(renamed from Jobs, sortable columns), and DAG (cubic-Bézier arrows,
#/dag/<job>to focus a subgraph). flow web --module <name>— pre-imports a pipelines module so the UI can render schedules, next-run times, and the DAG without a separate scheduler tick having populated the rich-history first.- Loopback bind no longer requires a bearer token by default —
set
--tokenexplicitly when binding to a non-loopback address.
Earlier (v0.5.0 highlights)
v0.5.0 was the operational-surface release — CLIs, alerters, Web UI, and observability on top of the v0.4.0 backend matrix.
@ematix.warehouse_pipelinedecorator + Rust executor (invoke_warehouse_pipeline) via the PyO3 callback bridge — no subprocess fork per run.- AWS Glue Schema Registry — end-to-end Kafka dispatch (typed
glue_schema_registryconnection, Rust dispatch on both consumer and producer paths, zlib codec, LocalStack integration suite). - Four new CLI subcommands:
flow doctor,flow init,flow logs,flow secrets test. - Web UI bearer-token auth (
flow web --token …) + cross-pipeline DAG view at#/dag. - Email + PagerDuty alerters —
email://…(stdlib smtplib) andpagerduty://…(Events API v2, auto-resolves on recovery). - OpenTelemetry trace spans for every pipeline run (stdout, OTLP gRPC, OTLP HTTP) + 6-panel Grafana starter dashboard.
- Streaming-pipeline live stats in the Web UI — rolling 1 m / 5 m throughput + batch-cycle windows in place of the old “Median duration: —” footer.
- Cron schedule timezones —
timezone="America/New_York"on the job;is_due()evaluates in that tz, Web UI renders “Next: …” in the same. - Arrow-native warehouse adapters — Snowflake
PUT+COPY INTO, Redshift S3 +COPY, BigQuery GCS +load_table_from_uri— pandas no longer on the warehouse write path.
Full changelog: see
CHANGELOG.md
in the repo for every prior release.
What’s still on the roadmap
- Published distributed benchmarks. The distributed code path
is shipped (see Execution above) and has a bench
harness (
tpch_distributed), but cluster-scale TPC-H runs at SF≥100 aren’t published yet — every number on /reference/benchmarks is single-machine.
Other roadmap items are tracked openly in the GitHub repo issues.