Refactor: standard install/start/check/stop/load/query interface per system by alexey-milovidov · Pull Request #860 · ClickHouse/ClickBench

alexey-milovidov · 2026-05-07T12:15:17Z

Summary

Split each local system's monolithic benchmark.sh into 7 single-purpose scripts (install, start, check, stop, load, query, data-size) with a stable contract, driven by a new shared lib/benchmark-common.sh.
Wrap dataframe / in-process systems (pandas, polars-dataframe, chdb-dataframe, daft-parquet*, duckdb-dataframe, sirius) in small FastAPI servers so they fit the same start/stop/query lifecycle.
88 local systems refactored; cloud/managed systems and a handful of non-functional ones are intentionally untouched.

Why

Previously, every system's benchmark.sh bundled installation, server lifecycle, dataset download, data loading, and query dispatch into one script — and run.sh hard-coded the per-query orchestration. There was no programmatic per-query entry point, so:

Tweaking the dataset, query set, or per-query behavior (e.g. restarting the system between queries to neutralize warm-process effects) required editing every system's scripts individually.
Building an online "run query X against system Y" service was impossible.
Most run.sh ran all 3 tries inside a single CLI invocation, so OS-cache warmth from try 1 leaked into tries 2/3.

The new per-system interface

Script	Stdin	Stdout	Stderr	Notes
`install`	-	progress	progress	Idempotent. Env prep + system install.
`start`	-	-	progress	Start daemon. Idempotent. Empty/exit-0 for stateless tools.
`check`	-	-	progress	Trivial query (e.g. `SELECT 1`). Exit 0 iff responsive.
`stop`	-	-	progress	Stop daemon. Idempotent.
`load`	-	progress	progress	Runs create.sql + loads data; deletes source files then `sync`.
`query`	one query	query result, any format	last line: fractional seconds (`0.123`)	Non-zero exit on failure.
`data-size`	-	bytes (one integer)	-	Reports the data footprint.

Each system's benchmark.sh becomes a 4-line shim that sets a couple of env vars and exec's the shared driver:

#!/bin/bash
export BENCH_DOWNLOAD_SCRIPT="download-hits-parquet-partitioned"
export BENCH_RESTARTABLE=yes
exec ../lib/benchmark-common.sh

The shared driver runs install → start+check → download → load (timed) → for each query: flush caches; if BENCH_RESTARTABLE=yes, stop+start; run query 3× → data-size → stop. The output log shape (Load time:, [t1,t2,t3], per query, Data size:) is identical to the old benchmark.sh, so cloud-init.sh.in's POST to play.clickhouse.com keeps working unchanged.

BENCH_RESTARTABLE=no for embedded CLIs (duckdb, sqlite, datafusion, …) and dataframe wrappers — restarting a single CLI/Python process between queries would dominate query time. For these, OS caches are still flushed between queries.

Scope

Refactored (88 systems):

Server, restartable: clickhouse, postgresql, mysql, mariadb, monetdb, druid, pinot, vertica, exasol, kinetica, heavyai, questdb, cockroachdb, elasticsearch, ydb, … and the postgres/clickhouse/mysql variants (timescaledb, citus, paradedb, postgresql-indexed, clickhouse-parquet*, clickhouse-datalake*, mysql-myisam, tidb, infobright, …)
Embedded CLI, not restartable: duckdb (and variants), sqlite, datafusion (and partitioned), glaredb (and partitioned), hyper, hyper-parquet, octosql, opteryx, sail (and partitioned), drill, turso, chdb, chdb-parquet-partitioned
Dataframe with FastAPI wrapper, not restartable: pandas, polars-dataframe, chdb-dataframe, daft-parquet, daft-parquet-partitioned, duckdb-dataframe, sirius
Spark family: spark, spark-auron, spark-comet, spark-gluten

Not refactored (intentionally out of scope):

Cloud / managed: alloydb, athena, aurora-{mysql,postgresql}, bigquery, clickhouse-cloud, databricks, motherduck, redshift, redshift-serverless, snowflake, hydrolix, firebolt(), hologres, tinybird, hydra, mariadb-columnstore, pg_duckdb, singlestore, supabase, tablespace, tembo-olap, timescale-cloud, crunchy-bridge-for-analytics, s3select, …
Non-functional: csvq, dsq, locustdb (panic on first query); exasol, spark-velox (empty dirs)
Non-SQL or no SQL CLI: mongodb (JS aggregation pipelines), polars (no SQL CLI; the dataframe variant is wrapped instead)

Validated end-to-end on a 96-core / 185 GB ARM machine

System	Data	Outcome
clickhouse	14.2 GB / 100M rows	Full 43 queries × 3 tries with stop/start between queries; load 124s
duckdb	20.6 GB / 100M rows	Full 43 queries × 3 tries (no restart); load 69s
pandas	4.2 GB in-mem (5M-row subset)	42/43 queries; Q43 hit a pandas lambda bug → recorded as `null` (framework's error path works)
sqlite	3.9 GB (5M-row subset)	First 5 queries × 3 tries; load 68s
postgresql	100M rows / 75 GB TSV	First 3 queries × 3 tries with restart; load 829s. Cold-cache spike clearly visible (135s → 7s after warmup) — confirms per-query restart actually flushes the page cache

All 88 refactored systems pass bash -n and have executable bits set on the 7 scripts + benchmark.sh.

Bug fixes surfaced during validation

lib/benchmark-common.sh: data-size now runs before stop (clickhouse and pandas need the server up to report size).
clickhouse/start: idempotent (was erroring when already running).
duckdb/load, sqlite/load: rm -f hits.db/mydb for idempotent reruns.
postgresql/load: -v ON_ERROR_STOP=1 so COPY data errors actually fail the script instead of silently rolling back.
BENCH_DOWNLOAD_SCRIPT may now be empty for systems that read directly from S3 datalakes / remote services (clickhouse-datalake*, duckdb-datalake*, chyt, …).

Flagged for follow-up review

duckdb-memory — :memory: semantics force a per-query reload; will inflate timings vs. the original single-process flow.
cloudberry, greenplum — multi-phase install (reboot between phases); the shim only runs phase 1.
sirius — GPU-dependent; long-lived duckdb CLI subprocess proxy; review the stdin/sentinel protocol.
paradedb*, pg_ducklake, pg_mooncake — Docker container created in install then docker cp in load (small divergence from the original docker run -v ... due to the lifecycle order: start runs before download).

Test plan

bash -n on all 88 systems' scripts
clickhouse: full 43-query benchmark.sh on 100M-row real data
duckdb: full 43-query benchmark.sh on 100M-row real data
pandas: 43-query benchmark.sh on a 5M-row subset
sqlite: abbreviated benchmark.sh on a 5M-row subset
postgresql: abbreviated benchmark.sh on full 100M-row data
Smoke-run on a fresh c6a.metal/equivalent VM via cloud-init for a representative system from each family before merging
Verify play.clickhouse.com log-ingestion sink continues to parse the output for at least one production benchmark run

🤖 Generated with Claude Code

…/data-size Each local system now exposes a small set of single-purpose scripts with a stable contract, so they can be driven by a shared lib/benchmark-common.sh and reused by external tooling (e.g. an online "run query against system X" service): install env prep + system install (idempotent) start start daemon (idempotent; empty for stateless tools) check trivial query, exit 0 iff responsive stop stop daemon (idempotent) load runs create.sql + loads data, deletes source files, sync query SQL on stdin; result on stdout; runtime in fractional seconds on the last line of stderr; non-zero exit on error data-size prints data footprint in bytes (one integer to stdout) Each system's old monolithic benchmark.sh is replaced by a 4-line shim that sets a couple of env vars (BENCH_DOWNLOAD_SCRIPT, BENCH_RESTARTABLE) and exec's lib/benchmark-common.sh. The shared driver runs the unified flow: install -> start+check -> download -> load (timed) -> for each query {flush caches; optionally stop+start to neutralize warm-process effects; run query 3x} -> data-size -> stop. Output format ([t1,t2,t3], Load time, Data size) matches the previous benchmark.sh exactly so cloud-init.sh.in's log POST to play.clickhouse.com keeps working unchanged. For dataframe/in-process systems (pandas, polars-dataframe, chdb-dataframe, daft-parquet*, duckdb-dataframe, sirius), the engine is wrapped in a small FastAPI server (server.py) so the start/stop/query interface still applies. BENCH_RESTARTABLE=no for these (and for embedded CLIs like duckdb, sqlite, datafusion, etc.) since restarting a single Python/CLI process between queries would dominate query time. Scope: 88 local systems refactored. Cloud/managed systems and a handful of non-functional ones (csvq, dsq, locustdb, mongodb, polars CLI, exasol, spark-velox) are intentionally left untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflict in clickhouse-datalake{,-partitioned}: upstream switched the datalake variants from filesystem-cache to userspace page-cache (PR #818). The refactored install/query scripts now adopt the page-cache approach. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mongodb: query takes a MongoDB aggregation pipeline (Extended JSON, one line) on stdin instead of SQL — these are the same canonical 43 ClickBench queries, just expressed as mongo pipelines. queries.txt is generated from queries.js (the source of truth) by replacing JS-only constructors (NumberLong, ISODate, NumberDecimal) with their EJSON canonical form. The shim sets BENCH_QUERIES_FILE=queries.txt to point the driver at it. polars: wrapped in a FastAPI server analogous to polars-dataframe, but the load step uses pl.scan_parquet (LazyFrame) so the parquet file remains needed at query time — the load script does NOT delete hits.parquet. data-size returns the on-disk parquet size since a LazyFrame has no materialized in-memory size. Both systems now expose the standard install/start/check/stop/load/query/ data-size scripts and a 4-line benchmark.sh shim, removing the old benchmark.sh / run.js / query.py / formatResult.js paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…use in query Per review: clickhouse-local persists table metadata in its --path dir, so the CREATE TABLE only needs to run once during ./load. ./query just runs the query against the persisted table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…atively Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… readiness Per review (alexey-milovidov): clickhouse start leaves the system in the desired state (server running) even when it returns non-zero with "already running". Make the shared driver tolerate non-zero from ./start and rely on bench_check_loop as the authoritative readiness signal. This lets per-system start scripts stay simple — they just need to make a best-effort attempt to launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ouse#860) Adopts the per-system 7-script interface from ClickHouse#860 for gizmosql/, and replaces the Java sqlline-based gizmosqlline client with the C++ gizmosql_client shell that ships with gizmosql_server. Scripts (matching the contract from lib/benchmark-common.sh): benchmark.sh - 4-line shim that exec's ../lib/benchmark-common.sh install - apt + curl gizmosql_cli_linux_$ARCH.zip; no openjdk, no separate gizmosqlline download start - idempotent server bring-up (skips if port 31337 is open) check - cheap TCP probe (auth-gated SQL would need credentials) stop - kills tracked PID; pkill belt-and-braces fallback load - rm -f clickbench.db, then create.sql + load.sql via gizmosql_client; deletes hits.parquet and sync's query - reads one query from stdin, runs via gizmosql_client with .timer on + .mode trash; emits fractional seconds as the last stderr line (parsed from "Run Time: X.XXs") data-size - wc -c clickbench.db Notes: - BENCH_DOWNLOAD_SCRIPT=download-hits-parquet-single, BENCH_RESTARTABLE=yes (gizmosql is a server, so per-query restart neutralizes warm-process effects, matching the clickhouse/postgres pattern in ClickHouse#860). - util.sh now exports GIZMOSQL_HOST/PORT/USER/PASSWORD - the env vars gizmosql_client reads natively, so query/load can call gizmosql_client with no flags. The server still receives the username via --username. - PID_FILE moved to a stable /tmp path (was /tmp/gizmosql_server_$$.pid, which broke across the start/stop process boundary in the new layout). This PR depends on ClickHouse#860 (which introduces lib/benchmark-common.sh and the contract). Once ClickHouse#860 lands, this PR's diff against main will be only the gizmosql/ files. Validated locally on macOS with gizmosql v1.22.4: the query script produces the expected fractional-seconds last line on stdout/stderr separation, and exits non-zero on error paths. See https://docs.gizmosql.com/#/client for gizmosql_client docs.

Resolves merge conflicts: - Removed cedardb/run.sh, gizmosql/run.sh — superseded by the standard query interface; the refactor branch already replaced them. - Restored datafusion{,-partitioned}/make-json.sh, doris{,-parquet}/get-result-json.sh with main's dated-results version. These are independent post-run JSON builders, still referenced from the per-system READMEs. - Kept the thin benchmark.sh shim in gizmosql/, spark-{auron,comet,gluten}/, trino/. Per-system result-JSON auto-save (added on main while this branch was in flight) is intentionally not carried over: under the new interface, result.csv is the single timing artifact and JSON construction belongs in separate tooling. - gizmosql/{install,load,query,util.sh}: merge auto-took main's switch from gizmosqlline (Java) to gizmosql_client (CLI shipped with the server), but the refactor branch's load/query still referenced GIZMOSQL_SERVER_URI and GIZMOSQL_USERNAME. Updated install to drop openjdk + gizmosqlline, load to use gizmosql_client (and stop the server first to release the database file), and query to drive gizmosql_client with .timer/.mode trash and parse "Run Time:" instead of "rows selected (... seconds)".

…-system layout These four entries were added on main while this branch was in flight (the existing trino/ scripts here were a memory-connector stub that never worked end-to-end). Rebuild each one against the new install/start/check/stop/load/ query/data-size contract so they share lib/benchmark-common.sh: - trino, trino-partitioned: Hive connector + file metastore + local Parquet hardlinked into data/hits/ (matches main's working impl from PR #856). - trino-datalake{,-partitioned}: same, plus the AnonymousAWSCredentials shim to read clickhouse-public-datasets/hits_compatible/athena from anonymous S3 (the published bucket size is reported by data-size since the data is read on demand). BENCH_DOWNLOAD_SCRIPT="" — no local dataset to fetch. - benchmark.sh in all four becomes a 4-line shim. Old run.sh deleted.

…r-system layout These four entries were added on main while this branch was in flight. Adapt them to the install/start/check/stop/load/query/data-size contract: - presto, presto-partitioned: Hive connector + file metastore + local Parquet hardlinked into data/hits/. - presto-datalake{,-partitioned}: same plus the AnonymousAWSCredentials shim (compiled in a throwaway trinodb/trino container, since the prestodb image ships only a JRE) so the hive-hadoop2 plugin can read the public bucket anonymously. BENCH_DOWNLOAD_SCRIPT="" — schema-only load against S3. Each benchmark.sh becomes a 4-line shim. Old run.sh deleted.

These two entries were added on main while this branch was in flight. Adapt to the install/start/check/stop/load/query/data-size contract: - BENCH_DOWNLOAD_SCRIPT="" — the vortex bench binary fetches Parquet and converts to .vortex on first invocation. - BENCH_RESTARTABLE=no — embedded Rust CLI; per-query restart would dominate query time. - query: stages stdin into a temp queries-file and passes -q 0, since the bench binary addresses queries by index rather than reading SQL on stdin. - The single variant uses the `clickbench` binary (vortex 0.34.0); the partitioned variant uses `query_bench clickbench` (vortex 0.44.0). Old run.sh deleted.

Quickwit was added on main while this branch was in flight. Adapt to the install/start/check/stop/load/query/data-size contract: - BENCH_QUERIES_FILE="queries.json" — Quickwit accepts Elasticsearch-format JSON queries via the /_elastic compat API, not SQL. queries.json holds one ES query per line; queries not expressible in Quickwit are encoded as the literal "null". - BENCH_DOWNLOAD_SCRIPT="" — the load script fetches hits.json.gz directly (there is no shared download-hits-json helper) and pipes it through `quickwit tool local-ingest`, since v0.9's sharded ingest-v2 endpoint caps single-node throughput at a few MB/s. - BENCH_RESTARTABLE=yes — relies on the common driver's per-query restart to flush Quickwit's fast_field_cache and split_footer_cache (the result caches are already disabled in node-config.yaml). - query: returns non-zero for "null" queries so the framework records null in the per-query timing array; otherwise reports .took (ms → seconds). Old run.sh deleted.

The original used /tmp/gizmosql_server_$$.pid where $$ is the calling process's PID. That worked when benchmark.sh sourced util.sh and called start/stop in the same shell, but under the new per-system layout each of start, stop, load, and query sources util.sh in its own subshell — so stop_gizmosql couldn't find the PID file written by start_gizmosql. Use a fixed path under the system directory instead. Also expose wait_for_gizmosql so callers (like load) can wait for readiness without restarting.

Conflict only in gizmosql/benchmark.sh — kept the thin shim. Main switched gizmosql to the official one-line installer (PR #879); fold that into gizmosql/install so we stop hand-detecting arch and downloading the zip. Other changes auto-merged: quickwit/index_config.yaml gained tag_fields on CounterID + record:basic on text fields (PR #886), and assorted result JSONs for ClickHouse Cloud / Citus / Cratedb / etc.

start/stop scripts may emit progress lines (clickhouse-server prints PID table tracking, sudo's chown invocation, postgres's startup messages, etc.). With BENCH_RESTARTABLE=yes those scripts run before every query, so their output interleaves with the parseable [t1,t2,t3] / Load time / Data size lines and breaks the cloud-init log POST to play.clickhouse.com. Redirect both stdout and stderr from ./start and ./stop to /dev/null at the three call sites in lib/benchmark-common.sh. The check loop is the authoritative readiness signal, so losing start's output costs nothing in steady state; for debugging, run ./start manually outside the driver.

The DuckDB installer at install.duckdb.org drops the binary into ~/.duckdb/cli/latest/duckdb and only suggests adding that directory to PATH. Previously each install attempted a per-user symlink into ~/.local/bin, which silently no-ops when that directory isn't on PATH (default for root in cloud-init). The result was ./check failing for 300s with no useful error. Symlink to /usr/local/bin/duckdb via sudo right after install instead; that's on PATH for every user, and the symlink is itself idempotent.

Ubuntu's docker.io ships the docker CLI without the v2 compose plugin, so the existing `command -v docker` short-circuit skipped installation on boxes that already had docker but no `docker compose`. ./start then ran `docker compose up -d`, which silently failed, and ./check timed out at 300s. Fall back to docker-compose-v2 for the Ubuntu package name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Throughput variant of ClickBench. N connections (default 10) hold open sessions and each picks a uniformly random query from the standard 43-query set; the run goes for a fixed wall-clock window (default 600s) after a warmup. Reports completed queries, QPS, latency p50/p95/p99, and per-query mean. Backends: ClickHouse over HTTP (stdlib http.client), StarRocks over the MySQL wire protocol (pymysql). Each system's recommended path so neither is paying a wire-format penalty the other isn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ned}/query: pass query via temp file `python3 - <<'PY' ... PY` directs the heredoc into python3's stdin so the interpreter can read its program from there. Once the heredoc is fully consumed, sys.stdin (the same FD) is at EOF — so sys.stdin.read() inside the heredoc returned an empty string, and chdb / hyper / sail dutifully ran the empty query and reported ~0.000s for every try. Stage stdin into a temp file in bash before invoking the heredoc and pass the path as argv[1]; the python script reads the query from that file. Also include result materialization in the timing window for chdb/query and chdb-parquet-partitioned/query (move `end = ...` past fetchall / str(res)) — the timer was previously stopped before the result was realized, which would have under-counted query time even when the stdin bug wasn't masking it entirely.

Right now ./check stderr is silently dropped while the loop retries for 300s, then we report "did not succeed within 300s" with no clue why. For deterministic failures (missing env var like YT_PROXY for chyt, an install step that didn't run, etc.) the user wastes 5 minutes and still has to dig through the per-system check script to find out what happened. Capture the last attempt's stderr and print it on timeout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The upstream install path assumes RHEL/Rocky/Alma — yum, grubby, SELinux, the wheel group, /data0. On Ubuntu/Debian the prereqs phase silently half-completes (several |||| true skips), the gpadmin user is sometimes not created, and db-install would later die at `yum install -y go`. Either way ./check times out at 300s with no diagnostic. Bail with a clear "needs yum" message before doing anything destructive, and call out the requirement in the README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cloud-init runs scripts as root with HOME unset. Tools that follow XDG-ish conventions then fall over: the GizmoSQL one-line installer exits at line 32 with "HOME: parameter not set" (it runs under `sh -u`), duckdb-vortex's `INSTALL vortex` writes to /.duckdb/extensions/... and later fails to find it ("Extension /.duckdb/extensions/v1.5.2/..."), and duckdb-datalake{,-partitioned} queries crash 43 times each with "Can't find the home directory at ''" while autoloading httpfs. Each affected install script tried to paper over this locally with `export HOME=${HOME:=~}`, but the export only lives for that script — the sibling load/query scripts the lib runs in fresh subprocesses still see HOME unset. Set it once here so every per-system step inherits it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

apt's monetdb5-sql post-install creates /var/lib/monetdb as the monetdb user's home dir, so the existing `if [ ! -d /var/lib/monetdb ]` guard skipped `monetdbd create` and left the dbfarm uninitialized. ./check then looped 300s on `mclient: cannot connect: control socket does not exist` and the run died. Probe the dbfarm marker file (.merovingian_properties) instead of the directory, and explicitly `monetdbd start` after create — both are idempotent, and a daemon that's already up just no-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

paradedb/paradedb:0.10.0 (the prior pin) was rotated out of Docker Hub — docker pull returned "manifest not found" and ./check timed out. The oldest tags still hosted are 0.15.x, so move both directories onto a real Postgres-version-specific tag (latest-pg17) that paradedb still maintains. This unblocks the image pull. NOTE: paradedb dropped its pg_lakehouse / parquet_fdw extension after 0.10.x (the parquet_fdw_handler() function no longer exists), so create.sql still needs to be reworked away from the foreign-table approach for queries to succeed end-to-end. That's a separate change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The prior URL (qa-build.oss-cn-beijing.aliyuncs.com selectdb-doris-2.1.7-rc01) returned 404 — SelectDB stopped publishing free standalone tarballs once the product moved fully to a managed-cloud offering. VeloDB (the company that now stewards SelectDB) hosts the official Apache Doris release binaries instead, which are functionally what SelectDB ships today. Pin to the current stable (4.0.5) and use the symmetric $dir_name path layout that doris/install already uses, instead of the hardcoded selectdb-doris-2.1.7 segment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Stock SQLite (3.46 in this run, but the same on every Ubuntu version ClickBench targets) does not ship REGEXP_REPLACE — it requires a build-time --enable-regexp flag or a loaded extension. So Q29 has been returning [null,null,null] on every SQLite run with "Parse error: no such function: REGEXP_REPLACE" in the log. Same shape of fix as the recent mariadb-columnstore rewrite: pull the host out with INSTR + SUBSTR + IIF instead of a regex. SQLite has no SUBSTRING_INDEX so each "split" needs its own subquery layer, but the algebra is identical: level 1: take everything after "://" (or whole string if no protocol) level 2: take everything before first "/" (or whole string if no path) level 3: strip leading "www." Verified on a synthetic table with the typical referer shapes — the rewrite produces the same hostname keys as REGEXP_REPLACE for every URL with a protocol-and-path. URLs without a trailing "/" or without a protocol get bucketed by host instead of pass-through, which is consistent with mariadb-columnstore and arguably more correct for the query's intent (group by host). Other systems that fail Q29 for the same root cause (REGEXP_REPLACE unsupported) but couldn't be tested locally: * cratedb — uses '$1' (PostgreSQL-style); current result also null, error not yet inspected — may need rewrite * elasticsearch — ES SQL surface is intentionally limited; needs an ES-specific rewrite (no SUBSTRING_INDEX, no IIF) Historical-only systems left as-is (results from 2020–2022, not in active CI): aurora-mysql, druid, heavyai, infobright, monetdb, pinot, singlestore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both installs hard-coded x86_64 release URLs, so every c8g.* run downloaded an amd64 binary that immediately died on aarch64: * databend — exec'd briefly, then ./check looped 600 s on "Failed to connect to localhost:8124"; produced 0 results. * octosql — failed at first invocation with "cannot execute binary file: Exec format error" before any query ran. Both projects ship matching arm64 nightlies/releases. Resolve `uname -m` in install and pick the corresponding tarball. Also point databend at the renamed databendlabs/databend org (datafuselabs/* still 301s but no reason to keep the old path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The recent c8g.* sweep surfaced systems that download an x86_64 binary or pull an amd64-only Docker image and then fail late with opaque errors after a 14 GB hits.parquet download or a 600 s ./check timeout. Where upstream simply doesn't publish an arm64 artifact, abort install with a clear message instead of letting the run burn an EC2 instance to discover the mismatch. Affected: * hyper, hyper-parquet — tableauhyperapi has Linux x86_64 manylinux2014 wheels only (no aarch64). * doris, doris-parquet — Apache Doris release tarballs are apache-doris-*-bin-x64.tar.gz; no arm64 mirror exists. * citus — citusdata/citus is amd64-only on every tag (Docker Hub manifest), runs under QEMU on arm64 and never starts in time. * pgpro_tam — innerlife/pgpro_tam is amd64-only. opteryx is the one fixable case in this batch: 0.26.1 only ships x86_64 wheels and the sdist build was breaking on arm64 ("third_party/abseil/containers.pyx doesn't match any files"); 0.26.8 publishes manylinux2014_aarch64 wheels for cp310-cp313, so bumping the pinned version unblocks c8g.* without affecting x86_64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pgducklake container ships with duckdb.memory_limit defaulted to 4 GB, so the CTAS over the 14 GB hits.parquet failed mid-load with Out of Memory Error: failed to allocate data of size 256 KiB (4.0 GiB/4.0 GiB used) even on c8g.metal-48xl (384 GB RAM). On the c7a.metal-48xl run the same OOM surfaced one query later as "Current transaction is aborted (please ROLLBACK)" because the failed CTAS poisoned the session. Compute 80 % of host MemTotal and SET the memory_limit on the same session that runs create.sql (psql with -c followed by -f resets the session between commands, so pipe both through stdin instead). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ES 9.4.0 was released on 2026-05-05 and apt picked it up on subsequent runs. The new bootstrap is stricter: cluster.initial_master_nodes must match an existing node.name, and the original install set the former to ["clickbench"] but never set the latter. 9.0–9.3 had been silently tolerating the mismatch; 9.4 leaves the cluster permanently red, every bulk write returns HTTP 503, and on smaller hosts ES eventually exits and connections start refusing. May 5 c6a.metal/c8g.4xlarge/t3a.small runs were green; May 9–10 ones (the same machines) all came back with 0/43 results and 100s+ "Sent batch N - Warning: HTTP 503" lines. Switch to `discovery.type: single-node`, the documented way to skip cluster bootstrap entirely for a single-process install. Drops both the matching-name requirement and any future surprise from 9.x bootstrap tightening. Also harden load: use `curl -sSf` for the index PUT so a non-2xx (cluster red, mapping error, etc.) aborts the run instead of spamming thousands of 503 lines against a missing /hits index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For dataframe systems with no SQL backend, the previous design pinned each ClickBench question to a hardcoded (sql_string, lambda) tuple in server.py. That makes the SQL string a translation key, not the query the system actually runs, and any drift between queries.sql and the server's whitelist silently 404s every query. Replace it with the generic shape duckdb-dataframe already uses: * queries.py: one Python expression per line, in the same order as queries.sql. This is the actual workload. * server.py /query: take a Python expression in the request body, compile + eval against the loaded DataFrame (`hits` and the engine module — pd or pl — in scope). Drops the QUERIES list, the lambda factory, and the QUERY_INDEX lookup. * benchmark.sh: BENCH_QUERIES_FILE=queries.py so the lib drives the Python file instead of queries.sql. queries.sql stays unchanged for cross-system comparison; the per-line mapping queries.sql ↔ queries.py is what makes results comparable. Also: * Q43 in pandas/queries.py: `freq='T'` was deprecated in pandas 2.x and removed in 3.x; switched to `freq='min'`. * For chdb-dataframe, daft-parquet, daft-parquet-partitioned (which pass SQL straight through to chdb / daft.sql), drop the same dead whitelist + `_make_runner` factory; /query just executes the body. Verified locally on a 100k-row slice of hits.parquet: all 43 pandas queries return non-null timings via /query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d}/check: probe HTTP, not docker logs Old check did `docker logs presto | grep "SERVER STARTED"`, but `docker stop` does not clear container logs — the SERVER STARTED line from the first start stayed visible forever, so check kept reporting "up" even while the container was stopped. With BENCH_RESTARTABLE=yes the loop is ./stop (docker stop presto) bench_wait_stopped (poll ./check until it FAILS) bench_flush_caches ./start (docker start presto, coordinator re-initializing) bench_check_loop (poll ./check until it SUCCEEDS) With the broken check, wait_stopped never observed a failure and timed out at 60 s; the run then drop_caches'd while the container was actually stopped, then ./start kicked off coordinator init, and bench_check_loop saw the still-cached "SERVER STARTED" string immediately, declaring the server ready. Queries hit a half-init coordinator and every one failed with Error running command: java.io.IOException: unexpected end of stream on http://localhost:8081/... — three retries of which produce [null,null,null] in the result. Replace with the HTTP probe trino/check already uses: hit /v1/info and require "starting":false. Container stopped → connection refused → check fails (correct). Coordinator mid-init → "starting":true → check fails (correct). Coordinator ready → check succeeds. Same fix copy-pasted to the three sibling presto-* variants since they all run the same prestodb/presto coordinator on host port 8081. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CrateDB is built on Elasticsearch internals: the PostgreSQL wire protocol accepts SELECT 1 a few hundred ms after start, but the underlying shards take longer to recover from disk after a restart. The old check just SELECT 1'd, so bench_check_loop declared the system ready while shards were still RECOVERING. The first few cold queries then failed with ERROR: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED] or ERROR: [crate.hits] shard 3 is not available until recovery caught up. Visible in c6a.metal/2026-05-10: Q1..Q4 came back mostly NULL, then a few partial triplets like [null, 4.964, 0.559], before the run settled. Require every shard on the node to be in a queryable state (STARTED or POST_RECOVERY) before reporting ready. Pre-load there are only system shards, which are STARTED in ms; post-load the hits shards must finish disk recovery before the cold timer starts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AWS returns InsufficientInstanceCapacity when the requested AZ has no spare instances of a given type — happens often for the bigger Graviton/AMD metal sizes during peak hours, and a single failure was enough to abandon the whole benchmark queue. Wrap the run-instances call in a loop that retries every 60 s on that specific error and keeps going indefinitely. Other errors (bad AMI, missing IAM perms, malformed user-data, ...) still propagate immediately so a fundamentally broken invocation doesn't loop forever. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extend the run-instances retry loop to cover quota errors alongside InsufficientInstanceCapacity: VcpuLimitExceeded on-demand vCPU service quota InstanceLimitExceeded older AWS error code for the same MaxSpotInstanceCountExceeded spot count quota All four are transient — AZ capacity comes back as other instances drain, and quotas free as other benchmarks in the queue finish. 60 s polling is right for both. Real config errors (bad AMI, missing IAM perms, malformed user-data, etc.) still propagate immediately. The retry-on-match is now driven by a single regex and the loop logs which specific code triggered the retry, so an operator watching the run can tell capacity from quota without diving into the AWS response body. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The gizmosql installer at https://install.gizmosql.com/install.sh runs under `sh -u` and its line 32 is PREFIX="${HOME}/.local/bin" with no fallback. cloud-init.sh.in does set HOME=/root, but a checkout that predates that fix (or any operator running ./benchmark.sh by hand without the env) sees the install die in 21 s with sh: 32: HOME: parameter not set Disk usage after: 2317328384 System: gizmosql Total time: 21 — same shape on the May 9 c6a.4xlarge run. Pin HOME to /root in the install if it isn't already set, and pass it into the piped sh explicitly so the gizmosql installer is insulated from whatever the parent env looked like. Also exports HOME for the later `sudo install $HOME/.local/bin/...` step that uses the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 2026-05-10 sweep on c6a.metal, c6a.2xlarge, c7a.metal-48xl, and c8g.metal-48xl all hit the 20000s operator timeout — load completed (525–734s) and a handful of queries ran before the budget expired, but the bench was killed before reaching ./data-size, so the sink.parser MV's `length(runtimes) = 43` precondition failed and no JSON ever made it to the website. Capture what we have: real timings for the queries that finished, [null,null,null] for the rest, data_size:null since the load was truncated. Same shape as the c6a.4xlarge / c8g.4xlarge JSONs that did complete on the same date. c7a.metal-48xl — 1/43 measured, load 525s c8g.metal-48xl — 1/43 measured, load 525s c6a.metal — 3/43 measured, load 544s c6a.2xlarge — 21/43 measured, load 734s Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ckHouse/ClickBench into refactor/per-system-script-interface

A JSON file of the form {"error": "..."} marks a failed run for that system/machine; such entries are now excluded from data.generated.js so the system is omitted from the report. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Combine the small filters (Open source, Hardware, Tuned) into a single horizontal row at the top of the selectors table. - Top-row filters now hide options in System / Type / Machine / Cluster size that have no entries satisfying the criteria. - Hovering a system in the System list, a summary row, or a details column header highlights that system's tags in the Type list with a green background (light/dark theme aware). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The stored-theme bootstrap was calling setTheme(), which calls render(), which now references `let systems` via applyTopRowFilters() — throwing a ReferenceError because the binding is still in its temporal dead zone. Set the data-theme attribute directly at bootstrap; the final render() at the end of the script handles the initial render. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The lukewarm-cold-run tag predates the May 7 refactor (1f352ad) when the bench loop didn't reliably restart between queries, so each cold run could re-use a warm process. After the refactor BENCH_RESTARTABLE governs that explicitly: systems either fully restart (with the cache drop landing on an actually-cold process tree thanks to the stop→wait_stopped→drop_caches→start ordering) or sit out at BENCH_RESTARTABLE=no for in-process tools where restart would dominate the timing. Either way the "lukewarm" qualifier no longer applies to results produced under the new driver. Strip the tag from: * every results/2026050[7-9]/*.json and results/20260510/*.json that carried it — 295 files across 29 systems (citus, clickhouse, clickhouse-web, databend, doris, doris-parquet, greenplum, mariadb-columnstore, pg_clickhouse, pg_duckdb-parquet, pg_mooncake, pgpro_tam, polars, polars-dataframe, presto + 3 variants, questdb, siglens, starrocks, timescaledb, trino + 3 variants, ursa, velodb, victorialogs) * template.json of those same 29 systems, so future runs don't re-introduce it. Older results (pre-refactor) keep the tag — they were produced under the historical driver and the attribute is genuine for them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Firebolt's "wait until ready" loop did curl -sS http://localhost:3473/ --data-binary 'SELECT ...' \ > /dev/null && break which exited on the FIRST HTTP response — including the HTTP 200 that carries {"errors":[{"description":"Cluster not yet healthy: Node startup is not yet finished"}], "statistics":{"elapsed":0.0}} while the container is still warming up. So the bench would proceed straight to CREATE TABLE, get the same Cluster-not-healthy error, run all 43 queries (each replying with "elapsed":0.0), and emit a log that looked fine: 43/43 timing triplets, load_time present, data_size present. The sink.parser MV's "good" predicate then rejected the row for arrayExists(x -> arrayExists(y -> toFloat64OrZero(y) > 0.1, x), runtimes) — every timing is 0.0, so no element exceeds 0.1, the row never lands in sink.results, and the website has had no new Firebolt result since 2026-02-21 even though the bench has been "running" successfully. Pipe the response into grep "Firebolt is ready" and only break when the sentinel actually appears in the body. Same fix for all three variants (firebolt, firebolt-parquet, firebolt-parquet-partitioned). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add the "stateless" tag to all result files and templates of systems that do not maintain persistent state of their own: polars, sail*, spark*, *-parquet, *-datalake. With the recent load-metric filter change, these systems are correctly omitted from the Load Time view. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- When the Load Time metric is selected, exclude entries tagged "stateless" — they have no meaningful load time. - Hovering a tag in the Type list highlights every summary row and details column header for systems carrying that tag, mirroring the existing system-hover behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pick up the latest result entries that landed upstream during the stateless-tag rebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexey-milovidov and others added 3 commits May 7, 2026 12:14

alexey-milovidov commented May 7, 2026

View reviewed changes

Comment thread clickhouse-datalake-partitioned/load Outdated

alexey-milovidov commented May 7, 2026

View reviewed changes

Comment thread clickhouse/query Outdated

alexey-milovidov commented May 7, 2026

View reviewed changes

Comment thread clickhouse/start Outdated

alexey-milovidov and others added 3 commits May 7, 2026 12:29

clickhouse/query: drop the cat shim — clickhouse-client reads stdin n…

b5d60e8

…atively Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add missing change

94794b5

prmoore77 mentioned this pull request May 7, 2026

GizmoSQL: switch from gizmosqlline (Java) to gizmosql_client #863

Merged

2 tasks

alexey-milovidov and others added 18 commits May 9, 2026 01:22

alexey-milovidov and others added 15 commits May 10, 2026 13:54

Merge branch 'main' into refactor/per-system-script-interface

ba4a5b6

More results

c18a16e

alexey-milovidov force-pushed the refactor/per-system-script-interface branch from 65ba6ac to 84a86f4 Compare May 10, 2026 17:40

alexey-milovidov added 3 commits May 10, 2026 19:45

Add error markers

80d85bd

Merge branch 'refactor/per-system-script-interface' of github.com:Cli…

d6342fd

…ckHouse/ClickBench into refactor/per-system-script-interface

Merge branch 'refactor/per-system-script-interface' of github.com:Cli…

51cb9c6

…ckHouse/ClickBench into refactor/per-system-script-interface

alexey-milovidov force-pushed the refactor/per-system-script-interface branch from d6342fd to 51cb9c6 Compare May 10, 2026 17:46

alexey-milovidov and others added 10 commits May 10, 2026 19:49

Merge branch 'refactor/per-system-script-interface' of github.com:Cli…

e0e741e

…ckHouse/ClickBench into refactor/per-system-script-interface

More results

9c7e07f

data.generated.js: regenerate after upstream merge

caa8dc6

Pick up the latest result entries that landed upstream during the stateless-tag rebase. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: standard install/start/check/stop/load/query interface per system#860

Refactor: standard install/start/check/stop/load/query interface per system#860
alexey-milovidov wants to merge 133 commits intomainfrom
refactor/per-system-script-interface

alexey-milovidov commented May 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexey-milovidov commented May 7, 2026

Summary

Why

The new per-system interface

Scope

Validated end-to-end on a 96-core / 185 GB ARM machine

Bug fixes surfaced during validation

Flagged for follow-up review

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant