Skip to content

apache/wayang

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,043 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Apache Wayang™ Wayang Logo

The first open-source cross-platform data processing system

Write your data pipeline once. Run it anywhere.

Maven central License Last commit GitHub commit activity (branch) GitHub forks GitHub Repo stars

Tweet LinkedIn

You write your pipeline against a single API, then decide how it runs. Point it at one engine and it runs there. Or hand Wayang's cost-based optimizer the choice and let it pick the best platform for each step across your laptop, Apache Spark, Apache Flink, or a database, even splitting a single job across several. Either way, when your data outgrows one machine you don't rewrite anything, you just make another engine available.

A single pipeline, written once, feeds the Wayang optimizer, which routes each step to the best available engine — Local, Spark, Flink, Postgres, and others.

Table of contents

How it works

Most data processing systems are designed around a single execution engine. That keeps things simple, but your pipeline ends up tied to that engine's API. So combining engines, or moving to another, typically means rewriting and gluing together which is costly and time-consuming.

Wayang sits one level up. You write a pipeline against Wayang's API and register the engines you have. Then it's your call. Want control? Register one engine and it runs there. Want it handled? Register several and let the cost-based optimizer pick the best one for each step, even splitting a single job across engines.

Supported platforms today

Wayang's APIs

  • Java (Scala-like fluent builder)
  • Scala
  • SQL
  • Java native (low-level, we recommend the fluent scala-like)

The plugin architecture makes adding new operators and platforms straightforward without touching internals — see Adding operators.

Quickstart

We'll run a word count locally first — no cluster, nothing to install on a server — then make Spark available with a one-line change. The pipeline itself never changes; only the set of engines you register does.

1. Run locally

import org.apache.wayang.core.api.Configuration;
import org.apache.wayang.core.api.WayangContext;
import org.apache.wayang.api.JavaPlanBuilder;
import org.apache.wayang.basic.data.Tuple2;
import org.apache.wayang.java.Java;
import java.util.Arrays;

public class WordCount {
    public static void main(String[] args) {
        // Register ONLY the local Java engine → runs on your machine, no cluster needed.
        WayangContext wayang = new WayangContext(new Configuration())
                .withPlugin(Java.basicPlugin());

        new JavaPlanBuilder(wayang)
                .withJobName("WordCount")
                .withUdfJarOf(WordCount.class)
                .readTextFile("file:///path/to/input.txt")
                .flatMap(line -> Arrays.asList(line.split("\\W+")))
                .filter(word -> !word.isEmpty())
                .map(word -> new Tuple2<>(word.toLowerCase(), 1))
                .reduceByKey(Tuple2::getField0,
                             (t1, t2) -> new Tuple2<>(t1.getField0(), t1.getField1() + t2.getField1()))
                .writeTextFile("file:///path/to/output.txt", t -> t.getField0() + ": " + t.getField1());
    }
}

It executes locally. Good for development, tests, and small data.

2. Run it on Spark

Now run the exact same pipeline on Spark instead of locally. You don't touch the pipeline — you change which platform you register: comment out Java and register Spark.

import org.apache.wayang.spark.Spark;               // swap the import

// Same pipeline as before — only the registered platform changed.
WayangContext wayang = new WayangContext(new Configuration())
        // .withPlugin(Java.basicPlugin())           // comment out the local engine
        .withPlugin(Spark.basicPlugin());            // register Spark instead

Run it again. The same pipeline now executes on Spark. You changed where it runs without changing what it does. Switch to Flink or any other supported platform the same way: swap the import and the registered plugin.

Why register only Spark here? Wayang's real power is registering several platforms and letting the optimizer pick. But on small test data the optimizer will almost always pick the local engine (Spark's startup overhead isn't worth it for a tiny file) so you'd never actually see Spark run. Registering Spark alone forces the issue so you can confirm it works. Step 3 shows the production pattern.

3. Register both and let the optimizer choose

This is the point of Wayang. In practice you don't pick a platform at all: you register every engine you have and let the optimizer choose the best one for each step.

// Register BOTH platforms — Wayang's optimizer decides which to use per step.
WayangContext wayang = new WayangContext(new Configuration())
        .withPlugin(Java.basicPlugin())
        .withPlugin(Spark.basicPlugin());

Now Wayang owns the placement decision. For each operator it estimates the cost on every registered platform and picks the cheapest, keeping a small job entirely local, pushing a large one onto Spark, or mixing both within the same job as the data and query demands. On a tiny input you'll see it keep everything local (that's the optimizer working correctly, not ignoring Spark); cross-platform splits show up once the data is big enough to justify them.

Install

Replace WAYANG_VERSION with the latest Maven Central release.

From Maven Central

<dependency>
  <groupId>org.apache.wayang</groupId>
  <artifactId>wayang-core</artifactId>
  <version>WAYANG_VERSION</version>
</dependency>
<dependency>
  <groupId>org.apache.wayang</groupId>
  <artifactId>wayang-basic</artifactId>
  <version>WAYANG_VERSION</version>
</dependency>
<dependency>
  <groupId>org.apache.wayang</groupId>
  <artifactId>wayang-api-scala-java</artifactId>
  <version>WAYANG_VERSION</version>
</dependency>
<!-- add one artifact per engine you want available -->
<dependency>
  <groupId>org.apache.wayang</groupId>
  <artifactId>wayang-java</artifactId>
  <version>WAYANG_VERSION</version>
</dependency>
<dependency>
  <groupId>org.apache.wayang</groupId>
  <artifactId>wayang-spark</artifactId>
  <version>WAYANG_VERSION</version>
</dependency>

The available modules:

  • wayang-core — core data structures and the optimizer (required)
  • wayang-basic — common operators and data types (recommended)
  • wayang-api-scala-java — fluent Scala/Java API for building plans (recommended)
  • wayang-java, wayang-spark, wayang-flink, wayang-postgres, wayang-sqlite3, wayang-graphchi, wayang-tensorflow, wayang-kafka — per-platform adapters; include one per engine you want available
  • wayang-profiler — learns operator and UDF cost functions from historical executions

For snapshot builds, add Apache's snapshot repository:

<repositories>
  <repository>
    <id>apache-snapshots</id>
    <name>Apache Foundation Snapshot Repository</name>
    <url>https://repository.apache.org/content/repositories/snapshots</url>
  </repository>
</repositories>

Build from source

git clone https://github.com/apache/wayang.git
cd wayang
./mvnw clean install -DskipTests

The current snapshot version lives in pom.xml.

Runtime requirements

  • Java 17 — set JAVA_HOME to your Java 17 installation.
  • Apache Spark 3.4.4 with Scala 2.12 — set SPARK_HOME.
  • Apache Hadoop 3+ — set HADOOP_HOME.
  • Maven for building from source.

Important

Java 17 needs extra JVM flags. Running Wayang on Java 17 (especially with Spark) requires opening some internal Java modules, or you'll hit IllegalAccessError. Edit your wayang-submit script (under wayang-assembly/target/wayang-WAYANG_VERSION/bin/wayang-submit) so the runner invocation passes:

--add-exports=java.base/sun.nio.ch=ALL-UNNAMED
--add-opens=java.base/java.nio=ALL-UNNAMED
--add-opens=java.base/java.lang=ALL-UNNAMED
--add-opens=java.base/java.util=ALL-UNNAMED
--add-opens=java.base/java.io=ALL-UNNAMED
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED
--add-opens=java.base/java.net=ALL-UNNAMED
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED

On Windows, also set HADOOP_HOME to a directory containing winutils.exe (unofficial source).

Validate the install

After building, unpack the assembly and put Wayang on your PATH:

tar -xvf wayang-WAYANG_VERSION.tar.gz
cd wayang-WAYANG_VERSION

# Linux
echo "export WAYANG_HOME=$(pwd)" >> ~/.bashrc
echo "export PATH=${PATH}:${WAYANG_HOME}/bin" >> ~/.bashrc
source ~/.bashrc

# macOS
echo "export WAYANG_HOME=$(pwd)" >> ~/.zshrc
echo "export PATH=${PATH}:${WAYANG_HOME}/bin" >> ~/.zshrc
source ~/.zshrc

Then run the bundled WordCount on your local Java engine:

bin/wayang-submit org.apache.wayang.apps.wordcount.Main java file://$(pwd)/README.md

Running the tests

./mvnw test

Documentation

Contributing

Contributions are welcome — bug reports, doc fixes, new platform adapters, new operators, optimizer improvements, anything. Start with CONTRIBUTING.md and the building guide, open an issue if you're not sure where to start, and introduce yourself on the dev mailing list — that's where active work gets discussed.

If you're looking for somewhere to begin, doc improvements, new operators, and additional examples are areas where a focused PR can land quickly.

Community

Authors

See the full list of contributors.

License

All files in this repository are licensed under the Apache License 2.0.

Copyright 2020 - 2026 The Apache Software Foundation.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Acknowledgements

The logo was donated by Brian Vera.