Apache Beam Tutorial for Batch and Streaming Data Pipelines

Apache Beam is an open-source unified programming model from the Apache Software Foundation for defining data processing pipelines. A Beam pipeline can describe ETL, batch processing, and stream processing logic without tying the business logic to only one execution engine. The same pipeline model can be run on a supported runner such as Apache Flink, Apache Spark, Google Cloud Dataflow, and other runner implementations, depending on the environment and workload.

MapReduce triggered the evolution of the Big Data ecosystem that we are seeing today. There are many frameworks like Hadoop, Spark, Flink, Google Cloud Dataflow, etc, that came into existence. But there has been no unified API that binds all these frameworks and data sources, and provide an abstraction to the application logic from big data ecosystem. Apache Beam provides this abstraction between your application logic and the execution environment.

This Apache Beam tutorial introduces the Beam programming model, the main pipeline terms, where Beam fits in the data engineering stack, and what a beginner should learn before writing production pipelines.

What Apache Beam is used for in data engineering

Apache Beam is used to define data processing pipelines in a runner-independent way. The pipeline describes what should happen to the data: read input, transform records, group or aggregate values, handle windows for streaming data, and write output. The selected runner decides how that pipeline is executed on a local machine, a distributed cluster, or a managed cloud service.

Hence, there is no need to mix the core application logic with every input-specific or runner-specific detail when you are writing a data processing or analytic application:

  • Data Source – Data source can be bounded data, such as files and tables, or unbounded streaming data, such as messages and events.
  • SDK – You may choose your SDK, such as Java or Python, based on the language used by your team and the IO connectors required by your pipeline.
  • Runner – Once the application logic is written as a Beam pipeline, you may choose one of the available runners, such as Apache Spark, Apache Flink, Google Cloud Dataflow, and others, based on the nature of your inputs and processing needs.

This is how Beam lets you write your pipeline logic once and keep the execution choice separate from most of the transformation logic.

Apache Beam programming model for unified batch and streaming

The Apache Beam model treats both batch and streaming workloads through a common set of concepts. A batch input is usually a bounded collection because it has a known end. A streaming input is usually an unbounded collection because new records may keep arriving. Beam uses the same pipeline structure for both, while streaming-specific concepts such as windowing, triggers, event time, and watermarks help control how unbounded data is processed.

Beam conceptMeaning in a pipeline
PipelineThe complete data processing job, from reading input to writing output.
PCollectionA distributed data collection. It can be bounded or unbounded.
PTransformAn operation that takes one or more collections and returns one or more collections.
ParDoA transform for applying per-element processing logic.
DoFnThe user-defined function used by ParDo to process each element.
RunnerThe execution engine that runs the pipeline.

Apache Beam pipeline terms: PCollection, PTransform, ParDo, and DoFn

A Beam pipeline is usually written as a chain of transforms. The input transform creates a PCollection. Other transforms read that PCollection and produce another PCollection. The final transform writes the result to a sink such as a file, table, database, message topic, or another supported destination.

ParDo and DoFn are often the first Beam terms that beginners find confusing. The difference is simple: DoFn is the function that contains your per-record logic, and ParDo is the Beam transform that applies that function to the elements of a PCollection.

</>
Copy
import apache_beam as beam

class ExtractWords(beam.DoFn):
    def process(self, line):
        for word in line.split():
            yield word

with beam.Pipeline() as pipeline:
    words = (
        pipeline
        | "Create lines" >> beam.Create(["apache beam tutorial", "beam pipeline example"])
        | "Extract words" >> beam.ParDo(ExtractWords())
    )

In this syntax example, ExtractWords is the DoFn, and beam.ParDo(ExtractWords()) applies it to every input line.

Apache Beam runners and the difference between Beam and Dataflow

Apache Beam is the programming model and SDK used to write the pipeline. A runner is the execution backend that runs the pipeline. Google Cloud Dataflow is a managed service and one of the Beam runners. In other words, Beam is not the same thing as Dataflow. Beam defines the pipeline; Dataflow can execute it on Google Cloud.

TermRoleExample use
Apache BeamProgramming model and SDKWrite a pipeline using Java or Python SDK APIs.
Beam runnerExecution adapterRun the same pipeline on a chosen processing engine.
Google Cloud DataflowManaged cloud runner/serviceRun Beam pipelines on Google Cloud without managing the cluster.
Apache Flink or Spark runnerDistributed execution optionRun Beam pipelines on an existing Flink or Spark environment when supported by the runner.

When learning Beam, keep the runner choice separate from the pipeline logic. Start with the local runner for small examples, then move to the runner used by your team or platform.

Is Apache Beam good for real-time processing?

Apache Beam can be used for real-time processing when the selected runner and IO connectors support streaming. Beam provides streaming concepts such as event time, windowing, triggers, allowed lateness, and watermarks. These features help process unbounded data where records arrive continuously or arrive late.

Beam is also useful for batch ETL pipelines. A common advantage is that a team can use one model for both bounded and unbounded data instead of learning separate APIs for every processing style. The practical result still depends on the runner, connector support, cluster configuration, and the latency needs of the application.

Simple Apache Beam pipeline example in Python

The following example creates a small in-memory collection, converts every word to lowercase, counts each word, and prints the result. It is suitable for understanding the pipeline shape before connecting files, tables, or streaming sources.

</>
Copy
import apache_beam as beam

with beam.Pipeline() as pipeline:
    word_counts = (
        pipeline
        | "Create input" >> beam.Create([
            "Apache Beam Tutorial",
            "Beam supports batch and streaming",
            "Apache Beam pipeline example"
        ])
        | "Split words" >> beam.FlatMap(lambda line: line.split())
        | "Lowercase words" >> beam.Map(lambda word: word.lower())
        | "Count words" >> beam.combiners.Count.PerElement()
        | "Print results" >> beam.Map(print)
    )

The output order may vary because Beam pipelines can run in parallel, but the result will contain counts similar to the following.

('apache', 2)
('beam', 3)
('tutorial', 1)
('supports', 1)
('batch', 1)
('and', 1)
('streaming', 1)
('pipeline', 1)
('example', 1)

Apache Beam best practices for beginner pipelines

  • Keep business logic in transforms and keep runner-specific settings outside the core transformation code where possible.
  • Use meaningful transform names, such as Read events, Parse JSON, and Write failed records, so the runner UI and logs are easier to understand.
  • Avoid collecting large datasets into local memory. Let Beam process data as distributed PCollection objects.
  • Test DoFn and transform logic with small inputs before running the pipeline on production data.
  • Choose the runner based on workload requirements, connector support, operational skills, and the deployment platform.
  • For streaming pipelines, define windowing and late-data behavior deliberately instead of relying on defaults without checking the result semantics.

Apache Beam tutorial editorial QA checklist

  • The page clearly explains that Apache Beam is a programming model, not a storage system or a cluster manager.
  • The difference between Apache Beam and Google Cloud Dataflow is stated in plain terms.
  • The tutorial explains ParDo and DoFn without treating them as the same object.
  • The batch and streaming discussion mentions bounded and unbounded data.
  • Code examples use small, testable inputs and do not depend on unavailable local files.
  • The tutorial avoids promising runner portability for every connector and every advanced feature without testing.

Apache Beam tutorial FAQs

Is Apache Beam good for real-time processing?

Yes, Apache Beam can be used for real-time processing when the selected runner and IO connectors support streaming. Beam includes streaming concepts such as event time, windows, triggers, watermarks, and late-data handling.

What is the difference between ParDo and DoFn in Apache Beam?

DoFn is the user-defined function that contains the per-element processing logic. ParDo is the Beam transform that applies that DoFn to each element in a PCollection.

What is the difference between Apache Beam and Google Cloud Dataflow?

Apache Beam is the SDK and programming model used to define pipelines. Google Cloud Dataflow is a managed Google Cloud service that can run Beam pipelines. Beam is the pipeline model; Dataflow is one possible execution service.

Which language should I use to learn Apache Beam?

Use the language that matches your project and available connectors. Java and Python are common choices for Beam tutorials and production pipelines. Beginners usually start with small local examples before adding external IO and a distributed runner.

Can one Apache Beam pipeline run on every runner without changes?

The Beam model is designed for runner portability, but every real pipeline should be tested on the target runner. Connector support, runner-specific behavior, streaming features, and operational settings can affect whether changes are needed.

Summary of Apache Beam tutorial concepts

Apache Beam helps define batch and streaming data pipelines with a unified programming model. The core ideas to learn first are Pipeline, PCollection, PTransform, ParDo, DoFn, and runner. Once these are clear, you can move from simple local examples to real inputs, streaming windows, and a production runner such as Flink, Spark, or Google Cloud Dataflow.