Apache Flink Tutorial

This Apache Flink tutorial introduces Flink as a distributed stream processing framework, explains where it fits with Hadoop and Spark, and gives you a practical path for learning Flink with examples. The goal is to help you understand Flink concepts before you move to installation, DataStream API programs, SQL jobs, and production-style pipelines.

Apache Flink Introduction - Apache Flink Tutorial - www.utorialkart.com

What is Apache Flink in stream processing?

Apache Flink is an open-source framework and distributed processing engine for stateful computations over data streams. In simple terms, Flink helps you process data while it is arriving, keep state across events, and produce results with low latency. It can process both unbounded streams, such as continuously arriving click events, and bounded streams, such as a fixed file or historical data set.

A Flink application usually reads data from a source, applies transformations, maintains state when needed, and writes results to a sink. Sources and sinks can be systems such as Kafka, files, databases, object stores, message queues, or custom connectors. This makes Flink useful for real-time analytics, event-driven applications, data enrichment, fraud checks, monitoring, and continuous ETL pipelines.

  • Stream processing: process events continuously as they arrive.
  • Stateful computation: remember previous events, counters, windows, sessions, or keyed state.
  • Event-time handling: reason about when an event actually happened, not only when it reached the system.
  • Fault tolerance: recover application state by using checkpoints and restart strategies.
  • Unified bounded and unbounded processing: write programs that can work with both live streams and finite data.

Apache Flink tutorial learning path for beginners

If you are new to Apache Flink, learn it in layers. Start with the processing model, then write a small local job, and only then move to connectors, state, windows, and deployment. This sequence avoids a common mistake: trying to configure a cluster before understanding how a Flink job is built.

  1. Understand streams: learn the difference between bounded and unbounded data.
  2. Install Flink locally: follow the setup step in this tutorial series and run a sample job.
  3. Write a DataStream API job: read records, transform them, group by key, and write output.
  4. Learn windows and time: use tumbling, sliding, session, or custom windows when aggregating events.
  5. Add state: use keyed state when the result depends on previous events for the same key.
  6. Use connectors: connect Flink to Kafka, files, JDBC systems, Elasticsearch, or other external systems.
  7. Try Table API and SQL: write declarative stream and batch queries when SQL is a better fit.
  8. Deploy and monitor: learn checkpoints, savepoints, parallelism, job managers, task managers, and logs.

Apache Flink APIs used in real Flink applications

Flink offers more than one API because not every data problem is best expressed in the same way. Choose the API based on how much control you need over events, state, and logic.

Flink APIBest useTypical example
DataStream APIEvent-by-event stream processing with custom logicFraud detection, alerts, session tracking, keyed aggregations
Table APIRelational style programs in Java, Scala, or PythonFiltering, joins, aggregations, and table-like transformations
Flink SQLDeclarative queries over streaming or batch tablesContinuous reports, streaming joins, and ETL pipelines
Process functionsLow-level control over timers, state, and event-time behaviorTimeout detection, custom session rules, delayed alerts

For most beginners, the DataStream API is the best place to start because it makes core Flink ideas visible: streams, transformations, keying, state, windows, and sinks. SQL becomes easier once you understand what Flink is doing underneath.

Apache Flink example using the DataStream API

The following Java example shows the basic shape of a Flink DataStream job. It creates a small bounded stream, splits each line into words, groups by word, and counts occurrences. In a real project, the source could be Kafka or a file, and the sink could be a database, another Kafka topic, or an object store.

</>
Copy
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class FlinkWordCountJob {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env =
                StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream<String> lines = env.fromData(
                "apache flink stream processing",
                "apache flink stateful processing"
        );

        DataStream<Tuple2<String, Integer>> counts = lines
                .flatMap((String line, Collector<Tuple2<String, Integer>> out) -> {
                    for (String word : line.split("\\\\s+")) {
                        out.collect(Tuple2.of(word, 1));
                    }
                })
                .returns(Types.TUPLE(Types.STRING, Types.INT))
                .keyBy(value -> value.f0)
                .sum(1);

        counts.print();

        env.execute("Flink Word Count");
    }
}

The printed records may appear in a different order because Flink executes jobs in parallel. For the two input lines above, the output will contain word counts similar to the following records.

(apache,2)
(flink,2)
(processing,2)
(stream,1)
(stateful,1)

How to run Apache Flink locally before using a cluster

Before deploying Flink on a cluster, run it locally. A local setup helps you check Java, build tools, Flink dependencies, and the job structure without adding cluster configuration problems.

We shall install Flink and learn its modules. When setting up a new project, refer to the current Apache Flink documentation for the supported Java, Maven, Gradle, Docker, and PyFlink requirements because version requirements change over time.

</>
Copy
java -version
mvn -version

After the local environment is ready, create a small Flink project, add your job class, build it, and run it from your IDE or through the Flink command line. Keep the first job small; the purpose is to verify the pipeline structure, not to solve every production concern at once.

Apache Flink vs Hadoop and Spark

Apache Flink, Hadoop, and Spark are related because all three are used in distributed data processing, but they are not the same tool. Hadoop MapReduce is mainly associated with batch processing. Spark is a general-purpose distributed compute engine with strong batch and SQL capabilities, and its streaming use is often based on micro-batch execution. Flink is designed around stream processing first and treats bounded data as a special case of streams.

  • Flink and Hadoop: Hadoop MapReduce is suitable for large batch jobs, while Flink is better suited when results must be updated continuously as events arrive.
  • Flink and Spark: both are distributed processing frameworks, but Flink gives stream processing, state, event time, and continuous execution a central role.
  • Batch data in Flink: Flink can process bounded data, but its model still comes from streams with a known end.
  • Iterative and stateful workloads: Flink is often chosen when a job needs long-running stateful computation over continuous events.

Apache Flink is sometimes discussed with Kappa-architecture, where one stream processing path handles both real-time and historical data replay. In practice, the right architecture depends on the system, the data source, latency needs, operational skill, and the way historical corrections are handled.

Apache Flink concepts every beginner should know

Before writing larger Flink jobs, learn the terms that appear repeatedly in examples and documentation.

  • Job: the complete Flink application submitted for execution.
  • Source: the input connector or collection that provides data to the job.
  • Transformation: an operation such as map, flatMap, filter, keyBy, window, join, or aggregation.
  • Sink: the destination where Flink writes the result.
  • Parallelism: the number of parallel task instances used to process data.
  • Checkpoint: a consistent snapshot that helps a job recover state after failure.
  • Savepoint: a manually triggered snapshot often used for upgrades, migrations, or controlled restarts.
  • Watermark: a mechanism used with event time to track progress when events can arrive late.

Apache Flink use cases for real-time data

For reference of use cases that are in live today, refer this link. Apache also maintains current Flink use-case documentation at Flink use cases.

Common Apache Flink use cases include:

  1. Real-time dashboards: aggregate metrics continuously for operations, finance, product usage, or infrastructure monitoring.
  2. Fraud and risk checks: evaluate a user, account, transaction, or device against recent behavior and rules.
  3. IoT event processing: process sensor readings, device status events, and alerts from distributed systems.
  4. Log and observability pipelines: enrich, filter, route, and aggregate application logs or telemetry data.
  5. Customer activity streams: update recommendations, segments, alerts, or personalization signals as events arrive.

In the earlier version of this tutorial, examples such as Alibaba and Bouygues Telecom were listed as Flink users. Large organizations may change their internal platforms over time, so treat company examples as references and focus on the technical pattern: continuous data, low-latency decisions, stateful processing, and fault-tolerant execution.

When Apache Flink is a good fit

Use Apache Flink when your data problem depends on events that keep arriving and the result must be updated continuously. It is especially useful when the application needs event-time processing, large keyed state, exactly-once style state consistency, or long-running pipelines that must recover from failures.

  • Choose Flink for continuous event processing, stateful stream analytics, and pipelines that need low latency.
  • Consider Flink SQL when analysts or engineers can describe the transformation more clearly as a query.
  • Use the DataStream API when the logic needs custom functions, timers, state, or fine-grained stream control.
  • Use batch-focused tools when the data is fixed, latency is not important, and a simpler batch workflow is enough.

Apache Flink official references and practice resources

Use the following references while working through this Apache Flink tutorial. The official documentation should be your primary reference for current commands, supported versions, and connector behavior.

Apache Flink tutorial QA checklist

Use this checklist while reviewing your first Flink program or while editing a Flink tutorial example.

  • Does the example clearly state whether it processes bounded data or an unbounded stream?
  • Are the source, transformations, keyBy logic, and sink easy to identify?
  • If the example uses time windows, does it explain processing time, event time, and late events?
  • If the job keeps state, does it mention checkpoints or recovery behavior?
  • Are Java, Maven, Gradle, Docker, or PyFlink version requirements checked against the current Flink documentation?
  • Does the tutorial avoid outdated claims such as treating every Spark streaming workload as identical or promising absolute availability?

FAQs on Apache Flink tutorial topics

What is Apache Flink mainly used for?

Apache Flink is mainly used for stateful stream processing, real-time analytics, continuous ETL, event-driven applications, fraud detection, monitoring pipelines, and workloads that need low-latency results from continuously arriving data.

Is Apache Flink only for streaming data?

No. Apache Flink can process both unbounded streams and bounded data sets. Its execution model is stream-first, so bounded data is treated as a stream with a known end.

Should I learn Flink DataStream API or Flink SQL first?

Start with the DataStream API if you want to understand Flink concepts such as events, state, keyBy, windows, watermarks, and sinks. Learn Flink SQL when you want to express stream or batch transformations with declarative queries.

How is Apache Flink different from Apache Spark Streaming?

Apache Flink is designed around continuous stream processing and stateful event-time computation. Spark is a broader distributed compute engine with strong batch and SQL support, and its streaming workloads are commonly implemented with micro-batch execution.

Can Apache Flink run on my laptop for learning?

Yes. You can run Flink locally for learning and testing small jobs. For production, Flink is usually deployed on a cluster environment with proper checkpointing, monitoring, resources, and connector configuration.

Conclusion on learning Apache Flink

In this Apache Flink tutorial, we learned what Flink is, why it is used for stateful stream processing, how it compares with Hadoop and Spark, and how a simple DataStream API program is structured. The next step is to install Flink, run a local job, and then practice with sources, transformations, windows, state, connectors, Table API, and Flink SQL.