Apache Spark MLlib Tutorial – Learn about Spark’s Scalable Machine Learning Library

MLlib is one of the four Apache Spark‘s libraries. It is a scalable Machine Learning Library.


MLlib could be developed using Java (Spark’s APIs).

With latest Spark releases, MLlib is inter-operable with Python’s Numpy libraries and R libraries.

Data Source

Using MLlib, one can access HDFS(Hadoop Data File System) and HBase, in addition to local files. This enables MLlib to be easily plugged into Hadoop workflows.


Spark’s framework excels at iterative computation. This enables the iterative parts of MLlib algorithms to run fast. And also MLlib contains high quality algorithms for Classification, Regression, Recommendation, Clustering, Topic Modelling, etc.

Apache Spark MLlib Tutorial
Apache Spark MLlib Tutorial

Following are some of the examples to MLlib algorithms, with step by step understanding of ML Pipeline construction and model building :

  1. Classification using Logistic Regression
  2. Classification using Naive Bayes
  3. Generalized Regression
  4. Survival Regression
  5. Decision Trees
  6. Random Forests
  7. Gradient Boosted Trees
  8. Recommendation using Alternating Least Squares (ALS)
  9. Clustering using KMeans
  10. Clustering using Gaussian Mixtures
  11. Topic Modelling using Latent Dirichlet Conditions
  12. Frequent Itemsets
  13. Association Rules
  14. Sequential Pattern Mining

MLlib Utilities

MLlib provides following workflow utilities :

  1. Feature Transformation
  2. ML Pipeline construction
  3. Model Evaluation
  4. Hyper-parameter tuning
  5. Saving and loading of models and pipelines
  6. Distributed Linear Algebra
  7. Statistics


In this Apache Spark Tutorial – Spark MLlib Tutorial, we have learnt about different machine learning algorithms available in Spark MLlib and different utilities MLlib provides.