What does it mean to configure Apache Spark Ecosystem ?
There are some parameters like number of nodes in the cluster, number of cores in each node, memory availability at each node, number of threads that could be launched, deployment mode, extra java options, extra library path, mapper properties, reducer properties, etc., that are dependent on the cluster setup or user preferences. These parameters should be given control over, to the Apache Spark application user, to fit or configure Apache Spark ecosystem to the Spark application needs.
We shall learn the parameters available for configuration and what do they mean to the Apache Spark ecosystem.
Following are the three broad categories of parameters where you can setup the configuration for Apache Spark ecosystem.
- Spark Application Parameters
- Spark Environment Parameters
- Logging Parameters
Spark Application Parameters
These parameters effect only the behavior and working of Apache Spark application submitted by the user.
Following are the ways to setup Spark Application Parameters :
- Spark Application Parameters could be setup in the spark application itself using SparkConf object in the Driver program.
- They could also be set using Java system properties if you are programming in a language runnable on JVM.
- These parameters could also provided by the user when submitting the spark application in the command prompt using spark-submit command.
Spark Environment Parameters
These parameters effect the behavior and working and memory usage of nodes in the cluster.
To configure each node in the spark cluster individually, environment parameters has to be setup in spark-env.sh shell script. The location of spark-env.sh is <apache-installation-directory>/conf/spark-env.sh . To configure a particular node in the cluster, spark-env.sh file in the node has to setup with the required parameters.
These parameters effect the logging behavior of the running Apache Spark Application.
To configure logging parameters, modify the log4j.properties file with the required values and place it in the location<apache-installation-directory>/conf/log4j.properties. This can be done at node level i.e., logging properties for each node could be setup by placing the log4j.properties in the node at the specified location.