On this article we will discuss how to install Apache Spark version 2.4.5 on Ubuntu 20.04 LTS.
Apache Spark is an open-source framework for distributed general-purpose cluster-computing system. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It provides high-level APIs compatible in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Apache Spark was originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation.
Apache Spark Installation on Ubuntu 20.04
On this article we will discuss how to install Apache Spark on Ubuntu 20.04 LTS. For this installation source, we will use the last stable of Apache Spark repository which was released on February 8, 2020
- Installing Apache Spark on Ubuntu 20.04 LTS
- Testing Spark
Before continuing Apache Spark installation, we have to prepare our Ubuntu 20.04 (Focal Fossa) environment first. Apache Spark installation requires a user with sudo either as root privileges, sufficient space and the last update system and good internet connection for downloading required package. Apache Spark also requires Java, Git and Scala package. So we have to install these software first.
Installing Java (OpenJDK11)
Java installation on this article has been covered on the previous article. We will refer to the Java installation article. Than we check out Java version, by command line below.
ramans@otodiginet:~$ java --version
Output as follow, the Java we used is the last version OpenJDK 11.0.07.
openjdk 11.0.7 2020-04-14 OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-3ubuntu1) OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-3ubuntu1, mixed mode, sharing)
Apache Spark is implemented on Scala programming language, so we have to install Scala for running Apache Spark. Scala is chosen for Apache Spark because the Scala code could aid developers to easily access and implement new features of Spark. Scala installation can be done by command line below.
ramans@otodiginet:~$ sudo apt-get install scala
We can verify the installation result by querying its version.
ramans@otodiginet:/opt/spark/spark-2.4.5-bin-hadoop2.7/sbin$ scala -version Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
Then we will test the Scala by running the Scala interface.
ramans@otodiginet:~$ scala Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.7). Type in expressions for evaluation. Or try :help. scala> println("Hello World") Hello World
The Scala has been running properly on Ubuntu 20.04, so we will continue to the next step of Apache Spark installation.
Installing Apache Spark on Ubuntu 20.04 LTS
1. Download Apache Spark from the source
We will use the latest version of Apache Spark from its official source, while this article is being written, the latest Apache Spark version is 2.4.5.
We use the root account for downloading the source and make directory name ‘spark‘ under /opt. We continue to download the source by using command line below .
root@otodiginet:# cd /opt root@otodiginet:/opt# wget http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
2. Extracting Apache Spark
The next step is to extract the Apache Spark tarball files to the /otp/spark directory.
root@otodiginet:/opt# tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz
All Spark’s files will be populated on the /opt/spark/ directory.
3. Configuring Apache Spark Environment
Almost all executable Spark files are located on the /opt/spark/spark-2.4.5-bin-hadoop2.7/sbin and /opt/spark/spark-2.4.5-bin-hadoop2.7/bin. For easy administering the Spark’s service we will add all services to the profile, so we will have quick access to the Spark service without texting full directory. For this purpose we could add the ./bashrc file.
ramans@otodiginet:/opt/spark/spark-2.4.5-bin-hadoop2.7$ vi ~/.bashrc
Add the two lines below in the end fo the file.
export SPARK_HOME=/opt/spark/spark-2.4.5-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Testing Apache Spark
1. Start Standalone Master Server
After all files are extracted and configured, now we can test to start up the Standalone Spark Master server, but submitting teh command line below. We can call this executable file on any directory.
We can see, if the Spark services has been up and the process will be listening on TCP port 8080. Then we can verify this service via web browser as below.
2. Starting Spark Worker Process
The Spark master service are running on spark://otodiginet:7077, so we will hit this address to startup the Spark worker process by submitting command line below.
ramans@otodiginet:$ star-slave.sh spar://otodiginet:7077
We verify this worker service on the web browser.
So far, the Spark installation on Ubuntu 20.04 has completed done. Have a nice day !!