Apache Spark installation on Ubuntu 20.04

How To Install Apache Spark On Ubuntu 20.04 LTS

On this article we will discuss how to install Apache Spark version 2.4.5 on Ubuntu 20.04 LTS.

Introduction

Apache Spark is an open-source framework for distributed general-purpose cluster-computing system. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It provides high-level APIs compatible in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.  Apache Spark was originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation.

Apache Spark Installation on Ubuntu 20.04

On this article we will discuss how to install Apache Spark on Ubuntu 20.04 LTS. For this installation source, we will use the last stable of Apache Spark repository which was released on February 8, 2020. Apache Spark version 2.4.5. Apache Spark needs several prerequisite to be running properly on the host system. In this article we will breakdown Apache Spark installation into several steps installation.

  • Prerequisite
  • Installing Apache Spark on Ubuntu 20.04 LTS
  • Testing Spark

Prerequisite

Before continuing Apache Spark installation, we have to prepare our Ubuntu 20.04 (Focal Fossa) environment first. Apache Spark installation requires a user with sudo either as root privileges, sufficient space and the last update system and good internet connection for downloading required package. Apache Spark also requires Java, Git and Scala package. So we have to install these software first.

Installing Java (OpenJDK11)

Java installation on this article has been covered on the previous article. We will refer to the Java installation article. Than we check out Java version, by command line below.

ramans@otodiginet:~$ java --version

Output  as follow, the Java we used is the last version OpenJDK 11.0.07.

openjdk 11.0.7 2020-04-14
OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-3ubuntu1)
OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-3ubuntu1, mixed mode, sharing)

java --versopm

Installing Scala

Apache Spark is implemented on Scala programming language, so we have to install Scala for running Apache Spark. Scala is chosen for Apache Spark because the Scala code could aid developers to easily access and implement new features of Spark. Scala installation can be done by command line below.

ramans@otodiginet:~$ sudo apt-get install scala

install scala on ubuntu 20.04

We can verify the installation result by querying its version.

ramans@otodiginet:/opt/spark/spark-2.4.5-bin-hadoop2.7/sbin$ scala -version
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Then we will test the Scala by running the Scala interface.

ramans@otodiginet:~$ scala
Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.7).
Type in expressions for evaluation. Or try :help.

scala> println("Hello World")
Hello World

install scala on ubuntu 20.04

The Scala has been running properly on Ubuntu 20.04, so we will continue to the next step of Apache Spark installation.

Installing Apache Spark on Ubuntu 20.04 LTS

1. Download Apache Spark from the source

We will use the latest version of Apache Spark from its official source, while this article is being written, the latest Apache Spark version is 2.4.5.

Download apache spark

We use the root account for downloading the source and make directory name ‘spark‘ under /opt.  We continue to download the source by using command line below .

root@otodiginet:# cd /opt 
root@otodiginet:/opt# wget http://apachemirror.wuchna.com/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

Download Saprk 2.4.5 binary source

 2. Extracting Apache Spark

The next step is to extract the Apache Spark tarball files to the /otp/spark directory.

root@otodiginet:/opt# tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz

Extract Spark 2.4.5 file

tar spark binary file

All Spark’s files will be populated on the /opt/spark/ directory.

3. Configuring Apache Spark Environment

Almost all executable Spark files are located on the /opt/spark/spark-2.4.5-bin-hadoop2.7/sbin and /opt/spark/spark-2.4.5-bin-hadoop2.7/bin. For easy administering the Spark’s service we will add all services to the profile, so we will have quick access to the Spark service without texting full directory. For this purpose we could add the ./bashrc file.

ramans@otodiginet:/opt/spark/spark-2.4.5-bin-hadoop2.7$ vi ~/.bashrc

.bashrc file

Add the two lines below in the end fo the file.

export SPARK_HOME=/opt/spark/spark-2.4.5-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

spark bash profile

Testing Apache Spark

1. Start Standalone Master Server

After all files are extracted and configured, now we can test to start up the Standalone Spark Master server, but submitting teh command line below. We can call this executable file on any directory.

ramans@otodiginet:$ star-master.sh

Starting standalone Spark master sever on

We can see, if the Spark services has been up and the process will be listening on TCP port 8080. Then we can verify this service via web browser as below.

Spark web master worker

2. Starting Spark Worker Process

The Spark master service are running on spark://otodiginet:7077, so we will hit this address to startup the Spark worker process by submitting command line below.

ramans@otodiginet:$ star-slave.sh spar://otodiginet:7077

starting spark worker process

We verify this worker service on the web browser.

Spark Worker Process

So far, the Spark installation on Ubuntu 20.04 has completed done. Have a nice day !!

 

 

Share this article via :

Leave a Reply

Your email address will not be published. Required fields are marked *