November 24, 2020
How to install and configure apache HADOOP

How To Install and Configure Apache Hadoop on Linux Ubuntu 18.04 LTS

Hadoop is a Java-based open source framework under the Apache license to support applications that run on Big Data. Hadoop runs in an environment that provides distributed storage and computing to clusters of computers/nodes. Hadoop stores data in Hadoop Distributed File System (HDFS) and the processing of these data is done using MapReduce. YARN provides API for requesting and allocating resource in the Hadoop cluster. At this article, we will explain how to install and configure Apache Hadoop on Ubuntu 18.04.

This article explains the way to install Hadoop Version 2 on Linux Ubuntu 18.04. we are going to install HDFS (Namenode and Datanode), YARN, MapReduce on the only node cluster in Pseudo Distributed Mode that is distributed simulation on one machine. Every Hadoop daemon like HDFS, YARN, Mapreduce can run as a separate/individual java method. But beforehand it will be explained in advance about the main components of Hadoop and its functions. Where these components will be configured.

Hadoop framework modules / components

Hadoop framework consist of four main modules/component, namely :

  • Hadoop Common contains libraries and utilities that needed by other Hadoop modules
  • Hadoop Distributed File System (HDFS)
  • MapReduce is a programming/Algorithm model for large-scale data management with distributed computing
  • Hadoop YARN (Yet Another Resource Negotiator) is a resource-management platform that is responsible for managing resources in clusters and scheduling

Before we are going to install Hadoop on the system, it is recomended to take the preparartion first. We have to prepare the hardware (server and operating system included), required dependencies application and Hadoop binary file itself. For the details, it will be described on sub section below. There are prerequisites for hardware and software.

Hardware and Software for Hadoop Requirements

Hardware and Platform

For the enterprise purpose, Hadoop should run on tight server category machines, the high end machines have more memory. Plus, newer machines are packed with a lot more disks (high storage capacity). But for learning Hadoop purpose, we can also using Virtual Machine or Virtual Box for operating system applied on it. . The machine must comply the supported platform, GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. All linux distribution will be a good environment for installing Hadoop. Here’is the main Linux distribution which is tested for Hadoop :

  • Ubuntu, The Server edition of Ubuntu is a good fit. Long Term Support (LTS) releases are recommended, because they continue to be updated for at least 2 years.
  • RedHat Enterprise Linux (RHEL), This is a well tested Linux distro that is geared for Enterprise. Comes with RedHat support.
  • CentOS, Source compatible distro with RHEL. Free. Very popular for running Hadoop. Use a later version (version 6.x).

Along this article, we will setup Hadoop in a single node cluster and running on Ubuntu 18.04 LTS.

Software

Hadoop requires software s Java & ssh services which running on Linux, detail are below :

  • Oracle Java Developer Kit (JDK), we will use the JDK 12.
  • Hadoop 2.8.5, it can be downloaded on https://www.apache.org/dist/hadoop/core/hadoop-2.8.5/
  • ssh service

Hadoop Installation Steps

Adding new user for Hadoop

Firstly we verify the OS release that we will install hadoop inside it. Create new user and group specialty for hadoop usage, we call this user as hadoop. We require the root privilege for this purpose or using sudo for sudoer users. We can use the command :

# lsb_release -a
ubuntu version
# adduser hadoop
add user hadoop
Install and configuring the Oracle Java Developement Kit (JDK)

The step was described on the previous article on How to Install Oracle Java 12 in Linux Ubuntu 18.04, and we skip this step. To verify we could check the java version on the system using command :

# java -version
java version



Configuring SSH Server Service

We need to install the Open SSH Server and also Open SSH Client, the use of this service is to encrypts all traffic (including passwords) to effectively eliminate eavesdropping, connection hijacking and other network-level attacks. This can be done with the command :

#apt-get install openssh-server openssh-client

After completing ssh installation, the we have to generate Public and Private Key Pairs ont the system. This can done with the following command :

 $ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 

So far we have succeded on installing and configuring ssh service, the next step is to verify the password-less ssh configuration with the command :

# ssh localhost

Hadoop Installation and Configuring Related xml Files

After environment was ready for Hadoop deployment, then we download and extract the Hadoop binary file from apache official website, Hadoop 2.8.5. The file is located under hadoop user directory (/home/hadoop/), than extract this file by command :

# tar -xzf hadoop-2.8.5.tar.gz

After all, we have to modify the profile of hadoop user by editing the ~./bashrc file reside under /home/hadoop/ directory.

Append this line on the file :

 export HADOOP_HOME=/home/hadoop/hadoop-2.8.5
 export HADOOP_INSTALL=$HADOOP_HOME
 export HADOOP_MAPRED_HOME=$HADOOP_HOME
 export HADOOP_COMMON_HOME=$HADOOP_HOME
 export HADOOP_HDFS_HOME=$HADOOP_HOME
 export YARN_HOME=$HADOOP_HOME
 export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
 export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
 export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

On the /home/hadoop/etc/hadoop directory, find hadoop-env.sh file, than append the JAVA_HOME and HADOOP_CONF_DIR.

Configuration Changes in core-site.xml file

The next step is to configure the xml file to define the specific location service on hadoop. Edit the core-site.xml file /etc/hadoop inside hadoop home directory and add following entries.

fs.defaultFS hdfs://localhost:9000 hadoop.tmp.dir /home/hadoop/hadooptmpdata



Share this article via :

Leave a Reply

Your email address will not be published. Required fields are marked *