Hadoop is a Java-based open source framework under the Apache license to support applications that run on Big Data. Hadoop runs in an environment that provides distributed storage and computing to clusters of computers/nodes. Hadoop stores data in Hadoop Distributed File System (HDFS) and the processing of these data is done using MapReduce. YARN provides API for requesting and allocating resource in the Hadoop cluster. At this article, we will explain how to install and configure Apache Hadoop on Ubuntu 18.04.
This article explains the way to install Hadoop Version 2 on Linux Ubuntu 18.04. we are going to install HDFS (Namenode and Datanode), YARN, MapReduce on the only node cluster in Pseudo Distributed Mode that is distributed simulation on one machine. Every Hadoop daemon like HDFS, YARN, Mapreduce can run as a separate/individual java method. But beforehand it will be explained in advance about the main components of Hadoop and its functions. Where these components will be configured.
Hadoop framework modules / components
Hadoop framework consist of four main modules/component, namely :
- Hadoop Common contains libraries and utilities that needed by other Hadoop modules
- Hadoop Distributed File System (HDFS)
- MapReduce is a programming/Algorithm model for large-scale data management with distributed computing
- Hadoop YARN (Yet Another Resource Negotiator) is a resource-management platform that is responsible for managing resources in clusters and scheduling
Before we are going to install Hadoop on the system, it is recomended to take the preparartion first. We have to prepare the hardware (server and operating system included), required dependencies application and Hadoop binary file itself. For the details, it will be described on sub section below. There are prerequisites for hardware and software.
Hardware and Software for Hadoop Requirements
Hardware and Platform
For the enterprise purpose, Hadoop should run on tight server category machines, the high end machines have more memory. Plus, newer machines are packed with a lot more disks (high storage capacity). But for learning Hadoop purpose, we can also using Virtual Machine or Virtual Box for operating system applied on it. . The machine must comply the supported platform, GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. All linux distribution will be a good environment for installing Hadoop. Here’is the main Linux distribution which is tested for Hadoop :
- Ubuntu, The Server edition of Ubuntu is a good fit. Long Term Support (LTS) releases are recommended, because they continue to be updated for at least 2 years.
- RedHat Enterprise Linux (RHEL), This is a well tested Linux distro that is geared for Enterprise. Comes with RedHat support.
- CentOS, Source compatible distro with RHEL. Free. Very popular for running Hadoop. Use a later version (version 6.x).
Along this article, we will setup Hadoop in a single node cluster and running on Ubuntu 18.04 LTS.
Hadoop requires software s Java & ssh services which running on Linux, detail are below :
- Oracle Java Developer Kit (JDK), we will use the JDK 12.
- Hadoop 2.8.5, it can be downloaded on https://www.apache.org/dist/hadoop/core/hadoop-2.8.5/
- ssh service
Hadoop Installation Steps
Adding new user for Hadoop
Firstly we verify the OS release that we will install hadoop inside it. Create new user and group specialty for hadoop usage, we call this user as hadoop. We require the root privilege for this purpose or using sudo for sudoer users. We can use the command :
# lsb_release -a
# adduser hadoop
Install and configuring the Oracle Java Developement Kit (JDK)
The step was described on the previous article on How to Install Oracle Java 12 in Linux Ubuntu 18.04, and we skip this step. To verify we could check the java version on the system using command :
# java -version
Configuring SSH Server Service
We need to install the Open SSH Server and also Open SSH Client, the use of this service is to encrypts all traffic (including passwords) to effectively eliminate eavesdropping, connection hijacking and other network-level attacks. This can be done with the command :
#apt-get install openssh-server openssh-client
After completing ssh installation, the we have to generate Public and Private Key Pairs ont the system. This can done with the following command :
$ ssh-keygen -t rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
So far we have succeded on installing and configuring ssh service, the next step is to verify the password-less ssh configuration with the command :
# ssh localhost
Hadoop Installation and Configuring Related xml Files
After environment was ready for Hadoop deployment, then we download and extract the Hadoop binary file from apache official website, Hadoop 2.8.5. The file is located under hadoop user directory (/home/hadoop/), than extract this file by command :
#tar -xzf hadoop-2.8.5.tar.gz
After all, we have to modify the profile of hadoop user by editing the ~./bashrc file reside under
Append this line on the file :
export HADOOP_HOME=/home/hadoop/hadoop-2.8.5 export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
/home/hadoop/etc/hadoop directory, find hadoop-env.sh file, than append the
Configuration Changes in core-site.xml file
The next step is to configure the xml file to define the specific location service on hadoop. Edit the
/etc/hadoop inside hadoop home directory and add following entries.