Summary -

In this topic, we described about the below sections -

Hadoop supported to install on GNU/Linux platform. Hadoop installation is easy when installing on the Linux environment. Hadoop needs to be installed on VirtualBox if installing Hadoop on any OS other than Linux. We will discuss about Hadoop installation on both Linux and VirtualBox environments.

Installing/verifying Java

Java is the main prerequisite for Hadoop installation and running. The Java JDK latest version should be installed on the system. The version can be checked by using the command below.

> java -version

If java is installed on the machine, it will give you the following output.

java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12) 
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) 

If java is not installed, install the java with below steps.

a.1:

Download java (JDK <latest version> - X64.tar.gz) from the following link http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Download jdk-8u<latest-version>-linux-x64.tar.gz from the list for the 64-bit system. If different bit system, choose accordingly.

For example, if the latest-version is 161, the file to download would be jdk-8u161-linux-x64.tar.gz.

a.2:

Change the directory to target directory where the java needs to installed and move the .tar.gz to the target directory.

a.3:

Unpack the tarball and install the JDK.

> % tar zxvf jdk-8u<latest-version>-linux-x64.tar.gz

For example, if the latest-version is 161, the command would be

> % tar zxvf jdk-8u161-linux-x64.tar.gz

The Java Development Kit files are installed in a directory called jdk1.8.0_<latest-version> in the current directory.

a.4:

Delete the .tar.gz file to save disk space.

> rm jdk-8u<latest-version>
 -linux-x64.tar.gz

Now verify the java version with -version command from the terminal.

> java -version 

Produces the below output -

java version "1.8.0_161" 
Java(TM) SE Runtime Environment (build 1.8.0_161-b12) 
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)  

Pre-installation Setup

Linux installation must be completed and should be in working condition before proceeding with the Hadoop installation. There are some other preinstallation setup in Linux environment is required. The below steps will provide the steps to setting up linux environment –

b.1: Creating a User Group and User

Creating Hadoop user group and user is not mandatory. But it is recommended before installing Hadoop. Open the Linux terminal and type below command to create user group.

$ sudo addgroup hadoop

If the command is successful, you will get the below messages and command prompt will display again.

$ sudo addgroup hadoop
Adding group ‘hadoop’ (GID 1001) …
Done.
$

Type below command to create user.

$ sudo adduser —ingroup hadoop hdpuser

If the command is successful, you will be prompted to enter the below details highlighted in bold.

$ sudo adduser —ingroup hadoop hdpuser
Adding user ‘hdpuser’ ...
Adding new user ‘hdpuser’ (1002) with the group ‘hadoop’ ...
Creating home directory ‘/home/hdpuser’ ...
Copying files from ‘/etc/skel’ ...
Enter new UNIX password:
Retype new UNIX password:
Password: password updated successfully
Changing the user information for hdpuser
Enter the new value or press enter for the default
	Full Name[]: 
	Room Number[]:
	Work Phone[]:
	Home Phone[]:
	Other[]:
	Is the information correct? [Y/n]  Y
$

Once the command prompt appeared, then the user created successfully.

b.2: SSH Server installation.

SSH is used for remote login. SSH is required in Hadoop to manage its nodes, i.e. remote machines and local machine. Hadoop requires SSH Server to login to local host. So SSH Server installation is mandatory step and should be installed successfully.

b.2.1:

Installation of SSH: Type below command to install the SSH server

$ sudo apt-get install ssh
$ sudo apt-get install pdsh

b.2.2:

After installing the SSH server, need to login as a newly created user. This step is optional and not mandatory.

Type below command to login with the user created

$ su – hdpuser

b.2.3:

Generate Key Pairs: As a next step, create the SSH key using sshgen command.

$ ssh-keygen -t rsa -P ""
b.2.4:

After successful key generation, the key needs to be added to the authorized keys file to enable the login without prompting for password.

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 

Or

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

b.2.5:

Change the file permission that contains keys.

$ chmod 0600 ~/.ssh/authorized_keys

b.3: Downloading Hadoop

Download and extract Hadoop 3.1.0 from Apache software foundation using the following commands.

$ wget http://www-us.apache.org/dist/hadoop
/common/hadoop-3.1.0/hadoop-3.1.0-src.tar.gz$

Once the download completes, we need to untar the Tarball archive

$ tar -xzf hadoop-3.1.0.tar.gz

Rename the folder extracted to hadoop to avoid confusion

$ mv hadoop-3.1.0 hadoop

Installing Hadoop in Standalone Mode

The below processing steps shows the installation of hadoop.

c.1. Setting Up Hadoop

c.1.1:

First step need to update the bashrc file with few environment variables.

$vi .bashrc

Copy below information to bashrc and save it

# Set Hadoop-related environment variables,
 Points to Hadoop home directory
export HADOOP_PREFIX=/home/hduser/hadoop
# This one points to the Java home oracle home directory. 
# Set JAVA_HOME (we will also configure JAVA_HOME
 directly for Hadoop later) 
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# The last one is to update the PATH to include
 the Hadoop Home directory
# Add Hadoop bin/ directory to PATH 
export PATH=$PATH:$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_PREFIX/sbin
export HADOOP_MAPRED_HOME=${HADOOP_PREFIX}
export HADOOP_COMMON_HOME=${HADOOP_PREFIX}
export HADOOP_HDFS_HOME=${HADOOP_PREFIX}
export YARN_HOME=${HADOOP_PREFIX}
:wq
$

c.1.2:

Update hadoop-env.sh file with java Home directory. In this update only, java host file

$ vi /home/hdpuser/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
:wq
$

c.1.3:

Update core-site.xml file to update the temp directory and fs default name.

$ mkdir /home/hduser/tmp
$ vi /home/hdpuser/hadoop/conf/core-site.xml

Copy the below information in between the configuration tags (<configuration></configuration>).

<configuration>
<property> 
<name>hadoop.tmp.dir</name>
<value>/home/hduser/tmp</value>
</property>
<property> 
<name>fs.defaultFS</name> 
<value>hdfs://localhost:9000</value> 
</property>
</configuration>

c.1.4:

Update the mapred-site.xml to define the property of job tracker

$ vi /home/hdpuser/hadoop/conf/mapred-site.xml

Copy the below information in between the configuration tags (<configuration></configuration>)

<configuration>
<property> 
<name>mapreduce.job.tracker </name>
<value>localhost:9000</value>
</property>
</configuration>

c.1.5:

Update the hdfs-site.xml to define the property of replication.

$ vi /home/hdpuser/hadoop/conf/hdfs-site.xml

Copy the below information in between the configuration tags (<configuration></configuration>)

<configuration>
<property> 
<name>dfs.replication</name> 
<value>1</value> 
</property>
</configuration>

Now the configuration completed successfully.

c.2: Verify the hadoop Installation

c.2.1:

Format the name node when the hadoop running first time.

$ hadoop namenode –format

Remember, this is one-time activity, should be done only first time when running hadoop and not required every time.

c.2.2: Start Hadoop:

There are two scripts to start hadoop

$ start-dfs.sh 
$ start-mapred.sh

c.2.3:

To check the services running on hadoop, use the below command

$jps

c.2.4: Accessing Hadoop on Browser

The default port number to access Hadoop is 50070 and use the following url to get Hadoop services on browser.

http://localhost:50070/

c.2.5: Verify All Applications for Cluster

The default port number to access all applications of cluster is 8088 and use the following url to visit this service.

http://localhost:8088/

c.2.6: To stop Hadoop need to use the below commands

$ stop-dfs.sh 
$ stop-mapred.sh