Hadoop Installation on Linux Systems
If you ever had to install Hadoop on any system you would understand the painful and unnecessarily tiresome process that goes into setting up Hadoop on your system. In this tutorial we will go through the Installation on Hadoop on a Linux system. I will also go through the common mistakes I went through while installing the system on my systems and on the systems of my fellow colleagues.
Downloading Requirements
I recommend installing Hadoop on using the terminal it provides a easy way to check if your installation progressed successfully. The tutorial will require the terminal to be active. To open the terminal on most Ubuntu systems the command is Ctrl+Alt+T once the terminal is opened we can start downloading the requirements using the command.
sudo apt update && sudo apt install openjdk-8-jdk
This should download the latest package information for all the sources on the internet. Java version 8 which is required for this installation is also installed. To check the installation run the following command
java -version
Next we will create a new user on Ubuntu to facilitate a dedicated Hadoop user. You will be asked for some information to be entered enter as you see fit.
sudo adduser hadoop
Install ssh (secure shell) to enable secure connection between the nodes in the cluster.
sudo apt install ssh
Installing Hadoop
First we need to switch to the new user. You might need to add sudo to the start depending on the system configuration.
su - hadoop
Now configure password-less SSH access for the newly created Hadoop user, press enter for everything that follows.
ssh-keygen -t rsa
Copy the generated public key to the authorized key file and set the proper
permissions:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys
You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and hit Enter to authenticate the localhost.
ssh localhost
Switch to the hadoop user again (with or without sudo)
su - hadoop
Download hadoop 3.3.6
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Once you’ve downloaded the file, you can unzip it to a folder.
tar -xvzf hadoop-3.3.6.tar.gz
Rename the extracted folder to remove version information. This is an optional step, but if you don’t want to rename, then adjust the remaining configuration paths.
mv hadoop-3.3.6 hadoop
Next, you will need to configure Hadoop and Java Environment Variables on your system. Open the ~/.bashrc file in your favorite text editor.Here I am using nano editior , to pasting the code we use ctrl+shift+v for saving the file ctrl+x and ctrl+y ,then hit enter:
nano ~/.bashrc
Append the below lines to the file.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Load the above configuration in the current environment.
source ~/.bashrc
You also need to configure JAVA_HOME in hadoop-env.sh file. Edit the Hadoop environment variable file in the text editor:
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Search for the “export JAVA_HOME” and configure it .
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Step 11 : Configuring Hadoop :
First, you will need to create the namenode and datanode directories inside the Hadoop user home directory. Run the following command to create both directories:
cd hadoop/
mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
Next, edit the core-site.xml file and update with your system hostname:
nano $HADOOP_HOME/etc/hadoop/core-site.xml
Change the following name as per your system hostname:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Save and close the file.
Then, edit the hdfs-site.xml file:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Change the NameNode and DataNode directory paths as shown below:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Then, edit the mapred-site.xml file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Make the following changes:
<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/hadoop</value>
</property>
</configuration>
Then, edit the yarn-site.xml file:
nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Make the following changes:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save the file and close it .
Start Hadoop cluster
Before starting the Hadoop cluster. You will need to format the Namenode as a Hadoop user.Run the following command to format the Hadoop Namenode:
hdfs namenode -format
Once the Namenode directory is successfully formatted with hdfs file system, you will see the message “Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted“.
Then start the Hadoop cluster with the following command.
start-all.sh
You can now check the status of all Hadoop services using the jps command:
jps
the cluster has successfully started. To end the cluster run the following command
stop-all.sh