Hadoop
The Apache Hadoop software library is a framework that allows
for the distributed processing of large data sets across clusters of computers
using simple programming models.
Install Hadoop
on Ubuntu 14.04
In
this chapter, we'll install a single-node Hadoop cluster backed by the Hadoop
Distributed File System on Ubuntu.
Installing
Updates:
Open
terminal and run command
$sudo apt-get update
Installing
Java:
$ sudo apt-get install
default-jdk
Check
java Version
$java -version
Installing
SSH
$sudo apt-get install ssh
check
ssh
$which ssh
Create
and Setup SSH Certificates
Hadoop
requires SSH access to manage its nodes, i.e. remote machines plus our local
machine. For our single-node setup of Hadoop, we therefore need to configure
SSH access to localhost.
So,
we need to have SSH up and running on our machine and configured it to allow
SSH public key authentication.
$ssh-keygen -t rsa
This
command adds the newly created key to the list of authorized keys so that
Hadoop can use ssh without prompting for a password.
$ cat $HOME/.ssh/id_rsa.pub
>> $HOME/.ssh/authorized_keys
We
can check if ssh works:
$ssh localhost
Install
Hadoop
Download
Hadoop
$wget
https://archive.apache.org/dist/hadoop/core/hadoop-2.7.1/hadoop-2.7.1.tar.gz
Extract
tar file
$tar xvzf hadoop-2.7.1.tar.gz
We
want to move the Hadoop installation to the /usr/local/hadoop
directory using the following command:
Create
Hadoop Directory and give permission
$sudo mkdir /usr/local/hadoop
$sudo chown -R hduser
/usr/local/hadoop
$sudo mv hadoop-2.7.1/* /usr/local/hadoop
Setup
Configuration Files
The
following files will have to be modified to complete the Hadoop setup:
1.
~/.bashrc
2.
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
3.
/usr/local/hadoop/etc/hadoop/core-site.xml
4.
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5.
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
Before
editing the .bashrc file in our home directory, we need to find the
path where Java has been installed to set the JAVA_HOME environment
variable using the following command:
$which javac
$readlink -f /usr/bin/javac
and append these lines in it.
#HADOOP
VARIABLES START
export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export
HADOOP_INSTALL=/usr/local/hadoop
export
PATH=$PATH:$HADOOP_INSTALL/bin
export
PATH=$PATH:$HADOOP_INSTALL/sbin
export
HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export
HADOOP_COMMON_HOME=$HADOOP_INSTALL
export
HADOOP_HDFS_HOME=$HADOOP_INSTALL
export
YARN_HOME=$HADOOP_INSTALL
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export
HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP
VARIABLES END
now Reload
bashrc file
$ source ~/.bashrc
2.
/usr/local/hadoop/etc/hadoop/hadoop-env.sh
We
need to set JAVA_HOME by modifying hadoop-env.sh file.
export
JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Adding
the above statement in the hadoop-env.sh file ensures that the
value of JAVA_HOME variable will be available to Hadoop whenever it is started
up.
3.
/usr/local/hadoop/etc/hadoop/core-site.xml:
The
/usr/local/hadoop/etc/hadoop/core-site.xml file contains
configuration properties that Hadoop uses when starting up.
This
file can be used to override the default settings that Hadoop starts with.
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser
/app/hadoop/tmp
Open
the file and enter the following in between the
<configuration></configuration> tag:
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary
directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class.
The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
4.
/usr/local/hadoop/etc/hadoop/mapred-site.xml
By
default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file
which has to be renamed/copied with the name mapred-site.xml:
The
mapred-site.xml file is used to specify which framework is being
used for MapReduce.
We
need to enter the following content in between the
<configuration></configuration> tag:
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then
jobs are run in-process as a single map
and reduce task.
</description>
</property>
5.
/usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml
file needs to be configured for each host in the cluster that is being
used.
It
is used to specify the directories which will be used as the namenode
and thedatanode on that host.
Before
editing this file, we need to create two directories which will contain the
namenode and the datanode for this Hadoop installation.
This
can be done using the following commands:
$
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
$
sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
$
sudo chown -R hduser /usr/local/hadoop_store
Open
the file and enter the following content in between the <configuration></configuration>
tag:
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be
specified when the file is created.
The default is used if replication is not specified
in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
Format
the New Hadoop Filesystem
Now,
the Hadoop file system needs to be formatted so that we can start to use it.
The format command should be issued with write permission since it creates
currentdirectory
under
/usr/local/hadoop_store/hdfs/namenode folder:
$
hadoop namenode -format
Starting
Hadoop
Now
it's time to start the newly installed single node cluster.
We
can use start-all.sh or (start-dfs.sh and
start-yarn.sh)
$
start-yarn.sh
$
start-dfs.sh
We
can check if it's really up and running:
$
jps
Hadoop
Web Interfaces
http://localhost:50070/
- web UI of the NameNode daemon
Running
a MapReduce Job
Now
it's time to run our first Hadoop MapReduce job. We will use one of the
examples that come with Hadoop package.
$ hadoop jar
/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar pi 100
100
Comments
Post a Comment