Creating a Hadoop Pseudo-Distributed Environment
Hadoop developers usually test their scripts and code on a pseudo-distributed environment (also known as a single node setup), which is a virtual machine that runs all of the Hadoop daemons simultaneously on a single machine. This allows you to quickly write scripts and test them on limited data sets without having to connect to a remote cluster or pay the expense of EC2. If you're learning Hadoop, you'll probably also want to set up a pseudo-distributed environment to facilitate your understanding of the various Hadoop daemons.
These instructions will help you install a pseudo-distributed environment with Hadoop 2.5.2 on Ubuntu 14.04.
Quick Start
There are a couple of options that will allow you to quickly get up and running if you are not familiar with systems administration on Linux or do not wish to work through the process of installing Hadoop yourself. District Data Labs has provided a Virtual Machine Disk (VMDK) configured exactly as the instructions below, available for you to download directly. You can then use this VMDK in the virtualization software of your choice (e.g. VirtualBox or VMWare Fusion). Alternatively both Hortonworks and Cloudera supply virtual machines for quick download. Be aware that if you do use Cloudera or Hortonworks distributions, then the environment may be subtly different than the one described below.
Click here to download the VMDK we have put together.
If you are using the VMDK supplied by District Data Labs, log in to the machine using the username and password as follows:
username: student
password: password
If you're brave enough to set up the environment yourself, go ahead and move to the next section!
Setting up Linux
Before you can get started installing Hadoop, you'll need to have a Linux environment configured and ready to use. These instructions assume that you can get an Ubuntu 14.04 distribution installed on the machine of your choice, either in a dual booted configuration or using a virtual machine. Using Ubuntu Server or Ubuntu Desktop is left to your preference, since you'll also need to be familiar working with the command line. Personally, I prefer to use Ubuntu Server since it's more lightweight, and SSH into it from my host operating system.
Base Environment: Ubuntu x64 Desktop 14.04 LTS
Make sure your system is fully up-to-date with the required by running the following commands:
~$ sudo apt-get update && sudo apt-get upgrade
~$ sudo apt-get install build-essential ssh lzop git rsync curl
~$ sudo apt-get install python-dev python-setuptools
~$ sudo apt-get install libcurl4-openssl-dev
~$ sudo easy_install pip
~$ sudo pip install virtualenv virtualenvwrapper python-dateutil
Creating a Hadoop User
In order to secure our Hadoop services, we will make sure that Hadoop is run as a Hadoop-specific user and group. This user would be able to initiate SSH connections to other nodes in a cluster, but not have administrative access to do damage to the operating system upon which the service was running. Implementing Linux permissions also helps secure HDFS and is the start of preparing a secure computing cluster.
This tutorial is not meant for operational implementation. However, as a data scientist, these permissions may save you some headache in the long run, so it is helpful to have the permissions in place on your development environment. This will also ensure that the Hadoop installation is separate from other software applications and will help organize the maintenance of the machine.
Create the hadoop
user and group, then add the student
user to the Hadoop group:
~$ sudo addgroup hadoop
~$ sudo useradd -m -g hadoop hadoop
~$ sudo usermod -a -G hadoop student
Once you have logged out and logged back in (or restarted the machine) you should be able to see that you've been added to the Hadoop group by issuing the groups
command. Note that the -r
flag creates a system user without a home directory.
Configuring SSH
SSH is required and must be installed on your system to use Hadoop (and to better manage the virtual environment, especially if you're using a headless Ubuntu). Generate some ssh keys for the Hadoop user by issuing the following commands:
~$ sudo su hadoop
~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
[... snip ...]
Simply hit enter at all the prompts to accept the defaults and to create a key that does not require a password to authenticate (this is required for Hadoop). In order to allow the key to be used to SSH into the box, copy the public key to the authorized_keys file with the following command:
~$ cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
~$ chmod 600 /home/hadoop/.ssh/authorized_keys
You should be able to download this key and use it to SSH into the Ubuntu environment. To test the SSH key issue the following command:
~$ ssh -l hadoop localhost
If this completes successfully without asking you for a password, then you have successfully configured SSH for Hadoop. Exit the SSH window by typing exit
. You should be returned back to the hadoop
user. Exit the Hadoop user by typing exit
again, you should now be in a terminal window that says student@ubuntu
.
Installing Java
Hadoop and most of the Hadoop ecosystem require Java to run. Hadoop requires a minimum of Oracle Java™ 1.6.x or greater and used to recommend particular versions of Java™ to use with Hadoop. Now, Hadoop maintains a reporting of the various JDKs that work well with Hadoop. Ubuntu does not maintain an Oracle JDK in Ubuntu repositories because it is proprietary code, so instead we will install OpenJDK. For more information on supported Java™ versions, see Hadoop Java Versions and for information about installing different versions on Ubuntu, please see Installing Java on Ubuntu.
~$ sudo apt-get install openjdk-7-*
Do a quick check to ensure the right version of Java™ is installed:
~$ java -version
java version "1.7.0_65"
OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-0ubuntu0.14.04.1)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
Hadoop is currently built and tested on both OpenJDK and Oracle's JDK/JRE.
Disabling IPv6
It has been reported for a while now that Hadoop running on Ubuntu has a conflict with IPv6, and ever since Hadoop 0.20, Ubuntu users have been disabling IPv6 on their clustered boxes. It is unclear whether or not this is still a bug in the latest versions of Hadoop, however in a single-node or pseudo-distributed environment we will have no need for IPv6, so it is best to simply disable it and not worry about any potential problems.
Edit the /etc/sysctl.conf
file by executing the following lines of code:
~$ gksu gedit /etc/sysctl.conf
Then add the following lines to the end of the file:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
For this change to take effect, reboot your computer. Once it has rebooted check the status with the following command:
~$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
If the output is 0, then IPv6 is enabled. If it is 1, then we have successfully disabled IPv6.
Installing Hadoop
To get Hadoop, you'll need to download the release of your choice from one of the Apache Download Mirrors. These instructions will download the current stable version of Hadoop with YARN at the time of this writing, Hadoop 2.5.2.
After you've selected a mirror, type the following commands into a Terminal window, replacing http://apache.mirror.com/hadoop-2.5.0/
with the mirror URL that you selected and that is best for your region:
~/Downloads$ curl -O http://apache.mirror.com/hadoop-2.5.2/hadoop-2.5.2.tar.gz
You can verify the download by ensuring that the md5sum
matches the md5sum which should also be available at the mirror:
~/Downloads$ md5sum hadoop-2.5.2.tar.gz
74a7581893a8224540a9417a4c2630da hadoop-2.5.2.tar.gz
Of course, you can use any mechanism you wish to download Hadoop - wget
or a browser will work just fine.
Unpacking
After obtaining the compressed tarball, the next step is to unpack it. You can use an Archive Manager or simply follow the instructions that follow next. The most significant decision that you have to make is where to unpack Hadoop to.
The Linux operating system depends upon a hierarchical directory structure to function. At the root, many directories that you've heard of have specific purposes:
/etc
is used to store configuration files/home
is used to store user specific files-
/bin
and/sbin
include programs that are vital for the OS /usr/sbin
are for programs that are not vital but are system wide/usr/local
is for locally installed programs/var
is used for program data including caches and logs
You can read more about these directories in this Stack Exchange post.
A good choice to move Hadoop to is the /opt
and /srv
directories.
/opt
contains non-packaged programs, usually source. A lot of developers stick their code there for deployments.- The
/srv
directory stands for services. Hadoop, HBase, Hive and others run as services on your machine, so this seems like a great place to put things, and it's a standard location that's easy to get to. So let's stick everything there!
Enter the following commands:
~/Downloads$ tar -xzf hadoop-2.5.2.tar.gz
~/Downloads$ sudo mv hadoop-2.5.2 /srv/
~/Downloads$ sudo chown -R hadoop:hadoop /srv/hadoop-2.5.2
~/Downloads$ sudo chmod g+w -R /srv/hadoop-2.5.2
~/Downloads$ sudo ln -s /srv/hadoop-2.5.2 /srv/hadoop
These commands unpack Hadoop, move it to the service directory where we will keep all of our Hadoop and cluster services, and then set permissions. Finally, we create a symlink
to the version of Hadoop that we would like to use, this will make it easy to upgrade our Hadoop distribution in the future.
Environment
In order to ensure everything executes correctly, we are going to set some environment variables so that Hadoop executes in its correct context. Enter the following command on the command line to open up a text editor with the profile of the hadoop
user to change the environment variables.
/srv$ gksu gedit /home/hadoop/.bashrc
Add the following lines to this file:
# Set the Hadoop Related Environment variables
export HADOOP_HOME=/srv/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
# Set the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
We'll also add some convenience functionality to the student user environment. Open the student user bash profile file with the following command:
~$ gedit ~/.profile
Add the following contents to that file:
# Set the Hadoop Related Environment variables
export HADOOP_HOME=/srv/hadoop
export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/hadoop streaming-2.5.2.jar
export PATH=$PATH:$HADOOP_HOME/bin
# Set the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
# Helpful Aliases
alias ..="cd .."
alias ...="cd ../.."
alias hfs="hadoop fs"
alias hls="hfs -ls"
These simple aliases may save you a lot of typing in the long run! Feel free to add any other helpers that you think might be useful in your development work.
Check that your environment configuration has worked by running a Hadoop command:
~$ hadoop version
Hadoop 2.5.2
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r cc72e9b000545b86b75a61f4835eb86d57bfafc0
Compiled by jenkins on 2014-11-14T23:45Z
Compiled with protoc 2.5.0
From source with checksum df7537a4faa4658983d397abf4514320
This command was run using /srv/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2.jar
If that ran with no errors and displayed an output similar to the one above, then everything has been configured correctly up to this point.
Hadoop Configuration
The penultimate step to setting up Hadoop as a pseudo-distributed node is to edit configuration files for the Hadoop environment, the MapReduce site, the HDFS site, and the YARN site. This will mostly entail configuration file editing.
Edit the hadoop-env.sh file by entering the following on the command line.
~$ gedit $HADOOP_HOME/etc/hadoop/hadoop-env.sh
The most important part of this configuration is to change the following line:
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Next, edit the core site configuration file:
~$ gedit $HADOOP_HOME/etc/hadoop/core-site.xml
Replace the <configuration></configuration>
with the following:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/app/hadoop/data</value>
</property>
</configuration>
Edit the MapReduce site configuration following by copying the template then opening the file for editing:
~$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template \
$HADOOP_HOME/etc/hadoop/mapred-site.xml
~$ gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml
Replace the <configuration></configuration>
with the following:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Now edit the HDFS site configuration by editing the following file:
~$ gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Replace the <configuration></configuration>
with the following:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Finally, edit the YARN site configuration file:
~$ gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml
And update the configuration as follows:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8050</value>
</property>
</configuration>
With these files edited, Hadoop should be fully configured as a pseudo-distributed environment.
Formatting the Namenode
The final step before we can turn Hadoop on is to format the namenode. The namenode is in charge of HDFS, the distributed file system. The namenode on this machine is going to keep its files in the /var/app/hadoop/data
directory. We need to initialize this directory and then format the namenode to properly use it.
~$ sudo mkdir -p /var/app/hadoop/data
~$ sudo chown hadoop:hadoop -R /var/app/hadoop
~$ sudo su hadoop
~$ hadoop namenode -format
You should see a bunch of Java messages scrolling down the page if the namenode has executed successfully. There should be directories inside of the /var/app/hadoop/data
directory, including a dfs
directory. If that is what you see, then Hadoop should be all set up and ready to use!
Starting Hadoop
At this point we can start and run our Hadoop daemons. When you formatted the namenode, you switched to being the hadoop
user with the sudo su hadoop
command. If you're still that user, go ahead and execute the following commands:
~$ $HADOOP_HOME/sbin/start-dfs.sh
~$ $HADOOP_HOME/sbin/start-yarn.sh
The daemons should start up and issue messages about where they are logging to and other important information. If you get asked about your SSH key, just type y
at the prompt. You can see the processes that are running via the jps
command:
~$ jps
4801 Jps
4468 ResourceManager
4583 NodeManager
4012 NameNode
4318 SecondaryNameNode
4150 DataNode
If the processes are not running, then something has gone wrong. You can also access the Hadoop cluster administration site by opening a browser and point it to http://localhost:8088. This should bring up a page with the Hadoop logo and a table of applications.
To wrap up the configuration, prepare a space on HDFS for our student
account to store data and to run analytical jobs on:
~$ hadoop fs -mkdir -p /user/student
~$ hadoop fs -chown student:student /user/student
You can now exit from the hadoop
user's shell with the exit
command.
Restarting Hadoop
If you reboot your machine, the Hadoop daemons will stop running and will not automatically be restarted. If you are attempting to run a Hadoop command and you get a "connection refused" message, it is likely because the daemons are not running. You can check this by issuing the jps
command as sudo:
~$ sudo jps
To restart Hadoop in the case that it shuts down, issue the following commands:
~$ sudo -H -u hadoop $HADOOP_HOME/sbin/start-dfs.sh
~$ sudo -H -u hadoop $HADOOP_HOME/sbin/start-yarn.sh
The processes should start up again as the dedicated hadoop
user and you'll be back on your way!
Installing Hive
For the most part, installing services on Hadoop (e.g. Hive, HBase, or others) will consist of the following in the environment we have set up:
- Download the release tarball of the service
- Unpack the release to
/srv/
and creating a symlink from the release to a simple name - Configure environment variables with the new paths
- Configure the service to run in pseudo-distributed mode
Hive also follows this pattern. Find the Hive release you wish to download from the Apache Hive downloads page. At the time of this writing, Hive release 0.14.0 is current. Once you have selected a mirror, download the apache-hive-0.14.0-bin.tar.gz
file to your downloads
directory. Then issue the following commands in the terminal to unpack it:
~$ tar -xzf apache-hive-0.14.0-bin.tgz
~$ sudo mv apache-hive-0.14.0-bin /srv
~$ sudo chown -R hadoop:hadoop /srv/apache-hive-0.14.0-bin
~$ sudo ln -s /srv/apache-hive-0.14.0-bin /srv/hive
Edit your ~/.profile
with these environment variables by adding the following to the bottom of the .profile
:
# Configure Hive environment
export HIVE_HOME=/srv/hive
export PATH=$PATH:$HIVE_HOME/bin
No other configuration for Hive is required, although you can find other configuration details in HIVE_HOME/conf
including the Hive environment shell file and the Hive site configuration XML.
Installing Spark
Installing Spark is also pretty straight forward, and we'll install it similarly to how we installed Hive. Find the Spark release you wish to download from the Apache Spark downloads page. The Spark release at the time of this writing is 1.1.0. You should choose the package type "Pre-built for Hadoop 2.4" and the download type should be "Direct Download". Then unpack it as follows:
~$ tar -xzf spark-1.1.0-bin-hadoop2.4.tgz
~$ sudo mv spark-1.1.0-bin-hadoop2.4.tgz /srv
~$ sudo chown -R hadoop:hadoop /srv/spark-1.1.0-bin-hadoop2.4
~$ sudo ln -s /srv/spark-1.1.0-bin-hadoop2.4 /srv/spark
Edit your ~/.profile
with the following environment variables at the bottom of the file:
# Configure Spark environment
export SPARK_HOME=/srv/spark
export PATH=$SPARK_HOME/bin:$PATH
After you source your .profile
or restart your terminal, you should be able to run a pyspark
interpreter locally. You can now use pyspark
and spark-submit
commands to run Spark jobs.
Conclusion
At this point you should now have a fully configured Hadoop setup ready for development in pseudo-distributed mode on Ubuntu with HDFS, MapReduce on YARN, Hive, and Spark all ready to go as well as a simple methodology for installing other services.
District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!