Setup Hadoop 2.6.0 on Intel Galileo Gen 2

Setup Hadoop 2.6.0 on Intel Galileo Gen 2

In this tutorial I will explain, how you can setup a Apache Hadoop Multi-Node Cluster on the Intel Galileo Gen 2 boards. The Master Node will run on a server with a Ubuntu Server OS and the four Slave Nodes will run on the Galileo boards with a Yocto OS.

The Problem are the limited resources of the Galileo boards:

Intel Galileo Gen 2 resources:

  • Intel® Quark™ SoC X1000 application processor, a 32-bit, single-core, single-thread, Intel® Pentium® processor instruction set architecture (ISA)-compatible, operating at speeds up to 400 MHz.
  • Support for a wide range of industry standard I/O interfaces, including a full-sized mini-PCI Express* slot, 100 Mb Ethernet port, microSD* slot, USB host port, and USB client port.
  • 256 MB DDR3, 512 kb embedded SRAM, 8 MB NOR Flash, and 8 kb EEPROM standard on the board, plus support for microSD card up to 32 GB.
  • Support for Yocto 1.4 Poky* Linux release.

  

  1. Install Hadoop
  2. Configure Master Node
  3. Configure Slave Nodes
  4. Start the Cluster 

1. Install Hadoop:

These steps you have to make on all nodes (Master Node and on all Slave Nodes) on which you want to run Hadoop.

First you have to Download Hadoop and extract the package to a location of your choice. I am using Hadoop 2.6.0.

 

You also need one of the latest java versions (Java 8, Java 7 or late Java 6). I am using Java 7. 

 

Now you set the Hadoop install directory and your Java directory to your system path. To add those permanently, you have to add the following commands to your "~/.profile"-file and reboot your system. After the reboot, check your environment variables using the "env"-command.

 

HADOOP_PREFIX=/path/to/Hadoop
export HADOOP_PREFIX
PATH=$PATH:$HADOOP_PREFIX/bin
export PATH
PATH=$PATH:$HADOOP_PREFIX/sbin
export PATH

JAVA_HOME=/path/to/Java
export JAVA_HOME
PATH=$PATH:$JAVA_HOME/bin
export PATH

 

The next step is really important! You have to setup a static IP for the Galileo boards. To do this, you have to enter the IP in your interface file and register it in runlevel five, so that it will run at the start of the system.

First, create a Backup of your interface file:

cp /etc/network/interfaces  /etc/network/interfaces.backup

vi /etc/network/interfaces

 

Now change "iface eth0 inet dhcd" to "iface eth0 inet static" and add the address, netmask and the gateway.

# /etc/network/interfaces -- configuration file for ifup(8), ifdown(8)

# The loopback interface
auto lo
iface lo inet loopback

# Wireless interfaces
iface wlan0 inet dhcp
        wireless_mode managed
        wireless_essid any
        wpa-driver wext
        wpa-conf /etc/wpa_supplicant.conf

iface atml0 inet dhcp

# Wired or wireless interfaces
auto eth0
iface eth0 inet static
        address x.x.x.x
        netmask 255.255.255.0
        gateway x.x.x.x

#iface eth1 inet dhcp
# Ethernet/RNDIS gadget (g_ether)
# ... or on host side, usbnet and random hwaddr
iface usb0 inet static                         
        address 192.168.7.2                    
        netmask 255.255.255.0                  
        network 192.168.7.0                    
        gateway 192.168.7.1                    
                                               
# Bluetooth networking                         
iface bnep0 inet dhcp              

The last step is to remove the startup script "S05connman" and add the networking script to "S05networking" in runlevel five. 

rm /etc/rc5.d/S05connman

ln -s ../init.d/networking /etc/rc5.d/S05networking

 Be sure that your system is in runlevel five by using the runlevel-command:

uc-lab-node-3:~# runlevel
N 5

Now, that you have everything to run Hadoop, we will modify the Hadoop configuration to run the Server as our Master Node and the Galileo boards as our Slave Nodes. 


2. Configuration Master Node:

We have to modify five hadoop configuration files to run the cluster (here you can find a detailed description of all configuration parameters). I recommend, to map the IPs of all Nodes (Master and Slaves) with concise names, so you don´t have to enter the IPs every time. Therefor, add the IPs and the names you have chosen to your host-file on your Master Node.

b2ap3_thumbnail_Bildschirmfoto-2015-04-05-um-13.48.08.png

Here is an example of the hosts file:

 

127.0.0.1        localhost
x.x.x.x          uc-lab-master-node
x.x.x.x          uc-lab-node-1
x.x.x.x          uc-lab-node-2
x.x.x.x          uc-lab-node-3
x.x.x.x          uc-lab-node-4

 

Now we navigate to the etc/hadoop folder in our Hadoop install directory and open the core-site.xml file.

b2ap3_thumbnail_Bildschirmfoto-2015-04-05-um-13.57.26.png 

 

First, we have to change the fs.default.name parameter, which specifies the NameNode host and port.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://uc-lab-master-node:9000</value>
</property>

</configuration>

 

Second, we have to modify the hdf-site.xml file. Here we change the dfs.permissions which specifies the permisson checking, the dfs.replication parameter specifies the default block replication, the dfs.namenode.name.dir specifies the path on the local filesystem where the NameNode (Master Node) stores the namespace and transactions logs and the dfs.datanode.data.dir parameter specifies the path where the blocks on the datanodes (Slave Nodes) are stored. The data on the Slave Nodes are stored on 32GB microSD cards.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
   <name>dfs.permissions</name>
   <value>false</value>
</property>

<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>

<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/usr/local/hadoop/hdfs/namenode</value>
</property>

<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/media/mmcblk0p1/hadoop_data/hdfs/datanode</value>
</property>

</configuration>

 

Third, we have to modify the yarn-site.xml. Here we specify the properties for the NodeManager and the ResourceManager. The first two properties (yarn.nodemanager.aux-services and yarn.nodemanager.aux-services.mapreduce.shuffle.class) are set by default. Those are to set the shuffle service that needs to be set for Map Reduce applications. The yarn.resourcemanager.resource-tracker.address parameter specifies the host and port for the NodeManagers running on the Slave Nodes, the yarn.resourcemanager.scheduler.address parameter specifies the host and port for the ApplicationMasters to talk to scheduler to obtain resources and the yarn.resourcemanager.address parameter specifies the host and port for clients to submit jobs.

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>

<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>uc-lab-master-node:8025</value>
</property>

<property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>uc-lab-master-node:8030</value>
</property>

<property>
   <name>yarn.resourcemanager.address</name>
   <value>uc-lab-master-node:8050</value>
</property>

</configuration>

 

Fourth, we have to modify the mapred-site.xml. Here we change the mapred.job.tracker parameter which specifies JobTracker host and port. The mapreduce.jobhistory.address and mapreduce.jobhistory.webapp.address specify the host and port of the JobHistory Server and the corresponding Web UI.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
   <name>mapred.job.tracker</name>
   <value>uc-lab-master-node:54311</value>
</property>

<property>
   <name>mapreduce.jobhistory.address</name>
   <value>uc-lab-master-node:10020</value>
</property>

<property>
   <name>mapreduce.jobhistory.webapp.address</name>
   <value>uc-lab-master-node:19888</value>
</property>


</configuration>

 

The last file we have to modify is the slaves file. Here we have to register all slave nodes of the cluster. Thats what my slaves file looks like:

uc-lab-node-1
uc-lab-node-2
uc-lab-node-3
uc-lab-node-4

You can also enter the ip´s of the slaves if you did not registerd them in the hosts file.

Now the Master Node is ready and we can configure our Slave Nodes. 


 3. Configuration Slave Nodes:

The core-site.xml and the yarn-site.xml are the same as on the Master Node and the slaves file has not to be modified. 

First, we modify the mapred-site.xml.template. Copy or rename the file to mapred-site.xml and add the following properties. The mapred.job.tracker parameter specifies the host and port that the MapReduce job tracker runs at.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
   <name>mapred.job.tracker</name>
   <value>uc-lab-master-node:54311</value>
</property>

</configuration>

 

Second, we modify the hdfs-site.xml. This file almost looks like the one on the Master Node. We just have to remove the dfs.namenode.name.dir property.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>

<property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/media/mmcblk0p1/hadoop_data/hdfs/datanode</value>
</property>

</configuration>

 

Now we can start the cluster. 


 4. Start the Cluster:

First you go to you Hadoop install directory on Your Master Node. Now you have to format the NameNode by executing the following command:

galileo@uc-lab-master-node:/usr/local/hadoop-2.6.0$ ./bin/hadoop namenode -format

 

This creates the in the hdfs-site.xml defined location for the Namenode.

To start the cluster I wrote a script to make it easier and to be sure that everything starts in the write order:

#!/bin/bash

slaveList=$1
dataNodes=`cat $slaveList`

hadoop-daemon.sh start namenode

hadoop-daemon.sh start secondarynamenode

for node in $dataNodes ; do
    ssh root@$node $HADOOP_PREFIX/sbin/hadoop-daemon.sh start datanode
done

yarn-daemon.sh start resourcemanager

for node in $dataNodes ; do
    ssh root@$node $HADOOP_PREFIX/sbin/yarn-daemon.sh start nodemanager
done

mr-jobhistory-daemon.sh start historyserver

 

This script needs the slaves file located in the etc/hadoop/ folder.

sh start-cluster.sh etc/hadoop/slaves

 

This starts the NameNode, SecondaryNameNode, ResourceManager and the JobHistoryServer on the Master Node and the DataNode and NodeManager on all saves registered in the slaves file.

If you now execute the jps command, it should look like this:

Master:

galileo@uc-lab-master-node:/usr/local/hadoop-2.6.0$ jps

29406 SecondaryNameNode

29736 JobHistoryServer

29478 ResourceManager

29778 Jps

29313 NameNode

 

Slaves:

root@uc-lab-node-1:/usr/local/hadoop-2.6.0# jps

2912 Jps

2704 NodeManager

2644 DataNode

 

If you get an output like this, the cluster is working and you can enter the Web UI.

To enter the Web UI you enter the IP of the Master Node and the default Port 50070 (you can change this port by adding the dfs.namenode.http-address parameter to the hdfs-site.xml file).

http://<IP>:50070/

  • Table_1

(click to enlarge)

 

 

Continue reading
Rate this blog entry:
13
5977 Hits 0 Comments