In this tutorial I will explain, how you can setup a Apache Hadoop Multi-Node Cluster on the Intel Galileo Gen 2 boards. The Master Node will run on a server with a Ubuntu Server OS and the four Slave Nodes will run on the Galileo boards with a Yocto OS.
The Problem are the limited resources of the Galileo boards:
Intel Galileo Gen 2 resources:
- Intel® Quark™ SoC X1000 application processor, a 32-bit, single-core, single-thread, Intel® Pentium® processor instruction set architecture (ISA)-compatible, operating at speeds up to 400 MHz.
- Support for a wide range of industry standard I/O interfaces, including a full-sized mini-PCI Express* slot, 100 Mb Ethernet port, microSD* slot, USB host port, and USB client port.
- 256 MB DDR3, 512 kb embedded SRAM, 8 MB NOR Flash, and 8 kb EEPROM standard on the board, plus support for microSD card up to 32 GB.
- Support for Yocto 1.4 Poky* Linux release.
- Install Hadoop
- Configure Master Node
- Configure Slave Nodes
- Start the Cluster
1. Install Hadoop:
These steps you have to make on all nodes (Master Node and on all Slave Nodes) on which you want to run Hadoop.
First you have to Download Hadoop and extract the package to a location of your choice. I am using Hadoop 2.6.0.
You also need one of the latest java versions (Java 8, Java 7 or late Java 6). I am using Java 7.
Now you set the Hadoop install directory and your Java directory to your system path. To add those permanently, you have to add the following commands to your "~/.profile"-file and reboot your system. After the reboot, check your environment variables using the "env"-command.
HADOOP_PREFIX=/path/to/Hadoop
export HADOOP_PREFIX
PATH=$PATH:$HADOOP_PREFIX/bin
export PATH
PATH=$PATH:$HADOOP_PREFIX/sbin
export PATH
JAVA_HOME=/path/to/Java
export JAVA_HOME
PATH=$PATH:$JAVA_HOME/bin
export PATH
The next step is really important! You have to setup a static IP for the Galileo boards. To do this, you have to enter the IP in your interface file and register it in runlevel five, so that it will run at the start of the system.
First, create a Backup of your interface file:
cp /etc/network/interfaces /etc/network/interfaces.backup
vi /etc/network/interfaces
Now change "iface eth0 inet dhcd" to "iface eth0 inet static" and add the address, netmask and the gateway.
# /etc/network/interfaces -- configuration file for ifup(8), ifdown(8)
# The loopback interface
auto lo
iface lo inet loopback
# Wireless interfaces
iface wlan0 inet dhcp
wireless_mode managed
wireless_essid any
wpa-driver wext
wpa-conf /etc/wpa_supplicant.conf
iface atml0 inet dhcp
# Wired or wireless interfaces
auto eth0
iface eth0 inet static
address x.x.x.x
netmask 255.255.255.0
gateway x.x.x.x
#iface eth1 inet dhcp
# Ethernet/RNDIS gadget (g_ether)
# ... or on host side, usbnet and random hwaddr
iface usb0 inet static
address 192.168.7.2
netmask 255.255.255.0
network 192.168.7.0
gateway 192.168.7.1
# Bluetooth networking
iface bnep0 inet dhcp
The last step is to remove the startup script "S05connman" and add the networking script to "S05networking" in runlevel five.
rm /etc/rc5.d/S05connman
ln -s ../init.d/networking /etc/rc5.d/S05networking
Be sure that your system is in runlevel five by using the runlevel-command:
uc-lab-node-3:~# runlevel
N 5
Now, that you have everything to run Hadoop, we will modify the Hadoop configuration to run the Server as our Master Node and the Galileo boards as our Slave Nodes.
2. Configuration Master Node:
We have to modify five hadoop configuration files to run the cluster (here you can find a detailed description of all configuration parameters). I recommend, to map the IPs of all Nodes (Master and Slaves) with concise names, so you don´t have to enter the IPs every time. Therefor, add the IPs and the names you have chosen to your host-file on your Master Node.
Here is an example of the hosts file:
127.0.0.1 localhost
x.x.x.x uc-lab-master-node
x.x.x.x uc-lab-node-1
x.x.x.x uc-lab-node-2
x.x.x.x uc-lab-node-3
x.x.x.x uc-lab-node-4
Now we navigate to the etc/hadoop folder in our Hadoop install directory and open the core-site.xml file.
First, we have to change the fs.default.name parameter, which specifies the NameNode host and port.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://uc-lab-master-node:9000</value>
</property>
</configuration>
Second, we have to modify the hdf-site.xml file. Here we change the dfs.permissions which specifies the permisson checking, the dfs.replication parameter specifies the default block replication, the dfs.namenode.name.dir specifies the path on the local filesystem where the NameNode (Master Node) stores the namespace and transactions logs and the dfs.datanode.data.dir parameter specifies the path where the blocks on the datanodes (Slave Nodes) are stored. The data on the Slave Nodes are stored on 32GB microSD cards.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/media/mmcblk0p1/hadoop_data/hdfs/datanode</value>
</property>
</configuration>
Third, we have to modify the yarn-site.xml. Here we specify the properties for the NodeManager and the ResourceManager. The first two properties (yarn.nodemanager.aux-services and yarn.nodemanager.aux-services.mapreduce.shuffle.class) are set by default. Those are to set the shuffle service that needs to be set for Map Reduce applications. The yarn.resourcemanager.resource-tracker.address parameter specifies the host and port for the NodeManagers running on the Slave Nodes, the yarn.resourcemanager.scheduler.address parameter specifies the host and port for the ApplicationMasters to talk to scheduler to obtain resources and the yarn.resourcemanager.address parameter specifies the host and port for clients to submit jobs.
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>uc-lab-master-node:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>uc-lab-master-node:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>uc-lab-master-node:8050</value>
</property>
</configuration>
Fourth, we have to modify the mapred-site.xml. Here we change the mapred.job.tracker parameter which specifies JobTracker host and port. The mapreduce.jobhistory.address and mapreduce.jobhistory.webapp.address specify the host and port of the JobHistory Server and the corresponding Web UI.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>uc-lab-master-node:54311</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>uc-lab-master-node:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>uc-lab-master-node:19888</value>
</property>
</configuration>
The last file we have to modify is the slaves file. Here we have to register all slave nodes of the cluster. Thats what my slaves file looks like:
uc-lab-node-1
uc-lab-node-2
uc-lab-node-3
uc-lab-node-4
You can also enter the ip´s of the slaves if you did not registerd them in the hosts file.
Now the Master Node is ready and we can configure our Slave Nodes.
3. Configuration Slave Nodes:
The core-site.xml and the yarn-site.xml are the same as on the Master Node and the slaves file has not to be modified.
First, we modify the mapred-site.xml.template. Copy or rename the file to mapred-site.xml and add the following properties. The mapred.job.tracker parameter specifies the host and port that the MapReduce job tracker runs at.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>uc-lab-master-node:54311</value>
</property>
</configuration>
Second, we modify the hdfs-site.xml. This file almost looks like the one on the Master Node. We just have to remove the dfs.namenode.name.dir property.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/media/mmcblk0p1/hadoop_data/hdfs/datanode</value>
</property>
</configuration>
Now we can start the cluster.
4. Start the Cluster:
First you go to you Hadoop install directory on Your Master Node. Now you have to format the NameNode by executing the following command:
galileo@uc-lab-master-node:/usr/local/hadoop-2.6.0$ ./bin/hadoop namenode -format
This creates the in the hdfs-site.xml defined location for the Namenode.
To start the cluster I wrote a script to make it easier and to be sure that everything starts in the write order:
#!/bin/bash
slaveList=$1
dataNodes=`cat $slaveList`
hadoop-daemon.sh start namenode
hadoop-daemon.sh start secondarynamenode
for node in $dataNodes ; do
ssh root@$node $HADOOP_PREFIX/sbin/hadoop-daemon.sh start datanode
done
yarn-daemon.sh start resourcemanager
for node in $dataNodes ; do
ssh root@$node $HADOOP_PREFIX/sbin/yarn-daemon.sh start nodemanager
done
mr-jobhistory-daemon.sh start historyserver
This script needs the slaves file located in the etc/hadoop/ folder.
sh start-cluster.sh etc/hadoop/slaves
This starts the NameNode, SecondaryNameNode, ResourceManager and the JobHistoryServer on the Master Node and the DataNode and NodeManager on all saves registered in the slaves file.
If you now execute the jps command, it should look like this:
Master:
galileo@uc-lab-master-node:/usr/local/hadoop-2.6.0$ jps
29406 SecondaryNameNode
29736 JobHistoryServer
29478 ResourceManager
29778 Jps
29313 NameNode
Slaves:
root@uc-lab-node-1:/usr/local/hadoop-2.6.0# jps
2912 Jps
2704 NodeManager
2644 DataNode
If you get an output like this, the cluster is working and you can enter the Web UI.
To enter the Web UI you enter the IP of the Master Node and the default Port 50070 (you can change this port by adding the dfs.namenode.http-address parameter to the hdfs-site.xml file).
http://<IP>:50070/
(click to enlarge)