Map-Reduce Architecture on Intel Galileo

Map-Reduce Architecture on Intel Galileo

The basic idea of this blog is to document the implementation of a Map-Reduce framework on a Grid Computing architecture. The architecture will be implemented on a set of Intel Galileo Gen 2 protoyping boards. One result of this work will be a distributed system, with heterogeneous nodes geographically dispersed.

The Bachelor Thesis is co-tutored by Prof. Dr. Ralf Seepold (HTWG Konstanz - Germany, UC-Lab) and Prof. Dr. Juan Antonio Ortega (Universidad de Sevilla - Spain, Escuela Técnica Superior de Ingenería Informática). The thesis is executed during an ERASMUS+ stay at the University of Seville, Campus Reina Mercedes, Sevilla, Spain. 

Continue reading
Rate this blog entry:
4
5526 Hits 0 Comments

Let's get acquainted

Let's get acquainted

Hi everyone!

This is my first post in the UC-Lab Blog and first of all we have to present ourselves. We are part of the Department of Computer Science (Ubiquitous Computing) at the University of Applied Sciences Hochschule Konstanz (HTWG) in Germany. Here our location in Google maps: https://goo.gl/maps/UpJnL .

Our research fields are mobile sensors, intelligent environments and several other topics. We work on different projects but we always try to create innovative concepts that can be useful for the society.

Our international team consists of researchers and students (http://uc-lab.in.htwg-konstanz.de/team) and the professor. To present everyone shortly, Dr. Prof. Seepold is the mastermind of the team, Patrick is our Android guru, Oana is an exchange student from the mysterious Romania, Mario is our secret agent who was sent to Seville as a Bachelor exchange student to provide us with the latest developments of Spanish science. My name is Daniel and I stress people to research biological sensors.

Why do we write a blog? It’s pretty simple: we are extremely excited about what we are doing and we want to share the knowledge and the experience that we get while working on our projects. We are planning to post at least once a week (most probably every Monday). Of course, we will try not to be boring ;).

A little teaser: in the next post we are going to write about the Intel Galileo gen 2. As a small preview we are going to share already now the first version of some 3D models for Intel Galileo gen 2 case.

galileoGen2Case.zip

P.S. While printing the cases please don't forget to turn them so that they fit your 3D printer

 

Continue reading
Rate this blog entry:
0
1751 Hits 0 Comments

Hadoop cluster running on Intel Galileo Gen 2

This my first post in the UC-Lab Blog. I am studying at the University of Applied Sciences Hochschule Konstanz (HTWG) in Germany and I am writing my bachelor thesis within the Ubiquitous Computing Laboratory in cooperation with the University of Seville (ERASMUS program). My objective is to develop a Grid based system for data mining using Map-Reduce. In this blog I will docuIment my process of my bachelor thesis.

Here you can find more information about this and other projects of the UC-Lab: http://uc-lab.in.htwg-konstanz.de/ucl-projects.html

 

The system has to run on the Intel Galileo Gen 2 board (http://www.intel.com/content/www/us/en/do-it-yourself/galileo-maker-quark-board.html). Because of the limited resources of the boards, this is going to be one of the main aspects I have to focus on.

Figure 2 shows the setup of the four boards I develop on. 

 

b2ap3_thumbnail_IMG_0260.JPG

Figure 1: Setup of the Intel Galileo Gen 2 boards

 

The Map-Reduce (MR) programming model with the Apache Hadoop framework is one of the most well-known and usually most common models. Specifically, it supports a simple programming model so that the end-user programmer only has to write the Map-Reduce tasks. However, Hadoop itself is a name for a federation of services, like HDFS, Hive, HBase, Map-Reduce, etc. (See Figure 2: Hadoop Architecture). Apache Storm and Apache Spark are distributed realtime computation systems and can be used with some of these services.

 

b2ap3_thumbnail_Hadoop-Architecture.png

Figure 2: Hadoop Architecture

 

 

In a first field test, we have setup a Hadoop Cluster (v2.6) on five Intel Galileodevelopment boards. Due to the minimum resources (RAM, CPU) it was not possible to run the Hadoop system in an appropriate way. After infrastructural changes (Namenode & ResouceManager had been moved to regular workstation pc), the system provides a higher performance and usability. Simple Map-Reduce jobs (e.g. WordCount) as well as jobs with a higher complexity (Recommendation system) work, even for millions of data entries, with an acceptable performance.

 

 

Apache Storm (https://storm.apache.org) offers the opportunity to use it on Hadoop or to run it in a standalone mode. Storm is a real-time, streaming computational system. Storm is a online framework, meaning, in this sense, a service that interacts with a running application. In contrast to Map-Reduce, it receives small pieces of data as they are processed in your application. You define a DAG (Directed acyclic graph) of operations to perform on the data.

Storm doesn't have anything (necessarily) to do with persisting your data. Here, streaming is another way to say keeping the information you care about and throwing the rest away. In reality, you probably have a persistence layer in your application that has already recorded the data, and so this a good and justified separation of concerns.

At the moment I couldn't get Storm running on the Intel Galileo boards. The main problem is the operations system running on the boards. I used an existing Yocto image and found out, that there are some needed services missing. On the server, which is running with Debian OS, the setup was no problem. I think if I make my own Image with Yocto Project it should run on the boards too.

 

 

The third opportunity is Apache Spark (https://spark.apache.org). Just like Storm, Spark offers the opportunity to use it on Hadoop or to run it in a standalone mode. 

One of the most interesting features of Spark is its smart use of memory. Map-Reduce has always worked primarily with data stored on disk. Spark, by contrast, can exploit the considerable amount of RAM that is spread across all the nodes in a cluster. It is smart about use of disk for overflow data and persistence. That gives Spark huge performance advantages for many workloads.

We were able to set up Spark over the cluster, but because of the limited RAM, we weren't able to run a job on the boards successfully. Maybe with some optimization of the use of the resources we could get it running.

 

 

My next Posts will contain a detailed tutorial how I set up the different frameworks and what is important to look out for.

Continue reading
Rate this blog entry:
9
6983 Hits 0 Comments

Setup Hadoop 2.6.0 on Intel Galileo Gen 2

Setup Hadoop 2.6.0 on Intel Galileo Gen 2

In this tutorial I will explain, how you can setup a Apache Hadoop Multi-Node Cluster on the Intel Galileo Gen 2 boards. The Master Node will run on a server with a Ubuntu Server OS and the four Slave Nodes will run on the Galileo boards with a Yocto OS.

The Problem are the limited resources of the Galileo boards:

Intel Galileo Gen 2 resources:

  • Intel® Quark™ SoC X1000 application processor, a 32-bit, single-core, single-thread, Intel® Pentium® processor instruction set architecture (ISA)-compatible, operating at speeds up to 400 MHz.
  • Support for a wide range of industry standard I/O interfaces, including a full-sized mini-PCI Express* slot, 100 Mb Ethernet port, microSD* slot, USB host port, and USB client port.
  • 256 MB DDR3, 512 kb embedded SRAM, 8 MB NOR Flash, and 8 kb EEPROM standard on the board, plus support for microSD card up to 32 GB.
  • Support for Yocto 1.4 Poky* Linux release.

  

  1. Install Hadoop
  2. Configure Master Node
  3. Configure Slave Nodes
  4. Start the Cluster 

1. Install Hadoop:

These steps you have to make on all nodes (Master Node and on all Slave Nodes) on which you want to run Hadoop.

First you have to Download Hadoop and extract the package to a location of your choice. I am using Hadoop 2.6.0.

 

You also need one of the latest java versions (Java 8, Java 7 or late Java 6). I am using Java 7. 

 

Now you set the Hadoop install directory and your Java directory to your system path. To add those permanently, you have to add the following commands to your "~/.profile"-file and reboot your system. After the reboot, check your environment variables using the "env"-command.

 

HADOOP_PREFIX=/path/to/Hadoop
export HADOOP_PREFIX
PATH=$PATH:$HADOOP_PREFIX/bin
export PATH
PATH=$PATH:$HADOOP_PREFIX/sbin
export PATH

JAVA_HOME=/path/to/Java
export JAVA_HOME
PATH=$PATH:$JAVA_HOME/bin
export PATH

 

The next step is really important! You have to setup a static IP for the Galileo boards. To do this, you have to enter the IP in your interface file and register it in runlevel five, so that it will run at the start of the system.

First, create a Backup of your interface file:

cp /etc/network/interfaces  /etc/network/interfaces.backup

vi /etc/network/interfaces

 

Now change "iface eth0 inet dhcd" to "iface eth0 inet static" and add the address, netmask and the gateway.

# /etc/network/interfaces -- configuration file for ifup(8), ifdown(8)

# The loopback interface
auto lo
iface lo inet loopback

# Wireless interfaces
iface wlan0 inet dhcp
        wireless_mode managed
        wireless_essid any
        wpa-driver wext
        wpa-conf /etc/wpa_supplicant.conf

iface atml0 inet dhcp

# Wired or wireless interfaces
auto eth0
iface eth0 inet static
        address x.x.x.x
        netmask 255.255.255.0
        gateway x.x.x.x

#iface eth1 inet dhcp
# Ethernet/RNDIS gadget (g_ether)
# ... or on host side, usbnet and random hwaddr
iface usb0 inet static                         
        address 192.168.7.2                    
        netmask 255.255.255.0                  
        network 192.168.7.0                    
        gateway 192.168.7.1                    
                                               
# Bluetooth networking                         
iface bnep0 inet dhcp              

The last step is to remove the startup script "S05connman" and add the networking script to "S05networking" in runlevel five. 

rm /etc/rc5.d/S05connman

ln -s ../init.d/networking /etc/rc5.d/S05networking

 Be sure that your system is in runlevel five by using the runlevel-command:

uc-lab-node-3:~# runlevel
N 5

Now, that you have everything to run Hadoop, we will modify the Hadoop configuration to run the Server as our Master Node and the Galileo boards as our Slave Nodes. 


2. Configuration Master Node:

We have to modify five hadoop configuration files to run the cluster (here you can find a detailed description of all configuration parameters). I recommend, to map the IPs of all Nodes (Master and Slaves) with concise names, so you don´t have to enter the IPs every time. Therefor, add the IPs and the names you have chosen to your host-file on your Master Node.

b2ap3_thumbnail_Bildschirmfoto-2015-04-05-um-13.48.08.png

Here is an example of the hosts file:

 

127.0.0.1        localhost
x.x.x.x          uc-lab-master-node
x.x.x.x          uc-lab-node-1
x.x.x.x          uc-lab-node-2
x.x.x.x          uc-lab-node-3
x.x.x.x          uc-lab-node-4

 

Now we navigate to the etc/hadoop folder in our Hadoop install directory and open the core-site.xml file.

b2ap3_thumbnail_Bildschirmfoto-2015-04-05-um-13.57.26.png 

 

First, we have to change the fs.default.name parameter, which specifies the NameNode host and port.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://uc-lab-master-node:9000</value>
</property>

</configuration>

 

Second, we have to modify the hdf-site.xml file. Here we change the dfs.permissions which specifies the permisson checking, the dfs.replication parameter specifies the default block replication, the dfs.namenode.name.dir specifies the path on the local filesystem where the NameNode (Master Node) stores the namespace and transactions logs and the dfs.datanode.data.dir parameter specifies the path where the blocks on the datanodes (Slave Nodes) are stored. The data on the Slave Nodes are stored on 32GB microSD cards.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
   <name>dfs.permissions</name>
   <value>false</value>
</property>

<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>

<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/usr/local/hadoop/hdfs/namenode</value>
</property>

<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/media/mmcblk0p1/hadoop_data/hdfs/datanode</value>
</property>

</configuration>

 

Third, we have to modify the yarn-site.xml. Here we specify the properties for the NodeManager and the ResourceManager. The first two properties (yarn.nodemanager.aux-services and yarn.nodemanager.aux-services.mapreduce.shuffle.class) are set by default. Those are to set the shuffle service that needs to be set for Map Reduce applications. The yarn.resourcemanager.resource-tracker.address parameter specifies the host and port for the NodeManagers running on the Slave Nodes, the yarn.resourcemanager.scheduler.address parameter specifies the host and port for the ApplicationMasters to talk to scheduler to obtain resources and the yarn.resourcemanager.address parameter specifies the host and port for clients to submit jobs.

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

<property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
</property>

<property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>uc-lab-master-node:8025</value>
</property>

<property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>uc-lab-master-node:8030</value>
</property>

<property>
   <name>yarn.resourcemanager.address</name>
   <value>uc-lab-master-node:8050</value>
</property>

</configuration>

 

Fourth, we have to modify the mapred-site.xml. Here we change the mapred.job.tracker parameter which specifies JobTracker host and port. The mapreduce.jobhistory.address and mapreduce.jobhistory.webapp.address specify the host and port of the JobHistory Server and the corresponding Web UI.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
   <name>mapred.job.tracker</name>
   <value>uc-lab-master-node:54311</value>
</property>

<property>
   <name>mapreduce.jobhistory.address</name>
   <value>uc-lab-master-node:10020</value>
</property>

<property>
   <name>mapreduce.jobhistory.webapp.address</name>
   <value>uc-lab-master-node:19888</value>
</property>


</configuration>

 

The last file we have to modify is the slaves file. Here we have to register all slave nodes of the cluster. Thats what my slaves file looks like:

uc-lab-node-1
uc-lab-node-2
uc-lab-node-3
uc-lab-node-4

You can also enter the ip´s of the slaves if you did not registerd them in the hosts file.

Now the Master Node is ready and we can configure our Slave Nodes. 


 3. Configuration Slave Nodes:

The core-site.xml and the yarn-site.xml are the same as on the Master Node and the slaves file has not to be modified. 

First, we modify the mapred-site.xml.template. Copy or rename the file to mapred-site.xml and add the following properties. The mapred.job.tracker parameter specifies the host and port that the MapReduce job tracker runs at.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
   <name>mapred.job.tracker</name>
   <value>uc-lab-master-node:54311</value>
</property>

</configuration>

 

Second, we modify the hdfs-site.xml. This file almost looks like the one on the Master Node. We just have to remove the dfs.namenode.name.dir property.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="/configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
   <name>dfs.replication</name>
   <value>3</value>
</property>

<property>
   <name>dfs.datanode.data.dir</name>
   <value>file:/media/mmcblk0p1/hadoop_data/hdfs/datanode</value>
</property>

</configuration>

 

Now we can start the cluster. 


 4. Start the Cluster:

First you go to you Hadoop install directory on Your Master Node. Now you have to format the NameNode by executing the following command:

galileo@uc-lab-master-node:/usr/local/hadoop-2.6.0$ ./bin/hadoop namenode -format

 

This creates the in the hdfs-site.xml defined location for the Namenode.

To start the cluster I wrote a script to make it easier and to be sure that everything starts in the write order:

#!/bin/bash

slaveList=$1
dataNodes=`cat $slaveList`

hadoop-daemon.sh start namenode

hadoop-daemon.sh start secondarynamenode

for node in $dataNodes ; do
    ssh root@$node $HADOOP_PREFIX/sbin/hadoop-daemon.sh start datanode
done

yarn-daemon.sh start resourcemanager

for node in $dataNodes ; do
    ssh root@$node $HADOOP_PREFIX/sbin/yarn-daemon.sh start nodemanager
done

mr-jobhistory-daemon.sh start historyserver

 

This script needs the slaves file located in the etc/hadoop/ folder.

sh start-cluster.sh etc/hadoop/slaves

 

This starts the NameNode, SecondaryNameNode, ResourceManager and the JobHistoryServer on the Master Node and the DataNode and NodeManager on all saves registered in the slaves file.

If you now execute the jps command, it should look like this:

Master:

galileo@uc-lab-master-node:/usr/local/hadoop-2.6.0$ jps

29406 SecondaryNameNode

29736 JobHistoryServer

29478 ResourceManager

29778 Jps

29313 NameNode

 

Slaves:

root@uc-lab-node-1:/usr/local/hadoop-2.6.0# jps

2912 Jps

2704 NodeManager

2644 DataNode

 

If you get an output like this, the cluster is working and you can enter the Web UI.

To enter the Web UI you enter the IP of the Master Node and the default Port 50070 (you can change this port by adding the dfs.namenode.http-address parameter to the hdfs-site.xml file).

http://<IP>:50070/

  • Table_1

(click to enlarge)

 

 

Continue reading
Rate this blog entry:
13
5692 Hits 0 Comments