You will find here all research content and exercises for a month Training session on Apache Hadoop.
This training is divided into two sections:
- Warming up (1 week)
- Hands on (3 weeks)
To gain quick insight about Hadoop, we recommend the following bibliography:
- Give yourself a bit of background about where this technology comes from:
- Google File System http://labs.google.com/papers/gfs.html (gfs-sosp2003.pdf)
- Google MapReduce http://labs.google.com/papers/mapreduce.html (mapreduce-osdi04.pdf.pdf)
- Google Big Table
- http://labs.google.com/papers/bigtable.html (bigtable-osdi06.pdf)
- Intro to Parallel computing: http://code.google.com/edu/parallel/mapreduce-tutorial.html
- Hadoop in Action (HadoopinAction-reduced.pdf)
- Join Algorithms Hadoop (Join Algorithms Hadoop.pdf)
- Hadoop The Definitive Guide (Hadoop_The_Definitive_Guide.pdf)
In order to ensure your knowledges about Hadoop, we suggest the following exercises using Map Reduce:
- Given a file, count the amount of words given.
- Given a group of files, return the amount of files holding the given words in those files.
- How would you join two tables using MapReduce?
Hadoop under Windows
We strongly recommend NOT to run Hadoop under Windows, even development environment.
Our experience at Globant shows that there are serious issues to configure Hadoop under Windows. However, if you’d like to proceed here are some instructions.
Through practical application we aim at cultivating knowledge within the studio about the different projects which are part of the Hadoop framework. At the same time we wish to build an application which will serve as a platform to present the different possibilities these technologies offer our clients.
The architecture used in these technologies makes them particularly useful to analyze large volumes of data. With this in mind we have design the following exercise, which we believe will play the part of both an entertaining challenge to start learning about the framework and will build an interesting product for our clients.
Implement a data analysis system which supports the following functionality:
- Develop a search module against the twitter APIs
twitter provides a RESTful service which will let us perform queries against the social network’s historic database. This API supports up to 350 requests per hour (100 if the user is not authenticated) and the generated results are grouped in pages of up to 100 tweets each. A stand-alone process will be configured to run a query against the API and consume as many tweets as possible every hour. This information will be stored in a HDFS using the Hadoop Java API.
- Configure a storage system for raw data
Given the rate limitations imposed on the twitter APIs we need a storage system which provides us with high data availability. HDFS will allow us to accumulate large volumes of data efficiently so that we can improve the speed and accuracy of our analysis. Depending on our implementation an appropriate data format fill has to be chosen.
- Put together a data processing cluster
Using a MapReduce cluster we can distribute the processing load among several nodes for a fast efficient analysis. Each node in the cluster must perform data processing for a portion of the collected data (Map). The result generated by each of the nodes must in turn be grouped to produce the information we wish to present our users (Reduce).
A powerful analysis will most likely include a chain of MapReduce jobs, each using the previous job’s output as it’s input. As a first approach we’d like to perform the following analysis:
- Tweet counting
- Tweet topic grouping
- Tweet mood grouping
These task will most likely execute in reverse order to produce a result of “Number of positive/negative tweets about a each topic”.
- Design and develop a “user-friendly” presentation of the analyzed data
A correct use of the technologies is not enough to consider the exercise a success. It is just as important that we are able to present the results to our users in a way that is fast and easy to see the value this service provides for their business.
Some examples of big volumes of data visualization can be found here: http://www.visualcomplexity.com/vc/
- Generate trend predictions (Nice-To-Have)
In the last couple of years we have seen several studies on the twitters capability to predict events (sales, stock pricing, elections). Mahout as a framework groups together a set of AI libraries built on top of a MapReduce architecture. These algorithms could be used to make recommendations based on analyzed data.