Assignment

AssignmentTutorOnline

In this part of the project, you will execute queries using Hive, Pig and Hadoop streaming and develop a custom version of KMeans clustering. The schema is available below:

http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/SSBM_schema_hive.sql

The data is available at http://rasinsrv07.cstcis.cti.depaul.edu/CSC555/SSBM1/

In your submission, please note what cluster you are using. Please be sure to submit all code. You should also submit the command lines you use and a screenshot of a completed run (just the last page, do not worry about capturing the entire output).

I highly recommend creating a small sample input (e.g., by running head -n 1000 lineorder.tbl > lineorder.tbl.sample, you can create a small version of lineorder with 1000 lines) and testing your code with it.

Part 1: Pig

Implement the following query using Pig:

select c_nation, AVG(lo_extendedprice) as AVG1

from customer, lineorder

where lo_custkey = c_custkey

and c_region = ‘AFRICA’

and lo_discount = 5

group by c_nation

order by AVG1;

Part 2: Hadoop streaming

Implement the following query using Hadoop streaming:

select sum(lo_revenue), d_year, p_brand1

from lineorder, dwdate, part

where lo_orderdate = d_datekey

and lo_partkey = p_partkey

and p_brand1 between ‘MFGR#2221’

and ‘MFGR#2228’

group by d_year, p_brand1

In Hadoop streaming, this will use a total of 3 passes (two joins and another one for GROUP BY).

Part 3: Clustering

Using Hadoop streaming and randomly generated data (similar to what you did in Assignment6, but generate 1M rows and 9 columns of data) perform five KMeans iterations manually, using 7 centers. You can randomly choose the initial centers, such as by picking 7 random points from your data. For each of five iterations, include the centers produced by your code (i.e., do not submit the command line five times, without the corresponding output).

This would require passing a text file with cluster centers using -file option as discussed in class, opening the centers.txt in the mapper with open(‘centers.txt’, ‘r’) and assigning a key to each point based on which center is the closest to each particular point. Your reducer would then compute the new centers by averaging the points, which would conclude the iteration. At that point, the output of the reducer with new centers can be given to the next pass of the same map reduce code using the -file option (you would need to get the output from HDFS into a local file for that).

The only difference between first and subsequent iterations is that in first iteration you have to pick the initial centers. Starting from the 2^nd iteration, the centers will be given to you by a previous pass of KMeans, and so on. Include the centers you computed at each iteration in your answer.