Big Data - HADOOP on Amazon Customer Review - AWS

Further on the Road to Big Data (Amazon Customer Review Dataset - AWS)

Project Goal and Objective:

As introduced in the first Project Page, this was a self teaching HADOOP project(s) that I created to expand my skills. This is a second project was create to learn about the big data tools now being used in the market. The goal was to read the below refrenced maternal and generate big data analytical results.

** This is a second chapter of my Big Data journey, the first part was covered under the NOAA data effort: Big Data Hadoop NOAA

Working Through the book:

As mentioned im my inital discover project (link above) the White-Sarfi Hadoop book was greatly used. After the NCDC/NOAA weather dataset example, there was second Amazon Customer Review Datasets under AWS that made real world and tangible sense to me. In addition, Amazon is starting to rule the retail world, why not get more exposure to them and learn what you can to enhance marketable skills.

Web Research Points ---> In the White-Safri book there was an AWS sample with the reference to the AWS data links. (Amazon Customer Dataset- AWS )

So, I am a frequent user Amazon to a degree and was curious who or what actually gets purchased on the site.
Two AWS- Customer data sets the I found interesting where the Retail set (2 gigs -unzipped) and Garden set (1 gigs - unzipped)
The files are in tsv format (tab), the differs from the standard CSV on most text file sources.

Problem Statement:

USE THE AWS DATA SETS FIND THE HIGHEST VOLUMENT ITEMS PURCHASED AND CAN THOSE TOTALS BE ROLLED INTO PARENT PRODUCT GROUPS (ie: Shoes vs. Shirts)

Solution Steps:

Look into the AWS sample data sets provided.
I choose not to review at the AWS sample code referenced, since I run Apache open source Hadoop on my own hardware.
New challenge was data in is tab separated format, soad to reference Java to find how to read tsv files using standard libaries.

Analysis Steps:

Review data sent in a text editor for some quick findings (On Ubuntu, I found the mousepad text editor best to handle 1 Gig files)
AWS sets are build on customer feedback and the star rating customer provided on the items purchased
I can count each of the data rows by the unique product code in data sets, drop customer number.
In the data sets it was found the parent product code was not helpful, I did not find a useful pattern. (ie: assumed the higer digests in parent number would seperate shoes from shits.)
I was able to use the AWS product description to lookup Actual Amazon products and confirm other assumptions/approaches

Question on the Data Sample Qualtiy:

I started to wonder on the data quality that was being provided. It was nice and very thoughtful that that Amazon provided these data sets for free with a login registration process. Why would Amazon give other retailer insight into their site and how sales feedback or volume is collected?

Second question, how was this sample data chosen (single time slice steam or selective choice option, -- ie: tech guy in the background playing a data god role)

AWS feedback: I sent an email off to AWS and they responded, it was confirmed that do collect a continues stream of data. that was very good to hear. There is a time stamp is present in each data row, so it can be use to explain some seasonal oddities or 'funnies' in the data. {Still some data funnies found, results found that 1,0000+ set of socks sold {fair and possible}, but then 750+ rubber costume horse-head</font> items sold same period {very odd}.

HADOOP Logic:

Main goal of extract was to roll-up data for unique product count sold. So, individual data rows fields like customer description feedback comments at the purchase level were dropped (large text field, happy to drop). I could always reference the full extract file in the mousepad or excel to get the the actual record detail if needed.

Using the main key as "product_id" and then to build a value string of multiple fields in the mapper was my design goal. The idea here was to possibly further work or process the data once an Hadoop reduce output out file was created. I could parse or further process the value field, once if was out of Hadoop (ie: Excel, My-Sql, or Tableau )

I did keep the individual purchase feedback votes field in the hadoop mapper at the individual at the record level. The idea for this was to add these up during reducer stage A latrer secondary goal was to calculate an average of approval by unique product_id by adding total_votes feedback divided by total unique product purchased count.

Working on tsv files on both data set was able to pick these AWS key data fields for Hadoop processing/extract:

product_category
product_parent
product_title
total_votes
product_id

Mapper key fields:

key.set(product_id.trim());
val.set(product_category + "," + product_parent + "," + product_title + "," + total_votes + "," + product_id);

Reducer code sample :

Table of results:

I was able to export the Hadoop output file into Excel for better data review and trend analysis finding.

*^{Extracted Amazon Retail Survery- Tom 30}*
product_id	purchase count	product_category ,product_parent ,product_title, total_votes, prodcut_id	Combined total Customer votes
B004M6XUI2	1834	Apparel 320933414 RFID Blocking Men's Leather Classic Bifold Wallet Black - Stops Electronic Pick Pocketing Works Against Identity Theft & Credit Card Data Breach by Stopping RFID Scans. 5	0
B004M6UDF0	1762	Apparel 920540636 Alpine Swiss Mens Wallet Leather Money Clip Thin Slim Front Pocket Wallet 0 B004M6UDF0	1155
B006PGGJOE	1495	Apparel 847381387 SHARKK® Aluminum Wallet Credit Card Holder With RFID Protection Made By SHARKK Brands 0 B006PGGJOE	804
B0045H0L1W	1427	Apparel 564769013 LED Light 0 B0045H0L1W	1273
B004M6UD46	1395	Apparel 357827969 Men's Leather Wallet Euro Traveler Extra Capacity Bifold Center Flip ID Window Black 0 B004M6UD46	930
B004M6UDC8	1301	Apparel 893991247 Alpine Swiss Men's Leather Zip Around Wallet ID Card Window Secure Zipper Bifold 4 B004M6UDC8	1183
B002HJ377A	1271	Apparel 601200592 The Mountain Three Wolf Moon Short Sleeve Tee 7 B002HJ377A	25588
B00FBD2G0I	1212	Apparel 875464768 Hanes Men's 10-Pack Crew Socks 2 B00FBD2G0I	1024
B000NZW3IY	1115	Apparel 601200592 THREE WOLF MOON SHIRT ADULT SIZE M 15 B000NZW3IY	61869
B0073MLZ70	1046	Apparel 808840165 Alpine Swiss Mens Leather Flipout ID Wallet Bifold Trifold Hybrid 0 B0073MLZ70	1256
B008Q0E61U	989	Apparel 126722583 BMC Mens Design Ties Clips Bar Wear Accessories Shirt Pins for Skinny & Vintage Neckties Dress Shirts Suits & Ties - Perfect Gifts for Men & Adolescent Boys - 8pc Mixed Color Variety Set Bundle Pack	0
B000SKI752	976	Apparel 110450175 Hanes Men's 6-Pack FreshIQ Cushion Crew Socks 0 B000SKI752	352
B00FJ5LI3O	918	Apparel 743151828 SHARKK® LARGE Aluminum Wallet Credit Card Holder With RFID Protection - Fits Bills 0 B00FJ5LI3O	394
B00422MCVW	869	Apparel 723568374 Tommy Hilfiger Men's Ranger Leather Passcase Wallet 0 B00422MCVW	523
B00FBD2G2Q	860	Apparel 317355887 Hanes Men's 10 Pack Ankle Socks 0 B00FBD2G2Q	235
B001FOR2CY	858	Apparel 354265211 Columbia Men's Military-style Web Belt 0 B001FOR2CY	606
B00CONNGVQ	827	Apparel 672857011 Ann Chery Women's Faja Clasica Waist Cincher 3 B00CONNGVQ	3340
B0002TOZ1O	822	Apparel 42124028 Gold Toe Men's Cotton Crew Athletic Sock 6-Pack 0	0
B00LMI9A6Y	808	Apparel 144275913 Stylish and Fit Body Fleece Lined Leggings for Women Warm Thick Spandex Tights 0 B00LMI9A6Y	1122
B00AB0AX4C	804	Apparel 650261858 Alpine Swiss Front Pocket Wallet Minimalist Super Thin 5 Card Wallet Genuine Leather 5 B00AB0AX4C	1028
B003YJBYPE	774	Apparel 724814536 Champion Men's 6 Pack No Show Socks 0 B003YJBYPE	170
B00CONNILE	770	Apparel 672857011 Ann Chery Women's Faja Clasica Waist Cincher 4 B00CONNILE	2421
B009Y9QCCS	746	Apparel 174908806 4pairs 71% Premium Merino Wool Crew Hiking Socks Made in USA People Socks 0 B009Y9QCCS	918
B001PU9A9Q	732	Apparel 904346411 Nippies Skin ORIGINAL Hypoallergenic Nipple Covers Pasties with ADHESIVE CREME COLOR 0 B001PU9A9Q	1294
B004U0IRGU	724	Apparel 852309029 Raver Blacked Out Gloves RGB LED 7 Colors Light Show Gloves 0 B004U0IRGU	486
B00422MCW6	712	Apparel 723568374 Tommy Hilfiger Men's Ranger Leather Passcase Wallet 0 B00422MCW6	543
B001OAMTQI	702	Apparel 480913524 Hanes Men's Cushion Crew Sock 0 B001OAMTQI	295
B009BKWE9U	674	Apparel 408055634 kilofly Purse Insert Organizer Expandable with Handles	0
B008VL8AW0

My-SQL data check:

Since there were some unexpected values found (ie: three wolf t-shirt sales), I loaded the full data set into My-SQL and ran some queries, The results did check or correlate with the Hadoop program unique product count purchases.

Note: The 'three wolf t-shirt sales' (orange rows above) was an online purchase hoax with fake reviews that drove sales to record high. (pls check, lookup in wiki: 'three wolf shirt sales')

Some other odd item total count purchase count total appeared to needed validation. One product item under two separate product descriptions different sizes did validate to positive results. A second prodcut item suspect was the total number purchased costume rubber horse heads. By check the 'product title' in the row and also looking up the item under Amazon, it the same product but by different manufactures.

SELECT
`asw_garden_sales`.`product_id` , count(*) AS a,
`asw_garden_sales`.`product_parent`,
`asw_garden_sales`.`product_title`,
`asw_garden_sales`.`product_category`
FROM `mysql`.`asw_garden_sales`
GROUP by `asw_garden_sales`.`product_id`
order by a desc;

You are here

Big Data - HADOOP on Amazon Customer Review - AWS

Further on the Road to Big Data (Amazon Customer Review Dataset - AWS)