Big Data - HADOOP on Amazon Customer Review - AWS

Further on the Road to Big Data  (Amazon Customer Review Dataset - AWS)

 

Project Goal and Objective:

As introduced in the first  Project Page, this was a self teaching HADOOP project(s) that I created to expand my skills. This is a  second project was create to learn about the big data tools now being used in the market.  The goal was to read the below refrenced maternal and generate big data analytical results.

 

** This is a second chapter of my Big Data journey, the first part was covered under the NOAA data  effort:  Big Data Hadoop NOAA

 

Working Through the book:

As mentioned im my inital discover project (link above) the White-Sarfi Hadoop book was greatly used.   After the NCDC/NOAA weather dataset example, there was second Amazon Customer Review Datasets under AWS that made real world and tangible sense to me.  In addition, Amazon is starting to rule the retail world, why not get more exposure to them and learn what you can to enhance marketable skills.

 

Web Research Points  ---> In the White-Safri book there was an AWS sample with the reference to the AWS data links.  (Amazon Customer Dataset- AWS )

  • So, I am a frequent user Amazon to a degree and was curious who or what actually gets purchased on the site.
  • Two AWS- Customer data sets the I found interesting where the Retail set (2 gigs -unzipped) and Garden set (1 gigs - unzipped)
  •  The files are in tsv format (tab), the differs from the standard CSV on most text file sources.

 

Problem Statement:

USE THE AWS DATA SETS FIND THE HIGHEST VOLUMENT ITEMS PURCHASED AND CAN THOSE TOTALS BE ROLLED INTO PARENT PRODUCT GROUPS  (ie: Shoes vs. Shirts)

 

Solution Steps:

  • Look into the AWS sample data sets provided.

  • I choose not to review at the AWS sample code referenced, since I run Apache open source Hadoop on my own hardware.

  • New challenge was data in is tab separated format, soad to reference Java to find how to read tsv files using standard libaries.

 

Analysis Steps:

  • Review data sent in a text editor for some quick findings (On Ubuntu, I found the mousepad text editor best to handle 1 Gig files)
  • AWS sets are build on customer feedback and the star rating customer provided on the items purchased
  • I can count each of the data rows by the unique product code in data sets, drop customer number.
  • In the data sets it was found the parent product code was not helpful, I did not find a useful pattern. (ie: assumed the higer digests in parent number would seperate shoes from shits.) 
  • I was able to use the AWS product description to lookup Actual Amazon products and confirm other assumptions/approaches

 

Question on the Data Sample Qualtiy:

I started to wonder on the data quality that was being provided.  It was nice and very thoughtful that that Amazon provided these data sets for free with a login registration process. Why would Amazon give other retailer insight into their site and how sales feedback or volume is collected?

Second question, how was this sample data chosen (single time slice steam or selective choice option, -- ie: tech guy in the background playing a data god role) 

 

AWS feedback: I sent an email off to AWS and they responded, it was confirmed that do collect a continues stream of data.  that was very good to hear.  There is a time stamp is present in each data row, so it can be use to explain some seasonal oddities or 'funnies' in the data.  {Still some data funnies found, results found that 1,0000+ set of socks sold {fair and possible}, but then 750+ rubber costume horse-head</font> items sold same period {very odd}.

 

HADOOP Logic:

Main goal of extract was to roll-up data for unique product count sold. So, individual data rows fields like customer description feedback comments at the purchase level were dropped (large text field, happy to drop).  I could always reference the full extract file in the mousepad or excel to get the the actual record detail  if needed. 

Using the main key as "product_id" and then to build a value string of multiple fields in the mapper was my design goal.  The idea here was to possibly further work or process the data once an Hadoop reduce output out file was created.  I could parse or further process the value field, once if was out of Hadoop (ie: Excel, My-Sql,  or Tableau )

I did keep the individual purchase feedback votes field in the hadoop mapper at the  individual at the record level. The idea for this was to add these up during reducer stage A latrer secondary goal was to calculate an average of approval by unique product_id by adding total_votes feedback divided  by total unique product purchased count.

Working on tsv files on both data set was able to pick these AWS key data fields for Hadoop processing/extract:

  • product_category
  • product_parent
  • product_title
  • total_votes
  • product_id

 

Mapper key fields:

  • key.set(product_id.trim());
  • val.set(product_category   + "," + product_parent + "," + product_title + "," + total_votes + "," + product_id);

 

Reducer code sample :

Amazon AWS Hadoop Reduce Sample 

 

 

Table of results:

I was able to export the Hadoop output file into Excel for better data review and trend analysis finding.

Extracted Amazon Retail Survery- Tom 30
 product_id  purchase count  product_category ,product_parent ,product_title,  total_votes, prodcut_id Combined total Customer votes  
B004M6XUI2 1834 Apparel 320933414 RFID Blocking Men's Leather Classic Bifold Wallet Black - Stops Electronic Pick Pocketing Works Against Identity Theft & Credit Card Data Breach by Stopping RFID Scans. 5 0
B004M6UDF0 1762 Apparel 920540636 Alpine Swiss Mens Wallet Leather Money Clip Thin Slim Front Pocket Wallet 0 B004M6UDF0 1155
B006PGGJOE 1495 Apparel 847381387 SHARKK® Aluminum Wallet Credit Card Holder With RFID Protection Made By SHARKK Brands 0 B006PGGJOE 804
B0045H0L1W 1427 Apparel 564769013 LED Light 0 B0045H0L1W 1273
B004M6UD46 1395 Apparel 357827969 Men's Leather Wallet Euro Traveler Extra Capacity Bifold Center Flip ID Window Black 0 B004M6UD46 930
B004M6UDC8 1301 Apparel 893991247 Alpine Swiss Men's Leather Zip Around Wallet ID Card Window Secure Zipper Bifold 4 B004M6UDC8 1183
B002HJ377A 1271 Apparel 601200592 The Mountain Three Wolf Moon Short Sleeve Tee 7 B002HJ377A 25588
B00FBD2G0I 1212 Apparel 875464768 Hanes Men's 10-Pack Crew Socks 2 B00FBD2G0I 1024
B000NZW3IY 1115 Apparel 601200592 THREE WOLF MOON SHIRT ADULT SIZE M 15 B000NZW3IY 61869
B0073MLZ70 1046 Apparel 808840165 Alpine Swiss Mens Leather Flipout ID Wallet Bifold Trifold Hybrid 0 B0073MLZ70 1256
B008Q0E61U 989 Apparel 126722583 BMC Mens Design Ties Clips Bar Wear Accessories Shirt Pins for Skinny & Vintage Neckties Dress Shirts Suits & Ties - Perfect Gifts for Men & Adolescent Boys - 8pc Mixed Color Variety Set Bundle Pack 0
B000SKI752 976 Apparel 110450175 Hanes Men's 6-Pack FreshIQ Cushion Crew Socks 0 B000SKI752 352
B00FJ5LI3O 918 Apparel 743151828 SHARKK® LARGE Aluminum Wallet Credit Card Holder With RFID Protection - Fits Bills 0 B00FJ5LI3O 394
B00422MCVW 869 Apparel 723568374 Tommy Hilfiger Men's Ranger Leather Passcase Wallet 0 B00422MCVW 523
B00FBD2G2Q 860 Apparel 317355887 Hanes Men's 10 Pack Ankle Socks 0 B00FBD2G2Q 235
B001FOR2CY 858 Apparel 354265211 Columbia Men's Military-style Web Belt 0 B001FOR2CY 606
B00CONNGVQ 827 Apparel 672857011 Ann Chery Women's Faja Clasica Waist Cincher 3 B00CONNGVQ 3340
B0002TOZ1O 822 Apparel 42124028 Gold Toe Men's Cotton Crew Athletic Sock 6-Pack 0 0
B00LMI9A6Y 808 Apparel 144275913 Stylish and Fit Body Fleece Lined Leggings for Women Warm Thick Spandex Tights 0 B00LMI9A6Y 1122
B00AB0AX4C 804 Apparel 650261858 Alpine Swiss Front Pocket Wallet Minimalist Super Thin 5 Card Wallet Genuine Leather 5 B00AB0AX4C 1028
B003YJBYPE 774 Apparel 724814536 Champion Men's 6 Pack No Show Socks 0 B003YJBYPE 170
B00CONNILE 770 Apparel 672857011 Ann Chery Women's Faja Clasica Waist Cincher 4 B00CONNILE 2421
B009Y9QCCS 746 Apparel 174908806 4pairs 71% Premium Merino Wool Crew Hiking Socks Made in USA People Socks 0 B009Y9QCCS 918
B001PU9A9Q 732 Apparel 904346411 Nippies Skin ORIGINAL Hypoallergenic Nipple Covers Pasties with ADHESIVE CREME COLOR 0 B001PU9A9Q 1294
B004U0IRGU 724 Apparel 852309029 Raver Blacked Out Gloves RGB LED 7 Colors Light Show Gloves 0 B004U0IRGU 486
B00422MCW6 712 Apparel 723568374 Tommy Hilfiger Men's Ranger Leather Passcase Wallet 0 B00422MCW6 543
B001OAMTQI 702 Apparel 480913524 Hanes Men's Cushion Crew Sock 0 B001OAMTQI 295
B009BKWE9U 674 Apparel 408055634 kilofly Purse Insert Organizer Expandable with Handles 0
B008VL8AW0      

 

 My-SQL data check:

Since there were some unexpected values found (ie: three wolf t-shirt sales), I loaded the full data set into My-SQL and ran some queries,  The results did check or correlate with the Hadoop program unique product count purchases. 

Note: The 'three wolf t-shirt sales' (orange rows above) was an online purchase hoax with fake reviews that drove sales to record high.  (pls check, lookup in wiki: 'three wolf shirt sales')

 

Some other odd item total count purchase count total appeared to needed validation. One product item under two separate product descriptions different sizes did validate to positive results. A second prodcut item suspect was the total number purchased costume rubber horse heads.  By check the 'product title' in the row and also looking up the item under Amazon, it the same product but by different manufactures. 


  • SELECT 
        `asw_garden_sales`.`product_id` , count(*) AS a,
        `asw_garden_sales`.`product_parent`,
        `asw_garden_sales`.`product_title`,
        `asw_garden_sales`.`product_category`
    FROM `mysql`.`asw_garden_sales`
    GROUP by `asw_garden_sales`.`product_id`
    order by a desc;