Further on the Road to Big Data (Amazon Customer Review Dataset - AWS)
Project Goal and Objective:
As introduced in the first Project Page, this was a self teaching HADOOP project(s) that I created to expand my skills. This is a second project was create to learn about the big data tools now being used in the market. The goal was to read the below refrenced maternal and generate big data analytical results.
** This is a second chapter of my Big Data journey, the first part was covered under the NOAA data effort: Big Data Hadoop NOAA
Working Through the book:
As mentioned im my inital discover project (link above) the White-Sarfi Hadoop book was greatly used. After the NCDC/NOAA weather dataset example, there was second Amazon Customer Review Datasets under AWS that made real world and tangible sense to me. In addition, Amazon is starting to rule the retail world, why not get more exposure to them and learn what you can to enhance marketable skills.
Web Research Points ---> In the White-Safri book there was an AWS sample with the reference to the AWS data links. (Amazon Customer Dataset- AWS )
- So, I am a frequent user Amazon to a degree and was curious who or what actually gets purchased on the site.
- Two AWS- Customer data sets the I found interesting where the Retail set (2 gigs -unzipped) and Garden set (1 gigs - unzipped)
- The files are in tsv format (tab), the differs from the standard CSV on most text file sources.
Problem Statement:
USE THE AWS DATA SETS FIND THE HIGHEST VOLUMENT ITEMS PURCHASED AND CAN THOSE TOTALS BE ROLLED INTO PARENT PRODUCT GROUPS (ie: Shoes vs. Shirts)
Solution Steps:
-
Look into the AWS sample data sets provided.
-
I choose not to review at the AWS sample code referenced, since I run Apache open source Hadoop on my own hardware.
- New challenge was data in is tab separated format, soad to reference Java to find how to read tsv files using standard libaries.
Analysis Steps:
- Review data sent in a text editor for some quick findings (On Ubuntu, I found the mousepad text editor best to handle 1 Gig files)
- AWS sets are build on customer feedback and the star rating customer provided on the items purchased
- I can count each of the data rows by the unique product code in data sets, drop customer number.
- In the data sets it was found the parent product code was not helpful, I did not find a useful pattern. (ie: assumed the higer digests in parent number would seperate shoes from shits.)
- I was able to use the AWS product description to lookup Actual Amazon products and confirm other assumptions/approaches
Question on the Data Sample Qualtiy:
I started to wonder on the data quality that was being provided. It was nice and very thoughtful that that Amazon provided these data sets for free with a login registration process. Why would Amazon give other retailer insight into their site and how sales feedback or volume is collected?
Second question, how was this sample data chosen (single time slice steam or selective choice option, -- ie: tech guy in the background playing a data god role)
AWS feedback: I sent an email off to AWS and they responded, it was confirmed that do collect a continues stream of data. that was very good to hear. There is a time stamp is present in each data row, so it can be use to explain some seasonal oddities or 'funnies' in the data. {Still some data funnies found, results found that 1,0000+ set of socks sold {fair and possible}, but then 750+ rubber costume horse-head</font> items sold same period {very odd}.
HADOOP Logic:
Main goal of extract was to roll-up data for unique product count sold. So, individual data rows fields like customer description feedback comments at the purchase level were dropped (large text field, happy to drop). I could always reference the full extract file in the mousepad or excel to get the the actual record detail if needed.
Using the main key as "product_id" and then to build a value string of multiple fields in the mapper was my design goal. The idea here was to possibly further work or process the data once an Hadoop reduce output out file was created. I could parse or further process the value field, once if was out of Hadoop (ie: Excel, My-Sql, or Tableau )
I did keep the individual purchase feedback votes field in the hadoop mapper at the individual at the record level. The idea for this was to add these up during reducer stage A latrer secondary goal was to calculate an average of approval by unique product_id by adding total_votes feedback divided by total unique product purchased count.
Working on tsv files on both data set was able to pick these AWS key data fields for Hadoop processing/extract:
- product_category
- product_parent
- product_title
- total_votes
- product_id
Mapper key fields:
- key.set(product_id.trim());
- val.set(product_category + "," + product_parent + "," + product_title + "," + total_votes + "," + product_id);
Reducer code sample :
Table of results:
I was able to export the Hadoop output file into Excel for better data review and trend analysis finding.
product_id | purchase count | product_category ,product_parent ,product_title, total_votes, prodcut_id | Combined total Customer votes | |
---|---|---|---|---|
B004M6XUI2 | 1834 | Apparel 320933414 RFID Blocking Men's Leather Classic Bifold Wallet Black - Stops Electronic Pick Pocketing Works Against Identity Theft & Credit Card Data Breach by Stopping RFID Scans. 5 | 0 | |
B004M6UDF0 | 1762 | Apparel 920540636 Alpine Swiss Mens Wallet Leather Money Clip Thin Slim Front Pocket Wallet 0 B004M6UDF0 | 1155 | |
B006PGGJOE | 1495 | Apparel 847381387 SHARKK® Aluminum Wallet Credit Card Holder With RFID Protection Made By SHARKK Brands 0 B006PGGJOE | 804 | |
B0045H0L1W | 1427 | Apparel 564769013 LED Light 0 B0045H0L1W | 1273 | |
B004M6UD46 | 1395 | Apparel 357827969 Men's Leather Wallet Euro Traveler Extra Capacity Bifold Center Flip ID Window Black 0 B004M6UD46 | 930 | |
B004M6UDC8 | 1301 | Apparel 893991247 Alpine Swiss Men's Leather Zip Around Wallet ID Card Window Secure Zipper Bifold 4 B004M6UDC8 | 1183 | |
B002HJ377A | 1271 | Apparel 601200592 The Mountain Three Wolf Moon Short Sleeve Tee 7 B002HJ377A | 25588 | |
B00FBD2G0I | 1212 | Apparel 875464768 Hanes Men's 10-Pack Crew Socks 2 B00FBD2G0I | 1024 | |
B000NZW3IY | 1115 | Apparel 601200592 THREE WOLF MOON SHIRT ADULT SIZE M 15 B000NZW3IY | 61869 | |
B0073MLZ70 | 1046 | Apparel 808840165 Alpine Swiss Mens Leather Flipout ID Wallet Bifold Trifold Hybrid 0 B0073MLZ70 | 1256 | |
B008Q0E61U | 989 | Apparel 126722583 BMC Mens Design Ties Clips Bar Wear Accessories Shirt Pins for Skinny & Vintage Neckties Dress Shirts Suits & Ties - Perfect Gifts for Men & Adolescent Boys - 8pc Mixed Color Variety Set Bundle Pack | 0 | |
B000SKI752 | 976 | Apparel 110450175 Hanes Men's 6-Pack FreshIQ Cushion Crew Socks 0 B000SKI752 | 352 | |
B00FJ5LI3O | 918 | Apparel 743151828 SHARKK® LARGE Aluminum Wallet Credit Card Holder With RFID Protection - Fits Bills 0 B00FJ5LI3O | 394 | |
B00422MCVW | 869 | Apparel 723568374 Tommy Hilfiger Men's Ranger Leather Passcase Wallet 0 B00422MCVW | 523 | |
B00FBD2G2Q | 860 | Apparel 317355887 Hanes Men's 10 Pack Ankle Socks 0 B00FBD2G2Q | 235 | |
B001FOR2CY | 858 | Apparel 354265211 Columbia Men's Military-style Web Belt 0 B001FOR2CY | 606 | |
B00CONNGVQ | 827 | Apparel 672857011 Ann Chery Women's Faja Clasica Waist Cincher 3 B00CONNGVQ | 3340 | |
B0002TOZ1O | 822 | Apparel 42124028 Gold Toe Men's Cotton Crew Athletic Sock 6-Pack 0 | 0 | |
B00LMI9A6Y | 808 | Apparel 144275913 Stylish and Fit Body Fleece Lined Leggings for Women Warm Thick Spandex Tights 0 B00LMI9A6Y | 1122 | |
B00AB0AX4C | 804 | Apparel 650261858 Alpine Swiss Front Pocket Wallet Minimalist Super Thin 5 Card Wallet Genuine Leather 5 B00AB0AX4C | 1028 | |
B003YJBYPE | 774 | Apparel 724814536 Champion Men's 6 Pack No Show Socks 0 B003YJBYPE | 170 | |
B00CONNILE | 770 | Apparel 672857011 Ann Chery Women's Faja Clasica Waist Cincher 4 B00CONNILE | 2421 | |
B009Y9QCCS | 746 | Apparel 174908806 4pairs 71% Premium Merino Wool Crew Hiking Socks Made in USA People Socks 0 B009Y9QCCS | 918 | |
B001PU9A9Q | 732 | Apparel 904346411 Nippies Skin ORIGINAL Hypoallergenic Nipple Covers Pasties with ADHESIVE CREME COLOR 0 B001PU9A9Q | 1294 | |
B004U0IRGU | 724 | Apparel 852309029 Raver Blacked Out Gloves RGB LED 7 Colors Light Show Gloves 0 B004U0IRGU | 486 | |
B00422MCW6 | 712 | Apparel 723568374 Tommy Hilfiger Men's Ranger Leather Passcase Wallet 0 B00422MCW6 | 543 | |
B001OAMTQI | 702 | Apparel 480913524 Hanes Men's Cushion Crew Sock 0 B001OAMTQI | 295 | |
B009BKWE9U | 674 | Apparel 408055634 kilofly Purse Insert Organizer Expandable with Handles | 0 | |
B008VL8AW0 |
My-SQL data check:
Since there were some unexpected values found (ie: three wolf t-shirt sales), I loaded the full data set into My-SQL and ran some queries, The results did check or correlate with the Hadoop program unique product count purchases.
Note: The 'three wolf t-shirt sales' (orange rows above) was an online purchase hoax with fake reviews that drove sales to record high. (pls check, lookup in wiki: 'three wolf shirt sales')
Some other odd item total count purchase count total appeared to needed validation. One product item under two separate product descriptions different sizes did validate to positive results. A second prodcut item suspect was the total number purchased costume rubber horse heads. By check the 'product title' in the row and also looking up the item under Amazon, it the same product but by different manufactures.
-
SELECT
`asw_garden_sales`.`product_id` , count(*) AS a,
`asw_garden_sales`.`product_parent`,
`asw_garden_sales`.`product_title`,
`asw_garden_sales`.`product_category`
FROM `mysql`.`asw_garden_sales`
GROUP by `asw_garden_sales`.`product_id`
order by a desc;