Project Report – (Phase -2)
On
Consumer Woes
Submitted in partial fulfilment of the requirement for the Degree of Bachelor of
Technology in Information Technology
at
DIT University, Dehradun
Submitted By :-
Kartik Singh (1501051132)
Shantanu Gupta (1501051125)
Divya Sharma (1501051075)
Under the guidance of
T. Santosh , Assistant Professor
IT Department
DIT UNIVERSITY, DEHRADUN
(State Private University through State Legislature Act No. 10 of 2013 of Uttarakhand and approved by
UGC)
Mussoorie Diversion Road, Dehradun, Uttarakhand – 248009, India.
201 8-20 19
i
Declaration by candidate s
We declare that the work presented in this project titled ” Consumer Woes “, submitted
to the Department of Information Technology, DIT University, for the award of the
Bachelor of Technology degree, is our original work. We have not plagiarized or
submitted the same work for the award of any other degree. In case this u ndertaking is
found incorrect, we acce pt that our degree may be unconditionally withdrawn.
Kar tik Singh (1501051132 )
Shantanu Gupta (1501051125 )
Divya Sharma ( 150 1051075 )
Date :
ii
Abstract
The word ‘Big Data’ designates advanced methods and tools to capture, store, distribute,
manage and investigate petabyte or larger sized datasets with high velocity and different
arrangements. Big data can be organized, unstructured or semi organized, resulting in
incapability of predictable data management methods. Hadoop is the main podium for
organizing Big Data and cracks the tricky of creating it convenient for analytics
determinations. It is an open source software project that allows the distributed handling
of large datasets across bunches of service servers.
The main aim of the project is to provide a mechanism that uses customer rating fo r
various products and aids the governing body to further distinguish the usability and
reliability of each product based on the ratings.
Index
Titl e Page
Cha pter 1 – Introduction 1
1.1 Developmen t Environment 2
1.2 System Users 2
1.3 Assumptions 2
Cha pter 2 – Implementation De tails 3
2.1 Hardware Requirements 3
2.2 Software Requirements 3
Chapter 3 – Methodology 4
3.1 Defining Thresholds 4
3.2 Preparation of Dat a 4
3.3 Aggregation of Data 5
Chapter 4 – Modules Covered 6
4.1 Data Upload 6
4.2 Data Processing 8
4.3 Part itioning of Data 10
4.4 Visualisation 13
Chapter 5 – Fea tures 15
5.1 Expected Analysis an d Output 15
Conclusion 16
References 17
1
Chapter 1
Introduction
Wealth of information flowing into an apex financial governing body of a developed
nation is anybody’s guess. Their knowledge management team is working smartly on
leveraging humungous data received 24 * 7 from various financial institutions across the
nation. Their goal is to tighten loopholes in the financial services being offered by various
companies and ensure the consumer is delighted with products; services and the post
engagement issues are minimized. The data source currently being taken up for ironing
out issues is Consumers data, which i s rapidly swelling with information about the
financial companies, consumer issues with their products. The management is looking
forward to status and insights on the consumer issues and handling in the first phase. The
analysis reflecting financial inst itutions’ status on products and services should be done in
form of visualizations.
The solution will be developed using Hive and subsequently deployed on Cloudera, a
PaaS platform on Cloud providing Analytics for Hadoop service. This document is th e
primary input to the development team to architect the proposed visual mining model for
this project.
2
Development Environment
The development will be carried out using Hive operations in C loudera . These tools
will simplify analysis and creation of Visualizations. The Hive operations use
MapReduce in the background to process the desired output. They also support
techniques for text analytics.
System Users
The users of the solution shall be management team of the APEX financial
institution. In addition the same will be become available to the financial service
providers.
Assumptions
1. It is assumed that the developer will make an effort to understand Hive
functionality and explore its features to generate the desired outputs.
2. The output generated from this project would be visualizations.
3. The data links provided are for financial institutions within USA.
4. Not all columns of data will get used in the given problem; the developer may
like to try out additional visualizations if the time permits.
3
Chapter 2
Implementation Details
Hardware Requirements
Device : Laptop
Processor : i3 (minimum) and above
RAM : 6 GB (minimum) and above
Hard disk : 25 GB (minimum) and above
Software Requirements
Operating System : LINUX
Languages
: UNIX, Hadoop Frameworks (Hive,
Platform
:
M ySQL , Map Reduce, pig)
VirtualBox Cloudera
4
Chapter 3
Methodology
It is required to analyze the financial data to discover insights on the product
grievances amongst the service providers. The analysis must span across last 3
years in order to generate any meaningful analysis. The outputs will be
visualizations from the Tableau chart feature along with the developer’s
observations that will be useful in drawing the inference.
Defining Thresholds
The challenge is to rank the institutions with the best and worst handling of cases in
the products being offered by them. It’s critical to understand the semantics of excess
as per the financial industry. In absence of public information on the threshold
values, the developer may decide on the threshold values to glean out meaningful
insights from the data.
Preparation of Data
To ensure the data is in good shape to perform hive operations, the following
checklist is followed.
1. All characters in text columns should be converted to uppercase.
2. Remove all punctuation, whitespace and control characters if any.
3. All numbers are integer values. Some columns may have negative values; no
transformation should be carried out on those columns.
4. The flight departure and arrival time columns need to be converted into hh:mm
format.
5. Since the flight date information is already available as Year, Month, Day of
Month and Day of Week, there is no transformation required on this data.
5
Aggregation of Data
Keeping the objective in mind, next step is to run operations on columns to arrive at
the number of instances. Cloudera provides an array of useful functions to group data
based on the column values.
On eyeballing the data you may wonder as to how such high volume of data shall get
converted to charts! This thought itself leads you onto the right track! Visualizations
require data that is result of pre -processing either using ready to use functions or
writing algorithms to generate the final data set that does not require any further
breaking down.
To understand the type of operations on prepared data, consider finding out how
many products are being serviced per institution for the last 3 years. Using the
group sheet feature, for each year, group on institution using count function. This
operation will return year wise list of institutions with the number of products being
serviced in each year.
Use case diagram
Cloudera
6
Chapter 4
Modules Covered
Module 1
1. FETCHING CFPB CONSUMER DATA AND STORE INTO HDFS
7
2. STORING DATA IN HDFS
8
Module 2
DATA PROCESSING USING APACHE HIVE
1. UPLOADED DATA
2. CR EATING DATABASE AND TABLE TO STORE THE DATA
9
3. DISPLAYING THE PROCESSED DATA
10
Module 3
PA RTITIONING OF DATA USING APACHE HIVE
11
1. DISPLAYING THE PARTIONED DATA
12
13
Modu le 4
ANALYSED TWITTER DATA VISUALIZATION
1. STATEWISE VISUALISATION
2. PRODUCTWISE VISUALISAT ION
14
3. VISUALISATION BY TYPES OF ISSUES
4. VISUALISATION OF CONSUMER DISPUTION
15
Chapter 5
Features
The analysis of consumer data using Tableau operations shall produce multiple outputs
that can be used to draw out inferences for product service patterns o f Institutions and
help them model the rules and regulations for providing improving the service levels of
the financial institutions.
Exp ec ted Analysis and Output
The governing financial body is keen to get the following insights.
1. State wise status of issues. Use heat maps to depict the concentration of issues reported
in the financial institutions in each state. Interaction required as the mouse hovers on each
state, the name of the state and its count of issues is displayed.
2. Issues by financial institution using bar charts .
3. Type of Issues in Top three financial institutions. Rank number of issues in descending
order.
4. Issues exceeding the defined norms. The visualization should r epresent only those
issues that exceed the defined threshold value and the way they have been resolved. On a
percentile scale display these issues categorized on the company response.
16
CONCLUSION
The data source currently being taken up for ironing out the issues of financial services
institutions is Consumers data, which is rapidly swelling with information about the
financial companies, consumer issues with their products. The management is looking
forward to status and insights on the consumer issues and handling in the first phase.
The solution to be developed shall produce multiple outputs that can be used to draw out
inferences for product service patterns of Institutions and help them model the rules and
regulations for providing improving the service levels of the financial institutions.
17
References
• Martin C. Brown, “Tableau for the common man “,
https://www.ibm.com/developerworks/library/ tableau /index.html , accessed
on 16 th August 2018.
• “MapReduce Tutorial “, https://www.tutorialspoint.com/map_reduce/ ,
accessed on 18 th August 2018.
• “Getting Started With Flume “,
https://cwiki.apache.org//confluence/display/FLUME/Getting+Started ,
accessed on 18 th October 2018.
• “IBM Bluemix Dev – Hands on with Hadoop in Minutes “,
https://developer.ibm.com/bluemix/2014/08/26/hand s -on -with -hadoop -in –
minutes/ , accessed on 20 th October 2018.
18
Team Details
Project Member:
Kartik Singh (1501051132)
Shantanu Gupta (1501051125)
Divya Sharma (1501051075)
Guide Name and Signature:
Mr. T Santosh
Assistant Professor, IT Department
DIT University
Signature: ______________