– 1 –
Faculty of Information Technology and Computer Science
Department of Computer Information System
Handling Big Data Using Bigml
Submitted By
Ghada A.A Mheidat
(2017930020)
Supervised By
Dr.Samer Samara
– 2 –
Table Of Content :
Introduction ……………………………………………………….… 2
Handle Big Data…………………………………………..…………. 2
Bigml ………………………………………………………………… 3
Model in Bigml…………………………………………………….… 6
Experiments depend on BigML………………………………….….. 6
Bigml and other tools ……………………………………………….. 7
Conclusion……………….……………………………………….……8
Future Work……………….……………………………………………9
– 3 –
1.Introduction
What will the life be without oxygen, water, plants or other basics of life; maybe
disorder in the universe or maybe we will looking for another planet to live.
This is how we can imagine how important the data is; especially people who
their lives , profits and their works depend on data . In my language I can
consider data the is basic entity in anything and for everything.
Several decades ago it was possible to handle data with traditional way because
data was not that large or Varity or even it was from specific resources. But
today data can include everything, every type and from any source, also data
attracted attention because it developing significantly in any field can be
mentioned. The most important option was how to storage all these data but
with increasing need become better how to handle the data because the method
of dealing with data has become different and many challenges appear such as,
Storage the data, data analysis, searching, sharing, transfer visualizations and
others.
in 2005n Roger mavgalas n from O,Rrilly created the concept of ” Big data ” and it
in simple way it’s implays to large set of data that impossible to handle by
traditional business intelligence tools which they are too complicated to
changing data needs and user requirements beside consuming time and budget .
This lead in the same year to create the first open source software platform that
helps managing storage and distributed processing vast amount of data which is
hadoop; after that many open source platforms have been created to do the
same work but in different levels of efficiency and capability in storage and
handle data, Such as BigML, RapidMiner, Weka.
Big Data has become a popular concept, it refers to deal with huge amount of
data set that is difficult to collecting, storage, analyst, maintain, and visualize
which there are challenges of big data. if we don’t override this challenges, Big
Data will become gold ore but we don’t have the ability to discover this
data.Chen et al(2014).
2. Handle Big Data
Data grow exponentially year over years and there is a difference between data
before decades and data right now, such in size, importance , making
relationships or even to make any prediction from the data in specific case or
particular problem , therefore Many difficulties facing data such as storage ,
understandability and how we can benefit from it, all this includes within one
concept, which is handle big data. The ability of handle big data include in
understand the main purpose of specific data, extract relationship from it and
– 4 –
build the model which achieves the objective of such data. Handle big data is
what we should focus on because we touch on what is beyond of large set of
data, we are in a phase trying to find out the details of data through several
stages using different techniques. Notice that the needs of users vary depending
on the purpose of this data. With the tremendous amount increase of data, it had
to be created new techniques that have the capability of dealing with varies
types and size such as Bigml,Weka and Rapidminer, which I will talk about them
in the report.
3.BigML
It was built in 2011, is a machine learning that provides a platform to build and
share different data sets and models. Why we choose BigML?. Many researches,
experiments and individual users depends on BigML because it provide
appealing models that is easier to understand and can anyone deal with it, also
many options provided to handle the data and to build the predictive model you
are expected. BigML can be considered a consumable, programmable and
scalable machine learning platformBigMl.com that provide for users many
options to handle and automate classification, Regression, Analyzing, Detection
and predictive model. BigML used through over than 68,000 users and over 600
universities around the world as shown in figure1 below, Notes that BigML
depends on decision tree as an option to analyze data.
Figure 1: Showing the prevalence of this tool around the world
– 5 –
Bigml is not satisfied with the features it contain but also trying to evolve over
the years as show below
Figure 2: Showing the development of Bigml tool
BigMl play a very efficient role in education to become accessible for every
educator to learn, analyze and treat their own data through (education program),
which provide a FREE PRO SUBSCRIPTION access to every user, education,
student for a year.
3.1. Orgnaization
Also BigML study the need of several users and cooperating workspace therefore
it create what is called (organizations) which provide the ability to work with
different persons in different places on the same projects from the same account
but at different permission levels.
3.2. Share and Clone Model
With BigML you can in easy way clone models, dataset and scripts from any user
into your BigML account this done when a user shared a resource by activating
the buttons of sharing and the cloning capability as shown below in figure 2.
Figure 3:
option provided
by Bigml to
share and clone
models and
dataset
– 6 –
3.3. Hadoop Integration
Many users world like to benefit from their external data sources, so BigML
provide {Hadoop integration} option to allow the users to upload their data from
hadoop server by using specific URL starting with {hdfs;ll as figure3 bellow .
Even from Google cloud storage, Google Drive and Drop box
Figure 4: showing how users can share their data from hadoop server
3.4. Handel Missing Values
Most of users realize how difficult missing data is and the important of get all the
data at prediction time, therefore BigML provide options that create a model, this
model create prediction with explicitly handling of missing value as the figure 4
below. Moreover BigML provide many measure to evaluate the correlation
between datasets such as chi-square, person and spearman coefficients, ANOVA
and Cramer.
Figure 5: create a model with explicitly handle missing values
– 7 –
3.5. Model in BigMl
The only type of model in Bigml is decision tree and it consider as its limitation but
with decision tree algorithm it consider to be entirely white box, this means the user
can get into the model in order to know what it deduced from the data and how it
benefited from that information to make prediction about unseen data.
Many features provided with white box model such as the user can discover the model
on website and downloaded for use it offline without no connection to the platform,
also model can be built without using any skills in programming and therefore this
features reflected the advantage of speed in constructed the model.
4. Experiments depend on BigML:
After defining Bigml and just a subset of its features that consider as unique and
powerful options to handle data, getting accurate predictions, treat with missing
values in modern way without effect the accuracy of predict model, it should be
a very appropriate tool for many experiments to solve certain problem or make
statistics test in specific field, therefore many researches depend on Bigml to
build the desired model. In this section I will introduce some experiments and its
result depend on Bigml:
• Menk(2017) In this experiments, researchers tries to predict the degree of
human curiosity by collecting data from facebook user profile, facebook
include large number of people from different places with different opinions,
feelings, emotions also they shared the details of their lives, therefore it good
place to collect varied data. This study depends on creating several models
using different tools such as weka and BigML in order to compare different
result. The tools were used to generate a model based on three class of
curiosity (Slightly, moderately or extremely curious user), The degree of
prediction is based on different experiment such as the first experiment was
base on the level on education and curiosity and another one was based on
the number of cities, places and countries, in each experiment the result was
strong positive correlation. The 3-class get the best result using BigML tool
with degree up to 78.90% of correctly classified instances.
• B
?hmova et al(2018)The main idea here is to use data from social media
network and analyze it in order to build mode that helps to support
recruitment process in modern human resources management. The
researches decide to use decision tree to build the model because it machine
learning tool specialized in classification and prediction tasks with help of
BigML tool and the result after run the model appeared prediction accuracy
of 68% to 84%, The limitation of the model is restricted only on the
– 8 –
candidates that their information much be covered on social media rather
than is not appropriator to finding all people in labor Market.
• Zainudin et al (2018)Also BigML can be used in medical experiment, in this
paper the authors build a model in order to help them to predict the most
popular places in Malaysian that cause Dengue in order to get early warning
and awareness to the people, the authors collected the data from Malaysia
open Data Government portal, specifically for Malaysian Dengue from 2010
to 2015. The model created by BigML shows that the area that include a lot of
flats next to each other in a crowded way the Dengue spreads quickly and
take very long time to treatment all the patients.
5.BigMl and other tools:
BigML provide a powerful and attractive interface that allows the user to upload
data, create, build a model and make predictions. Using BigML, the user does not
need to have previous knowledge in programming. Even if the user is
professional in certain language, BigML include RESTFUL API and list of libraries
for the most common platforms and programming languages such as Java,
Python, Ruby, R and IOS , this features considered within data preparation phase
,moreover Bigml allow the users to upload the data on website or using API. If
the user upload the data and not recognize weather the fields are a number or
categorical, Bigml can auto-detects data types. Also Bigml has some techniques
to parsing poorly formatted data, but with all these features Bigml does not
provide the ability to add new data for an existing data or model. BigML is built
to be consumable which means focus on user experience, interpretability,
visualizations and exportability of models to be used everywhere, with BigML
you don’t need to install nothing to start doing things. Just use your
computational resources.
5.1.Weka:
Weka is a collection of machine learning algorithm and is the first tool for ML
ever, started in 1992. It has been developed by researches and academics
without too much engineering concentration around the core of desktop
application without interactive visualization.
Weka provide many options to handle the data such as preprocessing,
classification, clustering and association, Weka treat with data files in formats of
ARFF, CSV,C4.5, or even the user can import data from URL or SQL database, also
provide many algorithms such as Naïve Bays, Decision tree, support vector
machine, and Neural Network which can be used through its API to build
applications and custom tools. but weka has many limitation that restrict the
user such as it cannot work non-Java based data based, moreover, CSV files in
– 9 –
weka cannot be read gradually, because weka need to define the data type of
each column such as nominal or numeric and this is not provided in CSV files but
exist in ARFF files (which is the default file format in weka ); because ARFF files
include a header determine the attribute. Another limitation of weka, the data
size depends on the selected algorithm, computer memory and the features of
the data, so don’t be surprised when you get error (Out Of Memory Exception)
because as we mentioned weka is application on desktop, it is normal to
determine the maximum quantity of memory to run java programs, and should
be far less than the size of computer memory(RAM), therefore the user should
restrict with the maximum java heap size to default number such as 2GB or 3GB
or any appropriate size . Also in weka you need to write a code so you must be
programming.
5.2.Rapidminer:
Rapid Miner is started in 2001 was called YALE (Yet Another Learning
Environment) then rename to RapidMiner, its open source software, design for
data preprocessing, optimization , validation and visualization. Also include
many clustering and classification algorithm . The main feature of Rapidminer
that anyone can analyze the data because it does not need to write a code. With
Rapidminer can use any dataset format such as ARFF,CSV,XML,Excel,access.The
graphical user interface is more readily and effective than weka also Rapidminer
provide many algorithms to be flexible in building models such as Naïve Byes ,
Neural Network, Support vector machine and decision tree. Moreover provide
connection database tools such as Mysql. Moreover RapidMiner provide very
useful feature, it can inclusion weka.jar and get into all the filters and methods
provided by weka (this is not include the visualization resources), note that the
inversion is not allowed. The main disadvantage of this tool is requiring too
much space and therefore sometimes shows the error “Out Of Space” because it
depends on the memory of your computer.
6.Conclusion
In this report we have introduced a tool to handle big data called Bigml, also we point
to another tools specialized in handling big data which they are Weka and Rapidminer
and discover the limitation of each tool. Bigml consider a very modern tool and it
always seeking to develop over the years as mentioned previously. We can figure that
we cannot use a tool contain all the options, always there is a limitation. Bigml with
all these unique features we can regardless to its limitation which is the only
algorithm used is decision tree specially its effective algorithm and used in a lot of
experiments and the following table shows comparison between the three
mentioned tools:
– 10 –
BigMl Weka Rapidminer
Size Up to 64 GB Depend on computer
memory
Depend on computer
memory
Data
Frame
CSV,ARFF,Excel,ac
cess,URL, and
many other
sources
ARFF, CSV,C4.5 ,SQL
database, URL but the
default is ARFF
ARFF,CSV,XML,Excel,
access
Algorith
m
Decision Tree Naïve Bays, Decision
tree, support vector
machine, and Neural
Network
Naïve Bays, Decision
tree, support vector
machine, and Neural
Network
API Provided Provided Provided
7. Future Work
In future work we are looking to build model based on information collected
from linkedin social network to discover the most job vacancies in IT field or any
other fields.
References
1 Chen.C.L.P and Zhang.C.Y (2014) “Data-intensive applications, challenges,
techniques and technologies: A survey on Big Data”.
2 Menk.A and Sebasti.L(2017)” Are you Curious? Predicting the Human
Curiosity from Facebook “, International Journal of Uncertainty, Vol. 25,PP.79-95.
3 B
?hmova.L and Chudn.D(2018)” Analyzing Social Media Data forRecruiting
Purposes “, Acta Informatica Pragensia, vol(7),N(1),PP.(4-21).
4 Zainudin1.Z and Shamsuddin1.S.M(2018)” Predictive Analytics in Malaysian
Dengue Data from 2010 until 2015 using BigML “Vol(8).No(3),PP.(18-30).
5 Rangra.K and Bansal .K.L(2014)” International Journal of Advanced
Research in Computer Science and Software Engineering ” Vol(4),N(6),PP.2016-
223.
6 Yao.Y et al (2017)” Complexity vs. Performance: Empirical Analysis of
Machine Learning as a Service” , London, United Kingdom,PP.384-397.
7www.bigml.com
8www.cs.waikato.ac.nz
9www.rapidminer.com