Name

Name: Ajay Nair
Student ID:17211015
E-mail: [email protected]
Programme: Msc in computing
Module code: MCM
Date of submission: 10-08-2018
Project Title: Smart City Services and Sentiment Analysis
Supervisor: D.Sc. Antti Knutas

Disclaimer:

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
4,80
Writers Experience
4,80
Delivery
4,90
Support
4,70
Price
Recommended Service
From $13.90 per page
4,6 / 5
4,70
Writers Experience
4,70
Delivery
4,60
Support
4,60
Price
From $20.00 per page
4,5 / 5
4,80
Writers Experience
4,50
Delivery
4,40
Support
4,10
Price
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

A report submitted to Dublin City University, School of Computing MCM Practicum, 2017/2018. I understand that the University regards breaches of academic integrity and plagiarism as grave and serious. I have read and understood the DCU Academic Integrity and Plagiarism Policy. I accept the penalties that may be imposed should I engage in practice or practices that breach this policy. I have identified and included the source of all facts, ideas, opinions, viewpoints of others in the assignment references. Direct quotations, paraphrasing, discussion of ideas from books, journal articles, internet sources, module text, or any other source whatsoever are acknowledged, and the sources cited are identified in the assignment references. I declare that this material, which I now submit for assessment, is entirely my own work and has not been taken from the work of others save and to the extent that such work has been cited and acknowledged within the text of my work. By signing this form or by submitting this material online I confirm that this assignment, or any part of it, has not been previously submitted by me or any other person for assessment on this or any other course of study. By signing this form or by submitting material for assessment online I confirm that I have read and understood DCU Academic Integrity and Plagiarism Policy (available at: http://www.dcu.ie/registry/examinations/index.shtml)

Name(s): Ajay Nair
Date: 10-08-2018

Smart City Services and Sentiment Analysis

Ajay S Nair
School of Computing
Dublin City University

Dublin, Ireland
[email protected]

Abstract -: A major part of the population in this world resides in urban areas. Each city with its facilities and services is highly important in every aspect. With everyday progress, all major cities are lagging in one area or another in terms of services, policy making, facilities, and planning. The main objective of this research is to analyze what services people think and expect from city services and using sentiment analysis, in order to find the most satisfactory level of each service. This research accessed the forum discussion dataset which is populating in one of the major discussion forums in Ireland called “Boards.ie” and runs sentiment analysis on the resulting topic models. The major question which leads to this paper is “what do people expect from a city and how can this be achieved through a study on current scenarios?”. The expected findings should showcase areas where improvements are needed and that helps towards better planning and policy-making of city authorities.
Keywords— HTML Data scrapping, LDA Topic Model, Sentimental analysis, City services, text mining

I. INTRODUCTION
Urban areas called smart cities are cities that use operational and feedback data from a variety of sources like power consumption statics, buying power and trend statistic, employment indexes, traffic congestions, public safety events, social media discussions etc. to optimize city services 1 and life quality of people residing there. Over the last decade, the smart city idea is prominent and as per the 2014studies, 2 there are 26 smart cities around the world and more are expected to be coming up noticeably in North America and Europe by 20252. This should be facilitated by detailed analysis, planning, developing and adopting digital systems and technologies, which help to improve the efficiency and quality of life of urban citizens.

The research goal of this study is to identify and filter out sentiments of people living in Dublin regarding all area of city services and filtering out areas that require more focus and improvement with the help of topic modeling and sentiment analysis.

A. Background and Motivation
Ireland has a fast-growing market with a high-velocity growth in both infrastructure and technology sector. This growth is clearly visible in all parts of Dublin and the city needs more planning and service schemes to cop up with this positive growth.

This study deals with city services and sentiment of all topics regarding Dublin and aims to filter out all major service sectors that are co-related to Dublin city. There are various studies themed on Smart city services and sentiment analysis, which include various topics like Sentiment Analysis, Gensim-LDA Topic Modeling, Core Smart City and Its Benchmarks etc.

Topic modeling is basically used to identify and cluster out major topics from a set of documents or data. In this study, LDA topic modeling is followed by with the help of Gensim3, which is basically a python library meant to deal with large data performance operations like LDA Topic Modeling, LSI Topic Modeling, TF-IDF Calculations, Tokenization’s etc. in an easy manner.

Sentiment analysis has a number of applications in the modern world like behavior analysis, hate speech detection, crime rate prediction and prevention, satisfactory analysis, social media analysis, e-commerce, digital marketing etc. But in this study, it is limited to smart city services where it deals with what people discuss, debate and expects from a smart city. The most used versions of sentiment analysis have three results or emotions: positive, negative and neutral, however new research papers were introduced in past few years to overcome the limitations of this basic analysis4. In most of the sentiment analysis cases, major content of the sentiment is identified first, while categorization of the sentiment is a more difficult task.

II. RELATED WORKS
‘The evolution of sentiment analysis'(V.Mäntylä, D. Graziotin and M. Kuutila, 2016 5 paper is highly recommended to learn how and when sentiment analysis started and in which era its growth got boosted. This paper gives a whole idea about what sentiment analysis is, the research area where this is dealt with, and its history. It discusses all possible results of the analysis, trending areas on it, applications of sentiment analysis and human behavior-based goal classification and top citation sites to find research paper’s limitations of the current sentiment analysis. This paper can be a good aid for a beginner in sentiment analysis since this covers all areas of sentiment analysis.
V.Mäntylä, D. Graziotin and M. Kuutila conducted the study on’The evolution of sentiment analysis’ 5 using word sentiment analysis in Google Scholar and corpus database in 2016. They filtered out and made a cluster of articles using LDA topic modeling and did the manual quantitative analysis. They found that since 2005 there is a visible increase in the number of papers published related to sentiment analysis and most of them are related to opinion mining. This paper also shows that there is a simultaneous increase in citation count with a number of papers and it surpasses the count of the much mature and large research area of software engineering. V.Mäntylä, D. Graziotin and M. Kuutila classified wide analysis methods into three categories called machine learning, natural language processing, and sentiment analysis specific method. The notable change found in recent papers is that they are mainly concentrating on social media, such as Twitter, Facebook etc. and indicates the current trend in the market and the technology target.

‘Social data sentiment analysis in smart environment'(Vakali et al., 2013) 6study describes the implementations of sentiment analysis in smart platforms. One can observe from the report that the success of sentiment analysis is purely based on two points: the first one being how to design the processes which should be closely related to human behavior and the second one being how to implement an idea in a computational way.

The authors Vakali, Despoina Chatzakou, Vassiliki Koutsonikol, Georgios Andreadis 6 addresses the challenge to go beyond normal polarities of human behavior and to accommodate more wide and complicated emotional processes in social media opinions. The authors made use of seminal ones in psychology in sentiment analysis and it helped him to categorize human emotions into six types, which help to create a wider spectrum compared to the basic dual polarity. He proposed a spectrum of six emotions: anger, disgust, fear, joy, sadness and, surprise.

The authors used two main parameters for the computational procedure: intensity and valence; which help in semantic with the emotional scaling. They help in discovering the emotional relevance of the tweets and qualifying merits of the emotions. After finding relations between tweets and the six primary emotions, the next step is data analysis through data summary (Example- K mean can be used for grouping tweets with similar emotions). The main advantage of this paper is that it is theoretically and technically sounder and more descriptive compared to most other papers.

In the study of ‘Multi-Aspect Sentiment Analysis with Topic Models’ (Lu et al., 2011 7, the authors concentrate on how to classify and mine out best user ratings and contents using different topic modeling. This paper explains all about different LDA models and the differences between them, including types of labeling, the performance, and efficiency. It describes the way documents are represented as mixtures over latent topics. It also describes multi-grain LDA and segmented topic modeling. In this paper, the author compares a few unsupervised, weakly supervised topic modeling examples and discusses two major multi-aspect sentiment analysis called Multi Aspect Sentence Labeling and Multi Rating Prediction. For both multi-aspect sentiment analysis, the authors used four different kinds of topic modeling called LDA, Local LDA, Multi-Grain LDA and Segment Topic modeling.

Multi-aspect sentence labeling is used here to label and gather out different reviews of restaurants from different regions and further summarization. Multi-aspect rating prediction is to predict implicit aspect specific star ratings for every review. The authors found that weakly supervised topic modeling did well over Multi-aspect sentence labeling and only works well for Multi-aspect prediction with indirect supervision. However, it was found that unsupervised topic modeling gave a high rate of performance only in weak prediction models.

In ‘Large scale data analytics for smart cities and related use cases'(Barnaghi, 20148paper, the author emphasizes on data mining by technical solutions to handle large data and tries to find patterns, co-occurrences, and trends from a large volume of data in the project presentation. The author used examples, 101 smart cities use cases, a lot of visualization etc. to depict the findings. The author also worked on visualizing how data analysis works for smart city development. Six stages were suggested in data analysis for smart city projects: The first stage deals with the collection of data; then data is filtered and preprocessed in the second stage; metadata integration and post-process pattern recognition are considered as third and fourth steps. These patterns should be analyzed semantically and thus results can give a better visualization in the last step. This article covers only basic ideas of data analysis in smart cities. However, better visualization and pictorial representations are the reader’s assets to understand the correlation between different areas.

Liangjie Hong and Brian D. Davison conducted a study on topic Empirical Study of Topic Modeling(Hong and D. Davison, 20109 and in this paper, the authors convey that by training a topic model on aggregated messages, it is possible to increase the quality of learned model which boosts performance significantly in real-world classification problems. The authors used several schemes to train a standard topic model and to find the quality and effectiveness through some experiments. This paper progresses through three stages: The first stage explains some existing LDA topic models. It also points out different extensions of LDA and how this is different from standard text mining tools. In the second phase of the paper, working of LDA and Author Topic modeling is described.LDA has a set of common processes which are applicable to all document collections. For each document, LDA picks a topic from its distribution over topics. It then samples a word from the distribution of words and this process is repeated for all words in the document. Author topic modeling is just an extension of LDA. In this, we need to consider two latent variables an author x, and topic z for each word. The main difference from LDA is that each individual document has an extra observed variable part called ‘Author’. So a combination of authors and words in the document gives the observed variable count for an individual document.

The core part of ‘Empirical Study of Topic Modeling’ 9 describes different training models and training steps. AT, MSG and TERM are the training models used here. Twitter data is used here to perform two main tasks which are predicting popular messages and grouping users on the basis of topical categories. Through experiments it was found that the document length is directly related to the effectiveness of topic model and aggregated short messages; this can produce a better training model. Also, it was observed that extension to AT model does not act as an effective modeling for messages or users and normal LDA is acting better on user aggregated profiles.

III. RESEARCH PLAN
The main objective of this research is to achieve a result which is more relevant and accurate about city data which helps future planning and insight driven approaches to develop a smart city project. The research plan is briefly mentioned in the chart below.
CHART 1 – HIGH-LEVEL RESEARCH WORKFLOW

A. Data collection and creating a database
In this stage, the main aim is to collect raw data, clean it and store it in a form that it is easily available for analysis. In this project, the data set is expected to be in the form of metadata or web-based data, so web data scrapping and parsing are essential in this stage(Refer to Table 1).

TABLE 1- INFORMATION ABOUT DATA SOURCE AND TYPE OF DATA

Source Data format Data Size
Boards.ie

http://data.sioc-project.org/download
16 Meta Data
XML-RDF-Format Size on Disk -:14.5MB
Sentences Count-3554
Word Count-
129,644

B. Feature extraction and LDA Topic Modeling
Parsing of data is done by using lxml10, 11python library. The row data is arranged in a tree-shaped structure (Refer to Fig. 1.) and the parsing is made use of ‘etree’11module of ‘lxml’ library. The output of scrapped data is stored in .txt format in a text file(Refer Fig 2).
At this stage, the filtered data from the discussion forum is clustered (Topic Modeling) out using Gensim 3 topic modeling algorithms. A bag of words called Key-list was also used to filter out data which is only related to Dublin. Gensim uses ‘Numpy’and ‘Scipy’ 12 for performance. It is specifically designed to handle large text collections, using data streaming and efficient incremental algorithms, which differentiate it from most other scientific software packages. LDA13,14,15 is used internally because LDA is a part of Gensim and it will help to discover a semantic structure (Meaningful insides which help in better decision making) of the documents by analyzing the SIOC corpus16

Fig. 1 Example of data-Meta Data format

Fig. 2 Output of html parsed data
C. LDA Visualization
py-LDA-vis 17is a python library which is known as the best library for visualization of topic modeling. This study aims to make utilize py-LDA-vis 18 for visualization assuming it helps to improve the understandability of topic modeling relevant to this study(Refer to fig 3).

D. Sentiment Analysis
The most interesting area of this research is sentiment analysis, and this is done with the help of ‘Thematic coding’ (Topic Clustering which is done on the previous step). Thematic coding helps to group data on the basis of the themes, which then undergoes sentiment analysis to find out which emotion suits the data better. Sentiment analysis methods chosen here are VADER Sentiment, AFINN and TextBlob Sentiment analysis.

VADER 19 (Valence Aware Dictionary and sEntiment Reasoner) is a fully open sourced lexicon and rule-based sentiment analysis tool specially designed for social media expressions.

AFINN 21 is a wordlist-based approach for sentiment analysis. AFINN is a list of English words rated for valence between -5 and +5 which is manually labeled by Finn Årup Nielsen in 2009-2011.

TextBlob 20 is a python library used to process textual data and delivers common natural language processing (NLP) tasks like sentiment analysis, noun phrase extraction, classification etc.

Different types of sentiment analysis 22help to improve the accuracy and precision of the findings. Cross-validation is planned to make use of positive, negative and overall accuracy score of VADER and TextBlob sentiments. However, all the LDA topic modeling output may not consider the final analysis. It is based on inclusion and exclusion criteria which are used to identify topics which only contain words similar to each other. If a topic is negative (The majority of sentiments among 3 methods) and contains 50% or more words, where words are not related to the similar area, then that topic must be excluded from the study and it is planned to do manually.

E. Evaluation
The major challenge in this study is the evaluation of results. Plans to do cross-check operation with a manually labeled dataset and expecting accuracy range between 70-80% considering high volume of data. Positive, negative and overall accuracy score of VADER and TextBlob sentiments also deliver a secondary validation 23 on this research.

IV. RESULTS
LDA Topic modeling results reveal the most relevant and important topics which appear in discussion forums. There is an option to finalize the number of topics to get as LDA 15 output. Initially, four topics were considered which were more relevant to Dublin and appeared in the discussion forum.

The output of the LDA topic modeling is shown in below example
(0, ‘0.027*”one” + 0.009*”great” + 0.009*”many” + 0.009*”going” + 0.008*”last” + 0.008*”still” + 0.008*”ie” + 0.007*”thing” + 0.007*”bit” + 0.007*”two”‘)

The output contains the topic number, the most prominent words and its probability distribution in that particular topic. In the example above, the first 0 indicates the topic number and 0.027 indicates the probability (27 %) of the word ‘one’ in topic 0.

Py-LDA-vis 18 is a python library which is mainly used for visualization of LDA topic modeling results in an easier and user interacting manner. The main four topics of the above example are listed out by LDA visualization and shown in Fig 3. The same process is repeated for 8 topics and 12 topics and the results are visualized using the same method shown above.

In the next stage, run an algorithm to find the topic number of each word in the corpus. Refer Table 2, where corpus word (1, 1) has topic id 0 and the probability of that word is in topic 0 is 62.47%. This experiment is repeated for a different number of topics, i.e. it is repeated for 4 (Output – shown in below Table 2), 8 and 12 topic LDA models

TABLE 2- CORPUS ; CORRESPONDING TOPIC NUMBER TABLE

Corpus Probability
(*100 = %) Topic No
(1, 1) 0.6247 1
(2, 1) 0.6247 0
(3, 1) 0.6245
1
(4, 1)
0.6247
0

(5, 1)
0.6244
1
(6, 1)
0.6244
0

(7, 1)
0.6249
2

In the next step, sentiment results are stored
in a csv file. Three different observations are listed out
for each sentence of input, i.e. sentiment scores of
VADER 20 Sentiment analysis, AFINN 21 sentiment
analysis, and TextBlob 20 Sentiment analysis outputs
are stored in an output csv file. This experiment is also
repeated for a different number of topics, i.e. it is
repeated for 4 (Output – shown in below Table 3), 8
and 12 topic LDA models. The output file contains each
sentence, sentence number, topic number indicating the
corresponding topic of each sentence and sentiments
scores, and this output is considered as the input for the
next stage of the result. (Refer Table 3)

TABLE 3- SENTENCE -CORRESPONDING TOPIC AND SENTIMENT OF THE SENTENCE
Topic Number VADER Sentiment AFINN-Sen TB- Sentiment TB- Subjectivity
Topic0
{‘neg’: 0.0, ‘neu’: 1.0, ‘pos’: 0.0, ‘compound’: 0.0}
0
0 0
neutral
{‘neg’: 0.0, ‘neu’: 0.0, ‘pos’: 0.0, ‘compound’: 0.0}
0 0 0
Topic7
{‘neg’: 0.293, ‘neu’: 0.707, ‘pos’: 0.0, ‘compound’: -0.6597}
-5
0.1
0.2

neutral
{‘neg’: 0.0, ‘neu’: 0.0, ‘pos’: 0.0, ‘compound’: 0.0}
0 0 0

In next stage of research, an algorithm was run to cluster out sentiments of each topic (Refer Fig. 4.) and to find the overall sentiment scores of each topic as per the three different methods chosen earlier.

The output of each topic is mentioned in the tables below.

A. LDA with 4 topics
TABLE 4- 4 TOPICS AND VADER,TEXTBLOB AND AFINN SENTIMENTS

Topic VADER Sentiment Result AFINN Sentiment Results TextBlob Sentiment Result

Topic1 Pos Pos Pos
Topic 2 Pos Neg Pos
Topic 3 Pos Pos Pos
Topic 4 Neg Neg Neg
B. LDA with 8 Topics
TABLE 5- 8 TOPICS AND VADER, TEXTBLOB AND AFINN SENTIMENTS
Topic VADER Sentiment Result AFINN Sentiment Results TextBlob Sentiment Result

Topic 1 Pos Pos Pos
Topic 2 Pos Pos Pos
Topic 3 Pos Pos Pos
Topic 4 Neg Neg Neg
Topic 5 Pos Neg Pos
Topic 6 Pos Pos Pos
Topic 7 Pos Pos Pos
Topic 8 Pos Pos Pos

C. LDA with 12 Topics

TABLE 6 – 12 TOPICS AND VADER, TEXTBLOB AND AFINN SENTIMENTS

Topic VADER Sentiment Result AFINN Sentiment Results TextBlob Sentiment Result

Topic 1 Neg Neg Neg
Topic 2 Pos Pos Pos
Topic 3 Pos Pos Pos
Topic 4 Pos Pos Pos
Topic 5 Pos Pos Pos
Topic 6 Pos Pos Pos
Topic 7 Pos Pos Pos
Topic 8 Neg Neg Pos
Topic 9 Pos Pos Pos
Topic 10 Pos Neg Pos
Topic 11 Pos Pos Pos
Topic 12 Pos Pos Pos

D. Validation results
Validation is done by two streams: one is through accuracy comparison on the sentiment prediction of VADER and TextBlob methods. It can refer as a validation on non-labeled data set.(Table 7 and Table 8)

D.1 Accuracy score of VADER Sentiment analysis
TABLE 7: ACCURACY SCORE TABLE OF A DIFFERENT NUMBER OF LDA TOPIC MODEL OUTPUTS USING VADER SENTIMENT.
LDA Topic Model Accuracy Score
4-Topic model 76.27
8-Topic model 51.38
12-Topic model 51.41

D.2 Accuracy score of TextBlob Sentiment analysis
TABLE 8: ACCURACY SCORE TABLE OF A DIFFERENT NUMBER OF LDA TOPIC MODEL OUTPUTS USING TEXTBLOB SENTIMENT
LDA Topic Model Overall Accuracy Score
4-Topic model 24.55
8-Topic model 24.08
12-Topic model 24.35

In the second stream of validation, sentiment prediction accuracy was observed from a manually labeled data set. Every sentence in the data set was manually tagged as positive, negative or neutral as per the nature of the sentence. The accuracy scores of each method are shown in the table below (Table 9).

TABLE 9: ACCURACY SCORE, PRECISION, F-SCORE, RECALL TABLE ON MANUALLY LABELED DATA.
Method Accuracy Precision Recall F-Score
VADER Sentiment 82.81 NA NA NA
TextBlob Sentiment 50.75 NA NA NA
Logistic
Regression 22 53.13 55.63 55.16 52.71
SVM model 22
46.88 47.98 48.02 46.82
NAIVE BAYES MODEL
22
50.00 48.33 48.41 48.18
RANDOM FOREST MODEL 22

62.50 70.83 65.87 61.13
DECISION TREE CLASSIFICATION MODEL 22

53.13 59.71 56.76 50.77
ENSEMBLE APPROACH
22

53.13 55.63 55.16 52.71
V. DISCUSSION
This study was based on unsupervised learning and therefore the possibility of external validation was very limited. The main aim of studies like this is to find the insight of the topics and encourage new studies which are a continuous part of this existing study. Most of the reputed case study papers were primarily based on internal validity and construct validity, but not on external validity 23.

By comparing the sentiment predictions of different methods, VADER Sentiment prediction and TextBlob predict the output almost in the same pattern (both are lexicon-based analysis) (Refer Table 4 ; Table 5). A change in prediction pattern is observed only when a number of topics is greater (Refer Table 6). However, AFINN method is a wordlist-based sentiment classification, which frequently shows different outputs compared to the other two methods (Refer Table 4, Table 5 ;Table 6)

Accuracy comparison is another area which distinguishes the better method for sentiment prediction between TextBlob and VADER methods. VADER shows much higher accuracy in predicting the sentiment on normal input (non-labeled) compared to TextBlob. VADER, which shows around 76 % accuracy compared to 24 % accuracy of TextBlob Sentient prediction (Refer Table 7 and Table 8)

However, interestingly the accuracy of the VADER sentiment analysis is decreasing when there is an increase in the number of topics in LDA modeling. In the case of TextBlob, it shows a steady accuracy in prediction of sentiment even though the accuracy rate is much lower than the VADER method.

Another interesting fact is that the accuracy rate of VADER is even higher in the labeled dataset (Table 9) compared to other renowned methods like Logistic Regression, SVM model, Naive Bayes model, Random Forest model, Decision Tree Classification model, and Ensemble Approach Using Voting Classifier 22. All these methods are well known for supervised learning. Other than VADER and TextBlob analysis, Random Forest gives a much higher accuracy rate compared to all the other methods (Refer Table 9).
VADER shows high accuracy rate in both labeled and unlabeled data sentiment analysis. However TextBlob accuracy indicates TextBlob results are undependable for this study. Other methods are used here for cross-validation and comparison purposes only and do not involve deep in this particular research.

This study result gives an insight into new policymaking on different topics which are related to city services and helps to review an overall impulse of people’s reaction on topics related to Dublin. Here, internal validation happens by comparing the results of one method of study with another. For example, results of VADER analysis are compared with results of AFINN or TextBlob and vice versa.

As a contradiction to the above validation explanations, an attempt was made to use external validation using manual tagging and supervised learning, which helps to cross-validate the results and give double authentication to this piece of research.
VI. CONCLUSION & FUTURE WORK
Data preprocessing and cleaning is a vital part of most of the Machine Learning projects and it is also clearly mentioned in CRISP-DM lifecycle. The data set is in metadata format and the major challenge in this study involved in data scrapping and arranging the data in further usable format. In this study, Gensim library had a vital role in processing this large amount of data and LDA topic modeling helped to get the most sophisticated topic modeling with less chance of topic irregularities in modeling. The prime advantage of this study is that it points out the topic which is important for the public and how the public reacts to each of the topics.

The objective of this study was to get the sentiment of people regarding city services. Results obtained indicate that people are mostly happy about current services and the same trend is followed even if the number of topics is increased or decreased. Only a few areas need improvement and more care. However, it is not 100 % true that all topics extracted after LDA modeling are considered for final analysis. Using inclusion and exclusion it is found that topic number 4 in 4-LDA topic modeling is ignored in the final analysis. Only a few areas like schooling, gaming and old buildings are showing some needs for improvement and people are expressing the negative sentiment in these areas.

This study indicates that there are some improvements are needed in a few areas, it helps practitioners to understand the lagging areas and help them to put some effort to resolve those. However, the practitioners need further insight into the depth and reason behind the problems and how to improve the lagging areas.

In this research, papers from different research areas like sentiment analysis and topic modeling were combined and discussed. Also, a sample project presentation paper was taken for a better understanding of the smart city project(Refer Related-Works above).It is clearly visible that the importance of sentiment analysis is increasing day by day and the scope of it is vast and wide 2. It is understood that topic modeling is an integral part of sentiment analysis and most of the researchers prefer the LDA method for it (Refer to24, 25 for detailed topic model reviews and comparison). The basic goal of this research is to help create a smart city project by applying the most appropriate topic modeling and to run a sentiment analysis in order to find a better actionable knowledge and decision support mechanism. While comparing with previous studies mentioned in related the works, the majority were emphasized on topics such as “How to do smart city analysis and future scope of smart cities in the economic and social area of life”. However, this study is more concentrated on the practical side of sentiment analysis and points out which specific areas need more focusing on to move forward to the smart city position.

However, the main limitation of this study was to filter out the data which relate only to city services. This was mainly due to the vast scope and availability of data. The second major limitation of this study was related to the accuracy of topic prediction. Even though LDA is a proven method, sometimes it also fails to showcase the related words in same topics which results in a decrease in the dependability and reliability of topics. This study helped to find the main topics where people have discomfort but did not cover the reasons behind this discomfort in depth.

The sentence by sentence sentiment analysis table helps future studies to categorize this topic further into different sessions and boost up further studies on smart city development. The scope of this study varies from person to person and as per their requirements. The main advantage of this study is that it covered all areas of machine learning except image analysis and neural networks. A study as an extension to this which can predict the reason for the negative sentiment and resolving measures has a wide scope in smart city studies. This study is a halfway mark to that ultimate aim.

The next step was to screen titles and abstracts to deter-
mine which ones to accept for full article screening. This
was accomplished using two reviewers who rated each
article for inclusion or exclusion based on predefined cri-
teria. Disagreements among raters were settled by a third
reviewer. Inclusion criteria were as follows (i) Objective or
self-report measurement of physical activity or body mass
(e.g. height and weight, skin-fold or waist circumference);
(ii) Measurement, either perceived (e.g. participant self-
report) or objective (e.g. geographic information systems
GIS mapping of objective environmental data or neigh-
bourhood audits) of at least one of the 10 smart growth
principles and (iii) Publication in a peer-reviewed journal.
Exclusion criteria were (i) Papers that focused primarily on
socioeconomic characteristics of a geographic area, neigh-
bourhood problems, social cohesion, social capital, or total
city or town size; (ii) A target population consisting mostly
of senior citizens (because of functional limitations that
may limit their physical activity); (iii) Instrument validation
studies; (iv) Papers that were reviews, case reports, edito-
rials, commentaries, discussions or letters and (v) Behav-
ioural interventions without an environmental component
(e.g. walking programmes, fitness education classes, etc.).
Full articles of the accepted titles and abstracts were then
screened using the same dual rater system and against the
same criteria.
The next step was to screen titles and abstracts to deter-
mine which ones to accept for full article screening. This
was accomplished using two reviewers who rated each
article for inclusion or exclusion based on predefined cri-
teria. Disagreements among raters were settled by a third
reviewer. Inclusion criteria were as follows (i) Objective or
self-report measurement of physical activity or body mass
(e.g. height and weight, skin-fold or waist circumference);
(ii) Measurement, either perceived (e.g. participant self-
report) or objective (e.g. geographic information systems
GIS mapping of objective environmental data or neigh-
bourhood audits) of at least one of the 10 smart growth
principles and (iii) Publication in a peer-reviewed journal.
Exclusion criteria were (i) Papers that focused primarily on
socioeconomic characteristics of a geographic area, neigh-
bourhood problems, social cohesion, social capital, or total
city or town size; (ii) A target population consisting mostly
of senior citizens (because of functional limitations that
may limit their physical activity); (iii) Instrument validation
studies; (iv) Papers that were reviews, case reports, edito-
rials, commentaries, discussions or letters and (v) Behav-
ioural interventions without an environmental component
(e.g. walking programmes, fitness education classes, etc.).
Full articles of the accepted titles and abstracts were then
screened using the same dual rater system and against the
same criteria.
The next step was to screen titles and abstracts to deter-
mine which ones to accept for full article screening. This
was accomplished using two reviewers who rated each
article for inclusion or exclusion based on predefined cri-
teria. Disagreements among raters were settled by a third
reviewer. Inclusion criteria were as follows (i) Objective or
self-report measurement of physical activity or body mass
(e.g. height and weight, skin-fold or waist circumference);
(ii) Measurement, either perceived (e.g. participant self-
report) or objective (e.g. geographic information systems
GIS mapping of objective environmental data or neigh-
bourhood audits) of at least one of the 10 smart growth
principles and (iii) Publication in a peer-reviewed journal.
Exclusion criteria were (i) Papers that focused primarily on
socioeconomic characteristics of a geographic area, neigh-
bourhood problems, social cohesion, social capital, or total
city or town size; (ii) A target population consisting mostly
of senior citizens (because of functional limitations that
may limit their physical activity); (iii) Instrument validation
studies; (iv) Papers that were reviews, case reports, edito-
rials, commentaries, discussions or letters and (v) Behav-
ioural interventions without an environmental component
(e.g. walking programmes, fitness education classes, etc.).
Full articles of the accepted titles and abstracts were then
screened using the same dual rater system and against the
same criteria.
The next step was to screen titles and abstracts to deter-
mine which ones to accept for full article screening. This
was accomplished using two reviewers who rated each
article for inclusion or exclusion based on predefined cri-
teria. Disagreements among raters were settled by a third
reviewer. Inclusion criteria were as follows (i) Objective or
self-report measurement of physical activity or body mass
(e.g. height and weight, skin-fold or waist circumference);
(ii) Measurement, either perceived (e.g. participant self-
report) or objective (e.g. geographic information systems
GIS mapping of objective environmental data or neigh-
bourhood audits) of at least one of the 10 smart growth
principles and (iii) Publication in a peer-reviewed journal.
Exclusion criteria were (i) Papers that focused primarily on
socioeconomic characteristics of a geographic area, neigh-
bourhood problems, social cohesion, social capital, or total
city or town size; (ii) A target population consisting mostly
of senior citizens (because of functional limitations that
may limit their physical activity); (iii) Instrument validation
studies; (iv) Papers that were reviews, case reports, edito-
rials, commentaries, discussions or letters and (v) Behav-
ioural interventions without an environmental component
(e.g. walking programmes, fitness education classes, etc.).
Full articles of the accepted titles and abstracts were then
screened using the same dual rater system and against the
same criteria.
The next step was to screen titles and abstracts to deter-
mine which ones to accept for full article screening. This
was accomplished using two reviewers who rated each
article for inclusion or exclusion based on predefined cri-
teria. Disagreements among raters were settled by a third
reviewer. Inclusion criteria were as follows (i) Objective or
self-report measurement of physical activity or body mass
(e.g. height and weight, skin-fold or waist circumference);
(ii) Measurement, either perceived (e.g. participant self-
report) or objective (e.g. geographic information systems
GIS mapping of objective environmental data or neigh-
bourhood audits) of at least one of the 10 smart growth
principles and (iii) Publication in a peer-reviewed journal.
Exclusion criteria were (i) Papers that focused primarily on
socioeconomic characteristics of a geographic area, neigh-
bourhood problems, social cohesion, social capital, or total
city or town size; (ii) A target population consisting mostly
of senior citizens (because of functional limitations that
may limit their physical activity); (iii) Instrument validation
studies; (iv) Papers that were reviews, case reports, edito-
rials, commentaries, discussions or letters and (v) Behav-
ioural interventions without an environmental component
(e.g. walking programmes, fitness education classes, etc.).
Full articles of the accepted titles and abstracts were then
screened using the same dual rater system and against the
same criteria.
VII. ACKNOWLEDGMENT
I would like to thank my supervisor, D.Sc. Antti Knutas for the support and direction in helping me to complete this research and document. I would also like to thank Dr. John Breslin (NUIG) for provide the access to the SIOC Corpus database.

VIII. REFERENCES
1 C. Harrison, B. Eckman, R. Hamilton, P. Hartswick, J. Kalagnanam, J. Paraszczak, and P. Williams, “Foundations for Smarter Cities,” IBM J. Res. Dev., vol. 54, no. 4, pp. 1–16, Jul. 2010.

2 F. & Sullivan, “Frost & Sullivan: Global Smart Cities market to reach US$1.56 trillion by 2020.” Online. Available: http://www.prnewswire.com/newsreleases/frost–sullivan-global-smart-cities-market-toreach-us156-trillion-by-2020-300001531.html. Accessed: 30-May-2016.

3 ?eh??ek, r. and Sojka, P. (2010). Software Framework for Topic Modeling with Large Corpora. online https://radimrehurek.com/gensim. Available at: http://is.muni.cz/publication/884893/en Accessed 4 Aug. 2018.
4 Wilson, T., Wiebe, J. and Hoffmann, P. (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. online Available at: http://delivery.acm.org/10.1145/1230000/1220619/p347-wilson.pdf?ip=51.171.78.86&id=1220619&acc=OPEN&key=4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E6D218144511F3437&__acm__=1533388406_8f45c230c6f0ffd6e2ad9592cbdcf4c6 Accessed 4 Aug. 2018.
5 V.Mäntylä, M., Graziotin, D. and Kuutila, M. (2016). The evolution of sentiment analysis—A review of research topics, venues, and top cited papers. online Available at: https://arxiv.org/ftp/arxiv/papers/1612/1612.01556.pdf Accessed 18 Nov. 2017
6 Vakali, A., Chatzakou, D., Koutsonikola, V. and Andreadis, G. (2013). Social data sentiment analysis in smart environment extending dual polarities for crowd pulse capturing. online Available at: http://oswinds.csd.auth.gr/sen_2_soc/wp-content/uploads/2013/09/DATA13-Vakali-camera.pdf Accessed 29 Oct. 2017.
7 Lu, B., Ott, M., Cardie, C. and Tsou, B. (2011). Multi-aspect Sentiment Analysis with Topic Models. online Available at: https://www.cs.cornell.edu/home/cardie/papers/masa-sentire-2011.pdf Accessed 5 Nov. 2017
8 Barnaghi, P. (2014). Large scale data analytics for smart cities and related use cases. online Available at: http://ec.europa.eu/information_society/newsroom/cf/dae/document.cfm?action=display&doc_id=7686 Accessed 12 Oct. 2017
9 Hong, L. and D. Davison, B. (2010). Empirical Study of Topic Modeling in Twitter. online Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.180.5941&rep=rep1&type=pdf Accessed 18 Nov. 2017
10 Anon, (n.d.). lxml – XML and HTML with Python.
11 Bicking, I. (2008). lxml: an underappreciated web scraping library. online Available at: http://www.ianbicking.org/blog/2008/12/lxml-an-underappreciated-web-scraping-library.html Accessed 4 Aug. 2018.
12 Jones, E., Oliphant, T. and Peterson, P. (2001). SciPy: Open Source Scientific Tools for Python. online Available at: http://www.scipy.org/ Accessed 4 Aug. 2018.
13 AlSumait, L., Barbará, D., Gentle, J. and Domeniconi, C. (2009). Topic Significance Ranking of LDA Generative Models. online Available at: https://link.springer.com/chapter/10.1007%2F978-3-642-04180-8_22 Accessed 4 Aug. 2018.
14 Li, S. (2018). Topic Modeling and Latent Dirichlet Allocation (LDA) in Python. online Available at: https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24 Accessed 4 Aug. 2018.
15 Anon, (2018). Latent Dirichlet allocation. online Available at: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation Accessed 4 Aug. 2018.
16 Breslin, J. and Bojars, U. (n.d.). sioc-project. online Available at: http://sioc-project.org/ Accessed 4 Aug. 2018.
17 Mabey, B. (n.d.). pyLDAvis. online Available at: https://github.com/bmabey/pyLDAvis Accessed 4 Aug. 2018.
18 Sievert, C. and E. Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. online Available at: https://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf Accessed 4 Aug. 2018.
19 Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
20 Loria, S. (n.d.). TextBlob: Simplified Text Processing. online Available at: https://textblob.readthedocs.io/en/dev/ Accessed 4 Aug. 2018.
21 Finn Årup Nielsen, “A new ANEW: evaluation of a word list for sentiment analysis in microblogs”, Proceedings of the ESWC2011 Workshop on ‘Making Sense of Microposts’: Big things come in small packages. Volume 718 in CEUR Workshop Proceedings: 93-98. 2011 May. Matthew Rowe, Milan Stankovic, Aba-SahDadzie, Mariann Hardey (editors)
22 Wu, X., Kumar, V., Quinlan, J., Ghosh, •. and Yang, Q. (2007). Top 10 algorithms in data mining. online Available at: http://www.realtechsupport.org/UB/CM/algorithms/Wu_10Algorithms_2008.pdf Accessed 4 Aug. 2018.
23 Gibbert, M., Ruigrok, W. and Wicki, B. (2008). What Passes as a Rigorous Case Study? online Available at: https://www.jstor.org/stable/40060241 Accessed 4 Aug. 2018.
24 Liu, Z. (2013). High Performance Latent Dirichlet Allocation for Text Mining. online Available at: https://pdfs.semanticscholar.org/6390/31a930df256987a1a230e319e19d3b0c2b84.pdf Accessed 4 Aug. 2018.
25 M. Blei, D. (2012). Probabilistic Topic Models. online Available at: https://dl.acm.org/citation.cfm?id=2133826 Accessed 4 Aug. 2018.