CHAPTER 1 INTRODUCTION TO PROMETHEUS MONITORING SYSTEM Prometheus is an open-source system monitoring and alerting toolkit originally built at Sound Cloud

Prometheus  is an open-source system monitoring and alerting toolkit originally built at Sound Cloud. Since its adoption in 2012, many companies and organizations have adopted Prometheus, and the project has a very active developer and user community. Prometheus joined the Cloud Native Computing Foundation in 2016 as the second hosted project, after Kubernetes.  The software was created because of the need to monitor multiple microservices that might be running in the system. Its architecture is modular and comes with several readily available modules called exporters, which helps to capture metrics from the most popular software. Prometheus is written in the Go language, and it ships with easily distributed binaries that can be used to get it running as quickly as possible.

Prometheus works well for recording any purely numeric time series. It fits both machine-centric monitoring as well as monitoring of highly dynamic service-oriented architectures. Its supports for collection of multi-dimensional data and to write queries on it, in the whole world of microservices. It is designed for reliability and dependability, to be that system which alerts during an outage to allow diagnosis of problems quickly. Each Prometheus server is standalone, not depending on network storage or other remote services. It can be relied upon when other parts of infrastructure are broken, and there is no need to setup extensive infrastructure to use it.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

State of Art Development
The cloud computing has gained importance over the years, the reasonable performance will be available on the cloud machines like EC2 cloud. Analysing the performance of the cloud depends on the various metrics like compute, memory, network and I/O 1. The management of distributed systems infrastructure requires dedicated set of tools. The one tool that helps visualize current operational state of all systems and notify when failure occurs is available within monitoring solution. The core layer in monitoring solution that initiates monitoring, stores collected data, visualize metrics and triggers alert notifications when needed. Prometheus is a open source tool used for collect monitoring metrics. It stores all the time-series in key value pairs. Data gathering process runs via pull mode over HTTP 2.

As the companies shift from desktop application to cloud-based software as a service (SaaS)
there is always a competition to provide better services and good quality of services (QoS) for their users. QoS is better achieved by allocation of more resources whenever required and removal when not. This can be done by estimating the need and workload prediction. Cloud workload prediction module for SaaS providers based on the autoregressive integrated moving average (ARIMA) model is a better approach for predictions 3. The workload of a particular cloud machine depends on various parameters like cpu usage, network etc., 4. The metrics required is not given by inbuilt system (cloud watch) for monitoring which gives limited facilities and metrics 5.

The AWS can be provided certain monitoring facilities based on the requirements and applications installed. The best practices includes using automated tools for various applications such as to monitor Web Applications, monitor all environments, performance test data-centre and AWS, and cost estimation for usage 6. The overall usages and graphs are collected dynamically based on time. The time-series analysis become trivial when not calculated efficiently and automatically. Python includes various functionalities to analysis time-series data using a library called statsmodel 7. The time-series data are usually of various format and sequence which requires using unsupervised real-time anomaly detection 8.
1.2 Motivation
The world is using a manual approach of monitoring their servers and taking the corrective actions after the error has occurred, but when it comes to large scale of machines it becomes a tedious work to manually do all the work. So the world needs a solution to automate this monitoring and alerting in real-time which is automated this is what made us developing the PROMETHEUS MONITORING SYSTEM.

1.3 Problem Statement
World has adapted itself to the use of  DevOps and microservices. This shift adds a great deal of complexity. Instead of having to monitor one system, there is a challenge to oversee our manifold services. Even though numerous monitoring systems available, not all of them are fit for monitoring large, distributed systems. Prometheus is a white box monitoring and alerting system that is designed for large, scalable environments. With Prometheus, all these questions can be solved, by exposing the internal state of applications. By monitoring this internal state, alerts can be thrown and action can be taken upon certain events. Hence, aim is to deploy a monitoring/alerting system for the System Architecture of MeTripping using Prometheus tool.

1.4 Objectives
A number of objectives are set, to be attained for a successful implementation of the project. The objectives of Prometheus Monitoring System are listed below:
To design a monitoring/alerting system for the system architecture of MeTripping and continuously check the health of all the servers deployed in AWS by the company.

Metrics for the whole system are classified as:
System Metrics which includes CPU Utilization, Memory Utilization and Disk I/O of all the servers.

Application Metrics customized with respect to per-specific application deployed on each server.

Business Metrics includes User Metrics, Hotel Bookings, and Flight Bookings and so on.

To create dashboards for each of the above metrics and analyse the entire system from time to time, and create alerts about the system performance through data analysis.

To perform predictive analysis of data using ARIMA model.

1.5 Scope
The system is designed to monitor the existing AWS servers in the company. Prometheus scrapes metrics from these instances with the help of exporters deployed on each machine instance based on the type of application service running on it. The scrape interval is set to 5s and hence there is a continuous check on the performance of these servers. Graphical visualization though grafana dashboards enable easy understanding of the whole system usage. Additional servers can be hired in case the search traffic to the website increases. Efficient usage of resources is possible through this continuous monitoring practice. Further to add to this, predictive analysis of time-series data is done to identify the pattern of various metrics.

1.6 Methodology
Initially, an AWS machine named “mt_monitor” is created which is responsible for the monitoring of MeTripping system architecture. AWS configuration is done on this machine to extract information about all the servers through AWS CLI (Command Line Interface). All the production and staging machines require monitoring. Hence those machines are extracted. Prometheus is installed on mt_monitor machine. Various exporters like Node exporters( to scrape System metrics), Postgres exporter, Redis exporter, MongoDB exporter, Elasticsearch exporter, CloudWatch exporter are installed based on the specific application running on the AWS instance. PromQL is used as the data Query language to analyze various metrics and display them on HTTP web page. To enhance the visualization, a data Visualization tool called Grafana into which Prometheus can be imported is used to create dashboards for the various metrics. Predictive analysis of data is done using ARIMA (Autoregressive Integrated Moving Average) model for the real time time-series data.

1.7 Organization of the Report
This section gives a broad picture of the various chapters in the report.

Chapter 2 is overview of the project which describes the details of the domain of the project carried out.

Chapter 3 is Software Requirement Specification which describes the user characteristics, assumptions and dependencies, constraints and functional requirements of the project.

Chapter 4 is High Level Design which states the design phase in software development Life Cycle. This chapter is about the design considerations like general constraints, development methods and architectural strategies. This chapter explains the project System Architecture and Data Flow Diagrams.

Chapter 5 is Detailed Design which explicates the project modules. The functionality of each module is explained in this section. And the structural diagram of each system is also explained.

Chapter 6 is Implementation which describes the technology used in the system. This section also explains programming language, development environment, code conventions followed.

Chapter 7 is Software Testing which elaborates the test environment and briefly explains the test cases which were executed during unit testing, integration testing and system testing.

Chapter 8 is “Experimental Results” which mentions the results found by the experimentalanalysis on the available data. It tells about the inference made from the results.

Chapter 9 is Conclusion conveying the summary, limitations and future enhancements of the project.

This chapter gives the introduction to Prometheus monitoring system and focuses on clarifying the motivation, objective and problem statement. A clear explanation of the task in hand is given down through the well described and detailed problem statement, objectives and methodology. The state of the art shows the latest developments in the field. It gives a clear context on what are the existing possibilities and the ones that are most suitable and match the project requirements.

2.1Design Considerations
2.1.1General Constraints
Robustness: The software developed should be able to withstand all the error and handle all the exceptions. The system should be working well with all the use cases
Security: The software should not be vulnerable to attack. The software show allow only authorized access
Usability: The software should be easily usable to all the users
Accuracy: The software should be able to give the accurate prediction based on the data
2.1.2Development Methods

Figure2.1 Agile software development life cycle
Agile software development life cycle:
The software was developed iteratively by submitting module by module. The requirements were changing time-to-time and the software had to go many changes throughout the development.
Small chunks were developing from time-to-time which required some changes to be incorporated in the system. In the meanwhile the developed modules were tested and the feedback was collected continuously to incorporate in our system
The initial version of the software was released with some simple functionalities and the changes and feedback and requirements were updated which added some advancement to the software we developed.

2.2Architectural Strategies
2.2.1Programming Language
Python- As python is one of the world’s powerful programming languages it gives some of the built-in modules for development which makes system faster and easy for development. The classes and methods are developed using python. The prediction model uses some of the libraries in python.
PromQL- The query for extraction and generation of the graphs has been written in PromQL query language
2.2.2Future Plans
As it comes to the future it will be taking the corrective actions automatically which means using AI agents for handling all the aspects of failure and recovery of the system. The enhancement includes chatbot implementation for limited set of queries about the usage stats and analysis of the data.

2.2.3User Interface Paradigm
The user will be provided with the dashboard for the results and reports generated. The dashboard provides various features like querying on the data and stats about the usage of resources and various functionalities.

The predictive analysis will be shown in a console of the IDE PyCharm. The user will be given set of values through which the user get an idea about the usage.

2.2.4Error Detection and Recovery
Error Detection is carried out by user testing and slack bot has been setup to report the bug in the system. The different datasets are used for testing the ARIMA model has been carried out to test the efficiency of the system.

Recovery has been done by alerting the user about the crash in the system using slack automated system and the systems stable state (previous state) will be restored.
2.2.5Data Storage Management
The data are extracted from the exporters and stored in a csv file. The extraction happens between an interval of 5 sec. As the data will be not accessed frequently and modified the data is stored on the stable storage within the machine running the programs.

2.2.6Communication Mechanism
Prometheus used http protocol to communicate with its client system and members. Message passing mechanism will be used to communicate with the exporters for the extraction of the raw data about usage of the resources.

Grafana uses http protocol for extraction of the data from prometheus. The data will be passed by prometheus to grafana using the endpoint ‘/metrics’.

2.2.7Graph Generation Mechanism
The prometheus tool uses a query language called PromQL used for aggregating the extracted data and based on those factors the graphs will be generated.
2.3System Architecture
As it comes to system architecture typical style has been used which is separate modules and microservices has been used to build the system.

Figure2.2 System Architecture2.4Data Flow Diagrams
2.4.1Data Flow Diagram – Level 0

Figure2.3 Data Flow Diagram – Level 0

Initial step is to collect the data from the system (AWS) and the data are stored in CSV file for further analysis. Prometheus is used for real time monitoring of the AWS instances and generation of usage graphs.

2.4.2Data Flow Diagram – Level 1

Figure2.4 Data Flow Diagram – Level 1

Exporters are installed for extracting the metrics from the AWS instances , which is then used by Prometheus monitoring tool for the usage graph generation and the extracted data will be stored in the CSV for further analysis
2.4.3Data Flow Diagram – Level 2

Figure2.5 Data Flow Diagram – Level 2
Different exporters are installed to get the metrics from different instances, where each exporter will be used by Prometheus to get the data for graph and usage stats generation.

Predictive analysis will be done on the stored data using the ARIMA model.

3.1Structure Chart
Monitoring And Alerting System
Predictive Model
Analysis data
Raw Data
Stored Data
Monitoring And Alerting System
Predictive Model
Analysis data
Raw Data
Stored Data

Figure3.1 Structure Chart For Prometheus Monitoring System
The above figure is the structure chart of the system developed where a main monitoring system will be interacting with the two subsystems underlying the main system at the top. The monitoring system is mainly supported by the Prometheus tool where it gets the data and displays the usage stats. The alerting system which is actually the predictive analysis which will be done by ARIMA (Autoregressive Integrated Moving Average) model.3.2Functional Description of the Modules
3.2.1. Prediction Module- This module is developed using Python. This module uses ARIMA model for prediction of the failure of the system. It also includes some inbuilt and third-party libraries for complicated operations.

This module includes code for doing some mathematical operations and gives the output in GUI form.

3.2.2. Data Store Module- This module is used to extract the data(metrics) from the system and store the data into the csv for further analysis. This module include some built-in libraries for extracting and storing it in the specified file (csv).

3.2.3. Data Cleansing Module- This module is used to handle the NULL values in the data sets. It include pandas libraries for calculating the mean of the particular type of instances and replaces or handling the NA or NULL values in the datasets
3.2.4. Efficient Parameter Prediction Module- This module is used to find the values of parameters to ARIMA model so the Mean Square Error is minimum.

The significant phase is implementation of the project according to design requirements that meets the objectives specified. In this phase the low level designs are transformed into the language specific programs. The phase requires actual implementation of ideas that were described in analysis and design phase following the methodology already described.

4.1Programming Language Selection
Python is the programming language used for coding and PromQL is the query language used to analyse the data.

The diverse application of the Python language is a result of the combination of features which give this language an edge over others. Some of the benefits of programming in Python include:
The Python Package Index (PyPI) contains numerous third-party modules that make Python capable of interacting with most of the other languages and platforms.

Python provides a large standard library which includes areas like internet protocols, string operations, web services tools and operating system interfaces.

Python language is developed under an OSI-approved open source license, which makes it free to use and distribute, including for commercial purposes.

Python offers excellent readability and uncluttered simple-to-learn syntax which helps beginners to utilize this programming language. The code style guidelines, PEP 8, provide a set of rules to facilitate the formatting of code.

Python has built-in list and dictionary data structures which can be used to construct fast runtime data structures.

Python has clean object-oriented design, provides enhanced process control capabilities, and possesses strong integration and text processing capabilities and its own unit testing framework, all of which contribute to the increase in its speed and productivity
4.2 Platform Selection
The system is designed to work on both Windows and Linux operating systems. All the implementations and docker environments are created on specific AWS machine instances.

A monitoring machine is created and configured for credentials of AWS. Grafana is used as the platform for graphical visualization of the real time time-series data. For time-series predictive analysis of data, Anaconda spyder is chosen as the IDE. For graphical plotting, support of python libraries like matplotlib is used.
4.3 Code Conventions
This section discusses the coding standards followed throughout the project. It includes the software applications that are necessary to complete the project. Proper coding standards should be followed because large project should be coded in a convenient style. Comments have to be specified for all modules to increase readability of the code. This makes it easier to understand any part of the code without much difficulty. Code conventions are important because it improves understandability in software, allowing the programmers to understand code clearly.


4.3.1 Naming Conventions
Naming conventions helps programs to be more user-friendly by improving the readability. The names given to scripts, packages, graphs and modules should be clear and precise so that they are relevant to what they mean and their contents can easily be understood. The conventions followed for this project are as follows:
Methods: Methods names are generally verb. The lower casing strategy is used for Methods. Example: fit ( ).

Variables: Variable names must be short and meaningful. Eg:train,test,model4.3.2 File Organization
The files used to implement the project were organized and kept in certain order based on their types
Configuration file for prometheus which contains specific targets for each exporter is stored in /etcThe docker environment is set up for each database like postgres,mongodB and redis. dockercompose.yml is the configuration file written to store the credentials and it’s docker run environment for each of the respective docker.

The data generated from querying using prometheus metrics is stored in csv files, which in turn is used as input for predictive-time-series data analysis.

4.3.3 Declarations
The declarations and conventions followed are ones specified in standard. Standard names are given which make it straightforward to understand the entity clearly and its role. More than one declaration per line is not allowed because of commenting and to reduce ambiguity.

Comments are an important practice of any coding traditions as it enhances the understandability of the code created. Remark lines start with the character ‘//’, and everything after a ‘//’ character is not considered by the interpreter. The // character itself advises the interpreter to ignore the rest of a similar line. In the project files, commented areas are generally in a different color  by default, so they are easy to identify.

Comments are valuable for clarifying what work a specific part of code performs particularly if the code depends on certain assumptions or perform subtle actions. Any new developer can understand the previously written code if it is well documented with comments. They serve purpose to explain what is the functionality of a certain piece of code.


4.4 Difficulties Encountered and Strategies Used to Tackle
This section discusses the difficulties encountered in the development of the project.
4.4.1 Selection of metrics for each application
There are multiple nosql databases used by the company to satisfy different requirements.Selection of appropriate metrics for monitoring of each database based on the function it performs needed a lot of analysis on the behavior of these databases.

4.4.2 Cloud Watch monitoring
Cloudwatch exporter is not an open-source tool. It provides 1 million free api requests per month and any request beyond that will be charged. Monitoring had to be done for 20 machine instances, and api request per machine instance is set to an interval of every 5 seconds. Hence the total number of requests for cloud-watch exceeds the available free quota every month.

4.4.3 Conflict with private IP addresses:
The AWS machines hired are from different zones. Node-exporters were directly installed on each of these instances to collect system metrics. Since each machine belonged to a different zone,interaction between the machines through private IP address wasn’t possible. The configuration of prometheus for targets was changed to their public IP address. Scrape interval was also increased to 45 seconds as 5 second interval wasn’t sufficient for scraping data from distant machines.

4.4.4 Selection of algorithm for time-series data:
Since the data generated from querying is time-series data, it has to be stationarized before applying any machine-learning algorithm on it. AutoRegressive-Integrated-Moving-Average(ARIMA) is the chosen algorithm for time-series data analysis and Dickey-Fuller test is used for checking stationarity of time series data.

4.5 Summary
This chapter provides an overview of development environment, programming language, code conventions and platform selection during project development. Several difficulties faced during the project development phase is highlighted and strategies used to tackle them are discussed.

Chapter 5
Software Testing
Software Testing is performed to detect defects or errors by testing the components of programs individually. Each component is combined together with another to form complete system during testing. Testing phase is mainly concerned with motivation to showcase that all individual functional goals and requirements of the system are met. The test cases are chosen to assure system behavior is appropriate for all possible combinations. Appropriately, the normal behavior and the exceptional behavior of the system under various combinations is given. In this manner test cases are chosen by giving inputs and outputs on expected lines. All the exceptions are handled appropriately. Test cases also include cases for which non suitable input is given and ensured that appropriate error or debug message is shown.

In this chapter,various test cases are written that perform module/unit testing for each functionality. Once the modules are ascertained to work properly,integration testing is done on all the modules by combining them together. System as a whole is tested by performing system testing.Initially testing was done on the local machine.Once it performed and met all the required objectives it was deployed on AWS instance.

5.1 Test Environment
The Prometheus tool is used to monitor the performance of all the instances in the company. Hence a proper testing environment and proper test strategies are required to perform unit, integrated, system testing. All the servers are run in background as daemon processes and the environment required by each server is satisfied by setting a docker environment for each server in the same machine instance.
Pycharm is used as the editor for coding. Each of the module can be tested by running the program in debug mode. Breakpoints are provided at necessary points and code is examined line by line for better understanding. Inputs are given in such a way that the code is examined for all possible inputs including the edge cases. Initially code is tested for a sample data set so that it takes lesser time for execution. The sample data set contains inputs that tests all the exceptions in the code. Once the sample data set runs successfully, the program is run for the whole set of input data.

5.2 Unit testing of the main module
Testing is an integral part of software development. Performing unit testing reduces the risk of failure of the whole component at the end. Each module/sub-component of the product is tested for the functionality it is expected to achieve after fabrication/design. It also helps to identify bugs and errors at early stages. Unit testing tries to create individual modules which are not dependant on each other. Hence failure of a single component/module does not affect the performance of the whole system. Replacement of that specific module can bring back the whole system to normal working. Hence identifying bugs in the system becomes much easier.

5.2.1 Unit Tests for Prometheus Server
The following show the test cases for identifying the states of the application used for monitoring

Sl No. of Test Case

Name of Test Case

Active state of Prometheus server

Feature being Tested

To check if prometheus server is started and is continuously running


The prometheus server is started as a daemon process on port 9090 and its activity is monitored by checking its status

Sample Input

Sudo systemctl status prometheus(command to check the status)

Expected Output
Server is in the running state

Actual Output
Server is in the running state


Server has to be monitored continuously for its up state

Table 5.1 Test case to demonstrate active state of Prometheus server

Figure 5.1 Test case Snapshot for Prometheus Server
The above table and figure shows that active states of the Prometheus servers. The servers are started whenever monitoring is required, which will be running in the background. The status of the server can be seen by commands.

Sl No. of Test Case


Name of Test Case

To check for state of postgres,Redis,mongodb and elastic search servers

Feature being Tested

To check for up/active status of the above servers


To continuously use database services,all the above services should be up

Sample Input

Expected Output

Active state of all the above servers

Actual Output

Active state of all the above servers


All the servers are always running

Table 5.2 Test case to demonstrate active state of various servers
Above table represent test case for all the applications running at remote places which includes database server, application servers etc.

Sl No. of Test Case

Name of Test Case
Status of all targets

Feature being Tested

To check for the up status of all the configured targets


The /targets web page running on port 9090 of mt_monitor machine lists all the configured jobs and their status

Sample Input


Expected Output
Display of a web page with targets configured in prometheus along with their status

Actual Output
Display of a web page with targets configured in prometheus along with their status


This webpage gives the details of all the configured jobs and hence monitoring becomes easier

Table 5.3 Test case to demonstrate active state of targets
Above table represent test testing all the targets which is used by application for monitoring purpose.

Sl No. of Test Case

Name of Test Case

Evaluation metric for ARIMA model

Feature being Tested

Mean squared error

Mean squared error gives average of the absolute values of the prediction errors

Sample Input

Run ARIMA model with input csv files

Expected Output
Predicted and expected values match with maximum accuracy

Actual Output

Predicted and expected values match with maximum accuracy


Enables predictive analysis of data

Table 5.4 Test case to demonstrate ARIMA model metric evaluation
The predictive analysis will be carried out using the ARIMA model, whose efficiency depends on the mean squared error.

Sl No. of Test Case

Name of Test Case

Testing The Efficiency Of The Predictive Analysis

Feature being Tested

To check how fast the model gets trained for different datasets


As the models works on different datasets it will be better for testing and recording the runtime of different sizes of datasets

Sample Input

Two sample Datasets for testing the model efficiency

Expected Output
Time for training the model and getting the predictions

Actual Output
Time for training the model and getting the predictions


The testing was carried out continuously for different datasets and getting the time analysis

Table 5.5 Test case to demonstrate Efficiency Predictive analysis
The predictive analysis will be carried out using some functions which runs for different duration it would be good to find the average running time for efficiency analysis.

5.3 Integration Testing of the Modules
Integration testing is a technique where the system is deployed along with other existing systems without breaking functionality of existing applications. This helps to Identify and uncover errors associated with interfacing.

5.3.1 Integration Tests for Prometheus Server
The following show the test cases for Integration of the Prometheus server with other application

Sl No. of Test Case

Name of Test Case

Integration Test For Prometheus Server

Feature being Tested

To check that Prometheus is connected with other applications


The Prometheus server should connected to application and remote servers like AWS instances and Grafana

Sample Input

Raw Data from AWS instances

Expected Output
Generation of the Graphs according to the usage

Actual Output
Generations of the Graphs according to the usage


The graphs was generated according to the usage

Table 5.6 Test case to demonstrate Integration Test for Prometheus Server
Different applications have to be integrated for overall software development. The applications run remotely which is integrated using the scripts and tested.

5.4 System Testing of the modules
All the modules combined and interfaced together to deliver the functionality as specified in the objectives of the project gives the entire system. System testing hence determines the accuracy and performance of the deployed software and specifies the requirement for monitoring and maintenance.

5.4.1 System Testing of the modules
The following show the test cases for System Testing of the Predictive Analysis Module

Sl No. of Test Case

Name of Test Case

System Testing of the Predictive Analysis Module

Feature being Tested

To check that System Testing of the Predictive Analysis Module


ARIMA model is used for Predictive Analysis which is tested for efficiency

Sample Input

Different Datasets

Expected Output
Generation of Actual and Expected value with minimum Mean squared error

Actual Output
Generation of Actual and Expected value with minimum Mean squared error


Minimum MSE will be the best model

Table 5.7 Test case to demonstrate System Testing of the Predictive Analysis Module
Predictive analysis module is the main module for analysis of the data, which will be tested for minimum MSE
5.5 Functional Testing of the GUI
Graphical user interface is developed in order to allow users to interact with the system. Grafana dashboards are developed for each application and they can be continuously viewed for performance monitoring of the system. They have a simple authentication system of a simple username and password login. User can select a dashboard which he wants to view. Editing options for the query are also provided. The time-interval can also be adjusted by the user. Hence various options are provided to the user for better understanding of the whole monitoring system.

5.5.1 Functional Testing of the GUI
The following show the test cases for Functional Testing of the GUI

Sl No. of Test Case

Name of Test Case

Functional Testing of the GUI

Feature being Tested

To check that GUI function are working well


GUI is easily used by the user

Sample Input

User Selection of different graph view

Expected Output
Generation of Graphs

Actual Output
Generation of Graphs


Generated Graphs for better understanding

Table 5.7 Test case to demonstrate Functional Testing of the GUI
Functional Testing of the GUI is carried out to test and verify user better understanding and usability.

Chapter 6
Experimental analysis and results of Prometheus monitoring system
Investigation of a procedure, trials are ordinarily used to assess the contributions of which process will significantly affect the yield of the procedure, and how much the objective level of those information sources ought to be to accomplish a coveted outcome. The yield got from the framework is contrasted with reality with confirm the accuracy of the framework. ARIMA model is used as the predictive-forecast model for statistical analysis of time-series data generated by the prometheus monitoring system.

6.1 Evaluation Metrics
6.1.1 Mean Square Error
As it is a short-term forecast, we can more easily check forecasted vs later observed results and look at the distribution of your ‘error’ over time. The relative mean square error (or relative MSE) is defined as the average of the absolute values of the prediction errors from one model, divided by the average of the absolute values of the prediction errors from a second model. A strength of the relative MSE metric that we find particularly compelling for use in practice is how it enables standardized comparisons of candidate models with reference models. This encourages honest evaluation (i.e. a model could have very low error, but a simpler model may have similarly low error) and can help identify the strengths and weaknesses of prediction models. The relative MSE has several desirable properties. First, the interpretation of the relative MSE for a given dataset does not depend on the scale of the data. Second, the relative MSE has an intuitive interpretation.

Figure-6.1 Mean squared error (Evaluation metric)

Figure-6.2 ARIMA model results

Figure 6.3 Real-time analysis of metrics using ARIMA model

Figure 6.4 Snapshot of Predictive analysis of time-series data using ARIMA model
6.2 Experimental Dataset
6.2.1 Data set generated from Scrape requests:
The dataset is generated by sending scrape requests to exporters at an interval of 5 seconds. Queries are written to data stored temporarily in Prometheus in SQLite database using PromQL query language. The query result is redirected to be stored in csv files.The data is collected for all System metrics which includes CPU utilization, Memory utilization and Disk storage. This is used as an input to the ARIMA model for training the model and thereby testing it.

Figure 6.5 Query results stored in a csv dump(Before data cleaning)

Figure 6.6 Cleaned dataset
6.3 Performance analysis of Prometheus Monitoring System
6.3.1 Monitoring of the prometheus server
For the monitoring system to perform effectively, the prometheus server should be up all the time. Hence there should be no downtime for the server. The server is run as a daemon process and is monitored continuously.

Figure 6.7 Monitoring of Prometheus server6.3.2 Metrics collection by prometheus server from various exporters
Various exporters deployed through docker/plugin environment on AWS machines based on the specific application are configured as jobs in the configuration file of prometheus which is prometheus.yml. The targets are mentioned for each job by giving IP address of the machine instance and default port for the respective exporter on which it runs.

Figure 6.8: Collection of system metrics through node exporter

Figure 6.9 Export of elasticsearch metrics through a plugin installed on elasticsearch server into Prometheus

Figure 6.10 AWS credentials

Figure 6.11 Configuration of prometheus.yml for each job
6.4 Inference from the Result
The monitoring system developed using prometheus tool helps in better understanding of performance of all the microservices running in the system.This real-time analysis of System,Business,Application metrics helps in understanding traffic coming into the website at various intervals of the time. Predictive analysis of data helps in making better decisions and efficient utilization of resources of machine instances.Hence this monitoring system developed helps MeTripping company on a large scale to efficiently manage resources and monitor their performance.

Prometheus is a scalable open-source software tool that provides a monitoring solution to deal with time-series numerical data. It was a project started by Sound Cloud with the inspiration from Google’s Borgmon. To monitor services using prometheus, an HTTP endpoint for prometheus runs on port 9090. All the metrics collected by prometheus are listed and can be visualized with simple graphs.

The prometheus monitoring system has achieved all the objectives set initially in the design phase of the project. It collects all the metrics from various exporters by pulling them continuously at a scrape interval of 5s.The promQL query language is used to query on these metrics and generate results. The results are graphically visualized using grafana dashboards. Along with real-time analysis, predictive analysis of data is done using the stored data. The algorithm used is ARIMA(Autoregressive integrated moving average) model.

Hence the system deployed is beneficial to the MeTripping company to monitor over all the microservices and devops and hence manage the resources efficiently and make better decisions based on the traffic coming to the website at different periods of time.

7.1Limitations of the Project
No support for logs storage.

Prometheus does not support long term durable storage
Prometheus does not have a dashboarding solution of its own. It depends on Grafana dashboards, adding some additional setup complexity.

Since AWS instances belong to different regions, public IP addresses have to be used in configuration. This can increase the vulnerability of attacks to the data.

Some metrics like CPU surplus storage have to be monitored using CloudWatch exporter which is not an open source tool.

7.2 Future Enhancement
Alert managers can be configured which can send alerts through emails or slack messages.

Proper authentication is to be set for usage of the prometheus monitoring system.

Instead of real-time analysis of data, more importance has to be given to the predictive analysis of data, which gives a better insight of the performance of system for any load.

Iman Sadooghi , ‘Understanding the Performance and Potential of Cloud Computing for Scientific Applications’, Publication ,Vol-3, 65-55, page number, 2017
Lukasz KUFEL,’ TOOLS FOR DISTRIBUTED SYSTEMS MONITORING’, Foundation of Computing and Decision Sciences, Vol-41, No4 ISSN 0867-6356.
Rodrigo N. Calheiros, Enayat Masoumi, Rajiv Ranjan, and Rajkumar Buyya, “Workload Prediction Using ARIMA Model and Its Impact on Cloud Applications’ QoS”, IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 3, NO. 4, OCTOBER-DECEMBER 2015
Release: Amazon EC2 on 2007-07-12, Amazon Web Services, online 2013, (Accessed: 1 November 2013)
WHITEPAPER By Mick England, DevOps professional in the Boston area “Monitoring AWS beyond CloudWatch” 2017
WHITEPAPER By Mick England, DevOps professional in the Boston area “Best Practices for Monitoring While Migrating to AWS” 2017
Wes McKinney, Josef Perktold, Skipper Seabold, “Time Series Analysis in Python with statsmodels” PROC. OF THE 10th PYTHON IN SCIENCE CONF. (SCIPY 2011)
Ahmad, Alexander Lavin, Scott Purdy, Zuha Agha, “Unsupervised real-time anomaly detection for streaming” USA Neurocomputing 262 (2017) 134–147 ELSEVIER
Jina Wang , Yongming Yan2 and Jun Guo, “Research on the Prediction Model of CPU Utilization Based on ARIMA-BP Neural Network”, MATEC Web of Conferences 65 , 03009 (2016)
Shervin Sharifi, Student Member, IEEE, Dilip Krishnaswamy, Member, IEEE, Tajana Šimuni? Rosing, Member, IEEE, “PROMETHEUS: A Proactive Method for Thermal Management of Heterogeneous MPSoCs” , TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS
Konstantinos Angelopoulos, Fatma Ba?ak Aydemir, Paolo Giorgini, John Mylopoulos, “Solving the Next Adaptation Problem with Prometheus”, University of Trento, Italy
L. Ramakrishnan, R. S. Canon, K. Muriki, I. Sakrejda, and N. J. Wright. “Evaluating Interconnect and virtualization performance for high performance computing”, ACM Performance Evaluation Review, 2013