Web Attack Detection using Machine Learning- by building ML based IDS.

KarthikJChandran
16 min readJan 6, 2021

Introduction:

The increase in internet usage has been revolutionary. Worldwide, there are 4.66 billion active internet users as of October 2020, encompassing 59 percent of the global population. These statistics support the idea that the development of web apps have rocketed in previous years. Just think about it, how many people do you know that doesn’t own a smartphone with an internet browser installed…?

Given that business systems created by web applications are web-based, they can be accessed 24/7 provided that you have an internet connection. What’s more, they are totally flexible, offering access from almost any device or browser. By using a web application, you avoid the hassle and memory usage of installation software on every device, you’ll also find web applications less punishing on older or low spec devices. More convenient is the accessibility of the web applications there are some serious threats that can cause serious damage to a web application and the personal details of the users associated with it. In this blog we shall look into such threats on web applications and how ML and DS techniques can be effective in overcoming these threats.

What is the problem?

Web application security is a never-ending game of cat and mouse. As soon as the latest threat is mitigated, a new threat emerges. The Web attacks could bring down the operation and services of corporations leaving them with financial, reputational damage and customer dissatisfaction.

Amazon Web Services (AWS) reports that in February 2020, they defended against a 2.3 -terabit-per-second(Tbps) distributed denial of service (DDoS) attack. Also GitHub was recognized as sustaining the largest DDoS attack in history, which involved a 1.35 Tbps attack against the site in 2018. The brute-force attack is still one of the most popular password cracking methods for hacking WordPress today. In this blog, we will focus on Machine Learning techniques to detect web attack in network communication flows using continuous learning algorithm that learns the normal pattern of network traffic, behavior of the network protocols and identify a compromised network flow.

Let’s learn more about Web Attacks:

To get more clarity on the different types of web attacks and how this is caused we should understand the basic operation of a web application.

A web application is a computer program that utilizes web browsers and web technology to perform tasks over the Internet. Basically a web application requires a web server to manage requests from the client, an application server to perform the tasks requested and a database to store the information.

1.1 Basic Web Application architecture

As you can see in the above snips it’s the client who request a task as a request in his browser over internet to the web server, the web server receives the request and pass it on to the application server which check the relevant data in the DB, creates a Dynamic page and pass it on to the web server which gets displayed over the browser.

How does web attacks takes place?

Web attacks can be a serious threat to a web application they take advantage on the vulnerability of the application gain access of the application DB and do serious damage.

  1. Brute-force: A brute-force attack is a trial and error method used by hackers to guess credentials or encrypted data such as login, passwords or encryption keys, through exhaustive effort (using brute force) with the hope of eventually guessing correctly.

2.DoS Attack: A Denial-of-Service (DoS) attack is an attack meant to shut down a website, making it inaccessible to its intended users by flooding it with useless traffic (junk requests). Sometimes DoS attacks are used for destroying computer defence systems

3.Botnets and DDoS Attack: A DDoS attack is short for “Distributed DoS attack”. Such attacks are performed by flooding the targeted website with useless traffic from multiple devices or a botnet. A botnet is a network of computers infected with malicious software (malware) without the user’s knowledge, organized into a group and controlled by cybercriminals. Modern botnets can contain tens of thousands of compromised mobile devices or desktop computers. Due to their nature, modern DDoS attacks are costly and require a lot of resources. Usually, that means you have a strong enemy which has enough gray money to order this kind of attack. Very often, performing DDoS attacks are ordered by unscrupulous competitors or political opponents.

4.SQL Injection: SQL injection is a code injection technique, used to attack data-driven applications, in which malicious SQL statements are inserted into an entry field for execution SQL injection is one of the most common web hacking techniques. SQL injection is the placement of malicious code in SQL statements, via web page input.

5.Infiltration: Infiltration can be accomplished by directly breaching a network, or by infecting a host, which is then joined to a private network.

6.Heartbleed: The Heartbleed attack works by tricking servers into leaking information stored in their memory. Attackers can also get access to a server’s private encryption key. That could allow the attacker to unscramble any private messages sent to the server and even impersonate the server.

Existing solution to prevent the web attack:

In production all web apps are bounded with the IDS(Intrusion detection system) to protect the web application from the web attacks. IDS is an application that monitors the computer network to deduct any malicious activities or threats and alerts the admin. The primary purpose of IDS to detect anomalies in the incoming request and alert. But it’s impossible for the IDS to provide complete protection. The reason is that they make False positive and False negative errors sometimes due to which the False negative requests go reviewed and attack the web server

Web attacks are basically the Evasion techniques used by the attackers to hide the malicious requests by encoding them. These can easily compromise the IDS and pass through it and directly the web app infra attaining its primary task.

The IDS can be classified into two types:

(i) Network-based Intrusion Detection System (NIDS): Network intrusion detection systems operate at the network level and monitor traffic from all devices going in and out of the network. NIDS performs analysis on the traffic looking for patterns and abnormal behaviors upon which a warning is sent.(ii) Host-based Intrusion Detection System (HIDS): The HIDS is unlike the NIDS monitors the entire network, HIDS monitors system data and looks for malicious activity on an individual host. HIDS can take snapshots, and if they change over time maliciously, an alert is raised.

Role of Machine Learning Algorithms

The traditional IDS cannot detect and restrict the attacks in full scale as they are limited to observe static patterns in the web requests, when the malicious web request is slightly encoded and passed it can easily traverse through the IDS and cause damage. Here I have used flow based traffic characteristics to analyze the difference in pattern between normal vs anomaly packet. We evaluate several supervised classification algorithms using metrics like maximum detection accuracy, lowest false negatives prediction. ML Algorithms are capable of learning large amount of malicious and benign requests of different patterns and can predict them effectively in production.

Business constrain

The Ultimate objective here is to build a ML based Network Intrusion detection system which can detect the malicious traffic flow over the application layer and stop them from entering into the Web application there by protecting the Web application from web attacks.

Dataset used

The biggest challenge here is to identify the appropriate dataset which holds the threatful and normal request so that we can train the ML model with it, many such datasets cannot be shared due to the privacy issues. The dataset used here contains normal network flows and flows with web attacks. The research team from Canadian Institute for Cybersecurity generated the CSE-CIC-IDS2018 on AWS dataset. The team took as top priority to generate a realistic network traffic using a benign profile system that abstracts behavior of human interactions and generates a naturalistic benign background traffic.

1.2 Test-Best Architecture

The team has built a testbed architecture which consists of some interconnected Windows and Linux based workstations. For Windows machines, we will use different service packs (because each pack has a diverse set of known vulnerabilities) and for Linux machines we will use Metasploit-able distribution, which is developed for being attacked by the new penetration testers. Thanks to University of New Brunswick who have build the web attacks dataset on timely intervals holding the trending web attacks. The CSE-CIC-IDS2018 on AWS used here was extracted from this Kaggle page. Data is recorded in various csv files based on dates , each dated csv files holds the web attacks passed on different dates.

The dataset includes seven different attack scenarios: Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The dataset includes the c network traffic and system logs of each machine, along with 80 features extracted from the captured traffic using CICFlowMeter-V3.

Total no of 80 features present in the dataset

Please visit this link to learn more about the test best architecture and the detailed description of the web attacks passed.

Performance Metrices

Performance of a Machine Learning model is measured using the performance metrices, there are different types of performance metrices to evaluate a model choosing the appropriate one is very much important to monitor the performance and optimize the same , we are using the below mentioned metrices to evaluate out model model.

1.Confusion matrix is used to measure the performance of the classification model. It is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.

Confusion Matrix

True Positive Rate(TPR): True Positive/positive

False Positive Rate(FPR): False Positive /Negative

False Negative Rate(FNR): False Negative/Positive

True Negative Rate(TNR): True Negative/Negative

For instance if a web traffic basically a malicious one predicted to be a genuine one when passed through the network may cause damage to the web app, so the cost of False positive can be high here. For better performance, TPR, TNR should be high and FNR, FPR should be low.

2)Precision: Precision is all the points that are declared to be positive but what percentage of them are actually positive.

Precision = True Positive/Actual Positive

3)Recall: It is all the points that are actually positive but what percentage declared positive.

Recall = True Positive/ Predicted Result

4)F1-Score:It is used to measure test accuracy. It is a weighted average of the precision and recall. When F1 score is 1 it’s best and on 0 it’s worst.

F1 = 2 * (precision * recall) / (precision + recall)

Precision and Recall should always be high.

5)Accuracy: It is simply the rate of correct classifications.

Accuracy(TP + TN) / (FP + TP + FN + TN)

Exploratory Data Analysis

The EDA is the most important and crucial stage in a Machine Learning Project. Before we apply a random model to the dataset it is very important that we do proper data analysis and pre processing , to uncover the underlying structure of a dataset because this process exposes trends, patterns, and relationships that are not readily apparent.

The CSE-CIC-IDS2018 dataset is generated on a test bed environment by passing each type of web attacks and benign attacks for each day ultimately we have the web requests as generated as csv files for each day holding different types of web attack together the csv files holds total size of 6.41GB hence we are performing the EDA and modelling on the google colab platform to get faster RAM and space specifications. Please refer this link to get detailed understanding about the google colab notebook.

But concatenating and processing the all the csv files and process may lead to CPU crash even on colab , hence we apply the concept of sampling here. Since the amount of web attacks are very much imbalanced we apply stratified sampling and random sampling on each csv files obtained on each day.

What is sampling and stratified sampling?

A sample is a random subset from the population. Usually we use samples when the population is big enough and difficult to perform analysis of the whole set. Sampling is a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined.

In a simple random sample, every member of the population has an equal chance of being selected. Your sampling frame should include the whole population. Stratified sampling involves dividing the population into subpopulations that may differ in important ways. It allows you draw more precise conclusions by ensuring that every subgroup is properly represented in the sample.

Connecting the drive to Google colab notebook and loading the csv
Appling sampling and stratified sampling on the datasets
Concatenating the stratified samples forming a final dataset

Concatenating the stratified samples we form our final dataset which hold the benign and web attack request in an almost balanced ratio and we do not have enough amount individual web attack count .Hence we are making this problem from a multi class classification problem to a Binary Classification problem. Encoding all types of web attacks to be 0 and the Benign requests to be 1.

Class Labels(web attacks) and its corresponding counts
Replacing the class labels to 0 and 1

Data Cleaning

This is the other most important step before modeling, in general any dataset acquired should be keenly analyzed ,cleaned and pre -processed . In fact this is the time consuming part in a ML life cycle. When you clean data, all outdated or incorrect information is gone leaving you with the highest quality information.

The data here had lots of outliers , missing values, special characters and the features in different data types, before processing to the modeling part it is very much important to clean them and process them .

Data Pre Processing

Feature Importance

It is important that we feed relevant data to the model so that the model can perform effectively with the necessary data. Feeding redundant and constant data may not help the prediction but in turn it reduces the performance of the model. So we should identify the important features in a large dataset and pass the model the important ones.

As of the CSE-CIC-IDS2018 considered , here the dataset is composed of 80 features excluding the class label we are left with 79 features here. It cannot be concluded that all these 79 features may be regarded as the important features and passed to a ML model. We should perform feature reduction retaining the important features and drop the features less important.

  1. Firstly the constant features values may not help the prediction we could see that the features Bwd PSH Flags, Bwd URG Flags,Fwd Byts/b Avg,Fwd Pkts/b Avg,Fwd Blk Rate Avg,Bwd Byts/b Avg,Bwd Pkts/b Avg, Bwd Blk Rate Avg just hold zero values, retaining them may not be useful during the prediction hence we are dropping these features from the dataset.
Features having zero values

2. To further dive deep in analyzing the important features we should use feature importance algorithms in order to identify the important features. Here I have used the permutation importance algorithm in order to identify the important features. With the Permutation Importance feature selection method, the performance of a model is tested after removing each individual feature and replacing that feature with random noise. In this way the importance of individual features can be directly compared, and a quantitative threshold can be used to determine feature inclusion. Permutation importance is calculated after a model has been fitted. So we won’t change the model or change what predictions. This can get quite confusing please refer this link to get descriptive understanding.

Permutation Importance feature selection method is available under the Python library ELI5.

3. The third feature reduction technique is finding the highly co-related features, the stronger the correlation, the more difficult it is to change one variable without changing another.

correlation heatmap between the features
Dropping the highly co related features

Final set of features post dropping the highly co related features with correlation score more than 0.95.

Modeling

Apparently there is no such perfect model which can perfectly fit the dataset and provide optimum results ,we will have to try and experiment with different types of models tuning their respective hyperparameters. But visualizing this particular data it is clear that the data is not linearly separable hence the linear ML models cannot be a best choice. We shall try to evidence the same by applying the data over a Linear model.

  1. Here we are applying the SGD Classifier with Log Loss on different iterations
Accuracy score of SGD classifier
Accuracy score vs Iteration graph

Its evidential that the linear model is performing worse with this data. The accuracy of the model has stopped post the 1000th iteration and the accuracy of 0.53 is definitely not a good score.

2.We are trying the same SGD classifier with Hinge loss.

Here the results are worse the accuracy is not increasing more than 0.47 even after increasing the number of iterations.

Accuracy score vs Iteration graph

3.Having the data is not linearly separable and the no of observations are large we can try the Decision tree model here with default parameters.

The DT model is producing better accuracy score than the previous models, with the confusion matrix we could see that the model is separating the benign and anomalous requests in a decent way.

4.We can try hyperparameter tuning to know the best hyperparameter and check the performance with appropriate hyperparameters.

Using GridsearchCV we have identified the best hyperparameter now we shall train the model with the best hyperparameter values

Post the Hyperparamater tuning and applying the best parameters the model accuracy had improved in a very small amount. There is not a bigger accuarcy difference but Hyperparameter tuning should be done for optimized results.

5. Ensemble techniques makes the models even more powerful, we are using the Random forest model which is a bagging ensemble technique. An easy to use ML algorithm that produces effective results even without hyper-parameter tuning. It is also one of the most used algorithms, because of its simplicity and diversity

RF model is producing good results but its more or less similar to the well tuned DT model.

6. Experimented the same with other boosting ensemble algorithms like AdaBoost and XGBoost but the results have not been very much fascinating.

7. As a part of the modeling I have built an custom ensemble model with base learners as DT model which has produced closer results as the best performing DT model,

Custom ensemble model:

1)Making the initial train and test(80–20) split.
2)In the 80% train set, we split the train set into D1 and D2.(50–50).
Now from this D1 we do sampling with replacement to create d1,d2,d3….dk, total (k samples).
And we create ‘k’ models and train each of these models with each of these k samples.
3)We pass the D2 set to each of these k models, now we will get k predictions for D2, from each of these models.
4) Now using these k predictions create a new dataset, and for D2, we already know it’s corresponding target values, so now we train a meta model with these k predictions.
5)For model evaluation, we can use the 20% data that we have kept as the test set. Pass that test set to each of the base models and we will get ‘k’ predictions.
Now we create a new dataset with these k predictions and pass it to your metamodel and we will get the final prediction.
using this final prediction as well as the targets for the test set, we calculate the models performance score.
For base models we use Decision Tree model

Model Deployment

This ML based Intrusion detection web app was built using Flask API, the trained models were saved as the joblib files called whenever the app is called passing the input . The input here is passed as 79 features csv file. Error exceptions like less features, text specials characters, missing features and blank files in the input csv have also been handled here. To track the error occurrence we are also recording the events as txt files using python logger.

Links

Github: clickhere , LinkedIn:clickhere

References

  1. https://csr.lanl.gov/data/2017/
  2. https://arxiv.org/pdf/1903.02460.pdf
  3. Detection of Denial of Service Attacks in Communication Networks by Vancouver, British Columbia, Canada
  4. Network Traffic Behavioral Analytics for Detection of DDoS Attacks by Alma D. Lopez Southern Methodist University, adlopez@smu.edu
  5. https://www.kaggle.com/solarmainframe/ids-intrusion-csv?select=02-21-2018.csv
  6. https://www.unb.ca/cic/datasets/ids-2018.html
  7. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

Future Work

  1. Include more no of samples on a high CPU box and apply complex ML model and optimize the results .
  2. Extend the Binary classification problem to a multi class classification problem predicting the different types of web attacks using the CIRA-CIC-DoHBrw-2020 ,the next version of IDS dataset having enough number of web attack count.
  3. Apply Deep learning concepts, use MLPs and also experiment LSTMs which would hold the sequential pattern in the traffic to produce better accuracy and for less latency.

--

--