1 Introduction
Web applications play an extremely important role in people’s daily lives. It brings great convenience to people. They can use Web applications for shopping, office, learning, entertainment and so on. However, the security of Web applications has long existed. Hackers can steal user’s private data by attacking Web applications, disabled Web services, steal sensitive user information, and bring serious financial loss to both service providers and users.
However, it’s hard to protect Web applications from attack. Even though developers and researchers have developed many solutions, like Web application firewalls (WAF), Web intrusion detection systems (Web IDSs), penetration testing, to protect Web applications, Web attacks remain a major threat. Generally, There are two approaches to detect Web attacks, one is the signature-based [1], another is the anomaly-based [2]. The signature-based method establish the detection model from known attacks and any behavior having the same attack signatures is identified as an attack. Contrarily, the anomaly-based method establishes a profile from normal behaviors and any violation is identified as an attack. The signature-based method is accepted and adopted more wildly than the anomaly-based method because generally the signature-based one has lower false alarm rate and achieves higher accuracy. Although it is effective, the rule-based method is still problematic. On the one hand, It is just as good as the range of the rule set, which means it is unable to identify attacks which are not in its signature dataset. On the other hand, bypassing WAF can be done easily if they replace keywords of existing malicious requests or encode themselves multiple times [3, 4].
Here, based on the BRNN [5] (Bidirectional recurrent neural networks) with the Bi-directional Long-Short Term Memory (Bi-LSTM) unit, we put forward a new anomaly detection method to detect Web attacks. Our model takes Uniform Resource Locators (URLs) and request body in the HTTP POST requests (only URLs for HTTP GET requests) as the input. After the URLs are tokenized, they will be mapped to vectors. Then the Bi-LSTM will learn from the normal request patterns. And then a trained neural network based on the output of the Bi-LSTM to judge whether given requests are anomalous. Our method has achieved state-of-the-art results in detecting Web attacks, the experimental results show that BL-IDS has a high detection rate and maintains a low false alarm rate.
The rest of the paper is organized as follows: Sect. 2 is introduction of Some related works. Section 3 is description of the method based on deep learning to detect Web attacks. Section 4 is experimental results and discussions. Section 5 is conclusion of this paper.
2 Related Works
Many machine learning techniques are used to detect Web attacks, Kruegel et al. have presented a multi-model method to detect Web attacks in [6]. The method analyzes HTTP requests and uses some different models built on different features, like attribute length, attribute character distribution, structural inference, invocation order and so on. Abou-Assaleh et al. [7] explored the idea of automatically detecting new malicious code using the collected dataset of the benign and malicious code which is based on N-gram. Moh et al. [8] have put forward a multi-stage log analysis architecture, which combines both pattern matching and supervised machine learning methods. It uses logs produced by the application during attacks to detect detecting SQL injection attacks effectively. Cao built a system which can avoid false negatives and enhance the efficiency of detecting work by using a prevailing machine learning algorithm called Adaboost in [9].
In recent years, deep learning, a branch of machine learning, has become increasingly popular and has been used in the field of information security. Cui et al. [10] propose an improved NIDS using word embedding-based deep learning (WEDL-NIDS), which first reduces the dimension of a packets payload via word embedding and learns the local contentful features of network traffic using deep convolutional neural networks (CNNs) [11], followed by adding the head features and learning global temporal features using long short-term memory (LSTM) [12] networks. The result they got was quite well. Fredrik Valeur et al. [13] had developed an anomaly-based system that learns the profiles of the normal database access performed by Web-based applications using a number of different models. These models allow for the detection of unknown attacks with reduced false positives and limited overhead. Zhang et al. [14] have put forward a deep learning method to detect Web attacks which is using a specially designed CNN. Similar to our work, the difference is the network architectures, they use the Convolutional Neural Network while we use the Bi-LSTM based on Bidirectional recurrent neural network [5]. And the method we have proposed has better performance.
Data Preprocessing: We decode HTTP request, then we split the decoded HTTP request, the Segmentation character includes /, & and so on.
Word embedding: We map each word into a word vector using word2vec [15], The mapped word vectors are used as an input to a model based on a neural network.
Training model and detect Web attack: We use the labeled word vectors to train a model based on neural network. Then use the trained model to classify the new HTTP request as Web attack or normal.
3.1 Data Preprocessing
In this section, We decode HTTP request, then we split the decoded HTTP request, the Segmentation character includes /, &, +, ?, =, @ and so on.
This is a request message based on HTTP. The HTTP request consists of three parts: the request line, headers and request body. The request line is the first line of the HTTP request message, and its format is as follows:
Method Request-URI HTTP-Version
Method represents the request method; Request-URI is a uniform resource identifier; HTTP-Version represents the requested HTTP protocol version. There are many kinds of methods. The two common methods are as GET and POST. GET request to get the resource identified by the Request-URI, POST appends new data to the resource identified by the Request-URI. The format of the Request-URI is as follows:
3.2 Word Embedding
3.3 Training Model and Detecting Web Attack
We treat the preprocessed sequence in Sect. 3.1 as a word, map it to a vector using word2vec as an input of model. And then we train model based on neural network use train sample. When the model is trained, it can be a classifier to detection Web attack or normal request. The neural network architecture is shown as Fig. 7.
A bidirectional recurrent neural network (BRNN) can be trained using all available input information in the past and future of a given period of time. Therefore, it can overcome the limitations of the conventional RNN.
4 Experimental Results and Discussion
This section we conducted various experiments on the dataset HTTP DATASET CSIC 2010 [16] to evaluate the performance of our proposed method for detecting Web attacks.
4.1 Dataset
The HTTP dataset CSIC 2010 includes thousands of automatically generated Web requests which can be used to test Web attack protection systems. It was developed at the Information Security Institute of CSIC (Spanish Research National Council). The HTTP dataset CSIC 2010 includes the generated traffic targeted to an e-Commerce Web application. The dataset includes 36,000 normal requests and more than 25,000 anomalous requests. The HTTP requests are labeled as normal or anomalous.
4.2 Experiment
The network summary of BL-IDS
Layer (type) | Output shape | Param |
Embedding | (None,56,40) | 2053840 |
Bidirection | (None,56,20) | 4080 |
Dropout | (None,56,20) | 0 |
Bidirection | (None,20) | 2480 |
Dropout | (None,20) | 0 |
Dense | (None,2) | 42 |
Binary confusion matrix
Actual class:abnormal | Actual class:normal | |
Predicted class:abnormal | TP | FP |
Predicted class:normal | FN | TN |
4.3 Results and Discussions
We used batch training methods to train the Bi-LSTM for 10 epochs. The batch size is set as 128 and the validation_split is set as 0.1. We Train on 43966 samples and validate on 4886 samples. The training accuracy and loss and the validation accuracy and loss every one epoch are recorded. The trends of the metrics are presented in Fig. 10. Figure 10(a) shows the accuracy trends, where the orange curve represents the validation accuracy and the dark cyan represents the training accuracy. It shows that after about 7 epochs of training, both the training and validation accuracies have achieved above 98%. Figure 10(b) shows the loss trends, where the orange curve represents the validation loss and the dark cyan represents the training loss. Clearly, both the training and validation losses decrease rapidly towards 0. The trends of accuracy and loss reflect the good capability of the Bi-LSTM.
We evaluate its ability of detecting Web attacks by running the trained Bi-LSTM on test data after 10 epochs of training, detection rate is 98.17%, false alarm rate is 1.40%, test accuracy is 98.35%, precision is 99.00% and score is 98.58%. This illustrates that with a certain amount of training, the Bi-LSTM has achieved state-of-the-art results in detecting Web attacks, which have both a high detection rate and a low false alarm rate.
Compared with Zhang [14]’s method, our method has achieved better results. Our experimental results show that BL-IDS can greatly improve the accuracy and detection rate while maintaining a low false alarm rate. Our analysis suggests that HTTP requests are more like natural languages, because they can all be considered as a sequence, and there is a temporal relationship between the sequences. So HTTP requests are more suitable to be processed by recurrent neural networks such as Bi-LSTM. However, convolutional neural networks are better at processing image tasks.
5 Conclusion
Exploring a deep learning method to detect Web attacks, which is based on the RNN with the Bi-LSTM. The method can detect Web attacks through inspecting the HTTP request packets. First, studing data preprocessing, which selects useful information from HTTP request packets and produce many word sequences. Second, studing the embedding method used to map words to vectors. Finally, a Bi-LSTM is used to extract features automatically and classify the HTTP request packets to normal or abnormal class. We conducted experiments on the dataset HTTP DATASET CSIC 2010 to evaluate the effectiveness of the method. The results show that the Bi-LSTM can be trained easily and the detection method have a high detection rate and low false alarms in detecting Web attacks.