1 Introduction
Modern softwareintensive systems, including serviceoriented systems, have become increasingly large and complex. While these systems provide users with rich services, they also bring new challenges to system operation and maintenance. One of the challenges is to identify faults and discover potential risks by analyzing a massive amount of log data. Logs are composed of semistructured texts, i.e., log messages. Log analysis is one of the main techniques that engineers use for troubleshooting faults and capturing potential risks. When a fault occurs, checking system logs helps to efficiently detect and locate the fault. However, with the increase in scale and complexity, manual identification of abnormal logs from massive log data has become infeasible.
During the past decade, many automated log analysis approaches, including supervised, semisupervised, and unsupervised approaches, have been proposed to detect system anomalies reflected by logs[1, 2, 3, 4, 5]. Although supervised approaches show promising results, the scarcity of labeled anomalous log data is a daunting issue. In contrast, unsupervised and semisupervised approaches have a significant advantage in that no labeled anomalous data are needed. However, the existing unsupervised and semisupervised methods have low accuracy.
In this paper, we propose a log anomaly detection method, LogDP, which simultaneously utilizes both dependency among log events and proximity among log sequences to detect anomalous log sequences. LogDP first discovers the normal patterns for logs, then identifies the log sequences that violate these patterns as anomalies. There are two types of normal patterns, dependency patterns (DPs) and proximity patterns (PPs). DPs are related to the events that have dependency relationships with other events, and PPs are for the events that are independent of other events. To find the DP of an event, LogDP trains a predictive model to predict this event using some other events as predictors. Here, we name the log event to be predicted as the focused event, and the predictor events as the related events of the focused event. To find the PP of an event, a mean prediction model is trained to use the mean value of the event as the expected value of the event. When detecting anomalies, given a log sequence, its expected values on all log events are predicted using the learned models, and the differences between the observed values and expected values are calculated, named pattern deviations, which indicate the degree of the log sequence deviating from their corresponding normal dependency. If any pattern deviations are beyond normal ranges, i.e., the normal patterns are violated, the log sequence is flagged as an anomaly.
In summary, our main contributions in this work are as follows:

We propose LogDP, a novel logbased anomaly detection method, which utilizes dependency among log events and proximity among log sequences at the same time. To our best knowledge, we are the first to introduce the dependencybased anomaly detection techniques in the field of log analysis.

We experimentally demonstrate the effectiveness of the proposed method on seven settings of three widelyused log datasets. The empirical experiments show that the proposed approach can outperform the stateoftheart unsupervised and semisupervised logbased anomaly detection methods.
2 The LogDP Method
In this section, we first explain log preprocessing, and then present the LogDP method. The LogDP method consists of two phases, the training phase and the test phase. In the training phase, for each log event, LogDP trains an expected value prediction model and produces the corresponding threshold. In the test phase, the trained prediction modes and thresholds are used to determine if a log sequence is an anomaly or not.
2.1 Log Preprocessing
Logs are usually semistructured texts, which are used to record the status of systems. Each log message consists of a constant part (log event) and a variable part (log parameter). Log parsers [6, 7, 8] can parse log messages into log events, which are the templates of the log messages. Figure 0(a) shows a snippet of raw logs and the results after they are parsed.
Log messages can be grouped into log sequences (i.e., series of log events that record specific execution flows) according to sessions or time windows. Sessionbased log partition often utilizes certain log identifiers to generate log sequences. When using time windows to partition logs, two types of strategies are usually used, i.e., fixed window and sliding window. Fixed window strategy uses a predefined window size, e.g., 1 hour, to produce log sequences, while sliding windows strategy generates log sequences using overlapping between two consecutive fixed windows. For each log sequence, the occurrences of the events are counted, resulting in an Event Count Matrix (ECM). For example, an ECM is shown in Figure 0(b), where indicates the number of occurrences of in , namely .
The notation used in this paper is as follows. We use a boldfaced upper case letter, e.g. X to denote a matrix; a boldfaced lower case letter, e.g. e
, for a vector; a lower case letter, e.g.
, for a scalar. We have reserved for an ECM with log sequences and log events. represents the set of log events of X and is a log event, i.e., . A log sequence is denoted as , where is a log instance, i.e., the occurrences count of an event in c. The log instance of event in sequence is represented as .2.2 The Training Phase of LogDP
The workflow of the training phase of the LogDP method is presented in Figure 2. The inputs of the training phase are a training set and a validation set , both of which only contain normal log sequences. is used to train expected value prediction models, and
is used to obtain the thresholds. The training phase is composed of two steps, related event selection and prediction model training. In the related event selection step, for each event, named focused event, its related event is selected to be used as predictors to predict the focused event. In the prediction model training step, two different prediction models are trained according to if Markov blanket (MB) is found for the focused event. If the focused event is not independent, i.e., it has MB, a MultiLayer Perceptron (MLP) regressor is trained to embody the dependency relationship between the focused event and its MB. If the focused event is independent, i.e., it has not MB found, a mean prediction model is trained. That is, DPs are learned for dependent events using the dependencybased technique, and PPs are for independent events using the proximitybased technique. After training the expected value prediction models,
is input to obtain the corresponding thresholds. The outputs of the training phase include a set of prediction models and their corresponding thresholds.2.2.1 Related Event Selection
In this step, we aim to identify the related events for a focused event, which are later used as predictors in a predictive model to predict the value of a focused (independent) event.
We follow [9]
to adopt a causal feature selection technique, MBs, in the step to achieve a good prediction accuracy and efficiency. MBs are defined in the context of a Bayesian Network (BN)
[10]. A BN is a type of probabilistic graphical model used to represent and infer the dependency among variables. In the context of log analysis, variables correspond to log events. A BN can be denoted as a pair of , where is a Directed Acyclic Graph (DAG) showing the structure of the BN, andis the joint probability of the nodes in
. Specifically, , where Eis the set of nodes representing the random variables in the domain under consideration, and
is the set of arcs representing the dependency among the nodes. is known as a parent of (or is a child of ) if there exists an arc . For any variable in a BN, its MB contains all the children, parents, and spouses (other parents of the children) of , denoted as . Given , is conditionally independent of all other variables in E, i.e.,(1) 
where .
According to Equation 1,
represents the information needed to estimate the probability of
by making irrelevant to the remaining variables, which makes is the minimal set of relevant variables to obtain the complete dependency of . The study in [9] has shown that using MBs as related variables could achieve better performance than other choices of related events.2.2.2 Dependency Model Training
The goal of the step is to train expected value prediction models. As shown in Figure 2, after learning MBs in the first step, events are categorized into two groups, independent events, i.e., events have no MB, and dependent events, i.e., events have MB. For an independent event, the expected value is predicted as the mean of the instances of the event in the training set. For a dependent event, an MLP regressor is trained to predict the expected value of using
as predictors. Theoretically, any regression model could be used for the step, and several regression models, such as regression trees, linear regression and SVM regressors, have been adopted in exiting dependencybase anomaly detection techniques. We chose MLP as the dependency model because it could deal with more complex data distribution and shows better performance than other regression models in our experiments.
In LogDP, we consider both dependent and independent log events in anomaly detection because it is common that some anomalous messages are printed to system logs only when anomalies occur. These anomalous log messages usually have no dependency on other log events. If this case is not included in the anomaly detection, a lot of anomalies could be missed. As these anomalous events only occur when anomalies happen, they are unlikely presented in normal log sequences, which is the reason that LogDP detects them by examining the deviation from the mean of values of normal sequences.
To obtain the threshold, a validation set with normal log sequences is input into the learned expected value prediction models to get the expected value of the validation set, i.e., . The deviation matrix of are calculated as . Then, for each event, its threshold is calculated as the maximum value of the deviations of the event, i.e., , where is the th column of D.
2.3 The Test Phase of LogDP
The goal of the test phase is to use the learned models and thresholds to detect anomalies. Given a log sequence , the expected value of each instance is predicted by corresponding prediction model. Then, the deviation is calculated as . If , then c is flagged as an anomaly. c is considered to be normal only if it follows all the normal patterns.
3 Evaluation
3.0.1 Datasets
Three public log datasets, HDFS, BGL and Spirit, are used in our experiments, which are available from [11]. From the three datasets, we generate seven datasets using different log grouping strategies. The HDFS is generated using session, and BGL and Spirit are generated using 1hour logs, 100 logs, and 20 logs windows. The names of the datasets of BGL and Spirit are denoted as DatasetWindow, e.g., BGL100logs as shown in Table 1.
For LogDP, the first 2/3 sequences of the training set are used for training, and the remaining 1/3 sequences are used as a validation set.
Datasets  #Evt  Window  Training Set  Test Set  
#Seq  #Anom.  %Anom.  #Seq  #Anom.  %Anom.  
HDFS  29  session  287,530  8,419  2.93%  287,531  8,419  2.93% 
1 hour  3,673  495  13.48%  1,481  170  11.48%  
BGL  980  100 logs  37,707  4,009  10.63%  9,426  816  8.66% 
20 logs  188,539  17,252  9.15%  47,134  3,005  6.38%  
1 hour  1,751  1,213  69.27%  585  225  38.46%  
Spirit  1,229  100 logs  79,999  20,598  25.75%  19,999  429  2.15% 
20 logs  399,999  82,002  20.50%  99,999  498  0.50% 

#Evt: number of events; #Seq: number of sequences; #Anom.: number of anomalies; %Anom.: percentage of anomalies.
3.0.2 Benchmark Methods
Six stateoftheart logbased anomaly detection methods are selected as the benchmark methods, including three proximitybased methods, PCA[12], OneClassSVM[13] (OCSVM), LogCluster[14]; a sequentialbased methods, DeepLog[4]; and two invariant relationbased methods, Invariant Mining[1] (IM) and ADR [3]. The description of the benchmark methods can be found in Section 4.
3.0.3 Experimental Results
The experimental results (in precision, recall and F1) of LogDP and benchmark methods are presented in Table 2. The best results are in boldface. Overall, LogDP produces superior results comparing to benchmark methods. Out of 7 datasets, LogDP achieves all the best results in F1; five best results in precision; two best results in recall.
Dataset  Metrics  LogDP  PCA  OCSVM  LogCluster  DeepLog  IM  ADR 
HDFSsession  F1  0.987  0.790  0.068  0.800  0.945  0.943  0.974 
Precision  0.979  0.980  0.035  0.870  0.958  0.893  0.951  
Recall  0.995  0.670  0.940  0.740  0.933  1.000  1.000  
BGL1hour  F1  0.789  0.170  0.393  0.147  0.596  0.490  0.547 
Precision  0.935  0.352  0.383  0.009  0.474  0.343  0.377  
Recall  0.682  0.112  0.403  0.394  0.802  0.859  1.000  
BGL100logs  F1  0.539  0.130  0.132  0.243  0.378  0.387  0.250 
Precision  0.858  0.440  0.075  0.147  0.321  0.324  0.143  
Recall  0.393  0.076  0.556  0.705  0.461  0.482  0.987  
BGL20logs  F1  0.460  0.237  0.168  0.226  0.224  0.203  0.204 
Precision  0.985  0.447  0.094  0.129  0.126  0.163  0.114  
Recall  0.300  0.162  0.744  0.884  0.981  0.269  0.988  
Spirit1hour  F1  0.821  0.187  0.601  0.367  0.582  0.387  0.792 
Precision  0.697  0.312  0.742  0.324  0.412  0.678  0.656  
Recall  1.000  0.133  0.505  0.422  0.991  0.271  1.000  
Spirit100logs  F1  0.575  0.111  0.003  0.110  0.153  0.107  0.445 
Precision  0.405  0.094  0.002  0.152  0.087  0.057  0.287  
Recall  0.993  0.135  0.023  0.086  0.643  0.993  0.994  
Spirit20logs  F1  0.905  0.095  0.009  0.173  0.135  0.032  0.558 
Precision  0.835  0.051  0.005  0.150  0.191  0.016  0.387  
Recall  0.988  0.639  0.057  0.205  0.104  0.974  0.999 
As for different strategies of log partitioning, i.e., session (for HDFS) or time window (for BGL and Spirit), LogDP performs well with both strategies. In contrast, as IM, ADR and DeepLog are designed to be more suitable for sessionbased log partitioning, they yield good results on the HDFS dataset but relatively poor results on other datasets. Compared to the benchmark methods based on proximitybased anomaly detection techniques, i.e., PCA, OCSVM and LogCluster, LogDP produces significantly better results on all datasets except for the precision of PCA on the HDFS dataset. In summary, the experiments have shown the superior performance of LogDP on different datasets with different log partition strategies.
4 Related Work
Logbased anomaly detection has been intensively studied in recent decades. In terms of the techniques used for anomaly detection, the existing approach can be roughly categorized into proximitybased, sequentialbased, and relationbased approaches. Proximitybased methods, such as PCA (Principal Component Analysis)
[12] and LogCluster [14], cast a log event sequence, as a point in a feature space and utilize distances or density metrics to evaluate the proximity of the log sequence with others. The sequences far from the others are flagged as anomalies. Sequentialbased methods, such as DeepLog [4] and LogAnomaly [5], use sequences of the log events to train models and try to predict future events. The log sequences that do not comply with the predicted sequential patterns are identified as anomalies. Relationbased methods such as Invariants Mining [1] and ADR [3], try to find meaningful relations among the log events and use the relations to detect anomalies. As a relationbased method, LogDP is more flexible than the existing ones. Existing relationbased methods [1, 3] are based on the invariant relationships among log events. Invariant relations refer to the linear relationships among log events that are related to the program workflows. However, there are two limitations in the existing invariant relationbased methods: (1) the mined relations are sensitive to data noise; (2) the mined relations are restricted to linear relations among the events. In contrast, LogDP utilizes the probabilistic relationships among log events, which makes it less sensitive to data noise. LogDP also adopts MLP regressors as dependency models, which can deal with both linear and nonlinear relationships.5 Conclusion
We have proposed a logbased anomaly detection method, LogDP, which utilizes the deviations from normal patterns to effectively detect anomalous log sequences. LogDP divides log events into two types, dependent events and independent events. For dependent events, the normal patterns are learned from the probabilistic relationship among an event and its MB, i.e., the dependency among events. For independent events, the normal patterns are obtained from the mean prediction models, i.e., the proximity among sequences. The log sequences that violate any normal pattern are identified as anomalies. Our experimental results show that LogDP outperforms the stateoftheart benchmark methods. Our source code and experimental data are available at: https://github.com/ilwoof/LogDP.
6 Acknowledgments
This research was supported by an Australian Government Research Training Program (RTP) Scholarship, and by the Australian Research Council’s Discovery Projects funding scheme (project DP200102940). The work was also supported with supercomputing resources provided by the Phoenix High Powered Computing (HPC) service at the University of Adelaide.
References
 [1] J. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining invariants from console logs for system problem detection. In USENIX Annual Technical Conference, 2010.
 [2] S. He, J. Zhu, P. He, and M. Lyu. Experience report: System log analysis for anomaly detection. In ISSRE, pages 207–218. IEEE, 2016.
 [3] Bo Zhang, Hongyu Zhang, Pablo Moscato, and Aozhong Zhang. Anomaly detection via mining numerical workflow relations from logs. In SRDS. IEEE, 2020.

[4]
M. Du, F. Li, G. Zheng, and V. Srikumar.
Deeplog: Anomaly detection and diagnosis from system logs through deep learning.
In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017.  [5] W. Meng, Y. Liu, Y. Zhu, S. Zhang, D. Pei, Y. Liu, Y. Chen, R. Zhang, S. Tao, P. Sun, et al. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In IJCAI, volume 19, pages 4739–4745, 2019.
 [6] P. He, J. Zhu, Z. Zheng, and M. Lyu. Drain: An online log parsing approach with fixed depth tree. In ICWS. IEEE, 2017.
 [7] Min Du and Feifei Li. Spell: Streaming parsing of system event logs. In IEEE ICDM, pages 859–864. IEEE, 2016.

[8]
H. Dai, H. Li, C. Chen, W. Shang, and T. Chen.
Logram: Efficient log parsing using ngram dictionaries.
IEEE Transactions on Software Engineering, 2020.  [9] S. Lu, L. Liu, J. Li, T. D. Le, and J. Liu. Lopad: A local prediction approach to anomaly detection. Advances in Knowledge Discovery and Data Mining, 2020.
 [10] Judea Pearl. Causality: models, reasoning and inference. Springer, 2000.
 [11] He S., Zhu J., He P., and R. Lyu M. Loghub: A large collection of system log datasets towards automated log analytics. arXiv eprints, 2020.
 [12] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I Jordan. Detecting largescale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 117–132, 2009.
 [13] B. Schölkopf, J. C Platt, John ST., A. J Smola, and R. C Williamson. Estimating the support of a highdimensional distribution. Neural computation, 2001.
 [14] Q. Lin, H. Zhang, J. Lou, Yu Zhang, and X. Chen. Log clustering based problem identification for online service systems. In ICSEC. IEEE, 2016.
Comments
There are no comments yet.