LOG REPRESENTATION AS AN INTERFACE FOR LOG PROCESSING APPLICATIONS

Abstract

Log files from computer systems are ubiquitous and record events, messages, or transactions. Logs are rich containers of data because they can store a sequence of structured textual and numerical data. Many sequential forms of data including natural languages and temporal signals can be represented as logs. We propose to represent logs at a few levels of abstraction including field level, log level, and log sequence level. The representation for each level can be computed from the previous level. These representations are in vector format and serve as interfaces to downstream applications. We use a version of Transformer Networks (TNs) to encode numerical and textual information that is suitable for log embeddings. We show how a number of log processing applications can be readily solved with our representation.

1. INTRODUCTION

A wide range of computer systems record their events as logs. Log-generating systems include telecommunication systems, data centers, software applications, operating systems, sensors, banks, markets, and block-chains ( Barik et al. (2016) ; Brandt et al. (2020) ; Busany & Maoz (2016) ; Cucurull & Puiggalí (2016) ; Sutton & Samavi (2017) ). In these systems, transactions, events, actions, communications, and errors messages are documented as log entries. Log entries are stored as plain text, so they can store any textual or numerical information. A variety of different data types could be viewed as logs, including: natural languages, temporal signals, and even DNA sequences. All transactions in a Turing machine can be stored as logs. Therefore, logs are theoretically strong enough that they can reproduce the state of any computer system. In some systems log entries are standard and complete. For example, in financial transactions and some database management systems, one can recreate the state of the system by applying a set of rules on the transactions ( Mohan et al. (1992) ). In other systems like telecommunication networks log entries are ad hoc, unstructured, and diverse. Therefore, the state of the system cannot be recreated with a set of rules. We will use layered and learnt vector embeddings to represent the state of the system and use them for downstream diagnostic applications including: anomaly detection, classification, causal analysis and search. Example: A snapshot of logs from a telecommunication product installed in a cell tower is shown in Figure 1 . The first entry of the log is a trigger for subsequent action of the unit to be restarted. Each log entry will be embedded in a vector space and the sequence of log entrys will have their own vector representation. Timestamp information provides crucial clues about the nature of the event(s). We summarize our contributions as: 1. We propose levels of abstraction for log representation, in order to standardize and simplify log processing applications. 2. We present a Transformer-based model to embed log sequences in a vector space. 3. We show how log processing applications can be simplified using these log representations. We validate our approach on a real data set obtained from a leading telecommunications vendor. The vocabulary of this data set is twenty times bigger than what is currently available in open source and often used in other research papersfoot_0 The prototypical work is DeepLog which, in a fashion analogous to natural language processing, models logs as sequences from a restricted vocabulary following certain patterns and rules ( Du et al. (2017) ). An LSTM model M of log executions is inferred from a database of log sequences. To determine if a given element w t+1 in a log sequence is normal or anomalous, DeepLog outputs the probability distribution P M (•|w 1:t ) where w 1:t = w 1 , w 2 , . . . , w t . If the actual token w t+1 is ranked high in P M (w t+1 |w 1:t ) then it is deemed as a normal event otherwise it is flagged as anomalous. Several variations on the above approach have been proposed. For example, (Yuan et al. ( 2020) proposed Logsy, a classification-based method to learn effective vector representations of log entries for downstream tasks like anomaly detection. The core idea of the proposed approach is to make use of the easily accessible auxiliary data to supplement the positive (normal) only target training data samples. The auxiliary logs that constitute anomalous data samples can be obtained from the internet. The intuition behind such an approach to anomaly detection is that the auxiliary dataset is sufficiently informative to enhance the representation of the normal and abnormal data, yet diverse enough to regularize against over-fitting and improve generalization.

2. SYSTEM LOGS AND NATURAL LANGUAGE MODELS

Log processing models often adopt natural language processing techniques, because logs have several similarities to natural language: (i) Both logs and a natural language consist of a sequence of tokens and (ii) Context matters in both data streams and both models need temporal memory, (iii) Both application scenarios have large datasets and (iv) Annotation is a limiting factor in both settings. This is why log processing literature often reuses natural language approaches. However there are several differences between system logs and natural languages that need to be brought to the fore: (i) There is temporal information associated with each log entry which can be exploited to give insights about the underlying processes that are generating the logs (ii) Each log entry itself is a composite record of different pieces of information unlike a word in a sentence which is nearly atomic, (iii) Log files often aggregate event logs from multiple threads, processes and devices. Therefore the inference model needs to identify the relevant context among all threads. 



After suitable anonymization, we plan to release the data set to the research community.



Figure 1: Sample telecommunication log file including three log entries. Telecommunication logs are complex and diverse because they involve various devices and software.

)) proposed ADA (Adaptive Deep Log Anomaly Detector) that exploit online deep learning Sahoo et al. (2018) methodology to build on the fly unsupervised adaptive deep log anomaly detector with LSTM. The new models are trained on new log samples. More recently, Nedelkoski et al. (

2.1 SEQ-TO-SEQ MODELSTransformer Networks (TNs) have become the de-facto choice for sequence-to-sequence modeling( Vaswani et al. (2017)). Based on the concept of self-attention, TNs overcome some of the key limitations of families of Recursive Neural Networks (RNN) including LSTMs and GRUs( Graves  & Schmidhuber (2005)). Transformers Networks can model long range dependencies with constant

