LOG REPRESENTATION AS AN INTERFACE FOR LOG PROCESSING APPLICATIONS

Abstract

Log files from computer systems are ubiquitous and record events, messages, or transactions. Logs are rich containers of data because they can store a sequence of structured textual and numerical data. Many sequential forms of data including natural languages and temporal signals can be represented as logs. We propose to represent logs at a few levels of abstraction including field level, log level, and log sequence level. The representation for each level can be computed from the previous level. These representations are in vector format and serve as interfaces to downstream applications. We use a version of Transformer Networks (TNs) to encode numerical and textual information that is suitable for log embeddings. We show how a number of log processing applications can be readily solved with our representation.

1. INTRODUCTION

A wide range of computer systems record their events as logs. Log-generating systems include telecommunication systems, data centers, software applications, operating systems, sensors, banks, markets, and block-chains ( Barik et al. ( 2016 2017)). In these systems, transactions, events, actions, communications, and errors messages are documented as log entries. Log entries are stored as plain text, so they can store any textual or numerical information. A variety of different data types could be viewed as logs, including: natural languages, temporal signals, and even DNA sequences. All transactions in a Turing machine can be stored as logs. Therefore, logs are theoretically strong enough that they can reproduce the state of any computer system. In some systems log entries are standard and complete. For example, in financial transactions and some database management systems, one can recreate the state of the system by applying a set of rules on the transactions ( Mohan et al. (1992) ). In other systems like telecommunication networks log entries are ad hoc, unstructured, and diverse. Therefore, the state of the system cannot be recreated with a set of rules. We will use layered and learnt vector embeddings to represent the state of the system and use them for downstream diagnostic applications including: anomaly detection, classification, causal analysis and search. Example: A snapshot of logs from a telecommunication product installed in a cell tower is shown in Figure 1 . The first entry of the log is a trigger for subsequent action of the unit to be restarted. Each log entry will be embedded in a vector space and the sequence of log entrys will have their own vector representation. Timestamp information provides crucial clues about the nature of the event(s). We summarize our contributions as: 1. We propose levels of abstraction for log representation, in order to standardize and simplify log processing applications. 2. We present a Transformer-based model to embed log sequences in a vector space. 3. We show how log processing applications can be simplified using these log representations. We validate our approach on a real data set obtained from a leading telecommunications vendor. The vocabulary of this data set is twenty times bigger than what is currently available in open source and often used in other research papersfoot_0 



After suitable anonymization, we plan to release the data set to the research community.1



); Brandt et al. (2020); Busany & Maoz (2016); Cucurull & Puiggalí (2016); Sutton & Samavi (

