Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
# Nezha
This repository is the basic implementation of our publication in `FSE'23` conference paper [Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data](./FSE2023_Nezha.pdf)
## Description
`Nezha` is an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multimodal data. `Nezha` transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of `Nezha` is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way.
## Quick Start
### Requirements
- Python3.6 is recommended to run the anomaly detection. Otherwise, any python3 version should be fine.
- Git is also needed.
### Setup
Download `Nezha` first via `git clone git@github.com:IntelligentDDS/Nezha.git`
Enter `Nezha` content by `cd Nezha`
`python3.6 -m pip install -r requirements.txt` to install the dependency for Nezha
### Running Nezha
#### OnlineBoutique at service level
```
python3.6 ./main.py --ns hipster --level service
pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 96.428571 %
```
#### OnlineBoutique at inner service level
```
python3.6 ./main.py --ns hipster --level inner
pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 96.428571 %
```
#### Trainticket at service level
```
python3.6 ./main.py --ns ts --level service
pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 97.777778 %
```
#### Trainticket at inner service level
```
python3.6 ./main.py --ns ts --level inner
pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 97.777778 %
```
The details of service level results and inner-service level results will be printed and recorded in `./log`
## Dataset
[2022-08-22](./rca_data/2022-08-22/) and [2022-08-23](./rca_data/2022-08-23/) is the fault-suffering dataset of OnlineBoutique
[2023-01-29](./rca_data/2023-01-29/) and [2023-01-30](./rca_data/2023-01-30/) is the fault-suffering dataset of Trainticket
### Fault-free data
[construct_data](./construct_data/) is the data of fault-free phase
[root_cause_hipster.json](./construct_data/root_cause_hipster.json) is the inner-servie level label of root causes in OnlineBoutique
[root_cause_ts.json](./construct_data/root_cause_ts.json) is the inner-servie level label of root causes in Trainticket
As an example,
```
"checkoutservice": {
"return": "Start charge card_Charge successfully",
"exception": "Start charge card_Charge successfully",
"network_delay": "NetworkP90(ms)",
"cpu_contention": "CpuUsageRate(%)",
"cpu_consumed": "CpuUsageRate(%)"
},
```
The label of `checkoutservice` means that the label `return` fault of `checkoutservice` is core regions between log statement contains `Start charge card` and `Charge successfully`.
### Fault-suffering Data
[rca_data](./rca_data/) is the data of fault-suffering phase
[2022-08-22-fault_list](./rca_data/2022-08-22-fault_list) and [2022-08-23-fault_list](./rca_data/2022-08-23-fault_list) is the servie level label of root causes in OnlineBoutique
[2023-01-29-fault_list](./rca_data/2022-01-29-fault_list) and [2022-01-30-fault_list](./rca_data/2022-01-30-fault_list) is the servie level label of root causes in TrainTicket
## Project Structure
```
.
├── LICENSE
├── README.md
├── construct_data
│ ├── 2022-08-22
│ │ ├── log
│ │ ├── metric
│ │ ├── trace
│ │ └── traceid
│ ├── 2022-08-23
│ ├── 2023-01-29
│ ├── 2023-01-30
│ ├── root_cause_hipster.json: label at inner-service level for OnlineBoutique
│ └── root_cause_ts.json: label at inner-service level for ts
├── rca_data
│ ├── 2022-08-22
│ │ ├── log
│ │ ├── metric
│ │ ├── trace
│ │ ├── traceid
│ │ └── 2022-08-22-fault_list.json: label at service level
│ ├── 2022-08-23
│ ├── 2023-01-29
│ └── 2023-01-30
├── log: RCA result
├── log_template: drain3 config
├── alarm.py: generate alarm
├── data_integrate.py: transform metric, log, and trace to event graph
├── log_parsing.py: parsing logs
├── log.py: record logs
├── pattern_miner.py: mine patterns from event graph
├── pattern_ranker.py: rank suspicious patterns
├── main.py: running nezha
└── requirements.txt
```
## Reference
Please cite our FSE'23 paper if you find this work is helpful.
```
@inproceedings{nezha,
title={Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data},
author={Yu, Guangba and Chen, Pengfei and Li, Yufeng and Chen, Hongyang and Li, Xiaoyun and Zheng, Zibin},
booktitle={ESEC/FSE 2023},
pages={},
year={2023},
organization={ACM}
}
```