Skip to content
代码片段 群组 项目

比较版本

更改显示为版本正在合并到目标版本。了解更多关于比较版本的信息。

来源

选择目标项目
No results found

目标

选择目标项目
  • aiops-nankai/model/nezha
显示更改
源代码提交(2)
显示
39490 个添加0 个删除
.vscode/
__pycache__
\ No newline at end of file
文件已添加
# Quick Start
### 1.1 Requirements
- Python3.6 is recommended to run the anomaly detection. Otherwise, any python3 version should be fine.
- Git is also needed.
### 1.2 Setup
download `Nezha` first by `git clone git@github.com:IntelligentDDS/Nezha.git`
`python3.6 -m pip install -r requirements.txt` to install the dependency for Nezha
### 1.3 Running Nezha
#### 1.3.1 Localize OnlineBoutique at service level
```
python3.6 ./main.py --ns hipster --level service
pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 96.428571 %
```
#### 1.3.2 Localize OnlineBoutique at inner service level
```
python3.6 ./main.py --ns hipster --level inner
pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 96.428571 %
```
#### 1.3.3 Localize Trainticket at service level
```
python3.6 ./main.py --ns ts --level service
pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 97.777778 %
```
#### 1.3.4 Localize Trainticket at inner service level
```
python3.6 ./main.py --ns ts --level inner
pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 97.777778 %
```
The details of service level results and inner-service level results will be printed and recorded in `./log`
\ No newline at end of file
MIT License
Copyright (c) 2023 IntelligentDDS
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
\ No newline at end of file
# Nezha
This repository is the basic implementation of our publication in `FSE'23` conference paper [Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data](./FSE2023_Nezha.pdf)
## Description
`Nezha` is an interpretable and fine-grained RCA approach that pinpoints root causes at the code region and resource type level by incorporative analysis of multimodal data. `Nezha` transforms heterogeneous multi-modal data into a homogeneous event representation and extracts event patterns by constructing and mining event graphs. The core idea of `Nezha` is to compare event patterns in the fault-free phase with those in the fault-suffering phase to localize root causes in an interpretable way.
## Quick Start
### Requirements
- Python3.6 is recommended to run the anomaly detection. Otherwise, any python3 version should be fine.
- Git is also needed.
### Setup
Download `Nezha` first via `git clone git@github.com:IntelligentDDS/Nezha.git`
Enter `Nezha` content by `cd Nezha`
`python3.6 -m pip install -r requirements.txt` to install the dependency for Nezha
### Running Nezha
#### OnlineBoutique at service level
```
python3.6 ./main.py --ns hipster --level service
pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 96.428571 %
```
#### OnlineBoutique at inner service level
```
python3.6 ./main.py --ns hipster --level inner
pattern_ranker.py:622: -------- hipster Fault numbuer : 56-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 92.857143 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 96.428571 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 96.428571 %
```
#### Trainticket at service level
```
python3.6 ./main.py --ns ts --level service
pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AS@5 Result-------
pattern_ranker.py:628: 97.777778 %
```
#### Trainticket at inner service level
```
python3.6 ./main.py --ns ts --level inner
pattern_ranker.py:622: -------- ts Fault numbuer : 45-------
pattern_ranker.py:623: --------AIS@1 Result-------
pattern_ranker.py:624: 86.666667 %
pattern_ranker.py:625: --------AIS@3 Result-------
pattern_ranker.py:626: 97.777778 %
pattern_ranker.py:627: --------AIS@5 Result-------
pattern_ranker.py:628: 97.777778 %
```
The details of service level results and inner-service level results will be printed and recorded in `./log`
## Dataset
[2022-08-22](./rca_data/2022-08-22/) and [2022-08-23](./rca_data/2022-08-23/) is the fault-suffering dataset of OnlineBoutique
[2023-01-29](./rca_data/2023-01-29/) and [2023-01-30](./rca_data/2023-01-30/) is the fault-suffering dataset of Trainticket
### Fault-free data
[construct_data](./construct_data/) is the data of fault-free phase
[root_cause_hipster.json](./construct_data/root_cause_hipster.json) is the inner-servie level label of root causes in OnlineBoutique
[root_cause_ts.json](./construct_data/root_cause_ts.json) is the inner-servie level label of root causes in Trainticket
As an example,
```
"checkoutservice": {
"return": "Start charge card_Charge successfully",
"exception": "Start charge card_Charge successfully",
"network_delay": "NetworkP90(ms)",
"cpu_contention": "CpuUsageRate(%)",
"cpu_consumed": "CpuUsageRate(%)"
},
```
The label of `checkoutservice` means that the label `return` fault of `checkoutservice` is core regions between log statement contains `Start charge card` and `Charge successfully`.
### Fault-suffering Data
[rca_data](./rca_data/) is the data of fault-suffering phase
[2022-08-22-fault_list](./rca_data/2022-08-22-fault_list) and [2022-08-23-fault_list](./rca_data/2022-08-23-fault_list) is the servie level label of root causes in OnlineBoutique
[2023-01-29-fault_list](./rca_data/2022-01-29-fault_list) and [2022-01-30-fault_list](./rca_data/2022-01-30-fault_list) is the servie level label of root causes in TrainTicket
## Project Structure
```
.
├── LICENSE
├── README.md
├── construct_data
│ ├── 2022-08-22
│ │ ├── log
│ │ ├── metric
│ │ ├── trace
│ │ └── traceid
│ ├── 2022-08-23
│ ├── 2023-01-29
│ ├── 2023-01-30
│ ├── root_cause_hipster.json: label at inner-service level for OnlineBoutique
│ └── root_cause_ts.json: label at inner-service level for ts
├── rca_data
│ ├── 2022-08-22
│ │ ├── log
│ │ ├── metric
│ │ ├── trace
│ │ ├── traceid
│ │ └── 2022-08-22-fault_list.json: label at service level
│ ├── 2022-08-23
│ ├── 2023-01-29
│ └── 2023-01-30
├── log: RCA result
├── log_template: drain3 config
├── alarm.py: generate alarm
├── data_integrate.py: transform metric, log, and trace to event graph
├── log_parsing.py: parsing logs
├── log.py: record logs
├── pattern_miner.py: mine patterns from event graph
├── pattern_ranker.py: rank suspicious patterns
├── main.py: running nezha
└── requirements.txt
```
## Reference
Please cite our FSE'23 paper if you find this work is helpful.
```
@inproceedings{nezha,
title={Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data},
author={Yu, Guangba and Chen, Pengfei and Li, Yufeng and Chen, Hongyang and Li, Xiaoyun and Zheng, Zibin},
booktitle={ESEC/FSE 2023},
pages={},
year={2023},
organization={ACM}
}
```
\ No newline at end of file
# STATUS
We would to apply for **Artifacts Evaluated - Functional, Artifacts Evaluated - Reusable**, and **Artifacts Available** badages.
## Artifacts Evaluated - Functional
We believe that the artifact deserves this badge because it has undergone a thorough evaluation of its functionality. The implementation associated with the paper has been tested and proven to successfully demonstrate the claimed functionality.
## Artifacts Evaluated - Reusable
We believe that the artifact deserves this badge because it has undergone an evaluation of its reusability. The implementation provided with the paper is considered to be of high quality and easy to use. Other researchers can utilize these resources in their own studies, as they are valuable and offer meaningful contributions to the research community.
## Artifacts Available
We believe that the artifact deserves this badge because we have made the implementation, dataset, or other relevant materials openly available to others at `https://github.com/IntelligentDDS/Nezha`
\ No newline at end of file
import csv
from itertools import product
import os
import re
import datetime
from os.path import dirname
from log import Logger
import logging
from yaml import FlowMappingEndToken
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import numpy as np
log_path = dirname(__file__) + '/log/' + str(datetime.datetime.now().strftime(
'%Y-%m-%d')) + '_nezha.log'
logger = Logger(log_path, logging.DEBUG, __name__).getlog()
metric_threshold_dir = "metric_threshold"
def get_svc(path):
svc = path.rsplit('-', 1)[0]
svc = svc.rsplit('-', 1)[0]
return svc
def generate_threshold(metric_dir, trace_file):
"""
fun generate_threshold: calculte mean and std for each metric of each servie
write ruslt to metric_threshold_dir/service.csv
:parameter
metric_dir - metric dir in construction phase
"""
metric_map = {}
path_list = os.listdir(metric_dir)
for path in path_list:
if "metric" in path:
svc = path.rsplit('-', 1)[0]
svc = svc.rsplit('-', 1)[0]
if svc in metric_map:
metric_map[svc].append(os.path.join(metric_dir, path))
else:
metric_map[svc] = [os.path.join(metric_dir, path)]
for svc in metric_map:
frames = []
# get pod name
for path in path_list:
if svc in path:
pod_name = path.split("_")[0]
print(pod_name)
network_mean, network_std = get_netwrok_metric(
trace_file=trace_file, pod_name=pod_name)
break
metric_threshold_file = metric_threshold_dir + "/" + svc + ".csv"
for path in metric_map[svc]:
frames.append(pd.read_csv(path, index_col=False, usecols=[
'CpuUsageRate(%)', 'MemoryUsageRate(%)', 'SyscallRead', 'SyscallWrite']))
# concat pods of the same service
result = pd.concat(frames)
with open(metric_threshold_file, 'w', newline='') as f:
writer = csv.writer(f)
header = ['CpuUsageRate(%)', 'MemoryUsageRate(%)', 'SyscallRead',
'SyscallWrite', 'NetworkP90(ms)']
writer.writerow(header)
mean_list = []
std_list = []
for metric in header:
if metric == 'NetworkP90(ms)':
continue
mean_list.append(np.mean(result[metric]))
std_list.append(np.std(result[metric]))
mean_list.append(network_mean)
std_list.append(network_std)
writer.writerow(mean_list)
writer.writerow(std_list)
def get_netwrok_metric(trace_file, pod_name):
"""
func get_netwrok_metric: use trace data to get netwrok metric
:parameter
time - to regex timestamp e.g, "2022-04-18 13:00"
data_dir
pod_name
:return
p90 netwrok latency
"""
latency_list = []
if "front" in pod_name:
# front end dose not calculate netwrok latency
return 10, 10
pod_reader = pd.read_csv(
trace_file, index_col='PodName', usecols=['TraceID', 'SpanID', 'ParentID', 'PodName', 'EndTimeUnixNano'])
parent_span_reader = pd.read_csv(
trace_file, index_col='SpanID', usecols=['TraceID', 'SpanID', 'ParentID', 'PodName', 'EndTimeUnixNano'])
try:
pod_spans = pod_reader.loc[[pod_name], [
'SpanID', 'ParentID', 'PodName', 'EndTimeUnixNano']]
except:
service = pod_name.rsplit('-', 1)[0]
service = service.rsplit('-', 1)[0]
csv_file = dirname(__file__) + "/metric_threshold/" + service + ".csv"
pod_reader = pd.read_csv(csv_file, usecols=['NetworkP90(ms)'])
# print("pod", pod_name, " not found in trace, return default ",
# float(pod_reader.iloc[0]))
return float(pod_reader.iloc[0]), 0
if len(pod_spans['SpanID']) > 0:
# process span independentlt and order by timestamp
for span_index in range(len(pod_spans['SpanID'])):
# span event
parent_id = pod_spans['ParentID'].iloc[span_index]
pod_start_time = int(
pod_spans['EndTimeUnixNano'].iloc[span_index])
try:
parent_pod_span = parent_span_reader.loc[[
parent_id], ['PodName', 'EndTimeUnixNano']]
if len(parent_pod_span) > 0:
for parent_span_index in range(len(parent_pod_span['PodName'])):
parent_pod_name = parent_pod_span['PodName'].iloc[parent_span_index]
parent_end_time = int(
parent_pod_span['EndTimeUnixNano'].iloc[parent_span_index])
if str(parent_pod_name) != str(pod_name):
latency = (parent_end_time - pod_start_time) / \
1000000 # convert to microsecond
# if "contacts-service" in pod_name:
# logger.info("%s, %s, %s, %s, %s" % (
# pod_name, pod_spans['SpanID'].iloc[span_index], parent_pod_name, pod_spans['ParentID'].iloc[span_index], latency))
latency_list.append(latency)
except:
pass
# logger.info("%s latency is %s" %(pod_name, np.percentile(latency_list, 90)))
if len(latency_list) > 2:
return np.percentile(latency_list, 90), statistics.stdev(latency_list)
else:
return 10, 10
def determine_alarm(pod, metric_type, metric_value, std_num, ns):
"""
fun determine_alarm: determin whether violate 3-sgima
:parameter
pod - podname to find corrsponding metric threshold file
metric_type - find correspding column
metric_vault - compare with the history mean and std
std_num - constrol std_num * std
:return
true - alarm
false - no alarm
"""
path_list = os.listdir(metric_threshold_dir)
if metric_type == "CpuUsageRate(%)" or metric_type == 'MemoryUsageRate(%)':
if metric_value > 80:
return True
else:
if ns == "hipster":
# for hipster
if metric_value > 200:
return True
elif ns == "ts":
# for ts
if metric_value > 300:
return True
return False
# for path in path_list:
# if re.search(path.split('.')[0], pod):
# hisory_metric = pd.read_csv(os.path.join(
# metric_threshold_dir, path), index_col=False, usecols=[metric_type])
# if metric_value > hisory_metric[metric_type][0] + std_num * hisory_metric[metric_type][1]:
# return True
# # elif metric_value < hisory_metric[metric_type][0] - std_num * hisory_metric[metric_type][1]:
# # return True
# else:
# return False
def generate_alarm(metric_list, ns, std_num=6):
"""
func generate_alarm: generate alram of each pod at current miniute
:parameter
metric_list - metric list from get_metric_with_time
:return
alarm_list, e.g., [{'pod': 'cartservice-579f59597d-n69b4', 'alarm': [{'metric_type': 'CpuUsageRate(%)', 'alarm_flag': True}]}]
[{
pod:
alarm: [
{
metric_type: CpuUsageRate(%)
alarm_flag: True
}
]
}]
"""
alarm_list = []
for pod_metric in metric_list:
alarm = {}
for i in range(len(pod_metric['metrics'])):
alarm_flag = determine_alarm(pod=pod_metric["pod"], metric_type=pod_metric['metrics'][i]["metric_type"],
metric_value=pod_metric['metrics'][i]["metric_value"], std_num=std_num, ns=ns)
if alarm_flag:
# if exist alarm_flag equal to true, create map
if "pod" not in alarm:
alarm = {"pod": pod_metric["pod"], "alarm": []}
alarm['alarm'].append(
{"metric_type": pod_metric['metrics'][i]["metric_type"], "alarm_flag": alarm_flag})
if "pod" in alarm:
alarm_list.append(alarm)
return alarm_list
def get_metric_with_time(time, base_dir):
"""
func get_metric_with_time: get metric list at determined miniute
:parameter
time - to regex timestamp e.g, "2022-04-18 13:00"
product_metric_dir
:return
target_list - traget metrics
[
{
pod:
metrics: [
{
"metric_type":
"metric_value":
}
]
}
]
"""
date = time.split(' ')[0]
hour_min = time.split(' ')[1]
hour = hour_min.split(':')[0]
min = hour_min.split(':')[1]
trace_file = base_dir + "/" + date + "/trace/" + hour + "_" + min + "_trace.csv"
metric_dir = base_dir + "/" + date + "/metric/"
path_list = os.listdir(metric_dir)
# metric_list = ['CpuUsageRate(%)', 'MemoryUsageRate(%)', 'SyscallRead',
# 'SyscallWrite']
metric_list = ['CpuUsageRate(%)', 'MemoryUsageRate(%)']
target_list = []
for path in path_list:
if "metric" in path:
metrics = pd.read_csv(os.path.join(metric_dir, path))
# metrics = pd.read_csv(os.path.join(product_metric_dir, path), index_col=False, usecols=['TimeStamp', 'PodName', 'CpuUsageRate(%)', 'MemoryUsageRate(%)', 'SyscallRead', 'SyscallWrite', 'PodServerLatencyP90(s)', 'PodClientLatencyP90(s)'])
for index in range(len(metrics['Time'])):
# regex timestamp
if re.search(time, metrics['Time'][index]):
target_metric = {
"pod": metrics['PodName'][index], "metrics": []}
for metric in metric_list:
target_metric["metrics"].append({
"metric_type": metric, "metric_value": metrics[metric][index]})
network_p90, _ = get_netwrok_metric(
trace_file=trace_file, pod_name=metrics['PodName'][index])
target_metric["metrics"].append(
{"metric_type": "NetworkP90(ms)", "metric_value": network_p90})
target_list.append(target_metric)
# print(target_list)
return target_list
此差异已折叠。
因为 它太大了无法显示 源差异 。您可以改为 查看blob
因为 它太大了无法显示 源差异 。您可以改为 查看blob
因为 它太大了无法显示 源差异 。您可以改为 查看blob
Source,Target,Weight
checkoutservice,emailservice,1
frontend,checkoutservice,1
frontend,recommendationservice,1
checkoutservice,cartservice,1
frontend,cartservice,1
checkoutservice,productcatalogservice,1
checkoutservice,shippingservice,1
frontend,productcatalogservice,1
frontend,adservice,1
checkoutservice,currencyservice,1
frontend,currencyservice,1
checkoutservice,paymentservice,1
frontend,shippingservice,1
recommendationservice,productcatalogservice,1
adservice,33.33.33.76,1
cartservice,33.33.33.80,1
checkoutservice,33.33.33.227,1
currencyservice,33.33.33.227,1
emailservice,33.33.33.116,1
frontend,33.33.33.167,1
loadgenerator,33.33.33.79,1
loadgenerator,33.33.33.60,1
paymentservice,33.33.33.115,1
productcatalogservice,33.33.33.115,1
recommendationservice,33.33.33.155,1
shippingservice,33.33.33.82,1
因为 它太大了无法显示 源差异 。您可以改为 查看blob
因为 它太大了无法显示 源差异 。您可以改为 查看blob
此差异已折叠。