图数据库nebula性能监控

目的

nebulaGraph官方自带的Nebula Dashboard 监控工具里的监控指标和筛选粒度已经足够丰富。但基于以下原因,还是在此基础上自己做了层监控

1.指标筛选

官方提供的性能接口里有几百项指标,我们从graph,storage,rockdb 三个层面,筛选了60几个重点的,需要关注的指标,筛选的依据是

最终,挑选了以下指标

nebula性能监控—指标项

2.编写收集代码(Python)

这部分因为有官方接口,所以直接请求就可以了


for nodeid,nodehost in grpahlist:
    dict_result=self.get_nebula_stats(nodehost)
    #入库 dict_result


  def get_nebula_stats(self,nodehost)
    dict_result={} 
    urlstr='http://%s/stats?format=json'%(nodehost,)
    mlist=self.get_urldata(urlstr)           
    for dictc in mlist:
        for k,v in dictc.items():
            if k in dict_graph:
                keyname=dict_graph[k][0]
                dict_result[keyname]=v
    return dict_result

3.性能评价

这是非常重要的一步,沿用我在2005年做的数据库性能模型的方法(参考:https://github.com/51ak/DatabaseRating/)

做weight标记的逻辑是定义如下的一个区间列表

 dict_graph={
        "num_active_queries.sum.60":("num_active_queries",(-4,5,10,50,1000000)),
"num_active_sessions.sum.60":("num_active_sessions",(-4,5,10,50,1000000)),
"num_opened_sessions.rate.60":("num_opened_sessions",(-4,100,500,10000,10000000)),
"num_queries.rate.60":("num_queries_rate",(-4,100,500,10000,10000000)),
"num_queries.sum.60":("num_queries_sum",(-4,5000,50000,900000,100000000)),
"num_sentences.rate.60":("num_sentences_rate",(-4,100,500,10000,10000000)),
"num_sentences.sum.60":("num_sentences_sum",(-4,5000,50000,900000,100000000)),
"query_latency_us.p95.60":("query_latency_us_p95",(-4,30000,100000,500000,10000000000)),
"query_latency_us.avg.60":("query_latency_us_avg",(-4,30000,100000,500000,10000000000)),
"num_queries_hit_memory_watermark.sum.60":("num_queries_hit_memory_watermark",(-4,100,5000,90000,100000000)),
"num_query_errors.rate.60":("num_query_errors",(-4,100,5000,90000,100000000)),
"num_slow_queries.sum.60":("num_slow_queries",(-4,50,500,5000,100000000))
        }

对每一个值,对其区间进行判断,如果第一区间,则是优,以此类推。所以对应的采集程序就变成了


  def get_nebula_stats(self,nodehost)
    dict_result={} 
    urlstr='http://%s/stats?format=json'%(nodehost,)
    mlist=self.get_urldata(urlstr)           
    for dictc in mlist:
        for k,v in dictc.items():
            if k in dict_graph:
                keyname=dict_graph[k][0]
                keyvalueweightlist=dict_graph[k][1]
                if v<0:
                    v=0
                value=float(v)
                valueweight=self.func_checkweight(value,keyvalueweightlist) 
                dict_result[keyname]=v
                dict_result[f"{keyname}_weight"]=valueweight
    return dict_result

4.配置一张报表,进行集中展示

在我们的dboop平台上增加一张报表配置。

nebula性能监控—后台配置

代码如下:



<page>

<!-- 自定义表头,可省略,默认是列名,如需要自定义宽度或表格头,写html格式 -->
  <tableheader><![CDATA[
  <tr style="height:25px;">
<th colspan="1" style="border-bottom:0"> </th>
<th colspan="4" style="border-bottom:0"><h3>主机</h3></th>
<th colspan="4" style="border-bottom:0"><h3>网络</h3></th>
<th colspan="2" style="border-bottom:0"><h3>活跃</h3></th>
<th colspan="3" style="border-bottom:0"><h3>每秒</h3></th>
<th colspan="2" style="border-bottom:0"><h3>响应时间</h3></th>
<th colspan="4" style="border-bottom:0"><h3>1分钟指标</h3></th>
</tr>
  <tr style="height:25px;">
<th><h3>Graph服务</h3></th>

<th><h3>CPU</h3></th>
<th><h3>内存</h3></th>
<th><h3>IOWait</h3></th>
<th><h3>磁盘空间</h3></th>
<th><h3>进流量</h3></th>
<th><h3>出流量</h3></th>
<th><h3>TcpWait</h3></th>
<th><h3>Tcp连接</h3></th>

<th><h3>查询</h3></th>
<th><h3>进程</h3></th>

<th><h3>QPS</h3></th>
<th><h3>语句PS</h3></th>
<th><h3>打开会话</h3></th>


<th><h3>P95(毫秒)</h3></th>
<th><h3>平均(毫秒)</h3></th>


<th><h3>查询数</h3></th>
<th><h3>内存溢出</h3></th>
<th><h3>错误次数</h3></th>
<th><h3>慢查询</h3></th>


</tr>

 ]]>
  </tableheader>  
  

  <viewtype>table</viewtype>
   <connstr>link:cnpg3rzn2x********byenoa</connstr>
  <sqlstr><![CDATA[

select 
case when left(h.hosttype,1)='物' then 
concat(n.ipstr,' <span class="f_1">物</span>')
 else concat(n.ipstr,' <span  class="f_2">虚</span>')
 end as 'IP',
            format_dash(`cpu`,`cpu_weight`,'d203qtapmhv1ywbojc579lkrgx','cpu','host',a.hostid,'percent') as 'cpu',
            format_dash(`mempercent`,`mempercent_weight`,'d203qtapmhv1ywbojc579lkrgx','mempercent','host',a.hostid,'percent') as 'mempercent',
            format_dash(`iowait`,`iowait_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','iowait','host',a.hostid,'') as 'iowait', 
case when `diskused_data` is null or `diskused_data`<0 then 
   format_dash(`diskused_root`,`diskused_root_weight`,'d203qtapmhv1ywbojc579lkrgx','diskused_root','host',a.hostid,'percent')
else 
   format_dash(`diskused_data`,`diskused_data_weight`,'d203qtapmhv1ywbojc579lkrgx','diskused_data','host',a.hostid,'percent')
end as 'diskused_data',
            format_dash(`network_in`,`network_in_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','network_in','host',a.hostid,'bytes') as 'network_in',
            format_dash(`network_out`,`network_out_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','network_out','host',a.hostid,'bytes') as 'network_out',
            format_dash(`tcp_wait`,`tcp_wait_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','tcp_wait','host',a.hostid,'floor') as 'tcp_wait',
            format_dash(`tcp_count`,`tcp_count_weight`,'dl4bzsk7pfg89v1uyoq2x0j65c','tcp_count','host',a.hostid,'floor') as 'tcp_count',
format_dash(`num_active_queries`,`num_active_queries_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_active_queries','graph',b.graphid,'floor') as 'num_active_queries',
format_dash(`num_active_sessions`,`num_active_sessions_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_active_sessions','graph',b.graphid,'floor') as 'num_active_sessions',


format_dash(`num_queries_rate`,`num_queries_rate_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_queries_rate','graph',b.graphid,'floor') as 'num_queries_rate',
format_dash(`num_sentences_rate`,`num_sentences_rate_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_sentences_rate','graph',b.graphid,'floor') as 'num_sentences_rate',
format_dash(`num_opened_sessions`,`num_opened_sessions_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_opened_sessions','graph',b.graphid,'floor') as 'num_opened_sessions',

format_dash(`query_latency_us_p95`/1000,`query_latency_us_p95_weight`,'day9kl2pm1wbts8c0xzhvfnueq','query_latency_us_p95','graph',b.graphid,'floor') as 'query_latency_us_p95',
format_dash(`query_latency_us_avg`/1000,`query_latency_us_avg_weight`,'day9kl2pm1wbts8c0xzhvfnueq','query_latency_us_avg','graph',b.graphid,'floor') as 'query_latency_us_avg',


format_dash(`num_queries_sum`,`num_queries_sum_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_queries_sum','graph',b.graphid,'floor') as 'num_queries_sum',
format_dash(`num_queries_hit_memory_watermark`,`num_queries_hit_memory_watermark_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_queries_hit_memory_watermark','graph',b.graphid,'floor') as 'num_queries_hit_memory_watermark',
format_dash(`num_query_errors`,`num_query_errors_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_query_errors','graph',b.graphid,'floor') as 'num_query_errors',
format_dash(`num_slow_queries`,`num_slow_queries_weight`,'day9kl2pm1wbts8c0xzhvfnueq','num_slow_queries','graph',b.graphid,'floor') as 'num_slow_queries'
from  dboop.linknode n 
join dboop.linkhost h on n.hostid=h.hostid
left join wc_host a on a.hostid=n.hostid  and a._timestamp>DATE_SUB(NOW(), INTERVAL 15 MINUTE)
left join wc_nebula_graph b on n.instanceid=b.graphid  and b._timestamp>DATE_SUB(NOW(), INTERVAL 15 MINUTE)
where  h.usetype='nebula' and h.ontype=9 and n.clustertype like '%g%'
;



]]>
  
  </sqlstr> 
  
  
 <title2>Nebula Graph性能</title2>

  <width>1250px</width>
</page>


最终效果

nebula性能监控—后台配置

我们得到了一个

我们付出了

>> Home

51ak

2022/06/23

Categories: nosql 监控 nebula dashboard Tags: 原创 精品

《数据库工作笔记》公众号
扫描上面的二维码,关注我的《数据库工作笔记》公众号