在上一篇[1]中,我们通过 NetworkX 和 Gephi 展示了<权力的游戏>中的人物关系。在本篇中,我们将展示如何通过 NetworkX 访问图数据库 Nebula Graph。
NetworkX [2] 是一个用 Python 语言开发的图论与复杂网络建模工具,内置了大量常用的图与复杂网络分析算法,可以方便地进行复杂网络数据分析、仿真建模等工作,功能丰富,简单易用。
在 NetworkX 中,图是由顶点、边和可选的属性构成的数据结构。顶点表示数据,边是由两个顶点唯一确定的,表示两个顶点之间的关系。顶点和边也可以拥有更多的属性,以存储更多的信息。
NetworkX 支持 4 种类型的图:
在 NetworkX 中创建一个无向图:
import networkx as nx
G = nx.Graph()
添加顶点:
G.add_node(1)
G.add_nodes_from([2,3,4])
G.add_node(2,name='Tom',age=23)
添加边:
G.add_edge(2,3)
G.add_edges_from([(1,2),(1,3)])
g.add_edge(1, 2, start_year=1996, end_year=2019)
在上一篇文章(一)中,我们已经演示了 NetworkX 的 Girvan-Newman 社区发现算法。
NetworkX 通常使用本地文件作为数据源,这在静态网络研究的时候没什么问题,但如果图网络经常会发生变化——例如某些中心节点已经不存在(Fig.1)或者引入了重要的网络拓扑变化(Fig.2)——每次生成全新的静态文件再加载分析就有些麻烦,最好整个变化过程可以持久化在一个数据库中,并且可以实时地直接从数据库中加载子图或者全图做分析。本文选用 Nebula Graph [3]作为存储图数据的图数据库。
Nebula Graph 提供了两种方式来获取图结构:
第一种方式适合在一个大规模的图网络中通过精细的过滤和剪枝条件来获取符合需求的若干个点和边。第二种方式更适合于全图的分析,这通常是在项目前期对全图进行一些启发式探索,当有进一步认知后再用第一种方式做精细的剪枝分析。
分析完 Nebula Graph 两种获取图结构方式后,下面来查看 Nebula Graph 的 Python 客户端代码,nebula-python/nebula/ngStorage/StorageClient.py 与 nebula-python/nebula/ngMeta/MetaClient.py 就是和底层存储交互的 API, 里面有扫描点、扫描边、读取一堆属性等等一系列丰富的接口。
下面两个接口可以用来读取所有的点、边数据:
def scan_vertex(self, space, return_cols, all_cols, limit, start_time, end_time)
def scan_edge(self, space, return_cols, all_cols, limit, start_time, end_time)
meta_client = MetaClient([('192.168.8.16', 45500)])
meta_client.connect()
storage_client = StorageClient(meta_client)
scan_edge_processor = ScanEdgeProcessor(meta_client)
space_name = 'nba' # 要读取的图空间名称
return_cols = {} # 要返回的边(或点)及其属性列
return_cols['serve'] = ['start_year', 'end_year']
return_cols['follow'] = ['degree']
allCols = False # 是否返回所有属性列,当该值为 False 时,仅返回在 returnCols 里指定的属性列,当为 True 时,返回所有属性列
limit = 100 # 最多返回的数据条数
start_time = 0
end_time = sys.maxsize
scan_edge_response_iterator = storage_client.scan_edge(space_name, return_cols, all_cols, limit, start_time, end_time)
while scan_edge_response_iterator.has_next():
scan_edge_response = scan_edge_response_iterator.next()
if scan_edge_response is None:
print("Error occurs while scaning edge")
break
process_edge(space, scan_edge_response)
其中,process_edge 是自定义的一个处理读出来边数据的函数,该函数可以先使用 scan_edge_processor 对 scan_edge_response 中的数据进行解码,解码后的数据可以直接打印出来,也可以做一些简单处理,另作他用,比如:将这些数据读入计算框架 NetworkX 里。
def process_edge(space, scan_edge_response):
result = scan_edge_processor.process(space, scan_edge_response)
# Get the corresponding rows by edge_name
for edge_name, edge_rows in result.rows.items():
for row in edge_rows:
srcId = row.default_properties[0].get_value()
dstId = row.default_properties[2].get_value()
print('%d -> %d' % (srcId, dstId))
props = {}
for prop in row.properties:
prop_name = prop.get_name()
prop_value = prop.get_value()
props[prop_name] = prop_value
G.add_edges_from([(srcId, dstId, props)]) # 添加边到 NetworkX 中的图 G
读取顶点数据的方法和上面的流程类似。
此外,对于分布式的一些图计算框架[4]来说,Nebula Graph 还提供了根据分片 (partition) 并发地批量读取存储的功能,这会在之后的文章中演示。
当我们把所有点和边数据都按照上述流程读入 NetworkX 后,我们还可以做一些基本的图分析和图计算:
nx.draw(G, with_labels=True, font_weight='bold')
import matplotlib.pyplot as plt
plt.show()
plt.savefig('./test.png')
绘制出来的图:
print('nodes: ', list(G.nodes))
print('edges: ', list(G.edges))
输出的结果:
nodes: [109, 119, 129, 139, 149, 209, 219, 229, 108, 118, 128, 138, 148, 208, 218, 228, 107, 117, 127, 137, 147, 207, 217, 227, 106, 116, 126, 136, 146, 206, 216, 226, 101, 111, 121, 131, 141, 201, 211, 221, 100, 110, 120, 130, 140, 150, 200, 210, 220, 102, 112, 122, 132, 142, 202, 212, 222, 103, 113, 123, 133, 143, 203, 213, 223, 104, 114, 124, 134, 144, 204, 214, 224, 105, 115, 125, 135, 145, 205, 215, 225]
edges: [(109, 100), (109, 125), (109, 204), (109, 219), (109, 222), (119, 200), (119, 205), (119, 113), (129, 116), (129, 121), (129, 128), (129, 216), (129, 221), (129, 229), (129, 137), (139, 138), (139, 212), (139, 218), (149, 130), (149, 219), (209, 123), (219, 130), (219, 112), (219, 104), (229, 147), (229, 116), (229, 141), (229, 144), (108, 100), (108, 101), (108, 204), (108, 206), (108, 214), (108, 215), (108, 222), (118, 120), (118, 131), (118, 205), (118, 113), (128, 116), (128, 121), (128, 201), (128, 202), (128, 205), (128, 223), (138, 115), (138, 204), (138, 210), (138, 212), (138, 221), (138, 225), (148, 127), (148, 136), (148, 137), (148, 214), (148, 223), (148, 227), (148, 213), (208, 127), (208, 103), (208, 104), (208, 124), (218, 127), (218, 110), (218, 103), (218, 104), (218, 114), (218, 105), (228, 146), (228, 145), (107, 100), (107, 204), (107, 217), (107, 224), (117, 200), (117, 136), (117, 142), (127, 114), (127, 212), (127, 213), (127, 214), (127, 222), (127, 226), (127, 227), (137, 136), (137, 213), (137, 150), (147, 136), (147, 214), (147, 223), (207, 121), (207, 140), (207, 122), (207, 134), (217, 126), (217, 141), (217, 124), (217, 144), (106, 204), (106, 212), (106, 113), (116, 141), (116, 126), (116, 210), (116, 216), (116, 121), (116, 113), (116, 105), (126, 216), (136, 210), (136, 213), (136, 214), (146, 202), (146, 210), (146, 215), (146, 222), (146, 226), (206, 123), (216, 144), (216, 105), (226, 140), (226, 112), (226, 114), (226, 144), (101, 100), (101, 102), (101, 125), (101, 204), (101, 215), (101, 113), (101, 104), (111, 200), (111, 204), (111, 215), (111, 220), (121, 202), (121, 215), (121, 113), (121, 134), (131, 205), (131, 220), (141, 124), (141, 205), (141, 225), (201, 145), (211, 124), (221, 104), (221, 124), (100, 125), (100, 204), (100, 102), (100, 113), (100, 104), (100, 144), (100, 105), (110, 204), (110, 220), (120, 150), (120, 202), (120, 205), (120, 113), (140, 114), (140, 214), (140, 224), (150, 143), (150, 213), (200, 142), (200, 104), (200, 145), (210, 124), (210, 144), (210, 115), (210, 145), (102, 203), (102, 204), (102, 103), (102, 135), (112, 204), (122, 213), (122, 223), (132, 225), (202, 133), (202, 114), (212, 103), (222, 104), (103, 204), (103, 114), (113, 104), (113, 105), (113, 125), (113, 204), (133, 114), (133, 144), (143, 213), (143, 223), (203, 135), (213, 124), (213, 145), (104, 105), (104, 204), (104, 215), (114, 115), (114, 204), (134, 224), (144, 145), (144, 214), (204, 105), (204, 125)]
p1 = nx.shortest_path(G, source=114, target=211)
print('顶点 114 到顶点 211 的最短路径: ', p1)
输出的结果:
顶点 114 到顶点 211 的最短路径: [114, 127, 208, 124, 211]
print(nx.pagerank(G))
输出的结果:
{109: 0.011507076520104863, 119: 0.007835838669313514, 129: 0.015304593799331218, 139: 0.007772926737873626, 149: 0.0073896601012629825, 209: 0.0065558926178649985, 219: 0.014100908598251508, 229: 0.011454115940170253, 108: 0.01645334474680034, 118: 0.01010598371500564, 128: 0.01594717876199238, 138: 0.01671097227127263, 148: 0.015898676579503977, 208: 0.009437234075904938, 218: 0.0153795416919104, 228: 0.005900393773635255, 107: 0.009745182763645681, 117: 0.008716335675518244, 127: 0.021565565312365507, 137: 0.011642680498867146, 147: 0.009721031073465738, 207: 0.01040504770909835, 217: 0.012054472529765329, 227: 0.005615576255373405, 106: 0.007371191843767635, 116: 0.020955704443679106, 126: 0.007589432032220849, 136: 0.015987209357117116, 146: 0.013922108926721374, 206: 0.008554794629575304, 216: 0.011219193251536395, 226: 0.013613173390725904, 101: 0.016680863106330837, 111: 0.010121524312495604, 121: 0.017545503989576015, 131: 0.008531567756846938, 141: 0.014598319866130227, 201: 0.0058643663430632525, 211: 0.003936285336338021, 221: 0.009587911774927793, 100: 0.02243017302167168, 110: 0.007928429795381916, 120: 0.011875669801396205, 130: 0.0073896601012629825, 140: 0.01205992633948699, 150: 0.010045605782606326, 200: 0.015289870550944322, 210: 0.017716629501785937, 220: 0.008666577509181518, 102: 0.014865431161046641, 112: 0.007931095811770324, 122: 0.008087439927630492, 132: 0.004659566123187912, 142: 0.006487446038191551, 202: 0.013579313206377282, 212: 0.01190888044566142, 222: 0.011376739416933006, 103: 0.013438110749144392, 113: 0.02458154500563397, 123: 0.01104978432213578, 133: 0.00743370900670294, 143: 0.008011123394996112, 203: 0.006883198710237787, 213: 0.020392557117890422, 223: 0.012345866520333572, 104: 0.024902235588979776, 114: 0.019369722463816744, 124: 0.017165705442951484, 134: 0.008284361176173354, 144: 0.019363506469972095, 204: 0.03507634139024834, 214: 0.015500649025348538, 224: 0.008320315540621754, 105: 0.01439975542831122, 115: 0.007592722237637133, 125: 0.010808523955754608, 135: 0.006883198710237788, 145: 0.014654713389044883, 205: 0.014660118545887803, 215: 0.01337467974572934, 225: 0.009909720748343093}
此外,也可以和上一篇中一样,接入 Gephi [5]来得到更好的图可视化效果。
本文的代码可以参见[6].
[1] https://nebula-graph.com.cn/posts/game-of-thrones-relationship-networkx-gephi-nebula-graph/
[2] https://networkx.github.io/
[3] https://github.com/vesoft-inc/nebula
[4] https://spark.apache.org/graphx/
[6] https://github.com/vesoft-inc/nebula-python/pull/31
喜欢这篇文章?来来来,给我们的 GitHub 点个 star 表鼓励啦~~ 🙇♂️🙇♀️ [手动跪谢]
交流图数据库技术?交个朋友,Nebula Graph 官方小助手微信:NebulaGraphbot 拉你进交流群~~
作者有话说:Hi,我是王杰,是图数据 Nebula Graph 研发工程师,希望本次的经验分享能给大家带来帮助,如有不当之处也希望能帮忙纠正,谢谢~
这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。
V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。
V2EX is a community of developers, designers and creative people.