logo
企业版

产品实践

实操 |LDBC 数据导入及 nGQL 实践

实操 |LDBC 数据导入及 nGQL 实践

概述

最近在自己搭的一个 NebulaGraph 单机集群中导入 LDBC 数据集,并尝试用 nGQL 写了几个LDBC SNB几个基础查询(Short Reads)。

数据导入

Nebula bench这个 repo 有用 Python 包装好的生成和导入 LDBC 到 Nebula 的过程,基本上照着它文档中的步骤做就行。

遇见的几个小坑:

  • 运行 python3 run.py importer 后默认生成的 yaml 中默认设置了 space 的 replica 为 3,在我的单机集群下不能用。要么自己改 py,要么就只是 draft run 后,自己运行 nebula importer 来导数据,我机智的选择了后者;
  • 导入的数据中发现一些完全不同类型的点的 vid 是一样的,如 person 和 organization,这个在后面跑 nGQL 的时候会觉得有点怪。Nebula bench 的文档也有提到,因为不影响压测,没做处理,好吧。

导入成功后,SHOW STATS 欣赏一下:

(root@nebula) [ldbc1]> show stats;
+---------+------------------+----------+
| Type    | Name             | Count    |
+---------+------------------+----------+
| "Tag"   | "Comment"        | 2052169  |
+---------+------------------+----------+
| "Tag"   | "Forum"          | 90492    |
+---------+------------------+----------+
| "Tag"   | "Organisation"   | 7955     |
+---------+------------------+----------+
| "Tag"   | "Person"         | 9892     |
+---------+------------------+----------+
| "Tag"   | "Place"          | 1460     |
+---------+------------------+----------+
| "Tag"   | "Post"           | 1003605  |
+---------+------------------+----------+
| "Tag"   | "Tag"            | 16080    |
+---------+------------------+----------+
| "Tag"   | "Tagclass"       | 71       |
+---------+------------------+----------+
| "Edge"  | "CONTAINER_OF"   | 1003605  |
+---------+------------------+----------+
| "Edge"  | "HAS_CREATOR"    | 3055774  |
+---------+------------------+----------+
| "Edge"  | "HAS_INTEREST"   | 229166   |
+---------+------------------+----------+
| "Edge"  | "HAS_MEMBER"     | 1611869  |
+---------+------------------+----------+
| "Edge"  | "HAS_MODERATOR"  | 90492    |
+---------+------------------+----------+
| "Edge"  | "HAS_TAG"        | 3721409  |
+---------+------------------+----------+
| "Edge"  | "HAS_TYPE"       | 16080    |
+---------+------------------+----------+
| "Edge"  | "IS_LOCATED_IN"  | 3073620  |
+---------+------------------+----------+
| "Edge"  | "IS_PART_OF"     | 1454     |
+---------+------------------+----------+
| "Edge"  | "IS_SUBCLASS_OF" | 70       |
+---------+------------------+----------+
| "Edge"  | "KNOWS"          | 180623   |
+---------+------------------+----------+
| "Edge"  | "LIKES"          | 2190095  |
+---------+------------------+----------+
| "Edge"  | "REPLY_OF"       | 2052169  |
+---------+------------------+----------+
| "Edge"  | "STUDY_AT"       | 7949     |
+---------+------------------+----------+
| "Edge"  | "WORK_AT"        | 21654    |
+---------+------------------+----------+
| "Space" | "vertices"       | 3165488  |
+---------+------------------+----------+
| "Space" | "edges"          | 17256029 |
+---------+------------------+----------+
Got 25 rows (time spent 1344/16017 us)

nGQL 查询

下面尝试解决LDBC SNB Interactive workload 中相对基础的几个查询场景,Short Reads,场景的需求可以具体看 spec。

Short Reads #1 - Profile of a person

match (v1:Person)-[:IS_LOCATED_IN]->(v2:Place) where id(v1)==$person_id
return v1.firstName, v1.lastName, v1.birthday, v1.locationIP, v1.browserUsed, id(v2), v1.gender, v1.creationDate

Short Reads #2 - Recent messages of a person

这里从 comment 找 post 需要支持不限跳数,目前 Nebula 尚不支持,只能指定一个足够大的上限,我随便设了 5.

match(p1:Person)<-[:HAS_CREATOR]-(m:`Comment`)-[:REPLY_OF*..5]->(p:Post)-[:HAS_CREATOR]->(p2:Person) 
where id(p1)==$person_id return id(m) as messageId, 
(case m.content is null when false then m.content when true then m.imageFile end) as content,
id(p),id(p2),p2.firstName,p2.lastName,
m.creationDate as creationDate order by creationDate desc, messageId desc limit 10;

Short Reads #3 - Friends of a person

match (p1:Person)-[k:KNOWS]-(p2:Person) where id(p1)==$person_id 
return id(p2) as friendId,p2.firstName,p2.lastName,k.creationDate as creationDate 
order by creationDate desc, friendId;

Short Reads #4 - Content of a message

终于可以不用 MATCH 了,这个简单的查询直接用 FETCH 搞定。

fetch prop on Post $message_id 
yield Post.creationDate, Post.content, Post.imageFile

Short Reads #5 - Creator of a message

同样不需要用 MATCHGO

go from 6605817 over HAS_CREATOR yield HAS_CREATOR._dst as personId, $$.Person.firstName, $$.Person.lastName;

Short Reads #6 - Forum of a message

继续 GO。这里也涉及到无限跳数的问题,GO 同样不支持,我设了最大跳数 5。

go 0 to 5 steps from $message_id over REPLY_OF yield REPLY_OF._dst as postId 
| go from $-.postId over CONTAINER_OF REVERSELY yield CONTAINER_OF._dst as forumId, $$.Forum.title as title
| go from $-.forumId over HAS_MODERATOR yield $-.forumId, $-.title, HAS_MODERATOR._dst as moderatorId, $$.Person.firstName, $$.Person.lastName

Short Reads #7 - Replies of a message

这个场景看下来需要 openCypher 的 OPTIONAL MATCH 来实现,Nebula 暂时还不支持,期待后续版本能加上。

the end.

本文中如有任何错误或疏漏,欢迎去 GitHub:https://github.com/vesoft-inc/nebula issue 区向我们提 issue 或者前往官方论坛:https://discuss.nebula-graph.com.cn/建议反馈 分类下提建议 👏;交流图数据库技术?加入 Nebula 交流群请先填写下你的 Nebula 名片,Nebula 小助手会拉你进群~~