基于 Nebula-Importer 批量导入工具性能验证方案总结

原文首发于 Nebula 官网【博客】模块：https://discuss.nebula-graph.com.cn/t/topic/4159

一、测试服务器配置

主机名称	操作系统	CPU架构	CPU核数	内存	硬盘
hadoop 10	CentOS 7.6	x86_64	32 核	128 GB	1.8 TB
hadoop 11	CentOS 7.6	x86_64	32 核	64 GB	1 TB
hadoop 12	CentOS 7.6	x86_64	16 核	64 GB	1 TB

二、Nebula Cluster环境

操作系统：CentOS 7.5+
具备官方要求的软件环境，如：gcc 版本 7.1.0+，cmake 版本 3.5.0+，glibc 版本 2.12+ 及基本依赖包

yum update
yum install -y make \
                 m4 \
                 git \
                 wget \
                 unzip \
                 xz \
                 readline-devel \
                 ncurses-devel \
                 zlib-devel \
                 gcc \
                 gcc-c++ \
                 cmake \
                 gettext \
                 curl \
                 redhat-lsb-core

Nebula 版本：V2.0.0
后端存储：3 个节点，RocksDB

进程\主机名称	hadoop10	hadoop11	hadoop12
metad 进程数量	1	1	1
storaged 进程数量	1	1	1
graphd 进程数量	1	1	1

三、数据准备及数据内容格式说明

顶点数据条数/文件大小	边数据条数/文件大小	顶点+边数据条数/文件大小
74,314,635 条/4.6 G	139,951,301 条/6.6 G	214,265,936 条/11.2 G

补充说明：

edge.csv 139,951,301 计约：1.4 亿条，6.6 G
vertex.csv 74,314,635 计约：7 千万，4.6 G
边和点合计 214,265,936 计约：2.14 亿，11.2 G

总数据量截图

点边数据导入

[root@hadoop10 datas]# wc -l edge.csv 
139951301 edge.csv
[root@hadoop10 datas]# head -10 vertex.csv 
-201035082963479683,实体
-1779678833482502384,值
4646408208538057683,胶饴
-1861609733419239066,别名: 饴糖、畅糖、畅、软糖。
-2047289935702608120,词条
5842706712819643509,词条（拼音：cí tiáo）也叫词目，是辞书学用语，指收列的词语及其释文。
-3063129772935425027,文化
-2484942249444426630,红色食品
-3877061284769534378,红色食品是指食品为红色、橙红色或棕红色的食品。
-3402450096279275143,否
[root@hadoop10 datas]# wc -l vertex.csv 
74314635 vertex.csv
[root@hadoop10 datas]# head -10 edge.csv 
-201035082963479683,-1779678833482502384,属性
4646408208538057683,-1861609733419239066,描述
-2047289935702608120,5842706712819643509,描述
-2047289935702608120,-3063129772935425027,标签
-2484942249444426630,-3877061284769534378,描述
-2484942249444426630,-2484942249444426630,中文名
-2484942249444426630,-3402450096279275143,是否含防腐剂
-2484942249444426630,4786182067583989997,主要食用功效
-2484942249444426630,-8978611301755314833,适宜人群
-2484942249444426630,-382812815618074210,用途

四、验证技术方案汇总说明

方案：采用 Nebula Importer 批量导入工具

编写导入 yaml 文件

编写 yaml 导入文件

version: v1rc1
description: example
clientSettings:
  concurrency: 10 # number of graph clients
  channelBufferSize: 128
  space: test
  connection:
    user: user
    password: password
    address: 191.168.7.10:9669,191.168.7.11:9669,191.168.7.12:9669
logPath: ./err/test.log
files:
  - path: ./vertex.csv
    failDataPath: ./err/vertex.csv
    batchSize: 100
    type: csv
    csv:
      withHeader: false
      withLabel: false
    schema:
      type: vertex
      vertex:
        tags:
          - name: entity
            props:
              - name: name
                type: string
  - path: ./edge.csv
    failDataPath: ./err/edge.csv
    batchSize: 100
    type: csv
    csv:
      withHeader: false
      withLabel: false
    schema:
      type: edge
      edge:
        name: relation
        withRanking: false
        props:
          - name: name
            type: string

创建 schema

Nebula Console 创建 space 及 tag 和 edge

# 1. 创建 space
 (admin@nebula) [(none)]> create space test2(vid_type = FIXED_STRING(64));
# 2. 切换到指定空间
 (admin@nebula) [(none)]> use test2;
# 3. 创建 tag
(admin@nebula) [test2]> create tag entity(name string);
# 4. 创建 edge
(admin@nebula) [test2]> create edge relation(name string);
# 5. 查看 tag 结构
 (admin@nebula) [test2]> describe tag entity;
+--------+----------+-------+---------+
| Field  | Type     | Null  | Default |
+--------+----------+-------+---------+
| "name" | "string" | "YES" |         |
+--------+----------+-------+---------+
Got 1 rows (time spent 703/1002 us)
# 6. 查看 edge 结构
 (admin@nebula) [test2]> describe edge relation;
+--------+----------+-------+---------+
| Field  | Type     | Null  | Default |
+--------+----------+-------+---------+
| "name" | "string" | "YES" |         |
+--------+----------+-------+---------+
Got 1 rows (time spent 703/1041 us)

编译

编译 nebula-importer 并执行 shell 命令

# 编译 nebula-importer 程序
make build
# 执行 shell 命令，并指定 yaml 配置文件
/opt/software/nebulagraph/nebula-importer/nebula-importer --config /opt/software/datas/rdf-import2.yaml

结果输出

输出结果：

# 获取日志部分打印
2021/04/19 19:05:55 [INFO] statsmgr.go:61: Tick: Time(2400.00s), Finished(210207018), Failed(0), Latency AVG(32441us), Batches Req AVG(33824us), Rows AVG(87586.25/s)
2021/04/19 19:06:00 [INFO] statsmgr.go:61: Tick: Time(2405.00s), Finished(210541418), Failed(0), Latency AVG(32461us), Batches Req AVG(33844us), Rows AVG(87543.20/s)
2021/04/19 19:06:05 [INFO] statsmgr.go:61: Tick: Time(2410.00s), Finished(210901218), Failed(0), Latency AVG(32475us), Batches Req AVG(33857us), Rows AVG(87510.88/s)
2021/04/19 19:06:10 [INFO] statsmgr.go:61: Tick: Time(2415.00s), Finished(211270318), Failed(0), Latency AVG(32486us), Batches Req AVG(33869us), Rows AVG(87482.50/s)
2021/04/19 19:06:15 [INFO] statsmgr.go:61: Tick: Time(2420.00s), Finished(211685318), Failed(0), Latency AVG(32490us), Batches Req AVG(33873us), Rows AVG(87473.27/s)
2021/04/19 19:06:20 [INFO] statsmgr.go:61: Tick: Time(2425.00s), Finished(211959718), Failed(0), Latency AVG(32517us), Batches Req AVG(33900us), Rows AVG(87406.07/s)
2021/04/19 19:06:25 [INFO] statsmgr.go:61: Tick: Time(2430.00s), Finished(212220818), Failed(0), Latency AVG(32545us), Batches Req AVG(33928us), Rows AVG(87333.67/s)
2021/04/19 19:06:30 [INFO] statsmgr.go:61: Tick: Time(2435.00s), Finished(212433518), Failed(0), Latency AVG(32579us), Batches Req AVG(33963us), Rows AVG(87241.69/s)
2021/04/19 19:06:35 [INFO] statsmgr.go:61: Tick: Time(2440.00s), Finished(212780818), Failed(0), Latency AVG(32593us), Batches Req AVG(33977us), Rows AVG(87205.25/s)
2021/04/19 19:06:40 [INFO] statsmgr.go:61: Tick: Time(2445.01s), Finished(213240518), Failed(0), Latency AVG(32589us), Batches Req AVG(33973us), Rows AVG(87214.69/s)
2021/04/19 19:06:40 [INFO] reader.go:180: Total lines of file(/opt/software/datas/edge.csv) is: 139951301, error lines: 0
2021/04/19 19:06:42 [INFO] statsmgr.go:61: Done(/opt/software/datas/edge.csv): Time(2446.70s), Finished(213307919), Failed(0), Latency AVG(32585us), Batches Req AVG(33968us), Rows AVG(87181.95/s)
2021/04/19 19:06:42 Finish import data, consume time: 2447.20s
2021/04/19 19:06:43 --- END OF NEBULA IMPORTER ---

补充说明：

Time(2446.70s), Finished(213307919), Failed(0), Latency AVG(32585us), Batches Req
AVG(33968us), Rows AVG(87181.95/s)
2021/04/19 19:06:42 Finish import data, consume time: 2447.20s
2021/04/19 19:06:43 --- END OF NEBULA IMPORTER ---

机器配置

机器配置要求高（CPU 核数、内存空间及磁盘存储空间）

hadoop 10

节点配置

hadoop 11

节点配置

hadoop 12

节点配置

配置要求建议：

对比 3 台机器的内存空间发现：数据量在 2 亿+以上消耗内存空间还是挺严重的，因此对内存空间的配置尽可能的大
CPU 核数及磁盘空间等，参考官网：https://docs.nebula-graph.com.cn/

语句测试

图数据库 NebulaGraph 原生支持 nGQL 语句，且兼容 openCypher，特别说明：nGQL 暂不支持遍历所有点和边，例如：MATCH (v) RETURN v ，需确保 MATCH 语句有至少一个索引可用。如果需要创建索引，在已有相关的点、边或属性的情况下，必须在创建索引后重建索引，索引才能生效。

支持 Cypher 语句：

# 测试 Cypher 语句
# 导入 ngql 文件
./nebula-console -addr 191.168.7.10 -port 9669 -u user -p password -t 120  -f /opt/software/datas/basketballplayer-2.X.ngql

语句执行

方案总结

通过大数据量在3台NebulaGraph集群验证，通过此种方案批量写入的性能可以达到生产业务场景的性能要求，但是基于CSV文件大数据量的存储后期需要在Hadoop分布式存储平台通过数据仓库方式存储且作为数据源，通过yaml配置项指定tag及edge的具体字段配置好，交由工具处理

交流图数据库技术？加入 Nebula 交流群请先填写下你的 Nebula 名片，Nebula 小助手会拉你进群~~

基于 Nebula-Importer 批量导入工具性能验证方案总结

一、 测试服务器配置

二、Nebula Cluster环境

三、数据准备及数据内容格式说明

四、验证技术方案汇总说明

方案：采用 Nebula Importer 批量导入工具

编写导入 yaml 文件

创建 schema

编译

结果输出

机器配置

语句测试

一、测试服务器配置