欢迎阅读HugeGraph文档
This is the multi-page printable view of this section. Click here to print.
Documentation
- 1: Introduction with HugeGraph
- 2: 下载 Apache HugeGraph (Incubating)
- 3: Quick Start
- 3.1: HugeGraph-Server Quick Start
- 3.2: HugeGraph-Loader Quick Start
- 3.3: HugeGraph-Hubble Quick Start
- 3.4: HugeGraph-AI Quick Start
- 3.5: HugeGraph-Client Quick Start
- 3.6: HugeGraph-Tools Quick Start
- 3.7: HugeGraph-Computer Quick Start
- 4: Config
- 4.1: HugeGraph 配置
- 4.2: HugeGraph 配置项
- 4.3: HugeGraph 内置用户权限与扩展权限配置及使用
- 4.4: 配置 HugeGraphServer 使用 https 协议
- 4.5: HugeGraph-Computer 配置
- 5: API
- 5.1: HugeGraph RESTful API
- 5.1.1: Schema API
- 5.1.2: PropertyKey API
- 5.1.3: VertexLabel API
- 5.1.4: EdgeLabel API
- 5.1.5: IndexLabel API
- 5.1.6: Rebuild API
- 5.1.7: Vertex API
- 5.1.8: Edge API
- 5.1.9: Traverser API
- 5.1.10: Rank API
- 5.1.11: Variable API
- 5.1.12: Graphs API
- 5.1.13: Task API
- 5.1.14: Gremlin API
- 5.1.15: Cypher API
- 5.1.16: Authentication API
- 5.1.17: Metrics API
- 5.1.18: Other API
- 5.2: HugeGraph Java Client
- 5.3: Gremlin-Console
- 6: GUIDES
- 6.1: HugeGraph Architecture Overview
- 6.2: HugeGraph Design Concepts
- 6.3: HugeGraph Plugin 机制及插件扩展流程
- 6.4: Backup Restore
- 6.5: FAQ
- 6.6: 报告安全问题
- 7: QUERY LANGUAGE
- 7.1: HugeGraph Gremlin
- 7.2: HugeGraph Examples
- 8: PERFORMANCE
- 8.1: HugeGraph BenchMark Performance
- 8.2: HugeGraph-API Performance
- 8.2.1: v0.5.6 Stand-alone(RocksDB)
- 8.2.2: v0.5.6 Cluster(Cassandra)
- 8.3: HugeGraph-Loader Performance
- 8.4:
- 9: Contribution Guidelines
- 9.1: 如何参与 HugeGraph 社区
- 9.2: 订阅社区邮箱
- 9.3: 验证 Apache 发版
- 9.4: 在 IDEA 中配置 Server 开发环境
- 9.5: Apache HugeGraph Committer 指南
- 10: CHANGELOGS
- 10.1: HugeGraph 0.12 Release Notes
- 10.2: HugeGraph 1.0.0 Release Notes
- 10.3: HugeGraph 1.2.0 Release Notes
- 10.4: HugeGraph 1.3.0 Release Notes
- 10.5: HugeGraph 1.5.0 Release Notes
- 11:
- 12:
1 - Introduction with HugeGraph
Summary
Apache HugeGraph 是一款易用、高效、通用的开源图数据库系统(Graph Database,GitHub 项目地址), 实现了Apache TinkerPop3框架及完全兼容Gremlin查询语言, 具备完善的工具链组件,助力用户轻松构建基于图数据库之上的应用和产品。HugeGraph 支持百亿以上的顶点和边快速导入,并提供毫秒级的关联关系查询能力(OLTP), 并支持大规模分布式图分析(OLAP)。
HugeGraph 典型应用场景包括深度关系探索、关联分析、路径搜索、特征抽取、数据聚类、社区检测、知识图谱等, 适用业务领域有如网络安全、电信诈骗、金融风控、广告推荐、社交网络和智能机器人等。
本系统的主要应用场景是解决反欺诈、威胁情报、黑产打击等业务的图数据存储和建模分析需求,在此基础上逐步扩展及支持了更多的通用图应用。
Features
HugeGraph 支持在线及离线环境下的图操作,支持批量导入数据,支持高效的复杂关联关系分析,并且能够与大数据平台无缝集成。 HugeGraph 支持多用户并行操作,用户可输入 Gremlin 查询语句,并及时得到图查询结果,也可在用户程序中调用 HugeGraph API 进行图分析或查询。
本系统具备如下特点:
- 易用:HugeGraph 支持 Gremlin 图查询语言与 RESTful API,同时提供图检索常用接口,具备功能齐全的周边工具,轻松实现基于图的各种查询分析运算。
- 高效:HugeGraph 在图存储和图计算方面做了深度优化,提供多种批量导入工具,轻松完成百亿级数据快速导入,通过优化过的查询达到图检索的毫秒级响应。支持数千用户并发的在线实时操作。
- 通用:HugeGraph 支持 Apache Gremlin 标准图查询语言和 Property Graph 标准图建模方法,支持基于图的 OLTP 和 OLAP 方案。集成 Apache Hadoop 及 Apache Spark 大数据平台。
- 可扩展:支持分布式存储、数据多副本及横向扩容,内置多种后端存储引擎,也可插件式轻松扩展后端存储引擎。
- 开放:HugeGraph 代码开源(Apache 2 License),客户可自主修改定制,选择性回馈开源社区。
本系统的功能包括但不限于:
- 支持从多数据源批量导入数据 (包括本地文件、HDFS 文件、MySQL 数据库等数据源),支持多种文件格式导入 (包括 TXT、CSV、JSON 等格式)
- 具备可视化操作界面,可用于操作、分析及展示图,降低用户使用门槛
- 优化的图接口:最短路径 (Shortest Path)、K 步连通子图 (K-neighbor)、K 步到达邻接点 (K-out)、个性化推荐算法 PersonalRank 等
- 基于 Apache TinkerPop3 框架实现,支持 Gremlin 图查询语言
- 支持属性图,顶点和边均可添加属性,支持丰富的属性类型
- 具备独立的 Schema 元数据信息,拥有强大的图建模能力,方便第三方系统集成
- 支持多顶点 ID 策略:支持主键 ID、支持自动生成 ID、支持用户自定义字符串 ID、支持用户自定义数字 ID
- 可以对边和顶点的属性建立索引,支持精确查询、范围查询、全文检索
- 存储系统采用插件方式,支持 RocksDB(单机/集群)、Cassandra、ScyllaDB、HBase、MySQL、PostgreSQL、Palo 以及 Memory 等
- 与 HDFS、Spark/Flink、GraphX 等大数据系统集成,支持 BulkLoad 操作导入海量数据
- 支持高可用 HA、数据多副本、备份恢复、监控、分布式 Trace 等
Modules
- HugeGraph-Server: HugeGraph-Server 是 HugeGraph 项目的核心部分,包含 Core、Backend、API 等子模块;
- Core:图引擎实现,向下连接 Backend 模块,向上支持 API 模块;
- Backend:实现将图数据存储到后端,支持的后端包括:Memory、Cassandra、ScyllaDB、RocksDB、HBase、MySQL 及 PostgreSQL,用户根据实际情况选择一种即可;
- API:内置 REST Server,向用户提供 RESTful API,同时完全兼容 Gremlin 查询。(支持分布式存储和计算下推)
- HugeGraph-Toolchain: (工具链)
- HugeGraph-Client:HugeGraph-Client 提供了 RESTful API 的客户端,用于连接 HugeGraph-Server,目前仅实现 Java 版,其他语言用户可自行实现;
- HugeGraph-Loader:HugeGraph-Loader 是基于 HugeGraph-Client 的数据导入工具,将普通文本数据转化为图形的顶点和边并插入图形数据库中;
- HugeGraph-Hubble:HugeGraph-Hubble 是 HugeGraph 的 Web 可视化管理平台,一站式可视化分析平台,平台涵盖了从数据建模,到数据快速导入,再到数据的在线、离线分析、以及图的统一管理的全过程;
- HugeGraph-Tools:HugeGraph-Tools 是 HugeGraph 的部署和管理工具,包括管理图、备份/恢复、Gremlin 执行等功能。
- HugeGraph-Computer:HugeGraph-Computer 是分布式图处理系统 (OLAP). 它是 Pregel 的一个实现。它可以运行在 Kubernetes/Yarn 等集群上,支持超大规模图计算。
- HugeGraph-AI:HugeGraph-AI 是 HugeGraph 独立的 AI 组件,提供了图神经网络的训练和推理功能,LLM/Graph RAG 结合/Python-Client 等相关组件,持续更新 ing。
Contact Us
- GitHub Issues: 使用途中出现问题或提供功能性建议,可通过此反馈 (推荐)
- 邮件反馈:dev@hugegraph.apache.org (邮箱订阅方式)
- SEC 反馈: security@hugegraph.apache.org (报告安全相关问题)
- 微信公众号:Apache HugeGraph, 欢迎扫描下方二维码加入我们!
2 - 下载 Apache HugeGraph (Incubating)
指南:
- 推荐使用最新版本的 HugeGraph 软件包, 运行时环境请选择 Java11
- 验证下载版本, 请使用相应的哈希 (SHA512)、签名和 项目签名验证 KEYS
- 检查哈希 (SHA512)、签名的说明在 版本验证 页面, 也可参考 ASF 验证说明
注: HugeGraph 所有组件版本号已保持一致,
client/loader/hubble/common
等 maven 仓库版本号同理, 依赖引用可参考 maven 示例
最新版本 1.5.0
注: 从版本
1.5.0
开始,需要 Java11 运行时环境
- Release Date: 2024-12-10
- Release Notes
二进制包
Server | Toolchain |
---|---|
[Binary] [Sign] [SHA512] | [Binary] [Sign] [SHA512] |
源码包
Please refer to build from source.
Server | Toolchain | AI | Computer |
---|---|---|---|
[Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] |
归档版本
注:
- 请大家尽早迁移到最新 Release 版本上, 社区将不再维护
1.0.0
前的旧版本 (非 ASF 版本)1.3.0
是最后一个兼容 Java8 的主版本, 请尽早使用/迁移运行时为 Java11 (低版本 Java 有潜在更多的 SEC 风险和性能影响)
1.3.0
- Release Date: 2024-04-01
- Release Notes
二进制包
Server | Toolchain |
---|---|
[Binary] [Sign] [SHA512] | [Binary] [Sign] [SHA512] |
源码包
Please refer to build from source.
Server | Toolchain | AI | Common |
---|---|---|---|
[Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] |
1.2.0
- Release Date: 2023-12-28
- Release Notes
二进制包
Server | Toolchain |
---|---|
[Binary] [Sign] [SHA512] | [Binary] [Sign] [SHA512] |
源码包
Server | Toolchain | Computer | Common |
---|---|---|---|
[Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] |
1.0.0
- Release Date: 2023-02-22
- Release Notes
二进制包
Server | Toolchain | Computer |
---|---|---|
[Binary] [Sign] [SHA512] | [Binary] [Sign] [SHA512] | [Binary] [Sign] [SHA512] |
源码包
Server | Toolchain | Computer | Common |
---|---|---|---|
[Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] | [Source] [Sign] [SHA512] |
旧版本 (非 ASF 版本)
由于 ASF 规则要求, 不能直接在当前页面存放非 ASF 发行包, 对于 1.0.0 前旧版本 (非 ASF 版本) 的下载说明, 请跳转至 https://github.com/apache/incubator-hugegraph-doc/wiki/Apache-HugeGraph-(Incubating)-Old-Versions-Download3 - Quick Start
3.1 - HugeGraph-Server Quick Start
1 HugeGraph-Server 概述
HugeGraph-Server 是 HugeGraph 项目的核心部分,包含 Core、Backend、API 等子模块。
Core 模块是 Tinkerpop 接口的实现,Backend 模块用于管理数据存储,目前支持的后端包括:Memory、Cassandra、ScyllaDB 以及 RocksDB,API 模块提供 HTTP Server,将 Client 的 HTTP 请求转化为对 Core 的调用。
文档中会出现
HugeGraph-Server
及HugeGraphServer
这两种写法,其他组件也类似。 这两种写法含义上并明显差异,可以这么区分:HugeGraph-Server
表示服务端相关组件代码,HugeGraphServer
表示服务进程。
2 依赖
2.1 安装 Java 11 (JDK 11)
请优先考虑在 Java 11 的环境上启动 HugeGraph-Server
(在 1.5.0 版前,会保留对 Java 8 的基本兼容)
在往下阅读之前先执行 java -version
命令确认 jdk 版本
注:使用 Java 8 启动 HugeGraph-Server 会失去一些安全性的保障,也会降低性能相关指标
我们推荐生产或对外网暴露访问的环境使用 Java 11 并考虑开启 Auth 权限认证。
3 部署
有四种方式可以部署 HugeGraph-Server 组件:
- 方式 1:使用 Docker 容器 (便于测试)
- 方式 2:下载 tar 包
- 方式 3:源码编译
- 方式 4:使用 tools 工具部署 (Outdated)
3.1 使用 Docker 容器 (便于测试)
可参考 Docker 部署方式。
我们可以使用 docker run -itd --name=server -p 8080:8080 hugegraph/hugegraph:1.5.0
去快速启动一个内置了 RocksDB
的 Hugegraph server
.
可选项:
- 可以使用
docker exec -it server bash
进入容器完成一些操作 - 可以使用
docker run -itd --name=server -p 8080:8080 -e PRELOAD="true" hugegraph/hugegraph:1.5.0
在启动的时候预加载一个内置的样例图。可以通过RESTful API
进行验证。具体步骤可以参考 5.1.1 - 可以使用
-e PASSWORD=123456
设置是否开启鉴权模式以及 admin 的密码,具体步骤可以参考 Config Authentication
如果使用 docker desktop,则可以按照如下的方式设置可选项:
另外,如果我们希望能够在一个文件中管理除了 server
之外的其他 Hugegraph 相关的实例,我们也可以使用 docker-compose
完成部署,使用命令 docker-compose up -d
,(当然只配置 server
也是可以的)以下是一个样例的 docker-compose.yml
:
version: '3'
services:
server:
image: hugegraph/hugegraph:1.5.0
container_name: server
# environment:
# - PRELOAD=true 为可选参数,为 True 时可以在启动的时候预加载一个内置的样例图
# - PASSWORD=123456 为可选参数,设置的时候可以开启鉴权模式,并设置密码
ports:
- 8080:8080
注意:
hugegraph 的 docker 镜像是一个便捷版本,用于快速启动 hugegraph,并不是官方发布物料包方式。你可以从 ASF Release Distribution Policy 中得到更多细节。
推荐使用
release tag
(如1.5.0/1.x.0
) 以获取稳定版。使用latest
tag 可以使用开发中的最新功能。
3.2 下载 tar 包
# use the latest version, here is 1.5.0 for example
wget https://downloads.apache.org/incubator/hugegraph/{version}/apache-hugegraph-incubating-{version}.tar.gz
tar zxf *hugegraph*.tar.gz
3.3 源码编译
源码编译前请确保本机有安装 wget/curl
命令
下载 HugeGraph 源代码
git clone https://github.com/apache/hugegraph.git
编译打包生成 tar 包
cd hugegraph
# (Optional) use "-P stage" param if you build failed with the latest code(during pre-release period)
mvn package -DskipTests
执行日志如下:
......
[INFO] Reactor Summary for hugegraph 1.5.0:
[INFO]
[INFO] hugegraph .......................................... SUCCESS [ 2.405 s]
[INFO] hugegraph-core ..................................... SUCCESS [ 13.405 s]
[INFO] hugegraph-api ...................................... SUCCESS [ 25.943 s]
[INFO] hugegraph-cassandra ................................ SUCCESS [ 54.270 s]
[INFO] hugegraph-scylladb ................................. SUCCESS [ 1.032 s]
[INFO] hugegraph-rocksdb .................................. SUCCESS [ 34.752 s]
[INFO] hugegraph-mysql .................................... SUCCESS [ 1.778 s]
[INFO] hugegraph-palo ..................................... SUCCESS [ 1.070 s]
[INFO] hugegraph-hbase .................................... SUCCESS [ 32.124 s]
[INFO] hugegraph-postgresql ............................... SUCCESS [ 1.823 s]
[INFO] hugegraph-dist ..................................... SUCCESS [ 17.426 s]
[INFO] hugegraph-example .................................. SUCCESS [ 1.941 s]
[INFO] hugegraph-test ..................................... SUCCESS [01:01 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
......
执行成功后,在 hugegraph 目录下生成 *hugegraph-*.tar.gz
文件,就是编译生成的 tar 包。
3.4 使用 tools 工具部署 (Outdated)
HugeGraph-Tools 提供了一键部署的命令行工具,用户可以使用该工具快速地一键下载、解压、配置并启动 HugeGraph-Server 和 HugeGraph-Hubble,最新的 HugeGraph-Toolchain 中已经包含所有的这些工具,直接下载它解压就有工具包集合了
# download toolchain package, it includes loader + tool + hubble, please check the latest version (here is 1.5.0)
wget https://downloads.apache.org/incubator/hugegraph/1.5.0/apache-hugegraph-toolchain-incubating-1.5.0.tar.gz
tar zxf *hugegraph-*.tar.gz
# enter the tool's package
cd *hugegraph*/*tool*
注:
${version}
为版本号,最新版本号可参考 Download 页面,或直接从 Download 页面点击链接下载
HugeGraph-Tools 的总入口脚本是 bin/hugegraph
,用户可以使用 help
子命令查看其用法,这里只介绍一键部署的命令。
bin/hugegraph deploy -v {hugegraph-version} -p {install-path} [-u {download-path-prefix}]
{hugegraph-version}
表示要部署的 HugeGraphServer 及 HugeGraphStudio 的版本,用户可查看 conf/version-mapping.yaml
文件获取版本信息,{install-path}
指定 HugeGraphServer 及 HugeGraphStudio 的安装目录,{download-path-prefix}
可选,指定 HugeGraphServer 及 HugeGraphStudio tar 包的下载地址,不提供时使用默认下载地址,比如要启动 0.6 版本的 HugeGraph-Server 及 HugeGraphStudio 将上述命令写为 bin/hugegraph deploy -v 0.6 -p services
即可。
4 配置
如果需要快速启动 HugeGraph 仅用于测试,那么只需要进行少数几个配置项的修改即可(见下一节)。
5 启动
5.1 使用启动脚本启动
启动分为"首次启动"和"非首次启动",这么区分是因为在第一次启动前需要初始化后端数据库,然后启动服务。
而在人为停掉服务后,或者其他原因需要再次启动服务时,因为后端数据库是持久化存在的,直接启动服务即可。
HugeGraphServer 启动时会连接后端存储并尝试检查后端存储版本号,如果未初始化后端或者后端已初始化但版本不匹配时(旧版本数据),HugeGraphServer 会启动失败,并给出错误信息。
如果需要外部访问 HugeGraphServer,请修改 rest-server.properties
的 restserver.url
配置项(默认为 http://127.0.0.1:8080
),修改成机器名或 IP 地址。
由于各种后端所需的配置(hugegraph.properties)及启动步骤略有不同,下面逐一对各后端的配置及启动做介绍。
如果想要使用 HugeGraph 鉴权模式,在后面正式启动 Server 之前应按照 Server 鉴权配置 进行配置。
5.1.1 RocksDB
点击展开/折叠 RocksDB 配置及启动方法
RocksDB 是一个嵌入式的数据库,不需要手动安装部署,要求 GCC 版本 >= 4.3.0(GLIBCXX_3.4.10),如不满足,需要提前升级 GCC
修改 hugegraph.properties
backend=rocksdb
serializer=binary
rocksdb.data_path=.
rocksdb.wal_path=.
初始化数据库(第一次启动时或在 conf/graphs/
下手动添加了新配置时需要进行初始化)
cd *hugegraph-${version}
bin/init-store.sh
启动 server
bin/start-hugegraph.sh
Starting HugeGraphServer...
Connecting to HugeGraphServer (http://127.0.0.1:8080/graphs)....OK
提示的 url 与 rest-server.properties
中配置的 restserver.url
一致
5.1.2 HBase
点击展开/折叠 HBase 配置及启动方法
用户需自行安装 HBase,要求版本 2.0 以上,下载地址
修改 hugegraph.properties
backend=hbase
serializer=hbase
# hbase backend config
hbase.hosts=localhost
hbase.port=2181
# Note: recommend to modify the HBase partition number by the actual/env data amount & RS amount before init store
# it may influence the loading speed a lot
#hbase.enable_partition=true
#hbase.vertex_partitions=10
#hbase.edge_partitions=30
初始化数据库(第一次启动时或在 conf/graphs/
下手动添加了新配置时需要进行初始化)
cd *hugegraph-${version}
bin/init-store.sh
启动 server
bin/start-hugegraph.sh
Starting HugeGraphServer...
Connecting to HugeGraphServer (http://127.0.0.1:8080/graphs)....OK
更多其它后端配置可参考配置项介绍
5.1.3 MySQL
点击展开/折叠 MySQL 配置及启动方法
由于 MySQL 是在 GPL 协议下,与 Apache 协议不兼容,用户需自行安装 MySQL,下载地址
下载 MySQL 的驱动包,比如 mysql-connector-java-8.0.30.jar
,并放入 HugeGraph-Server 的 lib
目录下。
修改 hugegraph.properties
,配置数据库 URL,用户名和密码,store
是数据库名,如果没有会被自动创建。
backend=mysql
serializer=mysql
store=hugegraph
# mysql backend config
jdbc.driver=com.mysql.cj.jdbc.Driver
jdbc.url=jdbc:mysql://127.0.0.1:3306
jdbc.username=
jdbc.password=
jdbc.reconnect_max_times=3
jdbc.reconnect_interval=3
jdbc.ssl_mode=false
初始化数据库(第一次启动时或在 conf/graphs/
下手动添加了新配置时需要进行初始化)
cd *hugegraph-${version}
bin/init-store.sh
启动 server
bin/start-hugegraph.sh
Starting HugeGraphServer...
Connecting to HugeGraphServer (http://127.0.0.1:8080/graphs)....OK
5.1.4 Cassandra
点击展开/折叠 Cassandra 配置及启动方法
用户需自行安装 Cassandra,要求版本 3.0 以上,下载地址
修改 hugegraph.properties
backend=cassandra
serializer=cassandra
# cassandra backend config
cassandra.host=localhost
cassandra.port=9042
cassandra.username=
cassandra.password=
#cassandra.connect_timeout=5
#cassandra.read_timeout=20
#cassandra.keyspace.strategy=SimpleStrategy
#cassandra.keyspace.replication=3
初始化数据库(第一次启动时或在 conf/graphs/
下手动添加了新配置时需要进行初始化)
cd *hugegraph-${version}
bin/init-store.sh
Initing HugeGraph Store...
2017-12-01 11:26:51 1424 [main] [INFO ] org.apache.hugegraph.HugeGraph [] - Opening backend store: 'cassandra'
2017-12-01 11:26:52 2389 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Failed to connect keyspace: hugegraph, try init keyspace later
2017-12-01 11:26:52 2472 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Failed to connect keyspace: hugegraph, try init keyspace later
2017-12-01 11:26:52 2557 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Failed to connect keyspace: hugegraph, try init keyspace later
2017-12-01 11:26:53 2797 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Store initialized: huge_graph
2017-12-01 11:26:53 2945 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Store initialized: huge_schema
2017-12-01 11:26:53 3044 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Store initialized: huge_index
2017-12-01 11:26:53 3046 [pool-3-thread-1] [INFO ] org.apache.hugegraph.backend.Transaction [] - Clear cache on event 'store.init'
2017-12-01 11:26:59 9720 [main] [INFO ] org.apache.hugegraph.HugeGraph [] - Opening backend store: 'cassandra'
2017-12-01 11:27:00 9805 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Failed to connect keyspace: hugegraph1, try init keyspace later
2017-12-01 11:27:00 9886 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Failed to connect keyspace: hugegraph1, try init keyspace later
2017-12-01 11:27:00 9955 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Failed to connect keyspace: hugegraph1, try init keyspace later
2017-12-01 11:27:00 10175 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Store initialized: huge_graph
2017-12-01 11:27:00 10321 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Store initialized: huge_schema
2017-12-01 11:27:00 10413 [main] [INFO ] org.apache.hugegraph.backend.store.cassandra.CassandraStore [] - Store initialized: huge_index
2017-12-01 11:27:00 10413 [pool-3-thread-1] [INFO ] org.apache.hugegraph.backend.Transaction [] - Clear cache on event 'store.init'
启动 server
bin/start-hugegraph.sh
Starting HugeGraphServer...
Connecting to HugeGraphServer (http://127.0.0.1:8080/graphs)....OK
5.1.5 Memory
点击展开/折叠 Memory 配置及启动方法
修改 hugegraph.properties
backend=memory
serializer=text
Memory 后端的数据是保存在内存中无法持久化的,不需要初始化后端,这也是唯一一个不需要初始化的后端。
启动 server
bin/start-hugegraph.sh
Starting HugeGraphServer...
Connecting to HugeGraphServer (http://127.0.0.1:8080/graphs)....OK
提示的 url 与 rest-server.properties 中配置的 restserver.url 一致
5.1.6 ScyllaDB
点击展开/折叠 ScyllaDB 配置及启动方法
用户需自行安装 ScyllaDB,推荐版本 2.1 以上,下载地址
修改 hugegraph.properties
backend=scylladb
serializer=scylladb
# cassandra backend config
cassandra.host=localhost
cassandra.port=9042
cassandra.username=
cassandra.password=
#cassandra.connect_timeout=5
#cassandra.read_timeout=20
#cassandra.keyspace.strategy=SimpleStrategy
#cassandra.keyspace.replication=3
由于 scylladb 数据库本身就是基于 cassandra 的"优化版",如果用户未安装 scylladb,也可以直接使用 cassandra 作为后端存储,只需要把 backend 和 serializer 修改为 scylladb,host 和 post 指向 cassandra 集群的 seeds 和 port 即可,但是并不建议这样做,这样发挥不出 scylladb 本身的优势了。
初始化数据库(第一次启动时或在 conf/graphs/
下手动添加了新配置时需要进行初始化)
cd *hugegraph-${version}
bin/init-store.sh
启动 server
bin/start-hugegraph.sh
Starting HugeGraphServer...
Connecting to HugeGraphServer (http://127.0.0.1:8080/graphs)....OK
5.1.7 启动 server 的时候创建示例图
在脚本启动时候携带 -p true
参数,表示 preload, 即创建示例图图
bin/start-hugegraph.sh -p true
Starting HugeGraphServer in daemon mode...
Connecting to HugeGraphServer (http://127.0.0.1:8080/graphs)......OK
并且使用 RESTful API 请求 HugeGraphServer
得到如下结果:
> curl "http://localhost:8080/graphs/hugegraph/graph/vertices" | gunzip
{"vertices":[{"id":"2:lop","label":"software","type":"vertex","properties":{"name":"lop","lang":"java","price":328}},{"id":"1:josh","label":"person","type":"vertex","properties":{"name":"josh","age":32,"city":"Beijing"}},{"id":"1:marko","label":"person","type":"vertex","properties":{"name":"marko","age":29,"city":"Beijing"}},{"id":"1:peter","label":"person","type":"vertex","properties":{"name":"peter","age":35,"city":"Shanghai"}},{"id":"1:vadas","label":"person","type":"vertex","properties":{"name":"vadas","age":27,"city":"Hongkong"}},{"id":"2:ripple","label":"software","type":"vertex","properties":{"name":"ripple","lang":"java","price":199}}]}
代表创建示例图成功。
5.2 使用 Docker
在 3.3 使用 Docker 容器中,我们已经介绍了如何使用 docker
部署 hugegraph-server
, 我们还可以使用其他的后端存储或者设置参数在 sever 启动的时候加载样例图
5.2.1 使用 Cassandra 作为后端
点击展开/折叠 Cassandra 配置及启动方法
在使用 Docker 的时候,我们可以使用 Cassandra 作为后端存储。我们更加推荐直接使用 docker-compose 来对于 server 以及 Cassandra 进行统一管理
样例的 docker-compose.yml
可以在 github 中获取,使用 docker-compose up -d
启动。(如果使用 cassandra 4.0 版本作为后端存储,则需要大约两个分钟初始化,请耐心等待)
version: "3"
services:
server:
image: hugegraph/hugegraph
container_name: cas-server
ports:
- 8080:8080
environment:
hugegraph.backend: cassandra
hugegraph.serializer: cassandra
hugegraph.cassandra.host: cas-cassandra
hugegraph.cassandra.port: 9042
networks:
- ca-network
depends_on:
- cassandra
healthcheck:
test: ["CMD", "bin/gremlin-console.sh", "--" ,"-e", "scripts/remote-connect.groovy"]
interval: 10s
timeout: 30s
retries: 3
cassandra:
image: cassandra:4
container_name: cas-cassandra
ports:
- 7000:7000
- 9042:9042
security_opt:
- seccomp:unconfined
networks:
- ca-network
healthcheck:
test: ["CMD", "cqlsh", "--execute", "describe keyspaces;"]
interval: 10s
timeout: 30s
retries: 5
networks:
ca-network:
volumes:
hugegraph-data:
在这个 yaml 中,需要在环境变量中以 hugegraph.<parameter_name>
的形式进行参数传递,配置 Cassandra 相关的参数。
具体来说,在 hugegraph.properties
配置文件中,提供了 backend=xxx
, cassandra.host=xxx
等配置项,为了配置这些配置项,在传递环境变量的过程之中,我们需要在这些配置项前加上 hugegrpah.
,即 hugegraph.backend
和 hugegraph.cassandra.host
。
其他配置可以参照 4 配置
5.2.2 启动 server 的时候创建示例图
在 docker 启动的时候设置环境变量 PRELOAD=true
, 从而实现启动脚本的时候加载数据。
使用
docker run
使用
docker run -itd --name=server -p 8080:8080 -e PRELOAD=true hugegraph/hugegraph:1.5.0
使用
docker-compose
创建
docker-compose.yml
,具体文件如下,在环境变量中设置 PRELOAD=true。其中,example.groovy
是一个预定义的脚本,用于预加载样例数据。如果有需要,可以通过挂载新的example.groovy
脚本改变预加载的数据。version: '3' services: server: image: hugegraph/hugegraph:1.5.0 container_name: server environment: - PRELOAD=true volumes: - /path/to/yourscript:/hugegraph/scripts/example.groovy ports: - 8080:8080
使用命令
docker-compose up -d
启动容器
使用 RESTful API 请求 HugeGraphServer
得到如下结果:
> curl "http://localhost:8080/graphs/hugegraph/graph/vertices" | gunzip
{"vertices":[{"id":"2:lop","label":"software","type":"vertex","properties":{"name":"lop","lang":"java","price":328}},{"id":"1:josh","label":"person","type":"vertex","properties":{"name":"josh","age":32,"city":"Beijing"}},{"id":"1:marko","label":"person","type":"vertex","properties":{"name":"marko","age":29,"city":"Beijing"}},{"id":"1:peter","label":"person","type":"vertex","properties":{"name":"peter","age":35,"city":"Shanghai"}},{"id":"1:vadas","label":"person","type":"vertex","properties":{"name":"vadas","age":27,"city":"Hongkong"}},{"id":"2:ripple","label":"software","type":"vertex","properties":{"name":"ripple","lang":"java","price":199}}]}
代表创建示例图成功。
6 访问 Server
6.1 服务启动状态校验
jps
查看服务进程
jps
6475 HugeGraphServer
curl
请求 RESTful API
echo `curl -o /dev/null -s -w %{http_code} "http://localhost:8080/graphs/hugegraph/graph/vertices"`
返回结果 200,代表 server 启动正常
6.2 请求 Server
HugeGraphServer 的 RESTful API 包括多种类型的资源,典型的包括 graph、schema、gremlin、traverser 和 task
graph
包含vertices
、edges
schema
包含vertexlabels
、propertykeys
、edgelabels
、indexlabels
gremlin
包含各种Gremlin
语句,如g.v()
,可以同步或者异步执行traverser
包含各种高级查询,包括最短路径、交叉点、N 步可达邻居等task
包含异步任务的查询和删除
6.2.1 获取 hugegraph
的顶点及相关属性
curl http://localhost:8080/graphs/hugegraph/graph/vertices
说明
由于图的点和边很多,对于 list 型的请求,比如获取所有顶点,获取所有边等,Server 会将数据压缩再返回,所以使用 curl 时得到一堆乱码,可以重定向至
gunzip
进行解压。推荐使用 Chrome 浏览器 + Restlet 插件发送 HTTP 请求进行测试。curl "http://localhost:8080/graphs/hugegraph/graph/vertices" | gunzip
当前 HugeGraphServer 的默认配置只能是本机访问,可以修改配置,使其能在其他机器访问。
vim conf/rest-server.properties restserver.url=http://0.0.0.0:8080
响应体如下:
{
"vertices": [
{
"id": "2lop",
"label": "software",
"type": "vertex",
"properties": {
"price": [
{
"id": "price",
"value": 328
}
],
"name": [
{
"id": "name",
"value": "lop"
}
],
"lang": [
{
"id": "lang",
"value": "java"
}
]
}
},
{
"id": "1josh",
"label": "person",
"type": "vertex",
"properties": {
"name": [
{
"id": "name",
"value": "josh"
}
],
"age": [
{
"id": "age",
"value": 32
}
]
}
},
...
]
}
详细的 API 请参考 RESTful-API 文档。
另外也可以通过访问 localhost:8080/swagger-ui/index.html
查看 API。
在使用 Swagger UI 调试 HugeGraph 提供的 API 时,如果 HugeGraph Server 开启了鉴权模式,可以在 Swagger 页面输入鉴权信息。
当前 HugeGraph 支持基于 Basic 和 Bearer 两种形式设置鉴权信息。
7 停止 Server
$cd *hugegraph-${version}
$bin/stop-hugegraph.sh
8 使用 IntelliJ IDEA 调试 Server
3.2 - HugeGraph-Loader Quick Start
1 HugeGraph-Loader 概述
HugeGraph-Loader 是 HugeGraph 的数据导入组件,能够将多种数据源的数据转化为图的顶点和边并批量导入到图数据库中。
目前支持的数据源包括:
- 本地磁盘文件或目录,支持 TEXT、CSV 和 JSON 格式的文件,支持压缩文件
- HDFS 文件或目录,支持压缩文件
- 主流关系型数据库,如 MySQL、PostgreSQL、Oracle、SQL Server
本地磁盘文件和 HDFS 文件支持断点续传。
后面会具体说明。
注意:使用 HugeGraph-Loader 需要依赖 HugeGraph Server 服务,下载和启动 Server 请参考 HugeGraph-Server Quick Start
2 获取 HugeGraph-Loader
有两种方式可以获取 HugeGraph-Loader:
- 使用 Docker 镜像 (便于测试)
- 下载已编译的压缩包
- 克隆源码编译安装
2.1 使用 Docker 镜像 (便于测试)
我们可以使用 docker run -itd --name loader hugegraph/loader:1.5.0
部署 loader 服务。对于需要加载的数据,则可以通过挂载 -v /path/to/data/file:/loader/file
或者 docker cp
的方式将文件复制到 loader 容器内部。
或者使用 docker-compose 启动 loader, 启动命令为 docker-compose up -d
, 样例的 docker-compose.yml 如下所示:
version: '3'
services:
server:
image: hugegraph/hugegraph:1.5.0
container_name: server
ports:
- 8080:8080
hubble:
image: hugegraph/hubble:1.5.0
container_name: hubble
ports:
- 8088:8088
loader:
image: hugegraph/loader:1.5.0
container_name: loader
# mount your own data here
# volumes:
# - /path/to/data/file:/loader/file
具体的数据导入流程可以参考 4.5 使用 docker 导入
注意:
hugegraph-loader 的 docker 镜像是一个便捷版本,用于快速启动 loader,并不是官方发布物料包方式。你可以从 ASF Release Distribution Policy 中得到更多细节。
推荐使用
release tag
(如1.5.0
) 以获取稳定版。使用latest
tag 可以使用开发中的最新功能。
2.2 下载已编译的压缩包
下载最新版本的 HugeGraph-Toolchain
Release 包,里面包含了 loader + tool + hubble
全套工具,如果你已经下载,可跳过重复步骤
wget https://downloads.apache.org/incubator/hugegraph/{version}/apache-hugegraph-toolchain-incubating-{version}.tar.gz
tar zxf *hugegraph*.tar.gz
2.3 克隆源码编译安装
克隆最新版本的 HugeGraph-Loader 源码包:
# 1. get from github
git clone https://github.com/apache/hugegraph-toolchain.git
# 2. get from direct url (please choose the **latest release** version)
wget https://downloads.apache.org/incubator/hugegraph/{version}/apache-hugegraph-toolchain-incubating-{version}-src.tar.gz
点击展开/折叠 手动安装 ojdbc 方法
由于 Oracle ojdbc license 的限制,需要手动安装 ojdbc 到本地 maven 仓库。 访问 Oracle jdbc 下载 页面。选择 Oracle Database 12c Release 2 (12.2.0.1) drivers,如下图所示。
打开链接后,选择“ojdbc8.jar”
把 ojdbc8 安装到本地 maven 仓库,进入ojdbc8.jar
所在目录,执行以下命令。
mvn install:install-file -Dfile=./ojdbc8.jar -DgroupId=com.oracle -DartifactId=ojdbc8 -Dversion=12.2.0.1 -Dpackaging=jar
编译生成 tar 包:
cd hugegraph-loader
mvn clean package -DskipTests
3 使用流程
使用 HugeGraph-Loader 的基本流程分为以下几步:
- 编写图模型
- 准备数据文件
- 编写输入源映射文件
- 执行命令导入
3.1 编写图模型
这一步是建模的过程,用户需要对自己已有的数据和想要创建的图模型有一个清晰的构想,然后编写 schema 建立图模型。
比如想创建一个拥有两类顶点及两类边的图,顶点是"人"和"软件",边是"人认识人"和"人创造软件",并且这些顶点和边都带有一些属性,比如顶点"人"有:“姓名”、“年龄"等属性, “软件"有:“名字”、“售卖价格"等属性;边"认识"有:“日期"属性等。
示例图模型
在设计好了图模型之后,我们可以用groovy
编写出schema
的定义,并保存至文件中,这里命名为schema.groovy
。
// 创建一些属性
schema.propertyKey("name").asText().ifNotExist().create();
schema.propertyKey("age").asInt().ifNotExist().create();
schema.propertyKey("city").asText().ifNotExist().create();
schema.propertyKey("date").asText().ifNotExist().create();
schema.propertyKey("price").asDouble().ifNotExist().create();
// 创建 person 顶点类型,其拥有三个属性:name, age, city,主键是 name
schema.vertexLabel("person").properties("name", "age", "city").primaryKeys("name").ifNotExist().create();
// 创建 software 顶点类型,其拥有两个属性:name, price,主键是 name
schema.vertexLabel("software").properties("name", "price").primaryKeys("name").ifNotExist().create();
// 创建 knows 边类型,这类边是从 person 指向 person 的
schema.edgeLabel("knows").sourceLabel("person").targetLabel("person").ifNotExist().create();
// 创建 created 边类型,这类边是从 person 指向 software 的
schema.edgeLabel("created").sourceLabel("person").targetLabel("software").ifNotExist().create();
关于 schema 的详细说明请参考 hugegraph-client 中对应部分。
3.2 准备数据
目前 HugeGraph-Loader 支持的数据源包括:
- 本地磁盘文件或目录
- HDFS 文件或目录
- 部分关系型数据库
- Kafka topic
3.2.1 数据源结构
3.2.1.1 本地磁盘文件或目录
用户可以指定本地磁盘文件作为数据源,如果数据分散在多个文件中,也支持以某个目录作为数据源,但暂时不支持以多个目录作为数据源。
比如:我的数据分散在多个文件中,part-0、part-1 … part-n,要想执行导入,必须保证它们是放在一个目录下的。然后在 loader 的映射文件中,将path
指定为该目录即可。
支持的文件格式包括:
- TEXT
- CSV
- JSON
TEXT 是自定义分隔符的文本文件,第一行通常是标题,记录了每一列的名称,也允许没有标题行(在映射文件中指定)。其余的每行代表一条记录,会被转化为一个顶点/边;行的每一列对应一个字段,会被转化为顶点/边的 id、label 或属性;
示例如下:
id|name|lang|price|ISBN
1|lop|java|328|ISBN978-7-107-18618-5
2|ripple|java|199|ISBN978-7-100-13678-5
CSV 是分隔符为逗号,
的 TEXT 文件,当列值本身包含逗号时,该列值需要用双引号包起来,如:
marko,29,Beijing
"li,nary",26,"Wu,han"
JSON 文件要求每一行都是一个 JSON 串,且每行的格式需保持一致。
{"source_name": "marko", "target_name": "vadas", "date": "20160110", "weight": 0.5}
{"source_name": "marko", "target_name": "josh", "date": "20130220", "weight": 1.0}
3.2.1.2 HDFS 文件或目录
用户也可以指定 HDFS 文件或目录作为数据源,上面关于本地磁盘文件或目录
的要求全部适用于这里。除此之外,鉴于 HDFS 上通常存储的都是压缩文件,loader 也提供了对压缩文件的支持,并且本地磁盘文件或目录
同样支持压缩文件。
目前支持的压缩文件类型包括:GZIP、BZ2、XZ、LZMA、SNAPPY_RAW、SNAPPY_FRAMED、Z、DEFLATE、LZ4_BLOCK、LZ4_FRAMED、ORC 和 PARQUET。
3.2.1.3 主流关系型数据库
loader 还支持以部分关系型数据库作为数据源,目前支持 MySQL、PostgreSQL、Oracle 和 SQL Server。
但目前对表结构要求较为严格,如果导入过程中需要做关联查询,这样的表结构是不允许的。关联查询的意思是:在读到表的某行后,发现某列的值不能直接使用(比如外键),需要再去做一次查询才能确定该列的真实值。
举个例子:假设有三张表,person、software 和 created
// person 表结构
id | name | age | city
// software 表结构
id | name | lang | price
// created 表结构
id | p_id | s_id | date
如果在建模(schema)时指定 person 或 software 的 id 策略是 PRIMARY_KEY,选择以 name 作为 primary keys(注意:这是 hugegraph 中 vertexlabel 的概念),在导入边数据时,由于需要拼接出源顶点和目标顶点的 id,必须拿着 p_id/s_id 去 person/software 表中查到对应的 name,这种需要做额外查询的表结构的情况,loader 暂时是不支持的。这时可以采用以下两种方式替代:
- 仍然指定 person 和 software 的 id 策略为 PRIMARY_KEY,但是以 person 表和 software 表的 id 列作为顶点的主键属性,这样导入边时直接使用 p_id 和 s_id 和顶点的 label 拼接就能生成 id 了;
- 指定 person 和 software 的 id 策略为 CUSTOMIZE,然后直接以 person 表和 software 表的 id 列作为顶点 id,这样导入边时直接使用 p_id 和 s_id 即可;
关键点就是要让边能直接使用 p_id 和 s_id,不要再去查一次。
3.2.2 准备顶点和边数据
3.2.2.1 顶点数据
顶点数据文件由一行一行的数据组成,一般每一行作为一个顶点,每一列会作为顶点属性。下面以 CSV 格式作为示例进行说明。
- person 顶点数据(数据本身不包含 header)
Tom,48,Beijing
Jerry,36,Shanghai
- software 顶点数据(数据本身包含 header)
name,price
Photoshop,999
Office,388
3.2.2.2 边数据
边数据文件由一行一行的数据组成,一般每一行作为一条边,其中有部分列会作为源顶点和目标顶点的 id,其他列作为边属性。下面以 JSON 格式作为示例进行说明。
- knows 边数据
{"source_name": "Tom", "target_name": "Jerry", "date": "2008-12-12"}
- created 边数据
{"source_name": "Tom", "target_name": "Photoshop"}
{"source_name": "Tom", "target_name": "Office"}
{"source_name": "Jerry", "target_name": "Office"}
3.3 编写数据源映射文件
3.3.1 映射文件概述
输入源的映射文件用于描述如何将输入源数据与图的顶点类型/边类型建立映射关系,以JSON
格式组织,由多个映射块组成,其中每一个映射块都负责将一个输入源映射为顶点和边。
具体而言,每个映射块包含一个输入源和多个顶点映射与边映射块,输入源块对应上面介绍的本地磁盘文件或目录
、HDFS 文件或目录
和关系型数据库
,负责描述数据源的基本信息,比如数据在哪,是什么格式的,分隔符是什么等。顶点映射/边映射与该输入源绑定,可以选择输入源的哪些列,哪些列作为 id、哪些列作为属性,以及每一列映射成什么属性,列的值映射成属性的什么值等等。
以最通俗的话讲,每一个映射块描述了:要导入的文件在哪,文件的每一行要作为哪一类顶点/边,文件的哪些列是需要导入的,以及这些列对应顶点/边的什么属性等。
注意:0.11.0 版本以前的映射文件与 0.11.0 以后的格式变化较大,为表述方便,下面称 0.11.0 以前的映射文件(格式)为 1.0 版本,0.11.0 以后的为 2.0 版本。并且若无特殊说明,“映射文件”表示的是 2.0 版本的。
点击展开/折叠 2.0 版本的映射文件的框架
{
"version": "2.0",
"structs": [
{
"id": "1",
"input": {
},
"vertices": [
{},
{}
],
"edges": [
{},
{}
]
}
]
}
这里直接给出两个版本的映射文件(描述了上面图模型和数据文件)
点击展开/折叠 2.0 版本的映射文件
{
"version": "2.0",
"structs": [
{
"id": "1",
"skip": false,
"input": {
"type": "FILE",
"path": "vertex_person.csv",
"file_filter": {
"extensions": [
"*"
]
},
"format": "CSV",
"delimiter": ",",
"date_format": "yyyy-MM-dd HH:mm:ss",
"time_zone": "GMT+8",
"skipped_line": {
"regex": "(^#|^//).*|"
},
"compression": "NONE",
"header": [
"name",
"age",
"city"
],
"charset": "UTF-8",
"list_format": {
"start_symbol": "[",
"elem_delimiter": "|",
"end_symbol": "]"
}
},
"vertices": [
{
"label": "person",
"skip": false,
"id": null,
"unfold": false,
"field_mapping": {},
"value_mapping": {},
"selected": [],
"ignored": [],
"null_values": [
""
],
"update_strategies": {}
}
],
"edges": []
},
{
"id": "2",
"skip": false,
"input": {
"type": "FILE",
"path": "vertex_software.csv",
"file_filter": {
"extensions": [
"*"
]
},
"format": "CSV",
"delimiter": ",",
"date_format": "yyyy-MM-dd HH:mm:ss",
"time_zone": "GMT+8",
"skipped_line": {
"regex": "(^#|^//).*|"
},
"compression": "NONE",
"header": null,
"charset": "UTF-8",
"list_format": {
"start_symbol": "",
"elem_delimiter": ",",
"end_symbol": ""
}
},
"vertices": [
{
"label": "software",
"skip": false,
"id": null,
"unfold": false,
"field_mapping": {},
"value_mapping": {},
"selected": [],
"ignored": [],
"null_values": [
""
],
"update_strategies": {}
}
],
"edges": []
},
{
"id": "3",
"skip": false,
"input": {
"type": "FILE",
"path": "edge_knows.json",
"file_filter": {
"extensions": [
"*"
]
},
"format": "JSON",
"delimiter": null,
"date_format": "yyyy-MM-dd HH:mm:ss",
"time_zone": "GMT+8",
"skipped_line": {
"regex": "(^#|^//).*|"
},
"compression": "NONE",
"header": null,
"charset": "UTF-8",
"list_format": null
},
"vertices": [],
"edges": [
{
"label": "knows",
"skip": false,
"source": [
"source_name"
],
"unfold_source": false,
"target": [
"target_name"
],
"unfold_target": false,
"field_mapping": {
"source_name": "name",
"target_name": "name"
},
"value_mapping": {},
"selected": [],
"ignored": [],
"null_values": [
""
],
"update_strategies": {}
}
]
},
{
"id": "4",
"skip": false,
"input": {
"type": "FILE",
"path": "edge_created.json",
"file_filter": {
"extensions": [
"*"
]
},
"format": "JSON",
"delimiter": null,
"date_format": "yyyy-MM-dd HH:mm:ss",
"time_zone": "GMT+8",
"skipped_line": {
"regex": "(^#|^//).*|"
},
"compression": "NONE",
"header": null,
"charset": "UTF-8",
"list_format": null
},
"vertices": [],
"edges": [
{
"label": "created",
"skip": false,
"source": [
"source_name"
],
"unfold_source": false,
"target": [
"target_name"
],
"unfold_target": false,
"field_mapping": {
"source_name": "name",
"target_name": "name"
},
"value_mapping": {},
"selected": [],
"ignored": [],
"null_values": [
""
],
"update_strategies": {}
}
]
}
]
}
点击展开/折叠 1.0 版本的映射文件
{
"vertices": [
{
"label": "person",
"input": {
"type": "file",
"path": "vertex_person.csv",
"format": "CSV",
"header": ["name", "age", "city"],
"charset": "UTF-8"
}
},
{
"label": "software",
"input": {
"type": "file",
"path": "vertex_software.csv",
"format": "CSV"
}
}
],
"edges": [
{
"label": "knows",
"source": ["source_name"],
"target": ["target_name"],
"input": {
"type": "file",
"path": "edge_knows.json",
"format": "JSON"
},
"field_mapping": {
"source_name": "name",
"target_name": "name"
}
},
{
"label": "created",
"source": ["source_name"],
"target": ["target_name"],
"input": {
"type": "file",
"path": "edge_created.json",
"format": "JSON"
},
"field_mapping": {
"source_name": "name",
"target_name": "name"
}
}
]
}
映射文件 1.0 版本是以顶点和边为中心,设置输入源;而 2.0 版本是以输入源为中心,设置顶点和边映射。有些输入源(比如一个文件)既能生成顶点,也能生成边,如果用 1.0 版的格式写,就需要在 vertex 和 edge 映射块中各写一次 input 块,这两次的 input 块是完全一样的;而 2.0 版本只需要写一次 input。所以 2.0 版相比于 1.0 版,能省掉一些 input 的重复书写。
在 hugegraph-loader-{version} 的 bin 目录下,有一个脚本工具 mapping-convert.sh
能直接将 1.0 版本的映射文件转换为 2.0 版本的,使用方式如下:
bin/mapping-convert.sh struct.json
会在 struct.json 的同级目录下生成一个 struct-v2.json。
3.3.2 输入源
输入源目前分为四类:FILE、HDFS、JDBC、KAFKA,由type
节点区分,我们称为本地文件输入源、HDFS 输入源、JDBC 输入源和 KAFKA 输入源,下面分别介绍。
3.3.2.1 本地文件输入源
- id: 输入源的 id,该字段用于支持一些内部功能,非必填(未填时会自动生成),强烈建议写上,对于调试大有裨益;
- skip: 是否跳过该输入源,由于 JSON 文件无法添加注释,如果某次导入时不想导入某个输入源,但又不想删除该输入源的配置,则可以设置为 true 将其跳过,默认为 false,非必填;
- input: 输入源映射块,复合结构
- type: 输入源类型,必须填 file 或 FILE;
- path: 本地文件或目录的路径,绝对路径或相对于映射文件的相对路径,建议使用绝对路径,必填;
- file_filter: 从
path
中筛选复合条件的文件,复合结构,目前只支持配置扩展名,用子节点extensions
表示,默认为”*",表示保留所有文件; - format: 本地文件的格式,可选值为 CSV、TEXT 及 JSON,必须大写,必填;
- header: 文件各列的列名,如不指定则会以数据文件第一行作为 header;当文件本身有标题且又指定了 header,文件的第一行会被当作普通的数据行;JSON 文件不需要指定 header,选填;
- delimiter: 文件行的列分隔符,默认以逗号
","
作为分隔符,JSON
文件不需要指定,选填; - charset: 文件的编码字符集,默认
UTF-8
,选填; - date_format: 自定义的日期格式,默认值为 yyyy-MM-dd HH:mm:ss,选填;如果日期是以时间戳的形式呈现的,此项须写为
timestamp
(固定写法); - time_zone: 设置日期数据是处于哪个时区的,默认值为
GMT+8
,选填; - skipped_line: 想跳过的行,复合结构,目前只能配置要跳过的行的正则表达式,用子节点
regex
描述,默认不跳过任何行,选填; - compression: 文件的压缩格式,可选值为 NONE、GZIP、BZ2、XZ、LZMA、SNAPPY_RAW、SNAPPY_FRAMED、Z、DEFLATE、LZ4_BLOCK、LZ4_FRAMED、ORC 和 PARQUET,默认为 NONE,表示非压缩文件,选填;
- list_format: 当文件 (非 JSON ) 的某列是集合结构时(对应图中的 PropertyKey 的 Cardinality 为 Set 或 List),可以用此项设置该列的起始符、分隔符、结束符,复合结构:
- start_symbol: 集合结构列的起始符 (默认值是
[
, JSON 格式目前不支持指定) - elem_delimiter: 集合结构列的分隔符 (默认值是
|
, JSON 格式目前只支持原生,
分隔) - end_symbol: 集合结构列的结束符 (默认值是
]
, JSON 格式目前不支持指定)
- start_symbol: 集合结构列的起始符 (默认值是
3.3.2.2 HDFS 输入源
上述本地文件输入源
的节点及含义这里基本都适用,下面仅列出 HDFS 输入源不一样的和特有的节点。
- type: 输入源类型,必须填 hdfs 或 HDFS,必填;
- path: HDFS 文件或目录的路径,必须是 HDFS 的绝对路径,必填;
- core_site_path: HDFS 集群的 core-site.xml 文件路径,重点要指明 NameNode 的地址(
fs.default.name
),以及文件系统的实现(fs.hdfs.impl
);
3.3.2.3 JDBC 输入源
前面说到过支持多种关系型数据库,但由于它们的映射结构非常相似,故统称为 JDBC 输入源,然后用vendor
节点区分不同的数据库。
- type: 输入源类型,必须填 jdbc 或 JDBC,必填;
- vendor: 数据库类型,可选项为 [MySQL、PostgreSQL、Oracle、SQLServer],不区分大小写,必填;
- driver: jdbc 使用的 driver 类型,必填;
- url: jdbc 要连接的数据库的 url,必填;
- database: 要连接的数据库名,必填;
- schema: 要连接的 schema 名,不同的数据库要求不一样,下面详细说明;
- table: 要连接的表名,
custom_sql
和table
参数必须填其中一个; - custom_sql: 自定义 SQL 语句,
custom_sql
和table
参数必须填其中一个; - username: 连接数据库的用户名,必填;
- password: 连接数据库的密码,必填;
- batch_size: 按页获取表数据时的一页的大小,默认为 500,选填;
MYSQL
节点 | 固定值或常见值 |
---|---|
vendor | MYSQL |
driver | com.mysql.cj.jdbc.Driver |
url | jdbc:mysql://127.0.0.1:3306 |
schema: 可空,若填写必须与 database 的值一样
POSTGRESQL
节点 | 固定值或常见值 |
---|---|
vendor | POSTGRESQL |
driver | org.postgresql.Driver |
url | jdbc:postgresql://127.0.0.1:5432 |
schema: 可空,默认值为“public”
ORACLE
节点 | 固定值或常见值 |
---|---|
vendor | ORACLE |
driver | oracle.jdbc.driver.OracleDriver |
url | jdbc:oracle:thin:@127.0.0.1:1521 |
schema: 可空,默认值与用户名相同
SQLSERVER
节点 | 固定值或常见值 |
---|---|
vendor | SQLSERVER |
driver | com.microsoft.sqlserver.jdbc.SQLServerDriver |
url | jdbc:sqlserver://127.0.0.1:1433 |
schema: 必填
3.3.2.4 Kafka 输入源
- type:输入源类型,必须填
kafka
或KAFKA
,必填; - bootstrap_server:设置 kafka bootstrap server 列表;
- topic:订阅的 topic;
- group:Kafka 消费者组;
- from_beginning:设置是否从头开始读取;
- format:本地文件的格式,可选值为 CSV、TEXT 及 JSON,必须大写,必填;
- header:文件各列的列名,如不指定则会以数据文件第一行作为 header;当文件本身有标题且又指定了 header,文件的第一行会被当作普通的数据行;JSON 文件不需要指定 header,选填;
- delimiter:文件行的列分隔符,默认以逗号”,“作为分隔符,JSON 文件不需要指定,选填;
- charset:文件的编码字符集,默认 UTF-8,选填;
- date_format:自定义的日期格式,默认值为 yyyy-MM-dd HH:mm:ss,选填;如果日期是以时间戳的形式呈现的,此项须写为 timestamp(固定写法);
- extra_date_formats:自定义的其他日期格式列表,默认为空,选填;列表中每一项都是一个 date_format 指定日期格式的备用日期格式;
- time_zone:置日期数据是处于哪个时区的,默认值为 GMT+8,选填;
- skipped_line:想跳过的行,复合结构,目前只能配置要跳过的行的正则表达式,用子节点 regex 描述,默认不跳过任何行,选填;
- early_stop:某次从 Kafka broker 拉取的记录为空,停止任务,默认为 false,仅用于调试,选填;
3.3.3 顶点和边映射
顶点和边映射的节点(JSON 文件中的一个 key)有很多相同的部分,下面先介绍相同部分,再分别介绍顶点映射
和边映射
的特有节点。
相同部分的节点
- label: 待导入的顶点/边数据所属的
label
,必填; - field_mapping: 将输入源列的列名映射为顶点/边的属性名,选填;
- value_mapping: 将输入源的数据值映射为顶点/边的属性值,选填;
- selected: 选择某些列插入,其他未选中的不插入,不能与
ignored
同时存在,选填; - ignored: 忽略某些列,使其不参与插入,不能与
selected
同时存在,选填; - null_values: 可以指定一些字符串代表空值,比如"NULL”,如果该列对应的顶点/边属性又是一个可空属性,那在构造顶点/边时不会设置该属性的值,选填;
- update_strategies: 如果数据需要按特定方式批量更新时可以对每个属性指定具体的更新策略 (具体见下),选填;
- unfold: 是否将列展开,展开的每一列都会与其他列一起组成一行,相当于是展开成了多行;比如文件的某一列(id 列)的值是
[1,2,3]
,其他列的值是18,Beijing
,当设置了 unfold 之后,这一行就会变成 3 行,分别是:1,18,Beijing
,2,18,Beijing
和3,18,Beijing
。需要注意的是此项只会展开被选作为 id 的列。默认 false,选填;
更新策略支持 8 种 : (需要全大写)
- 数值累加 :
SUM
- 两个数字/日期取更大的:
BIGGER
- 两个数字/日期取更小:
SMALLER
- Set属性取并集:
UNION
- Set属性取交集:
INTERSECTION
- List属性追加元素:
APPEND
- List/Set属性删除元素:
ELIMINATE
- 覆盖已有属性:
OVERRIDE
注意: 如果新导入的属性值为空,会采用已有的旧数据而不会采用空值,效果可以参考如下示例
// JSON 文件中以如下方式指定更新策略
{
"vertices": [
{
"label": "person",
"update_strategies": {
"age": "SMALLER",
"set": "UNION"
},
"input": {
"type": "file",
"path": "vertex_person.txt",
"format": "TEXT",
"header": ["name", "age", "set"]
}
}
]
}
// 1.写入一行带 OVERRIDE 更新策略的数据 (这里 null 代表空)
'a b null null'
// 2.再写一行
'null null c d'
// 3.最后可以得到
'a b c d'
// 如果没有更新策略,则会得到
'null null c d'
注意 : 采用了批量更新的策略后, 磁盘读请求数会大幅上升, 导入速度相比纯写覆盖会慢数倍 (此时HDD磁盘IOPS会成为瓶颈, 建议采用SSD以保证速度)
顶点映射的特有节点
- id: 指定某一列作为顶点的 id 列,当顶点 id 策略为
CUSTOMIZE
时,必填;当 id 策略为PRIMARY_KEY
时,必须为空;
边映射的特有节点
- source: 选择输入源某几列作为源顶点的 id 列,当源顶点的 id 策略为
CUSTOMIZE
时,必须指定某一列作为顶点的 id 列;当源顶点的 id 策略为PRIMARY_KEY
时,必须指定一列或多列用于拼接生成顶点的 id,也就是说,不管是哪种 id 策略,此项必填; - target: 指定某几列作为目标顶点的 id 列,与 source 类似,不再赘述;
- unfold_source: 是否展开文件的 source 列,效果与顶点映射中的类似,不再赘述;
- unfold_target: 是否展开文件的 target 列,效果与顶点映射中的类似,不再赘述;
3.4 执行命令导入
准备好图模型、数据文件以及输入源映射关系文件后,接下来就可以将数据文件导入到图数据库中。
导入过程由用户提交的命令控制,用户可以通过不同的参数控制执行的具体流程。
3.4.1 参数说明
参数 | 默认值 | 是否必传 | 描述信息 |
---|---|---|---|
-f 或 --file | Y | 配置脚本的路径 | |
-g 或 --graph | Y | 图数据库空间 | |
-s 或 --schema | Y | schema 文件路径 | |
-h 或 --host | localhost | HugeGraphServer 的地址 | |
-p 或 --port | 8080 | HugeGraphServer 的端口号 | |
--username | null | 当 HugeGraphServer 开启了权限认证时,当前图的 username | |
--token | null | 当 HugeGraphServer 开启了权限认证时,当前图的 token | |
--protocol | http | 向服务端发请求的协议,可选 http 或 https | |
--trust-store-file | 请求协议为 https 时,客户端的证书文件路径 | ||
--trust-store-password | 请求协议为 https 时,客户端证书密码 | ||
--clear-all-data | false | 导入数据前是否清除服务端的原有数据 | |
--clear-timeout | 240 | 导入数据前清除服务端的原有数据的超时时间 | |
--incremental-mode | false | 是否使用断点续导模式,仅输入源为 FILE 和 HDFS 支持该模式,启用该模式能从上一次导入停止的地方开始导 | |
--failure-mode | false | 失败模式为 true 时,会导入之前失败了的数据,一般来说失败数据文件需要在人工更正编辑好后,再次进行导入 | |
--batch-insert-threads | CPUs | 批量插入线程池大小 (CPUs 是当前 OS 可用可用逻辑核个数) | |
--single-insert-threads | 8 | 单条插入线程池的大小 | |
--max-conn | 4 * CPUs | HugeClient 与 HugeGraphServer 的最大 HTTP 连接数,调整线程的时候建议同时调整此项 | |
--max-conn-per-route | 2 * CPUs | HugeClient 与 HugeGraphServer 每个路由的最大 HTTP 连接数,调整线程的时候建议同时调整此项 | |
--batch-size | 500 | 导入数据时每个批次包含的数据条数 | |
--max-parse-errors | 1 | 最多允许多少行数据解析错误,达到该值则程序退出 | |
--max-insert-errors | 500 | 最多允许多少行数据插入错误,达到该值则程序退出 | |
--timeout | 60 | 插入结果返回的超时时间(秒) | |
--shutdown-timeout | 10 | 多线程停止的等待时间(秒) | |
--retry-times | 0 | 发生特定异常时的重试次数 | |
--retry-interval | 10 | 重试之前的间隔时间(秒) | |
--check-vertex | false | 插入边时是否检查边所连接的顶点是否存在 | |
--print-progress | true | 是否在控制台实时打印导入条数 | |
--dry-run | false | 打开该模式,只解析不导入,通常用于测试 | |
--help | false | 打印帮助信息 |
3.4.2 断点续导模式
通常情况下,Loader 任务都需要较长时间执行,如果因为某些原因导致导入中断进程退出,而下次希望能从中断的点继续导,这就是使用断点续导的场景。
用户设置命令行参数 –incremental-mode 为 true 即打开了断点续导模式。断点续导的关键在于进度文件,导入进程退出的时候,会把退出时刻的导入进度
记录到进度文件中,进度文件位于 ${struct}
目录下,文件名形如 load-progress ${date}
,${struct} 为映射文件的前缀,${date} 为导入开始
的时刻。比如:在 2019-10-10 12:30:30
开始的一次导入任务,使用的映射文件为 struct-example.json
,则进度文件的路径为与 struct-example.json
同级的 struct-example/load-progress 2019-10-10 12:30:30
。
注意:进度文件的生成与 –incremental-mode 是否打开无关,每次导入结束都会生成一个进度文件。
如果数据文件格式都是合法的,是用户自己停止(CTRL + C 或 kill,kill -9 不支持)的导入任务,也就是说没有错误记录的情况下,下一次导入只需要设置 为断点续导即可。
但如果是因为太多数据不合法或者网络异常,达到了 –max-parse-errors 或 –max-insert-errors 的限制,Loader 会把这些插入失败的原始行记录到 失败文件中,用户对失败文件中的数据行修改后,设置 –reload-failure 为 true 即可把这些"失败文件"也当作输入源进行导入(不影响正常的文件的导入), 当然如果修改后的数据行仍然有问题,则会被再次记录到失败文件中(不用担心会有重复行)。
每个顶点映射或边映射有数据插入失败时都会产生自己的失败文件,失败文件又分为解析失败文件(后缀 .parse-error)和插入失败文件(后缀 .insert-error),
它们被保存在 ${struct}/current
目录下。比如映射文件中有一个顶点映射 person 和边映射 knows,它们各有一些错误行,当 Loader 退出后,在
${struct}/current
目录下会看到如下文件:
- person-b4cd32ab.parse-error: 顶点映射 person 解析错误的数据
- person-b4cd32ab.insert-error: 顶点映射 person 插入错误的数据
- knows-eb6b2bac.parse-error: 边映射 knows 解析错误的数据
- knows-eb6b2bac.insert-error: 边映射 knows 插入错误的数据
.parse-error 和 .insert-error 并不总是一起存在的,只有存在解析出错的行才会有 .parse-error 文件,只有存在插入出错的行才会有 .insert-error 文件。
3.4.3 logs 目录文件说明
程序执行过程中各日志及错误数据会写入 hugegraph-loader.log 文件中。
3.4.4 执行命令
运行 bin/hugegraph-loader 并传入参数
bin/hugegraph-loader -g {GRAPH_NAME} -f ${INPUT_DESC_FILE} -s ${SCHEMA_FILE} -h {HOST} -p {PORT}
4 完整示例
下面给出的是 hugegraph-loader 包中 example 目录下的例子。(GitHub 地址)
4.1 准备数据
顶点文件:example/file/vertex_person.csv
marko,29,Beijing
vadas,27,Hongkong
josh,32,Beijing
peter,35,Shanghai
"li,nary",26,"Wu,han"
tom,null,NULL
顶点文件:example/file/vertex_software.txt
id|name|lang|price|ISBN
1|lop|java|328|ISBN978-7-107-18618-5
2|ripple|java|199|ISBN978-7-100-13678-5
边文件:example/file/edge_knows.json
{"source_name": "marko", "target_name": "vadas", "date": "20160110", "weight": 0.5}
{"source_name": "marko", "target_name": "josh", "date": "20130220", "weight": 1.0}
边文件:example/file/edge_created.json
{"aname": "marko", "bname": "lop", "date": "20171210", "weight": 0.4}
{"aname": "josh", "bname": "lop", "date": "20091111", "weight": 0.4}
{"aname": "josh", "bname": "ripple", "date": "20171210", "weight": 1.0}
{"aname": "peter", "bname": "lop", "date": "20170324", "weight": 0.2}
4.2 编写 schema
点击展开/折叠 schema 文件:example/file/schema.groovy
schema.propertyKey("name").asText().ifNotExist().create();
schema.propertyKey("age").asInt().ifNotExist().create();
schema.propertyKey("city").asText().ifNotExist().create();
schema.propertyKey("weight").asDouble().ifNotExist().create();
schema.propertyKey("lang").asText().ifNotExist().create();
schema.propertyKey("date").asText().ifNotExist().create();
schema.propertyKey("price").asDouble().ifNotExist().create();
schema.vertexLabel("person").properties("name", "age", "city").primaryKeys("name").ifNotExist().create();
schema.vertexLabel("software").properties("name", "lang", "price").primaryKeys("name").ifNotExist().create();
schema.indexLabel("personByAge").onV("person").by("age").range().ifNotExist().create();
schema.indexLabel("personByCity").onV("person").by("city").secondary().ifNotExist().create();
schema.indexLabel("personByAgeAndCity").onV("person").by("age", "city").secondary().ifNotExist().create();
schema.indexLabel("softwareByPrice").onV("software").by("price").range().ifNotExist().create();
schema.edgeLabel("knows").sourceLabel("person").targetLabel("person").properties("date", "weight").ifNotExist().create();
schema.edgeLabel("created").sourceLabel("person").targetLabel("software").properties("date", "weight").ifNotExist().create();
schema.indexLabel("createdByDate").onE("created").by("date").secondary().ifNotExist().create();
schema.indexLabel("createdByWeight").onE("created").by("weight").range().ifNotExist().create();
schema.indexLabel("knowsByWeight").onE("knows").by("weight").range().ifNotExist().create();
4.3 编写输入源映射文件example/file/struct.json
点击展开/折叠 源映射文件 example/file/struct.json
{
"vertices": [
{
"label": "person",
"input": {
"type": "file",
"path": "example/file/vertex_person.csv",
"format": "CSV",
"header": ["name", "age", "city"],
"charset": "UTF-8",
"skipped_line": {
"regex": "(^#|^//).*"
}
},
"null_values": ["NULL", "null", ""]
},
{
"label": "software",
"input": {
"type": "file",
"path": "example/file/vertex_software.txt",
"format": "TEXT",
"delimiter": "|",
"charset": "GBK"
},
"id": "id",
"ignored": ["ISBN"]
}
],
"edges": [
{
"label": "knows",
"source": ["source_name"],
"target": ["target_name"],
"input": {
"type": "file",
"path": "example/file/edge_knows.json",
"format": "JSON",
"date_format": "yyyyMMdd"
},
"field_mapping": {
"source_name": "name",
"target_name": "name"
}
},
{
"label": "created",
"source": ["source_name"],
"target": ["target_id"],
"input": {
"type": "file",
"path": "example/file/edge_created.json",
"format": "JSON",
"date_format": "yyyy-MM-dd"
},
"field_mapping": {
"source_name": "name"
}
}
]
}
4.4 执行命令导入
sh bin/hugegraph-loader.sh -g hugegraph -f example/file/struct.json -s example/file/schema.groovy
导入结束后,会出现类似如下统计信息:
vertices/edges has been loaded this time : 8/6
--------------------------------------------------
count metrics
input read success : 14
input read failure : 0
vertex parse success : 8
vertex parse failure : 0
vertex insert success : 8
vertex insert failure : 0
edge parse success : 6
edge parse failure : 0
edge insert success : 6
edge insert failure : 0
4.5 使用 docker 导入
4.5.1 使用 docker exec 直接导入数据
4.5.1.1 数据准备
如果仅仅尝试使用 loader, 我们可以使用内置的 example 数据集进行导入,无需自己额外准备数据
如果使用自定义的数据,则在使用 loader 导入数据之前,我们需要将数据复制到容器内部。
首先我们可以根据 4.1-4.3 的步骤准备数据,将准备好的数据通过 docker cp
复制到 loader 容器内部。
假设我们已经按照上述的步骤准备好了对应的数据集,存放在 hugegraph-dataset
文件夹下,文件结构如下:
tree -f hugegraph-dataset/
hugegraph-dataset
├── hugegraph-dataset/edge_created.json
├── hugegraph-dataset/edge_knows.json
├── hugegraph-dataset/schema.groovy
├── hugegraph-dataset/struct.json
├── hugegraph-dataset/vertex_person.csv
└── hugegraph-dataset/vertex_software.txt
将文件复制到容器内部
docker cp hugegraph-dataset loader:/loader/dataset
docker exec -it loader ls /loader/dataset
edge_created.json edge_knows.json schema.groovy struct.json vertex_person.csv vertex_software.txt
4.5.1.2 数据导入
以内置的 example 数据集为例,我们可以使用以下的命令对数据进行导入。
如果需要导入自己准备的数据集,则只需要修改 -f
配置脚本的路径 以及 -s
schema 文件路径即可。
其他的参数可以参照 3.4.1 参数说明
docker exec -it loader bin/hugegraph-loader.sh -g hugegraph -f example/file/struct.json -s example/file/schema.groovy -h server -p 8080
如果导入用户自定义的数据集,按照刚才的例子,则使用:
docker exec -it loader bin/hugegraph-loader.sh -g hugegraph -f /loader/dataset/struct.json -s /loader/dataset/schema.groovy -h server -p 8080
如果
loader
和server
位于同一 docker 网络,则可以指定-h {server_container_name}
, 否则需要指定server
的宿主机的 ip (在我们的例子中,server_container_name
为server
).
然后我们可以观察到结果:
HugeGraphLoader worked in NORMAL MODE
vertices/edges loaded this time : 8/6
--------------------------------------------------
count metrics
input read success : 14
input read failure : 0
vertex parse success : 8
vertex parse failure : 0
vertex insert success : 8
vertex insert failure : 0
edge parse success : 6
edge parse failure : 0
edge insert success : 6
edge insert failure : 0
--------------------------------------------------
meter metrics
total time : 0.199s
read time : 0.046s
load time : 0.153s
vertex load time : 0.077s
vertex load rate(vertices/s) : 103
edge load time : 0.112s
edge load rate(edges/s) : 53
也可以使用 curl
或者 hubble
观察导入结果,此处以 curl
为例:
> curl "http://localhost:8080/graphs/hugegraph/graph/vertices" | gunzip
{"vertices":[{"id":1,"label":"software","type":"vertex","properties":{"name":"lop","lang":"java","price":328.0}},{"id":2,"label":"software","type":"vertex","properties":{"name":"ripple","lang":"java","price":199.0}},{"id":"1:tom","label":"person","type":"vertex","properties":{"name":"tom"}},{"id":"1:josh","label":"person","type":"vertex","properties":{"name":"josh","age":32,"city":"Beijing"}},{"id":"1:marko","label":"person","type":"vertex","properties":{"name":"marko","age":29,"city":"Beijing"}},{"id":"1:peter","label":"person","type":"vertex","properties":{"name":"peter","age":35,"city":"Shanghai"}},{"id":"1:vadas","label":"person","type":"vertex","properties":{"name":"vadas","age":27,"city":"Hongkong"}},{"id":"1:li,nary","label":"person","type":"vertex","properties":{"name":"li,nary","age":26,"city":"Wu,han"}}]}
如果想检查边的导入结果,可以使用 curl "http://localhost:8080/graphs/hugegraph/graph/edges" | gunzip
4.5.2 进入 docker 容器进行导入
除了直接使用 docker exec
导入数据,我们也可以进入容器进行数据导入,基本流程与 4.5.1 相同
使用 docker exec -it loader bash
进入容器内部,并执行命令
sh bin/hugegraph-loader.sh -g hugegraph -f example/file/struct.json -s example/file/schema.groovy -h server -p 8080
执行的结果如 4.5.1 所示
4.6 使用 spark-loader 导入
Spark 版本:Spark 3+,其他版本未测试。 HugeGraph Toolchain 版本:toolchain-1.0.0
spark-loader
的参数分为两部分,注意:因二者参数名缩写存在重合部分,请使用参数全称。两种参数之间无需保证先后顺序。
- hugegraph 参数(参考:hugegraph-loader 参数说明 )
- Spark 任务提交参数(参考:Submitting Applications)
示例:
sh bin/hugegraph-spark-loader.sh --master yarn \
--deploy-mode cluster --name spark-hugegraph-loader --file ./hugegraph.json \
--username admin --token admin --host xx.xx.xx.xx --port 8093 \
--graph graph-test --num-executors 6 --executor-cores 16 --executor-memory 15g
3.3 - HugeGraph-Hubble Quick Start
1 HugeGraph-Hubble 概述
特别注意: 当前版本的 Hubble 还没有添加 Auth/Login 相关界面和接口和单独防护, 在下一个 Release 版 (> 1.5) 会加入, 请留意避免把它暴露在公网环境或不受信任的网络中,以免引起相关 SEC 问题 (另外也可以使用 IP & 端口白名单 + HTTPS)
HugeGraph-Hubble 是 HugeGraph 的一站式可视化分析平台,平台涵盖了从数据建模,到数据快速导入, 再到数据的在线、离线分析、以及图的统一管理的全过程,实现了图应用的全流程向导式操作,旨在提升用户的使用流畅度, 降低用户的使用门槛,提供更为高效易用的使用体验。
平台主要包括以下模块:
图管理
图管理模块通过图的创建,连接平台与图数据,实现多图的统一管理,并实现图的访问、编辑、删除、查询操作。
元数据建模
元数据建模模块通过创建属性库,顶点类型,边类型,索引类型,实现图模型的构建与管理,平台提供两种模式,列表模式和图模式,可实时展示元数据模型,更加直观。同时还提供了跨图的元数据复用功能,省去相同元数据繁琐的重复创建过程,极大地提升建模效率,增强易用性。
图分析
通过输入图遍历语言 Gremlin 可实现图数据的高性能通用分析,并提供顶点的定制化多维路径查询等功能,提供 3 种图结果展示方式,包括:图形式、表格形式、Json 形式,多维度展示数据形态,满足用户使用的多种场景需求。提供运行记录及常用语句收藏等功能,实现图操作的可追溯,以及查询输入的复用共享,快捷高效。支持图数据的导出,导出格式为 Json 格式。
任务管理
对于需要遍历全图的 Gremlin 任务,索引的创建与重建等耗时较长的异步任务,平台提供相应的任务管理功能,实现异步任务的统一的管理与结果查看。
数据导入 (BETA)
注: 数据导入功能目前适合初步试用,正式数据导入请使用 hugegraph-loader, 性能/稳定性/功能全面许多
数据导入是将用户的业务数据转化为图的顶点和边并插入图数据库中,平台提供了向导式的可视化导入模块,通过创建导入任务, 实现导入任务的管理及多个导入任务的并行运行,提高导入效能。进入导入任务后,只需跟随平台步骤提示,按需上传文件,填写内容, 就可轻松实现图数据的导入过程,同时支持断点续传,错误重试机制等,降低导入成本,提升效率。
2 部署
有三种方式可以部署hugegraph-hubble
- 使用 docker (便于测试)
- 下载 toolchain 二进制包
- 源码编译
2.1 使用 Docker (便于测试)
特别注意: docker 模式下,若 hubble 和 server 在同一宿主机,hubble 页面中设置 server 的
hostname
不能设置为localhost/127.0.0.1
,因这会指向 hubble 容器内部而非宿主机,导致无法连接到 server.若 hubble 和 server 在同一 docker 网络下,推荐直接使用
container_name
(如下例的server
) 作为主机名。或者也可以使用 宿主机 IP 作为主机名,此时端口号为宿主机给 server 配置的端口
我们可以使用 docker run -itd --name=hubble -p 8088:8088 hugegraph/hubble:1.5.0
快速启动 hubble.
或者使用 docker-compose 启动 hubble,另外如果 hubble 和 server 在同一个 docker 网络下,可以使用 server 的 contain_name 进行访问,而不需要宿主机的 ip
使用docker-compose up -d
,docker-compose.yml
如下:
version: '3'
services:
server:
image: hugegraph/hugegraph:1.5.0
container_name: server
ports:
- 8080:8080
hubble:
image: hugegraph/hubble:1.5.0
container_name: hubble
ports:
- 8088:8088
注意:
hugegraph-hubble
的 docker 镜像是一个便捷发布版本,用于快速测试试用 hubble,并非ASF 官方发布物料包的方式。你可以从 ASF Release Distribution Policy 中得到更多细节。生产环境推荐使用
release tag
(如1.5.0
) 稳定版。使用latest
tag 默认对应 master 最新代码。
2.2 下载 toolchain 二进制包
hubble
项目在toolchain
项目中,首先下载toolchain
的 tar 包
wget https://downloads.apache.org/incubator/hugegraph/{version}/apache-hugegraph-toolchain-incubating-{version}.tar.gz
tar -xvf apache-hugegraph-toolchain-incubating-{version}.tar.gz
cd apache-hugegraph-toolchain-incubating-{version}.tar.gz/apache-hugegraph-hubble-incubating-{version}
运行hubble
bin/start-hubble.sh
随后我们可以看到
starting HugeGraphHubble ..............timed out with http status 502
2023-08-30 20:38:34 [main] [INFO ] o.a.h.HugeGraphHubble [] - Starting HugeGraphHubble v1.0.0 on cpu05 with PID xxx (~/apache-hugegraph-toolchain-incubating-1.0.0/apache-hugegraph-hubble-incubating-1.0.0/lib/hubble-be-1.0.0.jar started by $USER in ~/apache-hugegraph-toolchain-incubating-1.0.0/apache-hugegraph-hubble-incubating-1.0.0)
...
2023-08-30 20:38:38 [main] [INFO ] c.z.h.HikariDataSource [] - hugegraph-hubble-HikariCP - Start completed.
2023-08-30 20:38:41 [main] [INFO ] o.a.c.h.Http11NioProtocol [] - Starting ProtocolHandler ["http-nio-0.0.0.0-8088"]
2023-08-30 20:38:41 [main] [INFO ] o.a.h.HugeGraphHubble [] - Started HugeGraphHubble in 7.379 seconds (JVM running for 8.499)
然后使用浏览器访问 ip:8088
可看到hubble
页面,通过bin/stop-hubble.sh
则可以停止服务
2.3 源码编译
注意: 目前已在 hugegraph-hubble/hubble-be/pom.xml
中引入插件 frontend-maven-plugin
,编译 hubble 时不需要用户本地环境提前安装 Nodejs V16.x
与 yarn
环境,可直接按下述步骤执行
下载 toolchain 源码包
git clone https://github.com/apache/hugegraph-toolchain.git
编译hubble
, 它依赖 loader 和 client, 编译时需提前构建这些依赖 (后续可跳)
cd hugegraph-toolchain
sudo pip install -r hugegraph-hubble/hubble-dist/assembly/travis/requirements.txt
mvn install -pl hugegraph-client,hugegraph-loader -am -Dmaven.javadoc.skip=true -DskipTests -ntp
cd hugegraph-hubble
mvn -e package -Dmaven.javadoc.skip=true -Dmaven.test.skip=true -ntp
cd apache-hugegraph-hubble-incubating*
启动hubble
bin/start-hubble.sh -d
3 平台使用流程
平台的模块使用流程如下:
4 平台使用说明
4.1 图管理
4.1.1 图创建
图管理模块下,点击【创建图】,通过填写图 ID、图名称、主机名、端口号、用户名、密码的信息,实现多图的连接。
创建图填写内容如下:
注意:如果使用 docker 启动
hubble
,且server
和hubble
位于同一宿主机,不能直接使用localhost/127.0.0.1
作为主机名。如果hubble
和server
在同一 docker 网络下,则可以直接使用 container_name 作为主机名,端口则为 8080。或者也可以使用宿主机 ip 作为主机名,此时端口为宿主机为 server 配置的端口
4.1.2 图访问
实现图空间的信息访问,进入后,可进行图的多维查询分析、元数据管理、数据导入、算法分析等操作。
4.1.3 图管理
- 用户通过对图的概览、搜索以及单图的信息编辑与删除,实现图的统一管理。
- 搜索范围:可对图名称和 ID 进行搜索。
4.2 元数据建模(列表 + 图模式)
4.2.1 模块入口
左侧导航处:
4.2.2 属性类型
4.2.2.1 创建
- 填写或选择属性名称、数据类型、基数,完成属性的创建。
- 创建的属性可作为顶点类型和边类型的属性。
列表模式:
图模式:
4.2.2.2 复用
- 平台提供【复用】功能,可直接复用其他图的元数据。
- 选择需要复用的图 ID,继续选择需要复用的属性,之后平台会进行是否冲突的校验,通过后,可实现元数据的复用。
选择复用项:
校验复用项:
4.2.2.3 管理
- 在属性列表中可进行单条删除或批量删除操作。
4.2.3 顶点类型
4.2.3.1 创建
- 填写或选择顶点类型名称、ID 策略、关联属性、主键属性,顶点样式、查询结果中顶点下方展示的内容,以及索引的信息:包括是否创建类型索引,及属性索引的具体内容,完成顶点类型的创建。
列表模式:
图模式:
4.2.3.2 复用
- 顶点类型的复用,会将此类型关联的属性和属性索引一并复用。
- 复用功能使用方法类似属性的复用,见 3.2.2.2。
4.2.3.3 管理
可进行编辑操作,顶点样式、关联类型、顶点展示内容、属性索引可编辑,其余不可编辑。
可进行单条删除或批量删除操作。
4.2.4 边类型
4.2.4.1 创建
- 填写或选择边类型名称、起点类型、终点类型、关联属性、是否允许多次连接、边样式、查询结果中边下方展示的内容,以及索引的信息:包括是否创建类型索引,及属性索引的具体内容,完成边类型的创建。
列表模式:
图模式:
4.2.4.2 复用
- 边类型的复用,会将此类型的起点类型、终点类型、关联的属性和属性索引一并复用。
- 复用功能使用方法类似属性的复用,见 3.2.2.2。
4.2.4.3 管理
- 可进行编辑操作,边样式、关联属性、边展示内容、属性索引可编辑,其余不可编辑,同顶点类型。
- 可进行单条删除或批量删除操作。
4.2.5 索引类型
展示顶点类型和边类型的顶点索引和边索引。
4.3 数据导入
注意:目前推荐使用 hugegraph-loader 进行正式数据导入,hubble 内置的导入用来做测试和简单上手
数据导入的使用流程如下:
4.3.1 模块入口
左侧导航处:
4.3.2 创建任务
- 填写任务名称和备注(非必填),可以创建导入任务。
- 可创建多个导入任务,并行导入。
4.3.3 上传文件
- 上传需要构图的文件,目前支持的格式为 CSV,后续会不断更新。
- 可同时上传多个文件。
4.3.4 设置数据映射
对上传的文件分别设置数据映射,包括文件设置和类型设置
文件设置:勾选或填写是否包含表头、分隔符、编码格式等文件本身的设置内容,均设置默认值,无需手动填写
类型设置:
顶点映射和边映射:
【顶点类型】 :选择顶点类型,并为其 ID 映射上传文件中列数据;
【边类型】:选择边类型,为其起点类型和终点类型的 ID 列映射上传文件的列数据;
映射设置:为选定的顶点类型的属性映射上传文件中的列数据,此处,若属性名称与文件的表头名称一致,可自动匹配映射属性,无需手动填选
完成设置后,显示设置列表,方可进行下一步操作,支持映射的新增、编辑、删除操作
设置映射的填写内容:
映射列表:
4.3.5 导入数据
导入前需要填写导入设置参数,填写完成后,可开始向图库中导入数据
- 导入设置
- 导入设置参数项如下图所示,均设置默认值,无需手动填写
- 导入详情
- 点击开始导入,开始文件的导入任务
- 导入详情中提供每个上传文件设置的映射类型、导入速度、导入的进度、耗时以及当前任务的具体状态,并可对每个任务进行暂停、继续、停止等操作
- 若导入失败,可查看具体原因
4.4 数据分析
4.4.1 模块入口
左侧导航处:
4.4.2 多图切换
通过左侧切换入口,灵活切换多图的操作空间
4.4.3 图分析与处理
HugeGraph 支持 Apache TinkerPop3 的图遍历查询语言 Gremlin,Gremlin 是一种通用的图数据库查询语言,通过输入 Gremlin 语句,点击执行,即可执行图数据的查询分析操作,并可实现顶点/边的创建及删除、顶点/边的属性修改等。
Gremlin 查询后,下方为图结果展示区域,提供 3 种图结果展示方式,分别为:【图模式】、【表格模式】、【Json 模式】。
支持缩放、居中、全屏、导出等操作。
【图模式】
【表格模式】
【Json 模式】
4.4.4 数据详情
点击顶点/边实体,可查看顶点/边的数据详情,包括:顶点/边类型,顶点 ID,属性及对应值,拓展图的信息展示维度,提高易用性。
4.4.5 图结果的多维路径查询
除了全局的查询外,可针对查询结果中的顶点进行深度定制化查询以及隐藏操作,实现图结果的定制化挖掘。
右击顶点,出现顶点的菜单入口,可进行展示、查询、隐藏等操作。
- 展开:点击后,展示与选中点关联的顶点。
- 查询:通过选择与选中点关联的边类型及边方向,在此条件下,再选择其属性及相应筛选规则,可实现定制化的路径展示。
- 隐藏:点击后,隐藏选中点及与之关联的边。
双击顶点,也可展示与选中点关联的顶点。
4.4.6 新增顶点/边
4.4.6.1 新增顶点
在图区可通过两个入口,动态新增顶点,如下:
- 点击图区面板,出现添加顶点入口
- 点击右上角的操作栏中的首个图标
通过选择或填写顶点类型、ID 值、属性信息,完成顶点的增加。
入口如下:
添加顶点内容如下:
4.4.6.2 新增边
右击图结果中的顶点,可增加该点的出边或者入边。
4.4.7 执行记录与收藏的查询
- 图区下方记载每次查询记录,包括:查询时间、执行类型、内容、状态、耗时、以及【收藏】和【加载】操作,实现图执行的全方位记录,有迹可循,并可对执行内容快速加载复用
- 提供语句的收藏功能,可对常用语句进行收藏操作,方便高频语句快速调用
4.5 任务管理
4.5.1 模块入口
左侧导航处:
4.5.2 任务管理
- 提供异步任务的统一的管理与结果查看,异步任务包括 4 类,分别为:
- gremlin:Gremlin 任务务
- algorithm:OLAP 算法任务务
- remove_schema:删除元数据
- rebuild_index:重建索引
- 列表显示当前图的异步任务信息,包括:任务 ID,任务名称,任务类型,创建时间,耗时,状态,操作,实现对异步任务的管理。
- 支持对任务类型和状态进行筛选
- 支持搜索任务 ID 和任务名称
- 可对异步任务进行删除或批量删除操作
4.5.3 Gremlin 异步任务
1.创建任务
- 数据分析模块,目前支持两种 Gremlin 操作,Gremlin 查询和 Gremlin 任务;若用户切换到 Gremlin 任务,点击执行后,在异步任务中心会建立一条异步任务; 2.任务提交
- 任务提交成功后,图区部分返回提交结果和任务 ID 3.任务详情
- 提供【查看】入口,可跳转到任务详情查看当前任务具体执行情况跳转到任务中心后,直接显示当前执行的任务行
点击查看入口,跳转到任务管理列表,如下:
4.查看结果
- 结果通过 json 形式展示
4.5.4 OLAP 算法任务
Hubble 上暂未提供可视化的 OLAP 算法执行,可调用 RESTful API 进行 OLAP 类算法任务,在任务管理中通过 ID 找到相应任务,查看进度与结果等。
4.5.5 删除元数据、重建索引
1.创建任务
- 在元数据建模模块中,删除元数据时,可建立删除元数据的异步任务
- 在编辑已有的顶点/边类型操作中,新增索引时,可建立创建索引的异步任务
2.任务详情
- 确认/保存后,可跳转到任务中心查看当前任务的详情
3.4 - HugeGraph-AI Quick Start
1 HugeGraph-AI 概述
hugegraph-ai 旨在探索 HugeGraph 与人工智能(AI)的融合,包括与大模型结合的应用,与图机器学习组件的集成等,为开发者在项目中利用 HugeGraph 的 AI 能力提供全面支持。
2 环境要求
- python 3.9+ (better to use
3.10
) - hugegraph-server 1.3+
3 准备工作
- 启动HugeGraph数据库,可以通过 Docker/Binary Package 运行它。
请参阅详细文档以获取更多指导 - 克隆项目
git clone https://github.com/apache/incubator-hugegraph-ai.git
- 安装 hugegraph-python-client 和 hugegraph_llm
cd ./incubator-hugegraph-ai # better to use virtualenv (source venv/bin/activate) pip install ./hugegraph-python-client pip install -r ./hugegraph-llm/requirements.txt
- 进入项目目录
cd ./hugegraph-llm/src
- 启动 Graph RAG 的 gradio 交互 demo,可以使用以下命令运行,启动后打开 http://127.0.0.1:8001默认主机为
python3 -m hugegraph_llm.demo.rag_demo.app
0.0.0.0
,端口为8001
。您可以通过传递命令行参数--host
和--port
来更改它们。python3 -m hugegraph_llm.demo.rag_demo.app --host 127.0.0.1 --port 18001
- 启动 Text2Gremlin 的 gradio 交互演示,可以使用以下命令运行,启动后打开 http://127.0.0.1:8002 ,您还可以按上述方式更改默认主机
0.0.0.0
和端口8002
。(🚧ing)python3 -m hugegraph_llm.demo.gremlin_generate_web_demo
- 在运行演示程序后,配置文件文件将被删除。
.env
将自动生成在hugegraph-llm/.env
路径下。此外,还有一个与 prompt 相关的配置文件config_prompt.yaml
。也会在hugegraph-llm/src/hugegraph_llm/resources/demo/config_prompt.yaml
路径下生成。 您可以在页面上修改内容,触发相应功能后会自动保存到配置文件中。你也可以直接修改文件而无需重启应用程序;只需刷新页面即可加载最新的更改。 (可选)要重新生成配置文件,您可以将config.generate
与-u
或--update
一起使用。python3 -m hugegraph_llm.config.generate --update
- (可选)您可以使用 hugegraph-hubble 来访问图形数据,可以通过 Docker/Docker-Compose 运行它以获得指导。 (Hubble 是一个图形分析仪表板,包括数据加载/模式管理/图形遍历/显示)。
- (可选)离线下载 NLTK 停用词
python ./hugegraph_llm/operators/common_op/nltk_helper.py
4 示例
4.1 通过 LLM 在 HugeGraph 中构建知识图谱
4.1.1 通过 gradio 交互式界面构建知识图谱
参数描述:
- Docs:
- text: 从纯文本建立 rag 索引
- file: 上传文件:TXT 或 .docx(可同时选择多个文件)
- Schema:(接受2种类型)
- 用户定义模式( JSON 格式,遵循模板来修改它)
- 指定 HugeGraph 图实例的名称,它将自动从中获取模式(如 “hugegraph”)
- Graph extract head: 用户自定义的图提取提示
- 如果已经存在图数据,你应该点击 “Rebuild vid Index” 来更新索引
4.1.2 通过代码构建知识图谱
KgBuilder
类用于构建知识图谱。下面是使用过程:
- 初始化:
KgBuilder
类使用语言模型的实例进行初始化。这可以从LLMs
类中获得。 初始化LLMs
实例,获取LLM
,然后创建一个任务实例KgBuilder
用于图的构建。KgBuilder
定义了多个运算符,用户可以根据需要自由组合它们。(提示:print_result()
可以在控制台中打印出每一步的结果,而不会影响整个执行逻辑)from hugegraph_llm.models.llms.init_llm import LLMs from hugegraph_llm.operators.kg_construction_task import KgBuilder TEXT = "" builder = KgBuilder(LLMs().get_llm()) ( builder .import_schema(from_hugegraph="talent_graph").print_result() .chunk_split(TEXT).print_result() .extract_info(extract_type="property_graph").print_result() .commit_to_hugegraph() .run() )
- 导入架构:
import_schema
方法用于从源导入架构。源可以是 HugeGraph 实例、用户定义的模式或提取结果。可以链接print_result
方法来打印结果。# Import schema from a HugeGraph instance builder.import_schema(from_hugegraph="xxx").print_result() # Import schema from an extraction result builder.import_schema(from_extraction="xxx").print_result() # Import schema from user-defined schema builder.import_schema(from_user_defined="xxx").print_result()
- Chunk Split :
chunk_split
方法用于将输入文本分割为块。文本应该作为字符串参数传递给方法。# Split the input text into documents builder.chunk_split(TEXT, split_type="document").print_result() # Split the input text into paragraphs builder.chunk_split(TEXT, split_type="paragraph").print_result() # Split the input text into sentences builder.chunk_split(TEXT, split_type="sentence").print_result()
- 信息抽取:
extract_info
方法用于从文本中提取信息。文本应该作为字符串参数传递给方法。TEXT = "Meet Sarah, a 30-year-old attorney, and her roommate, James, whom she's shared a home with since 2010." # extract property graph from the input text builder.extract_info(extract_type="property_graph").print_result() # extract triples from the input text builder.extract_info(extract_type="property_graph").print_result()
- Commit to HugeGraph :
commit_to_hugegraph
方法用于将构建的知识图谱提交到 HugeGraph 实例。builder.commit_to_hugegraph().print_result()
- Run :
run
方法用于执行链式操作。builder.run()
KgBuilder
类的方法可以链接在一起以执行一系列操作。
4.2 基于 HugeGraph 的检索增强生成(RAG)
RAGPipeline
类用于将 HugeGraph 与大型语言模型集成,以提供检索增强生成功能。下面是使用过程:
- 提取关键字:提取关键字并扩展同义词。
from hugegraph_llm.operators.graph_rag_task import RAGPipeline graph_rag = RAGPipeline() graph_rag.extract_keywords(text="Tell me about Al Pacino.").print_result()
- 根据关键字匹配 Vid::将节点与图中的关键字匹配。
graph_rag.keywords_to_vid().print_result()
- RAG 的查询图:从 HugeGraph 中检索对应的关键词及其多度关联关系。
graph_rag.query_graphdb(max_deep=2, max_items=30).print_result()
- 重新排序搜索结果:根据问题与结果之间的相似性对搜索结果进行重新排序。
graph_rag.merge_dedup_rerank().print_result()
- 合成答案:总结结果并组织语言来回答问题。
graph_rag.synthesize_answer(vector_only_answer=False, graph_only_answer=True).print_result()
- 运行:
run
方法用于执行上述操作。graph_rag.run(verbose=True)
3.5 - HugeGraph-Client Quick Start
1 HugeGraph-Client 概述
HugeGraph-Client 向 HugeGraph-Server 发出 HTTP 请求,获取并解析 Server 的执行结果。
提供了 Java/Go/Python 版,
用户可以使用 Client-API 编写代码操作 HugeGraph,比如元数据和图数据的增删改查,或者执行 gremlin 语句等。
后文主要是 Java 使用示例 (其他语言 SDK 可参考对应 READEME
页面)
现在已经支持基于 Go 语言的 HugeGraph Client SDK (version >=1.2.0)
2 环境要求
- java 11 (兼容 java 8)
- maven 3.5+
3 使用流程
使用 HugeGraph-Client 的基本步骤如下:
- 新建Eclipse/ IDEA Maven 项目;
- 在 pom 文件中添加 HugeGraph-Client 依赖;
- 创建类,调用 HugeGraph-Client 接口;
详细使用过程见下节完整示例。
4 完整示例
4.1 新建 Maven 工程
可以选择 Eclipse 或者 Intellij Idea 创建工程:
4.2 添加 hugegraph-client 依赖
添加 hugegraph-client 依赖
<dependencies>
<dependency>
<groupId>org.apache.hugegraph</groupId>
<artifactId>hugegraph-client</artifactId>
<!-- Update to the latest release version -->
<version>1.5.0</version>
</dependency>
</dependencies>
注:Graph 所有组件版本号均保持一致
4.3 Example
4.3.1 SingleExample
import java.io.IOException;
import java.util.Iterator;
import java.util.List;
import org.apache.hugegraph.driver.GraphManager;
import org.apache.hugegraph.driver.GremlinManager;
import org.apache.hugegraph.driver.HugeClient;
import org.apache.hugegraph.driver.SchemaManager;
import org.apache.hugegraph.structure.constant.T;
import org.apache.hugegraph.structure.graph.Edge;
import org.apache.hugegraph.structure.graph.Path;
import org.apache.hugegraph.structure.graph.Vertex;
import org.apache.hugegraph.structure.gremlin.Result;
import org.apache.hugegraph.structure.gremlin.ResultSet;
public class SingleExample {
public static void main(String[] args) throws IOException {
// If connect failed will throw a exception.
HugeClient hugeClient = HugeClient.builder("http://localhost:8080",
"hugegraph")
.build();
SchemaManager schema = hugeClient.schema();
schema.propertyKey("name").asText().ifNotExist().create();
schema.propertyKey("age").asInt().ifNotExist().create();
schema.propertyKey("city").asText().ifNotExist().create();
schema.propertyKey("weight").asDouble().ifNotExist().create();
schema.propertyKey("lang").asText().ifNotExist().create();
schema.propertyKey("date").asDate().ifNotExist().create();
schema.propertyKey("price").asInt().ifNotExist().create();
schema.vertexLabel("person")
.properties("name", "age", "city")
.primaryKeys("name")
.ifNotExist()
.create();
schema.vertexLabel("software")
.properties("name", "lang", "price")
.primaryKeys("name")
.ifNotExist()
.create();
schema.indexLabel("personByCity")
.onV("person")
.by("city")
.secondary()
.ifNotExist()
.create();
schema.indexLabel("personByAgeAndCity")
.onV("person")
.by("age", "city")
.secondary()
.ifNotExist()
.create();
schema.indexLabel("softwareByPrice")
.onV("software")
.by("price")
.range()
.ifNotExist()
.create();
schema.edgeLabel("knows")
.sourceLabel("person")
.targetLabel("person")
.properties("date", "weight")
.ifNotExist()
.create();
schema.edgeLabel("created")
.sourceLabel("person").targetLabel("software")
.properties("date", "weight")
.ifNotExist()
.create();
schema.indexLabel("createdByDate")
.onE("created")
.by("date")
.secondary()
.ifNotExist()
.create();
schema.indexLabel("createdByWeight")
.onE("created")
.by("weight")
.range()
.ifNotExist()
.create();
schema.indexLabel("knowsByWeight")
.onE("knows")
.by("weight")
.range()
.ifNotExist()
.create();
GraphManager graph = hugeClient.graph();
Vertex marko = graph.addVertex(T.LABEL, "person", "name", "marko",
"age", 29, "city", "Beijing");
Vertex vadas = graph.addVertex(T.LABEL, "person", "name", "vadas",
"age", 27, "city", "Hongkong");
Vertex lop = graph.addVertex(T.LABEL, "software", "name", "lop",
"lang", "java", "price", 328);
Vertex josh = graph.addVertex(T.LABEL, "person", "name", "josh",
"age", 32, "city", "Beijing");
Vertex ripple = graph.addVertex(T.LABEL, "software", "name", "ripple",
"lang", "java", "price", 199);
Vertex peter = graph.addVertex(T.LABEL, "person", "name", "peter",
"age", 35, "city", "Shanghai");
marko.addEdge("knows", vadas, "date", "2016-01-10", "weight", 0.5);
marko.addEdge("knows", josh, "date", "2013-02-20", "weight", 1.0);
marko.addEdge("created", lop, "date", "2017-12-10", "weight", 0.4);
josh.addEdge("created", lop, "date", "2009-11-11", "weight", 0.4);
josh.addEdge("created", ripple, "date", "2017-12-10", "weight", 1.0);
peter.addEdge("created", lop, "date", "2017-03-24", "weight", 0.2);
GremlinManager gremlin = hugeClient.gremlin();
System.out.println("==== Path ====");
ResultSet resultSet = gremlin.gremlin("g.V().outE().path()").execute();
Iterator<Result> results = resultSet.iterator();
results.forEachRemaining(result -> {
System.out.println(result.getObject().getClass());
Object object = result.getObject();
if (object instanceof Vertex) {
System.out.println(((Vertex) object).id());
} else if (object instanceof Edge) {
System.out.println(((Edge) object).id());
} else if (object instanceof Path) {
List<Object> elements = ((Path) object).objects();
elements.forEach(element -> {
System.out.println(element.getClass());
System.out.println(element);
});
} else {
System.out.println(object);
}
});
hugeClient.close();
}
}
4.3.2 BatchExample
import java.util.ArrayList;
import java.util.List;
import org.apache.hugegraph.driver.GraphManager;
import org.apache.hugegraph.driver.HugeClient;
import org.apache.hugegraph.driver.SchemaManager;
import org.apache.hugegraph.structure.graph.Edge;
import org.apache.hugegraph.structure.graph.Vertex;
public class BatchExample {
public static void main(String[] args) {
// If connect failed will throw a exception.
HugeClient hugeClient = HugeClient.builder("http://localhost:8080",
"hugegraph")
.build();
SchemaManager schema = hugeClient.schema();
schema.propertyKey("name").asText().ifNotExist().create();
schema.propertyKey("age").asInt().ifNotExist().create();
schema.propertyKey("lang").asText().ifNotExist().create();
schema.propertyKey("date").asDate().ifNotExist().create();
schema.propertyKey("price").asInt().ifNotExist().create();
schema.vertexLabel("person")
.properties("name", "age")
.primaryKeys("name")
.ifNotExist()
.create();
schema.vertexLabel("person")
.properties("price")
.nullableKeys("price")
.append();
schema.vertexLabel("software")
.properties("name", "lang", "price")
.primaryKeys("name")
.ifNotExist()
.create();
schema.indexLabel("softwareByPrice")
.onV("software").by("price")
.range()
.ifNotExist()
.create();
schema.edgeLabel("knows")
.link("person", "person")
.properties("date")
.ifNotExist()
.create();
schema.edgeLabel("created")
.link("person", "software")
.properties("date")
.ifNotExist()
.create();
schema.indexLabel("createdByDate")
.onE("created").by("date")
.secondary()
.ifNotExist()
.create();
// get schema object by name
System.out.println(schema.getPropertyKey("name"));
System.out.println(schema.getVertexLabel("person"));
System.out.println(schema.getEdgeLabel("knows"));
System.out.println(schema.getIndexLabel("createdByDate"));
// list all schema objects
System.out.println(schema.getPropertyKeys());
System.out.println(schema.getVertexLabels());
System.out.println(schema.getEdgeLabels());
System.out.println(schema.getIndexLabels());
GraphManager graph = hugeClient.graph();
Vertex marko = new Vertex("person").property("name", "marko")
.property("age", 29);
Vertex vadas = new Vertex("person").property("name", "vadas")
.property("age", 27);
Vertex lop = new Vertex("software").property("name", "lop")
.property("lang", "java")
.property("price", 328);
Vertex josh = new Vertex("person").property("name", "josh")
.property("age", 32);
Vertex ripple = new Vertex("software").property("name", "ripple")
.property("lang", "java")
.property("price", 199);
Vertex peter = new Vertex("person").property("name", "peter")
.property("age", 35);
Edge markoKnowsVadas = new Edge("knows").source(marko).target(vadas)
.property("date", "2016-01-10");
Edge markoKnowsJosh = new Edge("knows").source(marko).target(josh)
.property("date", "2013-02-20");
Edge markoCreateLop = new Edge("created").source(marko).target(lop)
.property("date",
"2017-12-10");
Edge joshCreateRipple = new Edge("created").source(josh).target(ripple)
.property("date",
"2017-12-10");
Edge joshCreateLop = new Edge("created").source(josh).target(lop)
.property("date", "2009-11-11");
Edge peterCreateLop = new Edge("created").source(peter).target(lop)
.property("date",
"2017-03-24");
List<Vertex> vertices = new ArrayList<>();
vertices.add(marko);
vertices.add(vadas);
vertices.add(lop);
vertices.add(josh);
vertices.add(ripple);
vertices.add(peter);
List<Edge> edges = new ArrayList<>();
edges.add(markoKnowsVadas);
edges.add(markoKnowsJosh);
edges.add(markoCreateLop);
edges.add(joshCreateRipple);
edges.add(joshCreateLop);
edges.add(peterCreateLop);
vertices = graph.addVertices(vertices);
vertices.forEach(vertex -> System.out.println(vertex));
edges = graph.addEdges(edges, false);
edges.forEach(edge -> System.out.println(edge));
hugeClient.close();
}
}
4.4 运行 Example
运行 Example 之前需要启动 Server, 启动过程见HugeGraph-Server Quick Start
4.5 详细 API 说明
3.6 - HugeGraph-Tools Quick Start
1 HugeGraph-Tools概述
HugeGraph-Tools 是 HugeGraph 的自动化部署、管理和备份/还原组件。
2 获取 HugeGraph-Tools
有两种方式可以获取 HugeGraph-Tools:(它被包含子 Toolchain 中)
- 下载二进制tar包
- 下载源码编译安装
2.1 下载二进制tar包
下载最新版本的 HugeGraph-Toolchain 包, 然后进入 tools 子目录
wget https://downloads.apache.org/incubator/hugegraph/1.0.0/apache-hugegraph-toolchain-incubating-1.0.0.tar.gz
tar zxf *hugegraph*.tar.gz
2.2 下载源码编译安装
源码编译前请确保安装了wget命令
下载最新版本的 HugeGraph-Toolchain 源码包, 然后根目录编译或者单独编译 tool 子模块:
# 1. get from github
git clone https://github.com/apache/hugegraph-toolchain.git
# 2. get from direct (e.g. here is 1.0.0, please choose the latest version)
wget https://downloads.apache.org/incubator/hugegraph/1.0.0/apache-hugegraph-toolchain-incubating-1.0.0-src.tar.gz
编译生成 tar 包:
cd hugegraph-tools
mvn package -DskipTests
生成 tar 包 hugegraph-tools-${version}.tar.gz
3 使用
3.1 功能概览
解压后,进入 hugegraph-tools 目录,可以使用bin/hugegraph
或者bin/hugegraph help
来查看 usage 信息。主要分为:
- 图管理类,graph-mode-set、graph-mode-get、graph-list、graph-get 和 graph-clear
- 异步任务管理类,task-list、task-get、task-delete、task-cancel 和 task-clear
- Gremlin类,gremlin-execute 和 gremlin-schedule
- 备份/恢复类,backup、restore、migrate、schedule-backup 和 dump
- 安装部署类,deploy、clear、start-all 和 stop-all
Usage: hugegraph [options] [command] [command options]
3.2 [options]-全局变量
options
是 HugeGraph-Tools 的全局变量,可以在 hugegraph-tools/bin/hugegraph 中配置,包括:
- –graph,HugeGraph-Tools 操作的图的名字,默认值是 hugegraph
- –url,HugeGraph-Server 的服务地址,默认是 http://127.0.0.1:8080
- –user,当 HugeGraph-Server 开启认证时,传递用户名
- –password,当 HugeGraph-Server 开启认证时,传递用户的密码
- –timeout,连接 HugeGraph-Server 时的超时时间,默认是 30s
- –trust-store-file,证书文件的路径,当 –url 使用 https 时,HugeGraph-Client 使用的 truststore 文件,默认为空,代表使用 hugegraph-tools 内置的 truststore 文件 conf/hugegraph.truststore
- –trust-store-password,证书文件的密码,当 –url 使用 https 时,HugeGraph-Client 使用的 truststore 的密码,默认为空,代表使用 hugegraph-tools 内置的 truststore 文件的密码
上述全局变量,也可以通过环境变量来设置。一种方式是在命令行使用 export 设置临时环境变量,在该命令行关闭之前均有效
全局变量 | 环境变量 | 示例 |
---|---|---|
–url | HUGEGRAPH_URL | export HUGEGRAPH_URL=http://127.0.0.1:8080 |
–graph | HUGEGRAPH_GRAPH | export HUGEGRAPH_GRAPH=hugegraph |
–user | HUGEGRAPH_USERNAME | export HUGEGRAPH_USERNAME=admin |
–password | HUGEGRAPH_PASSWORD | export HUGEGRAPH_PASSWORD=test |
–timeout | HUGEGRAPH_TIMEOUT | export HUGEGRAPH_TIMEOUT=30 |
–trust-store-file | HUGEGRAPH_TRUST_STORE_FILE | export HUGEGRAPH_TRUST_STORE_FILE=/tmp/trust-store |
–trust-store-password | HUGEGRAPH_TRUST_STORE_PASSWORD | export HUGEGRAPH_TRUST_STORE_PASSWORD=xxxx |
另一种方式是在 bin/hugegraph 脚本中设置环境变量:
#!/bin/bash
# Set environment here if needed
#export HUGEGRAPH_URL=
#export HUGEGRAPH_GRAPH=
#export HUGEGRAPH_USERNAME=
#export HUGEGRAPH_PASSWORD=
#export HUGEGRAPH_TIMEOUT=
#export HUGEGRAPH_TRUST_STORE_FILE=
#export HUGEGRAPH_TRUST_STORE_PASSWORD=
3.3 图管理类,graph-mode-set、graph-mode-get、graph-list、graph-get和graph-clear
- graph-mode-set,设置图的 restore mode
- –graph-mode 或者 -m,必填项,指定将要设置的模式,合法值包括 [NONE, RESTORING, MERGING, LOADING]
- graph-mode-get,获取图的 restore mode
- graph-list,列出某个 HugeGraph-Server 中全部的图
- graph-get,获取某个图及其存储后端类型
- graph-clear,清除某个图的全部 schema 和 data
- –confirm-message 或者 -c,必填项,删除确认信息,需要手动输入,二次确认防止误删,“I’m sure to delete all data”,包括双引号
当需要把备份的图原样恢复到一个新的图中的时候,需要先将图模式设置为 RESTORING 模式;当需要将备份的图合并到已存在的图中时,需要先将图模式设置为 MERGING 模式。
3.4 异步任务管理类,task-list、task-get和task-delete
- task-list,列出某个图中的异步任务,可以根据任务的状态过滤
- –status,选填项,指定要查看的任务的状态,即按状态过滤任务
- –limit,选填项,指定要获取的任务的数目,默认为 -1,意思为获取全部符合条件的任务
- task-get,获取某个异步任务的详细信息
- –task-id,必填项,指定异步任务的 ID
- task-delete,删除某个异步任务的信息
- –task-id,必填项,指定异步任务的 ID
- task-cancel,取消某个异步任务的执行
- –task-id,要取消的异步任务的 ID
- task-clear,清理完成的异步任务
- –force,选填项,设置时,表示清理全部异步任务,未执行完成的先取消,然后清除所有异步任务。默认只清理已完成的异步任务
3.5 Gremlin类,gremlin-execute和gremlin-schedule
- gremlin-execute,发送 Gremlin 语句到 HugeGraph-Server 来执行查询或修改操作,同步执行,结束后返回结果
- –file 或者 -f,指定要执行的脚本文件,UTF-8编码,与 –script 互斥
- –script 或者 -s,指定要执行的脚本字符串,与 –file 互斥
- –aliases 或者 -a,Gremlin 别名设置,格式为:key1=value1,key2=value2,…
- –bindings 或者 -b,Gremlin 绑定设置,格式为:key1=value1,key2=value2,…
- –language 或者 -l,Gremlin 脚本的语言,默认为 gremlin-groovy
–file 和 –script 二者互斥,必须设置其中之一
- gremlin-schedule,发送 Gremlin 语句到 HugeGraph-Server 来执行查询或修改操作,异步执行,任务提交后立刻返回异步任务id
- –file 或者 -f,指定要执行的脚本文件,UTF-8编码,与 –script 互斥
- –script 或者 -s,指定要执行的脚本字符串,与 –file 互斥
- –bindings 或者 -b,Gremlin 绑定设置,格式为:key1=value1,key2=value2,…
- –language 或者 -l,Gremlin 脚本的语言,默认为 gremlin-groovy
–file 和 –script 二者互斥,必须设置其中之一
3.6 备份/恢复类
- backup,将某张图中的 schema 或者 data 备份到 HugeGraph 系统之外,以 JSON 形式存在本地磁盘或者 HDFS
- –format,备份的格式,可选值包括 [json, text],默认为 json
- –all-properties,是否备份顶点/边全部的属性,仅在 –format 为 text 是有效,默认 false
- –label,要备份的顶点/边的类型,仅在 –format 为 text 是有效,只有备份顶点或者边的时候有效
- –properties,要备份的顶点/边的属性,逗号分隔,仅在 –format 为 text 是有效,只有备份顶点或者边的时候有效
- –compress,备份时是否压缩数据,默认为 true
- –directory 或者 -d,存储 schema 或者 data 的目录,本地目录时,默认为’./{graphName}’,HDFS 时,默认为 ‘{fs.default.name}/{graphName}’
- –huge-types 或者 -t,要备份的数据类型,逗号分隔,可选值为 ‘all’ 或者 一个或多个 [vertex,edge,vertex_label,edge_label,property_key,index_label] 的组合,‘all’ 代表全部6种类型,即顶点、边和所有schema
- –log 或者 -l,指定日志目录,默认为当前目录
- –retry,指定失败重试次数,默认为 3
- –split-size 或者 -s,指定在备份时对顶点或者边分块的大小,默认为 1048576
- -D,用 -Dkey=value 的模式指定动态参数,用来备份数据到 HDFS 时,指定 HDFS 的配置项,例如:-Dfs.default.name=hdfs://localhost:9000
- restore,将 JSON 格式存储的 schema 或者 data 恢复到一个新图中(RESTORING 模式)或者合并到已存在的图中(MERGING 模式)
- –directory 或者 -d,存储 schema 或者 data 的目录,本地目录时,默认为’./{graphName}’,HDFS 时,默认为 ‘{fs.default.name}/{graphName}’
- –clean,是否在恢复图完成后删除 –directory 指定的目录,默认为 false
- –huge-types 或者 -t,要恢复的数据类型,逗号分隔,可选值为 ‘all’ 或者 一个或多个 [vertex,edge,vertex_label,edge_label,property_key,index_label] 的组合,‘all’ 代表全部6种类型,即顶点、边和所有schema
- –log 或者 -l,指定日志目录,默认为当前目录
- –retry,指定失败重试次数,默认为 3
- -D,用 -Dkey=value 的模式指定动态参数,用来从 HDFS 恢复图时,指定 HDFS 的配置项,例如:-Dfs.default.name=hdfs://localhost:9000
只有当 –format 为 json 执行 backup 时,才可以使用 restore 命令恢复
- migrate, 将当前连接的图迁移至另一个 HugeGraphServer 中
- –target-graph,目标图的名字,默认为 hugegraph
- –target-url,目标图所在的 HugeGraphServer,默认为 http://127.0.0.1:8081
- –target-username,访问目标图的用户名
- –target-password,访问目标图的密码
- –target-timeout,访问目标图的超时时间
- –target-trust-store-file,访问目标图使用的 truststore 文件
- –target-trust-store-password,访问目标图使用的 truststore 的密码
- –directory 或者 -d,迁移过程中,存储源图的 schema 或者 data 的目录,本地目录时,默认为’./{graphName}’,HDFS 时,默认为 ‘{fs.default.name}/{graphName}’
- –huge-types 或者 -t,要迁移的数据类型,逗号分隔,可选值为 ‘all’ 或者 一个或多个 [vertex,edge,vertex_label,edge_label,property_key,index_label] 的组合,‘all’ 代表全部6种类型,即顶点、边和所有schema
- –log 或者 -l,指定日志目录,默认为当前目录
- –retry,指定失败重试次数,默认为 3
- –split-size 或者 -s,指定迁移过程中对源图进行备份时顶点或者边分块的大小,默认为 1048576
- -D,用 -Dkey=value 的模式指定动态参数,用来在迁移图过程中需要备份数据到 HDFS 时,指定 HDFS 的配置项,例如:-Dfs.default.name=hdfs://localhost:9000
- –graph-mode 或者 -m,将源图恢复到目标图时将目标图设置的模式,合法值包括 [RESTORING, MERGING]
- –keep-local-data,是否保留在迁移图的过程中产生的源图的备份,默认为 false,即默认迁移图结束后不保留产生的源图备份
- schedule-backup,周期性对图执行备份操作,并保留一定数目的最新备份(目前仅支持本地文件系统)
- –directory 或者 -d,必填项,指定备份数据的目录
- –backup-num,选填项,指定保存的最新的备份的数目,默认为 3
- –interval,选填项,指定进行备份的周期,格式同 Linux crontab 格式
- dump,把整张图的顶点和边全部导出,默认以
vertex vertex-edge1 vertex-edge2...
JSON格式存储。 用户也可以自定义存储格式,只需要在hugegraph-tools/src/main/java/com/baidu/hugegraph/formatter
目录下实现一个继承自Formatter
的类,例如CustomFormatter
,使用时指定该类为formatter即可,例如bin/hugegraph dump -f CustomFormatter
- –formatter 或者 -f,指定使用的 formatter,默认为 JsonFormatter
- –directory 或者 -d,存储 schema 或者 data 的目录,默认为当前目录
- –log 或者 -l,指定日志目录,默认为当前目录
- –retry,指定失败重试次数,默认为 3
- –split-size 或者 -s,指定在备份时对顶点或者边分块的大小,默认为 1048576
- -D,用 -Dkey=value 的模式指定动态参数,用来备份数据到 HDFS 时,指定 HDFS 的配置项,例如:-Dfs.default.name=hdfs://localhost:9000
3.7 安装部署类
- deploy,一键下载、安装和启动 HugeGraph-Server 和 HugeGraph-Studio
- -v,必填项,指明安装的 HugeGraph-Server 和 HugeGraph-Studio 的版本号,最新的是 0.9
- -p,必填项,指定安装的 HugeGraph-Server 和 HugeGraph-Studio 目录
- -u,选填项,指定下载 HugeGraph-Server 和 HugeGraph-Studio 压缩包的链接
- clear,清理 HugeGraph-Server 和 HugeGraph-Studio 目录和tar包
- -p,必填项,指定要清理的 HugeGraph-Server 和 HugeGraph-Studio 的目录
- start-all,一键启动 HugeGraph-Server 和 HugeGraph-Studio,并启动监控,服务死掉时自动拉起服务
- -v,必填项,指明要启动的 HugeGraph-Server 和 HugeGraph-Studio 的版本号,最新的是 0.9
- -p,必填项,指定安装了 HugeGraph-Server 和 HugeGraph-Studio 的目录
- stop-all,一键关闭 HugeGraph-Server 和 HugeGraph-Studio
deploy命令中有可选参数 -u,提供时会使用指定的下载地址替代默认下载地址下载 tar 包,并且将地址写入
~/hugegraph-download-url-prefix
文件中;之后如果不指定地址时,会优先从~/hugegraph-download-url-prefix
指定的地址下载 tar 包;如果 -u 和~/hugegraph-download-url-prefix
都没有时,会从默认下载地址进行下载
3.8 具体命令参数
各子命令的具体参数如下:
Usage: hugegraph [options] [command] [command options]
Options:
--graph
Name of graph
Default: hugegraph
--password
Password of user
--timeout
Connection timeout
Default: 30
--trust-store-file
The path of client truststore file used when https protocol is enabled
--trust-store-password
The password of the client truststore file used when the https protocol
is enabled
--url
The URL of HugeGraph-Server
Default: http://127.0.0.1:8080
--user
Name of user
Commands:
graph-list List all graphs
Usage: graph-list
graph-get Get graph info
Usage: graph-get
graph-clear Clear graph schema and data
Usage: graph-clear [options]
Options:
* --confirm-message, -c
Confirm message of graph clear is "I'm sure to delete all data".
(Note: include "")
graph-mode-set Set graph mode
Usage: graph-mode-set [options]
Options:
* --graph-mode, -m
Graph mode, include: [NONE, RESTORING, MERGING]
Possible Values: [NONE, RESTORING, MERGING, LOADING]
graph-mode-get Get graph mode
Usage: graph-mode-get
task-list List tasks
Usage: task-list [options]
Options:
--limit
Limit number, no limit if not provided
Default: -1
--status
Status of task
task-get Get task info
Usage: task-get [options]
Options:
* --task-id
Task id
Default: 0
task-delete Delete task
Usage: task-delete [options]
Options:
* --task-id
Task id
Default: 0
task-cancel Cancel task
Usage: task-cancel [options]
Options:
* --task-id
Task id
Default: 0
task-clear Clear completed tasks
Usage: task-clear [options]
Options:
--force
Force to clear all tasks, cancel all uncompleted tasks firstly,
and delete all completed tasks
Default: false
gremlin-execute Execute Gremlin statements
Usage: gremlin-execute [options]
Options:
--aliases, -a
Gremlin aliases, valid format is: 'key1=value1,key2=value2...'
Default: {}
--bindings, -b
Gremlin bindings, valid format is: 'key1=value1,key2=value2...'
Default: {}
--file, -f
Gremlin Script file to be executed, UTF-8 encoded, exclusive to
--script
--language, -l
Gremlin script language
Default: gremlin-groovy
--script, -s
Gremlin script to be executed, exclusive to --file
gremlin-schedule Execute Gremlin statements as asynchronous job
Usage: gremlin-schedule [options]
Options:
--bindings, -b
Gremlin bindings, valid format is: 'key1=value1,key2=value2...'
Default: {}
--file, -f
Gremlin Script file to be executed, UTF-8 encoded, exclusive to
--script
--language, -l
Gremlin script language
Default: gremlin-groovy
--script, -s
Gremlin script to be executed, exclusive to --file
backup Backup graph schema/data. If directory is on HDFS, use -D to
set HDFS params. For exmaple:
-Dfs.default.name=hdfs://localhost:9000
Usage: backup [options]
Options:
--all-properties
All properties to be backup flag
Default: false
--compress
compress flag
Default: true
--directory, -d
Directory of graph schema/data, default is './{graphname}' in
local file system or '{fs.default.name}/{graphname}' in HDFS
--format
File format, valid is [json, text]
Default: json
--huge-types, -t
Type of schema/data. Concat with ',' if more than one. 'all' means
all vertices, edges and schema, in other words, 'all' equals with
'vertex,edge,vertex_label,edge_label,property_key,index_label'
Default: [PROPERTY_KEY, VERTEX_LABEL, EDGE_LABEL, INDEX_LABEL, VERTEX, EDGE]
--label
Vertex or edge label, only valid when type is vertex or edge
--log, -l
Directory of log
Default: ./logs
--properties
Vertex or edge properties to backup, only valid when type is
vertex or edge
Default: []
--retry
Retry times, default is 3
Default: 3
--split-size, -s
Split size of shard
Default: 1048576
-D
HDFS config parameters
Syntax: -Dkey=value
Default: {}
schedule-backup Schedule backup task
Usage: schedule-backup [options]
Options:
--backup-num
The number of latest backups to keep
Default: 3
* --directory, -d
The directory of backups stored
--interval
The interval of backup, format is: "a b c d e". 'a' means minute
(0 - 59), 'b' means hour (0 - 23), 'c' means day of month (1 -
31), 'd' means month (1 - 12), 'e' means day of week (0 - 6)
(Sunday=0), "*" means all
Default: "0 0 * * *"
dump Dump graph to files
Usage: dump [options]
Options:
--directory, -d
Directory of graph schema/data, default is './{graphname}' in
local file system or '{fs.default.name}/{graphname}' in HDFS
--formatter, -f
Formatter to customize format of vertex/edge
Default: JsonFormatter
--log, -l
Directory of log
Default: ./logs
--retry
Retry times, default is 3
Default: 3
--split-size, -s
Split size of shard
Default: 1048576
-D
HDFS config parameters
Syntax: -Dkey=value
Default: {}
restore Restore graph schema/data. If directory is on HDFS, use -D to
set HDFS params if needed. For
exmaple:-Dfs.default.name=hdfs://localhost:9000
Usage: restore [options]
Options:
--clean
Whether to remove the directory of graph data after restored
Default: false
--directory, -d
Directory of graph schema/data, default is './{graphname}' in
local file system or '{fs.default.name}/{graphname}' in HDFS
--huge-types, -t
Type of schema/data. Concat with ',' if more than one. 'all' means
all vertices, edges and schema, in other words, 'all' equals with
'vertex,edge,vertex_label,edge_label,property_key,index_label'
Default: [PROPERTY_KEY, VERTEX_LABEL, EDGE_LABEL, INDEX_LABEL, VERTEX, EDGE]
--log, -l
Directory of log
Default: ./logs
--retry
Retry times, default is 3
Default: 3
-D
HDFS config parameters
Syntax: -Dkey=value
Default: {}
migrate Migrate graph
Usage: migrate [options]
Options:
--directory, -d
Directory of graph schema/data, default is './{graphname}' in
local file system or '{fs.default.name}/{graphname}' in HDFS
--graph-mode, -m
Mode used when migrating to target graph, include: [RESTORING,
MERGING]
Default: RESTORING
Possible Values: [NONE, RESTORING, MERGING, LOADING]
--huge-types, -t
Type of schema/data. Concat with ',' if more than one. 'all' means
all vertices, edges and schema, in other words, 'all' equals with
'vertex,edge,vertex_label,edge_label,property_key,index_label'
Default: [PROPERTY_KEY, VERTEX_LABEL, EDGE_LABEL, INDEX_LABEL, VERTEX, EDGE]
--keep-local-data
Whether to keep the local directory of graph data after restored
Default: false
--log, -l
Directory of log
Default: ./logs
--retry
Retry times, default is 3
Default: 3
--split-size, -s
Split size of shard
Default: 1048576
--target-graph
The name of target graph to migrate
Default: hugegraph
--target-password
The password of target graph to migrate
--target-timeout
The timeout to connect target graph to migrate
Default: 0
--target-trust-store-file
The trust store file of target graph to migrate
--target-trust-store-password
The trust store password of target graph to migrate
--target-url
The url of target graph to migrate
Default: http://127.0.0.1:8081
--target-user
The username of target graph to migrate
-D
HDFS config parameters
Syntax: -Dkey=value
Default: {}
deploy Install HugeGraph-Server and HugeGraph-Studio
Usage: deploy [options]
Options:
* -p
Install path of HugeGraph-Server and HugeGraph-Studio
-u
Download url prefix path of HugeGraph-Server and HugeGraph-Studio
* -v
Version of HugeGraph-Server and HugeGraph-Studio
start-all Start HugeGraph-Server and HugeGraph-Studio
Usage: start-all [options]
Options:
* -p
Install path of HugeGraph-Server and HugeGraph-Studio
* -v
Version of HugeGraph-Server and HugeGraph-Studio
clear Clear HugeGraph-Server and HugeGraph-Studio
Usage: clear [options]
Options:
* -p
Install path of HugeGraph-Server and HugeGraph-Studio
stop-all Stop HugeGraph-Server and HugeGraph-Studio
Usage: stop-all
help Print usage
Usage: help
3.9 具体命令示例
1. gremlin语句
# 同步执行gremlin
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph gremlin-execute --script 'g.V().count()'
# 异步执行gremlin
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph gremlin-schedule --script 'g.V().count()'
2. 查看task情况
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph task-list
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph task-list --limit 5
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph task-list --status success
3. 图模式查看和设置
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph graph-mode-set -m RESTORING MERGING NONE
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph graph-mode-set -m RESTORING
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph graph-mode-get
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph graph-list
4. 清理图
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph graph-clear -c "I'm sure to delete all data"
5. 图备份
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph backup -t all --directory ./backup-test
6. 周期性的备份
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph --interval */2 * * * * schedule-backup -d ./backup-0.10.2
7. 图恢复
# 设置图模式
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph graph-mode-set -m RESTORING
# 恢复图
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph restore -t all --directory ./backup-test
# 恢复图模式
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph graph-mode-set -m NONE
8. 图迁移
./bin/hugegraph --url http://127.0.0.1:8080 --graph hugegraph migrate --target-url http://127.0.0.1:8090 --target-graph hugegraph
3.7 - HugeGraph-Computer Quick Start
1 HugeGraph-Computer 概述
HugeGraph-Computer
是分布式图处理系统 (OLAP). 它是 Pregel 的一个实现. 它可以运行在 Kubernetes 上。
特性
- 支持分布式MPP图计算,集成HugeGraph作为图输入输出存储。
- 算法基于BSP(Bulk Synchronous Parallel)模型,通过多次并行迭代进行计算,每一次迭代都是一次超步。
- 自动内存管理。该框架永远不会出现 OOM(内存不足),因为如果它没有足够的内存来容纳所有数据,它会将一些数据拆分到磁盘。
- 边的部分或超级节点的消息可以在内存中,所以你永远不会丢失它。
- 您可以从 HDFS 或 HugeGraph 或任何其他系统加载数据。
- 您可以将结果输出到 HDFS 或 HugeGraph,或任何其他系统。
- 易于开发新算法。您只需要像在单个服务器中一样专注于仅顶点处理,而不必担心消息传输和内存存储管理。
2 依赖
2.1 安装 Java 11 (JDK 11)
必须在 ≥ Java 11
的环境上启动 Computer
,然后自行配置。
在往下阅读之前务必执行 java -version
命令查看 jdk 版本
3 开始
3.1 在本地运行 PageRank 算法
要使用 HugeGraph-Computer 运行算法,必须装有 Java 11 或更高版本。
还需要首先部署 HugeGraph-Server 和 Etcd.
有两种方式可以获取 HugeGraph-Computer:
- 下载已编译的压缩包
- 克隆源码编译打包
3.1.1 下载已编译的压缩包
下载最新版本的 HugeGraph-Computer release 包:
wget https://downloads.apache.org/incubator/hugegraph/${version}/apache-hugegraph-computer-incubating-${version}.tar.gz
tar zxvf apache-hugegraph-computer-incubating-${version}.tar.gz -C hugegraph-computer
3.1.2 克隆源码编译打包
克隆最新版本的 HugeGraph-Computer 源码包:
$ git clone https://github.com/apache/hugegraph-computer.git
编译生成tar包:
cd hugegraph-computer
mvn clean package -DskipTests
3.1.3 启动 master 节点
您可以使用
-c
参数指定配置文件, 更多computer 配置请看: Computer Config Options
cd hugegraph-computer
bin/start-computer.sh -d local -r master
3.1.4 启动 worker 节点
bin/start-computer.sh -d local -r worker
3.1.5 查询算法结果
2.5.1 为 server 启用 OLAP
索引查询
如果没有启用OLAP索引,则需要启用, 更多参考: modify-graphs-read-mode
PUT http://localhost:8080/graphs/hugegraph/graph_read_mode
"ALL"
3.1.5.2 查询 page_rank
属性值:
curl "http://localhost:8080/graphs/hugegraph/graph/vertices?page&limit=3" | gunzip
3.2 在 Kubernetes 中运行 PageRank 算法
要使用 HugeGraph-Computer 运行算法,您需要先部署 HugeGraph-Server
3.2.1 安装 HugeGraph-Computer CRD
# Kubernetes version >= v1.16
kubectl apply -f https://raw.githubusercontent.com/apache/hugegraph-computer/master/computer-k8s-operator/manifest/hugegraph-computer-crd.v1.yaml
# Kubernetes version < v1.16
kubectl apply -f https://raw.githubusercontent.com/apache/hugegraph-computer/master/computer-k8s-operator/manifest/hugegraph-computer-crd.v1beta1.yaml
3.2.2 显示 CRD
kubectl get crd
NAME CREATED AT
hugegraphcomputerjobs.hugegraph.apache.org 2021-09-16T08:01:08Z
3.2.3 安装 hugegraph-computer-operator&etcd-server
kubectl apply -f https://raw.githubusercontent.com/apache/hugegraph-computer/master/computer-k8s-operator/manifest/hugegraph-computer-operator.yaml
3.2.4 等待 hugegraph-computer-operator&etcd-server 部署完成
kubectl get pod -n hugegraph-computer-operator-system
NAME READY STATUS RESTARTS AGE
hugegraph-computer-operator-controller-manager-58c5545949-jqvzl 1/1 Running 0 15h
hugegraph-computer-operator-etcd-28lm67jxk5 1/1 Running 0 15h
3.2.5 提交作业
更多 computer crd spec 请看: Computer CRD
更多 Computer 配置请看: Computer Config Options
cat <<EOF | kubectl apply --filename -
apiVersion: hugegraph.apache.org/v1
kind: HugeGraphComputerJob
metadata:
namespace: hugegraph-computer-operator-system
name: &jobName pagerank-sample
spec:
jobId: *jobName
algorithmName: page_rank
image: hugegraph/hugegraph-computer:latest # algorithm image url
jarFile: /hugegraph/hugegraph-computer/algorithm/builtin-algorithm.jar # algorithm jar path
pullPolicy: Always
workerCpu: "4"
workerMemory: "4Gi"
workerInstances: 5
computerConf:
job.partitions_count: "20"
algorithm.params_class: org.apache.hugegraph.computer.algorithm.centrality.pagerank.PageRankParams
hugegraph.url: http://${hugegraph-server-host}:${hugegraph-server-port} # hugegraph server url
hugegraph.name: hugegraph # hugegraph graph name
EOF
3.2.6 显示作业
kubectl get hcjob/pagerank-sample -n hugegraph-computer-operator-system
NAME JOBID JOBSTATUS
pagerank-sample pagerank-sample RUNNING
3.2.7 显示节点日志
# Show the master log
kubectl logs -l component=pagerank-sample-master -n hugegraph-computer-operator-system
# Show the worker log
kubectl logs -l component=pagerank-sample-worker -n hugegraph-computer-operator-system
# Show diagnostic log of a job
# 注意: 诊断日志仅在作业失败时存在,并且只会保存一小时。
kubectl get event --field-selector reason=ComputerJobFailed --field-selector involvedObject.name=pagerank-sample -n hugegraph-computer-operator-system
3.2.8 显示作业的成功事件
NOTE: it will only be saved for one hour
kubectl get event --field-selector reason=ComputerJobSucceed --field-selector involvedObject.name=pagerank-sample -n hugegraph-computer-operator-system
3.2.9 查询算法结果
如果输出到 Hugegraph-Server
则与 Locally 模式一致,如果输出到 HDFS
,请检查 hugegraph-computerresults{jobId}
目录下的结果文件。
4 内置算法文档
4.1 支持的算法列表:
中心性算法:
- PageRank
- BetweennessCentrality
- ClosenessCentrality
- DegreeCentrality
社区算法:
- ClusteringCoefficient
- Kcore
- Lpa
- TriangleCount
- Wcc
路径算法:
- RingsDetection
- RingsDetectionWithFilter
更多算法请看: Built-In algorithms
4.2 算法描述
TODO
5 算法开发指南
TODO
6 注意事项
- 如果computer-k8s模块下面的某些类不存在,你需要运行
mvn compile
来提前生成对应的类。
4 - Config
4.1 - HugeGraph 配置
1 概述
配置文件的目录为 hugegraph-release/conf,所有关于服务和图本身的配置都在此目录下。
主要的配置文件包括:gremlin-server.yaml、rest-server.properties 和 hugegraph.properties
HugeGraphServer 内部集成了 GremlinServer 和 RestServer,而 gremlin-server.yaml 和 rest-server.properties 就是用来配置这两个 Server 的。
- GremlinServer:GremlinServer 接受用户的 gremlin 语句,解析后转而调用 Core 的代码。
- RestServer:提供 RESTful API,根据不同的 HTTP 请求,调用对应的 Core API,如果用户请求体是 gremlin 语句,则会转发给 GremlinServer,实现对图数据的操作。
下面对这三个配置文件逐一介绍。
2 gremlin-server.yaml
gremlin-server.yaml 文件默认的内容如下:
# host and port of gremlin server, need to be consistent with host and port in rest-server.properties
#host: 127.0.0.1
#port: 8182
# Gremlin 查询中的超时时间(以毫秒为单位)
evaluationTimeout: 30000
channelizer: org.apache.tinkerpop.gremlin.server.channel.WsAndHttpChannelizer
# 不要在此处设置图形,此功能将在支持动态添加图形后再进行处理
graphs: {
}
scriptEngines: {
gremlin-groovy: {
staticImports: [
org.opencypher.gremlin.process.traversal.CustomPredicates.*',
org.opencypher.gremlin.traversal.CustomFunctions.*
],
plugins: {
org.apache.hugegraph.plugin.HugeGraphGremlinPlugin: {},
org.apache.tinkerpop.gremlin.server.jsr223.GremlinServerGremlinPlugin: {},
org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {
classImports: [
java.lang.Math,
org.apache.hugegraph.backend.id.IdGenerator,
org.apache.hugegraph.type.define.Directions,
org.apache.hugegraph.type.define.NodeRole,
org.apache.hugegraph.traversal.algorithm.CollectionPathsTraverser,
org.apache.hugegraph.traversal.algorithm.CountTraverser,
org.apache.hugegraph.traversal.algorithm.CustomizedCrosspointsTraverser,
org.apache.hugegraph.traversal.algorithm.CustomizePathsTraverser,
org.apache.hugegraph.traversal.algorithm.FusiformSimilarityTraverser,
org.apache.hugegraph.traversal.algorithm.HugeTraverser,
org.apache.hugegraph.traversal.algorithm.JaccardSimilarTraverser,
org.apache.hugegraph.traversal.algorithm.KneighborTraverser,
org.apache.hugegraph.traversal.algorithm.KoutTraverser,
org.apache.hugegraph.traversal.algorithm.MultiNodeShortestPathTraverser,
org.apache.hugegraph.traversal.algorithm.NeighborRankTraverser,
org.apache.hugegraph.traversal.algorithm.PathsTraverser,
org.apache.hugegraph.traversal.algorithm.PersonalRankTraverser,
org.apache.hugegraph.traversal.algorithm.SameNeighborTraverser,
org.apache.hugegraph.traversal.algorithm.ShortestPathTraverser,
org.apache.hugegraph.traversal.algorithm.SingleSourceShortestPathTraverser,
org.apache.hugegraph.traversal.algorithm.SubGraphTraverser,
org.apache.hugegraph.traversal.algorithm.TemplatePathsTraverser,
org.apache.hugegraph.traversal.algorithm.steps.EdgeStep,
org.apache.hugegraph.traversal.algorithm.steps.RepeatEdgeStep,
org.apache.hugegraph.traversal.algorithm.steps.WeightedEdgeStep,
org.apache.hugegraph.traversal.optimize.ConditionP,
org.apache.hugegraph.traversal.optimize.Text,
org.apache.hugegraph.traversal.optimize.TraversalUtil,
org.apache.hugegraph.util.DateUtil,
org.opencypher.gremlin.traversal.CustomFunctions,
org.opencypher.gremlin.traversal.CustomPredicate
],
methodImports: [
java.lang.Math#*,
org.opencypher.gremlin.traversal.CustomPredicate#*,
org.opencypher.gremlin.traversal.CustomFunctions#*
]
},
org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {
files: [scripts/empty-sample.groovy]
}
}
}
}
serializers:
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphBinaryMessageSerializerV1,
config: {
serializeResultToString: false,
ioRegistries: [org.apache.hugegraph.io.HugeGraphIoRegistry]
}
}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0,
config: {
serializeResultToString: false,
ioRegistries: [org.apache.hugegraph.io.HugeGraphIoRegistry]
}
}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV2d0,
config: {
serializeResultToString: false,
ioRegistries: [org.apache.hugegraph.io.HugeGraphIoRegistry]
}
}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV3d0,
config: {
serializeResultToString: false,
ioRegistries: [org.apache.hugegraph.io.HugeGraphIoRegistry]
}
}
metrics: {
consoleReporter: {enabled: false, interval: 180000},
csvReporter: {enabled: false, interval: 180000, fileName: ./metrics/gremlin-server-metrics.csv},
jmxReporter: {enabled: false},
slf4jReporter: {enabled: false, interval: 180000},
gangliaReporter: {enabled: false, interval: 180000, addressingMode: MULTICAST},
graphiteReporter: {enabled: false, interval: 180000}
}
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 64
writeBufferLowWaterMark: 32768
writeBufferHighWaterMark: 65536
ssl: {
enabled: false
}
上面的配置项很多,但目前只需要关注如下几个配置项:channelizer 和 graphs。
- graphs:GremlinServer 启动时需要打开的图,该项是一个 map 结构,key 是图的名字,value 是该图的配置文件路径;
- channelizer:GremlinServer 与客户端有两种通信方式,分别是 WebSocket 和 HTTP(默认)。如果选择 WebSocket, 用户可以通过 Gremlin-Console 快速体验 HugeGraph 的特性,但是不支持大规模数据导入, 推荐使用 HTTP 的通信方式,HugeGraph 的外围组件都是基于 HTTP 实现的;
默认 GremlinServer 是服务在 localhost:8182,如果需要修改,配置 host、port 即可
- host:部署 GremlinServer 机器的机器名或 IP,目前 HugeGraphServer 不支持分布式部署,且 GremlinServer 不直接暴露给用户;
- port:部署 GremlinServer 机器的端口;
同时需要在 rest-server.properties 中增加对应的配置项 gremlinserver.url=http://host:port
3 rest-server.properties
rest-server.properties 文件的默认内容如下:
# bind url
# could use '0.0.0.0' or specified (real)IP to expose external network access
restserver.url=http://127.0.0.1:8080
#restserver.enable_graphspaces_filter=false
# gremlin server url, need to be consistent with host and port in gremlin-server.yaml
#gremlinserver.url=http://127.0.0.1:8182
graphs=./conf/graphs
# The maximum thread ratio for batch writing, only take effect if the batch.max_write_threads is 0
batch.max_write_ratio=80
batch.max_write_threads=0
# configuration of arthas
arthas.telnet_port=8562
arthas.http_port=8561
arthas.ip=127.0.0.1
arthas.disabled_commands=jad
# authentication configs
# choose 'org.apache.hugegraph.auth.StandardAuthenticator' or
# 'org.apache.hugegraph.auth.ConfigAuthenticator'
#auth.authenticator=
# for StandardAuthenticator mode
#auth.graph_store=hugegraph
# auth client config
#auth.remote_url=127.0.0.1:8899,127.0.0.1:8898,127.0.0.1:8897
# for ConfigAuthenticator mode
#auth.admin_token=
#auth.user_tokens=[]
# TODO: Deprecated & removed later (useless from version 1.5.0)
# rpc server configs for multi graph-servers or raft-servers
#rpc.server_host=127.0.0.1
#rpc.server_port=8091
#rpc.server_timeout=30
# rpc client configs (like enable to keep cache consistency)
#rpc.remote_url=127.0.0.1:8091,127.0.0.1:8092,127.0.0.1:8093
#rpc.client_connect_timeout=20
#rpc.client_reconnect_period=10
#rpc.client_read_timeout=40
#rpc.client_retries=3
#rpc.client_load_balancer=consistentHash
# raft group initial peers
#raft.group_peers=127.0.0.1:8091,127.0.0.1:8092,127.0.0.1:8093
# lightweight load balancing (beta)
server.id=server-1
server.role=master
# slow query log
log.slow_query_threshold=1000
# jvm(in-heap) memory usage monitor, set 1 to disable it
memory_monitor.threshold=0.85
memory_monitor.period=2000
- restserver.url:RestServer 提供服务的 url,根据实际环境修改。如果其他 IP 地址无法访问,可以尝试修改为特定的地址;或修改为
http://0.0.0.0
来监听来自任何 IP 地址的请求,这种方案较为便捷,但需要留意服务可被访问的网络范围; - graphs:RestServer 启动时也需要打开图,该项为 map 结构,key 是图的名字,value 是该图的配置文件路径;
注意:gremlin-server.yaml 和 rest-server.properties 都包含 graphs 配置项,而
init-store
命令是根据 gremlin-server.yaml 的 graphs 下的图进行初始化的。
配置项 gremlinserver.url 是 GremlinServer 为 RestServer 提供服务的 url,该配置项默认为 http://localhost:8182,如需修改,需要和 gremlin-server.yaml 中的 host 和 port 相匹配;
4 hugegraph.properties
hugegraph.properties 是一类文件,因为如果系统存在多个图,则会有多个相似的文件。该文件用来配置与图存储和查询相关的参数,文件的默认内容如下:
# gremlin entrence to create graph
gremlin.graph=org.apache.hugegraph.HugeFactory
# cache config
#schema.cache_capacity=100000
# vertex-cache default is 1000w, 10min expired
#vertex.cache_capacity=10000000
#vertex.cache_expire=600
# edge-cache default is 100w, 10min expired
#edge.cache_capacity=1000000
#edge.cache_expire=600
# schema illegal name template
#schema.illegal_name_regex=\s+|~.*
#vertex.default_label=vertex
backend=rocksdb
serializer=binary
store=hugegraph
raft.mode=false
raft.safe_read=false
raft.use_snapshot=false
raft.endpoint=127.0.0.1:8281
raft.group_peers=127.0.0.1:8281,127.0.0.1:8282,127.0.0.1:8283
raft.path=./raft-log
raft.use_replicator_pipeline=true
raft.election_timeout=10000
raft.snapshot_interval=3600
raft.backend_threads=48
raft.read_index_threads=8
raft.queue_size=16384
raft.queue_publish_timeout=60
raft.apply_batch=1
raft.rpc_threads=80
raft.rpc_connect_timeout=5000
raft.rpc_timeout=60000
# if use 'ikanalyzer', need download jar from 'https://github.com/apache/hugegraph-doc/raw/ik_binary/dist/server/ikanalyzer-2012_u6.jar' to lib directory
search.text_analyzer=jieba
search.text_analyzer_mode=INDEX
# rocksdb backend config
#rocksdb.data_path=/path/to/disk
#rocksdb.wal_path=/path/to/disk
# cassandra backend config
cassandra.host=localhost
cassandra.port=9042
cassandra.username=
cassandra.password=
#cassandra.connect_timeout=5
#cassandra.read_timeout=20
#cassandra.keyspace.strategy=SimpleStrategy
#cassandra.keyspace.replication=3
# hbase backend config
#hbase.hosts=localhost
#hbase.port=2181
#hbase.znode_parent=/hbase
#hbase.threads_max=64
# mysql backend config
#jdbc.driver=com.mysql.jdbc.Driver
#jdbc.url=jdbc:mysql://127.0.0.1:3306
#jdbc.username=root
#jdbc.password=
#jdbc.reconnect_max_times=3
#jdbc.reconnect_interval=3
#jdbc.ssl_mode=false
# postgresql & cockroachdb backend config
#jdbc.driver=org.postgresql.Driver
#jdbc.url=jdbc:postgresql://localhost:5432/
#jdbc.username=postgres
#jdbc.password=
# palo backend config
#palo.host=127.0.0.1
#palo.poll_interval=10
#palo.temp_dir=./palo-data
#palo.file_limit_size=32
重点关注未注释的几项:
- gremlin.graph:GremlinServer 的启动入口,用户不要修改此项;
- backend:使用的后端存储,可选值有 memory、cassandra、scylladb、mysql、hbase、postgresql 和 rocksdb;
- serializer:主要为内部使用,用于将 schema、vertex 和 edge 序列化到后端,对应的可选值为 text、cassandra、scylladb 和 binary;(注:rocksdb 后端值需是 binary,其他后端 backend 与 serializer 值需保持一致,如 hbase 后端该值为 hbase)
- store:图存储到后端使用的数据库名,在 cassandra 和 scylladb 中就是 keyspace 名,此项的值与 GremlinServer 和 RestServer 中的图名并无关系,但是出于直观考虑,建议仍然使用相同的名字;
- cassandra.host:backend 为 cassandra 或 scylladb 时此项才有意义,cassandra/scylladb 集群的 seeds;
- cassandra.port:backend 为 cassandra 或 scylladb 时此项才有意义,cassandra/scylladb 集群的 native port;
- rocksdb.data_path:backend 为 rocksdb 时此项才有意义,rocksdb 的数据目录
- rocksdb.wal_path:backend 为 rocksdb 时此项才有意义,rocksdb 的日志目录
- admin.token: 通过一个 token 来获取服务器的配置信息,例如:http://localhost:8080/graphs/hugegraph/conf?token=162f7848-0b6d-4faf-b557-3a0797869c55
5 多图配置
我们的系统是可以存在多个图的,并且各个图的后端可以不一样,比如图 hugegraph_rocksdb
和 hugegraph_mysql
,其中 hugegraph_rocksdb
以 RocksDB
作为后端,hugegraph_mysql
以 MySQL
作为后端。
配置方法也很简单:
[可选]:修改 rest-server.properties
通过修改 rest-server.properties
中的 graphs
配置项来设置图的配置文件目录。默认配置为 graphs=./conf/graphs
,如果想要修改为其它目录则调整 graphs
配置项,比如调整为 graphs=/etc/hugegraph/graphs
,示例如下:
graphs=./conf/graphs
在 conf/graphs
路径下基于 hugegraph.properties
修改得到 hugegraph_mysql_backend.properties
和 hugegraph_rocksdb_backend.properties
hugegraph_mysql_backend.properties
修改的部分如下:
backend=mysql
serializer=mysql
store=hugegraph_mysql
# mysql backend config
jdbc.driver=com.mysql.cj.jdbc.Driver
jdbc.url=jdbc:mysql://127.0.0.1:3306
jdbc.username=root
jdbc.password=123456
jdbc.reconnect_max_times=3
jdbc.reconnect_interval=3
jdbc.ssl_mode=false
hugegraph_rocksdb_backend.properties
修改的部分如下:
backend=rocksdb
serializer=binary
store=hugegraph_rocksdb
停止 Server,初始化执行 init-store.sh(为新的图创建数据库),重新启动 Server
$ ./bin/stop-hugegraph.sh
$ ./bin/init-store.sh
Initializing HugeGraph Store...
2023-06-11 14:16:14 [main] [INFO] o.a.h.u.ConfigUtil - Scanning option 'graphs' directory './conf/graphs'
2023-06-11 14:16:14 [main] [INFO] o.a.h.c.InitStore - Init graph with config file: ./conf/graphs/hugegraph_rocksdb_backend.properties
...
2023-06-11 14:16:15 [main] [INFO] o.a.h.StandardHugeGraph - Graph 'hugegraph_rocksdb' has been initialized
2023-06-11 14:16:15 [main] [INFO] o.a.h.c.InitStore - Init graph with config file: ./conf/graphs/hugegraph_mysql_backend.properties
...
2023-06-11 14:16:16 [main] [INFO] o.a.h.StandardHugeGraph - Graph 'hugegraph_mysql' has been initialized
2023-06-11 14:16:16 [main] [INFO] o.a.h.StandardHugeGraph - Close graph standardhugegraph[hugegraph_rocksdb]
...
2023-06-11 14:16:16 [main] [INFO] o.a.h.HugeFactory - HugeFactory shutdown
2023-06-11 14:16:16 [hugegraph-shutdown] [INFO] o.a.h.HugeFactory - HugeGraph is shutting down
Initialization finished.
$ ./bin/start-hugegraph.sh
Starting HugeGraphServer...
Connecting to HugeGraphServer (http://127.0.0.1:8080/graphs)...OK
Started [pid 21614]
查看创建的图:
curl http://127.0.0.1:8080/graphs/
{"graphs":["hugegraph_rocksdb","hugegraph_mysql"]}
查看某个图的信息:
curl http://127.0.0.1:8080/graphs/hugegraph_mysql_backend
{"name":"hugegraph_mysql","backend":"mysql"}
curl http://127.0.0.1:8080/graphs/hugegraph_rocksdb_backend
{"name":"hugegraph_rocksdb","backend":"rocksdb"}
4.2 - HugeGraph 配置项
Gremlin Server 配置项
对应配置文件gremlin-server.yaml
config option | default value | description |
---|---|---|
host | 127.0.0.1 | The host or ip of Gremlin Server. |
port | 8182 | The listening port of Gremlin Server. |
graphs | hugegraph: conf/hugegraph.properties | The map of graphs with name and config file path. |
scriptEvaluationTimeout | 30000 | The timeout for gremlin script execution(millisecond). |
channelizer | org.apache.tinkerpop.gremlin.server.channel.HttpChannelizer | Indicates the protocol which the Gremlin Server provides service. |
authentication | authenticator: org.apache.hugegraph.auth.StandardAuthenticator, config: {tokens: conf/rest-server.properties} | The authenticator and config(contains tokens path) of authentication mechanism. |
Rest Server & API 配置项
对应配置文件rest-server.properties
config option | default value | description |
---|---|---|
graphs | [hugegraph:conf/hugegraph.properties] | The map of graphs’ name and config file. |
server.id | server-1 | The id of rest server, used for license verification. |
server.role | master | The role of nodes in the cluster, available types are [master, worker, computer] |
restserver.url | http://127.0.0.1:8080 | The url for listening of rest server. |
ssl.keystore_file | server.keystore | The path of server keystore file used when https protocol is enabled. |
ssl.keystore_password | The password of the path of the server keystore file used when the https protocol is enabled. | |
restserver.max_worker_threads | 2 * CPUs | The maximum worker threads of rest server. |
restserver.min_free_memory | 64 | The minimum free memory(MB) of rest server, requests will be rejected when the available memory of system is lower than this value. |
restserver.request_timeout | 30 | The time in seconds within which a request must complete, -1 means no timeout. |
restserver.connection_idle_timeout | 30 | The time in seconds to keep an inactive connection alive, -1 means no timeout. |
restserver.connection_max_requests | 256 | The max number of HTTP requests allowed to be processed on one keep-alive connection, -1 means unlimited. |
gremlinserver.url | http://127.0.0.1:8182 | The url of gremlin server. |
gremlinserver.max_route | 8 | The max route number for gremlin server. |
gremlinserver.timeout | 30 | The timeout in seconds of waiting for gremlin server. |
batch.max_edges_per_batch | 500 | The maximum number of edges submitted per batch. |
batch.max_vertices_per_batch | 500 | The maximum number of vertices submitted per batch. |
batch.max_write_ratio | 50 | The maximum thread ratio for batch writing, only take effect if the batch.max_write_threads is 0. |
batch.max_write_threads | 0 | The maximum threads for batch writing, if the value is 0, the actual value will be set to batch.max_write_ratio * restserver.max_worker_threads. |
auth.authenticator | The class path of authenticator implementation. e.g., org.apache.hugegraph.auth.StandardAuthenticator, or org.apache.hugegraph.auth.ConfigAuthenticator. | |
auth.admin_token | 162f7848-0b6d-4faf-b557-3a0797869c55 | Token for administrator operations, only for org.apache.hugegraph.auth.ConfigAuthenticator. |
auth.graph_store | hugegraph | The name of graph used to store authentication information, like users, only for org.apache.hugegraph.auth.StandardAuthenticator. |
auth.user_tokens | [hugegraph:9fd95c9c-711b-415b-b85f-d4df46ba5c31] | The map of user tokens with name and password, only for org.apache.hugegraph.auth.ConfigAuthenticator. |
auth.audit_log_rate | 1000.0 | The max rate of audit log output per user, default value is 1000 records per second. |
auth.cache_capacity | 10240 | The max cache capacity of each auth cache item. |
auth.cache_expire | 600 | The expiration time in seconds of vertex cache. |
auth.remote_url | If the address is empty, it provide auth service, otherwise it is auth client and also provide auth service through rpc forwarding. The remote url can be set to multiple addresses, which are concat by ‘,’. | |
auth.token_expire | 86400 | The expiration time in seconds after token created |
auth.token_secret | FXQXbJtbCLxODc6tGci732pkH1cyf8Qg | Secret key of HS256 algorithm. |
exception.allow_trace | false | Whether to allow exception trace stack. |
memory_monitor.threshold | 0.85 | The threshold of JVM(in-heap) memory usage monitoring , 1 means disabling this function. |
memory_monitor.period | 2000 | The period in ms of JVM(in-heap) memory usage monitoring. |
基本配置项
基本配置项及后端配置项对应配置文件:{graph-name}.properties,如hugegraph.properties
config option | default value | description |
---|---|---|
gremlin.graph | org.apache.hugegraph.HugeFactory | Gremlin entrance to create graph. |
backend | rocksdb | The data store type, available values are [memory, rocksdb, cassandra, scylladb, hbase, mysql]. |
serializer | binary | The serializer for backend store, available values are [text, binary, cassandra, hbase, mysql]. |
store | hugegraph | The database name like Cassandra Keyspace. |
store.connection_detect_interval | 600 | The interval in seconds for detecting connections, if the idle time of a connection exceeds this value, detect it and reconnect if needed before using, value 0 means detecting every time. |
store.graph | g | The graph table name, which store vertex, edge and property. |
store.schema | m | The schema table name, which store meta data. |
store.system | s | The system table name, which store system data. |
schema.illegal_name_regex | .\s+$|~. | The regex specified the illegal format for schema name. |
schema.cache_capacity | 10000 | The max cache size(items) of schema cache. |
vertex.cache_type | l2 | The type of vertex cache, allowed values are [l1, l2]. |
vertex.cache_capacity | 10000000 | The max cache size(items) of vertex cache. |
vertex.cache_expire | 600 | The expire time in seconds of vertex cache. |
vertex.check_customized_id_exist | false | Whether to check the vertices exist for those using customized id strategy. |
vertex.default_label | vertex | The default vertex label. |
vertex.tx_capacity | 10000 | The max size(items) of vertices(uncommitted) in transaction. |
vertex.check_adjacent_vertex_exist | false | Whether to check the adjacent vertices of edges exist. |
vertex.lazy_load_adjacent_vertex | true | Whether to lazy load adjacent vertices of edges. |
vertex.part_edge_commit_size | 5000 | Whether to enable the mode to commit part of edges of vertex, enabled if commit size > 0, 0 means disabled. |
vertex.encode_primary_key_number | true | Whether to encode number value of primary key in vertex id. |
vertex.remove_left_index_at_overwrite | false | Whether remove left index at overwrite. |
edge.cache_type | l2 | The type of edge cache, allowed values are [l1, l2]. |
edge.cache_capacity | 1000000 | The max cache size(items) of edge cache. |
edge.cache_expire | 600 | The expiration time in seconds of edge cache. |
edge.tx_capacity | 10000 | The max size(items) of edges(uncommitted) in transaction. |
query.page_size | 500 | The size of each page when querying by paging. |
query.batch_size | 1000 | The size of each batch when querying by batch. |
query.ignore_invalid_data | true | Whether to ignore invalid data of vertex or edge. |
query.index_intersect_threshold | 1000 | The maximum number of intermediate results to intersect indexes when querying by multiple single index properties. |
query.ramtable_edges_capacity | 20000000 | The maximum number of edges in ramtable, include OUT and IN edges. |
query.ramtable_enable | false | Whether to enable ramtable for query of adjacent edges. |
query.ramtable_vertices_capacity | 10000000 | The maximum number of vertices in ramtable, generally the largest vertex id is used as capacity. |
query.optimize_aggregate_by_index | false | Whether to optimize aggregate query(like count) by index. |
oltp.concurrent_depth | 10 | The min depth to enable concurrent oltp algorithm. |
oltp.concurrent_threads | 10 | Thread number to concurrently execute oltp algorithm. |
oltp.collection_type | EC | The implementation type of collections used in oltp algorithm. |
rate_limit.read | 0 | The max rate(times/s) to execute query of vertices/edges. |
rate_limit.write | 0 | The max rate(items/s) to add/update/delete vertices/edges. |
task.wait_timeout | 10 | Timeout in seconds for waiting for the task to complete,such as when truncating or clearing the backend. |
task.input_size_limit | 16777216 | The job input size limit in bytes. |
task.result_size_limit | 16777216 | The job result size limit in bytes. |
task.sync_deletion | false | Whether to delete schema or expired data synchronously. |
task.ttl_delete_batch | 1 | The batch size used to delete expired data. |
computer.config | /conf/computer.yaml | The config file path of computer job. |
search.text_analyzer | ikanalyzer | Choose a text analyzer for searching the vertex/edge properties, available type are [word, ansj, hanlp, smartcn, jieba, jcseg, mmseg4j, ikanalyzer]. # if use ‘ikanalyzer’, need download jar from ‘https://github.com/apache/hugegraph-doc/raw/ik_binary/dist/server/ikanalyzer-2012_u6.jar' to lib directory |
search.text_analyzer_mode | smart | Specify the mode for the text analyzer, the available mode of analyzer are {word: [MaximumMatching, ReverseMaximumMatching, MinimumMatching, ReverseMinimumMatching, BidirectionalMaximumMatching, BidirectionalMinimumMatching, BidirectionalMaximumMinimumMatching, FullSegmentation, MinimalWordCount, MaxNgramScore, PureEnglish], ansj: [BaseAnalysis, IndexAnalysis, ToAnalysis, NlpAnalysis], hanlp: [standard, nlp, index, nShort, shortest, speed], smartcn: [], jieba: [SEARCH, INDEX], jcseg: [Simple, Complex], mmseg4j: [Simple, Complex, MaxWord], ikanalyzer: [smart, max_word]}. |
snowflake.datecenter_id | 0 | The datacenter id of snowflake id generator. |
snowflake.force_string | false | Whether to force the snowflake long id to be a string. |
snowflake.worker_id | 0 | The worker id of snowflake id generator. |
raft.mode | false | Whether the backend storage works in raft mode. |
raft.safe_read | false | Whether to use linearly consistent read. |
raft.use_snapshot | false | Whether to use snapshot. |
raft.endpoint | 127.0.0.1:8281 | The peerid of current raft node. |
raft.group_peers | 127.0.0.1:8281,127.0.0.1:8282,127.0.0.1:8283 | The peers of current raft group. |
raft.path | ./raft-log | The log path of current raft node. |
raft.use_replicator_pipeline | true | Whether to use replicator line, when turned on it multiple logs can be sent in parallel, and the next log doesn’t have to wait for the ack message of the current log to be sent. |
raft.election_timeout | 10000 | Timeout in milliseconds to launch a round of election. |
raft.snapshot_interval | 3600 | The interval in seconds to trigger snapshot save. |
raft.backend_threads | current CPU v-cores | The thread number used to apply task to backend. |
raft.read_index_threads | 8 | The thread number used to execute reading index. |
raft.apply_batch | 1 | The apply batch size to trigger disruptor event handler. |
raft.queue_size | 16384 | The disruptor buffers size for jraft RaftNode, StateMachine and LogManager. |
raft.queue_publish_timeout | 60 | The timeout in second when publish event into disruptor. |
raft.rpc_threads | 80 | The rpc threads for jraft RPC layer. |
raft.rpc_connect_timeout | 5000 | The rpc connect timeout for jraft rpc. |
raft.rpc_timeout | 60000 | The rpc timeout for jraft rpc. |
raft.rpc_buf_low_water_mark | 10485760 | The ChannelOutboundBuffer’s low water mark of netty, when buffer size less than this size, the method ChannelOutboundBuffer.isWritable() will return true, it means that low downstream pressure or good network. |
raft.rpc_buf_high_water_mark | 20971520 | The ChannelOutboundBuffer’s high water mark of netty, only when buffer size exceed this size, the method ChannelOutboundBuffer.isWritable() will return false, it means that the downstream pressure is too great to process the request or network is very congestion, upstream needs to limit rate at this time. |
raft.read_strategy | ReadOnlyLeaseBased | The linearizability of read strategy. |
RPC server 配置
config option | default value | description |
---|---|---|
rpc.client_connect_timeout | 20 | The timeout(in seconds) of rpc client connect to rpc server. |
rpc.client_load_balancer | consistentHash | The rpc client uses a load-balancing algorithm to access multiple rpc servers in one cluster. Default value is ‘consistentHash’, means forwarding by request parameters. |
rpc.client_read_timeout | 40 | The timeout(in seconds) of rpc client read from rpc server. |
rpc.client_reconnect_period | 10 | The period(in seconds) of rpc client reconnect to rpc server. |
rpc.client_retries | 3 | Failed retry number of rpc client calls to rpc server. |
rpc.config_order | 999 | Sofa rpc configuration file loading order, the larger the more later loading. |
rpc.logger_impl | com.alipay.sofa.rpc.log.SLF4JLoggerImpl | Sofa rpc log implementation class. |
rpc.protocol | bolt | Rpc communication protocol, client and server need to be specified the same value. |
rpc.remote_url | The remote urls of rpc peers, it can be set to multiple addresses, which are concat by ‘,’, empty value means not enabled. | |
rpc.server_adaptive_port | false | Whether the bound port is adaptive, if it’s enabled, when the port is in use, automatically +1 to detect the next available port. Note that this process is not atomic, so there may still be port conflicts. |
rpc.server_host | The hosts/ips bound by rpc server to provide services, empty value means not enabled. | |
rpc.server_port | 8090 | The port bound by rpc server to provide services. |
rpc.server_timeout | 30 | The timeout(in seconds) of rpc server execution. |
Cassandra 后端配置项
config option | default value | description |
---|---|---|
backend | Must be set to cassandra . | |
serializer | Must be set to cassandra . | |
cassandra.host | localhost | The seeds hostname or ip address of cassandra cluster. |
cassandra.port | 9042 | The seeds port address of cassandra cluster. |
cassandra.connect_timeout | 5 | The cassandra driver connect server timeout(seconds). |
cassandra.read_timeout | 20 | The cassandra driver read from server timeout(seconds). |
cassandra.keyspace.strategy | SimpleStrategy | The replication strategy of keyspace, valid value is SimpleStrategy or NetworkTopologyStrategy. |
cassandra.keyspace.replication | [3] | The keyspace replication factor of SimpleStrategy, like ‘[3]’.Or replicas in each datacenter of NetworkTopologyStrategy, like ‘[dc1:2,dc2:1]’. |
cassandra.username | The username to use to login to cassandra cluster. | |
cassandra.password | The password corresponding to cassandra.username. | |
cassandra.compression_type | none | The compression algorithm of cassandra transport: none/snappy/lz4. |
cassandra.jmx_port=7199 | 7199 | The port of JMX API service for cassandra. |
cassandra.aggregation_timeout | 43200 | The timeout in seconds of waiting for aggregation. |
ScyllaDB 后端配置项
config option | default value | description |
---|---|---|
backend | Must be set to scylladb . | |
serializer | Must be set to scylladb . |
其它与 Cassandra 后端一致。
RocksDB 后端配置项
config option | default value | description |
---|---|---|
backend | Must be set to rocksdb . | |
serializer | Must be set to binary . | |
rocksdb.data_disks | [] | The optimized disks for storing data of RocksDB. The format of each element: STORE/TABLE: /path/disk .Allowed keys are [g/vertex, g/edge_out, g/edge_in, g/vertex_label_index, g/edge_label_index, g/range_int_index, g/range_float_index, g/range_long_index, g/range_double_index, g/secondary_index, g/search_index, g/shard_index, g/unique_index, g/olap] |
rocksdb.data_path | rocksdb-data/data | The path for storing data of RocksDB. |
rocksdb.wal_path | rocksdb-data/wal | The path for storing WAL of RocksDB. |
rocksdb.allow_mmap_reads | false | Allow the OS to mmap file for reading sst tables. |
rocksdb.allow_mmap_writes | false | Allow the OS to mmap file for writing. |
rocksdb.block_cache_capacity | 8388608 | The amount of block cache in bytes that will be used by RocksDB, 0 means no block cache. |
rocksdb.bloom_filter_bits_per_key | -1 | The bits per key in bloom filter, a good value is 10, which yields a filter with ~ 1% false positive rate, -1 means no bloom filter. |
rocksdb.bloom_filter_block_based_mode | false | Use block based filter rather than full filter. |
rocksdb.bloom_filter_whole_key_filtering | true | True if place whole keys in the bloom filter, else place the prefix of keys. |
rocksdb.bottommost_compression | NO_COMPRESSION | The compression algorithm for the bottommost level of RocksDB, allowed values are none/snappy/z/bzip2/lz4/lz4hc/xpress/zstd. |
rocksdb.bulkload_mode | false | Switch to the mode to bulk load data into RocksDB. |
rocksdb.cache_index_and_filter_blocks | false | Indicating if we’d put index/filter blocks to the block cache. |
rocksdb.compaction_style | LEVEL | Set compaction style for RocksDB: LEVEL/UNIVERSAL/FIFO. |
rocksdb.compression | SNAPPY_COMPRESSION | The compression algorithm for compressing blocks of RocksDB, allowed values are none/snappy/z/bzip2/lz4/lz4hc/xpress/zstd. |
rocksdb.compression_per_level | [NO_COMPRESSION, NO_COMPRESSION, SNAPPY_COMPRESSION, SNAPPY_COMPRESSION, SNAPPY_COMPRESSION, SNAPPY_COMPRESSION, SNAPPY_COMPRESSION] | The compression algorithms for different levels of RocksDB, allowed values are none/snappy/z/bzip2/lz4/lz4hc/xpress/zstd. |
rocksdb.delayed_write_rate | 16777216 | The rate limit in bytes/s of user write requests when need to slow down if the compaction gets behind. |
rocksdb.log_level | INFO | The info log level of RocksDB. |
rocksdb.max_background_jobs | 8 | Maximum number of concurrent background jobs, including flushes and compactions. |
rocksdb.level_compaction_dynamic_level_bytes | false | Whether to enable level_compaction_dynamic_level_bytes, if it’s enabled we give max_bytes_for_level_multiplier a priority against max_bytes_for_level_base, the bytes of base level is dynamic for a more predictable LSM tree, it is useful to limit worse case space amplification. Turning this feature on/off for an existing DB can cause unexpected LSM tree structure so it’s not recommended. |
rocksdb.max_bytes_for_level_base | 536870912 | The upper-bound of the total size of level-1 files in bytes. |
rocksdb.max_bytes_for_level_multiplier | 10.0 | The ratio between the total size of level (L+1) files and the total size of level L files for all L. |
rocksdb.max_open_files | -1 | The maximum number of open files that can be cached by RocksDB, -1 means no limit. |
rocksdb.max_subcompactions | 4 | The value represents the maximum number of threads per compaction job. |
rocksdb.max_write_buffer_number | 6 | The maximum number of write buffers that are built up in memory. |
rocksdb.max_write_buffer_number_to_maintain | 0 | The total maximum number of write buffers to maintain in memory. |
rocksdb.min_write_buffer_number_to_merge | 2 | The minimum number of write buffers that will be merged together. |
rocksdb.num_levels | 7 | Set the number of levels for this database. |
rocksdb.optimize_filters_for_hits | false | This flag allows us to not store filters for the last level. |
rocksdb.optimize_mode | true | Optimize for heavy workloads and big datasets. |
rocksdb.pin_l0_filter_and_index_blocks_in_cache | false | Indicating if we’d put index/filter blocks to the block cache. |
rocksdb.sst_path | The path for ingesting SST file into RocksDB. | |
rocksdb.target_file_size_base | 67108864 | The target file size for compaction in bytes. |
rocksdb.target_file_size_multiplier | 1 | The size ratio between a level L file and a level (L+1) file. |
rocksdb.use_direct_io_for_flush_and_compaction | false | Enable the OS to use direct read/writes in flush and compaction. |
rocksdb.use_direct_reads | false | Enable the OS to use direct I/O for reading sst tables. |
rocksdb.write_buffer_size | 134217728 | Amount of data in bytes to build up in memory. |
rocksdb.max_manifest_file_size | 104857600 | The max size of manifest file in bytes. |
rocksdb.skip_stats_update_on_db_open | false | Whether to skip statistics update when opening the database, setting this flag true allows us to not update statistics. |
rocksdb.max_file_opening_threads | 16 | The max number of threads used to open files. |
rocksdb.max_total_wal_size | 0 | Total size of WAL files in bytes. Once WALs exceed this size, we will start forcing the flush of column families related, 0 means no limit. |
rocksdb.db_write_buffer_size | 0 | Total size of write buffers in bytes across all column families, 0 means no limit. |
rocksdb.delete_obsolete_files_period | 21600 | The periodicity in seconds when obsolete files get deleted, 0 means always do full purge. |
rocksdb.hard_pending_compaction_bytes_limit | 274877906944 | The hard limit to impose on pending compaction in bytes. |
rocksdb.level0_file_num_compaction_trigger | 2 | Number of files to trigger level-0 compaction. |
rocksdb.level0_slowdown_writes_trigger | 20 | Soft limit on number of level-0 files for slowing down writes. |
rocksdb.level0_stop_writes_trigger | 36 | Hard limit on number of level-0 files for stopping writes. |
rocksdb.soft_pending_compaction_bytes_limit | 68719476736 | The soft limit to impose on pending compaction in bytes. |
HBase 后端配置项
config option | default value | description |
---|---|---|
backend | Must be set to hbase . | |
serializer | Must be set to hbase . | |
hbase.hosts | localhost | The hostnames or ip addresses of HBase zookeeper, separated with commas. |
hbase.port | 2181 | The port address of HBase zookeeper. |
hbase.threads_max | 64 | The max threads num of hbase connections. |
hbase.znode_parent | /hbase | The znode parent path of HBase zookeeper. |
hbase.zk_retry | 3 | The recovery retry times of HBase zookeeper. |
hbase.aggregation_timeout | 43200 | The timeout in seconds of waiting for aggregation. |
hbase.kerberos_enable | false | Is Kerberos authentication enabled for HBase. |
hbase.kerberos_keytab | The HBase’s key tab file for kerberos authentication. | |
hbase.kerberos_principal | The HBase’s principal for kerberos authentication. | |
hbase.krb5_conf | etc/krb5.conf | Kerberos configuration file, including KDC IP, default realm, etc. |
hbase.hbase_site | /etc/hbase/conf/hbase-site.xml | The HBase’s configuration file |
hbase.enable_partition | true | Is pre-split partitions enabled for HBase. |
hbase.vertex_partitions | 10 | The number of partitions of the HBase vertex table. |
hbase.edge_partitions | 30 | The number of partitions of the HBase edge table. |
MySQL & PostgreSQL 后端配置项
config option | default value | description |
---|---|---|
backend | Must be set to mysql . | |
serializer | Must be set to mysql . | |
jdbc.driver | com.mysql.jdbc.Driver | The JDBC driver class to connect database. |
jdbc.url | jdbc:mysql://127.0.0.1:3306 | The url of database in JDBC format. |
jdbc.username | root | The username to login database. |
jdbc.password | ****** | The password corresponding to jdbc.username. |
jdbc.ssl_mode | false | The SSL mode of connections with database. |
jdbc.reconnect_interval | 3 | The interval(seconds) between reconnections when the database connection fails. |
jdbc.reconnect_max_times | 3 | The reconnect times when the database connection fails. |
jdbc.storage_engine | InnoDB | The storage engine of backend store database, like InnoDB/MyISAM/RocksDB for MySQL. |
jdbc.postgresql.connect_database | template1 | The database used to connect when init store, drop store or check store exist. |
PostgreSQL 后端配置项
config option | default value | description |
---|---|---|
backend | Must be set to postgresql . | |
serializer | Must be set to postgresql . |
其它与 MySQL 后端一致。
PostgreSQL 后端的 driver 和 url 应该设置为:
jdbc.driver=org.postgresql.Driver
jdbc.url=jdbc:postgresql://localhost:5432/
4.3 - HugeGraph 内置用户权限与扩展权限配置及使用
概述
HugeGraph 为了方便不同用户场景下的鉴权使用,目前内置了完备的StandardAuthenticator
权限模式,支持多用户认证、
以及细粒度的权限访问控制,采用基于“用户 - 用户组 - 操作 - 资源”的 4 层设计,灵活控制用户角色与权限 (支持多 GraphServer)
StandardAuthenticator
模式的几个核心设计:
- 初始化时创建超级管理员 (
admin
) 用户,后续通过超级管理员创建其它用户,新创建的用户被分配足够权限后,可以创建或管理更多的用户 - 支持动态创建用户、用户组、资源,支持动态分配或取消权限
- 用户可以属于一个或多个用户组,每个用户组可以拥有对任意个资源的操作权限,操作类型包括:读、写、删除、执行等种类
- “资源” 描述了图数据库中的数据,比如符合某一类条件的顶点,每一个资源包括
type
、label
、properties
三个要素,共有 18 种类型、任意 label、任意 properties 可组合形成的资源,一个资源的内部条件是且关系,多个资源之间的条件是或关系
举例说明:
// 场景:某用户只有北京地区的数据读取权限
user(name=xx) -belong-> group(name=xx) -access(read)-> target(graph=graph1, resource={label: person, city: Beijing})
配置用户认证
HugeGraph 目前默认未启用用户认证功能,需通过修改配置文件来启用该功能。(Note: 如果在生产环境/外网使用, 请使用 Java11 版本 + 开启权限避免安全相关隐患)
目前已内置实现了StandardAuthenticator
模式,该模式支持多用户认证与细粒度权限控制。此外,开发者可以自定义实现HugeAuthenticator
接口来对接自身的权限系统。
用户认证方式均采用 HTTP Basic Authentication ,简单说就是在发送 HTTP 请求时在 Authentication
设置选择 Basic
然后输入对应的用户名和密码,对应 HTTP 明文如下所示 :
GET http://localhost:8080/graphs/hugegraph/schema/vertexlabels
Authorization: Basic admin xxxx
警告:在 1.5.0 之前版本的 HugeGraph-Server 在鉴权模式下存在 JWT 相关的安全隐患,请务必使用新版本或自行修改 JWT token 的 secretKey。
修改方式为在配置文件rest-server.properties
中重写auth.token_secret
信息:(1.5.0 后会默认生成随机值则无需配置)
auth.token_secret=XXXX #这里为 32 位 String,由 a-z,A-Z 和 0-9 组成
也可以通过下面的命令实现:
RANDOM_STRING=$(head /dev/urandom | tr -dc A-Za-z0-9 | head -c 32)
echo "auth.token_secret=${RANDOM_STRING}" >> rest-server.properties
StandardAuthenticator 模式
StandardAuthenticator
模式是通过在数据库后端存储用户信息来支持用户认证和权限控制,该实现基于数据库存储的用户的名称与密码进行认证(密码已被加密),基于用户的角色来细粒度控制用户权限。下面是具体的配置流程(重启服务生效):
在配置文件gremlin-server.yaml
中配置authenticator
及其rest-server
文件路径:
authentication: {
authenticator: org.apache.hugegraph.auth.StandardAuthenticator,
authenticationHandler: org.apache.hugegraph.auth.WsAndHttpBasicAuthHandler,
config: {tokens: conf/rest-server.properties}
}
在配置文件rest-server.properties
中配置authenticator
及其graph_store
信息:
auth.authenticator=org.apache.hugegraph.auth.StandardAuthenticator
auth.graph_store=hugegraph
# auth client config
# 如果是分开部署 GraphServer 和 AuthServer,还需要指定下面的配置,地址填写 AuthServer 的 IP:RPC 端口
#auth.remote_url=127.0.0.1:8899,127.0.0.1:8898,127.0.0.1:8897
其中,graph_store
配置项是指使用哪一个图来存储用户信息,如果存在多个图的话,选取任意一个均可。
在配置文件hugegraph{n}.properties
中配置gremlin.graph
信息:
gremlin.graph=org.apache.hugegraph.auth.HugeFactoryAuthProxy
然后详细的权限 API 调用和说明请参考 Authentication-API 文档。
自定义用户认证系统
如果需要支持更加灵活的用户系统,可自定义 authenticator 进行扩展,自定义 authenticator 实现接口org.apache.hugegraph.auth.HugeAuthenticator
即可,然后修改配置文件中authenticator
配置项指向该实现。
基于鉴权模式启动
在鉴权配置完成后,需在首次执行 init-store.sh
时命令行中输入 admin
密码 (非 docker 部署模式下)
如果基于 docker 镜像部署或者已经初始化 HugeGraph 并需要转换为鉴权模式,需要删除相关图数据并重新启动 HugeGraph, 若图已有业务数据,暂时无法直接转换鉴权模式 (hugegraph 版本 <= 1.2.0)
对于该功能的改进已经在最新版本发布 (Docker latest 可用),可参考 PR 2411, 此时可无缝切换。
# stop the hugeGraph firstly
bin/stop-hugegraph.sh
# delete the store data (here we use the default path for rocksdb)
# Note: no need to delete data in the latest code (fixed in https://github.com/apache/incubator-hugegraph/pull/2411)
rm -rf rocksdb-data/
# init store again
bin/init-store.sh
# start hugeGraph again
bin/start-hugegraph.sh
使用 Docker 时开启鉴权模式
对于镜像 hugegraph/hugegraph
大于等于 1.2.0
的版本,我们可以在启动 docker
镜像的同时开启鉴权模式
具体做法如下:
1. 采用 docker run
在 docker run
中添加环境变量 PASSWORD=123456
(密码可以自由设置)即可开启鉴权模式::
docker run -itd -e PASSWORD=123456 --name=server -p 8080:8080 hugegraph/hugegraph:1.5.0
2. 采用 docker-compose
使用 docker-compose
在环境变量中设置 PASSWORD=123456
即可
version: '3'
services:
server:
image: hugegraph/hugegraph:1.5.0
container_name: server
ports:
- 8080:8080
environment:
- PASSWORD=123456
3. 进入容器后重新开启鉴权模式
首先进入容器:
docker exec -it server bash
# 用于快速修改配置, 修改前的文件被保存在conf-bak文件夹下
bin/enable-auth.sh
之后参照 基于鉴权模式启动 即可
4.4 - 配置 HugeGraphServer 使用 https 协议
概述
HugeGraphServer 默认使用的是 http 协议,如果用户对请求的安全性有要求,可以配置成 https。
服务端配置
修改 conf/rest-server.properties 配置文件,将 restserver.url 的 schema 部分改为 https。
# 将协议设置为 https
restserver.url=https://127.0.0.1:8080
# 服务端 keystore 文件路径,当协议为 https 时该默认值自动生效,可按需修改此项
ssl.keystore_file=conf/hugegraph-server.keystore
# 服务端 keystore 文件密码,当协议为 https 时该默认值自动生效,可按需修改此项
ssl.keystore_password=******
服务端的 conf 目录下已经给出了一个 keystore 文件hugegraph-server.keystore
,该文件的密码为hugegraph
,
这两项都是在开启了 https 协议时的默认值,用户可以生成自己的 keystore 文件及密码,然后修改ssl.keystore_file
和ssl.keystore_password
的值。
客户端配置
在 HugeGraph-Client 中使用 https
在构造 HugeClient 时传入 https 相关的配置,代码示例:
String url = "https://localhost:8080";
String graphName = "hugegraph";
HugeClientBuilder builder = HugeClient.builder(url, graphName);
// 客户端 keystore 文件路径
String trustStoreFilePath = "hugegraph.truststore";
// 客户端 keystore 密码
String trustStorePassword = "******";
builder.configSSL(trustStoreFilePath, trustStorePassword);
HugeClient hugeClient = builder.build();
注意:HugeGraph-Client 在 1.9.0 版本以前是直接以 new 的方式创建,并且不支持 https 协议,在 1.9.0 版本以后改成以 builder 的方式创建,并支持配置 https 协议。
在 HugeGraph-Loader 中使用 https
启动导入任务时,在命令行中添加如下选项:
# https
--protocol https
# 客户端证书文件路径,当指定 --protocol 为 https 时,默认值 conf/hugegraph.truststore 自动生效,可按需修改
--trust-store-file {file}
# 客户端证书文件密码,当指定 --protocol 为 https 时,默认值 hugegraph 自动生效,可按需修改
--trust-store-password {password}
hugegraph-loader 的 conf 目录下已经放了一个默认的客户端证书文件 hugegraph.truststore,其密码是 hugegraph。
在 HugeGraph-Tools 中使用 https
执行命令时,在命令行中添加如下选项:
# 客户端证书文件路径,当 url 中使用 https 协议时,默认值 conf/hugegraph.truststore 自动生效,可按需修改
--trust-store-file {file}
# 客户端证书文件密码,当 url 中使用 https 协议时,默认值 hugegraph 自动生效,可按需修改
--trust-store-password {password}
# 执行迁移命令时,当 --target-url 中使用 https 协议时,默认值 conf/hugegraph.truststore 自动生效,可按需修改
--target-trust-store-file {target-file}
# 执行迁移命令时,当 --target-url 中使用 https 协议时,默认值 hugegraph 自动生效,可按需修改
--target-trust-store-password {target-password}
hugegraph-tools 的 conf 目录下已经放了一个默认的客户端证书文件 hugegraph.truststore,其密码是 hugegraph。
如何生成证书文件
本部分给出生成证书的示例,如果默认的证书已经够用,或者已经知晓如何生成,可跳过。
服务端
- ⽣成服务端私钥,并且导⼊到服务端 keystore ⽂件中,server.keystore 是给服务端⽤的,其中保存着⾃⼰的私钥
keytool -genkey -alias serverkey -keyalg RSA -keystore server.keystore
过程中根据需求填写描述信息,默认证书的描述信息如下:
名字和姓⽒:hugegraph
组织单位名称:hugegraph
组织名称:hugegraph
城市或区域名称:BJ
州或省份名称:BJ
国家代码:CN
- 根据服务端私钥,导出服务端证书
keytool -export -alias serverkey -keystore server.keystore -file server.crt
server.crt 就是服务端的证书
客户端
keytool -import -alias serverkey -file server.crt -keystore client.truststore
client.truststore 是给客户端⽤的,其中保存着受信任的证书
4.5 - HugeGraph-Computer 配置
Computer Config Options
config option | default value | description |
---|---|---|
algorithm.message_class | org.apache.hugegraph.computer.core.config.Null | The class of message passed when compute vertex. |
algorithm.params_class | org.apache.hugegraph.computer.core.config.Null | The class used to transfer algorithms’ parameters before algorithm been run. |
algorithm.result_class | org.apache.hugegraph.computer.core.config.Null | The class of vertex’s value, the instance is used to store computation result for the vertex. |
allocator.max_vertices_per_thread | 10000 | Maximum number of vertices per thread processed in each memory allocator |
bsp.etcd_endpoints | http://localhost:2379 | The end points to access etcd. |
bsp.log_interval | 30000 | The log interval(in ms) to print the log while waiting bsp event. |
bsp.max_super_step | 10 | The max super step of the algorithm. |
bsp.register_timeout | 300000 | The max timeout to wait for master and works to register. |
bsp.wait_master_timeout | 86400000 | The max timeout(in ms) to wait for master bsp event. |
bsp.wait_workers_timeout | 86400000 | The max timeout to wait for workers bsp event. |
hgkv.max_data_block_size | 65536 | The max byte size of hgkv-file data block. |
hgkv.max_file_size | 2147483648 | The max number of bytes in each hgkv-file. |
hgkv.max_merge_files | 10 | The max number of files to merge at one time. |
hgkv.temp_file_dir | /tmp/hgkv | This folder is used to store temporary files, temporary files will be generated during the file merging process. |
hugegraph.name | hugegraph | The graph name to load data and write results back. |
hugegraph.url | http://127.0.0.1:8080 | The hugegraph url to load data and write results back. |
input.edge_direction | OUT | The data of the edge in which direction is loaded, when the value is BOTH, the edges in both OUT and IN direction will be loaded. |
input.edge_freq | MULTIPLE | The frequency of edges can exist between a pair of vertices, allowed values: [SINGLE, SINGLE_PER_LABEL, MULTIPLE]. SINGLE means that only one edge can exist between a pair of vertices, use sourceId + targetId to identify it; SINGLE_PER_LABEL means that each edge label can exist one edge between a pair of vertices, use sourceId + edgelabel + targetId to identify it; MULTIPLE means that many edge can exist between a pair of vertices, use sourceId + edgelabel + sortValues + targetId to identify it. |
input.filter_class | org.apache.hugegraph.computer.core.input.filter.DefaultInputFilter | The class to create input-filter object, input-filter is used to Filter vertex edges according to user needs. |
input.loader_schema_path | The schema path of loader input, only takes effect when the input.source_type=loader is enabled | |
input.loader_struct_path | The struct path of loader input, only takes effect when the input.source_type=loader is enabled | |
input.max_edges_in_one_vertex | 200 | The maximum number of adjacent edges allowed to be attached to a vertex, the adjacent edges will be stored and transferred together as a batch unit. |
input.source_type | hugegraph-server | The source type to load input data, allowed values: [‘hugegraph-server’, ‘hugegraph-loader’], the ‘hugegraph-loader’ means use hugegraph-loader load data from HDFS or file, if use ‘hugegraph-loader’ load data then please config ‘input.loader_struct_path’ and ‘input.loader_schema_path’. |
input.split_fetch_timeout | 300 | The timeout in seconds to fetch input splits |
input.split_max_splits | 10000000 | The maximum number of input splits |
input.split_page_size | 500 | The page size for streamed load input split data |
input.split_size | 1048576 | The input split size in bytes |
job.id | local_0001 | The job id on Yarn cluster or K8s cluster. |
job.partitions_count | 1 | The partitions count for computing one graph algorithm job. |
job.partitions_thread_nums | 4 | The number of threads for partition parallel compute. |
job.workers_count | 1 | The workers count for computing one graph algorithm job. |
master.computation_class | org.apache.hugegraph.computer.core.master.DefaultMasterComputation | Master-computation is computation that can determine whether to continue next superstep. It runs at the end of each superstep on master. |
output.batch_size | 500 | The batch size of output |
output.batch_threads | 1 | The threads number used to batch output |
output.hdfs_core_site_path | The hdfs core site path. | |
output.hdfs_delimiter | , | The delimiter of hdfs output. |
output.hdfs_kerberos_enable | false | Is Kerberos authentication enabled for Hdfs. |
output.hdfs_kerberos_keytab | The Hdfs’s key tab file for kerberos authentication. | |
output.hdfs_kerberos_principal | The Hdfs’s principal for kerberos authentication. | |
output.hdfs_krb5_conf | /etc/krb5.conf | Kerberos configuration file. |
output.hdfs_merge_partitions | true | Whether merge output files of multiple partitions. |
output.hdfs_path_prefix | /hugegraph-computer/results | The directory of hdfs output result. |
output.hdfs_replication | 3 | The replication number of hdfs. |
output.hdfs_site_path | The hdfs site path. | |
output.hdfs_url | hdfs://127.0.0.1:9000 | The hdfs url of output. |
output.hdfs_user | hadoop | The hdfs user of output. |
output.output_class | org.apache.hugegraph.computer.core.output.LogOutput | The class to output the computation result of each vertex. Be called after iteration computation. |
output.result_name | value | The value is assigned dynamically by #name() of instance created by WORKER_COMPUTATION_CLASS. |
output.result_write_type | OLAP_COMMON | The result write-type to output to hugegraph, allowed values are: [OLAP_COMMON, OLAP_SECONDARY, OLAP_RANGE]. |
output.retry_interval | 10 | The retry interval when output failed |
output.retry_times | 3 | The retry times when output failed |
output.single_threads | 1 | The threads number used to single output |
output.thread_pool_shutdown_timeout | 60 | The timeout seconds of output threads pool shutdown |
output.with_adjacent_edges | false | Output the adjacent edges of the vertex or not |
output.with_edge_properties | false | Output the properties of the edge or not |
output.with_vertex_properties | false | Output the properties of the vertex or not |
sort.thread_nums | 4 | The number of threads performing internal sorting. |
transport.client_connect_timeout | 3000 | The timeout(in ms) of client connect to server. |
transport.client_threads | 4 | The number of transport threads for client. |
transport.close_timeout | 10000 | The timeout(in ms) of close server or close client. |
transport.finish_session_timeout | 0 | The timeout(in ms) to finish session, 0 means using (transport.sync_request_timeout * transport.max_pending_requests). |
transport.heartbeat_interval | 20000 | The minimum interval(in ms) between heartbeats on client side. |
transport.io_mode | AUTO | The network IO Mode, either ‘NIO’, ‘EPOLL’, ‘AUTO’, the ‘AUTO’ means selecting the property mode automatically. |
transport.max_pending_requests | 8 | The max number of client unreceived ack, it will trigger the sending unavailable if the number of unreceived ack >= max_pending_requests. |
transport.max_syn_backlog | 511 | The capacity of SYN queue on server side, 0 means using system default value. |
transport.max_timeout_heartbeat_count | 120 | The maximum times of timeout heartbeat on client side, if the number of timeouts waiting for heartbeat response continuously > max_heartbeat_timeouts the channel will be closed from client side. |
transport.min_ack_interval | 200 | The minimum interval(in ms) of server reply ack. |
transport.min_pending_requests | 6 | The minimum number of client unreceived ack, it will trigger the sending available if the number of unreceived ack < min_pending_requests. |
transport.network_retries | 3 | The number of retry attempts for network communication,if network unstable. |
transport.provider_class | org.apache.hugegraph.computer.core.network.netty.NettyTransportProvider | The transport provider, currently only supports Netty. |
transport.receive_buffer_size | 0 | The size of socket receive-buffer in bytes, 0 means using system default value. |
transport.recv_file_mode | true | Whether enable receive buffer-file mode, it will receive buffer write file from socket by zero-copy if enable. |
transport.send_buffer_size | 0 | The size of socket send-buffer in bytes, 0 means using system default value. |
transport.server_host | 127.0.0.1 | The server hostname or ip to listen on to transfer data. |
transport.server_idle_timeout | 360000 | The max timeout(in ms) of server idle. |
transport.server_port | 0 | The server port to listen on to transfer data. The system will assign a random port if it’s set to 0. |
transport.server_threads | 4 | The number of transport threads for server. |
transport.sync_request_timeout | 10000 | The timeout(in ms) to wait response after sending sync-request. |
transport.tcp_keep_alive | true | Whether enable TCP keep-alive. |
transport.transport_epoll_lt | false | Whether enable EPOLL level-trigger. |
transport.write_buffer_high_mark | 67108864 | The high water mark for write buffer in bytes, it will trigger the sending unavailable if the number of queued bytes > write_buffer_high_mark. |
transport.write_buffer_low_mark | 33554432 | The low water mark for write buffer in bytes, it will trigger the sending available if the number of queued bytes < write_buffer_low_mark.org.apache.hugegraph.config.OptionChecker$$Lambda$97/0x00000008001c8440@776a6d9b |
transport.write_socket_timeout | 3000 | The timeout(in ms) to write data to socket buffer. |
valuefile.max_segment_size | 1073741824 | The max number of bytes in each segment of value-file. |
worker.combiner_class | org.apache.hugegraph.computer.core.config.Null | Combiner can combine messages into one value for a vertex, for example page-rank algorithm can combine messages of a vertex to a sum value. |
worker.computation_class | org.apache.hugegraph.computer.core.config.Null | The class to create worker-computation object, worker-computation is used to compute each vertex in each superstep. |
worker.data_dirs | [jobs] | The directories separated by ‘,’ that received vertices and messages can persist into. |
worker.edge_properties_combiner_class | org.apache.hugegraph.computer.core.combiner.OverwritePropertiesCombiner | The combiner can combine several properties of the same edge into one properties at inputstep. |
worker.partitioner | org.apache.hugegraph.computer.core.graph.partition.HashPartitioner | The partitioner that decides which partition a vertex should be in, and which worker a partition should be in. |
worker.received_buffers_bytes_limit | 104857600 | The limit bytes of buffers of received data, the total size of all buffers can’t excess this limit. If received buffers reach this limit, they will be merged into a file. |
worker.vertex_properties_combiner_class | org.apache.hugegraph.computer.core.combiner.OverwritePropertiesCombiner | The combiner can combine several properties of the same vertex into one properties at inputstep. |
worker.wait_finish_messages_timeout | 86400000 | The max timeout(in ms) message-handler wait for finish-message of all workers. |
worker.wait_sort_timeout | 600000 | The max timeout(in ms) message-handler wait for sort-thread to sort one batch of buffers. |
worker.write_buffer_capacity | 52428800 | The initial size of write buffer that used to store vertex or message. |
worker.write_buffer_threshold | 52428800 | The threshold of write buffer, exceeding it will trigger sorting, the write buffer is used to store vertex or message. |
K8s Operator Config Options
NOTE: Option needs to be converted through environment variable settings, e.g. k8s.internal_etcd_url => INTERNAL_ETCD_URL
config option | default value | description |
---|---|---|
k8s.auto_destroy_pod | true | Whether to automatically destroy all pods when the job is completed or failed. |
k8s.close_reconciler_timeout | 120 | The max timeout(in ms) to close reconciler. |
k8s.internal_etcd_url | http://127.0.0.1:2379 | The internal etcd url for operator system. |
k8s.max_reconcile_retry | 3 | The max retry times of reconcile. |
k8s.probe_backlog | 50 | The maximum backlog for serving health probes. |
k8s.probe_port | 9892 | The value is the port that the controller bind to for serving health probes. |
k8s.ready_check_internal | 1000 | The time interval(ms) of check ready. |
k8s.ready_timeout | 30000 | The max timeout(in ms) of check ready. |
k8s.reconciler_count | 10 | The max number of reconciler thread. |
k8s.resync_period | 600000 | The minimum frequency at which watched resources are reconciled. |
k8s.timezone | Asia/Shanghai | The timezone of computer job and operator. |
k8s.watch_namespace | hugegraph-computer-system | The value is watch custom resources in the namespace, ignore other namespaces, the ‘*’ means is all namespaces will be watched. |
HugeGraph-Computer CRD
spec | default value | description | required |
---|---|---|---|
algorithmName | The name of algorithm. | true | |
jobId | The job id. | true | |
image | The image of algorithm. | true | |
computerConf | The map of computer config options. | true | |
workerInstances | The number of worker instances, it will instead the ‘job.workers_count’ option. | true | |
pullPolicy | Always | The pull-policy of image, detail please refer to: https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy | false |
pullSecrets | The pull-secrets of Image, detail please refer to: https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod | false | |
masterCpu | The cpu limit of master, the unit can be ’m’ or without unit detail please refer to:https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu | false | |
workerCpu | The cpu limit of worker, the unit can be ’m’ or without unit detail please refer to:https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu | false | |
masterMemory | The memory limit of master, the unit can be one of Ei、Pi、Ti、Gi、Mi、Ki detail please refer to:https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory | false | |
workerMemory | The memory limit of worker, the unit can be one of Ei、Pi、Ti、Gi、Mi、Ki detail please refer to:https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-memory | false | |
log4jXml | The content of log4j.xml for computer job. | false | |
jarFile | The jar path of computer algorithm. | false | |
remoteJarUri | The remote jar uri of computer algorithm, it will overlay algorithm image. | false | |
jvmOptions | The java startup parameters of computer job. | false | |
envVars | please refer to: https://kubernetes.io/docs/tasks/inject-data-application/define-interdependent-environment-variables/ | false | |
envFrom | please refer to: https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/ | false | |
masterCommand | bin/start-computer.sh | The run command of master, equivalent to ‘Entrypoint’ field of Docker. | false |
masterArgs | ["-r master", “-d k8s”] | The run args of master, equivalent to ‘Cmd’ field of Docker. | false |
workerCommand | bin/start-computer.sh | The run command of worker, equivalent to ‘Entrypoint’ field of Docker. | false |
workerArgs | ["-r worker", “-d k8s”] | The run args of worker, equivalent to ‘Cmd’ field of Docker. | false |
volumes | Please refer to: https://kubernetes.io/docs/concepts/storage/volumes/ | false | |
volumeMounts | Please refer to: https://kubernetes.io/docs/concepts/storage/volumes/ | false | |
secretPaths | The map of k8s-secret name and mount path. | false | |
configMapPaths | The map of k8s-configmap name and mount path. | false | |
podTemplateSpec | Please refer to: https://kubernetes.io/docs/reference/kubernetes-api/workload-resources/pod-template-v1/#PodTemplateSpec | false | |
securityContext | Please refer to: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/ | false |
KubeDriver Config Options
config option | default value | description |
---|---|---|
k8s.build_image_bash_path | The path of command used to build image. | |
k8s.enable_internal_algorithm | true | Whether enable internal algorithm. |
k8s.framework_image_url | hugegraph/hugegraph-computer:latest | The image url of computer framework. |
k8s.image_repository_password | The password for login image repository. | |
k8s.image_repository_registry | The address for login image repository. | |
k8s.image_repository_url | hugegraph/hugegraph-computer | The url of image repository. |
k8s.image_repository_username | The username for login image repository. | |
k8s.internal_algorithm | [pageRank] | The name list of all internal algorithm. |
k8s.internal_algorithm_image_url | hugegraph/hugegraph-computer:latest | The image url of internal algorithm. |
k8s.jar_file_dir | /cache/jars/ | The directory where the algorithm jar to upload location. |
k8s.kube_config | ~/.kube/config | The path of k8s config file. |
k8s.log4j_xml_path | The log4j.xml path for computer job. | |
k8s.namespace | hugegraph-computer-system | The namespace of hugegraph-computer system. |
k8s.pull_secret_names | [] | The names of pull-secret for pulling image. |
5 - API
5.1 - HugeGraph RESTful API
HugeGraph-Server通过HugeGraph-API基于HTTP协议为Client提供操作图的接口,主要包括元数据和 图数据的增删改查,遍历算法,变量,图操作及其他操作。
除了下方的文档,你还可以通过 localhost:8080/swagger-ui/index.html
访问 swagger-ui
以查看 RESTful API
。示例可以参考此处
5.1.1 - Schema API
1.1 Schema
HugeGraph 提供单一接口获取某个图的全部 Schema 信息,包括:PropertyKey、VertexLabel、EdgeLabel 和 IndexLabel。
Method & Url
GET http://localhost:8080/graphs/{graph_name}/schema
e.g: GET http://localhost:8080/graphs/hugegraph/schema
Response Status
200
Response Body
{
"propertykeys": [
{
"id": 7,
"name": "price",
"data_type": "DOUBLE",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:05.316"
}
},
{
"id": 6,
"name": "date",
"data_type": "TEXT",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:05.309"
}
},
{
"id": 3,
"name": "city",
"data_type": "TEXT",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:05.287"
}
},
{
"id": 2,
"name": "age",
"data_type": "INT",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:05.280"
}
},
{
"id": 5,
"name": "lang",
"data_type": "TEXT",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:05.301"
}
},
{
"id": 4,
"name": "weight",
"data_type": "DOUBLE",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:05.294"
}
},
{
"id": 1,
"name": "name",
"data_type": "TEXT",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:05.250"
}
}
],
"vertexlabels": [
{
"id": 1,
"name": "person",
"id_strategy": "PRIMARY_KEY",
"primary_keys": [
"name"
],
"nullable_keys": [
"age",
"city"
],
"index_labels": [
"personByAge",
"personByCity",
"personByAgeAndCity"
],
"properties": [
"name",
"age",
"city"
],
"status": "CREATED",
"ttl": 0,
"enable_label_index": true,
"user_data": {
"~create_time": "2023-05-08 17:49:05.336"
}
},
{
"id": 2,
"name": "software",
"id_strategy": "CUSTOMIZE_NUMBER",
"primary_keys": [],
"nullable_keys": [],
"index_labels": [
"softwareByPrice"
],
"properties": [
"name",
"lang",
"price"
],
"status": "CREATED",
"ttl": 0,
"enable_label_index": true,
"user_data": {
"~create_time": "2023-05-08 17:49:05.347"
}
}
],
"edgelabels": [
{
"id": 1,
"name": "knows",
"source_label": "person",
"target_label": "person",
"frequency": "SINGLE",
"sort_keys": [],
"nullable_keys": [],
"index_labels": [
"knowsByWeight"
],
"properties": [
"weight",
"date"
],
"status": "CREATED",
"ttl": 0,
"enable_label_index": true,
"user_data": {
"~create_time": "2023-05-08 17:49:08.437"
}
},
{
"id": 2,
"name": "created",
"source_label": "person",
"target_label": "software",
"frequency": "SINGLE",
"sort_keys": [],
"nullable_keys": [],
"index_labels": [
"createdByDate",
"createdByWeight"
],
"properties": [
"weight",
"date"
],
"status": "CREATED",
"ttl": 0,
"enable_label_index": true,
"user_data": {
"~create_time": "2023-05-08 17:49:08.446"
}
}
],
"indexlabels": [
{
"id": 1,
"name": "personByAge",
"base_type": "VERTEX_LABEL",
"base_value": "person",
"index_type": "RANGE_INT",
"fields": [
"age"
],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:05.375"
}
},
{
"id": 2,
"name": "personByCity",
"base_type": "VERTEX_LABEL",
"base_value": "person",
"index_type": "SECONDARY",
"fields": [
"city"
],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:06.898"
}
},
{
"id": 3,
"name": "personByAgeAndCity",
"base_type": "VERTEX_LABEL",
"base_value": "person",
"index_type": "SECONDARY",
"fields": [
"age",
"city"
],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:07.407"
}
},
{
"id": 4,
"name": "softwareByPrice",
"base_type": "VERTEX_LABEL",
"base_value": "software",
"index_type": "RANGE_DOUBLE",
"fields": [
"price"
],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:07.916"
}
},
{
"id": 5,
"name": "createdByDate",
"base_type": "EDGE_LABEL",
"base_value": "created",
"index_type": "SECONDARY",
"fields": [
"date"
],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:08.454"
}
},
{
"id": 6,
"name": "createdByWeight",
"base_type": "EDGE_LABEL",
"base_value": "created",
"index_type": "RANGE_DOUBLE",
"fields": [
"weight"
],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:08.963"
}
},
{
"id": 7,
"name": "knowsByWeight",
"base_type": "EDGE_LABEL",
"base_value": "knows",
"index_type": "RANGE_DOUBLE",
"fields": [
"weight"
],
"status": "CREATED",
"user_data": {
"~create_time": "2023-05-08 17:49:09.473"
}
}
]
}
5.1.2 - PropertyKey API
1.2 PropertyKey
Params说明:
- name:属性类型名称,必填
- data_type:属性类型数据类型,包括:bool、byte、int、long、float、double、text、date、uuid、blob,默认
text
类型 (代表 string 字符串类型) - cardinality:属性类型基数,包括:single、list、set,默认
single
(代表单属性值)
请求体字段说明:
- id:属性类型id值
- properties:属性的属性,对于属性而言,此项为空
- user_data:设置属性类型的通用信息,比如可设置age属性的取值范围,最小为0,最大为100;目前此项不做任何校验,只为后期拓展提供预留入口
1.2.1 创建一个 PropertyKey
Method & Url
POST http://localhost:8080/graphs/hugegraph/schema/propertykeys
Request Body
{
"name": "age",
"data_type": "INT",
"cardinality": "SINGLE"
}
Response Status
202
Response Body
{
"property_key": {
"id": 1,
"name": "age",
"data_type": "INT",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"~create_time": "2022-05-13 13:47:23.745"
}
},
"task_id": 0
}
1.2.2 为已存在的 PropertyKey 添加或移除 userdata
Params
- action: 表示当前行为是添加还是移除,取值为
append
(添加)和eliminate
(移除)
Method & Url
PUT http://localhost:8080/graphs/hugegraph/schema/propertykeys/age?action=append
Request Body
{
"name": "age",
"user_data": {
"min": 0,
"max": 100
}
}
Response Status
202
Response Body
{
"property_key": {
"id": 1,
"name": "age",
"data_type": "INT",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"min": 0,
"max": 100,
"~create_time": "2022-05-13 13:47:23.745"
}
},
"task_id": 0
}
1.2.3 获取所有的 PropertyKey
Method & Url
GET http://localhost:8080/graphs/hugegraph/schema/propertykeys
Response Status
200
Response Body
{
"propertykeys": [
{
"id": 3,
"name": "city",
"data_type": "TEXT",
"cardinality": "SINGLE",
"properties": [],
"user_data": {}
},
{
"id": 2,
"name": "age",
"data_type": "INT",
"cardinality": "SINGLE",
"properties": [],
"user_data": {}
},
{
"id": 5,
"name": "lang",
"data_type": "TEXT",
"cardinality": "SINGLE",
"properties": [],
"user_data": {}
},
{
"id": 4,
"name": "weight",
"data_type": "DOUBLE",
"cardinality": "SINGLE",
"properties": [],
"user_data": {}
},
{
"id": 6,
"name": "date",
"data_type": "TEXT",
"cardinality": "SINGLE",
"properties": [],
"user_data": {}
},
{
"id": 1,
"name": "name",
"data_type": "TEXT",
"cardinality": "SINGLE",
"properties": [],
"user_data": {}
},
{
"id": 7,
"name": "price",
"data_type": "INT",
"cardinality": "SINGLE",
"properties": [],
"user_data": {}
}
]
}
1.2.4 根据name获取PropertyKey
Method & Url
GET http://localhost:8080/graphs/hugegraph/schema/propertykeys/age
其中,age
为要获取的 PropertyKey 的名称
Response Status
200
Response Body
{
"id": 1,
"name": "age",
"data_type": "INT",
"cardinality": "SINGLE",
"aggregate_type": "NONE",
"write_type": "OLTP",
"properties": [],
"status": "CREATED",
"user_data": {
"min": 0,
"max": 100,
"~create_time": "2022-05-13 13:47:23.745"
}
}
1.2.5 根据 name 删除 PropertyKey
Method & Url
DELETE http://localhost:8080/graphs/hugegraph/schema/propertykeys/age
其中,age
为要删除的 PropertyKey 的名称
Response Status
202
Response Body
{
"task_id" : 0
}
5.1.3 - VertexLabel API
1.3 VertexLabel
假设已经创建好了1.1.3中列出来的 PropertyKeys
Params说明
- id:顶点类型id值
- name:顶点类型名称,必填
- id_strategy: 顶点类型的ID策略,主键ID、自动生成、自定义字符串、自定义数字、自定义UUID,默认主键ID
- properties: 顶点类型关联的属性类型
- primary_keys: 主键属性,当ID策略为PRIMARY_KEY时必须有值,其他ID策略时必须为空;
- enable_label_index: 是否开启类型索引,默认关闭
- index_names:顶点类型创建的索引,详情见3.4
- nullable_keys:可为空的属性
- user_data:设置顶点类型的通用信息,作用同属性类型
1.3.1 创建一个VertexLabel
Method & Url
POST http://localhost:8080/graphs/hugegraph/schema/vertexlabels
Request Body
{
"name": "person",
"id_strategy": "DEFAULT",
"properties": [
"name",
"age"
],
"primary_keys": [
"name"
],
"nullable_keys": [],
"enable_label_index": true
}
Response Status
201
Response Body
{
"id": 1,
"primary_keys": [
"name"
],
"id_strategy": "PRIMARY_KEY",
"name": "person2",
"index_names": [
],
"properties": [
"name",
"age"
],
"nullable_keys": [
],
"enable_label_index": true,
"user_data": {}
}
从 hugegraph-server v0.11.2 版本开始支持顶点的 TTL 功能。顶点的 TTL 是通过 VertexLabel 来设置的。比如希望 person 类型的顶点存活时间为一天,需要在创建 person VertexLabel 的时候将 TTL 字段设置为 86400000,即单位为毫秒。
{
"name": "person",
"id_strategy": "DEFAULT",
"properties": [
"name",
"age"
],
"primary_keys": [
"name"
],
"nullable_keys": [],
"ttl": 86400000,
"enable_label_index": true
}
另外,当顶点中带有"创建时间"的属性且希望以"创建时间"属性作为计算顶点存活时间的起点时,可以设置 VertexLabel 中的 ttl_start_time 字段。比如 person VertexLabel 有 createdTime 属性,且 createdTime 是 Date 类型的参数,希望 person 类型的顶点从创建开始存活一天的时间,那么创建 person VertexLabel 的 Request Body 如下:
{
"name": "person",
"id_strategy": "DEFAULT",
"properties": [
"name",
"age",
"createdTime"
],
"primary_keys": [
"name"
],
"nullable_keys": [],
"ttl": 86400000,
"ttl_start_time": "createdTime",
"enable_label_index": true
}
1.3.2 为已存在的VertexLabel添加properties或userdata,或者移除userdata(目前不支持移除properties)
Params
- action: 表示当前行为是添加还是移除,取值为
append
(添加)和eliminate
(移除)
Method & Url
PUT http://localhost:8080/graphs/hugegraph/schema/vertexlabels/person?action=append
Request Body
{
"name": "person",
"properties": [
"city"
],
"nullable_keys": ["city"],
"user_data": {
"super": "animal"
}
}
Response Status
200
Response Body
{
"id": 1,
"primary_keys": [
"name"
],
"id_strategy": "PRIMARY_KEY",
"name": "person",
"index_names": [
],
"properties": [
"city",
"name",
"age"
],
"nullable_keys": [
"city"
],
"enable_label_index": true,
"user_data": {
"super": "animal"
}
}
1.3.3 获取所有的VertexLabel
Method & Url
GET http://localhost:8080/graphs/hugegraph/schema/vertexlabels
Response Status
200
Response Body
{
"vertexlabels": [
{
"id": 1,
"primary_keys": [
"name"
],
"id_strategy": "PRIMARY_KEY",
"name": "person",
"index_names": [
],
"properties": [
"city",
"name",
"age"
],
"nullable_keys": [
"city"
],
"enable_label_index": true,
"user_data": {
"super": "animal"
}
},
{
"id": 2,
"primary_keys": [
"name"
],
"id_strategy": "PRIMARY_KEY",
"name": "software",
"index_names": [
],
"properties": [
"price",
"name",
"lang"
],
"nullable_keys": [
"price"
],
"enable_label_index": false,
"user_data": {}
}
]
}
1.3.4 根据name获取VertexLabel
Method & Url
GET http://localhost:8080/graphs/hugegraph/schema/vertexlabels/person
Response Status
200
Response Body
{
"id": 1,
"primary_keys": [
"name"
],
"id_strategy": "PRIMARY_KEY",
"name": "person",
"index_names": [
],
"properties": [
"city",
"name",
"age"
],
"nullable_keys": [
"city"
],
"enable_label_index": true,
"user_data": {
"super": "animal"
}
}
1.3.5 根据name删除VertexLabel
删除 VertexLabel 会导致删除对应的顶点以及相关的索引数据,会产生一个异步任务
Method & Url
DELETE http://localhost:8080/graphs/hugegraph/schema/vertexlabels/person
Response Status
202
Response Body
{
"task_id": 1
}
注:
可以通过
GET http://localhost:8080/graphs/hugegraph/tasks/1
(其中"1"是task_id)来查询异步任务的执行状态,更多异步任务RESTful API
5.1.4 - EdgeLabel API
1.4 EdgeLabel
假设已经创建好了1.2.3中的 PropertyKeys 和 1.3.3中的 VertexLabels
Params说明
- name:顶点类型名称,必填
- source_label: 源顶点类型的名称,必填
- target_label: 目标顶点类型的名称,必填
- frequency:两个点之间是否可以有多条边,可以取值SINGLE和MULTIPLE,非必填,默认值SINGLE
- properties: 边类型关联的属性类型,选填
- sort_keys: 当允许关联多次时,指定区分键属性列表
- nullable_keys:可为空的属性,选填,默认可为空
- enable_label_index: 是否开启类型索引,默认关闭
1.4.1 创建一个EdgeLabel
Method & Url
POST http://localhost:8080/graphs/hugegraph/schema/edgelabels
Request Body
{
"name": "created",
"source_label": "person",
"target_label": "software",
"frequency": "SINGLE",
"properties": [
"date"
],
"sort_keys": [],
"nullable_keys": [],
"enable_label_index": true
}
Response Status
201
Response Body
{
"id": 1,
"sort_keys": [
],
"source_label": "person",
"name": "created",
"index_names": [
],
"properties": [
"date"
],
"target_label": "software",
"frequency": "SINGLE",
"nullable_keys": [
],
"enable_label_index": true,
"user_data": {}
}
从 hugegraph-server v0.11.2 版本开始支持边的 TTL 功能。边的 TTL 是通过 EdgeLabel 来设置的。比如希望 knows 类型的边存活时间为一天,需要在创建 knows EdgeLabel 的时候将 TTL 字段设置为 86400000,即单位为毫秒。
{
"id": 1,
"sort_keys": [
],
"source_label": "person",
"name": "knows",
"index_names": [
],
"properties": [
"date",
"createdTime"
],
"target_label": "person",
"frequency": "SINGLE",
"nullable_keys": [
],
"enable_label_index": true,
"ttl": 86400000,
"user_data": {}
}
另外,当边中带有"创建时间"的属性且希望以"创建时间"属性作为计算边存活时间的起点时,可以设置 EdgeLabel 中的 ttl_start_time 字段。比如 knows EdgeLabel 有 createdTime 属性,且 createdTime 是 Date 类型的参数,希望 knows 类型的边从创建开始存活一天的时间,那么创建 knows EdgeLabel 的 Request Body 如下:
{
"id": 1,
"sort_keys": [
],
"source_label": "person",
"name": "knows",
"index_names": [
],
"properties": [
"date",
"createdTime"
],
"target_label": "person",
"frequency": "SINGLE",
"nullable_keys": [
],
"enable_label_index": true,
"ttl": 86400000,
"ttl_start_time": "createdTime",
"user_data": {}
}
1.4.2 为已存在的EdgeLabel添加properties或userdata,或者移除userdata(目前不支持移除properties)
Params
- action: 表示当前行为是添加还是移除,取值为
append
(添加)和eliminate
(移除)
Method & Url
PUT http://localhost:8080/graphs/hugegraph/schema/edgelabels/created?action=append
Request Body
{
"name": "created",
"properties": [
"weight"
],
"nullable_keys": [
"weight"
]
}
Response Status
200
Response Body
{
"id": 2,
"sort_keys": [
],
"source_label": "person",
"name": "created",
"index_names": [
],
"properties": [
"date",
"weight"
],
"target_label": "software",
"frequency": "SINGLE",
"nullable_keys": [
"weight"
],
"enable_label_index": true,
"user_data": {}
}
1.4.3 获取所有的EdgeLabel
Method & Url
GET http://localhost:8080/graphs/hugegraph/schema/edgelabels
Response Status
200
Response Body
{
"edgelabels": [
{
"id": 1,
"sort_keys": [
],
"source_label": "person",
"name": "created",
"index_names": [
],
"properties": [
"date",
"weight"
],
"target_label": "software",
"frequency": "SINGLE",
"nullable_keys": [
"weight"
],
"enable_label_index": true,
"user_data": {}
},
{
"id": 2,
"sort_keys": [
],
"source_label": "person",
"name": "knows",
"index_names": [
],
"properties": [
"date",
"weight"
],
"target_label": "person",
"frequency": "SINGLE",
"nullable_keys": [
],
"enable_label_index": false,
"user_data": {}
}
]
}
1.4.4 根据name获取EdgeLabel
Method & Url
GET http://localhost:8080/graphs/hugegraph/schema/edgelabels/created
Response Status
200
Response Body
{
"id": 1,
"sort_keys": [
],
"source_label": "person",
"name": "created",
"index_names": [
],
"properties": [
"date",
"city",
"weight"
],
"target_label": "software",
"frequency": "SINGLE",
"nullable_keys": [
"city",
"weight"
],
"enable_label_index": true,
"user_data": {}
}
1.4.5 根据name删除EdgeLabel
删除 EdgeLabel 会导致删除对应的边以及相关的索引数据,会产生一个异步任务
Method & Url
DELETE http://localhost:8080/graphs/hugegraph/schema/edgelabels/created
Response Status
202
Response Body
{
"task_id": 1
}
注:
可以通过
GET http://localhost:8080/graphs/hugegraph/tasks/1
(其中"1"是task_id)来查询异步任务的执行状态,更多异步任务RESTful API
5.1.5 - IndexLabel API
1.5 IndexLabel
假设已经创建好了1.1.3中的 PropertyKeys 、1.2.3中的 VertexLabels 以及 1.3.3中的 EdgeLabels
1.5.1 创建一个IndexLabel
Method & Url
POST http://localhost:8080/graphs/hugegraph/schema/indexlabels
Request Body
{
"name": "personByCity",
"base_type": "VERTEX_LABEL",
"base_value": "person",
"index_type": "SECONDARY",
"fields": [
"city"
]
}
Response Status
202
Response Body
{
"index_label": {
"id": 1,
"base_type": "VERTEX_LABEL",
"base_value": "person",
"name": "personByCity",
"fields": [
"city"
],
"index_type": "SECONDARY"
},
"task_id": 2
}
1.5.2 获取所有的IndexLabel
Method & Url
GET http://localhost:8080/graphs/hugegraph/schema/indexlabels
Response Status
200
Response Body
{
"indexlabels": [
{
"id": 3,
"base_type": "VERTEX_LABEL",
"base_value": "software",
"name": "softwareByPrice",
"fields": [
"price"
],
"index_type": "RANGE"
},
{
"id": 4,
"base_type": "EDGE_LABEL",
"base_value": "created",
"name": "createdByDate",
"fields": [
"date"
],
"index_type": "SECONDARY"
},
{
"id": 1,
"base_type": "VERTEX_LABEL",
"base_value": "person",
"name": "personByCity",
"fields": [
"city"
],
"index_type": "SECONDARY"
},
{
"id": 3,
"base_type": "VERTEX_LABEL",
"base_value": "person",
"name": "personByAgeAndCity",
"fields": [
"age",
"city"
],
"index_type": "SECONDARY"
}
]
}
1.5.3 根据name获取IndexLabel
Method & Url
GET http://localhost:8080/graphs/hugegraph/schema/indexlabels/personByCity
Response Status
200
Response Body
{
"id": 1,
"base_type": "VERTEX_LABEL",
"base_value": "person",
"name": "personByCity",
"fields": [
"city"
],
"index_type": "SECONDARY"
}
1.5.4 根据name删除IndexLabel
删除 IndexLabel 会导致删除相关的索引数据,会产生一个异步任务
Method & Url
DELETE http://localhost:8080/graphs/hugegraph/schema/indexlabels/personByCity
Response Status
202
Response Body
{
"task_id": 1
}
注:
可以通过
GET http://localhost:8080/graphs/hugegraph/tasks/1
(其中"1"是task_id)来查询异步任务的执行状态,更多异步任务RESTful API
5.1.6 - Rebuild API
1.6 Rebuild
1.6.1 重建IndexLabel
Method & Url
PUT http://localhost:8080/graphs/hugegraph/jobs/rebuild/indexlabels/personByCity
Response Status
202
Response Body
{
"task_id": 1
}
注:
可以通过
GET http://localhost:8080/graphs/hugegraph/tasks/1
(其中"1"是task_id)来查询异步任务的执行状态,更多异步任务RESTful API
1.6.2 VertexLabel对应的全部索引重建
Method & Url
PUT http://localhost:8080/graphs/hugegraph/jobs/rebuild/vertexlabels/person
Response Status
202
Response Body
{
"task_id": 2
}
注:
可以通过
GET http://localhost:8080/graphs/hugegraph/tasks/2
(其中"2"是task_id)来查询异步任务的执行状态,更多异步任务RESTful API
1.6.3 EdgeLabel对应的全部索引重建
Method & Url
PUT http://localhost:8080/graphs/hugegraph/jobs/rebuild/edgelabels/created
Response Status
202
Response Body
{
"task_id": 3
}
注:
可以通过
GET http://localhost:8080/graphs/hugegraph/tasks/3
(其中"3"是task_id)来查询异步任务的执行状态,更多异步任务RESTful API
5.1.7 - Vertex API
2.1 Vertex
顶点类型中的 Id
策略决定了顶点的 Id
类型,其对应的 id
类型如下:
Id_Strategy | id type |
---|---|
AUTOMATIC | number |
PRIMARY_KEY | string |
CUSTOMIZE_STRING | string |
CUSTOMIZE_NUMBER | number |
CUSTOMIZE_UUID | uuid |
顶点的 GET/PUT/DELETE
API 中 url 的 id 部分应该传入带有类型信息的 id 值,这个类型信息通过 json 串是否带引号来表示,也就是说:
- 当 id 类型为
number
时,url 中的 id 不带引号,例如xxx/vertices/123456
- 当 id 类型为
string
时,url 中的 id 带引号,例如xxx/vertices/"123456"
接下来的示例需要先根据以下 groovy
脚本创建图 schema
schema.propertyKey("name").asText().ifNotExist().create();
schema.propertyKey("age").asInt().ifNotExist().create();
schema.propertyKey("city").asText().ifNotExist().create();
schema.propertyKey("weight").asDouble().ifNotExist().create();
schema.propertyKey("lang").asText().ifNotExist().create();
schema.propertyKey("price").asDouble().ifNotExist().create();
schema.propertyKey("hobby").asText().valueList().ifNotExist().create();
schema.vertexLabel("person").properties("name", "age", "city", "weight", "hobby").primaryKeys("name").nullableKeys("age", "city", "weight", "hobby").ifNotExist().create();
schema.vertexLabel("software").properties("name", "lang", "price").primaryKeys("name").nullableKeys("lang", "price").ifNotExist().create();
schema.indexLabel("personByAge").onV("person").by("age").range().ifNotExist().create();
2.1.1 创建一个顶点
Method & Url
POST http://localhost:8080/graphs/hugegraph/graph/vertices
Request Body
{
"label": "person",
"properties": {
"name": "marko",
"age": 29
}
}
Response Status
201
Response Body
{
"id": "1:marko",
"label": "person",
"type": "vertex",
"properties": {
"name": "marko",
"age": 29
}
}
2.1.2 创建多个顶点
Method & Url
POST http://localhost:8080/graphs/hugegraph/graph/vertices/batch
Request Body
[
{
"label": "person",
"properties": {
"name": "marko",
"age": 29
}
},
{
"label": "software",
"properties": {
"name": "ripple",
"lang": "java",
"price": 199
}
}
]
Response Status
201
Response Body
[
"1:marko",
"2:ripple"
]
2.1.3 更新顶点属性
Method & Url
PUT http://127.0.0.1:8080/graphs/hugegraph/graph/vertices/"1:marko"?action=append
Request Body
{
"label": "person",
"properties": {
"age": 30,
"city": "Beijing"
}
}
注意:属性的取值有三种类别,分别为single、set和list。single表示增加或更新属性值,set或list表示追加属性值。
Response Status
200
Response Body
{
"id": "1:marko",
"label": "person",
"type": "vertex",
"properties": {
"name": "marko",
"age": 30,
"city": "Beijing"
}
}
2.1.4 批量更新顶点属性
功能说明
批量更新顶点的属性时,可以选择多种更新策略,如下:
- SUM: 数值累加
- BIGGER: 原值和新值(数字、日期)取更大的
- SMALLER: 原值和新值(数字、日期)取更小的
- UNION: Set属性取并集
- INTERSECTION: Set属性取交集
- APPEND: List属性追加元素
- ELIMINATE: List/Set属性删除元素
- OVERRIDE: 覆盖已有属性,如果新属性为null,则仍然使用旧属性
假设原顶点的属性如下:
{
"vertices": [
{
"id": "2:lop",
"label": "software",
"type": "vertex",
"properties": {
"name": "lop",
"lang": "java",
"price": 328
}
},
{
"id": "1:josh",
"label": "person",
"type": "vertex",
"properties": {
"name": "josh",
"age": 32,
"city": "Beijing",
"weight": 0.1,
"hobby": [
"reading",
"football"
]
}
}
]
}
通过以下命令新增顶点:
curl -H "Content-Type: application/json" -d '[{"label":"person","properties":{"name":"josh","age":32,"city":"Beijing","weight":0.1,"hobby":["reading","football"]}},{"label":"software","properties":{"name":"lop","lang":"java","price":328}}]' http:///127.0.0.1:8080/graphs/hugegraph/graph/vertices/batch
Method & Url
PUT http://127.0.0.1:8080/graphs/hugegraph/graph/vertices/batch
Request Body
{
"vertices": [
{
"label": "software",
"type": "vertex",
"properties": {
"name": "lop",
"lang": "c++",
"price": 299
}
},
{
"label": "person",
"type": "vertex",
"properties": {
"name": "josh",
"city": "Shanghai",
"weight": 0.2,
"hobby": [
"swimming"
]
}
}
],
"update_strategies": {
"price": "BIGGER",
"age": "OVERRIDE",
"city": "OVERRIDE",
"weight": "SUM",
"hobby": "UNION"
},
"create_if_not_exist": true
}
Response Status
200
Response Body
{
"vertices": [
{
"id": "2:lop",
"label": "software",
"type": "vertex",
"properties": {
"name": "lop",
"lang": "c++",
"price": 328
}
},
{
"id": "1:josh",
"label": "person",
"type": "vertex",
"properties": {
"name": "josh",
"age": 32,
"city": "Shanghai",
"weight": 0.3,
"hobby": [
"reading",
"football",
"swimming"
]
}
}
]
}
结果分析如下:
- lang 属性未指定更新策略,直接用新值覆盖旧值,无论新值是否为null;
- price 属性指定 BIGGER 的更新策略,旧属性值为328,新属性值为299,所以仍然保留了旧属性值328;
- age 属性指定 OVERRIDE 更新策略,而新属性值中未传入age,相当于age为null,所以仍然保留了原属性值32;
- city 属性也指定了 OVERRIDE 更新策略,且新属性值不为null,所以覆盖了旧值;
- weight 属性指定了 SUM 更新策略,旧属性值为0.1,新属性值为0.2,最后的值为0.3;
- hobby 属性(基数为Set)指定了 UNION 更新策略,所以新值与旧值取了并集;
其他更新策略的使用方式与此类似,此处不再详述。
2.1.5 删除顶点属性
Method & Url
PUT http://127.0.0.1:8080/graphs/hugegraph/graph/vertices/"1:marko"?action=eliminate
Request Body
{
"label": "person",
"properties": {
"city": "Beijing"
}
}
注意:这里会直接删除属性(删除key和所有value),无论其属性的取值是single、set或list。
Response Status
200
Response Body
{
"id": "1:marko",
"label": "person",
"type": "vertex",
"properties": {
"name": "marko",
"age": 30
}
}
2.1.6 获取符合条件的顶点
Params
- label: 顶点的类型
- properties: 属性键值对(查询属性的前提是该属性已经建立了索引)
- limit: 查询结果的最大数目
- page: 分页的页号
以上参数都是可选的,但如果提供了page参数,就必须同时提供limit参数,并且不能再提供其他参数。label, properties
和limit
之间可以任意组合。
属性键值对由属性名称和属性值组成JSON格式的对象,可以使用多个属性键值对作为查询条件,属性值支持精确匹配和范围匹配,精确匹配的形式如properties={"age":29}
,范围匹配的形式如properties={"age":"P.gt(29)"}
,范围匹配支持以下表达式:
表达式 | 说明 |
---|---|
P.eq(number) | 属性值等于number的顶点 |
P.neq(number) | 属性值不等于number的顶点 |
P.lt(number) | 属性值小于number的顶点 |
P.lte(number) | 属性值小于等于number的顶点 |
P.gt(number) | 属性值大于number的顶点 |
P.gte(number) | 属性值大于等于number的顶点 |
P.between(number1,number2) | 属性值大于等于number1且小于number2的顶点 |
P.inside(number1,number2) | 属性值大于number1且小于number2的顶点 |
P.outside(number1,number2) | 属性值小于number1且大于number2的顶点 |
P.within(value1,value2,value3,…) | 属性值等于任何一个给定value的顶点 |
查询所有 age 为 29 且 label 为 person 的顶点
Method & Url
GET http://localhost:8080/graphs/hugegraph/graph/vertices?label=person&properties={"age":29}&limit=1
Response Status
200
Response Body
{
"vertices": [
{
"id": "1:marko",
"label": "person",
"type": "vertex",
"properties": {
"name": "marko",
"age": 30
}
}
]
}
分页查询所有顶点,获取第一页(page不带参数值),限定3条
通过以下命令新增顶点:
curl -H "Content-Type: application/json" -d '[{"label":"person","properties":{"name":"peter","age":29,"city":"Shanghai"}},{"label":"person","properties":{"name":"vadas","age":27,"city":"Hongkong"}}]' http://localhost:8080/graphs/hugegraph/graph/vertices/batch
Method & Url
GET http://localhost:8080/graphs/hugegraph/graph/vertices?page&limit=3
Response Status
200
Response Body
{
"vertices": [
{
"id": "2:lop",
"label": "software",
"type": "vertex",
"properties": {
"name": "lop",
"lang": "c++",
"price": 328
}
},
{
"id": "1:josh",
"label": "person",
"type": "vertex",
"properties": {
"name": "josh",
"age": 32,
"city": "Shanghai",
"weight": 0.3,
"hobby": [
"reading",
"football",
"swimming"
]
}
},
{
"id": "1:marko",
"label": "person",
"type": "vertex",
"properties": {
"name": "marko",
"age": 30
}
}
],
"page": "CIYxOnBldGVyAAAAAAAAAAM="
}
返回的 body
里面是带有下一页的页号信息的,"page": "CIYxOnBldGVyAAAAAAAAAAM="
,在查询下一页的时候将该值赋给 page
参数。
分页查询所有顶点,获取下一页(page带上上一页返回的page值),限定3条
Method & Url
GET http://localhost:8080/graphs/hugegraph/graph/vertices?page=CIYxOnBldGVyAAAAAAAAAAM=&limit=3
Response Status
200
Response Body
{
"vertices": [
{
"id": "1:peter",
"label": "person",
"type": "vertex",
"properties": {
"name": "peter",
"age": 29,
"city": "Shanghai"
}
},
{
"id": "1:vadas",
"label": "person",
"type": "vertex",
"properties": {
"name": "vadas",
"age": 27,
"city": "Hongkong"
}
},
{
"id": "2:ripple",
"label": "software",
"type": "vertex",
"properties": {
"name": "ripple",
"lang": "java",
"price": 199
}
}
],
"page": null
}
当"page": null
时,表示已经没有下一页了(注:如果后端使用的是 Cassandra ,为了提高性能,当返回的页数刚好是最后一页时,返回的 page
值可能不为空,但是如果用这个 page
值再请求下一页数据时,就会返回 空数据
和 page = null
,其他情况也类似)
2.1.7 根据Id获取顶点
Method & Url
GET http://localhost:8080/graphs/hugegraph/graph/vertices/"1:marko"
Response Status
200
Response Body
{
"id": "1:marko",
"label": "person",
"type": "vertex",
"properties": {
"name": "marko",
"age": 30
}
}
2.1.8 根据Id删除顶点
Params
- label: 顶点类型,可选参数
仅根据Id删除顶点
Method & Url
DELETE http://localhost:8080/graphs/hugegraph/graph/vertices/"1:marko"
Response Status
204
根据Label+Id删除顶点
通过指定Label参数和Id来删除顶点时,一般来说其性能比仅根据Id删除会更好。
Method & Url
DELETE http://localhost:8080/graphs/hugegraph/graph/vertices/"1:marko"?label=person
Response Status
204
5.1.8 - Edge API
2.2 Edge
顶点 id 格式的修改也影响到了边的 id 以及源顶点和目标顶点 id 的格式
EdgeId 是由 src-vertex-id + direction + label + sort-values + tgt-vertex-id
拼接而成,但是这里的顶点 id 类型不是通过引号区分的,而是根据前缀区分:
- 当 id 类型为 number 时,EdgeId 的顶点 id 前有一个前缀
L
,形如 “L123456>1»L987654” - 当 id 类型为 string 时,EdgeId 的顶点 id 前有一个前缀
S
,形如 “S1:peter>1»S2:lop”
接下来的示例需要先根据以下 groovy
脚本创建图 schema
import org.apache.hugegraph.HugeFactory
import org.apache.tinkerpop.gremlin.structure.T
conf = "conf/graphs/hugegraph.properties"
graph = HugeFactory.open(conf)
schema = graph.schema()
schema.propertyKey("name").asText().ifNotExist().create()
schema.propertyKey("age").asInt().ifNotExist().create()
schema.propertyKey("city").asText().ifNotExist().create()
schema.propertyKey("weight").asDouble().ifNotExist().create()
schema.propertyKey("lang").asText().ifNotExist().create()
schema.propertyKey("date").asText().ifNotExist().create()
schema.propertyKey("price").asInt().ifNotExist().create()
schema.vertexLabel("person").properties("name", "age", "city").primaryKeys("name").ifNotExist().create()
schema.vertexLabel("software").properties("name", "lang", "price").primaryKeys("name").ifNotExist().create()
schema.indexLabel("personByCity").onV("person").by("city").secondary().ifNotExist().create()
schema.indexLabel("personByAgeAndCity").onV("person").by("age", "city").secondary().ifNotExist().create()
schema.indexLabel("softwareByPrice").onV("software").by("price").range().ifNotExist().create()
schema.edgeLabel("knows").sourceLabel("person").targetLabel("person").properties("date", "weight").ifNotExist().create()
schema.edgeLabel("created").sourceLabel("person").targetLabel("software").properties("date", "weight").ifNotExist().create()
schema.indexLabel("createdByDate").onE("created").by("date").secondary().ifNotExist().create()
schema.indexLabel("createdByWeight").onE("created").by("weight").range().ifNotExist().create()
schema.indexLabel("knowsByWeight").onE("knows").by("weight").range().ifNotExist().create()
marko = graph.addVertex(T.label, "person", "name", "marko", "age", 29, "city", "Beijing")
vadas = graph.addVertex(T.label, "person", "name", "vadas", "age", 27, "city", "Hongkong")
lop = graph.addVertex(T.label, "software", "name", "lop", "lang", "java", "price", 328)
josh = graph.addVertex(T.label, "person", "name", "josh", "age", 32, "city", "Beijing")
ripple = graph.addVertex(T.label, "software", "name", "ripple", "lang", "java", "price", 199)
peter = graph.addVertex(T.label, "person", "name", "peter", "age", 35, "city", "Shanghai")
graph.tx().commit()
g = graph.traversal()
2.2.1 创建一条边
Params
路径参数说明:
- graph:待操作的图
请求体说明:
- label:边类型名称,必填
- outV:源顶点 id,必填
- inV:目标顶点 id,必填
- outVLabel:源顶点类型,必填
- inVLabel:目标顶点类型,必填
- properties: 边关联的属性,对象内部结构为:
- name:属性名称
- value:属性值
Method & Url
POST http://localhost:8080/graphs/hugegraph/graph/edges
Request Body
{
"label": "created",
"outV": "1:marko",
"inV": "2:lop",
"outVLabel": "person",
"inVLabel": "software",
"properties": {
"date": "20171210",
"weight": 0.4
}
}
Response Status
201
Response Body
{
"id": "S1:marko>2>>S2:lop",
"label": "created",
"type": "edge",
"outV": "1:marko",
"outVLabel": "person",
"inV": "2:lop",
"inVLabel": "software",
"properties": {
"weight": 0.4,
"date": "20171210"
}
}
2.2.2 创建多条边
Params
路径参数说明:
- graph:待操作的图
请求参数说明:
- check_vertex:是否检查顶点存在 (true | false),当设置为 true 而待插入边的源顶点或目标顶点不存在时会报错,默认为 true
请求体说明:
- 边信息的列表
Method & Url
POST http://localhost:8080/graphs/hugegraph/graph/edges/batch
Request Body
[
{
"label": "knows",
"outV": "1:marko",
"inV": "1:vadas",
"outVLabel": "person",
"inVLabel": "person",
"properties": {
"date": "20160110",
"weight": 0.5
}
},
{
"label": "knows",
"outV": "1:marko",
"inV": "1:josh",
"outVLabel": "person",
"inVLabel": "person",
"properties": {
"date": "20130220",
"weight": 1.0
}
}
]
Response Status
201
Response Body
[
"S1:marko>1>>S1:vadas",
"S1:marko>1>>S1:josh"
]
2.2.3 更新边属性
Params
路径参数说明:
- graph:待操作的图
- id:待操作的边 id
请求参数说明:
- action:append 操作
请求体说明:
- 边信息
Method & Url
PUT http://localhost:8080/graphs/hugegraph/graph/edges/S1:marko>2>>S2:lop?action=append
Request Body
{
"properties": {
"weight": 1.0
}
}
注意:属性的取值是有三种类别的,分别是 single、set 和 list。如果是 single,表示增加或更新属性值;如果是 set 或 list,则表示追加属性值
Response Status
200
Response Body
{
"id": "S1:marko>2>>S2:lop",
"label": "created",
"type": "edge",
"outV": "1:marko",
"outVLabel": "person",
"inV": "2:lop",
"inVLabel": "software",
"properties": {
"weight": 1.0,
"date": "20171210"
}
}
2.2.4 批量更新边属性
Params
路径参数说明:
- graph:待操作的图
请求体说明:
- edges:边信息的列表
- update_strategies:对于每个属性,可以单独设置其更新策略,包括:
- SUM:仅支持 number 类型
- BIGGER/SMALLER:仅支持 date/number 类型
- UNION/INTERSECTION:仅支持 set 类型
- APPEND/ELIMINATE:仅支持 collection 类型
- OVERRIDE
- check_vertex:是否检查顶点存在 (true | false),当设置为 true 而待插入边的源顶点或目标顶点不存在时会报错,默认为 true
- create_if_not_exist:目前只支持设定为 true
Method & Url
PUT http://127.0.0.1:8080/graphs/hugegraph/graph/edges/batch
Request Body
{
"edges": [
{
"label": "knows",
"outV": "1:marko",
"inV": "1:vadas",
"outVLabel": "person",
"inVLabel": "person",
"properties": {
"date": "20160111",
"weight": 1.0
}
},
{
"label": "knows",
"outV": "1:marko",
"inV": "1:josh",
"outVLabel": "person",
"inVLabel": "person",
"properties": {
"date": "20130221",
"weight": 0.5
}
}
],
"update_strategies": {
"weight": "SUM",
"date": "OVERRIDE"
},
"check_vertex": false,
"create_if_not_exist": true
}
Response Status
200
Response Body
{
"edges": [
{
"id": "S1:marko>1>>S1:vadas",
"label": "knows",
"type": "edge",
"outV": "1:marko",
"outVLabel": "person",
"inV": "1:vadas",
"inVLabel": "person",
"properties": {
"weight": 1.5,
"date": "20160111"
}
},
{
"id": "S1:marko>1>>S1:josh",
"label": "knows",
"type": "edge",
"outV": "1:marko",
"outVLabel": "person",
"inV": "1:josh",
"inVLabel": "person",
"properties": {
"weight": 1.5,
"date": "20130221"
}
}
]
}
2.2.5 删除边属性
Params
路径参数说明:
- graph:待操作的图
- id:待操作的边 id
请求参数说明:
- action:eliminate 操作
请求体说明:
- 边信息
Method & Url
PUT http://localhost:8080/graphs/hugegraph/graph/edges/S1:marko>2>>S2:lop?action=eliminate
Request Body
{
"properties": {
"weight": 1.0
}
}
注意:这里会直接删除属性(删除 key 和所有 value),无论其属性的取值是 single、set 或 list
Response Status
400
Response Body
无法删除未设置为 nullable 的属性
{
"exception": "class java.lang.IllegalArgumentException",
"message": "Can't remove non-null edge property 'p[weight->1.0]'",
"cause": ""
}
2.2.6 获取符合条件的边
Params
路径参数说明:
- graph:待操作的图
请求参数说明:
- vertex_id: 顶点 id
- direction: 边的方向 (OUT | IN | BOTH),默认为 BOTH
- label: 边的标签
- properties: 属性键值对 (根据属性查询的前提是预先建立了索引)
- keep_start_p: 默认为 false,当设置为 true 后,不会自动转义范围匹配输入的表达式,例如此时
properties={"age":"P.gt(0.8)"}
会被理解为精确匹配,即 age 属性等于 “P.gt(0.8)” - offset:偏移,默认为 0
- limit: 查询数目,默认为 100
- page: 页号
属性键值对由 JSON 格式的属性名称和属性值组成,允许多个属性键值对作为查询条件,属性值支持精确匹配和范围匹配,精确匹配时形如 properties={"weight":0.8}
,范围匹配时形如 properties={"age":"P.gt(0.8)"}
,范围匹配支持的表达式如下:
表达式 | 说明 |
---|---|
P.eq(number) | 属性值等于 number 的边 |
P.neq(number) | 属性值不等于 number 的边 |
P.lt(number) | 属性值小于 number 的边 |
P.lte(number) | 属性值小于等于 number 的边 |
P.gt(number) | 属性值大于 number 的边 |
P.gte(number) | 属性值大于等于 number 的边 |
P.between(number1,number2) | 属性值大于等于 number1 且小于 number2 的边 |
P.inside(number1,number2) | 属性值大于 number1 且小于 number2 的边 |
P.outside(number1,number2) | 属性值小于 number1 且大于 number2 的边 |
P.within(value1,value2,value3,…) | 属性值等于任何一个给定 value 的边 |
P.textcontains(value) | 属性值包含给定 value 的边 (string 类型) |
P.contains(value) | 属性值包含给定 value 的边 (collection 类型) |
查询与顶点 person:marko(vertex_id=“1:marko”) 相连且 label 为 knows 的且 date 属性等于 “20160111” 的边
Method & Url
GET http://127.0.0.1:8080/graphs/hugegraph/graph/edges?vertex_id="1:marko"&label=knows&properties={"date":"P.within(\"20160111\")"}
Response Status
200
Response Body
{
"edges": [
{
"id": "S1:marko>1>>S1:vadas",
"label": "knows",
"type": "edge",
"outV": "1:marko",
"outVLabel": "person",
"inV": "1:vadas",
"inVLabel": "person",
"properties": {
"weight": 1.5,
"date": "20160111"
}
}
]
}
分页查询所有边,获取第一页(page 不带参数值),限定 2 条
Method & Url
GET http://127.0.0.1:8080/graphs/hugegraph/graph/edges?page&limit=2
Response Status
200
Response Body
{
"edges": [
{
"id": "S1:marko>1>>S1:josh",
"label": "knows",
"type": "edge",
"outV": "1:marko",
"outVLabel": "person",
"inV": "1:josh",
"inVLabel": "person",
"properties": {
"weight": 1.5,
"date": "20130221"
}
},
{
"id": "S1:marko>1>>S1:vadas",
"label": "knows",
"type": "edge",
"outV": "1:marko",
"outVLabel": "person",
"inV": "1:vadas",
"inVLabel": "person",
"properties": {
"weight": 1.5,
"date": "20160111"
}
}
],
"page": "EoYxOm1hcmtvgggCAIQyOmxvcAAAAAAAAAAC"
}
返回的 body 里面是带有下一页的页号信息的,"page": "EoYxOm1hcmtvgggCAIQyOmxvcAAAAAAAAAAC"
,在查询下一页的时候将该值赋给 page 参数
分页查询所有边,获取下一页(page 带上上一页返回的 page 值),限定 2 条
Method & Url
GET http://127.0.0.1:8080/graphs/hugegraph/graph/edges?page=EoYxOm1hcmtvgggCAIQyOmxvcAAAAAAAAAAC&limit=2
Response Status
200
Response Body
{
"edges": [
{
"id": "S1:marko>2>>S2:lop",
"label": "created",
"type": "edge",
"outV": "1:marko",
"outVLabel": "person",
"inV": "2:lop",
"inVLabel": "software",
"properties": {
"weight": 1.0,
"date": "20171210"
}
}
],
"page": null
}
此时 "page": null
表示已经没有下一页了
注:后端为 Cassandra 时,为了性能考虑,返回页恰好为最后一页时,返回
page
值可能非空,通过该page
再请求下一页数据时则返回空数据
及page = null
,其他情况类似
2.2.7 根据 id 获取边
Params
路径参数说明:
- graph:待操作的图
- id:待操作的边 id
Method & Url
GET http://localhost:8080/graphs/hugegraph/graph/edges/S1:marko>2>>S2:lop
Response Status
200
Response Body
{
"id": "S1:marko>2>>S2:lop",
"label": "created",
"type": "edge",
"outV": "1:marko",
"outVLabel": "person",
"inV": "2:lop",
"inVLabel": "software",
"properties": {
"weight": 1.0,
"date": "20171210"
}
}
2.2.8 根据 id 删除边
Params
路径参数说明:
- graph:待操作的图
- id:待操作的边 id
请求参数说明:
- label: 边的标签
仅根据 id 删除边
Method & Url
DELETE http://localhost:8080/graphs/hugegraph/graph/edges/S1:marko>2>>S2:lop
Response Status
204
根据 label + id 删除边
通过指定 label 参数和 id 来删除边时,一般来说其性能比仅根据 id 删除会更好
Method & Url
DELETE http://localhost:8080/graphs/hugegraph/graph/edges/S1:marko>1>>S1:vadas?label=knows
Response Status
204
5.1.9 - Traverser API
3.1 traverser API概述
HugeGraphServer为HugeGraph图数据库提供了RESTful API接口。除了顶点和边的CRUD基本操作以外,还提供了一些遍历(traverser)方法,我们称为traverser API
。这些遍历方法实现了一些复杂的图算法,方便用户对图进行分析和挖掘。
HugeGraph支持的Traverser API包括:
- K-out API,根据起始顶点,查找恰好N步可达的邻居,分为基础版和高级版:
- 基础版使用GET方法,根据起始顶点,查找恰好N步可达的邻居
- 高级版使用POST方法,根据起始顶点,查找恰好N步可达的邻居,与基础版的不同在于:
- 支持只统计邻居数量
- 支持顶点和边属性过滤
- 支持返回到达邻居的最短路径
- K-neighbor API,根据起始顶点,查找N步以内可达的所有邻居,分为基础版和高级版:
- 基础版使用GET方法,根据起始顶点,查找N步以内可达的所有邻居
- 高级版使用POST方法,根据起始顶点,查找N步以内可达的所有邻居,与基础版的不同在于:
- 支持只统计邻居数量
- 支持顶点和边属性过滤
- 支持返回到达邻居的最短路径
- Same Neighbors, 查询两个顶点的共同邻居
- Jaccard Similarity API,计算jaccard相似度,包括两种:
- 一种是使用GET方法,计算两个顶点的邻居的相似度(交并比)
- 一种是使用POST方法,在全图中查找与起点的jaccard similarity最高的N个点
- Shortest Path API,查找两个顶点之间的最短路径
- All Shortest Paths,查找两个顶点间的全部最短路径
- Weighted Shortest Path,查找起点到目标点的带权最短路径
- Single Source Shortest Path,查找一个点到其他各个点的加权最短路径
- Multi Node Shortest Path,查找指定顶点集之间两两最短路径
- Paths API,查找两个顶点间的全部路径,分为基础版和高级版:
- 基础版使用GET方法,根据起点和终点,查找两个顶点间的全部路径
- 高级版使用POST方法,根据一组起点和一组终点,查找两个集合间符合条件的全部路径
- Customized Paths API,从一批顶点出发,按(一种)模式遍历经过的全部路径
- Template Path API,指定起点和终点以及起点和终点间路径信息,查找符合的路径
- Crosspoints API,查找两个顶点的交点(共同祖先或者共同子孙)
- Customized Crosspoints API,从一批顶点出发,按多种模式遍历,最后一步到达的顶点的交点
- Rings API,从起始顶点出发,可到达的环路路径
- Rays API,从起始顶点出发,可到达边界的路径(即无环路径)
- Fusiform Similarity API,查找一个顶点的梭形相似点
- Vertices API
- 按ID批量查询顶点;
- 获取顶点的分区;
- 按分区查询顶点;
- Edges API
- 按ID批量查询边;
- 获取边的分区;
- 按分区查询边;
3.2. traverser API详解
使用方法中的例子,都是基于TinkerPop官网给出的图:
数据导入程序如下:
public class Loader {
public static void main(String[] args) {
HugeClient client = new HugeClient("http://127.0.0.1:8080", "hugegraph");
SchemaManager schema = client.schema();
schema.propertyKey("name").asText().ifNotExist().create();
schema.propertyKey("age").asInt().ifNotExist().create();
schema.propertyKey("city").asText().ifNotExist().create();
schema.propertyKey("weight").asDouble().ifNotExist().create();
schema.propertyKey("lang").asText().ifNotExist().create();
schema.propertyKey("date").asText().ifNotExist().create();
schema.propertyKey("price").asInt().ifNotExist().create();
schema.vertexLabel("person")
.properties("name", "age", "city")
.primaryKeys("name")
.nullableKeys("age")
.ifNotExist()
.create();
schema.vertexLabel("software")
.properties("name", "lang", "price")
.primaryKeys("name")
.nullableKeys("price")
.ifNotExist()
.create();
schema.indexLabel("personByCity")
.onV("person")
.by("city")
.secondary()
.ifNotExist()
.create();
schema.indexLabel("personByAgeAndCity")
.onV("person")
.by("age", "city")
.secondary()
.ifNotExist()
.create();
schema.indexLabel("softwareByPrice")
.onV("software")
.by("price")
.range()
.ifNotExist()
.create();
schema.edgeLabel("knows")
.multiTimes()
.sourceLabel("person")
.targetLabel("person")
.properties("date", "weight")
.sortKeys("date")
.nullableKeys("weight")
.ifNotExist()
.create();
schema.edgeLabel("created")
.sourceLabel("person").targetLabel("software")
.properties("date", "weight")
.nullableKeys("weight")
.ifNotExist()
.