Pipeline Automation

The Best Graph Database for Your CG Production Data In 2026

As we mentioned in a previous blog post, A CG production can be represented as a graph structure. A movie is made of shots which are…

9 years ago • 8 min read

By Frank Rousseau

CG production data is a graph: a movie breaks down into sequences, shots, scenes, and assets all bound together by dependency relationships. Change the mesh on a prop, and you need to know exactly what rebuilds downstream: the rig, the animation keys, the final image sequence. That chain of impact is a graph.

And yet, most studios still use relational databases by default.

This article makes the case for graph databases in CG pipelines, walks through the three most viable open-source options with working Python code, and helps you decide which one to reach for first.

Why Graph Databases Belong in Your Pipeline

A traditional relational database forces you to flatten graph-shaped data into tables. Traversing dependencies then requires recursive joins that are slow to write and slower to run.

Graph databases store your data as nodes and edges. It unlocks three concrete wins for pipeline TDs:

Instant impact analysis. When a director asks "what breaks if I change this asset?", you can answer in milliseconds instead of writing a new SQL query.
Dependency-aware build ordering. A directed graph gives you a topological sort for free: the correct sequence of operations to rebuild any element is implicit in the structure.
Faster iteration. When a change cascades through your production, you can respond and re-queue only the affected work rather than rebuilding everything.

Kitsu tip: Kitsu already tracks the task graph across your production: assets, shots, sequences, and their task statuses. If you want to layer a custom graph database on top for deeper dependency analysis, Kitsu's open REST API makes it straightforward to sync production entities into your graph store of choice.

The Test Setup

To compare databases practically, we model the dependency graph for a single prop going through a CG pipeline (concept → texture/mesh → model/rig → animation keys → final image sequence) and then run one representative query 10,000 times:

"Which elements are impacted if Props 1 Mesh changes?"

All benchmarks run on an i7-6700 CPU @ 3.40GHz and include the Python client overhead, since that's what you'll actually be running in production.

Option 1: Neo4j

Best for: Production-critical pipelines where robustness and query power matter most.

Neo4j is the most mature graph database available. It has a commercial enterprise tier (with monitoring, backup, and HA clustering) and a solid community edition that's free for most studio use cases.

Getting Started

Spin up the community edition via Docker:

docker run \
  --publish=7474:7474 --publish=7687:7687 \
  --volume=$HOME/neo4j/data:/data \
  neo4j

Install the Python driver:

pip install neo4j-driver

Populating the Graph

Neo4j uses Cypher, a purpose-built graph query language that reads almost like English. The MERGE command acts as "create if not exists", which keeps your setup scripts idempotent:

from neo4j.v1 import GraphDatabase, basic_auth

driver = GraphDatabase.driver(
    "bolt://localhost:7687",
    auth=basic_auth("neo4j", "tests")
)
session = driver.session()

def create_asset(name):
    session.run("MERGE (a:Asset { name: $name })", name=name)

def create_shot(name):
    session.run("MERGE (a:Shot { name: $name })", name=name)

def create_relation(asset1, asset2):
    session.run(
        "MATCH (a:Asset { name: $asset1 }), (b:Asset { name: $asset2 })"
        "MERGE (a)-[r:ELEMENT_OF]->(b)",
        asset1=asset1, asset2=asset2
    )

def create_casting(asset, shot):
    session.run(
        "MATCH (a:Asset { name: $asset }), (b:Shot { name: $shot })"
        "MERGE (a)-[r:CASTED_IN]->(b)",
        asset=asset, shot=shot
    )

# Nodes
create_asset("Props 1 concept")
create_asset("Props 1 mesh")
create_asset("Props 1 texture")
create_asset("Props 1 rig")
create_asset("Props 1 model")
create_asset("Props 1 keys")
create_shot("Shot 1")

# Edges
create_relation("Props 1 concept", "Props 1 texture")
create_relation("Props 1 concept", "Props 1 mesh")
create_relation("Props 1 mesh", "Props 1 model")
create_relation("Props 1 texture", "Props 1 model")
create_relation("Props 1 mesh", "Props 1 rig")
create_relation("Props 1 mesh", "Props 1 keys")
create_relation("Props 1 rig", "Props 1 keys")
create_casting("Props 1 model", "Shot 1")
create_casting("Props 1 keys", "Shot 1")

Querying for Impact

The [*] wildcard traverses all hops in one shot, no recursion to manage manually:

result = session.run(
    "MATCH (:Asset { name: 'Props 1 mesh' })-[*]->(out)"
    "RETURN out.name as name"
)

for record in result:
    print(record["name"])

session.close()

Performance

10,000 queries: 3.5 seconds (keep the session open; reopening it each time costs ~17 seconds).

Verdict

Neo4j is the fastest option in this test and has the most expressive query language. If you're on a hard-deadline production with SLAs, the enterprise tier's monitoring and backup features are worth the cost. For most studios, the community edition is plenty. There's also a community ORM client that makes the Python integration more ergonomic.

Option 2: ArangoDB

Best for: Studios that want to experiment quickly and may also need document storage alongside graph data.

ArangoDB is a multi-model database that handles documents, key-value, and graph storage in one engine. This flexibility means you can store rich asset metadata as JSON documents and model the relationships between them as a graph without running two separate databases.

Getting Started

docker run -p 8529:8529 -e ARANGO_ROOT_PASSWORD=openSesame arangodb/arangodb:3.2.1
pip install python-arango

Setting Up the Graph Schema

ArangoDB requires you to define vertex collections and edge definitions explicitly. Edges are always directed:

from arango.client import ArangoClient

client = ArangoClient(username='root', password='openSesame')
db = client.create_database('cgproduction')

dependencies = db.create_graph('dependencies')
shots = dependencies.create_vertex_collection('shots')
assets = dependencies.create_vertex_collection('assets')

casting = dependencies.create_edge_definition(
    name='casting',
    from_collections=['assets'],
    to_collections=['shots']
)
elements = dependencies.create_edge_definition(
    name='element',
    from_collections=['assets'],
    to_collections=['assets']
)

Note: ArangoDB raises an exception if you try to create something that already exists. You'll need to wrap creation calls in your own "get or create" helpers for idempotent setup scripts.

Inserting Data

# Vertices
assets.insert({'_key': 'props1-concept', 'name': 'Props 1 Concept'})
assets.insert({'_key': 'props1-texture', 'name': 'Props 1 Texture'})
assets.insert({'_key': 'props1-mesh',    'name': 'Props 1 Mesh'})
assets.insert({'_key': 'props1-rig',     'name': 'Props 1 Rig'})
assets.insert({'_key': 'props1-model',   'name': 'Props 1 Model'})
assets.insert({'_key': 'props1-keys',    'name': 'Props 1 Keys'})
shots.insert({'_key': 'shot1', 'name': 'Shot 1 Image Sequence'})

# Edges
elements.insert({'_from': 'assets/props1-concept', '_to': 'assets/props1-texture'})
elements.insert({'_from': 'assets/props1-concept', '_to': 'assets/props1-mesh'})
elements.insert({'_from': 'assets/props1-texture', '_to': 'assets/props1-model'})
elements.insert({'_from': 'assets/props1-mesh',    '_to': 'assets/props1-rig'})
elements.insert({'_from': 'assets/props1-mesh',    '_to': 'assets/props1-model'})
elements.insert({'_from': 'assets/props1-mesh',    '_to': 'assets/props1-keys'})
elements.insert({'_from': 'assets/props1-rig',     '_to': 'assets/props1-keys'})
casting.insert({'_from': 'assets/props1-model', '_to': 'shots/shot1'})
casting.insert({'_from': 'assets/props1-keys',  '_to': 'shots/shot1'})

Querying for Impact

traversal_results = dependencies.traverse(
    start_vertex='assets/props1-mesh',
    direction='outbound'
)

for result in traversal_results["vertices"]:
    print(result["name"])

The traversal API also exposes depth-first vs breadth-first options, shortest path finding, and path length retrieval. Useful for more advanced pipeline analysis.

Performance

10,000 queries: 26 seconds. Slower than Neo4j, but still perfectly acceptable for most pipeline tooling that won't run tens of thousands of queries per session.

Verdict

ArangoDB is the most developer-friendly of the bunch. It's well documented, the Python client is clean, and the web UI makes it easy to visualise and debug your graph as you build it. The document storage model maps naturally to how pipeline TDs already think about asset data.

Because ArangoDB stores vertices as JSON documents, you can directly mirror Kitsu's asset and shot entities (fetched via the Kitsu API or gazu, the Python client) into ArangoDB with minimal transformation. This makes ArangoDB a natural companion to Kitsu if you want to add dependency tracking to your Kitsu-managed production.

Option 3: Cayley

Best for: Complementing an existing relational database with lightweight graph traversal, if you're comfortable with an experimental tool.

Cayley is a graph database from Google, written in Go. Its defining feature is that it's a layer on top of other storage backends (Bolt, PostgreSQL, etc.), which means you may be able to add graph capabilities without replacing your existing database.

Limitations to Know Upfront

Documentation is thin.
The Python client (pyley) is incomplete — quad creation must be done via raw HTTP requests.
The visualisation UI is buggy.
Recursive traversal isn't yet available in the Python client.

Getting Started

Download the Cayley binary, initialise the database, and start the HTTP server:

./cayley init -db bolt -dbpath /tmp/testdb
./cayley http --dbpath=/tmp/testdb --host 0.0.0.0 --port 64210

pip install pyley requests

Inserting Quads

Cayley models everything as quads: subject → predicate → object (+ optional label). Since the Python client doesn't support quad creation, use the REST API directly:

import requests

def create_quad(quad):
    return requests.post(
        "http://localhost:64210/api/v1/write",
        json=[quad]
    )

quads = [
    {"subject": "props1-concept", "predicate": "dependencyof", "object": "props1-texture"},
    {"subject": "props1-concept", "predicate": "dependencyof", "object": "props1-mesh"},
    {"subject": "props1-texture", "predicate": "dependencyof", "object": "props1-model"},
    {"subject": "props1-mesh",    "predicate": "dependencyof", "object": "props1-model"},
    {"subject": "props1-mesh",    "predicate": "dependencyof", "object": "props1-rig"},
    {"subject": "props1-mesh",    "predicate": "dependencyof", "object": "props1-keys"},
    {"subject": "props1-rig",     "predicate": "dependencyof", "object": "props1-keys"},
    {"subject": "props1-model",   "predicate": "dependencyof", "object": "shot1-image-sequence"},
    {"subject": "props1-keys",    "predicate": "dependencyof", "object": "shot1-image-sequence"},
]

for quad in quads:
    create_quad(quad)

Re-inserting identical quads is a no-op.

Querying for Impact

from pyley import CayleyClient, GraphObject

client = CayleyClient("http://localhost:64210", "v1")
graph = GraphObject()

query = graph.V("props1-mesh").Out().All()

Performance

10,000 queries: 50 seconds. The slowest of the group.

Verdict

Cayley has a genuinely elegant design and the concept of a graph layer over an existing backend is compelling. But it's not production-ready for most studios today: documentation is sparse, the Python client is incomplete, and performance lags behind. Watch this project, but don't ship on it yet.

Quick Comparison

	Neo4j	ArangoDB	Cayley
Performance (10k queries)	3.5s	26s	50s
Query language	Cypher (expressive)	AQL + traversal API	Gremlin / MQL
Python client quality	Good (+ ORM option)	Good	Incomplete
Documentation	Excellent	Good	Poor
Multi-model	No	Yes (doc + graph)	No
Web UI	Yes	Yes	Broken
Production-ready	✅	✅	⚠️
Best for	Speed & robustness	Flexibility & dev experience	Experimenting

Alternatives Worth Knowing

If you're not ready to adopt a dedicated graph database, two approaches work well with tools you may already have:

PostgreSQL recursive joins can handle straightforward dependency traversal without adding a new database to your stack. Query complexity grows quickly, but it's a valid starting point.

Elasticsearch can store vertices and edges as JSON documents and supports graph-like queries. It adds the benefit of full-text and fuzzy search across your asset metadata. Useful if you want to search and traverse in the same system.

Visualising Your Graph

Once your data is in a graph database, you'll eventually want to render it in your own tools. Good options by platform:

Qt (Python/C++):

Nodz - Python, easy to integrate
ZodiacGraph - C++, high performance

Web/Electron:

Cytoscape.js - versatile and production-grade
SigmaJS - fast, well documented
D3.js - maximum flexibility, steeper learning curve

Kitsu's web interface already provides a visual breakdown of production structure: episodes, sequences, shots, and assets. For teams that want a pre-built production graph view without custom tooling, Kitsu gives you that out of the box. Custom graph visualisations make the most sense for deep technical dependency analysis (e.g. which render farm jobs to invalidate when an asset version changes).

Recommended Action Plan

Start with ArangoDB if you're exploring graph databases for the first time or want to prototype quickly. Its document model and clean Python client have the lowest barrier to entry.
Switch to Neo4j when performance becomes a constraint or when you need enterprise-grade reliability. The Cypher query language is worth learning — it pays off quickly for complex traversals.
Use Kitsu + gazu to seed your graph database with production entities. Kitsu is your source of truth for assets, shots, and task statuses; your graph database adds the dependency and build-order layer on top.
Skip Cayley for now. Check back in 12–18 months — the core design is sound, but it needs more documentation and a more complete Python client.
Consider Postgres first if you have simple dependency needs and want to avoid adding a new technology to your stack.

Graph databases won't replace your production tracker but they will make your pipeline smarter about what needs to rebuild when things change. If you're managing CG production with Kitsu, you already have the asset and shot graph; a dedicated graph database lets you extend that into full dependency tracking and build orchestration.

The Hidden Cost of Retakes In A CG Production (2026)

Files and Nodes Metadata In A CG Pipeline (2026)

Why Graph Databases Belong in Your Pipeline

The Test Setup

Option 1: Neo4j

Getting Started

Populating the Graph

Querying for Impact

Performance

Verdict

Option 2: ArangoDB

Getting Started

Setting Up the Graph Schema

Inserting Data

Querying for Impact

Performance

Verdict

Option 3: Cayley

Limitations to Know Upfront

Getting Started

Inserting Quads

Querying for Impact

Performance

Verdict

Quick Comparison

Alternatives Worth Knowing

Visualising Your Graph

Recommended Action Plan

Spread the word

The Hidden Cost of Retakes In A CG Production (2026)

Files and Nodes Metadata In A CG Pipeline (2026)

Keep reading

5 Real-Time Communication Technologies for CG Pipelines Compared: Polling, Webhooks, WebSockets, Server-Sent Events, and WebRTC

Automate Blender Animation with Python and CSV Data

Build Reliable Export Tools with Blender’s Depsgraph

Subscribe to our newsletter