forked from thinkaurelius/titan
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtitanbasics.txt
1448 lines (1039 loc) · 98 KB
/
titanbasics.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
[[titan-basics]]
Titan Basics
============
[[configuration]]
Configuration
-------------
A Titan graph database cluster consists of one or multiple Titan instances. To open a Titan instance a configuration has to be provided which specifies how Titan should be set up.
A Titan configuration specifies which components Titan should use, controls all operational aspects of a Titan deployment, and provides a number of tuning options to get maximum performance from a Titan cluster.
At a minimum, a Titan configuration must define the persistence engine that Titan should use as a storage backend. <<storage-backends>> lists all supported persistence engines and how to configure them respectively.
If advanced graph query support (e.g full-text search, geo search, or range queries) is required an additional indexing backend must be configured. See <<index-backends>> for details. If query performance is a concern, then caching should be enabled. Cache configuration and tuning is described in <<caching>>.
Example Configurations
~~~~~~~~~~~~~~~~~~~~~~
Below are some example configuration files to demonstrate how to configure the most commonly used storage backends, indexing systems, and performance components. This covers only a tiny portion of the available configuration options. Refer to <<titan-config-ref>> for the complete list of all options.
Cassandra+Elasticsearch
^^^^^^^^^^^^^^^^^^^^^^^
Sets up Titan to use the Cassandra persistence engine running locally and a remote Elastic search indexing system:
[source, properties]
----
storage.backend=cassandra
storage.hostname=localhost
index.search.backend=elasticsearch
index.search.hostname=100.100.101.1, 100.100.101.2
index.search.elasticsearch.client-only=true
----
HBase+Caching
^^^^^^^^^^^^^
Sets up Titan to use the HBase persistence engine running remotely and uses Titan's caching component for better performance.
[source, properties]
----
storage.backend=hbase
storage.hostname=100.100.101.1
storage.port=2181
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
----
BerkeleyDB
^^^^^^^^^^
Sets up Titan to use BerkeleyDB as an embedded persistence engine with ElasticSearch as an embedded indexing system.
[source, properties]
----
storage.backend=berkeleyje
storage.directory=/tmp/graph
index.search.backend=elasticsearch
index.search.directory=/tmp/searchindex
index.search.elasticsearch.client-only=false
index.search.elasticsearch.local-mode=true
----
<<titan-config-ref>>_ describes all of these configuration options in detail. The +conf+ directory of the Titan distribution contains additional configuration examples.
Further Examples
^^^^^^^^^^^^^^^^
There are several example configuration files in the `conf/` directory that can be used to get started with Titan quickly. Paths to these files can be passed to `TitanFactory.open(...)` as shown below:
[source, java]
----
// Connect to Cassandra on localhost using a default configuration
graph = TitanFactory.open("conf/titan-cassandra.properties")
// Connect to HBase on localhost using a default configuration
graph = TitanFactory.open("conf/titan-hbase.properties")
----
Using Configuration
~~~~~~~~~~~~~~~~~~~
How the configuration is provided to Titan depends on the instantiation mode.
TitanFactory
^^^^^^^^^^^^
Console
+++++++
The Titan distribution contains a command line Console which makes it easy to get started and interact with Titan. Invoke `bin/gremlin.sh` (Unix/Linux) or `bin/gremlin.bat`
(Windows) to start the Console and then open a Titan graph using the factory with the configuration stored in an accessible properties configuration file:
[source, gremlin]
----
graph = TitanFactory.open('path/to/configuration.properties')
----
Titan Embedded
++++++++++++++
TitanFactory can also be used to open an embedded Titan graph instance from within a JVM-based user application. In that case, Titan is part of the user application and the application can call upon Titan directly through its public http://thinkaurelius.github.io/titan/javadoc/current/[API documentation].
Short Codes
+++++++++++
If the Titan graph cluster has been previously configured and/or only the storage backend needs to be defined, TitanFactory accepts a colon-separated string representation of the storage backend name and hostname or directory.
[source, gremlin]
----
graph = TitanFactory.open('cassandra:localhost')
----
[source, gremlin]
----
graph = TitanFactory.open('berkeleyje:/tmp/graph')
----
Titan Server
^^^^^^^^^^^^
To interact with Titan remotely or in another process through a client, a Titan "server" needs to be configured and started. Internally, Titan uses http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/#gremlin-server[Gremlin Server] of the http://tinkerpop.incubator.apache.org/[TinkerPop] stack to service client requests, therefore, configuring Titan Server is accomplished through a Gremlin Server configuration file.
To configure Gremlin Server with a `TitanGraph` instance the Gremlin Server configuration file requires the following settings:
[source, yaml]
----
...
graphs: {
graph: conf/titan-berkeleyje.properties
}
plugins:
- aurelius.titan
...
----
The entry for `graphs` defines the bindings to specific `TitanGraph` configurations. In the above case it binds `graph` to a Titan configuration at `conf/titan-berkeleyje.properties`. This means that when referencing the `TitanGraph` in remote contexts, this graph can simply be referred to as `g` in scripts sent to the server. The `plugins` entry simply enables the Titan Gremlin Plugin, which enables auto-imports of Titan classes so that they can be referenced in remotely submitted scripts.
Learn more about using and connecting to Titan server in <<server>>.
Server Distribution
+++++++++++++++++++
The Titan zip file contains a quick start server component that helps make it easier to get started with Gremlin Server and Titan. Invoke `bin/titan.sh start` to start Gremlin Server with Cassandra and ElasticSearch.
[[configuration-global]]
Global Configuration
~~~~~~~~~~~~~~~~~~~~
Titan distinguishes between local and global configuration options. Local configuration options apply to an individual Titan instance. Global configuration options apply to all instances in a cluster. More specifically, Titan distinguishes the following five scopes for configuration options:
* *LOCAL*: These options only apply to an individual Titan instance and are specified in the configuration provided when initializing the Titan instance.
* *MASKABLE*: These configuration options can be overwritten for an individual Titan instance by the local configuration file. If the local configuration file does not specify the option, its value is read from the global Titan cluster configuration.
* *GLOBAL*: These options are always read from the cluster configuration and cannot be overwritten on an instance basis.
* *GLOBAL_OFFLINE*: Like _GLOBAL_, but changing these options requires a cluster restart to ensure that the value is the same across the entire cluster.
* *FIXED*: Like _GLOBAL_, but the value cannot be changed once the Titan cluster is initialized.
When the first Titan instance in a cluster is started, the global configuration options are initialized from the provided local configuration file. Subsequently changing global configuration options is done through Titan's management API. To access the management API, call `g.getManagementSystem()` on an open Titan instance handle `g`. For example, to change the default caching behavior on a Titan cluster:
[source, gremlin]
----
mgmt = graph.openManagement()
mgmt.get('cache.db-cache')
// Prints the current config setting
mgmt.set('cache.db-cache', true)
// Changes option
mgmt.get('cache.db-cache')
// Prints 'true'
mgmt.commit()
// Changes take effect
----
Changing Offline Options
^^^^^^^^^^^^^^^^^^^^^^^^
Changing configuration options does not affect running instances and only applies to newly started ones. Changing _GLOBAL_OFFLINE_ configuration options requires restarting the cluster so that the changes take effect immediately for all instances.
To change _GLOBAL_OFFLINE_ options follow these steps:
* Close all but one Titan instance in the cluster
* Connect to the single instance
* Ensure all running transactions are closed
* Ensure no new transactions are started (i.e. the cluster must be offline)
* Open the management API
* Change the configuration option(s)
* Call commit which will automatically shut down the graph instance
* Restart all instances
Refer to the full list of configuration options in <<titan-config-ref>> for more information including the configuration scope of each option.
[[schema]]
Schema and Data Modeling
------------------------
Each Titan graph has a schema comprised of the edge labels, property keys, and vertex labels used therein. A Titan schema can either be explicitly or implicitly defined. Users are encouraged to explicitly define the graph schema during application development. An explicitly defined schema is an important component of a robust graph application and greatly improves collaborative software development. Note, that a Titan schema can be evolved over time without any interruption of normal database operations. Extending the schema does not slow down query answering and does not require database downtime.
The schema type - i.e. edge label, property key, or vertex label - is assigned to elements in the graph - i.e. edge, properties or vertices respectively - when they are first created. The assigned schema type cannot be changed for a particular element. This ensures a stable type system that is easy to reason about.
Beyond the schema definition options explained in this section, schema types provide performance tuning options that are discussed in <<advanced-schema>>.
Defining Edge Labels
~~~~~~~~~~~~~~~~~~~~
Each edge connecting two vertices has a label which defines the semantics of the relationship. For instance, an edge labeled `friend` between vertices A and B encodes a friendship between the two individuals.
To define an edge label, call `makeEdgeLabel(String)` on an open graph or management transaction and provide the name of the edge label as the argument. Edge label names must be unique in the graph. This method returns a builder for edge labels that allows to define its multiplicity. The *multiplicity* of an edge label defines a multiplicity constraint on all edges of this label, that is, a maximum number of edges between pairs of vertices. Titan recognizes the following multiplicity settings.
Edge Label Multiplicity
^^^^^^^^^^^^^^^^^^^^^^^
.Multiplicity Settings
* *MULTI*: Allows multiple edges of the same label between any pair of vertices. In other words, the graph is a _multi graph_ with respect to such edge label. There is no constraint on edge multiplicity.
* *SIMPLE*: Allows at most one edge of such label between any pair of vertices. In other words, the graph is a _simple graph_ with respect to the label. Ensures that edges are unique for a given label and pairs of vertices.
* *MANY2ONE*: Allows at most one outgoing edge of such label on any vertex in the graph but places no constraint on incoming edges. The edge label `mother` is an example with MANY2ONE multiplicity since each person has at most one mother but mothers can have multiple children.
* *ONE2MANY*: Allows at most one incoming edge of such label on any vertex in the graph but places no constraint on outgoing edges. The edge label `winnerOf` is an example with ONE2MANY multiplicity since each contest is won by at most one person but a person can win multiple contests.
* *ONE2ONE*: Allows at most one incoming and one outgoing edge of such label on any vertex in the graph. The edge label 'marriedTo' is an example with ONE2ONE multiplicity since a person is married to exactly one other person.
The default multiplicity is MULTI. The definition of an edge label is completed by calling the `make()` method on the builder which returns the defined edge label as shown in the following example.
[source, gremlin]
mgmt = graph.openManagement()
follow = mgmt.makeEdgeLabel('follow').multiplicity(MULTI).make()
mother = mgmt.makeEdgeLabel('mother').multiplicity(MANY2ONE).make()
mgmt.commit()
Defining Property Keys
~~~~~~~~~~~~~~~~~~~~~~
Properties on vertices and edges are key-value pairs. For instance, the property `name='Daniel'` has the key `name` and the value `'Daniel'`. Property keys are part of the Titan schema and can constrain the allowed data types and cardinality of values.
To define a property key, call `makePropertyKey(String)` on an open graph or management transaction and provide the name of the property key as the argument. Property key names must be unique in the graph. This method returns a builder for the property keys.
Property Key Data Type
^^^^^^^^^^^^^^^^^^^^^^
Use `dataType(Class)` to define the data type of a property key. Titan will enforce that all values associated with the key have the configured data type and thereby ensures that data added to the graph is valid. For instance, one can define that the `name` key has a String data type.
Define the data type as `Object.class` in order to allow any (serializable) value to be associated with a key. However, it is encouraged to use concrete data types whenever possible.
Configured data types must be concrete classes and not interfaces or abstract classes. Titan enforces class equality, so adding a sub-class of a configured data type is not allowed.
Titan natively supports the following data types.
.Native Titan Data Types
[options="header"]
|=====
|Name |Description
|String |Character sequence
|Character |Individual character
|Boolean |true or false
|Byte |byte value
|Short |short value
|Integer |integer value
|Long |long value
|Float |4 byte floating point number
|Double |8 byte floating point number
|Decimal |Number with 3 decimal digits
|Precision |Number with 6 decimal digits
|Date |Date
|Geoshape |Geographic shape like point, circle or box
|UUID |UUID
|=====
[[property-cardinality]]
Property Key Cardinality
^^^^^^^^^^^^^^^^^^^^^^^^
Use `cardinality(Cardinality)` to define the allowed cardinality of the values associated with the key on any given vertex.
.Cardinality Settings
* *SINGLE*: Allows at most one value per element for such key. In other words, the key->value mapping is unique for all elements in the graph. The property key `birthDate` is an example with SINGLE cardinality since each person has exactly one birth date.
* *LIST*: Allows an arbitrary number of values per element for such key. In other words, the key is associated with a list of values allowing duplicate values. Assuming we model sensors as vertices in a graph, the property key `sensorReading` is an example with LIST cardinality to allow lots of (potentially duplicate) sensor readings to be recorded.
* *SET*: Allows multiple values but no duplicate values per element for such key. In other words, the key is associated with a set of values. The property key `name` has SET cardinality if we want to capture all names of an individual (including nick name, maiden name, etc).
The default cardinality setting is SINGLE.
Note, that property keys used on edges and properties have cardinality SINGLE. Attaching multiple values for a single key on an edge or property is not supported.
[source, gremlin]
mgmt = graph.openManagement()
birthDate = mgmt.makePropertyKey('birthDate').dataType(Long.class).cardinality(Cardinality.SINGLE).make()
name = mgmt.makePropertyKey('name').dataType(String.class).cardinality(Cardinality.SET).make()
sensorReading = mgmt.makePropertyKey('sensorReading').dataType(Double.class).cardinality(Cardinality.LIST).make()
mgmt.commit()
Relation Types
~~~~~~~~~~~~~~
Edge labels and property keys are jointly referred to as *relation types*. Names of relation types must be unique in the graph which means that property keys and edge labels cannot have the same name. There are methods in the Titan API to query for the existence or retrieve relation types which encompasses both property keys and edge labels.
[source, gremlin]
mgmt = graph.openManagement()
if (mgmt.containsRelationType('name'))
name = mgmt.getPropertyKey('name')
mgmt.getRelationTypes(EdgeLabel.class)
mgmt.commit()
Defining Vertex Labels
~~~~~~~~~~~~~~~~~~~~~~
Like edges, vertices have labels. Unlike edge labels, vertex labels are optional. Vertex labels are useful to distinguish different types of vertices, e.g. _user_ vertices and _product_ vertices.
For compatibility with Blueprints, Titan provides differently-named methods for adding labeled and unlabeled vertices:
* `addVertexWithLabel`
* `addVertex`
Although labels are optional at the conceptual and data model level, Titan assigns all vertices a label as an internal implementation detail. Vertices created by the `addVertex` methods use Titan's default label.
To create a label, call `makeVertexLabel(String).make()` on an open graph or management transaction and provide the name of the vertex label as the argument. Vertex label names must be unique in the graph.
[source, gremlin]
mgmt = graph.openManagement()
person = mgmt.makeVertexLabel('person').make()
mgmt.commit()
// Create a labeled vertex
person = graph.addVertex(label, 'person')
// Create an unlabeled vertex
v = graph.addVertex()
graph.tx().commit()
Automatic Schema Maker
~~~~~~~~~~~~~~~~~~~~~~
If an edge label, property key, or vertex label has not been defined explicitly, it will be defined implicitly when it is first used during the addition of an edge, vertex or the setting of a property. The `DefaultSchemaMaker` configured for the Titan graph defines such types.
By default, implicitly created edge labels have multiplicity MULTI and implicitly created property keys have cardinality SINGLE and data type `Object.class`. Users can control automatic schema element creation by implementing and registering their own `DefaultSchemaMaker`.
It is strongly encouraged to explicitly define all schema elements and to disable automatic schema creation by setting `schema.default=none` in the Titan graph configuration.
Changing Schema Elements
~~~~~~~~~~~~~~~~~~~~~~~~
The definition of an edge label, property key, or vertex label cannot be changed once its committed into the graph. However, the names of schema elements can be changed via `TitanManagement.changeName(TitanSchemaElement, String)` as shown in the following example where the property key `place` is renamed to `location`.
[source, gremlin]
mgmt = graph.openManagement()
place = mgmt.getPropertyKey('place')
mgmt.changeName(place, 'location')
mgmt.commit()
Note, that schema name changes may not be immediately visible in currently running transactions and other Titan graph instances in the cluster. While schema name changes are announced to all Titan instances through the storage backend, it may take a while for the schema changes to take effect and it may require a instance restart in the event of certain failure conditions - like network partitions - if they coincide with the rename. Hence, the user must ensure that either of the following holds:
* The renamed label or key is not currently in active use (i.e. written or read) and will not be in use until all Titan instances are aware of the name change.
* Running transactions actively accomodate the brief intermediate period where either the old or new name is valid based on the specific Titan instance and status of the name-change announcement. For instance, that could mean transactions query for both names simultaneously.
Should the need arise to re-define an existing schema type, it is recommended to change the name of this type to a name that is not currently (and will never be) in use. After that, a new label or key can be defined with the original name, thereby effectively replacing the old one.
However, note that this would not affect vertices, edges, or properties previously written with the existing type. Redefining existing graph elements is not supported online and must be accomplished through a batch graph transformation.
[[gremlin]]
Gremlin Query Language
----------------------
image:http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/images/gremlin-logo.png[link="http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/"]
http://tinkerpop.incubator.apache.org/[Gremlin] is Titan's query language used to retrieve data from and modify data in the graph. Gremlin is a path-oriented language which succinctly expresses complex graph traversals and mutation operations. Gremlin is a http://en.wikipedia.org/wiki/Functional_programming[functional language] whereby traversal operators are chained together to form path-like expressions. For example, "from Hercules, traverse to his father and then his father's father and return the grandfather's name."
Gremlin is developed independently from Titan and supported by most graph databases. By building applications on top of Titan through the Gremlin query language users avoid vendor-lock in because their application can be migrated to other graph databases supporting Gremlin.
This section is a brief overview of the Gremlin query language. For more information on Gremlin, refer to the following resources:
* http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/[Complete Gremlin Manual]
* http://sql2gremlin.com[Gremlin for SQL developers *Gremlin2 Syntax*]
Introductory Traversals
~~~~~~~~~~~~~~~~~~~~~~~
A Gremlin query is a chain of operations/functions that are evaluated from left to right. A simple grandfather query is provided below over the _Graph of the Gods_ dataset discussed in <<getting-started>>.
[source, gremlin]
gremlin> g.V().has('name', 'hercules').out('father').out('father').values('name')
==>saturn
The query above can be read:
. `g`: for the current graph traversal.
. `V`: for all vertices in the graph
. `has('name', 'hercules')`: filters the vertices down to those with name property "hercules" (there is only one).
. `out('father')`: traverse outgoing father edge's from Hercules.
. `out('father')`: traverse outgoing father edge's from Hercules' father's vertex (i.e. Jupiter).
. `name`: get the name property of the "hercules" vertex's grandfather.
Taken together, these steps form a path-like traversal query. Each step can be decomposed and its results demonstrated. This style of building up a traversal/query is useful when constructing larger, complex query chains.
[source, gremlin]
gremlin> g
==>graphtraversalsource[titangraph[cassandrathrift:127.0.0.1], standard]
gremlin> g.V().has('name', 'hercules')
==>v[24]
gremlin> g.V().has('name', 'hercules').out('father')
==>v[16]
gremlin> g.V().has('name', 'hercules').out('father').out('father')
==>v[20]
gremlin> g.V().has('name', 'hercules').out('father').out('father').values('name')
==>saturn
For a sanity check, it is usually good to look at the properties of each return, not the assigned long id.
[source, gremlin]
gremlin> g.V().has('name', 'hercules').values('name')
==>hercules
gremlin> g.V().has('name', 'hercules').out('father').values('name')
==>jupiter
gremlin> g.V().has('name', 'hercules').out('father').out('father').values('name')
==>saturn
Note the related traversal that shows the entire father family tree branch of Hercules. This more complicated traversal is provided in order to demonstrate the flexibility and expressivity of the language. A competent grasp of Gremlin provides the Titan user the ability to fluently navigate the underlying graph structure.
[source, gremlin]
gremlin> g.V().has('name', 'hercules').repeat(out('father')).emit().values('name')
==>jupiter
==>saturn
Some more traversal examples are provided below.
[source, gremlin]
gremlin> hercules = g.V().has('name', 'hercules').next()
==>v[1536]
gremlin> g.V(hercules).out('father', 'mother').label()
==>god
==>human
gremlin> g.V(hercules).out('battled').label()
==>monster
==>monster
==>monster
gremlin> g.V(hercules).out('battled').valueMap()
==>{name=nemean}
==>{name=hydra}
==>{name=cerberus}
Each _step_ (denoted by a separating `.`) is a function that operates on the objects emitted from the previous step. There are numerous steps in the Gremlin language (see http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/#graph-traversal-steps[Gremlin Steps]). By simply changing a step or order of the steps, different traversal semantics are enacted. The example below returns the name of all the people that have battled the same monsters as Hercules who themselves are not Hercules (i.e. "co-battlers" or perhaps, "allies").
Given that _The Graph of the Gods_ only has one battler (Hercules), another battler (for the sake of example) is added to the graph with Gremlin showcasing how vertices and edges are added to the graph.
[source, gremlin]
gremlin> theseus = graph.addVertex('human')
==>v[3328]
gremlin> theseus.property('name', 'theseus')
==>null
gremlin> cerberus = g.V().has('name', 'cerberus').next()
==>v[2816]
gremlin> battle = theseus.addEdge('battled', cerberus, 'time', 22)
==>e[7eo-2kg-iz9-268][3328-battled->2816]
gremlin> battle.values('time')
==>22
When adding a vertex, an optional vertex label can be provided. An edge label must be specified when adding edges. Properties as key-value pairs can be set on both vertices and edges. When a property key is defined with SET or LIST cardinality, `addProperty` must be used when adding a respective property to a vertex.
[source, gremlin]
gremlin> g.V(hercules).as('h').out('battled').in('battled').where(neq('h')).values('name')
==>theseus
The example above has 4 chained functions: `out`, `in`, `except`, and `values` (i.e. `name` is shorthand for `values('name')`). The function signatures of each are itemized below, where `V` is vertex and `U` is any object, where `V` is a subset of `U`.
. `out: V -> V`
. `in: V -> V`
. `except: U -> U`
. `values: V -> U`
When chaining together functions, the incoming type must match the outgoing type, where `U` matches anything. Thus, the "co-battled/ally" traversal above is correct.
[NOTE]
The Gremlin overview presented in this section focused on the Gremlin-Groovy language implementation. Additional https://github.com/tinkerpop/gremlin/wiki/JVM-Language-Implementations[JVM language implementations] of Gremlin are available.
[[server]]
Titan Server
------------
image:http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/images/gremlin-server.png[width=400]
Titan uses the http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/#gremlin-server[Gremlin Server] engine as the server component to process and answer client queries.
Gremlin Server provides a way to remotely execute Gremlin scripts against one or more Titan instances hosted within it. By default, client applications can connect to it via link:https://en.wikipedia.org/wiki/WebSocket[WebSockets] using a custom subprotocol (there are a link:http://tinkerpop.incubator.apache.org/#libraries[number of clients] developed in different languages to help support the subprotocol). Gremlin Server can also be configured to serve a simple REST-style endpoint for processing Gremlin as well. These configurations just represent the out-of-the-box options for Gremlin Server. It is certainly possible to also extend it with other means of communication by implementing the interfaces that it provides.
Getting Started
~~~~~~~~~~~~~~~
The Titan https://github.com/thinkaurelius/titan/wiki/Downloads[Download] comes pre-configured to run Gremlin Server without any additional configuration. Alternatively, one can http://tinkerpop.incubator.apache.org/[Download Gremlin Server] separately and then install Titan manually.
Using the Pre-Packaged Distribution
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The pre-packaged version of Titan with Gremlin Server is designed to get users started quickly with Gremlin Server, Cassandra and ElasticSearch. It starts each of these components in their own process through a single shell script called `bin/titan.sh`. This quick-start bundle is not meant to be representative of what a production installation architecture should look like, but it does provide a good way to do some development with Titan, run some tests and see how all the components are wired up together.
* Download a copy of the current `titan-$VERSION.zip` file from the https://github.com/thinkaurelius/titan/wiki/Downloads[Downloads page]
* Unzip it and enter the `titan-$VERSION` directory
* Run `bin/titan.sh start`. This step will start Gremlin Server with Cassandra/ES forked into a separate process.
[source,bourne]
----
$ bin/titan.sh start
Forking Cassandra...
Running `nodetool statusthrift`.. OK (returned exit status 0 and printed string "running").
Forking Elasticsearch...
Connecting to Elasticsearch (127.0.0.1:9300)... OK (connected to 127.0.0.1:9300).
Forking Gremlin-Server...
Connecting to Gremlin-Server (127.0.0.1:8182)... OK (connected to 127.0.0.1:8182).
Run gremlin.sh to connect.
----
Manual Setup
^^^^^^^^^^^^
Manual setup of Titan in Gremlin Server is straightforward as long as the individual doing the setup has some basic understanding of Titan configuration and how Gremlin Server handles any link:http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/#_configuring_2[graph configuration]. In short, Gremlin Server configuration files point to graph-specific configuration files and use those to instantiate `Graph` instances that it will then host. In order to instantiate these `Graph` instances, Gremlin Server requires that the appropriate libraries and dependencies for the `Graph` be available on its classpath.
Get started by link:http://tinkerpop.incubator.apache.org/[downloading] the appropriate version of Gremlin Server, which needs to <<versions.txt#version-compat,match a version>> supported by the Titan version in use. For purposes of demonstration, these instructions will outline how to configure the BerkeleyDB backend for Titan in Gremlin Server. As stated earlier, Gremlin Server needs Titan dependencies on its classpath. Invoke the following command replacing `$VERSION` with the version of Titan to use:
[source,bourne]
----
bin/gremlin-server.sh -i com.thinkaurelius.titan titan-all $VERSION
----
When this process completes, Gremlin Server should now have all the Titan dependencies available to it and will thus be able to instantiate `TitanGraph` objects.
IMPORTANT: The above command uses Groovy Grape and if it is not configured properly download errors may ensue. Please refer to link:http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/#gremlin-applications[this section] of the TinkerPop documentation for more information around setting up `~/.groovy/grapeConfig.xml`.
Create a file called `GREMLIN_SERVER_HOME/conf/titan.properties` with the following contents:
[source,text]
----
gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
storage.backend=berkeleyje
storage.directory=db/berkeley
----
Configuration of other backends would not be so different. If using Cassandra, then use Cassandra configuration options in the `titan.properties` file. The only important piece to leave unchanged is the `gremlin.graph` setting which should always use `TitanFactory`. This setting tells Gremlin Server how to instantiate a `TitanGraph` instance.
Next create a file called `GREMLIN_SERVER_HOME/conf/gremlin-server-titan.yaml` that has the following contents:
[source,yaml]
----
host: localhost
port: 8182
graphs: {
graph: conf/titan.properties}
plugins:
- aurelius.titan
scriptEngines: {
gremlin-groovy: {
scripts: [scripts/titan.groovy]}}
serializers:
- { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { useMapperFromGraph: graph }}
- { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
metrics: {
slf4jReporter: {enabled: true, interval: 180000}}
----
There are several important parts to this configuration file as they relate to Titan. First, in the `graphs` map, there is a key called `graph` and its value is `conf/titan.properties`. This tells Gremlin Server to instantiate a `Graph` instance called "graph" and use the `conf/titan.properties` file to configure it. The "graph" key becomes the unique name for the `Graph` instance in Gremlin Server and it can be referenced as such in the scripts submitted to it. Second, in the `plugins` list, there is a reference to `aurelius.titan`, which tells Gremlin Server to initialize the "Titan Plugin". The "Titan Plugin" will auto-import Titan specific classes for usage in scripts. Finally, note the `scripts` key and the reference to `scripts/titan.groovy`. This Groovy file is an initialization script for Gremlin Server and that particular ScriptEngine. Create `scripts/titan.groovy` with the following contents:
[source,groovy]
----
def globals = [:]
globals << [g : graph.traversal()]
----
The above script creates a `Map` called `globals` and assigns to it a key/value pair. The key is `g` and its value is a `TraversalSource` generated from `graph`, which was configured for Gremlin Server in its configuration file. At this point, there are now two global variables available to scripts provided to Gremlin Server - `graph` and `g`.
At this point, Gremlin Server is setup and configuration of Titan in Gremlin Server is complete. To start the server:
[source,bourne]
----
$ bin/gremlin-server.sh conf/gremlin-server-titan.yaml
[INFO] GremlinServer -
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
[INFO] GremlinServer - Configuring Gremlin Server from conf/gremlin-server-titan.yaml
[INFO] MetricManager - Configured Metrics Slf4jReporter configured with interval=180000ms and loggerName=org.apache.tinkerpop.gremlin.server.Settings$Slf4jReporterMetrics
[INFO] GraphDatabaseConfiguration - Set default timestamp provider MICRO
[INFO] GraphDatabaseConfiguration - Generated unique-instance-id=7f0000016240-ubuntu1
[INFO] Backend - Initiated backend operations thread pool of size 8
[INFO] KCVSLog$MessagePuller - Loaded unidentified ReadMarker start time 2015-10-02T12:28:24.411Z into com.thinkaurelius.titan.diskstorage.log.kcvs.KCVSLog$MessagePuller@35399441
[INFO] GraphManager - Graph [graph] was successfully configured via [conf/titan.properties].
[INFO] ServerGremlinExecutor - Initialized Gremlin thread pool. Threads in pool named with pattern gremlin-*
[INFO] ScriptEngines - Loaded gremlin-groovy ScriptEngine
[INFO] GremlinExecutor - Initialized gremlin-groovy ScriptEngine with scripts/titan.groovy
[INFO] ServerGremlinExecutor - Initialized GremlinExecutor and configured ScriptEngines.
[INFO] ServerGremlinExecutor - A GraphTraversalSource is now bound to [g] with graphtraversalsource[standardtitangraph[berkeleyje:db/berkeley], standard]
[INFO] AbstractChannelizer - Configured application/vnd.gremlin-v1.0+gryo with org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0
[INFO] AbstractChannelizer - Configured application/vnd.gremlin-v1.0+gryo-stringd with org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0
[INFO] GremlinServer$1 - Gremlin Server configured with worker thread pool of 1, gremlin pool of 8 and boss thread pool of 1.
[INFO] GremlinServer$1 - Channel started at port 8182.
----
The following section explains how to connect to the running server.
Connecting to Gremlin Server
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gremlin Server will be ready to listen for WebSocket connections when it is started. The easiest way to test the connection is with Gremlin Console.
Start http://tinkerpop.incubator.apache.org/docs/{tinkerpop_version}/#gremlin-console[Gremlin Console] with `bin/gremlin.sh` and use the `:remote` and `:>` commands to issue Gremlin to Gremlin Server:
[source, text]
----
$ bin/gremlin.sh
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.utilities
plugin activated: aurelius.titan
plugin activated: tinkerpop.tinkergraph
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Connected - localhost/127.0.0.1:8182
gremlin> :> graph.addVertex("name", "stephen")
==>v[256]
gremlin> :> g.V().values('name')
==>stephen
----
The `:remote` command tells the console to configure a remote connection to Gremlin Server using the `conf/remote.yaml` file to connect. That file points to a Gremlin Server instance running on `localhost`. The `:>` is the "submit" command which sends the Gremlin on that line to the currently active remote.
[TIP]
To start Titan Server with the REST API, find the `conf/gremlin-server/gremlin-server.yaml` file in the distribution and edit it. Modify the `channelizer` setting to be `org.apache.tinkerpop.gremlin.server.channel.HttpChannelizer` then start Titan Server.
[[indexes]]
Indexing for better Performance
-------------------------------
Titan supports two different kinds of indexing to speed up query processing: *graph indexes* and *vertex-centric indexes*. Most graph queries start the traversal from a list of vertices or edges that are identified by their properties. Graph indexes make these global retrieval operations efficient on large graphs. Vertex-centric indexes speed up the actual traversal through the graph, in particular when traversing through vertices with many incident edges.
[[graph-indexes]]
Graph Index
~~~~~~~~~~~
Graph indexes are global index structures over the entire graph which allow efficient retrieval of vertices or edges by their properties for sufficiently selective conditions. For instance, consider the following queries
[source, gremlin]
g.V().has('name', 'hercules')
g.E().has('reason', textContains('loves'))
The first query asks for all vertices with the name `hercules`. The second asks for all edges where the property reason contains the word `loves`. Without a graph index answering those queries would require a full scan over all vertices or edges in the graph to find those that match the given condition which is very inefficient and infeasible for huge graphs.
Titan distinguishes between two types of graph indexes: *composite* and *mixed* indexes. Composite indexes are very fast and efficient but limited to equality lookups for a particular, previously-defined combination of property keys. Mixed indexes can be used for lookups on any combination of indexed keys and support multiple condition predicates in addition to equality depending on the backing index store.
Both types of indexes are created through the Titan management system and the index builder returned by `TitanManagement.buildIndex(String, Class)` where the first argument defines the name of the index and the second argument specifies the type of element to be indexed (e.g. `Vertex.class`). The name of a graph index must be unique.
Graph indexes built against newly defined property keys, i.e. property keys that are defined in the same management transaction as the index, are immediately available. Graph indexes built against property keys that are already in use require the execution of a <<reindex, reindex procedure>> to ensure that the index contains all previously added elements. Until the reindex procedure has completed, the index will not be available. It is encouraged to define graph indexes in the same transaction as the initial schema.
[NOTE]
In the absence of an index, Titan will default to a full graph scan in order to retrieve the desired list of vertices. While this produces the correct result set, the graph scan can be very inefficient and lead to poor overall system performance in a production environment. Enable the `force-index` configuration option in production deployments of Titan to prohibit graph scans.
Composite Index
^^^^^^^^^^^^^^^
Composite indexes retrieve vertices or edges by one or a (fixed) composition of multiple keys.
Consider the following composite index definitions.
[source, gremlin]
graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('byNameComposite', Vertex.class).addKey(name).buildCompositeIndex()
mgmt.buildIndex('byNameAndAgeComposite', Vertex.class).addKey(name).addKey(age).buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'byNameComposite').call()
mgmt.awaitGraphIndexStatus(graph, 'byNameAndAgeComposite').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameComposite"), SchemaAction.REINDEX).get()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndAgeComposite"), SchemaAction.REINDEX).get()
mgmt.commit()
First, two property keys `name` and `age` are already defined. Next, a simple composite index on just the name property key is built. Titan will use this index to answer the following query.
[source, gremlin]
g.V().has('name', 'hercules')
The second composite graph index includes both keys. Titan will use this index to answer the following query.
[source, gremlin]
g.V().has('age', 30).has('name', 'hercules')
Note, that all keys of a composite graph index must be found in the query's equality conditions for this index to be used. For example, the following query cannot be answered with either of the indexes because it only contains a constraint on `age` but not `name`.
[source, gremlin]
g.V().has('age', 30)
Also note, that composite graph indexes can only be used for equality constraints like those in the queries above. The following query would be answered with just the simple composite index defined on the `name` key because the age constraint is not an equality constraint.
[source, gremlin]
g.V().has('name', 'hercules').has('age', inside(20, 50))
Composite indexes do not require configuration of an external indexing backend and are supported through the primary storage backend. Hence, composite index modifications are persisted through the same transaction as graph modifications which means that those changes are atomic and/or consistent if the underlying storage backend supports atomicity and/or consistency.
[NOTE]
A composite index may comprise just one or multiple keys. A composite index with just one key is sometimes referred to as a key-index.
[[index-unique]]
Index Uniqueness
++++++++++++++++
Composite indexes can also be used to enforce property uniqueness in the graph. If a composite graph index is defined as `unique()` there can be at most one vertex or edge for any given concatenation of property values associated with the keys of that index.
For instance, to enforce that names are unique across the entire graph the following composite graph index would be defined.
[source, gremlin]
graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
mgmt.buildIndex('byNameUnique', Vertex.class).addKey(name).unique().buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'byNameUnique').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameUnique"), SchemaAction.REINDEX).get()
mgmt.commit()
[NOTE]
To enforce uniqueness against an eventually consistent storage backend, the <<eventual-consistency, consistency>> of the index must be explicitly set to enabling locking.
[[index-mixed]]
Mixed Index
^^^^^^^^^^^
Mixed indexes retrieve vertices or edges by any combination of previously added property keys.
Mixed indexes provide more flexibility than composite indexes and support additional condition predicates beyond equality. On the other hand, mixed indexes are slower for most equality queries than composite indexes.
Unlike composite indexes, mixed indexes require the configuration of an <<index-backends, indexing backend>> and use that indexing backend to execute lookup operations. Titan can support multiple indexing backends in a single installation. Each indexing backend must be uniquely identified by name in the Titan configuration which is called the *indexing backend name*.
[source, gremlin]
graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
age = mgmt.getPropertyKey('age')
mgmt.buildIndex('nameAndAge', Vertex.class).addKey(name).addKey(age).buildMixedIndex("search")
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'nameAndAge').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("nameAndAge"), SchemaAction.REINDEX).get()
mgmt.commit()
The example above defines a mixed index containing the property keys `name` and `age`. The definition refers to the indexing backend name `search` so that Titan knows which configured indexing backend it should use for this particular index. The `search` parameter specified in the buildMixedIndex call must match the second clause in the Titan configuration definition like this: index.*search*.backend If the index was named 'solrsearch' then the configuration definition would appear like this: index.*solrsearch*.backend.
The mgmt.buildIndex example specified above uses text search as its default behavior. An index statement that explicity defines the index as a text index can be written as follows:
[source,gremlin]
mgmt.buildIndex('nameAndAge',Vertex.class).addKey(name,Mapping.TEXT.getParameter()).addKey(age,Mapping.TEXT.getParameter()).buildMixedIndex("search")
See <<index-parameters>> for more information on text and string search options, and see the documentation section specific to the indexing backend in use for more details on how each backend handles text versus string searches.
While the index definition example looks similar to the composite index above, it provides greater query support and can answer _any_ of the following queries.
[source, gremlin]
g.V().has('name', textContains('hercules')).has('age', inside(20, 50))
g.V().has('name', textContains('hercules'))
g.V().has('age', lt(50))
Mixed indexes support full-text search, range search, geo search and others. Refer to <<search-predicates>> for a list of predicates supported by a particular indexing backend.
[NOTE]
Unlike composite indexes, mixed indexes do not support uniqueness.
Adding Property Keys
++++++++++++++++++++
Property keys can be added to an existing mixed index which allows subsequent queries to include this key in the query condition.
[source, gremlin]
graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
location = mgmt.makePropertyKey('location').dataType(Geoshape.class).make()
nameAndAge = mgmt.getGraphIndex('nameAndAge')
mgmt.addIndexKey(nameAndAge, location)
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'nameAndAge').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("nameAndAge"), SchemaAction.REINDEX).get()
mgmt.commit()
To add a newly defined key, we first retrieve the existing index from the management transaction by its name and then invoke the `addIndexKey` method to add the key to this index.
If the added key is defined in the same management transaction, it will be immediately available for querying. If the property key has already been in use, adding the key requires the execution of a <<reindex, reindex procedure>> to ensure that the index contains all previously added elements. Until the reindex procedure has completed, the key will not be available in the mixed index.
Mapping Parameters
++++++++++++++++++
When adding a property key to a mixed index - either through the index builder or the `addIndexKey` method - a list of parameters can be optionally specified to adjust how the property value is mapped into the indexing backend. Refer to the <<text-search, mapping parameters overview>> for a complete list of parameter types supported by each indexing backend.
Ordering
^^^^^^^^
The order in which the results of a graph query are returned can be defined using the `order().by()` directive. The `order().by()` method expects two parameters:
* The name of the property key by which to order the results. The results will be ordered by the value of the vertices or edges for this property key.
* The sort order: either increasing `incr` or decreasing `decr`
For example, the query `g.V().has('name', textContains('hercules')).order().by('age', decr).limit(10)` retrieves the ten oldest individuals with 'hercules' in their name.
When using `order().by()` it is important to note that:
* Composite graph indexes do not natively support ordering search results. All results will be retrieved and then sorted in-memory. For large result sets, this can be very expensive.
* Mixed indexes support ordering natively and efficiently. However, the property key used in the order().by() method must have been previously added to the mixed indexed for native result ordering support. This is important in cases where the the order().by() key is different from the query keys. If the property key is not part of the index, then sorting requires loading all results into memory.
Label Constraint
^^^^^^^^^^^^^^^^
In many cases it is desirable to only index vertices or edges with a particular label. For instance, one may want to index only gods by their name and not every single vertex that has a name property.
When defining an index it is possible to restrict the index to a particular vertex or edge label using the `indexOnly` method of the index builder. The following creates a composite index for the property key `name` that indexes only vertices labeled `god`.
[source, gremlin]
graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
name = mgmt.getPropertyKey('name')
god = mgmt.getVertexLabel('god')
mgmt.buildIndex('byNameAndLabel', Vertex.class).addKey(name).indexOnly(god).buildCompositeIndex()
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'byNameAndLabel').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("byNameAndLabel"), SchemaAction.REINDEX).get()
mgmt.commit()
Label restrictions similarly apply to mixed indexes. When a composite index with label restriction is defined as unique, the uniqueness constraint only applies to properties on vertices or edges for the specified label.
Composite vs Mixed Index
^^^^^^^^^^^^^^^^^^^^^^^^
. Use a composite index for exact match index retrievals. Composite indexes do not require configuring or operating an external index system and are often significantly faster than mixed indexes.
.. As an exception, use a mixed index for exact matches when the number of distinct values for query constraint is relatively small or if one value is expected to be associated with many elements in the graph (i.e. in case of low selectivity).
. Use a mixed indexes for numeric range, full-text or geo-spatial indexing. Also, using a mixed index can speed up the order().by() queries.
[[vertex-indexes]]
Vertex-centric Index
~~~~~~~~~~~~~~~~~~~~
Vertex-centric indexes are local index structures built individually per vertex. In large graphs vertices can have thousands of incident edges. Traversing through those vertices can be very slow because a large subset of the incident edges has to be retrieved and then filtered in memory to match the conditions of the traversal. Vertex-centric indexes can speed up such traversals by using localized index structures to retrieve only those edges that need to be traversed.
Suppose that Hercules battled hundreds of monsters in addition to the three captured in the introductory <<getting-started, Graph of the Gods>>. Without a vertex-centric index, a query asking for those monsters battled between time point `10` and `20` would require retrieving all `battled` edges even though there are only a handful of matching edges.
[source, gremlin]
h = g.V().has('name', 'hercules').next()
g.V(h).outE('battled').has('time', inside(10, 20)).inV()
Building a vertex-centric index by time speeds up such traversal queries.
[source, gremlin]
graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
time = mgmt.getPropertyKey('time')
battled = mgmt.getEdgeLabel('battled')
mgmt.buildEdgeIndex(battled, 'battlesByTime', Direction.BOTH, Order.decr, time)
mgmt.commit()
//Wait for the index to become available
mgmt.awaitGraphIndexStatus(graph, 'battlesByTime').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getGraphIndex("battlesByTime"), SchemaAction.REINDEX).get()
mgmt.commit()
This example builds a vertex-centric index which indexes `battled` edges in both direction by time in decreasing order.
A vertex-centric index is built against a particular edge label which is the first argument to the index construction method `TitanManagement.buildEdgeIndex()`. The index only applies to edges of this label - `battled` in the example above. The second argument is a unique name for the index. The third argument is the edge direction in which the index is built. The index will only apply to traversals along edges in this direction. In this example, the vertex-centric index is built in both direction which means that time restricted traversals along `battled` edges can be served by this index in both the `IN` and `OUT` direction. Titan will maintain a vertex-centric index on both the in- and out-vertex of `battled` edges. Alternatively, one could define the index to apply to the `OUT` direction only which would speed up traversals from Hercules to the monsters but not in the reverse direction. This would only require maintaining one index and hence half the index maintenance and storage cost.
The last two arguments are the sort order of the index and a list of property keys to index by. The sort order is optional and defaults to ascending order (i.e. `Order.ASC`). The list of property keys must be non-empty and defines the keys by which to index the edges of the given label. A vertex-centric index can be defined with multiple keys.
[source, gremlin]
graph.tx().rollback() //Never create new indexes while a transaction is active
mgmt = graph.openManagement()
time = mgmt.getPropertyKey('time')
rating = mgmt.makePropertyKey('rating').dataType(Double.class).make()
battled = mgmt.getEdgeLabel('battled')
mgmt.buildEdgeIndex(battled, 'battlesByRatingAndTime', Direction.OUT, Order.decr, rating, time)
mgmt.commit()
//Wait for the index to become available
mgmt.awaitRelationIndexStatus(graph, 'battlesByRatingAndTime', 'battled').call()
//Reindex the existing data
mgmt = graph.openManagement()
mgmt.updateIndex(mgmt.getRelationIndex(battled, 'battlesByRatingAndTime'), SchemaAction.REINDEX).get()
mgmt.commit()
This example extends the schema by a `rating` property on `battled` edges and builds a vertex-centric index which indexes `battled` edges in the out-going direction by rating and time in decreasing order. Note, that the order in which the property keys are specified is important because vertex-centric indexes are prefix indexes. This means, that `battled` edges are indexed by `rating` _first_ and `time` _second_.
[source, gremlin]
//Add some rating data
h = g.V().has('name', 'hercules').next()
g.V(h).outE('battled').property('rating', 5.0) //Add some rating properties
g.V(h).outE('battled').has('rating', gt(3.0)).inV()
g.V(h).outE('battled').has('rating', 5.0).has('time', inside(10, 50)).inV()
g.V(h).outE('battled').has('time', inside(10, 50)).inV()
Hence, the `battlesByLocationAndTime` index can speed up the first two but not the third query.
Multiple vertex-centric indexes can be built for the same edge label in order to support different constraint traversals. Titan's query optimizer attempts to pick the most efficient index for any given traversal. Vertex-centric indexes only support equality and range/interval constraints.
[NOTE]
The property keys used in a vertex-centric index must have an explicitly defined data type (i.e. _not_ `Object.class`) which supports a native sort order. If the data type are floating point numbers, Titan's custom `Decimal` or `Precision` data types must be used which have a fixed number of decimals.
If the vertex-centric index is built against an edge label that is defined in the same management transaction, the index will be immediately available for querying. If the edge label has already been in use, building a vertex-centric index against it requires the execution of a <<reindex, reindex procedure>> to ensure that the index contains all previously added edges. Until the reindex procedure has completed, the index will not be available.
[NOTE]
Titan automatically builds vertex-centric indexes per edge label and property key. That means, even with thousands of incident `battled` edges, queries like `g.V(h).out('mother')` or `g.V(h).values('age')` are efficiently answered by the local index.
Vertex-centric indexes cannot speed up unconstrained traversals which require traversing through all incident edges of a particular label. Those traversals will become slower as the number of incident edges increases. Often, such traversals can be rewritten as constrained traversals that can utilize a vertex-centric index to ensure acceptable performance at scale.
Ordered Traversals
^^^^^^^^^^^^^^^^^^
The following queries specify an order in which the incident edges are to be traversed. Use the `localLimit` command to retrieve a subset of the edges (in a given order) for EACH vertex that is traversed.
[source, gremlin]
h = g..V().has('name', 'hercules').next()
g.V(h).local(outE('battled').order().by('time', decr).limit(10)).inV().values('name')
g.V(h).local(outE('battled').has('rating', 5.0).order().by('time', decr).limit(10)).values('place')
The first query asks for the names of the 10 most recently battled monsters by Hercules. The second query asks for the places of the 10 most recent battles of Hercules that are rated 5 stars. In both cases, the query is constrained by an order on a property key with a limit on the number of elements to be returned.
Such queries can also be efficiently answered by vertex-centric indexes if the order key matches the key of the index and the requested order (i.e. increasing or decreasing) is the same as the one defined for the index. The `battlesByTime` index would be used to answer the first query and `battlesByLocationAndTime` applies to the second. Note, that the `battlesByLocationAndTime` index cannot be used to answer the first query because an equality constraint on `rating` must be present for the second key in the index to be effective.
[NOTE]
Ordered vertex queries are a Titan extension to Gremlin which causes the verbose syntax and requires the `_()` step to convert the Titan result back into a Gremlin pipeline.
[[tx]]
Transactions
------------
Almost all interaction with Titan is associated with a transaction. Titan transactions are safe for concurrent use by multiple threads. Methods on a TitanGraph instance like `graph.v(...)` and `graph.commit()` perform a `ThreadLocal` lookup to retrieve or create a transaction associated with the calling thread. Callers can alternatively forego `ThreadLocal` transaction management in favor of calling `graph.newTransaction()`, which returns a reference to a transaction object with methods to read/write graph data and commit or rollback.
Titan transactions are not necessarily ACID. They can be so configured on BerkleyDB, but they are not generally so on Cassandra or HBase, where the underlying storage system does not provide serializable isolation or multi-row atomic writes and the cost of simulating those properties would be substantial.
This section describes Titan's transactional semantics and API.
Transaction Handling
~~~~~~~~~~~~~~~~~~~~
Every graph operation in Titan occurs within the context of a transaction. According to the Blueprints' specification, each thread opens its own transaction against the graph database with the first operation (i.e. retrieval or mutation) on the graph::
[source, gremlin]
----
graph = TitanFactory.open("berkeleyje:/tmp/titan")
juno = graph.addVertex() //Automatically opens a new transaction
juno.property("name", "juno")
graph.tx().commit() //Commits transaction
----
In this example, a local Titan graph database is opened. Adding the vertex "juno" is the first operation (in this thread) which automatically opens a new transaction. All subsequent operations occur in the context of that same transaction until the transaction is explicitly stopped or the graph database is `shutdown()`. If transactions are still open when `shutdown()` is called, then the behavior of the outstanding transactions is technically undefined. In practice, any non-thread-bound transactions will usually be effectively rolled back, but the thread-bound transaction belonging to the thread that invoked shutdown will first be committed. Note, that both read and write operations occur within the context of a transaction.
Transactional Scope
~~~~~~~~~~~~~~~~~~~
All graph elements (vertices, edges, and types) are associated with the transactional scope in which they were retrieved or created. Under Blueprint's default transactional semantics, transactions are automatically created with the first operation on the graph and closed explicitly using `commit()` or `rollback()`. Once the transaction is closed, all graph elements associated with that transaction become stale and unavailable. However, Titan will automatically transition vertices and types into the new transactional scope as shown in this example::
[source, gremlin]
graph = TitanFactory.open("berkeleyje:/tmp/titan")
juno = graph.addVertex() //Automatically opens a new transaction
graph.tx().commit() //Ends transaction
juno.property("name", "juno") //Vertex is automatically transitioned
Edges, on the other hand, are not automatically transitioned and cannot be accessed outside their original transaction. They must be explicitly transitioned.
[source, gremlin]
e = juno.addEdge("knows", graph.addVertex())
graph.tx().commit() //Ends transaction
e = g.E(e).next() //Need to refresh edge
e.property("time", 99)
Transaction Failures
~~~~~~~~~~~~~~~~~~~~
When committing a transaction, Titan will attempt to persist all changes to the storage backend. This might not always be successful due to IO exceptions, network errors, machine crashes or resource unavailability. Hence, transactions can fail. In fact, transactions *will eventually fail* in sufficiently large systems. Therefore, we highly recommend that your code expects and accommodates such failures.
[source, gremlin]
try {
if (g.V().has("name", name).iterator().hasNext())
throw new IllegalArgumentException("Username already taken: " + name)
user = graph.addVertex()
user.property("name", name)
graph.tx().commit()
} catch (Exception e) {
//Recover, retry, or return error message
println(e.getMessage())
}
The example above demonstrates a simplified user signup implementation where `name` is the name of the user who wishes to register. First, it is checked whether a user with that name already exists. If not, a new user vertex is created and the name assigned. Finally, the transaction is committed.
If the transaction fails, a `TitanException` is thrown. There are a variety of reasons why a transaction may fail. Titan differentiates between _potentially temporary_ and _permanent_ failures.
Potentially temporary failures are those related to resource unavailability and IO hickups (e.g. network timeouts). Titan automatically tries to recover from temporary failures by retrying to persist the transactional state after some delay. The number of retry attempts and the retry delay are configurable (see <<titan-config-ref>>).
Permanent failures can be caused by complete connection loss, hardware failure or lock contention. To understand the cause of lock contention, consider the signup example above and suppose a user tries to signup with username "juno". That username may still be available at the beginning of the transaction but by the time the transaction is committed, another user might have concurrently registered with "juno" as well and that transaction holds the lock on the username therefore causing the other transaction to fail. Depending on the transaction semantics one can recover from a lock contention failure by re-running the entire transaction.