While many papers proceed from a graph-theoretic definition
G = (V, E) these typically fail to take into account two important aspects of graph technologies in industry practice:
- labels and properties (key/value attribute pairs) for more effective modeling of linked data
- internationalized resource identifiers (IRIs) as unique identifiers that map into controlled vocabularies, which can be leveraged for graph queries and semantic inference
Industry analysts sometimes point to these two concerns being represented by competiting approaches, namely labeled property graphs (LPG) representation versus semantic web standards defined by the World Wide Web Consortium (W3C). Efforts are in progress to harmonize both of these needs within the same graphs, such as #hartig14 for eventual standards. However, with some discipline in data modeling practices, both of these criteria can be met within current graph frameworks, provided that:
- nodes and edges each have specific labels which serve as IRIs that map to a set of controlled vocabularies
- nodes and edges each have properties, which include probabilities from the point of generation
Building on definitions given in #martonsv17, #qin2023sgr, this project proceeds from the perspective of primarily using LPG graph representation, while adhering to the aforementioned data modeling discipline.
G = (V, E, src, tgt, lbl, P) is an edge-labeled directed multigraph with:
- a set of nodes V
- a set of edges E
src: E → V` that associates each edge with its source vertex
tgt: E → Vthat associates each edge with its target vertex
lbl: E → dom(S)that associates each edge its label
P: (V ∪ E) → 2pthat associates nodes and edges with their properties
The project architecture enables a "map-reduce" style of distributed processing, so that "chunks" of text (e.g., paragraphs) can be processed independently, with results being aggregated at the end of a batch.
The intermediate processing of each "chunk" uses
NetworkX #hagberg2008 to allow for running in-memory graph algorithms and analytics, and integrate more efficiently with graph machine learning libraries.
openCypher representation #martonsv17 is used to serialize end results, which get aggregated using the open source
KùzuDB graph database #feng2023kuzu and its Python API.