Graph Representation

While many papers proceed from a graph-theoretic definition G = (V, E) these typically fail to take into account two important aspects of graph technologies in industry practice:

  1. labels and properties (key/value attribute pairs) for more effective modeling of linked data
  2. internationalized resource identifiers (IRIs) as unique identifiers that map into controlled vocabularies, which can be leveraged for graph queries and semantic inference

Industry analysts sometimes point to these two concerns being represented by competiting approaches, namely labeled property graphs (LPG) representation versus semantic web standards defined by the World Wide Web Consortium (W3C). Efforts are in progress to harmonize both of these needs within the same graphs, such as #hartig14 for eventual standards. However, with some discipline in data modeling practices, both of these criteria can be met within current graph frameworks, provided that:

  • nodes and edges each have specific labels which serve as IRIs that map to a set of controlled vocabularies
  • nodes and edges each have properties, which include probabilities from the point of generation

Building on definitions given in #martonsv17, #qin2023sgr, this project proceeds from the perspective of primarily using LPG graph representation, while adhering to the aforementioned data modeling discipline.

G = (V, E, src, tgt, lbl, P) is an edge-labeled directed multigraph with:

  • a set of nodes V
  • a set of edges E
  • function src: E → V` that associates each edge with its source vertex
  • function tgt: E → V that associates each edge with its target vertex
  • function lbl: E → dom(S) that associates each edge its label
  • function P: (V ∪ E) → 2p that associates nodes and edges with their properties

The project architecture enables a "map-reduce" style of distributed processing, so that "chunks" of text (e.g., paragraphs) can be processed independently, with results being aggregated at the end of a batch. The intermediate processing of each "chunk" uses NetworkX #hagberg2008 to allow for running in-memory graph algorithms and analytics, and integrate more efficiently with graph machine learning libraries. Then an openCypher representation #martonsv17 is used to serialize end results, which get aggregated using the open source KùzuDB graph database #feng2023kuzu and its Python API.