Skip to content

Graph-Based Data Science

illustration of a knowledge graph, plus laboratory glassware

The kglab package provides a simple abstraction layer in Python for building knowledge graphs.

The main goal is to leverage idiomatic Python for common use cases in data science and data engineering work that require graph data, presenting graph-based data science as an emerging practice.

Cut to the Chase

  1. To get started right away, jump to Getting Started
  2. For an extensive, hands-on coding tour through kglab, follow the Tutorial notebooks
  3. Check the source code at https://github.com/DerwenAI/kglab

Motivations

Note

FAQ: Why build yet another graph library, when there are already so many available?

A short list of primary motivations have been identified for kglab, its design criteria, and engineering trade-offs:

Point 1: integrate with popular graph libraries, including RDFlib, OWL-RL, pySHACL, NetworkX, iGraph, PyVis, node2vec, pslpython, pgmpy, and so on – several of which would otherwise not have much common ground.

Data Science Workflows

Point 2: close integration plus example code for working with the "PyData" stack, namely pandas, NumPy, scikit-learn, matplotlib, etc., as well as PyTorch, and other quintessential data science tools.

Distributed Systems Infrastructure

Point 3: integrate efficiently with Big Data tools and practices for contemporary data engineering and cloud computing infrastructure, including: Ray, Jupyter, RAPIDS, Apache Arrow, Apache Parquet, Apache Spark, etc.

Natural Language Understanding

Point 4: incorporate graph-based methods and semantic technologies into spaCy pipelines, e.g., through pytextrank, plus biome.text and other customized natural language pipelines.

Hybrid AI Approaches

Point 5: explore "hybrid" approaches that combine machine learning with symbolic, rule-based processing – including probabilistic graph inference and knowledge graph embedding.

Abstraction Layer

The overall intent of kglab is to build an abstraction layer for KG work in Python. This is provided as a library, not as a framework. It's difficult to imagine how to implement this kind of abstraction layer outside of a functional programming language.

Consider the fact that many dependencies have their origins in the Semantic Web. The ongoing work of W3C provides ontologies, standards, and other initiatives that are incredibly valuable for graph-based. That overall effort began in the 1990s, and arguably its momentum imploded circa 2005 – despite best intentions by brilliant individuals and quite capable organizations.

In retrospect, it was a classic case of a technology being "too early" since those efforts generally lacked the necessary compute resources and language constructs. The "Big Data" efforts did not really take off until a few years following 2005. For example, Apache Spark would never have been possible prior to the mid-2000s introduction of: the Scala language (2004), commodity multi-core processors (2005), cloud computing (2006), actor model (2006), and so on.

Arguably, many challenges faced by the Semantic Web developer community can be traced to their nearly-exclusive focus on using Java, C, or C++ for reference implementions of their proposed standards. They did not benefit from so many of the learnings about distributed systems which arrived a decade later.

In particular, applicative systems leverage functional programming constructs to implement valuable uses of advanced math when working with data at scale. This allows for cost-effective parallel processing that is relatively simple to use. As a "thought exercise" consider how the semantic technologies may have differed if they'd been launched after Spark became popular? Stated differently, kglab is a direct exploration of how semantic technologies and other graph-based techniques can be improved by using contemporary distributed systems as a foundation.

Python 3.x provides just enough of a foundation as a functional programming language – e.g., classes, type annotations, closures, and so on – to make kglab feasible. While perhaps this might be simpler to write in Clojure, Scala, Haskell, etc., those languages lack enough "critical mass" in terms of graph libraries or user communities to sustain this kind of open source project.

Community Resources

Getting Help

The Knowledge Graph Conference hosts several community resources where you can post questions and get help about kglab and related KG topics.

KGC also hosts "knowledge espresso" (monthly office hours) with Paco Nathan and others involved in this open source project.

Feedback and Roadmap

Note

SPECIAL REQUEST: Which features would you like to see the most in an open source Python library for building knowledge graphs?

Your feedback through this online survey helps us prioritize the roadmap for kglab: https://forms.gle/FMHgtmxHYWocprMn6

Links for other open source community resources:


Last update: 2021-01-21