Graph-Based Data Science¶
The kglab package provides a simple abstraction layer in Python for building knowledge graphs.
The main goal is to leverage idiomatic Python for common use cases in data science and data engineering work that require graph data, presenting graph-based data science as an emerging practice.
Cut to the Chase¶
- To get started right away, jump to Getting Started
- For an extensive, hands-on coding tour through kglab, follow the Tutorial notebooks
- Check the source code at https://github.com/DerwenAI/kglab
Motivations¶
Note
FAQ: Why build yet another graph library, when there are already so many available?
A short list of primary motivations have been identified for kglab, its design criteria, and engineering trade-offs:
Popular Graph Libraries¶
Point 1: integrate with popular graph libraries, including RDFlib, OWL-RL, pySHACL, NetworkX, iGraph, PyVis, node2vec, pslpython, pgmpy, and so on – several of which would otherwise not have much common ground.
Data Science Workflows¶
Point 2: close integration plus example code for working with the "PyData" stack, namely pandas, NumPy, scikit-learn, matplotlib, etc., as well as PyTorch, and other quintessential data science tools.
Distributed Systems Infrastructure¶
Point 3: integrate efficiently with Big Data tools and practices for contemporary data engineering and cloud computing infrastructure, including: Ray, Jupyter, RAPIDS, Apache Arrow, Apache Parquet, Apache Spark, etc.
Natural Language Understanding¶
Point 4:
incorporate graph-based methods and
semantic technologies
into
spaCy
pipelines, e.g., through
pytextrank
,
plus
biome.text
and other customized
natural language
pipelines.
Hybrid AI Approaches¶
Point 5: explore "hybrid" approaches that combine machine learning with symbolic, rule-based processing – including probabilistic graph inference and knowledge graph embedding.
Abstraction Layer¶
The overall intent of kglab is to build an abstraction layer for KG work in Python. This is provided as a library, not as a framework. It's difficult to imagine how to implement this kind of abstraction layer outside of a functional programming language.
Consider the fact that many dependencies have their origins in the Semantic Web. The ongoing work of W3C provides ontologies, standards, and other initiatives that are incredibly valuable for graph-based. That overall effort began in the 1990s, and arguably its momentum imploded circa 2005 – despite best intentions by brilliant individuals and quite capable organizations.
In retrospect, it was a classic case of a technology being "too early" since those efforts generally lacked the necessary compute resources and language constructs. The "Big Data" efforts did not really take off until a few years following 2005. For example, Apache Spark would never have been possible prior to the mid-2000s introduction of: the Scala language (2004), commodity multi-core processors (2005), cloud computing (2006), actor model (2006), and so on.
Arguably, many challenges faced by the Semantic Web developer community can be traced to their nearly-exclusive focus on using Java, C, or C++ for reference implementions of their proposed standards. They did not benefit from so many of the learnings about distributed systems which arrived a decade later.
In particular, applicative systems leverage functional programming constructs to implement valuable uses of advanced math when working with data at scale. This allows for cost-effective parallel processing that is relatively simple to use. As a "thought exercise" consider how the semantic technologies may have differed if they'd been launched after Spark became popular? Stated differently, kglab is a direct exploration of how semantic technologies and other graph-based techniques can be improved by using contemporary distributed systems as a foundation.
Python 3.x provides just enough of a foundation as a functional programming language – e.g., classes, type annotations, closures, and so on – to make kglab feasible. While perhaps this might be simpler to write in Clojure, Scala, Haskell, etc., those languages lack enough "critical mass" in terms of graph libraries or user communities to sustain this kind of open source project.
Community Resources¶
Getting Help¶
The Knowledge Graph Conference hosts several community resources where you can post questions and get help about kglab and related KG topics.
- community Slack – specifically on the
#ask
channel - Knowledge Tech Q&A site for extended discussions
- "KG 101" tutorial at Knowledge Connexions 2020
- Just Enough Math group on LinkedIn – join to receive related updates, news, conference coupons, etc.
KGC also hosts "knowledge espresso" (monthly office hours) with Paco Nathan and others involved in this open source project.
Feedback and Roadmap¶
Note
SPECIAL REQUEST: Which features would you like to see the most in an open source Python library for building knowledge graphs?
Your feedback through this online survey helps us prioritize the roadmap for kglab: https://forms.gle/FMHgtmxHYWocprMn6
Links for other open source community resources: