Graph-Based Data Science¶
Cut to the Chase¶
- To get started right away, jump to Getting Started
- For an extensive, hands-on coding tour through kglab, follow the Tutorial notebooks
- Check the source code at https://github.com/DerwenAI/kglab
FAQ: Why build yet another graph library, when there are already so many available?
A short list of primary motivations have been identified for kglab, its design criteria, and engineering trade-offs:
Popular Graph Libraries¶
Point 1: integrate with popular graph libraries, including RDFlib, OWL-RL, pySHACL, NetworkX, iGraph, PyVis, node2vec, pslpython, pgmpy, and so on – several of which would otherwise not have much common ground.
Data Science Workflows¶
Point 2: close integration plus example code for working with the "PyData" stack, namely pandas, NumPy, scikit-learn, matplotlib, etc., as well as PyTorch, and other quintessential data science tools.
Distributed Systems Infrastructure¶
Point 3: integrate efficiently with Big Data tools and practices for contemporary data engineering and cloud computing infrastructure, including: Ray, Jupyter, RAPIDS, Apache Arrow, Apache Parquet, Apache Spark, etc.
Natural Language Understanding¶
Hybrid AI Approaches¶
The overall intent of kglab is to build an abstraction layer for KG work in Python. This is provided as a library, not as a framework. It's difficult to imagine how to implement this kind of abstraction layer outside of a functional programming language.
Consider the fact that many dependencies have their origins in the Semantic Web. The ongoing work of W3C provides ontologies, standards, and other initiatives that are incredibly valuable for graph-based. That overall effort began in the 1990s, and arguably its momentum imploded circa 2005 – despite best intentions by brilliant individuals and quite capable organizations.
In retrospect, it was a classic case of a technology being "too early" since those efforts generally lacked the necessary compute resources and language constructs. The "Big Data" efforts did not really take off until a few years following 2005. For example, Apache Spark would never have been possible prior to the mid-2000s introduction of: the Scala language (2004), commodity multi-core processors (2005), cloud computing (2006), actor model (2006), and so on.
Arguably, many challenges faced by the Semantic Web developer community can be traced to their nearly-exclusive focus on using Java, C, or C++ for reference implementions of their proposed standards. They did not benefit from so many of the learnings about distributed systems which arrived a decade later.
In particular, applicative systems leverage functional programming constructs to implement valuable uses of advanced math when working with data at scale. This allows for cost-effective parallel processing that is relatively simple to use. As a "thought exercise" consider how the semantic technologies may have differed if they'd been launched after Spark became popular? Stated differently, kglab is a direct exploration of how semantic technologies and other graph-based techniques can be improved by using contemporary distributed systems as a foundation.
Python 3.x provides just enough of a foundation as a functional programming language – e.g., classes, type annotations, closures, and so on – to make kglab feasible. While perhaps this might be simpler to write in Clojure, Scala, Haskell, etc., those languages lack enough "critical mass" in terms of graph libraries or user communities to sustain this kind of open source project.
Links for open source community resources:
- Issue Tracker
- Project Board
- Graph-Based Data Science group on LinkedIn – join to receive related updates, news, conference coupons, etc.
- "Graph-Based Data Science" talk
Feedback and Roadmap¶
SPECIAL REQUEST: Which features would you like to see the most in an open source Python library for building knowledge graphs?
Your feedback through this online survey helps us prioritize the roadmap for kglab: https://forms.gle/FMHgtmxHYWocprMn6