Implementation Details
This project Implements an LLM-augmented textgraph
algorithm for
constructing a lemma graph from raw, unstructured text source.
The TextGraphs
library is based on work developed by
Derwen
in 2023 Q2 for customer apps and used in our Cysoni
product.
This library integrates code from:
For more background about early efforts which led to this line of inquiry, see the recent talks:
- "Language, Graphs, and AI in Industry" Paco Nathan, K1st World (2023-10-11) (video)
- "Language Tools for Creators" Paco Nathan, FOSSY (2023-07-13)
The TextGraphs
library shows integrations of several of these kinds
of components, complemented with use of graph queries, graph algorithms,
and other related tooling.
Admittedly, the results present a "hybrid" approach:
it's not purely "generative" -- whatever that might mean.
A core principle here is to provide results from the natural language workflows which may be used for expert feedback. In other words, how can we support means for leveraging human-in-the-loop (HITL) process?
Another principle has been to create a Python library built to produced
configurable, extensible pipelines.
Care has been given to writing code that can be run concurrently
(e.g., leveraging asyncio
), using dependencies which have
business-friendly licenses, and paying attention to security concerns.
The library provides three main affordances for AI applications:
-
With the default settings, one can use
TextGraphs
to extracti ranked key phrases from raw text -- even without using any of the additional deep learning models. -
Going a few further steps, one can generate an RDF or LPG graph from raw texts, and make use of entity linking, relation extraction, and other techniques to ground the natural language parsing by leveraging some knowledge graph which represents a particular domain. Default examples use WikiMedia graphs: DBPedia, Wikidata, etc.
-
A third set of goals for
TextGraphs
is to provide a "playground" or "gym" for evaluating graph levels of detail, i.e., abstraction layers for knowledge graphs, and explore some the emerging work to produced foundation models for knowledge graphs through topological transforms.
Regarding the third point, consider how language parsing produces graphs by definition, although NLP results tend to be quite noisy. The annotations inferred by NLP pipelines often get thrown out. This seemed like a good opportunity to generate sample data for "condensing" graphs into more abstracted representations. In other words, patterns within the relatively noisy parse results can be condensed into relatively refined knowledge graph elements.
Note that while the spaCy
library for NLP plays a central role, the
TextGraphs
library is not intended to become a spaCy
pipeline.