Implementation Details

This project Implements an LLM-augmented textgraph algorithm for constructing a lemma graph from raw, unstructured text source.

The TextGraphs library is based on work developed by Derwen in 2023 Q2 for customer apps and used in our Cysoni product.

This library integrates code from:

For more background about early efforts which led to this line of inquiry, see the recent talks:

The TextGraphs library shows integrations of several of these kinds of components, complemented with use of graph queries, graph algorithms, and other related tooling. Admittedly, the results present a "hybrid" approach: it's not purely "generative" -- whatever that might mean.

A core principle here is to provide results from the natural language workflows which may be used for expert feedback. In other words, how can we support means for leveraging human-in-the-loop (HITL) process?

Another principle has been to create a Python library built to produced configurable, extensible pipelines. Care has been given to writing code that can be run concurrently (e.g., leveraging asyncio), using dependencies which have business-friendly licenses, and paying attention to security concerns.

The library provides three main affordances for AI applications:

  1. With the default settings, one can use TextGraphs to extracti ranked key phrases from raw text -- even without using any of the additional deep learning models.

  2. Going a few further steps, one can generate an RDF or LPG graph from raw texts, and make use of entity linking, relation extraction, and other techniques to ground the natural language parsing by leveraging some knowledge graph which represents a particular domain. Default examples use WikiMedia graphs: DBPedia, Wikidata, etc.

  3. A third set of goals for TextGraphs is to provide a "playground" or "gym" for evaluating graph levels of detail, i.e., abstraction layers for knowledge graphs, and explore some the emerging work to produced foundation models for knowledge graphs through topological transforms.

Regarding the third point, consider how language parsing produces graphs by definition, although NLP results tend to be quite noisy. The annotations inferred by NLP pipelines often get thrown out. This seemed like a good opportunity to generate sample data for "condensing" graphs into more abstracted representations. In other words, patterns within the relatively noisy parse results can be condensed into relatively refined knowledge graph elements.

Note that while the spaCy library for NLP plays a central role, the TextGraphs library is not intended to become a spaCy pipeline.