Other projects have investigated related lines of inquiry, which help frame the problems encountered.
- primary goal is to generate entities and facts from a KG
- emphasis on handling rare facts from a broad domain of topics and on improving perplexity
- "we are interested in LMs that dynamically decide the facts to incorporate from the KG, guided by the discourse"
- con: uses relatively simple
G = V,Egraph-theoretic notions of graph data, which is ostensibly RDF
- "traditional LMs are only capable of remembering facts seen at training time, and often have difficulty recalling them"
- introducing KGLM: enables the model to render information it has never seen before, as well as generate out-of-vocabulary tokens
- generates conditional probability of mapping an entity to a parsed token, based on previous tokens and entities within the same stream
- maintains a dynamically growing local KG, a subset of the KG that contains entities that have already been mentioned in the text, and their related entities
- "one of the primary barriers to incorporating factual knowledge into LMs is that training data is hard to obtain"
- provides the
Linked WikiText-2dataset for running benchmarks, available on GitHub
- "For most LMs, it is difficult to control their generation since factual knowledge is entangled with generation capabilities of the model"
Standard language modeling corpora consist only of text, and thus are unable to describe which entities or facts each token is referring to. In contrast, while relation extraction datasets link text to a knowledge graph, the text is made up of disjoint sentences that do not provide sufficient context to train a powerful language model.
spaCyto parse and annotate tokens with metadata
- parse trees => graph => heuristics to map from phrases to concepts
sense2vecto find neighborhoods for surface forms (acronyms, synonyms, etc.)
- UMAP, etc. => hinting toward: "descriptive but not computable"
- UX: active learning vs. annotations of wrong examples using
- "spend more effort per example" => coining term active teaching
- rethinking beyond the "optimality trap"
- "maybe familiarity is a liability in data analytics?" => doubt can be an advantage
- how to prompt LLMs with KGs
- "build a prompting pipeline that endows LLMs with the capability of comprehending KG inputs and inferring with a combined implicit knowledge and the retrieved external knowledge"
- in contrast, the prompt engineering paradigm: "pre-train, prompt, and predict"
- "goal of this work is to build a plug-and-play prompting approach to elicit the graph-of-thoughts reasoning capability in LLMs"
1.consolidates the retrieved facts from KGs and the implicit knowledge from LLMs
- discovers new patterns in input KGs
- reasons over the mind map to yield final outputs
- build multiple evidence sub-graphs which get aggregated into reasoning graphs, then prompt LLMs and build a mind map to explain the reasoning process
- conjecture that LLMs can comprehend and extract knowledge from a reasoning graph that is described by natural language
- prompting a GPT-3.5 with
MindMapyields an overwhelming performance over GPT-4 consistently
"Deep NLP on SF Literature" Krishna Tripathi GitHub (2024-01-25)
- processes texts using customized methods, NLTK, and spaCy
- performs domain-specific named entity recognition in multiple stages
- fine-tunes a RoBERTa model using GPT to generate annotated data
- implements multicore LDA for efficient topic modeling and theme-extraction
- modularized code makes this work highly reusable for other domain-specific literature tasks: code can be easily refitted for legal datasets, a corpus of classics etc.
- goes the additional step of using these results to rework training data and train models
"How to Convert Any Text Into a Graph of Concepts"
Rahul Nayak, Towards Data Science (2023-11-09)
- "a method to convert any text corpus into a graph of concepts" (aka KG)
- use KGs to implement RAG and "chat with our documents"
- Q: is this work solid enough to cite in an academic paper??
"Extracting Relation from Sentence using LLM" Muhammad Nizami Medium (2023-11-15)
"Text-to-Graph via LLM: pre-training, prompting, or tuning?" Peter Lawrence Medium (2024-01-16)