Probabilistic Graph Features

TODO: summarize from

results from the combined analysis get collected into an intermediate form which is a probabilistic structure called a lemma graph.

note: NLP parsers tend to produce a wealth of annotations from raw text, most of which are thrown away in many application. what if instead this parse information got collected together, temporarily while analyzing a chunk of text?

an application running in production most likely would not want to persist the entirety of lemma graph data generated during analysis of a full corpus. instead, consider this structure as a kind of temporary cache during the analysis for one unit of work, i.e., a "chunk" of text.

from the pragmatics of writing, editing, and critical review, a natural size for this kind of chunking is to analyze at the paragraph level. in some domains, such as analysis of patent applications, chunking at the level of "claims" might be indicated.

the probabilistic aspects of the intermediate lemma graph data become especially important in a linguistic context:

  • entities have many surface forms
  • synonyms (synsets) change meanings in different domains, especially when abbreviated
  • ambiguous references may exist, though not all are important to resolve based on "premature optimization"

Note that semantic modeling practices using RDF tend to have a relatively trivial notion of "synonyms", notably by annotating a subject with one preferred label and zero or more additional labels. This may be sufficiently descriptive for building taxonomies manually; however, this approach is not sufficient for making the modeled representation computable in light of the many kinds of surface forms and possible sources of ambiguity. The RDF representation uses skos:broader to connect surface forms, and the LPG representation uses probabilities to manage disambiguation these terms.