Note
To run this notebook in JupyterLab, load examples/ex7_1.ipynb
Statistical Relational Learning with pslpython
¶
As we've seen there are several ways to work with graph-based data, including: SPARQL queries, graph algorithms traversals, ML embedding, etc. Each of these methods makes trade-offs in terms of:
- computational costs as the graph size scales
- robustness when there is uncertainty or conflicting information in the graph
- formalism (i.e., analytic solutions) vs. empirical approaches (i.e., data-driven, machine learning)
One way to visualize some of these trade-offs is in the following diagram:
Note in the top/right corner of the diagram that a relatively formal category of graph-based approaches is called statistical relational learning. The gist is that so much of the network analysis that we want to perform can be describe mathematically as markov networks, in terms of probabilistic models. Sometimes these can be quite computationally expensive; for example, hedge funds on Wall Street tend to burn lots of cloud computing on markov models. They are robust in terms of being able to work well even with lots of missing or conflicting data, and the formalism implies that we can infer mathematical guarantees from the analyis. That's quite the opposite of deep learning models, which are great at predicting sequences of things, but terrible at providing guarantees.
Clearly, there's been much emphasis in industry recently that equates "artificial intelligence" with "deep learning", although we are also recognizing diminishing returns for methods that rely purely on ever-larger data rates and ever-larger ML models. One path forward will be to combine machine learning with use of structured knowledge (i.e., KGs) such that we can avoid "boiling the oceans" with purely data-driven approaches when in so many use cases we can leverage domain expertise.
In this secton we'll consider one form of statistical relational learning called probabilistic soft logic (PSL) which is essentially a kind of "fuzzy logic" for graphs that has interesting computational qualities. Whereas many kinds of formal graph analysis (e.g., "traveling salesman problem") are provably hard and quite expensive in practice, PSL can be solved with a convex optimization (e.g., like so many machine learning algorithms).
Consider this: we can describe "rules" about nodes and relations in a KG, then assign probabilities to specific instances of those rules that are found within our graph. If the probabilities are all zero then the system is consistent. As some of the assigned probabilities are increased, then some of the rules become inconsistent. How high (i.e., optimal) of a set of probabilities can we assign while still keeping the system consistent? Alternatively, if we apply a set of rules, then how "far away" (probabilistically speaking) is a graph from being logically consistent?
This comes in quite handy when we want to combine semantic technologies and machine learning, or rather when we have explicit rules plus lots of empirical data. Data quality is a persistent problem, so we can leverage PSL to identify which parts of the graph seem the least "logically consistent", and therefore need some review and curation.
RDF representation of the "simple acquaintances" example¶
One of the examples given for PSL is called simple acquaintances, which uses a graph of some friends, where they live, what interests they share, and then infers who probably knows whom. Some people explicitly do or do not know each other, while other "knows" relations can be inferred based on whether two people have lived in the same place or share common interest.
The objective is to build a PSL model for link prediction, to evaluate the annotations in the friend graph. In this case, we'll assume that the "knows" relations have been added from a questionable source (e.g., some third-party dataset) so we'll measure a subset of these relations and determine their likelihood. NB: this is really useful for cleaning up annotations in a large graph!
Now let's load a KG which is an RDF representation of this example, based on a simple extension of the foaf
vocabulary:
import kglab
namespaces = {
"acq": "http://example.org/stuff/",
"foaf": "http://xmlns.com/foaf/0.1/",
}
kg = kglab.KnowledgeGraph(
name = "LINQS simple acquaintance example for PSL",
base_uri = "http://example.org/stuff/",
namespaces = namespaces,
)
kg.load_rdf("../dat/acq.ttl")
<kglab.kglab.KnowledgeGraph at 0x109a4e290>
Take a look at the dat/acq.ttl
file to see the people and their relations.
Here's a quick visualization of the graph:
VIS_STYLE = {
"foaf": {
"color": "orange",
"size": 5,
},
"acq":{
"color": "blue",
"size": 30,
},
}
excludes = [
kg.get_ns("rdf").type,
kg.get_ns("rdfs").domain,
kg.get_ns("rdfs").range,
]
subgraph = kglab.SubgraphTensor(kg, excludes=excludes)
pyvis_graph = subgraph.build_pyvis_graph(notebook=True, style=VIS_STYLE)
pyvis_graph.force_atlas_2based()
pyvis_graph.show("tmp.fig04.html")
Also, let's serialize this in TTL/Turtle format:
kg.save_rdf("foo.ttl")
Take a look at the resulting foo.ttl
file to see how the relations are organized now.
Is that more readable?
Loading a PSL model¶
Next, we'll use the pslpython
library implemented in Python (atop Java core software) to define three predicates (i.e., relations – similar as in RDF) which are: Neighbors
, Likes
, Knows
from pslpython.model import Model
from pslpython.partition import Partition
from pslpython.predicate import Predicate
from pslpython.rule import Rule
model = Model("simple acquaintances")
Then add each of the predicates:
predicate = Predicate("Neighbors", closed=True, size=2)
model.add_predicate(predicate)
predicate = Predicate("Likes", closed=True, size=2)
model.add_predicate(predicate)
predicate = Predicate("Knows", closed=False, size=2)
model.add_predicate(predicate)
<pslpython.model.Model at 0x12992d810>
Next, we'll add a set of probabilistic rules, all with different weights applied:
- "Two people who live in the same place are more likely to know each other"
- "Two people who don't live in the same place are less likely to know each other"
- "Two people who share a common interest are more likely to know each other"
- "Two people who both know a third person are more likely to know each other"
- "Otherwise, any pair of people are less likely to know each other"
model.add_rule(Rule("20: Neighbors(P1, L) & Neighbors(P2, L) & (P1 != P2) -> Knows(P1, P2) ^2"))
model.add_rule(Rule("5: Neighbors(P1, L1) & Neighbors(P2, L2) & (P1 != P2) & (L1 != L2) -> !Knows(P1, P2) ^2"))
model.add_rule(Rule("10: Likes(P1, L) & Likes(P2, L) & (P1 != P2) -> Knows(P1, P2) ^2"))
model.add_rule(Rule("5: Knows(P1, P2) & Knows(P2, P3) & (P1 != P3) -> Knows(P1, P3) ^2"))
model.add_rule(Rule("5: !Knows(P1, P2) ^2"))
<pslpython.model.Model at 0x12992d810>
Finally we'll add a commutative rule such that "If Person 1 knows Person 2, then Person 2 also knows Person 1."
model.add_rule(Rule("Knows(P1, P2) = Knows(P2, P1) ."))
<pslpython.model.Model at 0x12992d810>
To initialize the model, we'll clear any pre-existing data from each of the predicates:
for predicate in model.get_predicates().values():
predicate.clear_data()
And we'll define a simple helper function, to format a unique URL within our acq
vocabulary (a simple extension of foaf
) based on the purely numeric identifiers used within PSL:
def get_person_id (url):
return url.replace("http://example.org/stuff/person_", "")
Let's query our KG to populate data into the Neighbors
predicate in the PSL model, based on foaf:based_near
that represents living near the same locations:
predicate = model.get_predicate("Neighbors")
sparql = """
SELECT DISTINCT ?p1 ?p2
WHERE {
?p1 foaf:based_near ?l .
?p2 foaf:based_near ?l .
}
"""
for row in kg.query(sparql):
p1 = get_person_id(row.p1)
p2 = get_person_id(row.p2)
if p1 != p2:
predicate.add_data_row(Partition.OBSERVATIONS, [p1, p2])
Note: these data points are observations, i.e., empirical support for the probabilistic model.
Then let's query our KG to populate data into the Likes
predicate in the PSL model, based on shared interests in foaf:topic_interest
topics:
predicate = model.get_predicate("Likes")
sparql = """
SELECT DISTINCT ?p1 ?p2
WHERE {
?p1 foaf:topic_interest ?t .
?p2 foaf:topic_interest ?t .
}
"""
for row in kg.query(sparql):
p1 = get_person_id(row.p1)
p2 = get_person_id(row.p2)
if p1 != p2:
predicate.add_data_row(Partition.OBSERVATIONS, [p1, p2])
Just for kicks, let's take a look at the internal representation of a PSL predicate, which is a pandas
DataFrame:
predicate = model.get_predicate("Likes")
predicate.__dict__
{'_types': [<ArgType.UNIQUE_STRING_ID: 'UniqueStringID'>,
<ArgType.UNIQUE_STRING_ID: 'UniqueStringID'>],
'_data': {<Partition.OBSERVATIONS: 'observations'>: 0 1 2
0 21 9 1.0
1 21 22 1.0
2 21 3 1.0
3 21 1 1.0
4 21 24 1.0
.. .. .. ...
567 5 18 1.0
568 5 15 1.0
569 5 13 1.0
570 15 5 1.0
571 22 11 1.0
[572 rows x 3 columns],
<Partition.TARGETS: 'targets'>: Empty DataFrame
Columns: [0, 1, 2]
Index: [],
<Partition.TRUTH: 'truth'>: Empty DataFrame
Columns: [0, 1, 2]
Index: []},
'_name': 'LIKES',
'_closed': True}
Now we'll load data from the dat/psl/knows_targets.txt
CSV file, which is a list of foaf:knows
relations in our graph that we want to analyze.
Each of these has an assumed value of 1.0
(true) or 0.0
(false).
Our PSL analysis will assign probabilities for each so that we can compare which annotations appear to be suspect and require further review:
import csv
import pandas as pd
import rdflib
targets = []
rows_list = []
predicate = model.get_predicate("Knows")
with open("../dat/psl/knows_targets.txt", "r") as f:
reader = csv.reader(f, delimiter="\t")
for i, row in enumerate(reader):
p1, p2 = row
targets.append((p1, p2))
p1_url = rdflib.URIRef("http://example.org/stuff/person_" + p1)
p2_url = rdflib.URIRef("http://example.org/stuff/person_" + p2)
if (p1_url, kg.get_ns("foaf").knows, p2_url) in kg.rdf_graph():
truth = 1.0
predicate.add_data_row(Partition.TRUTH, [p1, p2], truth_value=truth)
predicate.add_data_row(Partition.TARGETS, [p1, p2])
rows_list.append({ 0: p1, 1: p2, "truth": truth})
elif (p1_url, kg.get_ns("acq").wantsIntro, p2_url) in kg.rdf_graph():
truth = 0.0
predicate.add_data_row(Partition.TRUTH, [p1, p2], truth_value=truth)
predicate.add_data_row(Partition.TARGETS, [p1, p2])
rows_list.append({ 0: p1, 1: p2, "truth": truth})
else:
print("UNKNOWN", p1, p2)
df_dat = pd.DataFrame(rows_list)
These data points are considered to be ground atoms, each with a truth value set initially. These are also our targets for which nodes in the graph to analyze based on the rules.
Next, we'll add foaf:knows
observations which are in the graph, although not among our set of targets.
This provides more evidence for the probabilistic inference.
Note that since RDF does not allow for representing probabilities on relations, we're using the acq:wantsIntro
to represent a foaf:knows
with a 0.0
probability:
predicate = model.get_predicate("Knows")
sparql = """
SELECT ?p1 ?p2
WHERE {
?p1 foaf:knows ?p2 .
}
ORDER BY ?p1 ?p2
"""
for row in kg.query(sparql):
p1 = get_person_id(row.p1)
p2 = get_person_id(row.p2)
if (p1, p2) not in targets:
predicate.add_data_row(Partition.OBSERVATIONS, [p1, p2], truth_value=1.0)
sparql = """
SELECT ?p1 ?p2
WHERE {
?p1 acq:wantsIntro ?p2 .
}
ORDER BY ?p1 ?p2
"""
for row in kg.query(sparql):
p1 = get_person_id(row.p1)
p2 = get_person_id(row.p2)
if (p1, p2) not in targets:
predicate.add_data_row(Partition.OBSERVATIONS, [p1, p2], truth_value=0.0)
Now we're ready to optimize the PSL model – this may take a few minutes to run:
PSL_OPTIONS = {
"log4j.threshold": "INFO"
}
results = model.infer(additional_cli_optons=[], psl_config=PSL_OPTIONS)
6988 [pslpython.model PSL] INFO --- 0 [main] INFO org.linqs.psl.cli.Launcher - Running PSL CLI Version 2.2.2-5f9a472
7356 [pslpython.model PSL] INFO --- 369 [main] INFO org.linqs.psl.cli.Launcher - Loading data
7634 [pslpython.model PSL] INFO --- 647 [main] INFO org.linqs.psl.cli.Launcher - Data loading complete
7635 [pslpython.model PSL] INFO --- 648 [main] INFO org.linqs.psl.cli.Launcher - Loading model from /var/folders/zz/2ffrqd5j7n52x67qd94h_r_h0000gp/T/psl-python/simple acquaintances/simple acquaintances.psl
7802 [pslpython.model PSL] INFO --- 816 [main] INFO org.linqs.psl.cli.Launcher - Model loading complete
7804 [pslpython.model PSL] INFO --- 816 [main] INFO org.linqs.psl.cli.Launcher - Starting inference with class: org.linqs.psl.application.inference.MPEInference
7985 [pslpython.model PSL] INFO --- 998 [main] INFO org.linqs.psl.application.inference.MPEInference - Grounding out model.
14361 [pslpython.model PSL] INFO --- 7374 [main] INFO org.linqs.psl.application.inference.MPEInference - Grounding complete.
14842 [pslpython.model PSL] INFO --- 7855 [main] INFO org.linqs.psl.application.inference.InferenceApplication - Beginning inference.
35191 [pslpython.model PSL] WARNING --- 28204 [main] WARN org.linqs.psl.reasoner.admm.ADMMReasoner - No feasible solution found. 112 constraints violated.
35192 [pslpython.model PSL] INFO --- 28204 [main] INFO org.linqs.psl.reasoner.admm.ADMMReasoner - Optimization completed in 10796 iterations. Objective: 147035.2, Feasible: false, Primal res.: 0.16985348, Dual res.: 0.0083313715
35193 [pslpython.model PSL] INFO --- 28205 [main] INFO org.linqs.psl.application.inference.InferenceApplication - Inference complete.
35194 [pslpython.model PSL] INFO --- 28205 [main] INFO org.linqs.psl.application.inference.InferenceApplication - Writing results to Database.
35230 [pslpython.model PSL] INFO --- 28239 [main] INFO org.linqs.psl.application.inference.InferenceApplication - Results committed to database.
35231 [pslpython.model PSL] INFO --- 28239 [main] INFO org.linqs.psl.cli.Launcher - Inference Complete
Let's examine the results.
We'll get a pandas
DataFrame describing the targets in the Knows
predicate:
predicate = model.get_predicates()["KNOWS"]
df = results[predicate]
df.head()
0 | 1 | truth | |
---|---|---|---|
0 | 7 | 15 | 0.001825 |
1 | 7 | 17 | 0.004212 |
2 | 7 | 18 | 0.994970 |
3 | 7 | 12 | 0.988291 |
4 | 6 | 19 | 0.005619 |
Now we can compare the "truth" values from our targets, with their probabilities from the inference provided by the PSL model:
import pandas as pd
dat_val = {}
for index, row in df_dat.iterrows():
p1 = row[0]
p2 = row[1]
key = (int(p1), int(p2))
dat_val[key] = row["truth"]
for index, row in df.iterrows():
p1 = row[0]
p2 = row[1]
key = (int(p1), int(p2))
df.at[index, "diff"] = row["truth"] - dat_val[key]
pd.set_option("max_rows", None)
df
0 | 1 | truth | diff | |
---|---|---|---|---|
0 | 7 | 15 | 0.001825 | 0.001825 |
1 | 7 | 17 | 0.004212 | 0.004212 |
2 | 7 | 18 | 0.994970 | -0.005030 |
3 | 7 | 12 | 0.988291 | -0.011709 |
4 | 6 | 19 | 0.005619 | 0.005619 |
5 | 6 | 12 | 0.243235 | -0.756765 |
6 | 6 | 13 | 0.990194 | -0.009806 |
7 | 6 | 14 | 0.240264 | 0.240264 |
8 | 6 | 16 | 0.982446 | -0.017554 |
9 | 6 | 17 | 0.003854 | 0.003854 |
10 | 5 | 22 | 0.232084 | -0.767916 |
11 | 5 | 16 | 0.996406 | -0.003594 |
12 | 5 | 17 | 0.997194 | -0.002806 |
13 | 5 | 12 | 0.997899 | -0.002101 |
14 | 5 | 13 | 0.998490 | -0.001510 |
15 | 4 | 15 | 0.003186 | 0.003186 |
16 | 9 | 19 | 0.209264 | 0.209264 |
17 | 9 | 17 | 0.977174 | -0.022826 |
18 | 8 | 21 | 0.977074 | -0.022926 |
19 | 8 | 10 | 0.986552 | -0.013448 |
20 | 8 | 12 | 0.980424 | -0.019576 |
21 | 8 | 13 | 0.983486 | -0.016514 |
22 | 7 | 20 | 0.002991 | 0.002991 |
23 | 13 | 4 | 0.983961 | -0.016039 |
24 | 13 | 7 | 0.003287 | 0.003287 |
25 | 12 | 6 | 0.242892 | -0.757108 |
26 | 11 | 7 | 0.002605 | 0.002605 |
27 | 11 | 8 | 0.003733 | 0.003733 |
28 | 11 | 3 | 0.004305 | 0.004305 |
29 | 10 | 6 | 0.991534 | -0.008466 |
30 | 10 | 9 | 0.986129 | -0.013871 |
31 | 16 | 7 | 0.004930 | 0.004930 |
32 | 16 | 8 | 0.971347 | -0.028653 |
33 | 16 | 2 | 0.005648 | 0.005648 |
34 | 15 | 1 | 0.003491 | 0.003491 |
35 | 15 | 5 | 0.999349 | -0.000651 |
36 | 14 | 6 | 0.240554 | 0.240554 |
37 | 14 | 9 | 0.986702 | -0.013298 |
38 | 14 | 0 | 0.999034 | -0.000966 |
39 | 14 | 1 | 0.987152 | -0.012848 |
40 | 14 | 3 | 0.986234 | -0.013766 |
41 | 19 | 9 | 0.209216 | 0.209216 |
42 | 19 | 1 | 0.970383 | -0.029617 |
43 | 18 | 5 | 0.999129 | -0.000871 |
44 | 18 | 8 | 0.003016 | 0.003016 |
45 | 17 | 1 | 0.211154 | 0.211154 |
46 | 0 | 1 | 0.997292 | -0.002708 |
47 | 23 | 9 | 0.981103 | -0.018897 |
48 | 23 | 0 | 0.997935 | -0.002065 |
49 | 3 | 7 | 0.986071 | -0.013929 |
50 | 23 | 2 | 0.986615 | -0.013385 |
51 | 22 | 5 | 0.231897 | -0.768103 |
52 | 22 | 9 | 0.008285 | 0.008285 |
53 | 2 | 6 | 0.989766 | -0.010234 |
54 | 22 | 1 | 0.969754 | -0.030246 |
55 | 1 | 2 | 0.983649 | -0.016351 |
56 | 21 | 1 | 0.977122 | -0.022878 |
57 | 0 | 7 | 0.998759 | -0.001241 |
58 | 20 | 1 | 0.984048 | -0.015952 |
59 | 20 | 6 | 0.003489 | 0.003489 |
60 | 7 | 1 | 0.986649 | -0.013351 |
61 | 6 | 7 | 0.991634 | -0.008366 |
62 | 6 | 9 | 0.235502 | 0.235502 |
63 | 5 | 7 | 0.245458 | -0.754542 |
64 | 5 | 9 | 0.256432 | -0.743568 |
65 | 5 | 2 | 0.998474 | -0.001526 |
66 | 4 | 0 | 0.997036 | -0.002964 |
67 | 10 | 15 | 0.995006 | -0.004994 |
68 | 9 | 1 | 0.977481 | -0.022519 |
69 | 9 | 5 | 0.256282 | -0.743718 |
70 | 9 | 6 | 0.235542 | 0.235542 |
71 | 9 | 7 | 0.986425 | -0.013575 |
72 | 8 | 7 | 0.986281 | -0.013719 |
73 | 8 | 3 | 0.007074 | 0.007074 |
74 | 7 | 5 | 0.245358 | -0.754642 |
75 | 12 | 24 | 0.991920 | -0.008080 |
76 | 13 | 10 | 0.003481 | 0.003481 |
77 | 13 | 11 | 0.990418 | -0.009582 |
78 | 12 | 21 | 0.215407 | -0.784593 |
79 | 12 | 17 | 0.005308 | 0.005308 |
80 | 11 | 22 | 0.982138 | -0.017862 |
81 | 12 | 10 | 0.988092 | -0.011908 |
82 | 11 | 19 | 0.981664 | -0.018336 |
83 | 16 | 23 | 0.006045 | 0.006045 |
84 | 16 | 15 | 0.988322 | -0.011678 |
85 | 16 | 12 | 0.006512 | 0.006512 |
86 | 14 | 23 | 0.989004 | -0.010996 |
87 | 15 | 11 | 0.995351 | -0.004649 |
88 | 14 | 17 | 0.004377 | 0.004377 |
89 | 14 | 11 | 0.992662 | -0.007338 |
90 | 14 | 13 | 0.991027 | -0.008973 |
91 | 13 | 22 | 0.977845 | -0.022155 |
92 | 19 | 16 | 0.009076 | 0.009076 |
93 | 18 | 15 | 0.000952 | 0.000952 |
94 | 18 | 17 | 0.230089 | 0.230089 |
95 | 17 | 21 | 0.204119 | 0.204119 |
96 | 17 | 22 | 0.970583 | -0.029417 |
97 | 18 | 12 | 0.992946 | -0.007054 |
98 | 17 | 18 | 0.230132 | 0.230132 |
99 | 0 | 18 | 0.999131 | -0.000869 |
100 | 20 | 12 | 0.985861 | -0.014139 |
101 | 0 | 15 | 0.999314 | -0.000686 |
102 | 3 | 21 | 0.977511 | -0.022489 |
103 | 23 | 18 | 0.002387 | 0.002387 |
104 | 3 | 10 | 0.986157 | -0.013843 |
105 | 22 | 21 | 0.209190 | -0.790810 |
106 | 3 | 12 | 0.980083 | -0.019917 |
107 | 22 | 24 | 0.987341 | -0.012659 |
108 | 21 | 24 | 0.003140 | 0.003140 |
109 | 2 | 11 | 0.003025 | 0.003025 |
110 | 21 | 22 | 0.209066 | -0.790934 |
111 | 21 | 14 | 0.003638 | 0.003638 |
112 | 21 | 17 | 0.203993 | 0.203993 |
113 | 1 | 11 | 0.003799 | 0.003799 |
114 | 1 | 13 | 0.983726 | -0.016274 |
115 | 1 | 17 | 0.210739 | 0.210739 |
116 | 21 | 12 | 0.215540 | -0.784460 |
117 | 0 | 22 | 0.001780 | 0.001780 |
In other words, which of these "knows" relations in the graph appears to be suspect, based on our rules plus the other evidence in the graph?
Let's visualize a histogram of how the inferred probabilities are distributed:
df["diff"].hist();
In most cases there is little or no difference (0.0 <= d <= 0.2
) in the probabilities for the target relations.
However, some appear to be off by a substantial (-0.8
) amount, which indicates problems in that part of our graph data.
The following rows show where these foaf:knows
annotations in the graph differs significantly from their truth values predicted by PSL:
df[df["diff"] < -0.2]
0 | 1 | truth | diff | |
---|---|---|---|---|
5 | 6 | 12 | 0.243235 | -0.756765 |
10 | 5 | 22 | 0.232084 | -0.767916 |
25 | 12 | 6 | 0.242892 | -0.757108 |
51 | 22 | 5 | 0.231897 | -0.768103 |
63 | 5 | 7 | 0.245458 | -0.754542 |
64 | 5 | 9 | 0.256432 | -0.743568 |
69 | 9 | 5 | 0.256282 | -0.743718 |
74 | 7 | 5 | 0.245358 | -0.754642 |
78 | 12 | 21 | 0.215407 | -0.784593 |
105 | 22 | 21 | 0.209190 | -0.790810 |
110 | 21 | 22 | 0.209066 | -0.790934 |
116 | 21 | 12 | 0.215540 | -0.784460 |
Speaking of human-in-the-loop practices for AI, using PSL along with a KG seems like a great way to leverage machine learning, so that the people can focus on parts of the graph that have the most uncertainty. And, therefore, probably provide the best ROI for investing time+cost into curation.
Exercises¶
Exercise 1:
Build a PSL model that tests the "noodle vs. pancake" rules used in an earlier example with our recipe KG. Which recipes should be annotated differently?
Exercise 2:
Try representing one of the other PSL examples using RDF and kglab
.