Note
To run this notebook in JupyterLab, load examples/ex5_0.ipynb
SHACL validation with pySHACL
¶
Let's explore use of the W3 Shapes Constraint Language (SHACL) based on the pySHACL
library.
When we build KGs, it can be helpful to think of the different semantic technologies in terms of layers:
- SKOS - thesauri and classification
- SHACL - requirements
- OWL - concepts
- RDF - represent nodes, predicates, literals
For an excellent overview + demos of SHACL, see shacl-masterclass
by Veronika Heimsbakk.
Another great online resource for working with SHACL is the SHACL Playground.
With SHACL we can validate as well as run some forms of inference to complement what's provided by RDF, OWL, and so on. For a good overview, see the discussion about SHACL and other rule-base approaches in general in "Rules for Knowledge Graphs Rules" by Dan McCreary.
First, we'll show one of the examples from pySHACL
, starting with its SHACL shapes graph in Turtle format:
shapes_graph = """
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix schema: <http://schema.org/> .
schema:PersonShape
a sh:NodeShape ;
sh:targetClass schema:Person ;
sh:property [
sh:path schema:givenName ;
sh:datatype xsd:string ;
sh:name "given name" ;
] ;
sh:property [
sh:path schema:birthDate ;
sh:lessThan schema:deathDate ;
sh:maxCount 1 ;
] ;
sh:property [
sh:path schema:gender ;
sh:in ( "female" "male" "nonbinary" "self-descr" ) ;
] ;
sh:property [
sh:path schema:address ;
sh:node schema:AddressShape ;
] .
schema:AddressShape
a sh:NodeShape ;
sh:closed true ;
sh:property [
sh:path schema:streetAddress ;
sh:datatype xsd:string ;
] ;
sh:property [
sh:path schema:postalCode ;
sh:datatype xsd:integer ;
sh:minInclusive 10000 ;
sh:maxInclusive 99999 ;
] .
"""
Then define a simple data graph to test against, given in JSON-LD format:
data_graph = """
{
"@context": { "@vocab": "http://schema.org/" },
"@id": "http://example.org/ns#Bob",
"@type": "Person",
"givenName": "Robert",
"familyName": "Junior",
"birthDate": "1971-07-07",
"deathDate": "1968-09-10",
"address": {
"@id": "http://example.org/ns#BobsAddress",
"streetAddress": "1600 Amphitheatre Pkway",
"postalCode": 9404
}
}
"""
Now let's run pySHACL
directly, to test whether this data graph conforms to its shapes graph, then print out the results:
import pyshacl
results = pyshacl.validate(
data_graph,
shacl_graph=shapes_graph,
data_graph_format="json-ld",
shacl_graph_format="ttl",
inference="rdfs",
debug=True,
serialize_report_graph="ttl",
)
conforms, report_graph, report_text = results
print("conforms", conforms)
Constraint Violation in LessThanConstraintComponent (http://www.w3.org/ns/shacl#LessThanConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:lessThan schema:deathDate ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:path schema:birthDate ]
Focus Node: <http://example.org/ns#Bob>
Value Node: Literal("1971-07-07")
Result Path: schema:birthDate
Message: Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal("1971-07-07")
Constraint Violation in MinInclusiveConstraintComponent (http://www.w3.org/ns/shacl#MinInclusiveConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:integer ; sh:maxInclusive Literal("99999", datatype=xsd:integer) ; sh:minInclusive Literal("10000", datatype=xsd:integer) ; sh:path schema:postalCode ]
Focus Node: <http://example.org/ns#BobsAddress>
Value Node: Literal("9404", datatype=xsd:integer)
Result Path: schema:postalCode
Message: Value is not >= Literal("10000", datatype=xsd:integer)
Constraint Violation in NodeConstraintComponent (http://www.w3.org/ns/shacl#NodeConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:node schema:AddressShape ; sh:path schema:address ]
Focus Node: <http://example.org/ns#Bob>
Value Node: <http://example.org/ns#BobsAddress>
Result Path: schema:address
Message: Value does not conform to Shape schema:AddressShape
conforms False
The conforms
flag should return False
since the given data graph violates some of its shape constraints.
Let's look at the report_text
output for human-readable analysis:
print(report_text)
Validation Report
Conforms: False
Results (2):
Constraint Violation in LessThanConstraintComponent (http://www.w3.org/ns/shacl#LessThanConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:lessThan schema:deathDate ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:path schema:birthDate ]
Focus Node: <http://example.org/ns#Bob>
Value Node: Literal("1971-07-07")
Result Path: schema:birthDate
Message: Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal("1971-07-07")
Constraint Violation in NodeConstraintComponent (http://www.w3.org/ns/shacl#NodeConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:node schema:AddressShape ; sh:path schema:address ]
Focus Node: <http://example.org/ns#Bob>
Value Node: <http://example.org/ns#BobsAddress>
Result Path: schema:address
Message: Value does not conform to Shape schema:AddressShape
The birthDate
value in the data graph causes a LessThanConstraintComponent
violation based on the PersonShape
constraint rules.
The postalCode
value causes a NodeConstraintComponent
violation based on the AddressShape
constraint rules.
The serialize_report_graph
parameter for pyshacl.validate()
requested that the report graph be serialized as an RDF graph in Turtle format, returned as a byte array.
Let's parse this as an rdflib.Graph
object, and iterate through its RDF statements to view a machine-readable validation report:
import rdflib
report_g = rdflib.Graph()
report_g.parse(data=report_graph, format="ttl", encoding="utf-8")
nm = report_g.namespace_manager
for s, p, o in sorted(report_g):
print(s.n3(nm), p.n3(nm), o.n3(nm))
_:n6099ef7758484ffe8efa7612be89be9db1 rdf:type sh:ValidationReport
_:n6099ef7758484ffe8efa7612be89be9db1 sh:conforms "false"^^xsd:boolean
_:n6099ef7758484ffe8efa7612be89be9db1 sh:result _:n6099ef7758484ffe8efa7612be89be9db2
_:n6099ef7758484ffe8efa7612be89be9db1 sh:result _:n6099ef7758484ffe8efa7612be89be9db4
_:n6099ef7758484ffe8efa7612be89be9db2 rdf:type sh:ValidationResult
_:n6099ef7758484ffe8efa7612be89be9db2 sh:focusNode <http://example.org/ns#Bob>
_:n6099ef7758484ffe8efa7612be89be9db2 sh:resultMessage "Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal(\"1971-07-07\")"
_:n6099ef7758484ffe8efa7612be89be9db2 sh:resultPath schema:birthDate
_:n6099ef7758484ffe8efa7612be89be9db2 sh:resultSeverity sh:Violation
_:n6099ef7758484ffe8efa7612be89be9db2 sh:sourceConstraintComponent sh:LessThanConstraintComponent
_:n6099ef7758484ffe8efa7612be89be9db2 sh:sourceShape _:n6099ef7758484ffe8efa7612be89be9db3
_:n6099ef7758484ffe8efa7612be89be9db2 sh:value "1971-07-07"
_:n6099ef7758484ffe8efa7612be89be9db3 sh:lessThan schema:deathDate
_:n6099ef7758484ffe8efa7612be89be9db3 sh:maxCount "1"^^xsd:integer
_:n6099ef7758484ffe8efa7612be89be9db3 sh:path schema:birthDate
_:n6099ef7758484ffe8efa7612be89be9db4 rdf:type sh:ValidationResult
_:n6099ef7758484ffe8efa7612be89be9db4 sh:focusNode <http://example.org/ns#Bob>
_:n6099ef7758484ffe8efa7612be89be9db4 sh:resultMessage "Value does not conform to Shape schema:AddressShape"
_:n6099ef7758484ffe8efa7612be89be9db4 sh:resultPath schema:address
_:n6099ef7758484ffe8efa7612be89be9db4 sh:resultSeverity sh:Violation
_:n6099ef7758484ffe8efa7612be89be9db4 sh:sourceConstraintComponent sh:NodeConstraintComponent
_:n6099ef7758484ffe8efa7612be89be9db4 sh:sourceShape _:n6099ef7758484ffe8efa7612be89be9db5
_:n6099ef7758484ffe8efa7612be89be9db4 sh:value <http://example.org/ns#BobsAddress>
_:n6099ef7758484ffe8efa7612be89be9db5 sh:node schema:AddressShape
_:n6099ef7758484ffe8efa7612be89be9db5 sh:path schema:address
For some use cases, you may need to query this report graph, e.g., to identify data quality issues in the data graph.
Validating RDF graphs with kglab
¶
Now let's try this again, using the kglab
abstraction layer.
First we'll load our recipe graph developed in the earlier examples, which has been serialized to dat/recipes.ttl
in Turtle format:
from os.path import dirname
import kglab
import os
namespaces = {
"nom": "http://example.org/#",
"wtm": "http://purl.org/heals/food/",
"ind": "http://purl.org/heals/ingredient/",
"skos": "http://www.w3.org/2004/02/skos/core#",
}
kg = kglab.KnowledgeGraph(
name = "A recipe KG example based on Food.com",
base_uri = "https://www.food.com/recipe/",
namespaces = namespaces,
)
kg.load_rdf(dirname(os.getcwd()) + "/dat/recipes.ttl") ;
Next we define a SHACL shape graph to provide requirements for our recipes KG:
shape_graph = """
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix nom: <http://example.org/#> .
@prefix wtm: <http://purl.org/heals/food/> .
@prefix ind: <http://purl.org/heals/ingredient/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
nom:RecipeShape
a sh:NodeShape ;
sh:targetClass wtm:Recipe ;
sh:property [
sh:path wtm:hasIngredient ;
sh:node wtm:Ingredient ;
sh:minCount 3 ;
] ;
sh:property [
sh:path skos:definition ;
sh:datatype xsd:string ;
sh:maxLength 50 ;
] .
"""
Now let's run the SHACL validation through the kglab
integration for the pySHACL
library.
Note that providing a shape graph through the shacl_graph
parameter is optional.
Alternatively, RDF statements in the SHACL shape graph could have been included in our dat/recipes.ttl
file.
conforms, report_graph, report_text = kg.validate(
shacl_graph=shape_graph,
shacl_graph_format="ttl"
)
Usage of abort_on_error is deprecated. Use abort_on_first instead.
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
Focus Node: <https://www.food.com/recipe/261361>
Value Node: Literal("german dumplings spaetzle or kniffles for soup or saute")
Result Path: skos:definition
Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
Focus Node: <https://www.food.com/recipe/137158>
Value Node: Literal("pikkuleipienperustaikina finnish butter cookie dough")
Result Path: skos:definition
Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
Focus Node: <https://www.food.com/recipe/61108>
Value Node: Literal("german pancakes from the mennonite treasury of recipes")
Result Path: skos:definition
Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
Focus Node: <https://www.food.com/recipe/279314>
Value Node: Literal("choux pastry for profiteroles cream puffs or eclairs")
Result Path: skos:definition
Message: String length not <= Literal("50", datatype=xsd:integer)
Let's print the report text, which should be approximately what was in the logged output:
print(report_text)
Validation Report
Conforms: False
Results (4):
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
Focus Node: <https://www.food.com/recipe/261361>
Value Node: Literal("german dumplings spaetzle or kniffles for soup or saute")
Result Path: skos:definition
Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
Focus Node: <https://www.food.com/recipe/137158>
Value Node: Literal("pikkuleipienperustaikina finnish butter cookie dough")
Result Path: skos:definition
Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
Focus Node: <https://www.food.com/recipe/61108>
Value Node: Literal("german pancakes from the mennonite treasury of recipes")
Result Path: skos:definition
Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
Severity: sh:Violation
Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
Focus Node: <https://www.food.com/recipe/279314>
Value Node: Literal("choux pastry for profiteroles cream puffs or eclairs")
Result Path: skos:definition
Message: String length not <= Literal("50", datatype=xsd:integer)
As a wrapper on pySHACL
, the kglab
integration returns report_graph
as another KnowledgeGraph
object.
Let's run a SPARQL query on that to determine programmatically which elements of our recipe KG are violating the SHACL constraint rules:
import pandas as pd
sparql = """
SELECT ?id ?focus ?path ?value ?constraint ?message
WHERE {
?id rdf:type sh:ValidationResult .
?id sh:focusNode ?focus .
?id sh:resultPath ?path .
?id sh:value ?value .
?id sh:resultMessage ?message .
?id sh:sourceConstraintComponent ?constraint
}
"""
df = report_graph.query_as_df(sparql)
df
id | focus | path | value | constraint | message | |
---|---|---|---|---|---|---|
0 | _:n3e261bc5685149ffab113fd5b6f0bebfb2 | <https://www.food.com/recipe/137158> | skos:definition | pikkuleipienperustaikina finnish butter cooki... | sh:MaxLengthConstraintComponent | String length not <= Literal("50", datatype=xs... |
1 | _:n3e261bc5685149ffab113fd5b6f0bebfb4 | <https://www.food.com/recipe/61108> | skos:definition | german pancakes from the mennonite treasury o... | sh:MaxLengthConstraintComponent | String length not <= Literal("50", datatype=xs... |
2 | _:n3e261bc5685149ffab113fd5b6f0bebfb5 | <https://www.food.com/recipe/279314> | skos:definition | choux pastry for profiteroles cream puffs or... | sh:MaxLengthConstraintComponent | String length not <= Literal("50", datatype=xs... |
3 | _:n3e261bc5685149ffab113fd5b6f0bebfb6 | <https://www.food.com/recipe/261361> | skos:definition | german dumplings spaetzle or kniffles for so... | sh:MaxLengthConstraintComponent | String length not <= Literal("50", datatype=xs... |
Let's visualize the report_graph
as well, with the SHACL rule nodes highlighted in red, and the violations in orange:
VIS_STYLE = {
"sh": {
"color": "red",
"size": 20,
},
"_":{
"color": "orange",
"size": 30,
},
}
subgraph = kglab.SubgraphTensor(report_graph)
pyvis_graph = subgraph.build_pyvis_graph(notebook=True, style=VIS_STYLE)
pyvis_graph.force_atlas_2based()
pyvis_graph.show("tmp.fig05.html")
In practice, you may need to run inference (RDFS
, OWL
, SKOS
, etc.) prior to running a SHACL validation, to add RDF statements to the data graph before applying the shape constraints.
We'll cover that topic later.
In summary, SHACL provides an excellent approach for:
- auditing work
- ensuring data quality when working with KGs
- inference, since the data graph elements which violate rules could in turn be annotated
- building applications for human-in-the-loop aka machine teaching
Exercises¶
Exercise 1:
Fix the errors in the first example by modifying its data graph, i.e., its ABox.
Can you get it to a state were the returned flag conforms
is true?
Exercise 2:
Extend the SHACL shape graph for our recipe KG to validate that each recipe has a non-zero cooking time? How large must the maximum cooking time be set to avoid violations?