Skip to content

Note

To run this notebook in JupyterLab, load examples/ex5_0.ipynb

SHACL validation with pySHACL

Let's explore use of the W3 Shapes Constraint Language (SHACL) based on the pySHACL library.

When we build KGs, it can be helpful to think of the different semantic technologies in terms of layers:

  • SKOS - thesauri and classification
  • SHACL - requirements
  • OWL - concepts
  • RDF - represent nodes, predictates, literals

For an excellent overview + demos of SHACL, see shacl-masterclass by Veronika Heimsbakk. Another great online resource for working with SHACL is the SHACL Playground.

With SHACL we can validate as well as run some forms of inference to complement what's provided by RDF, OWL, and so on. For a good overview, see the discussion about SHACL and other rule-base approaches in general in "Rules for Knowledge Graphs Rules" by Dan McCreary.

First, we'll show one of the examples from pySHACL, starting with its SHACL shapes graph in Turtle format:

shapes_graph = """
@prefix sh:     <http://www.w3.org/ns/shacl#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .
@prefix schema: <http://schema.org/> .

schema:PersonShape
    a sh:NodeShape ;
    sh:targetClass schema:Person ;
    sh:property [
        sh:path schema:givenName ;
        sh:datatype xsd:string ;
        sh:name "given name" ;
    ] ;
    sh:property [
        sh:path schema:birthDate ;
        sh:lessThan schema:deathDate ;
        sh:maxCount 1 ;
    ] ;
    sh:property [
        sh:path schema:gender ;
        sh:in ( "female" "male" ) ;
    ] ;
    sh:property [
        sh:path schema:address ;
        sh:node schema:AddressShape ;
    ] .

schema:AddressShape
    a sh:NodeShape ;
    sh:closed true ;
    sh:property [
        sh:path schema:streetAddress ;
        sh:datatype xsd:string ;
    ] ;
    sh:property [
        sh:path schema:postalCode ;
        sh:datatype xsd:integer ;
        sh:minInclusive 10000 ;
        sh:maxInclusive 99999 ;
    ] .
"""

Then define a simple data graph to test against, given in JSON-LD format:

data_graph = """
{
    "@context": { "@vocab": "http://schema.org/" },
    "@id": "http://example.org/ns#Bob",
    "@type": "Person",
    "givenName": "Robert",
    "familyName": "Junior",

    "birthDate": "1971-07-07",
    "deathDate": "1968-09-10",
    "address": {
        "@id": "http://example.org/ns#BobsAddress",
        "streetAddress": "1600 Amphitheatre Pkway",
        "postalCode": 9404
    }
}
"""

Now let's run pySHACL directly, to test whether this data graph conforms to its shapes graph, then print out the results:

import pyshacl

results = pyshacl.validate(
    data_graph,
    shacl_graph=shapes_graph,
    data_graph_format="json-ld",
    shacl_graph_format="ttl",
    inference="rdfs",
    debug=True,
    serialize_report_graph="ttl",
    )

conforms, report_graph, report_text = results

print("conforms", conforms)
Constraint Violation in LessThanConstraintComponent (http://www.w3.org/ns/shacl#LessThanConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:lessThan schema:deathDate ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:path schema:birthDate ]
    Focus Node: <http://example.org/ns#Bob>
    Value Node: Literal("1971-07-07")
    Result Path: schema:birthDate
    Message: Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal("1971-07-07")

Constraint Violation in MinInclusiveConstraintComponent (http://www.w3.org/ns/shacl#MinInclusiveConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:integer ; sh:maxInclusive Literal("99999", datatype=xsd:integer) ; sh:minInclusive Literal("10000", datatype=xsd:integer) ; sh:path schema:postalCode ]
    Focus Node: <http://example.org/ns#BobsAddress>
    Value Node: Literal("9404", datatype=xsd:integer)
    Result Path: schema:postalCode
    Message: Value is not >= Literal("10000", datatype=xsd:integer)

Constraint Violation in NodeConstraintComponent (http://www.w3.org/ns/shacl#NodeConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:node schema:AddressShape ; sh:path schema:address ]
    Focus Node: <http://example.org/ns#Bob>
    Value Node: <http://example.org/ns#BobsAddress>
    Result Path: schema:address
    Message: Value does not conform to Shape schema:AddressShape



conforms False

The conforms flag should return False since the given data graph violates some of its shape constraints. Let's look at the report_text output for human-readable analysis:

print(report_text)
Validation Report
Conforms: False
Results (2):
Constraint Violation in LessThanConstraintComponent (http://www.w3.org/ns/shacl#LessThanConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:lessThan schema:deathDate ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:path schema:birthDate ]
    Focus Node: <http://example.org/ns#Bob>
    Value Node: Literal("1971-07-07")
    Result Path: schema:birthDate
    Message: Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal("1971-07-07")
Constraint Violation in NodeConstraintComponent (http://www.w3.org/ns/shacl#NodeConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:node schema:AddressShape ; sh:path schema:address ]
    Focus Node: <http://example.org/ns#Bob>
    Value Node: <http://example.org/ns#BobsAddress>
    Result Path: schema:address
    Message: Value does not conform to Shape schema:AddressShape

The birthDate value in the data graph causes a LessThanConstraintComponent violation based on the PersonShape constraint rules. The postalCode value causes a NodeConstraintComponent violation based on the AddressShape constraint rules.

The serialize_report_graph parameter for pyshacl.validate() requested that the report graph be serialized as an RDF graph in Turtle format, returned as a byte array. Let's parse this as an rdflib.Graph object, and iterate through its triples to see a machine-readable validation report:

import rdflib

report_g = rdflib.Graph()
report_g.parse(data=report_graph, format="ttl", encoding="utf-8")
nm = report_g.namespace_manager

for s, p, o in sorted(report_g):
    print(s.n3(nm), p.n3(nm), o.n3(nm))
_:ub2bL12C28 sh:node schema:AddressShape
_:ub2bL12C28 sh:path schema:address
_:ub2bL15C9 rdf:type sh:ValidationResult
_:ub2bL15C9 sh:focusNode <http://example.org/ns#Bob>
_:ub2bL15C9 sh:resultMessage "Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal(\"1971-07-07\")"
_:ub2bL15C9 sh:resultPath schema:birthDate
_:ub2bL15C9 sh:resultSeverity sh:Violation
_:ub2bL15C9 sh:sourceConstraintComponent sh:LessThanConstraintComponent
_:ub2bL15C9 sh:sourceShape _:ub2bL21C28
_:ub2bL15C9 sh:value "1971-07-07"
_:ub2bL21C28 sh:lessThan schema:deathDate
_:ub2bL21C28 sh:maxCount "1"^^xsd:integer
_:ub2bL21C28 sh:path schema:birthDate
_:ub2bL4C1 rdf:type sh:ValidationReport
_:ub2bL4C1 sh:conforms "false"^^xsd:boolean
_:ub2bL4C1 sh:result _:ub2bL15C9
_:ub2bL4C1 sh:result _:ub2bL6C15
_:ub2bL6C15 rdf:type sh:ValidationResult
_:ub2bL6C15 sh:focusNode <http://example.org/ns#Bob>
_:ub2bL6C15 sh:resultMessage "Value does not conform to Shape schema:AddressShape"
_:ub2bL6C15 sh:resultPath schema:address
_:ub2bL6C15 sh:resultSeverity sh:Violation
_:ub2bL6C15 sh:sourceConstraintComponent sh:NodeConstraintComponent
_:ub2bL6C15 sh:sourceShape _:ub2bL12C28
_:ub2bL6C15 sh:value <http://example.org/ns#BobsAddress>

For some use cases, you may need to query this report graph, e.g., to identify data quality issues in the data graph.


Validating RDF graphs with kglab

Now let's try this again, using the kglab abstraction layer. First we'll load our recipe graph developed in the earlier examples, which has been serialized to dat/recipes.ttl in Turtle format:

import kglab

namespaces = {
    "nom":  "http://example.org/#",
    "wtm":  "http://purl.org/heals/food/",
    "ind":  "http://purl.org/heals/ingredient/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    }

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on Food.com",
    base_uri = "https://www.food.com/recipe/",
    namespaces = namespaces,
    )

kg.load_rdf("../dat/recipes.ttl")

Next we define a SHACL shape graph to provide requirements for our recipes KG:

shape_graph = """
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix nom:  <http://example.org/#> .
@prefix wtm:  <http://purl.org/heals/food/> .
@prefix ind:  <http://purl.org/heals/ingredient/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

nom:RecipeShape
    a sh:NodeShape ;
    sh:targetClass wtm:Recipe ;
    sh:property [
        sh:path wtm:hasIngredient ;
        sh:node wtm:Ingredient ;
        sh:minCount 3 ;
    ] ;
    sh:property [
        sh:path skos:definition ;
        sh:datatype xsd:string ;
        sh:maxLength 50 ;
    ] .
"""

Now let's run the SHACL validation through the kglab integration for the pySHACL library. Note that providing a shape graph through the shacl_graph parameter is optional; alternatively the SHACL shape graph triples could have been included in our dat/recipes.ttl file.

conforms, report_graph, report_text = kg.validate(
    shacl_graph=shape_graph,
    shacl_graph_format="ttl"
)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/137158>
    Value Node: Literal("pikkuleipienperustaikina  finnish butter cookie dough")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/261361>
    Value Node: Literal("german dumplings  spaetzle or kniffles  for soup or saute")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/279314>
    Value Node: Literal("choux pastry  for profiteroles  cream puffs or eclairs")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/61108>
    Value Node: Literal("german pancakes  from the mennonite treasury of recipes")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

Let's print the report text, which should be approximately what was in the logged output:

print(report_text)
Validation Report
Conforms: False
Results (4):
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/137158>
    Value Node: Literal("pikkuleipienperustaikina  finnish butter cookie dough")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/261361>
    Value Node: Literal("german dumplings  spaetzle or kniffles  for soup or saute")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/279314>
    Value Node: Literal("choux pastry  for profiteroles  cream puffs or eclairs")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/61108>
    Value Node: Literal("german pancakes  from the mennonite treasury of recipes")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

As a wrapper on pySHACL, the kglab integration returns report_graph as another KnowledgeGraph object. Let's run a SPARQL query on that to determine programmatically which elements of our recipe KG are violating the SHACL constraint rules:

import pandas as pd
pd.set_option("max_rows", None)

sparql = """
SELECT ?id ?focus ?path ?value ?constraint ?message
  WHERE {
    ?id rdf:type sh:ValidationResult .
    ?id sh:focusNode ?focus .
    ?id sh:resultPath ?path .
    ?id sh:value ?value .
    ?id sh:resultMessage ?message .
    ?id sh:sourceConstraintComponent ?constraint
  }
"""

df = report_graph.query_as_df(sparql)
df
id focus path value constraint message
0 _:ub5bL6C15 <https://www.food.com/recipe/279314> skos:definition choux pastry for profiteroles cream puffs or... sh:MaxLengthConstraintComponent String length not <= Literal("50", datatype=xs...
1 _:ub5bL14C9 <https://www.food.com/recipe/261361> skos:definition german dumplings spaetzle or kniffles for so... sh:MaxLengthConstraintComponent String length not <= Literal("50", datatype=xs...
2 _:ub5bL22C9 <https://www.food.com/recipe/137158> skos:definition pikkuleipienperustaikina finnish butter cooki... sh:MaxLengthConstraintComponent String length not <= Literal("50", datatype=xs...
3 _:ub5bL30C9 <https://www.food.com/recipe/61108> skos:definition german pancakes from the mennonite treasury o... sh:MaxLengthConstraintComponent String length not <= Literal("50", datatype=xs...

In practice, you may need to run inference (RDFS, OWL, SKOS, etc.) prior to running a SHACL validation, to add triples to the data graph before applying the shape constraints. We'll cover that topic later.

In summary, SHACL provides an excellent approach for:

  • auditing work
  • ensuring data quality when working with KGs
  • inference, since the data graph elemements which violate rules could in turn be annotated
  • building applications for human-in-the-loop aka machine teaching

Exercises

Exercise 1:

Fix the errors in the first example by modifying its data graph, i.e., its ABox. Can you get it to a state were the returned flag conforms is true?

Exercise 2:

Extend the SHACL shape graph for our recipe KG to validate that each recipe has a non-zero cooking time? How large must the maximum cooking time be set to avoid violations?


Last update: 2021-01-21