Skip to content

Note

To run this notebook in JupyterLab, load examples/ex5_0.ipynb

SHACL validation with pySHACL

Let's explore use of the W3 Shapes Constraint Language (SHACL) based on the pySHACL library.

When we build KGs, it can be helpful to think of the different semantic technologies in terms of layers:

  • SKOS - thesauri and classification
  • SHACL - requirements
  • OWL - concepts
  • RDF - represent nodes, predicates, literals

For an excellent overview + demos of SHACL, see shacl-masterclass by Veronika Heimsbakk. Another great online resource for working with SHACL is the SHACL Playground.

With SHACL we can validate as well as run some forms of inference to complement what's provided by RDF, OWL, and so on. For a good overview, see the discussion about SHACL and other rule-base approaches in general in "Rules for Knowledge Graphs Rules" by Dan McCreary.

First, we'll show one of the examples from pySHACL, starting with its SHACL shapes graph in Turtle format:

shapes_graph = """
@prefix sh:     <http://www.w3.org/ns/shacl#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .
@prefix schema: <http://schema.org/> .

schema:PersonShape
    a sh:NodeShape ;
    sh:targetClass schema:Person ;
    sh:property [
        sh:path schema:givenName ;
        sh:datatype xsd:string ;
        sh:name "given name" ;
    ] ;
    sh:property [
        sh:path schema:birthDate ;
        sh:lessThan schema:deathDate ;
        sh:maxCount 1 ;
    ] ;
    sh:property [
        sh:path schema:gender ;
        sh:in ( "female" "male" "nonbinary" "self-descr" ) ;
    ] ;
    sh:property [
        sh:path schema:address ;
        sh:node schema:AddressShape ;
    ] .

schema:AddressShape
    a sh:NodeShape ;
    sh:closed true ;
    sh:property [
        sh:path schema:streetAddress ;
        sh:datatype xsd:string ;
    ] ;
    sh:property [
        sh:path schema:postalCode ;
        sh:datatype xsd:integer ;
        sh:minInclusive 10000 ;
        sh:maxInclusive 99999 ;
    ] .
"""

Then define a simple data graph to test against, given in JSON-LD format:

data_graph = """
{
    "@context": { "@vocab": "http://schema.org/" },
    "@id": "http://example.org/ns#Bob",
    "@type": "Person",
    "givenName": "Robert",
    "familyName": "Junior",

    "birthDate": "1971-07-07",
    "deathDate": "1968-09-10",
    "address": {
        "@id": "http://example.org/ns#BobsAddress",
        "streetAddress": "1600 Amphitheatre Pkway",
        "postalCode": 9404
    }
}
"""

Now let's run pySHACL directly, to test whether this data graph conforms to its shapes graph, then print out the results:

import pyshacl

results = pyshacl.validate(
    data_graph,
    shacl_graph=shapes_graph,
    data_graph_format="json-ld",
    shacl_graph_format="ttl",
    inference="rdfs",
    debug=True,
    serialize_report_graph="ttl",
    )

conforms, report_graph, report_text = results

print("conforms", conforms)
Constraint Violation in LessThanConstraintComponent (http://www.w3.org/ns/shacl#LessThanConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:lessThan schema:deathDate ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:path schema:birthDate ]
    Focus Node: <http://example.org/ns#Bob>
    Value Node: Literal("1971-07-07")
    Result Path: schema:birthDate
    Message: Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal("1971-07-07")

Constraint Violation in MinInclusiveConstraintComponent (http://www.w3.org/ns/shacl#MinInclusiveConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:integer ; sh:maxInclusive Literal("99999", datatype=xsd:integer) ; sh:minInclusive Literal("10000", datatype=xsd:integer) ; sh:path schema:postalCode ]
    Focus Node: <http://example.org/ns#BobsAddress>
    Value Node: Literal("9404", datatype=xsd:integer)
    Result Path: schema:postalCode
    Message: Value is not >= Literal("10000", datatype=xsd:integer)

Constraint Violation in NodeConstraintComponent (http://www.w3.org/ns/shacl#NodeConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:node schema:AddressShape ; sh:path schema:address ]
    Focus Node: <http://example.org/ns#Bob>
    Value Node: <http://example.org/ns#BobsAddress>
    Result Path: schema:address
    Message: Value does not conform to Shape schema:AddressShape



conforms False

The conforms flag should return False since the given data graph violates some of its shape constraints. Let's look at the report_text output for human-readable analysis:

print(report_text)
Validation Report
Conforms: False
Results (2):
Constraint Violation in LessThanConstraintComponent (http://www.w3.org/ns/shacl#LessThanConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:lessThan schema:deathDate ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:path schema:birthDate ]
    Focus Node: <http://example.org/ns#Bob>
    Value Node: Literal("1971-07-07")
    Result Path: schema:birthDate
    Message: Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal("1971-07-07")
Constraint Violation in NodeConstraintComponent (http://www.w3.org/ns/shacl#NodeConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:node schema:AddressShape ; sh:path schema:address ]
    Focus Node: <http://example.org/ns#Bob>
    Value Node: <http://example.org/ns#BobsAddress>
    Result Path: schema:address
    Message: Value does not conform to Shape schema:AddressShape

The birthDate value in the data graph causes a LessThanConstraintComponent violation based on the PersonShape constraint rules. The postalCode value causes a NodeConstraintComponent violation based on the AddressShape constraint rules.

The serialize_report_graph parameter for pyshacl.validate() requested that the report graph be serialized as an RDF graph in Turtle format, returned as a byte array. Let's parse this as an rdflib.Graph object, and iterate through its RDF statements to view a machine-readable validation report:

import rdflib

report_g = rdflib.Graph()
report_g.parse(data=report_graph, format="ttl", encoding="utf-8")
nm = report_g.namespace_manager

for s, p, o in sorted(report_g):
    print(s.n3(nm), p.n3(nm), o.n3(nm))
_:n6099ef7758484ffe8efa7612be89be9db1 rdf:type sh:ValidationReport
_:n6099ef7758484ffe8efa7612be89be9db1 sh:conforms "false"^^xsd:boolean
_:n6099ef7758484ffe8efa7612be89be9db1 sh:result _:n6099ef7758484ffe8efa7612be89be9db2
_:n6099ef7758484ffe8efa7612be89be9db1 sh:result _:n6099ef7758484ffe8efa7612be89be9db4
_:n6099ef7758484ffe8efa7612be89be9db2 rdf:type sh:ValidationResult
_:n6099ef7758484ffe8efa7612be89be9db2 sh:focusNode <http://example.org/ns#Bob>
_:n6099ef7758484ffe8efa7612be89be9db2 sh:resultMessage "Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal(\"1971-07-07\")"
_:n6099ef7758484ffe8efa7612be89be9db2 sh:resultPath schema:birthDate
_:n6099ef7758484ffe8efa7612be89be9db2 sh:resultSeverity sh:Violation
_:n6099ef7758484ffe8efa7612be89be9db2 sh:sourceConstraintComponent sh:LessThanConstraintComponent
_:n6099ef7758484ffe8efa7612be89be9db2 sh:sourceShape _:n6099ef7758484ffe8efa7612be89be9db3
_:n6099ef7758484ffe8efa7612be89be9db2 sh:value "1971-07-07"
_:n6099ef7758484ffe8efa7612be89be9db3 sh:lessThan schema:deathDate
_:n6099ef7758484ffe8efa7612be89be9db3 sh:maxCount "1"^^xsd:integer
_:n6099ef7758484ffe8efa7612be89be9db3 sh:path schema:birthDate
_:n6099ef7758484ffe8efa7612be89be9db4 rdf:type sh:ValidationResult
_:n6099ef7758484ffe8efa7612be89be9db4 sh:focusNode <http://example.org/ns#Bob>
_:n6099ef7758484ffe8efa7612be89be9db4 sh:resultMessage "Value does not conform to Shape schema:AddressShape"
_:n6099ef7758484ffe8efa7612be89be9db4 sh:resultPath schema:address
_:n6099ef7758484ffe8efa7612be89be9db4 sh:resultSeverity sh:Violation
_:n6099ef7758484ffe8efa7612be89be9db4 sh:sourceConstraintComponent sh:NodeConstraintComponent
_:n6099ef7758484ffe8efa7612be89be9db4 sh:sourceShape _:n6099ef7758484ffe8efa7612be89be9db5
_:n6099ef7758484ffe8efa7612be89be9db4 sh:value <http://example.org/ns#BobsAddress>
_:n6099ef7758484ffe8efa7612be89be9db5 sh:node schema:AddressShape
_:n6099ef7758484ffe8efa7612be89be9db5 sh:path schema:address

For some use cases, you may need to query this report graph, e.g., to identify data quality issues in the data graph.


Validating RDF graphs with kglab

Now let's try this again, using the kglab abstraction layer. First we'll load our recipe graph developed in the earlier examples, which has been serialized to dat/recipes.ttl in Turtle format:

from os.path import dirname
import kglab
import os

namespaces = {
    "nom":  "http://example.org/#",
    "wtm":  "http://purl.org/heals/food/",
    "ind":  "http://purl.org/heals/ingredient/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    }

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on Food.com",
    base_uri = "https://www.food.com/recipe/",
    namespaces = namespaces,
    )

kg.load_rdf(dirname(os.getcwd()) + "/dat/recipes.ttl") ;

Next we define a SHACL shape graph to provide requirements for our recipes KG:

shape_graph = """
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix nom:  <http://example.org/#> .
@prefix wtm:  <http://purl.org/heals/food/> .
@prefix ind:  <http://purl.org/heals/ingredient/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

nom:RecipeShape
    a sh:NodeShape ;
    sh:targetClass wtm:Recipe ;
    sh:property [
        sh:path wtm:hasIngredient ;
        sh:node wtm:Ingredient ;
        sh:minCount 3 ;
    ] ;
    sh:property [
        sh:path skos:definition ;
        sh:datatype xsd:string ;
        sh:maxLength 50 ;
    ] .
"""

Now let's run the SHACL validation through the kglab integration for the pySHACL library. Note that providing a shape graph through the shacl_graph parameter is optional. Alternatively, RDF statements in the SHACL shape graph could have been included in our dat/recipes.ttl file.

conforms, report_graph, report_text = kg.validate(
    shacl_graph=shape_graph,
    shacl_graph_format="ttl"
)
Usage of abort_on_error is deprecated. Use abort_on_first instead.
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/261361>
    Value Node: Literal("german dumplings  spaetzle or kniffles  for soup or saute")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/137158>
    Value Node: Literal("pikkuleipienperustaikina  finnish butter cookie dough")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/61108>
    Value Node: Literal("german pancakes  from the mennonite treasury of recipes")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/279314>
    Value Node: Literal("choux pastry  for profiteroles  cream puffs or eclairs")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

Let's print the report text, which should be approximately what was in the logged output:

print(report_text)
Validation Report
Conforms: False
Results (4):
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/261361>
    Value Node: Literal("german dumplings  spaetzle or kniffles  for soup or saute")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/137158>
    Value Node: Literal("pikkuleipienperustaikina  finnish butter cookie dough")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/61108>
    Value Node: Literal("german pancakes  from the mennonite treasury of recipes")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
    Severity: sh:Violation
    Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
    Focus Node: <https://www.food.com/recipe/279314>
    Value Node: Literal("choux pastry  for profiteroles  cream puffs or eclairs")
    Result Path: skos:definition
    Message: String length not <= Literal("50", datatype=xsd:integer)

As a wrapper on pySHACL, the kglab integration returns report_graph as another KnowledgeGraph object. Let's run a SPARQL query on that to determine programmatically which elements of our recipe KG are violating the SHACL constraint rules:

import pandas as pd

sparql = """
SELECT ?id ?focus ?path ?value ?constraint ?message
  WHERE {
    ?id rdf:type sh:ValidationResult .
    ?id sh:focusNode ?focus .
    ?id sh:resultPath ?path .
    ?id sh:value ?value .
    ?id sh:resultMessage ?message .
    ?id sh:sourceConstraintComponent ?constraint
  }
"""

df = report_graph.query_as_df(sparql)
df
id focus path value constraint message
0 _:n3e261bc5685149ffab113fd5b6f0bebfb2 <https://www.food.com/recipe/137158> skos:definition pikkuleipienperustaikina finnish butter cooki... sh:MaxLengthConstraintComponent String length not <= Literal("50", datatype=xs...
1 _:n3e261bc5685149ffab113fd5b6f0bebfb4 <https://www.food.com/recipe/61108> skos:definition german pancakes from the mennonite treasury o... sh:MaxLengthConstraintComponent String length not <= Literal("50", datatype=xs...
2 _:n3e261bc5685149ffab113fd5b6f0bebfb5 <https://www.food.com/recipe/279314> skos:definition choux pastry for profiteroles cream puffs or... sh:MaxLengthConstraintComponent String length not <= Literal("50", datatype=xs...
3 _:n3e261bc5685149ffab113fd5b6f0bebfb6 <https://www.food.com/recipe/261361> skos:definition german dumplings spaetzle or kniffles for so... sh:MaxLengthConstraintComponent String length not <= Literal("50", datatype=xs...

Let's visualize the report_graph as well, with the SHACL rule nodes highlighted in red, and the violations in orange:

VIS_STYLE = {
    "sh": {
        "color": "red",
        "size": 20,
    },
    "_":{
        "color": "orange",
        "size": 30,
    },
}

subgraph = kglab.SubgraphTensor(report_graph)
pyvis_graph = subgraph.build_pyvis_graph(notebook=True, style=VIS_STYLE)

pyvis_graph.force_atlas_2based()
pyvis_graph.show("tmp.fig05.html")

png

In practice, you may need to run inference (RDFS, OWL, SKOS, etc.) prior to running a SHACL validation, to add RDF statements to the data graph before applying the shape constraints. We'll cover that topic later.

In summary, SHACL provides an excellent approach for:

  • auditing work
  • ensuring data quality when working with KGs
  • inference, since the data graph elements which violate rules could in turn be annotated
  • building applications for human-in-the-loop aka machine teaching

Exercises

Exercise 1:

Fix the errors in the first example by modifying its data graph, i.e., its ABox. Can you get it to a state were the returned flag conforms is true?

Exercise 2:

Extend the SHACL shape graph for our recipe KG to validate that each recipe has a non-zero cooking time? How large must the maximum cooking time be set to avoid violations?


Last update: 2022-03-23