Load data via Morph-KGC¶

morph-kgc is an engine that constructs RDF knowledge graphs from heterogeneous data sources with R2RML and RML mapping languages. Morph-KGC is built on top of pandas and it leverages mapping partitions to significantly reduce execution times and memory consumption for large data sources.

For documentation see https://github.com/oeg-upm/Morph-KGC/wiki/Usage

This example uses a simple SQLite database as input, transforming it into an RDF knowledge graph based on an R2RML mapping for relations between "students" and "sports".

First, let's visualize the sample database: CREATE TABLE "Student" ( "ID" integer PRIMARY KEY, "FirstName" varchar(50), "LastName" varchar(50) );

CREATE TABLE "Sport" ( "ID" integer PRIMARY KEY, "Description" varchar(50) );

CREATE TABLE "Student_Sport" ( "ID_Student" integer, "ID_Sport" integer, PRIMARY KEY ("ID_Student","ID_Sport"), FOREIGN KEY ("ID_Student") REFERENCES "Student"("ID"), FOREIGN KEY ("ID_Sport") REFERENCES "Sport"("ID") );

INSERT INTO "Student" ("ID","FirstName","LastName") VALUES (10,'Venus', 'Williams'); INSERT INTO "Student" ("ID","FirstName","LastName") VALUES (11,'Fernando', 'Alonso'); INSERT INTO "Student" ("ID","FirstName","LastName") VALUES (12,'David', 'Villa');

INSERT INTO "Sport" ("ID", "Description") VALUES (110,'Tennis'); INSERT INTO "Sport" ("ID", "Description") VALUES (111,'Football'); INSERT INTO "Sport" ("ID", "Description") VALUES (112,'Formula1');

INSERT INTO "Student_Sport" ("ID_Student", "ID_Sport") VALUES (10,110); INSERT INTO "Student_Sport" ("ID_Student", "ID_Sport") VALUES (11,111); INSERT INTO "Student_Sport" ("ID_Student", "ID_Sport") VALUES (11,112); INSERT INTO "Student_Sport" ("ID_Student", "ID_Sport") VALUES (12,111); This has three tables plus the data to populate them.

Morph-KGC needs a configuration to describe the mapping, so let's create a basic one for our example:

import os

cwd = os.path.dirname(os.getcwd())

config = f"""
[StudentSportDB]
mappings={cwd}/dat/student_sport.r2rml.ttl
db_url=sqlite:///{cwd}/dat/student_sport.db
"""

print(config)

[StudentSportDB]
mappings=/Users/paco/src/kglab/dat/student_sport.r2rml.ttl
db_url=sqlite:////Users/paco/src/kglab/dat/student_sport.db

You can see how to create this config file in the docs.

Alternatively, you provide a path to a config file, for example:

config = "path/to/config.ini"

This config references a R2RML mapping (in student_sport.r2rml.ttl) which gets applied to the input data. One way to create such a mapping is YARRRML, "a human-friendly text-based representation of RML rules". You write rules in YAML then transform these with a yarrrml-parser into either a RML or R2RML file. See the browser-based version called Matey to get an idea of how this works.

The YARRRML file needed for this example can be found in dat/ which can be used with the sample database:

prefixes:
  ns1: "http://example.com/"

sources:
  students:
    queryFormulation: mysql
    query: select * from Student left join Student_Sport on Student.ID = Student_Sport.ID_Student left join Sport on Sport.ID = Student_Sport.ID_Sport;

mappings:
  student:
    sources: students
    s: http://example.com/$(ID_Student)
    po:
      - [a, ns1:Student]
      - [ns1:firstName, $(FirstName)]
      - [ns1:lastName, $(LastName)]
      - p: ns1:plays
        o: 
        - mapping: sport
          condition:
            function: equal
            parameters:
              - [str1, $(ID_Sport)]
              - [str2, $(ID_Sport)]
  sport:
    sources: students
    s: http://example.com/$(ID_Sport)
    po:
      - [a, ns1:Sport]
      - [ns1:description, $(Description)]
      - [ns1:id, $(ID_Sport)]

Using the yarrrml-parser:

yarrrml-parser -i student_sport.yml -o student_sport.r2rml.ttl -f R2RML

produces the students_sport.r2rml.ttl file. This is the file referenced in the configuration for Morph-KGC.

Next we'll use morph-kgc to load the RDF data from the SQLite based on an R2RML mapping:

from icecream import ic
import kglab

namespaces = {
    "ex":  "http://example.com/",
    }

kg = kglab.KnowledgeGraph(
    name = "A KG example with students and sports",
    namespaces = namespaces,
    )

kg.materialize(config);

INFO | 2022-03-23 11:32:45,453 | 7 mapping rules retrieved.
INFO | 2022-03-23 11:32:45,466 | Mapping partition with 1 groups generated.
INFO | 2022-03-23 11:32:45,467 | Maximum number of rules within mapping group: 7.
INFO | 2022-03-23 11:32:45,469 | Mappings processed in 0.281 seconds.
INFO | 2022-03-23 11:32:45,694 | Number of triples generated in total: 22.

Data can be loaded from multiple text formats, e.g. CSV, JSON, XML, Parquet, and also through different relational DBMS such as PostgresSQL, MySQL, Oracle, Microsoft SQL Server, MariaDB, and so on.

Now let's try to query!

sparql = """
PREFIX ex: <http://example.com/>

SELECT ?student_name ?sport_desc
WHERE {
  ?student rdf:type ex:Student .
  ?student ex:firstName ?student_name .
  ?student ex:plays ?sport .
  ?sport ex:description ?sport_desc
}
    """

for row in kg._g.query(sparql):
    student_name = kg.n3fy(row.student_name)
    sport_desc = kg.n3fy(row.sport_desc)
    ic(student_name, sport_desc)

ic| student_name: 'Fernando', sport_desc: 'Formula1'
ic| student_name: 'Fernando', sport_desc: 'Football'
ic| student_name: 'Venus', sport_desc: 'Tennis'
ic| student_name: 'David', sport_desc: 'Football'

Last update: 2022-03-23