CSV Extractor Algorithm

The CSV Extractor produces an RDF representation of a CSV file compliant with the RFC 4180 and that foresees an header. Such extractor relies on the presence of an header to use the named fields as RDF properties. Field delimiter could be automatically guessed or specified via Apache Any23 Configuration.

Given a document with URL url, Apache Any23 uses the following algorithm to extract RDF:

  • It tries to guess the fields delimiter and to detect the header
  • for each field name:
    • if name is a valid IRI keep it as an IRI since could be derefenceable.
    • if name is not a valid IRI, the associated RDF Property IRI propUri will be in the form of: url concatenated name
    • add label statement: propUri rdfs:label name
    • add column index statement: propUri <http://vocab.sindice.net/csv/rowPosition> index
  • for each row:
    • add RDFS type statement: <url/row/index> rdfs:type <http://vocab.sindice.net/csv/Row>, where index is the column index number.
    • for each cell value:
      • write statement, <url/row/<index>> propUri cell where: cell could be an IRI if the cell value is an IRI, or a typed literal according the value of the CSV actual value cell.
  • add RDF statements claiming number of rows and columns.

For example, given this trivial CSV with an header and just two rows:

first name; last name; http://xmlns.com/foaf/0.1/knows; age
Davide; Palmisano; http://michelemostarda.com; 30; value should not appear
Michele; Mostarda; http://g1o.net;

the following RDF (serialized in RDF/XML) is produced:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

  <rdf:Description rdf:about="http://bob.example.com/firstName">
    <label xmlns="http://www.w3.org/2000/01/rdf-schema#">first name</label>
    <columnPosition xmlns="http://vocab.sindice.net/csv/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">0</columnPosition>
  </rdf:Description>

  <rdf:Description rdf:about="http://bob.example.com/lastName">
    <label xmlns="http://www.w3.org/2000/01/rdf-schema#">last name</label>
    <columnPosition xmlns="http://vocab.sindice.net/csv/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1</columnPosition>
  </rdf:Description>

  <rdf:Description rdf:about="http://xmlns.com/foaf/0.1/knows">
    <columnPosition xmlns="http://vocab.sindice.net/csv/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2</columnPosition>
  </rdf:Description>

  <rdf:Description rdf:about="http://bob.example.com/age">
    <label xmlns="http://www.w3.org/2000/01/rdf-schema#">age</label>
    <columnPosition xmlns="http://vocab.sindice.net/csv/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">3</columnPosition>
  </rdf:Description>

  <rdf:Description rdf:about="http://bob.example.com/row/0">
    <rdf:type rdf:resource="http://vocab.sindice.net/csv/Row"/>
    <firstName xmlns="http://bob.example.com/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Davide</firstName>
    <lastName xmlns="http://bob.example.com/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Palmisano</lastName>
    <knows xmlns="http://xmlns.com/foaf/0.1/"
    rdf:resource="http://michelemostarda.com"/
    <age xmlns="http://bob.example.com/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">30</age>
  </rdf:Description>

  <rdf:Description rdf:about="http://bob.example.com/">
    <row xmlns="http://vocab.sindice.net/csv/" rdf:resource="http://bob.example.com/row/0"/>
  </rdf:Description>

  <rdf:Description rdf:about="http://bob.example.com/row/0">
    <rowPosition xmlns="http://vocab.sindice.net/csv/">0</rowPosition>
  </rdf:Description>

  <rdf:Description rdf:about="http://bob.example.com/row/1">
    <rdf:type rdf:resource="http://vocab.sindice.net/csv/Row"/>
    <firstName xmlns="http://bob.example.com/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Michele</firstName>
    <lastName xmlns="http://bob.example.com/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Mostarda</lastName>
    <knows xmlns="http://xmlns.com/foaf/0.1/" rdf:resource="http://g1o.net" />
  </rdf:Description>

  <rdf:Description rdf:about="http://bob.example.com/">
    <row xmlns="http://vocab.sindice.net/csv/"
    rdf:resource="http://bob.example.com/row/1"/>
  </rdf:Description>

  <rdf:Description rdf:about="http://bob.example.com/row/1">
    <rowPosition xmlns="http://vocab.sindice.net/csv/">1</rowPosition>
  </rdf:Description>

  <rdf:Description rdf:about="http://bob.example.com/">
    <numberOfRows xmlns="http://vocab.sindice.net/csv/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">2</numberOfRows>
    <numberOfColumns xmlns="http://vocab.sindice.net/csv/"
    rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">4</numberOfColumns>
  </rdf:Description>

</rdf:RDF>