doc/tutorials/dataimport/index.rst
author Laurent Wouters <lwouters@cenotelie.fr>
Fri, 20 Mar 2020 14:34:07 +0100
changeset 12931 6eae252361e5
parent 12792 e2cdb1be6bd9
permissions -rw-r--r--
[rql] Store selected variables for RQL select queries in ResultSet (#17218476) By storing the name of the selected variables for RQL select queries in the ResultSet (within the "variables" attribute), the information can be passed down to specific protocols, e.g. rqlio that may wish to pass is down further to clients. In turn, clients can then choose to present the results of RQL select queries as symbolic bindings using the names used in the query's projection, instead of ordinal arrays.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     1
Importing relational data into a CubicWeb instance
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     2
==================================================
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     3
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     4
Introduction
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     5
~~~~~~~~~~~~
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     6
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
     7
This tutorial explains how to import data from an external source (e.g. a collection of files)
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     8
into a CubicWeb cube instance.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     9
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    10
First, once we know the format of the data we wish to import, we devise a
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    11
*data model*, that is, a CubicWeb (Yams) schema which reflects the way the data
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    12
is structured. This schema is implemented in the ``schema.py`` file.
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    13
In this tutorial, we will describe such a schema for a particular data set,
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    14
the Diseasome data (see below).
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    15
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    16
Once the schema is defined, we create a cube and an instance.
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    17
The cube is a specification of an application, whereas an instance
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    18
is the application per se.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    19
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    20
Once the schema is defined and the instance is created, the import can be performed, via
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    21
the following steps:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    22
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    23
1. Build a custom parser for the data to be imported. Thus, one obtains a Python
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    24
   memory representation of the data.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    25
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    26
2. Map the parsed data to the data model defined in ``schema.py``.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    27
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    28
3. Perform the actual import of the data. This comes down to "populating"
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    29
   the data model with the memory representation obtained at 1, according to
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    30
   the mapping defined at 2.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    31
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    32
This tutorial illustrates all the above steps in the context of relational data
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    33
stored in the RDF format.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    34
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    35
More specifically, we describe the import of Diseasome_ RDF/OWL data.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    36
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    37
.. _Diseasome: http://datahub.io/dataset/fu-berlin-diseasome
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    38
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    39
Building a data model
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    40
~~~~~~~~~~~~~~~~~~~~~
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    41
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    42
The first thing to do when using CubicWeb for creating an application from scratch
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    43
is to devise a *data model*, that is, a relational representation of the problem to be
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    44
modeled or of the structure of the data to be imported.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    45
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    46
In such a schema, we define
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    47
an entity type (``EntityType`` objects) for each type of entity to import. Each such type
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    48
has several attributes. If the attributes are of known CubicWeb (Yams) types, viz. numbers,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    49
strings or characters, then they are defined as attributes, as e.g. ``attribute = Int()``
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    50
for an attribute named ``attribute`` which is an integer.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    51
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    52
Each such type also has a set of
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    53
relations, which are defined like the attributes, except that they represent, in fact,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    54
relations between the entities of the type under discussion and the objects of a type which
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    55
is specified in the relation definition.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    56
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    57
For example, for the Diseasome data, we have two types of entities, genes and diseases.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    58
Thus, we create two classes which inherit from ``EntityType``::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    59
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    60
    class Disease(EntityType):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    61
        # Corresponds to http://www.w3.org/2000/01/rdf-schema#label
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    62
        label = String(maxsize=512, fulltextindexed=True)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    63
        ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    64
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    65
        #Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    66
        associated_genes = SubjectRelation('Gene', cardinality='**')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    67
        ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    68
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    69
        #Corresponds to 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/chromosomalLocation'
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    70
        chromosomal_location = SubjectRelation('ExternalUri', cardinality='?*', inlined=True)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    71
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    72
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    73
    class Gene(EntityType):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    74
        ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    75
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    76
In this schema, there are attributes whose values are numbers or strings. Thus, they are
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    77
defined by using the CubicWeb / Yams primitive types, e.g., ``label = String(maxsize=12)``.
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    78
These types can have several constraints or attributes, such as ``maxsize``.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    79
There are also relations, either between the entity types themselves, or between them
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    80
and a CubicWeb type, ``ExternalUri``. The latter defines a class of URI objects in
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    81
CubicWeb. For instance, the ``chromosomal_location`` attribute is a relation between
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    82
a ``Disease`` entity and an ``ExternalUri`` entity. The relation is marked by the CubicWeb /
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    83
Yams ``SubjectRelation`` method. The latter can have several optional keyword arguments, such as
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    84
``cardinality`` which specifies the number of subjects and objects related by the relation type
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    85
specified. For example, the ``'?*'`` cardinality in the ``chromosomal_relation`` relation type says
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    86
that zero or more ``Disease`` entities are related to zero or one ``ExternalUri`` entities.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    87
In other words, a ``Disease`` entity is related to at most one ``ExternalUri`` entity via the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    88
``chromosomal_location`` relation type, and that we can have zero or more ``Disease`` entities in the
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    89
data base.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    90
For a relation between the entity types themselves, the ``associated_genes`` between a ``Disease``
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    91
entity and a ``Gene`` entity is defined, so that any number of ``Gene`` entities can be associated
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    92
to a ``Disease``, and there can be any number of ``Disease`` s if a ``Gene`` exists.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    93
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    94
Of course, before being able to use the CubicWeb / Yams built-in objects, we need to import them::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    95
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
    96
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    97
    from yams.buildobjs import EntityType, SubjectRelation, String, Int
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    98
    from cubicweb.schemas.base import ExternalUri
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    99
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   100
Building a custom data parser
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   101
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   102
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   103
The data we wish to import is structured in the RDF format,
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   104
as a text file containing a set of lines.
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   105
On each line, there are three fields.
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   106
The first two fields are URIs ("Universal Resource Identifiers").
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   107
The third field is either an URI or a string. Each field bares a particular meaning:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   108
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   109
- the leftmost field is an URI that holds the entity to be imported.
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   110
  Note that the entities defined in the data model (i.e., in ``schema.py``) should
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   111
  correspond to the entities whose URIs are specified in the import file.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   112
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   113
- the middle field is an URI that holds a relation whose subject is the  entity
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   114
  defined by the leftmost field. Note that this should also correspond
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   115
  to the definitions in the data model.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   116
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   117
- the rightmost field is either an URI or a string. When this field is an URI,
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   118
  it gives the object of the relation defined by the middle field.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   119
  When the rightmost field is a string, the middle field is interpreted as an attribute
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   120
  of the subject (introduced by the leftmost field) and the rightmost field is
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   121
  interpreted as the value of the attribute.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   122
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   123
Note however that some attributes (i.e. relations whose objects are strings)
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   124
have their objects defined as strings followed by ``^^`` and by another URI;
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   125
we ignore this part.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   126
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   127
Let us show some examples:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   128
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   129
- of line holding an attribute definition:
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   130
  ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/CYP17A1>
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   131
  <http://www.w3.org/2000/01/rdf-schema#label> "CYP17A1" .``
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   132
  The line contains the definition of the ``label`` attribute of an
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   133
  entity of type ``gene``. The value of ``label`` is '``CYP17A1``'.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   134
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   135
- of line holding a relation definition:
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   136
  ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/1>
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   137
  <http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene>
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   138
  <http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HADH2> .``
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   139
  The line contains the definition of the ``associatedGene`` relation between
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   140
  a ``disease`` subject entity identified by ``1`` and a ``gene`` object
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   141
  entity defined by ``HADH2``.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   142
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   143
Thus, for parsing the data, we can (:note: see the ``diseasome_parser`` module):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   144
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   145
1. define a couple of regular expressions for parsing the two kinds of lines,
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   146
   ``RE_ATTS`` for parsing the attribute definitions, and ``RE_RELS`` for parsing
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   147
   the relation definitions.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   148
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   149
2. define a function that iterates through the lines of the file and retrieves
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   150
   (``yield`` s) a (subject, relation, object) tuple for each line.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   151
   We called it ``_retrieve_structure`` in the ``diseasome_parser`` module.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   152
   The function needs the file name and the types for which information
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   153
   should be retrieved.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   154
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   155
Alternatively, instead of hand-making the parser, one could use the RDF parser provided
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   156
in the ``dataio`` cube.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   157
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   158
.. XXX To further study and detail the ``dataio`` cube usage.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   159
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   160
Once we get to have the (subject, relation, object) triples, we need to map them into
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   161
the data model.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   162
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   163
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   164
Mapping the data to the schema
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   165
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   166
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   167
In the case of diseasome data, we can just define two dictionaries for mapping
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   168
the names of the relations as extracted by the parser, to the names of the relations
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   169
as defined in the ``schema.py`` data model. In the ``diseasome_parser`` module
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   170
they are called ``MAPPING_ATTS`` and ``MAPPING_RELS``.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   171
Given that the relation and attribute names are given in CamelCase in the original data,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   172
mappings are necessary if we follow the PEP08 when naming the attributes in the data model.
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   173
For example, the RDF relation ``chromosomalLocation`` is mapped into the schema relation
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   174
``chromosomal_location``.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   175
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   176
Once these mappings have been defined, we just iterate over the (subject, relation, object)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   177
tuples provided by the parser and we extract the entities, with their attributes and relations.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   178
For each entity, we thus have a dictionary with two keys, ``attributes`` and ``relations``.
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   179
The value associated to the ``attributes`` key is a dictionary containing (attribute: value)
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   180
pairs, where "value" is a string, plus the ``cwuri`` key / attribute holding the URI of
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   181
the entity itself.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   182
The value associated to the ``relations`` key is a dictionary containing (relation: value)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   183
pairs, where "value" is an URI.
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   184
This is implemented in the ``entities_from_rdf`` interface function of the module
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   185
``diseasome_parser``. This function provides an iterator on the dictionaries containing
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   186
the ``attributes`` and ``relations`` keys for all entities.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   187
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   188
However, this is a simple case. In real life, things can get much more complicated, and the
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   189
mapping can be far from trivial, especially when several data sources (which can follow
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   190
different formatting and even structuring conventions) must be mapped into the same data model.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   191
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   192
Importing the data
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   193
~~~~~~~~~~~~~~~~~~
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   194
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   195
The data import code should be placed in a Python module. Let us call it
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   196
``diseasome_import.py``. Then, this module should be called via
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   197
``cubicweb-ctl``, as follows::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   198
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   199
    cubicweb-ctl shell diseasome_import.py -- <other arguments e.g. data file>
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   200
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   201
In the import module, we should use a *store* for doing the import.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   202
A store is an object which provides three kinds of methods for
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   203
importing data:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   204
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   205
- a method for importing the entities, along with the values
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   206
  of their attributes.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   207
- a method for importing the relations between the entities.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   208
- a method for committing the imports to the database.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   209
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   210
In CubicWeb, we have four stores:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   211
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   212
1. ``ObjectStore`` base class for the stores in CubicWeb.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   213
   It only provides a skeleton for all other stores and
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   214
   provides the means for creating the memory structures
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   215
   (dictionaries) that hold the entities and the relations
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   216
   between them.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   217
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   218
2. ``RQLObjectStore``: store which uses the RQL language for performing
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   219
   database insertions and updates. It relies on all the CubicWeb hooks
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   220
   machinery, especially for dealing with security issues (database access
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   221
   permissions).
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   222
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   223
2. ``NoHookRQLObjectStore``: store which uses the RQL language for
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   224
   performing database insertions and updates, but for which
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   225
   all hooks are deactivated. This implies that
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   226
   certain checks with respect to the CubicWeb / Yams schema
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   227
   (data model) are not performed. However, all SQL queries
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   228
   obtained from the RQL ones are executed in a sequential
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   229
   manner, one query per inserted entity.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   230
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   231
4. ``SQLGenObjectStore``: store which uses the SQL language directly.
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   232
   It inserts entities either sequentially, by executing an SQL query
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   233
   for each entity, or directly by using one PostGRES ``COPY FROM``
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   234
   query for a set of similarly structured entities.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   235
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   236
For really massive imports (millions or billions of entities), there
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   237
is a cube ``dataio`` which contains another store, called
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   238
``MassiveObjectStore``. This store is similar to ``SQLGenObjectStore``,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   239
except that anything related to CubicWeb is bypassed. That is, even the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   240
CubicWeb EID entity identifiers are not handled. This store is the fastest,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   241
but has a slightly different API from the other four stores mentioned above.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   242
Moreover, it has an important limitation, in that it doesn't insert inlined [#]_
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   243
relations in the database.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   244
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   245
.. [#] An inlined relation is a relation defined in the schema
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   246
       with the keyword argument ``inlined=True``. Such a relation
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   247
       is inserted in the database as an attribute of the entity
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   248
       whose subject it is.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   249
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   250
In the following section we will see how to import data by using the stores
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   251
in CubicWeb's ``dataimport`` module.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   252
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   253
Using the stores in ``dataimport``
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   254
++++++++++++++++++++++++++++++++++
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   255
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   256
``ObjectStore`` is seldom used in real life for importing data, since it is
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   257
only the base store for the other stores and it doesn't perform an actual
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   258
import of the data. Nevertheless, the other three stores, which import data,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   259
are based on ``ObjectStore`` and provide the same API.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   260
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   261
All three stores ``RQLObjectStore``, ``NoHookRQLObjectStore`` and
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   262
``SQLGenObjectStore`` provide exactly the same API for importing data, that is
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   263
entities and relations, in an SQL database.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   264
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   265
Before using a store, one must import the ``dataimport`` module and then initialize
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   266
the store, with the current ``session`` as a parameter::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   267
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   268
    import cubicweb.dataimport as cwdi
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   269
    ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   270
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   271
    store = cwdi.RQLObjectStore(session)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   272
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   273
Each such store provides three methods for data import:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   274
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   275
#. ``create_entity(Etype, **attributes)``, which allows us to add
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   276
   an entity of the Yams type ``Etype`` to the database. This entity's attributes
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   277
   are specified in the ``attributes`` dictionary. The method returns the entity
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   278
   created in the database. For example, we add two entities,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   279
   a person, of ``Person`` type, and a location, of ``Location`` type::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   280
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   281
        person = store.create_entity('Person', name='Toto', age='18', height='190')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   282
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   283
        location = store.create_entity('Location', town='Paris', arrondissement='13')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   284
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   285
#. ``relate(subject_eid, r_type, object_eid)``, which allows us to add a relation
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   286
   of the Yams type ``r_type`` to the database. The relation's subject is an entity
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   287
   whose EID is ``subject_eid``; its object is another entity, whose EID is
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   288
   ``object_eid``.  For example [#]_::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   289
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   290
        store.relate(person.eid(), 'lives_in', location.eid(), **kwargs)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   291
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   292
   ``kwargs`` is only used by the ``SQLGenObjectStore``'s ``relate`` method and is here
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   293
   to allow us to specify the type of the subject of the relation, when the relation is
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   294
   defined as inlined in the schema.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   295
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   296
.. [#] The ``eid`` method of an entity defined via ``create_entity`` returns
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   297
       the entity identifier as assigned by CubicWeb when creating the entity.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   298
       This only works for entities defined via the stores in the CubicWeb's
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   299
       ``dataimport`` module.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   300
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   301
   The keyword argument that is understood by ``SQLGenObjectStore`` is called
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   302
   ``subjtype`` and holds the type of the subject entity. For the example considered here,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   303
   this comes to having [#]_::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   304
8927
885dea8f16a0 [cubicweb/doc] Replace dc_type() by cw_etype
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents: 8836
diff changeset
   305
        store.relate(person.eid(), 'lives_in', location.eid(), subjtype=person.cw_etype)
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   306
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   307
   If ``subjtype`` is not specified, then the store tries to infer the type of the subject.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   308
   However, this doesn't always work, e.g. when there are several possible subject types
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   309
   for a given relation type.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   310
8927
885dea8f16a0 [cubicweb/doc] Replace dc_type() by cw_etype
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents: 8836
diff changeset
   311
.. [#] The ``cw_etype`` attribute of an entity defined via ``create_entity`` holds
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   312
       the type of the entity just created. This only works for entities defined via
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   313
       the stores in the CubicWeb's ``dataimport`` module. In the example considered
8927
885dea8f16a0 [cubicweb/doc] Replace dc_type() by cw_etype
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents: 8836
diff changeset
   314
       here, ``person.cw_etype`` holds ``'Person'``.
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   315
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   316
   All the other stores but ``SQLGenObjectStore`` ignore the ``kwargs`` parameters.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   317
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   318
#. ``flush()``, which allows us to perform the actual commit into the database, along
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   319
   with some cleanup operations. Ideally, this method should be called as often as
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   320
   possible, that is after each insertion in the database, so that database sessions
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   321
   are kept as atomic as possible. In practice, we usually call this method twice:
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   322
   first, after all the entities have been created, second, after all relations have
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   323
   been created.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   324
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   325
   Note however that before each commit the database insertions
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   326
   have to be consistent with the schema. Thus, if, for instance,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   327
   an entity has an attribute defined through a relation (viz.
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   328
   a ``SubjectRelation``) with a ``"1"`` or ``"+"`` object
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   329
   cardinality, we have to create the entity under discussion,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   330
   the object entity of the relation under discussion, and the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   331
   relation itself, before committing the additions to the database.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   332
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   333
   The ``flush`` method is simply called as::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   334
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   335
        store.flush().
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   336
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   337
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   338
Using the ``MassiveObjectStore`` in the ``dataio`` cube
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   339
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   340
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   341
This store, available in the ``dataio`` cube, allows us to
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   342
fully dispense with the CubicWeb import mechanisms and hence
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   343
to interact directly with the database server, via SQL queries.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   344
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   345
Moreover, these queries rely on PostGreSQL's ``COPY FROM`` instruction
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   346
to create several entities in a single query. This brings tremendous
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   347
performance improvements with respect to the RQL-based data insertion
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   348
procedures.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   349
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   350
However, the API of this store is slightly different from the API of
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   351
the stores in CubicWeb's ``dataimport`` module.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   352
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   353
Before using the store, one has to import the ``dataio`` cube's
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   354
``dataimport`` module, then initialize the store by giving it the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   355
``session`` parameter::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   356
12556
d1c659d70368 [doc] replace legacy import to new style cube import in various places
Philippe Pepiot <philippe.pepiot@logilab.fr>
parents: 10496
diff changeset
   357
    from cubicweb_dataio import dataimport as mcwdi
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   358
    ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   359
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   360
    store = mcwdi.MassiveObjectStore(session)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   361
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   362
The ``MassiveObjectStore`` provides six methods for inserting data
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   363
into the database:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   364
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   365
#. ``init_rtype_table(SubjEtype, r_type, ObjEtype)``, which specifies the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   366
   creation of the tables associated to the relation types in the database.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   367
   Each such table has three column, the type of the subject entity, the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   368
   type of the relation (that is, the name of the attribute in the subject
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   369
   entity which is defined via the relation), and the type of the object
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   370
   entity. For example::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   371
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   372
        store.init_rtype_table('Person', 'lives_in', 'Location')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   373
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   374
   Please note that these tables can be created before the entities, since
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   375
   they only specify their types, not their unique identifiers.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   376
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   377
#. ``create_entity(Etype, **attributes)``, which allows us to add new entities,
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   378
   whose attributes are given in the ``attributes`` dictionary.
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   379
   Please note however that, by default, this method does *not* return
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   380
   the created entity. The method is called, for example, as in::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   381
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   382
        store.create_entity('Person', name='Toto', age='18', height='190',
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   383
                            uri='http://link/to/person/toto_18_190')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   384
        store.create_entity('Location', town='Paris', arrondissement='13',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   385
                            uri='http://link/to/location/paris_13')
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   386
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   387
   In order to be able to link these entities via the relations when needed,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   388
   we must provide ourselves a means for uniquely identifying the entities.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   389
   In general, this is done via URIs, stored in attributes like ``uri`` or
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   390
   ``cwuri``. The name of the attribute is irrelevant as long as its value is
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   391
   unique for each entity.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   392
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   393
#. ``relate_by_iid(subject_iid, r_type, object_iid)`` allows us to actually
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   394
   relate the entities uniquely identified by ``subject_iid`` and
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   395
   ``object_iid`` via a relation of type ``r_type``. For example::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   396
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   397
        store.relate_by_iid('http://link/to/person/toto_18_190',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   398
                            'lives_in',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   399
                            'http://link/to/location/paris_13')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   400
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   401
   Please note that this method does *not* work for inlined relations!
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   402
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   403
#. ``convert_relations(SubjEtype, r_type, ObjEtype, subj_iid_attribute,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   404
   obj_iid_attribute)``
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   405
   allows us to actually insert
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   406
   the relations in the database. At one call of this method, one inserts
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   407
   all the relations of type ``rtype`` between entities of given types.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   408
   ``subj_iid_attribute`` and ``object_iid_attribute`` are the names
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   409
   of the attributes which store the unique identifiers of the entities,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   410
   as assigned by the user. These names can be identical, as long as
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   411
   their values are unique. For example, for inserting all relations
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   412
   of type ``lives_in`` between ``People`` and ``Location`` entities,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   413
   we write::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   414
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   415
        store.convert_relations('Person', 'lives_in', 'Location', 'uri', 'uri')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   416
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   417
#. ``flush()`` performs the actual commit in the database. It only needs
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   418
   to be called after ``create_entity`` and ``relate_by_iid`` calls.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   419
   Please note that ``relate_by_iid`` does *not* perform insertions into
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   420
   the database, hence calling ``flush()`` for it would have no effect.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   421
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   422
#. ``cleanup()`` performs database cleanups, by removing temporary tables.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   423
   It should only be called at the end of the import.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   424
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   425
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   426
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   427
.. XXX to add smth on the store's parameter initialization.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   428
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   429
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   430
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   431
Application to the Diseasome data
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   432
+++++++++++++++++++++++++++++++++
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   433
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   434
Import setup
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   435
############
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   436
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   437
We define an import function, ``diseasome_import``, which does basically four things:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   438
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   439
#. creates and initializes the store to be used, via a line such as::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   440
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   441
        store = cwdi.SQLGenObjectStore(session)
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   442
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   443
   where ``cwdi`` is the imported ``cubicweb.dataimport`` or
12556
d1c659d70368 [doc] replace legacy import to new style cube import in various places
Philippe Pepiot <philippe.pepiot@logilab.fr>
parents: 10496
diff changeset
   444
   ``cubicweb_dataio.dataimport``.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   445
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   446
#. calls the diseasome parser, that is, the ``entities_from_rdf`` function in the
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   447
   ``diseasome_parser`` module and iterates on its result, in a line such as::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   448
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   449
        for entity, relations in parser.entities_from_rdf(filename, ('gene', 'disease')):
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   450
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   451
   where ``parser`` is the imported ``diseasome_parser`` module, and ``filename`` is the
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   452
   name of the file containing the data (with its path), e.g. ``../data/diseasome_dump.nt``.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   453
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   454
#. creates the entities to be inserted in the database; for Diseasome, there are two
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   455
   kinds of entities:
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   456
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   457
   #. entities defined in the data model, viz. ``Gene`` and ``Disease`` in our case.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   458
   #. entities which are built in CubicWeb / Yams, viz. ``ExternalUri`` which define
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   459
      URIs.
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   460
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   461
   As we are working with RDF data, each entity is defined through a series of URIs. Hence,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   462
   each "relational attribute" [#]_ of an entity is defined via an URI, that is, in CubicWeb
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   463
   terms, via an ``ExternalUri`` entity. The entities are created, in the loop presented above,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   464
   as such::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   465
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   466
        ent = store.create_entity(etype, **entity)
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   467
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   468
   where ``etype`` is the appropriate entity type, either ``Gene`` or ``Disease``.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   469
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   470
.. [#] By "relational attribute" we denote an attribute (of an entity) which
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   471
       is defined through a relation, e.g. the ``chromosomal_location`` attribute
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   472
       of ``Disease`` entities, which is defined through a relation between a
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   473
       ``Disease`` and an ``ExternalUri``.
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   474
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   475
   The ``ExternalUri`` entities are as many as URIs in the data file. For them, we define a unique
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   476
   attribute, ``uri``, which holds the URI under discussion::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   477
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   478
        extu = store.create_entity('ExternalUri', uri="http://path/of/the/uri")
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   479
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   480
#. creates the relations between the entities. We have relations between:
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   481
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   482
   #. entities defined in the schema, e.g. between ``Disease`` and ``Gene``
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   483
      entities, such as the ``associated_genes`` relation defined for
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   484
      ``Disease`` entities.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   485
   #. entities defined in the schema and ``ExternalUri`` entities, such as ``gene_id``.
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   486
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   487
   The way relations are added to the database depends on the store:
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   488
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   489
   - for the stores in the CubicWeb ``dataimport`` module, we only use
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   490
     ``store.relate``, in
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   491
     another loop, on the relations (that is, a
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   492
     loop inside the preceding one, mentioned at step 2)::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   493
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   494
        for rtype, rels in relations.iteritems():
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   495
            ...
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   496
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   497
            store.relate(ent.eid(), rtype, extu.eid(), **kwargs)
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   498
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   499
     where ``kwargs`` is a dictionary designed to accommodate the need for specifying
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   500
     the type of the subject entity of the relation, when the relation is inlined and
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   501
     ``SQLGenObjectStore`` is used. For example::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   502
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   503
            ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   504
            store.relate(ent.eid(), 'chromosomal_location', extu.eid(), subjtype='Disease')
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   505
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   506
   - for the ``MassiveObjectStore`` in the ``dataio`` cube's ``dataimport`` module,
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   507
     the relations are created in three steps:
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   508
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   509
     #. first, a table is created for each relation type, as in::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   510
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   511
            ...
8927
885dea8f16a0 [cubicweb/doc] Replace dc_type() by cw_etype
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents: 8836
diff changeset
   512
            store.init_rtype_table(ent.cw_etype, rtype, extu.cw_etype)
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   513
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   514
        which comes down to lines such as::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   515
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   516
            store.init_rtype_table('Disease', 'associated_genes', 'Gene')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   517
            store.init_rtype_table('Gene', 'gene_id', 'ExternalUri')
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   518
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   519
     #. second, the URI of each entity will be used as its identifier, in the
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   520
        ``relate_by_iid`` method, such as::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   521
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   522
            disease_uri = 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/3'
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   523
            gene_uri = '<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HSD3B2'
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   524
            store.relate_by_iid(disease_uri, 'associated_genes', gene_uri)
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   525
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   526
     #. third, the relations for each relation type will be added to the database,
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   527
        via the ``convert_relations`` method, such as in::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   528
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   529
            store.convert_relations('Disease', 'associated_genes', 'Gene', 'cwuri', 'cwuri')
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   530
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   531
        and::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   532
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   533
            store.convert_relations('Gene', 'hgnc_id', 'ExternalUri', 'cwuri', 'uri')
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   534
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   535
        where ``cwuri`` and ``uri`` are the attributes which store the URIs of the entities
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   536
        defined in the data model, and of the ``ExternalUri`` entities, respectively.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   537
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   538
#. flushes all relations and entities::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   539
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   540
    store.flush()
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   541
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   542
   which performs the actual commit of the inserted entities and relations in the database.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   543
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   544
If the ``MassiveObjectStore`` is used, then a cleanup of temporary SQL tables should be performed
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   545
at the end of the import::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   546
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   547
    store.cleanup()
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   548
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   549
Timing benchmarks
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   550
#################
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   551
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   552
In order to time the import script, we just decorate the import function with the ``timed``
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   553
decorator::
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   554
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   555
    from logilab.common.decorators import timed
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   556
    ...
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   557
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   558
    @timed
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   559
    def diseasome_import(session, filename):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   560
        ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   561
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   562
After running the import function as shown in the "Importing the data" section, we obtain two time measurements::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   563
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   564
    diseasome_import clock: ... / time: ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   565
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   566
Here, the meanings of these measurements are [#]_:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   567
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   568
- ``clock`` is the time spent by CubicWeb, on the server side (i.e. hooks and data pre- / post-processing on SQL
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   569
  queries),
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   570
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   571
- ``time`` is the sum between ``clock`` and the time spent in PostGreSQL.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   572
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   573
.. [#] The meanings of the ``clock`` and ``time`` measurements, when using the ``@timed``
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   574
       decorators, were taken from `a blog post on massive data import in CubicWeb`_.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   575
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   576
.. _a blog post on massive data import in CubicWeb: http://www.cubicweb.org/blogentry/2116712
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   577
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   578
The import function is put in an import module, named ``diseasome_import`` here. The module is called
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   579
directly from the CubicWeb shell, as follows::
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   580
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   581
    cubicweb-ctl shell diseasome_instance diseasome_import.py \
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   582
    -- -df diseasome_import_file.nt -st StoreName
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   583
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   584
The module accepts two arguments:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   585
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   586
- the data file, introduced by ``-df [--datafile]``, and
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   587
- the store, introduced by ``-st [--store]``.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   588
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   589
The timings (in seconds) for different stores are given in the following table, for
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   590
importing 4213 ``Disease`` entities and 3919 ``Gene`` entities with the import module
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   591
just described:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   592
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   593
+--------------------------+------------------------+--------------------------------+------------+
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   594
| Store                    | CubicWeb time (clock)  | PostGreSQL time (time - clock) | Total time |
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   595
+==========================+========================+================================+============+
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   596
| ``RQLObjectStore``       | 225.98                 | 62.05                          | 288.03     |
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   597
+--------------------------+------------------------+--------------------------------+------------+
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   598
| ``NoHookRQLObjectStore`` | 62.73                  | 51.38                          | 114.11     |
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   599
+--------------------------+------------------------+--------------------------------+------------+
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   600
| ``SQLGenObjectStore``    | 20.41                  | 11.03                          | 31.44      |
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   601
+--------------------------+------------------------+--------------------------------+------------+
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   602
| ``MassiveObjectStore``   | 4.84                   | 6.93                           | 11.77      |
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   603
+--------------------------+------------------------+--------------------------------+------------+
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   604
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   605
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   606
Conclusions
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   607
~~~~~~~~~~~
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   608
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   609
In this tutorial we have seen how to import data in a CubicWeb application instance. We have first seen how to
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   610
create a schema, then how to create a parser of the data and a mapping of the data to the schema.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   611
Finally, we have seen four ways of importing data into CubicWeb.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   612
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   613
Three of those are integrated into CubicWeb, namely the ``RQLObjectStore``, ``NoHookRQLObjectStore`` and
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   614
``SQLGenObjectStore`` stores, which have a common API:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   615
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   616
- ``RQLObjectStore`` is by far the slowest, especially its time spent on the
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   617
  CubicWeb side, and so it should be used only for small amounts of
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   618
  "sensitive" data (i.e. where security is a concern).
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   619
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   620
- ``NoHookRQLObjectStore`` slashes by almost four the time spent on the CubicWeb side,
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   621
  but is also quite slow; on the PostGres side it is as slow as the previous store.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   622
  It should be used for data where security is not a concern,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   623
  but consistency (with the data model) is.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   624
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   625
- ``SQLGenObjectStore`` slashes by three the time spent on the CubicWeb side and by five the time
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   626
  spent on the PostGreSQL side. It should be used for relatively great amounts of data, where
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   627
  security and data consistency are not a concern. Compared to the previous store, it has the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   628
  disadvantage that, for inlined relations, we must specify their subjects' types.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   629
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   630
For really huge amounts of data there is a fourth store, ``MassiveObjectStore``, available
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   631
from the ``dataio`` cube. It provides a blazing performance with respect to all other stores:
12792
e2cdb1be6bd9 [doc8] D002 Trailing whitespace
Arthur Lutz <arthur.lutz@logilab.fr>
parents: 12556
diff changeset
   632
it is almost 25 times faster than ``RQLObjectStore`` and almost three times faster than
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   633
``SQLGenObjectStore``. However, it has a few usage caveats that should be taken into account:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   634
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   635
#. it cannot insert relations defined as inlined in the schema,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   636
#. no security or consistency check is performed on the data,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   637
#. its API is slightly different from the other stores.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   638
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   639
Hence, this store should be used when security and data consistency are not a concern,
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   640
and there are no inlined relations in the schema.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   641
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   642
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   643
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   644
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   645
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   646