doc/tutorials/dataimport/index.rst
changeset 12792 e2cdb1be6bd9
parent 12556 d1c659d70368
equal deleted inserted replaced
12791:4564ecfc0134 12792:e2cdb1be6bd9
     2 ==================================================
     2 ==================================================
     3 
     3 
     4 Introduction
     4 Introduction
     5 ~~~~~~~~~~~~
     5 ~~~~~~~~~~~~
     6 
     6 
     7 This tutorial explains how to import data from an external source (e.g. a collection of files) 
     7 This tutorial explains how to import data from an external source (e.g. a collection of files)
     8 into a CubicWeb cube instance.
     8 into a CubicWeb cube instance.
     9 
     9 
    10 First, once we know the format of the data we wish to import, we devise a 
    10 First, once we know the format of the data we wish to import, we devise a
    11 *data model*, that is, a CubicWeb (Yams) schema which reflects the way the data
    11 *data model*, that is, a CubicWeb (Yams) schema which reflects the way the data
    12 is structured. This schema is implemented in the ``schema.py`` file.
    12 is structured. This schema is implemented in the ``schema.py`` file.
    13 In this tutorial, we will describe such a schema for a particular data set, 
    13 In this tutorial, we will describe such a schema for a particular data set,
    14 the Diseasome data (see below).
    14 the Diseasome data (see below).
    15 
    15 
    16 Once the schema is defined, we create a cube and an instance. 
    16 Once the schema is defined, we create a cube and an instance.
    17 The cube is a specification of an application, whereas an instance 
    17 The cube is a specification of an application, whereas an instance
    18 is the application per se. 
    18 is the application per se.
    19 
    19 
    20 Once the schema is defined and the instance is created, the import can be performed, via
    20 Once the schema is defined and the instance is created, the import can be performed, via
    21 the following steps:
    21 the following steps:
    22 
    22 
    23 1. Build a custom parser for the data to be imported. Thus, one obtains a Python
    23 1. Build a custom parser for the data to be imported. Thus, one obtains a Python
    39 Building a data model
    39 Building a data model
    40 ~~~~~~~~~~~~~~~~~~~~~
    40 ~~~~~~~~~~~~~~~~~~~~~
    41 
    41 
    42 The first thing to do when using CubicWeb for creating an application from scratch
    42 The first thing to do when using CubicWeb for creating an application from scratch
    43 is to devise a *data model*, that is, a relational representation of the problem to be
    43 is to devise a *data model*, that is, a relational representation of the problem to be
    44 modeled or of the structure of the data to be imported. 
    44 modeled or of the structure of the data to be imported.
    45 
    45 
    46 In such a schema, we define
    46 In such a schema, we define
    47 an entity type (``EntityType`` objects) for each type of entity to import. Each such type
    47 an entity type (``EntityType`` objects) for each type of entity to import. Each such type
    48 has several attributes. If the attributes are of known CubicWeb (Yams) types, viz. numbers,
    48 has several attributes. If the attributes are of known CubicWeb (Yams) types, viz. numbers,
    49 strings or characters, then they are defined as attributes, as e.g. ``attribute = Int()``
    49 strings or characters, then they are defined as attributes, as e.g. ``attribute = Int()``
    50 for an attribute named ``attribute`` which is an integer. 
    50 for an attribute named ``attribute`` which is an integer.
    51 
    51 
    52 Each such type also has a set of
    52 Each such type also has a set of
    53 relations, which are defined like the attributes, except that they represent, in fact,
    53 relations, which are defined like the attributes, except that they represent, in fact,
    54 relations between the entities of the type under discussion and the objects of a type which
    54 relations between the entities of the type under discussion and the objects of a type which
    55 is specified in the relation definition. 
    55 is specified in the relation definition.
    56 
    56 
    57 For example, for the Diseasome data, we have two types of entities, genes and diseases.
    57 For example, for the Diseasome data, we have two types of entities, genes and diseases.
    58 Thus, we create two classes which inherit from ``EntityType``::
    58 Thus, we create two classes which inherit from ``EntityType``::
    59 
    59 
    60     class Disease(EntityType):
    60     class Disease(EntityType):
    71 
    71 
    72 
    72 
    73     class Gene(EntityType):
    73     class Gene(EntityType):
    74         ...
    74         ...
    75 
    75 
    76 In this schema, there are attributes whose values are numbers or strings. Thus, they are 
    76 In this schema, there are attributes whose values are numbers or strings. Thus, they are
    77 defined by using the CubicWeb / Yams primitive types, e.g., ``label = String(maxsize=12)``. 
    77 defined by using the CubicWeb / Yams primitive types, e.g., ``label = String(maxsize=12)``.
    78 These types can have several constraints or attributes, such as ``maxsize``. 
    78 These types can have several constraints or attributes, such as ``maxsize``.
    79 There are also relations, either between the entity types themselves, or between them
    79 There are also relations, either between the entity types themselves, or between them
    80 and a CubicWeb type, ``ExternalUri``. The latter defines a class of URI objects in 
    80 and a CubicWeb type, ``ExternalUri``. The latter defines a class of URI objects in
    81 CubicWeb. For instance, the ``chromosomal_location`` attribute is a relation between 
    81 CubicWeb. For instance, the ``chromosomal_location`` attribute is a relation between
    82 a ``Disease`` entity and an ``ExternalUri`` entity. The relation is marked by the CubicWeb /
    82 a ``Disease`` entity and an ``ExternalUri`` entity. The relation is marked by the CubicWeb /
    83 Yams ``SubjectRelation`` method. The latter can have several optional keyword arguments, such as
    83 Yams ``SubjectRelation`` method. The latter can have several optional keyword arguments, such as
    84 ``cardinality`` which specifies the number of subjects and objects related by the relation type 
    84 ``cardinality`` which specifies the number of subjects and objects related by the relation type
    85 specified. For example, the ``'?*'`` cardinality in the ``chromosomal_relation`` relation type says
    85 specified. For example, the ``'?*'`` cardinality in the ``chromosomal_relation`` relation type says
    86 that zero or more ``Disease`` entities are related to zero or one ``ExternalUri`` entities.
    86 that zero or more ``Disease`` entities are related to zero or one ``ExternalUri`` entities.
    87 In other words, a ``Disease`` entity is related to at most one ``ExternalUri`` entity via the
    87 In other words, a ``Disease`` entity is related to at most one ``ExternalUri`` entity via the
    88 ``chromosomal_location`` relation type, and that we can have zero or more ``Disease`` entities in the
    88 ``chromosomal_location`` relation type, and that we can have zero or more ``Disease`` entities in the
    89 data base. 
    89 data base.
    90 For a relation between the entity types themselves, the ``associated_genes`` between a ``Disease``
    90 For a relation between the entity types themselves, the ``associated_genes`` between a ``Disease``
    91 entity and a ``Gene`` entity is defined, so that any number of ``Gene`` entities can be associated
    91 entity and a ``Gene`` entity is defined, so that any number of ``Gene`` entities can be associated
    92 to a ``Disease``, and there can be any number of ``Disease`` s if a ``Gene`` exists.
    92 to a ``Disease``, and there can be any number of ``Disease`` s if a ``Gene`` exists.
    93 
    93 
    94 Of course, before being able to use the CubicWeb / Yams built-in objects, we need to import them::
    94 Of course, before being able to use the CubicWeb / Yams built-in objects, we need to import them::
    95 
    95 
    96     
    96 
    97     from yams.buildobjs import EntityType, SubjectRelation, String, Int
    97     from yams.buildobjs import EntityType, SubjectRelation, String, Int
    98     from cubicweb.schemas.base import ExternalUri
    98     from cubicweb.schemas.base import ExternalUri
    99 
    99 
   100 Building a custom data parser
   100 Building a custom data parser
   101 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   101 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   102 
   102 
   103 The data we wish to import is structured in the RDF format,
   103 The data we wish to import is structured in the RDF format,
   104 as a text file containing a set of lines. 
   104 as a text file containing a set of lines.
   105 On each line, there are three fields. 
   105 On each line, there are three fields.
   106 The first two fields are URIs ("Universal Resource Identifiers"). 
   106 The first two fields are URIs ("Universal Resource Identifiers").
   107 The third field is either an URI or a string. Each field bares a particular meaning:
   107 The third field is either an URI or a string. Each field bares a particular meaning:
   108 
   108 
   109 - the leftmost field is an URI that holds the entity to be imported. 
   109 - the leftmost field is an URI that holds the entity to be imported.
   110   Note that the entities defined in the data model (i.e., in ``schema.py``) should 
   110   Note that the entities defined in the data model (i.e., in ``schema.py``) should
   111   correspond to the entities whose URIs are specified in the import file.
   111   correspond to the entities whose URIs are specified in the import file.
   112 
   112 
   113 - the middle field is an URI that holds a relation whose subject is the  entity 
   113 - the middle field is an URI that holds a relation whose subject is the  entity
   114   defined by the leftmost field. Note that this should also correspond
   114   defined by the leftmost field. Note that this should also correspond
   115   to the definitions in the data model.
   115   to the definitions in the data model.
   116 
   116 
   117 - the rightmost field is either an URI or a string. When this field is an URI, 
   117 - the rightmost field is either an URI or a string. When this field is an URI,
   118   it gives the object of the relation defined by the middle field.
   118   it gives the object of the relation defined by the middle field.
   119   When the rightmost field is a string, the middle field is interpreted as an attribute
   119   When the rightmost field is a string, the middle field is interpreted as an attribute
   120   of the subject (introduced by the leftmost field) and the rightmost field is
   120   of the subject (introduced by the leftmost field) and the rightmost field is
   121   interpreted as the value of the attribute.
   121   interpreted as the value of the attribute.
   122 
   122 
   123 Note however that some attributes (i.e. relations whose objects are strings) 
   123 Note however that some attributes (i.e. relations whose objects are strings)
   124 have their objects defined as strings followed by ``^^`` and by another URI;
   124 have their objects defined as strings followed by ``^^`` and by another URI;
   125 we ignore this part.
   125 we ignore this part.
   126 
   126 
   127 Let us show some examples:
   127 Let us show some examples:
   128 
   128 
   129 - of line holding an attribute definition:
   129 - of line holding an attribute definition:
   130   ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/CYP17A1> 
   130   ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/CYP17A1>
   131   <http://www.w3.org/2000/01/rdf-schema#label> "CYP17A1" .``
   131   <http://www.w3.org/2000/01/rdf-schema#label> "CYP17A1" .``
   132   The line contains the definition of the ``label`` attribute of an
   132   The line contains the definition of the ``label`` attribute of an
   133   entity of type ``gene``. The value of ``label`` is '``CYP17A1``'.
   133   entity of type ``gene``. The value of ``label`` is '``CYP17A1``'.
   134 
   134 
   135 - of line holding a relation definition:
   135 - of line holding a relation definition:
   136   ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/1> 
   136   ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/1>
   137   <http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene> 
   137   <http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene>
   138   <http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HADH2> .``
   138   <http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HADH2> .``
   139   The line contains the definition of the ``associatedGene`` relation between
   139   The line contains the definition of the ``associatedGene`` relation between
   140   a ``disease`` subject entity identified by ``1`` and a ``gene`` object 
   140   a ``disease`` subject entity identified by ``1`` and a ``gene`` object
   141   entity defined by ``HADH2``.
   141   entity defined by ``HADH2``.
   142 
   142 
   143 Thus, for parsing the data, we can (:note: see the ``diseasome_parser`` module):
   143 Thus, for parsing the data, we can (:note: see the ``diseasome_parser`` module):
   144 
   144 
   145 1. define a couple of regular expressions for parsing the two kinds of lines, 
   145 1. define a couple of regular expressions for parsing the two kinds of lines,
   146    ``RE_ATTS`` for parsing the attribute definitions, and ``RE_RELS`` for parsing
   146    ``RE_ATTS`` for parsing the attribute definitions, and ``RE_RELS`` for parsing
   147    the relation definitions.
   147    the relation definitions.
   148 
   148 
   149 2. define a function that iterates through the lines of the file and retrieves
   149 2. define a function that iterates through the lines of the file and retrieves
   150    (``yield`` s) a (subject, relation, object) tuple for each line.
   150    (``yield`` s) a (subject, relation, object) tuple for each line.
   164 Mapping the data to the schema
   164 Mapping the data to the schema
   165 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   165 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   166 
   166 
   167 In the case of diseasome data, we can just define two dictionaries for mapping
   167 In the case of diseasome data, we can just define two dictionaries for mapping
   168 the names of the relations as extracted by the parser, to the names of the relations
   168 the names of the relations as extracted by the parser, to the names of the relations
   169 as defined in the ``schema.py`` data model. In the ``diseasome_parser`` module 
   169 as defined in the ``schema.py`` data model. In the ``diseasome_parser`` module
   170 they are called ``MAPPING_ATTS`` and ``MAPPING_RELS``. 
   170 they are called ``MAPPING_ATTS`` and ``MAPPING_RELS``.
   171 Given that the relation and attribute names are given in CamelCase in the original data,
   171 Given that the relation and attribute names are given in CamelCase in the original data,
   172 mappings are necessary if we follow the PEP08 when naming the attributes in the data model.
   172 mappings are necessary if we follow the PEP08 when naming the attributes in the data model.
   173 For example, the RDF relation ``chromosomalLocation`` is mapped into the schema relation 
   173 For example, the RDF relation ``chromosomalLocation`` is mapped into the schema relation
   174 ``chromosomal_location``.
   174 ``chromosomal_location``.
   175 
   175 
   176 Once these mappings have been defined, we just iterate over the (subject, relation, object)
   176 Once these mappings have been defined, we just iterate over the (subject, relation, object)
   177 tuples provided by the parser and we extract the entities, with their attributes and relations.
   177 tuples provided by the parser and we extract the entities, with their attributes and relations.
   178 For each entity, we thus have a dictionary with two keys, ``attributes`` and ``relations``.
   178 For each entity, we thus have a dictionary with two keys, ``attributes`` and ``relations``.
   179 The value associated to the ``attributes`` key is a dictionary containing (attribute: value) 
   179 The value associated to the ``attributes`` key is a dictionary containing (attribute: value)
   180 pairs, where "value" is a string, plus the ``cwuri`` key / attribute holding the URI of 
   180 pairs, where "value" is a string, plus the ``cwuri`` key / attribute holding the URI of
   181 the entity itself.
   181 the entity itself.
   182 The value associated to the ``relations`` key is a dictionary containing (relation: value)
   182 The value associated to the ``relations`` key is a dictionary containing (relation: value)
   183 pairs, where "value" is an URI.
   183 pairs, where "value" is an URI.
   184 This is implemented in the ``entities_from_rdf`` interface function of the module 
   184 This is implemented in the ``entities_from_rdf`` interface function of the module
   185 ``diseasome_parser``. This function provides an iterator on the dictionaries containing
   185 ``diseasome_parser``. This function provides an iterator on the dictionaries containing
   186 the ``attributes`` and ``relations`` keys for all entities.
   186 the ``attributes`` and ``relations`` keys for all entities.
   187 
   187 
   188 However, this is a simple case. In real life, things can get much more complicated, and the 
   188 However, this is a simple case. In real life, things can get much more complicated, and the
   189 mapping can be far from trivial, especially when several data sources (which can follow 
   189 mapping can be far from trivial, especially when several data sources (which can follow
   190 different formatting and even structuring conventions) must be mapped into the same data model.
   190 different formatting and even structuring conventions) must be mapped into the same data model.
   191 
   191 
   192 Importing the data
   192 Importing the data
   193 ~~~~~~~~~~~~~~~~~~
   193 ~~~~~~~~~~~~~~~~~~
   194 
   194 
   195 The data import code should be placed in a Python module. Let us call it 
   195 The data import code should be placed in a Python module. Let us call it
   196 ``diseasome_import.py``. Then, this module should be called via
   196 ``diseasome_import.py``. Then, this module should be called via
   197 ``cubicweb-ctl``, as follows::
   197 ``cubicweb-ctl``, as follows::
   198 
   198 
   199     cubicweb-ctl shell diseasome_import.py -- <other arguments e.g. data file>
   199     cubicweb-ctl shell diseasome_import.py -- <other arguments e.g. data file>
   200 
   200 
   214    provides the means for creating the memory structures
   214    provides the means for creating the memory structures
   215    (dictionaries) that hold the entities and the relations
   215    (dictionaries) that hold the entities and the relations
   216    between them.
   216    between them.
   217 
   217 
   218 2. ``RQLObjectStore``: store which uses the RQL language for performing
   218 2. ``RQLObjectStore``: store which uses the RQL language for performing
   219    database insertions and updates. It relies on all the CubicWeb hooks 
   219    database insertions and updates. It relies on all the CubicWeb hooks
   220    machinery, especially for dealing with security issues (database access
   220    machinery, especially for dealing with security issues (database access
   221    permissions).
   221    permissions).
   222 
   222 
   223 2. ``NoHookRQLObjectStore``: store which uses the RQL language for
   223 2. ``NoHookRQLObjectStore``: store which uses the RQL language for
   224    performing database insertions and updates, but for which 
   224    performing database insertions and updates, but for which
   225    all hooks are deactivated. This implies that 
   225    all hooks are deactivated. This implies that
   226    certain checks with respect to the CubicWeb / Yams schema 
   226    certain checks with respect to the CubicWeb / Yams schema
   227    (data model) are not performed. However, all SQL queries 
   227    (data model) are not performed. However, all SQL queries
   228    obtained from the RQL ones are executed in a sequential
   228    obtained from the RQL ones are executed in a sequential
   229    manner, one query per inserted entity.
   229    manner, one query per inserted entity.
   230 
   230 
   231 4. ``SQLGenObjectStore``: store which uses the SQL language directly. 
   231 4. ``SQLGenObjectStore``: store which uses the SQL language directly.
   232    It inserts entities either sequentially, by executing an SQL query 
   232    It inserts entities either sequentially, by executing an SQL query
   233    for each entity, or directly by using one PostGRES ``COPY FROM`` 
   233    for each entity, or directly by using one PostGRES ``COPY FROM``
   234    query for a set of similarly structured entities. 
   234    query for a set of similarly structured entities.
   235 
   235 
   236 For really massive imports (millions or billions of entities), there
   236 For really massive imports (millions or billions of entities), there
   237 is a cube ``dataio`` which contains another store, called 
   237 is a cube ``dataio`` which contains another store, called
   238 ``MassiveObjectStore``. This store is similar to ``SQLGenObjectStore``,
   238 ``MassiveObjectStore``. This store is similar to ``SQLGenObjectStore``,
   239 except that anything related to CubicWeb is bypassed. That is, even the
   239 except that anything related to CubicWeb is bypassed. That is, even the
   240 CubicWeb EID entity identifiers are not handled. This store is the fastest,
   240 CubicWeb EID entity identifiers are not handled. This store is the fastest,
   241 but has a slightly different API from the other four stores mentioned above.
   241 but has a slightly different API from the other four stores mentioned above.
   242 Moreover, it has an important limitation, in that it doesn't insert inlined [#]_
   242 Moreover, it has an important limitation, in that it doesn't insert inlined [#]_
   243 relations in the database. 
   243 relations in the database.
   244 
   244 
   245 .. [#] An inlined relation is a relation defined in the schema
   245 .. [#] An inlined relation is a relation defined in the schema
   246        with the keyword argument ``inlined=True``. Such a relation
   246        with the keyword argument ``inlined=True``. Such a relation
   247        is inserted in the database as an attribute of the entity
   247        is inserted in the database as an attribute of the entity
   248        whose subject it is.
   248        whose subject it is.
   258 import of the data. Nevertheless, the other three stores, which import data,
   258 import of the data. Nevertheless, the other three stores, which import data,
   259 are based on ``ObjectStore`` and provide the same API.
   259 are based on ``ObjectStore`` and provide the same API.
   260 
   260 
   261 All three stores ``RQLObjectStore``, ``NoHookRQLObjectStore`` and
   261 All three stores ``RQLObjectStore``, ``NoHookRQLObjectStore`` and
   262 ``SQLGenObjectStore`` provide exactly the same API for importing data, that is
   262 ``SQLGenObjectStore`` provide exactly the same API for importing data, that is
   263 entities and relations, in an SQL database. 
   263 entities and relations, in an SQL database.
   264 
   264 
   265 Before using a store, one must import the ``dataimport`` module and then initialize 
   265 Before using a store, one must import the ``dataimport`` module and then initialize
   266 the store, with the current ``session`` as a parameter::
   266 the store, with the current ``session`` as a parameter::
   267 
   267 
   268     import cubicweb.dataimport as cwdi
   268     import cubicweb.dataimport as cwdi
   269     ...
   269     ...
   270 
   270 
   272 
   272 
   273 Each such store provides three methods for data import:
   273 Each such store provides three methods for data import:
   274 
   274 
   275 #. ``create_entity(Etype, **attributes)``, which allows us to add
   275 #. ``create_entity(Etype, **attributes)``, which allows us to add
   276    an entity of the Yams type ``Etype`` to the database. This entity's attributes
   276    an entity of the Yams type ``Etype`` to the database. This entity's attributes
   277    are specified in the ``attributes`` dictionary. The method returns the entity 
   277    are specified in the ``attributes`` dictionary. The method returns the entity
   278    created in the database. For example, we add two entities,
   278    created in the database. For example, we add two entities,
   279    a person, of ``Person`` type, and a location, of ``Location`` type::
   279    a person, of ``Person`` type, and a location, of ``Location`` type::
   280 
   280 
   281         person = store.create_entity('Person', name='Toto', age='18', height='190')
   281         person = store.create_entity('Person', name='Toto', age='18', height='190')
   282 
   282 
   283         location = store.create_entity('Location', town='Paris', arrondissement='13')
   283         location = store.create_entity('Location', town='Paris', arrondissement='13')
   284 
   284 
   285 #. ``relate(subject_eid, r_type, object_eid)``, which allows us to add a relation
   285 #. ``relate(subject_eid, r_type, object_eid)``, which allows us to add a relation
   286    of the Yams type ``r_type`` to the database. The relation's subject is an entity
   286    of the Yams type ``r_type`` to the database. The relation's subject is an entity
   287    whose EID is ``subject_eid``; its object is another entity, whose EID is 
   287    whose EID is ``subject_eid``; its object is another entity, whose EID is
   288    ``object_eid``.  For example [#]_::
   288    ``object_eid``.  For example [#]_::
   289 
   289 
   290         store.relate(person.eid(), 'lives_in', location.eid(), **kwargs)
   290         store.relate(person.eid(), 'lives_in', location.eid(), **kwargs)
   291 
   291 
   292    ``kwargs`` is only used by the ``SQLGenObjectStore``'s ``relate`` method and is here
   292    ``kwargs`` is only used by the ``SQLGenObjectStore``'s ``relate`` method and is here
   293    to allow us to specify the type of the subject of the relation, when the relation is
   293    to allow us to specify the type of the subject of the relation, when the relation is
   294    defined as inlined in the schema. 
   294    defined as inlined in the schema.
   295 
   295 
   296 .. [#] The ``eid`` method of an entity defined via ``create_entity`` returns
   296 .. [#] The ``eid`` method of an entity defined via ``create_entity`` returns
   297        the entity identifier as assigned by CubicWeb when creating the entity.
   297        the entity identifier as assigned by CubicWeb when creating the entity.
   298        This only works for entities defined via the stores in the CubicWeb's
   298        This only works for entities defined via the stores in the CubicWeb's
   299        ``dataimport`` module.
   299        ``dataimport`` module.
   300 
   300 
   301    The keyword argument that is understood by ``SQLGenObjectStore`` is called 
   301    The keyword argument that is understood by ``SQLGenObjectStore`` is called
   302    ``subjtype`` and holds the type of the subject entity. For the example considered here,
   302    ``subjtype`` and holds the type of the subject entity. For the example considered here,
   303    this comes to having [#]_::
   303    this comes to having [#]_::
   304 
   304 
   305         store.relate(person.eid(), 'lives_in', location.eid(), subjtype=person.cw_etype)
   305         store.relate(person.eid(), 'lives_in', location.eid(), subjtype=person.cw_etype)
   306 
   306 
   307    If ``subjtype`` is not specified, then the store tries to infer the type of the subject.
   307    If ``subjtype`` is not specified, then the store tries to infer the type of the subject.
   308    However, this doesn't always work, e.g. when there are several possible subject types
   308    However, this doesn't always work, e.g. when there are several possible subject types
   309    for a given relation type. 
   309    for a given relation type.
   310 
   310 
   311 .. [#] The ``cw_etype`` attribute of an entity defined via ``create_entity`` holds
   311 .. [#] The ``cw_etype`` attribute of an entity defined via ``create_entity`` holds
   312        the type of the entity just created. This only works for entities defined via
   312        the type of the entity just created. This only works for entities defined via
   313        the stores in the CubicWeb's ``dataimport`` module. In the example considered
   313        the stores in the CubicWeb's ``dataimport`` module. In the example considered
   314        here, ``person.cw_etype`` holds ``'Person'``.
   314        here, ``person.cw_etype`` holds ``'Person'``.
   315     
   315 
   316    All the other stores but ``SQLGenObjectStore`` ignore the ``kwargs`` parameters.
   316    All the other stores but ``SQLGenObjectStore`` ignore the ``kwargs`` parameters.
   317 
   317 
   318 #. ``flush()``, which allows us to perform the actual commit into the database, along
   318 #. ``flush()``, which allows us to perform the actual commit into the database, along
   319    with some cleanup operations. Ideally, this method should be called as often as 
   319    with some cleanup operations. Ideally, this method should be called as often as
   320    possible, that is after each insertion in the database, so that database sessions
   320    possible, that is after each insertion in the database, so that database sessions
   321    are kept as atomic as possible. In practice, we usually call this method twice: 
   321    are kept as atomic as possible. In practice, we usually call this method twice:
   322    first, after all the entities have been created, second, after all relations have
   322    first, after all the entities have been created, second, after all relations have
   323    been created. 
   323    been created.
   324 
   324 
   325    Note however that before each commit the database insertions
   325    Note however that before each commit the database insertions
   326    have to be consistent with the schema. Thus, if, for instance,
   326    have to be consistent with the schema. Thus, if, for instance,
   327    an entity has an attribute defined through a relation (viz.
   327    an entity has an attribute defined through a relation (viz.
   328    a ``SubjectRelation``) with a ``"1"`` or ``"+"`` object 
   328    a ``SubjectRelation``) with a ``"1"`` or ``"+"`` object
   329    cardinality, we have to create the entity under discussion,
   329    cardinality, we have to create the entity under discussion,
   330    the object entity of the relation under discussion, and the
   330    the object entity of the relation under discussion, and the
   331    relation itself, before committing the additions to the database.
   331    relation itself, before committing the additions to the database.
   332 
   332 
   333    The ``flush`` method is simply called as::
   333    The ``flush`` method is simply called as::
   341 This store, available in the ``dataio`` cube, allows us to
   341 This store, available in the ``dataio`` cube, allows us to
   342 fully dispense with the CubicWeb import mechanisms and hence
   342 fully dispense with the CubicWeb import mechanisms and hence
   343 to interact directly with the database server, via SQL queries.
   343 to interact directly with the database server, via SQL queries.
   344 
   344 
   345 Moreover, these queries rely on PostGreSQL's ``COPY FROM`` instruction
   345 Moreover, these queries rely on PostGreSQL's ``COPY FROM`` instruction
   346 to create several entities in a single query. This brings tremendous 
   346 to create several entities in a single query. This brings tremendous
   347 performance improvements with respect to the RQL-based data insertion
   347 performance improvements with respect to the RQL-based data insertion
   348 procedures.
   348 procedures.
   349 
   349 
   350 However, the API of this store is slightly different from the API of
   350 However, the API of this store is slightly different from the API of
   351 the stores in CubicWeb's ``dataimport`` module.
   351 the stores in CubicWeb's ``dataimport`` module.
   352 
   352 
   353 Before using the store, one has to import the ``dataio`` cube's 
   353 Before using the store, one has to import the ``dataio`` cube's
   354 ``dataimport`` module, then initialize the store by giving it the
   354 ``dataimport`` module, then initialize the store by giving it the
   355 ``session`` parameter::
   355 ``session`` parameter::
   356 
   356 
   357     from cubicweb_dataio import dataimport as mcwdi
   357     from cubicweb_dataio import dataimport as mcwdi
   358     ...
   358     ...
   373 
   373 
   374    Please note that these tables can be created before the entities, since
   374    Please note that these tables can be created before the entities, since
   375    they only specify their types, not their unique identifiers.
   375    they only specify their types, not their unique identifiers.
   376 
   376 
   377 #. ``create_entity(Etype, **attributes)``, which allows us to add new entities,
   377 #. ``create_entity(Etype, **attributes)``, which allows us to add new entities,
   378    whose attributes are given in the ``attributes`` dictionary. 
   378    whose attributes are given in the ``attributes`` dictionary.
   379    Please note however that, by default, this method does *not* return 
   379    Please note however that, by default, this method does *not* return
   380    the created entity. The method is called, for example, as in::
   380    the created entity. The method is called, for example, as in::
   381 
   381 
   382         store.create_entity('Person', name='Toto', age='18', height='190', 
   382         store.create_entity('Person', name='Toto', age='18', height='190',
   383                             uri='http://link/to/person/toto_18_190')
   383                             uri='http://link/to/person/toto_18_190')
   384         store.create_entity('Location', town='Paris', arrondissement='13',
   384         store.create_entity('Location', town='Paris', arrondissement='13',
   385                             uri='http://link/to/location/paris_13')
   385                             uri='http://link/to/location/paris_13')
   386    
   386 
   387    In order to be able to link these entities via the relations when needed,
   387    In order to be able to link these entities via the relations when needed,
   388    we must provide ourselves a means for uniquely identifying the entities.
   388    we must provide ourselves a means for uniquely identifying the entities.
   389    In general, this is done via URIs, stored in attributes like ``uri`` or
   389    In general, this is done via URIs, stored in attributes like ``uri`` or
   390    ``cwuri``. The name of the attribute is irrelevant as long as its value is
   390    ``cwuri``. The name of the attribute is irrelevant as long as its value is
   391    unique for each entity.
   391    unique for each entity.
   392 
   392 
   393 #. ``relate_by_iid(subject_iid, r_type, object_iid)`` allows us to actually 
   393 #. ``relate_by_iid(subject_iid, r_type, object_iid)`` allows us to actually
   394    relate the entities uniquely identified by ``subject_iid`` and 
   394    relate the entities uniquely identified by ``subject_iid`` and
   395    ``object_iid`` via a relation of type ``r_type``. For example::
   395    ``object_iid`` via a relation of type ``r_type``. For example::
   396 
   396 
   397         store.relate_by_iid('http://link/to/person/toto_18_190',
   397         store.relate_by_iid('http://link/to/person/toto_18_190',
   398                             'lives_in',
   398                             'lives_in',
   399                             'http://link/to/location/paris_13')
   399                             'http://link/to/location/paris_13')
   409    of the attributes which store the unique identifiers of the entities,
   409    of the attributes which store the unique identifiers of the entities,
   410    as assigned by the user. These names can be identical, as long as
   410    as assigned by the user. These names can be identical, as long as
   411    their values are unique. For example, for inserting all relations
   411    their values are unique. For example, for inserting all relations
   412    of type ``lives_in`` between ``People`` and ``Location`` entities,
   412    of type ``lives_in`` between ``People`` and ``Location`` entities,
   413    we write::
   413    we write::
   414         
   414 
   415         store.convert_relations('Person', 'lives_in', 'Location', 'uri', 'uri')
   415         store.convert_relations('Person', 'lives_in', 'Location', 'uri', 'uri')
   416 
   416 
   417 #. ``flush()`` performs the actual commit in the database. It only needs 
   417 #. ``flush()`` performs the actual commit in the database. It only needs
   418    to be called after ``create_entity`` and ``relate_by_iid`` calls. 
   418    to be called after ``create_entity`` and ``relate_by_iid`` calls.
   419    Please note that ``relate_by_iid`` does *not* perform insertions into
   419    Please note that ``relate_by_iid`` does *not* perform insertions into
   420    the database, hence calling ``flush()`` for it would have no effect.
   420    the database, hence calling ``flush()`` for it would have no effect.
   421 
   421 
   422 #. ``cleanup()`` performs database cleanups, by removing temporary tables.
   422 #. ``cleanup()`` performs database cleanups, by removing temporary tables.
   423    It should only be called at the end of the import.
   423    It should only be called at the end of the import.
   435 ############
   435 ############
   436 
   436 
   437 We define an import function, ``diseasome_import``, which does basically four things:
   437 We define an import function, ``diseasome_import``, which does basically four things:
   438 
   438 
   439 #. creates and initializes the store to be used, via a line such as::
   439 #. creates and initializes the store to be used, via a line such as::
   440     
   440 
   441         store = cwdi.SQLGenObjectStore(session)
   441         store = cwdi.SQLGenObjectStore(session)
   442    
   442 
   443    where ``cwdi`` is the imported ``cubicweb.dataimport`` or 
   443    where ``cwdi`` is the imported ``cubicweb.dataimport`` or
   444    ``cubicweb_dataio.dataimport``.
   444    ``cubicweb_dataio.dataimport``.
   445 
   445 
   446 #. calls the diseasome parser, that is, the ``entities_from_rdf`` function in the 
   446 #. calls the diseasome parser, that is, the ``entities_from_rdf`` function in the
   447    ``diseasome_parser`` module and iterates on its result, in a line such as::
   447    ``diseasome_parser`` module and iterates on its result, in a line such as::
   448         
   448 
   449         for entity, relations in parser.entities_from_rdf(filename, ('gene', 'disease')):
   449         for entity, relations in parser.entities_from_rdf(filename, ('gene', 'disease')):
   450         
   450 
   451    where ``parser`` is the imported ``diseasome_parser`` module, and ``filename`` is the 
   451    where ``parser`` is the imported ``diseasome_parser`` module, and ``filename`` is the
   452    name of the file containing the data (with its path), e.g. ``../data/diseasome_dump.nt``.
   452    name of the file containing the data (with its path), e.g. ``../data/diseasome_dump.nt``.
   453 
   453 
   454 #. creates the entities to be inserted in the database; for Diseasome, there are two 
   454 #. creates the entities to be inserted in the database; for Diseasome, there are two
   455    kinds of entities:
   455    kinds of entities:
   456    
   456 
   457    #. entities defined in the data model, viz. ``Gene`` and ``Disease`` in our case.
   457    #. entities defined in the data model, viz. ``Gene`` and ``Disease`` in our case.
   458    #. entities which are built in CubicWeb / Yams, viz. ``ExternalUri`` which define
   458    #. entities which are built in CubicWeb / Yams, viz. ``ExternalUri`` which define
   459       URIs.
   459       URIs.
   460    
   460 
   461    As we are working with RDF data, each entity is defined through a series of URIs. Hence,
   461    As we are working with RDF data, each entity is defined through a series of URIs. Hence,
   462    each "relational attribute" [#]_ of an entity is defined via an URI, that is, in CubicWeb
   462    each "relational attribute" [#]_ of an entity is defined via an URI, that is, in CubicWeb
   463    terms, via an ``ExternalUri`` entity. The entities are created, in the loop presented above,
   463    terms, via an ``ExternalUri`` entity. The entities are created, in the loop presented above,
   464    as such::
   464    as such::
   465         
   465 
   466         ent = store.create_entity(etype, **entity)
   466         ent = store.create_entity(etype, **entity)
   467         
   467 
   468    where ``etype`` is the appropriate entity type, either ``Gene`` or ``Disease``.
   468    where ``etype`` is the appropriate entity type, either ``Gene`` or ``Disease``.
   469 
   469 
   470 .. [#] By "relational attribute" we denote an attribute (of an entity) which
   470 .. [#] By "relational attribute" we denote an attribute (of an entity) which
   471        is defined through a relation, e.g. the ``chromosomal_location`` attribute
   471        is defined through a relation, e.g. the ``chromosomal_location`` attribute
   472        of ``Disease`` entities, which is defined through a relation between a
   472        of ``Disease`` entities, which is defined through a relation between a
   473        ``Disease`` and an ``ExternalUri``.
   473        ``Disease`` and an ``ExternalUri``.
   474    
   474 
   475    The ``ExternalUri`` entities are as many as URIs in the data file. For them, we define a unique
   475    The ``ExternalUri`` entities are as many as URIs in the data file. For them, we define a unique
   476    attribute, ``uri``, which holds the URI under discussion::
   476    attribute, ``uri``, which holds the URI under discussion::
   477         
   477 
   478         extu = store.create_entity('ExternalUri', uri="http://path/of/the/uri")
   478         extu = store.create_entity('ExternalUri', uri="http://path/of/the/uri")
   479 
   479 
   480 #. creates the relations between the entities. We have relations between:
   480 #. creates the relations between the entities. We have relations between:
   481    
   481 
   482    #. entities defined in the schema, e.g. between ``Disease`` and ``Gene``
   482    #. entities defined in the schema, e.g. between ``Disease`` and ``Gene``
   483       entities, such as the ``associated_genes`` relation defined for 
   483       entities, such as the ``associated_genes`` relation defined for
   484       ``Disease`` entities.
   484       ``Disease`` entities.
   485    #. entities defined in the schema and ``ExternalUri`` entities, such as ``gene_id``.
   485    #. entities defined in the schema and ``ExternalUri`` entities, such as ``gene_id``.
   486    
   486 
   487    The way relations are added to the database depends on the store: 
   487    The way relations are added to the database depends on the store:
   488    
   488 
   489    - for the stores in the CubicWeb ``dataimport`` module, we only use 
   489    - for the stores in the CubicWeb ``dataimport`` module, we only use
   490      ``store.relate``, in 
   490      ``store.relate``, in
   491      another loop, on the relations (that is, a 
   491      another loop, on the relations (that is, a
   492      loop inside the preceding one, mentioned at step 2)::
   492      loop inside the preceding one, mentioned at step 2)::
   493         
   493 
   494         for rtype, rels in relations.iteritems():
   494         for rtype, rels in relations.iteritems():
   495             ...
   495             ...
   496             
   496 
   497             store.relate(ent.eid(), rtype, extu.eid(), **kwargs)
   497             store.relate(ent.eid(), rtype, extu.eid(), **kwargs)
   498         
   498 
   499      where ``kwargs`` is a dictionary designed to accommodate the need for specifying
   499      where ``kwargs`` is a dictionary designed to accommodate the need for specifying
   500      the type of the subject entity of the relation, when the relation is inlined and
   500      the type of the subject entity of the relation, when the relation is inlined and
   501      ``SQLGenObjectStore`` is used. For example::
   501      ``SQLGenObjectStore`` is used. For example::
   502             
   502 
   503             ...
   503             ...
   504             store.relate(ent.eid(), 'chromosomal_location', extu.eid(), subjtype='Disease')
   504             store.relate(ent.eid(), 'chromosomal_location', extu.eid(), subjtype='Disease')
   505    
   505 
   506    - for the ``MassiveObjectStore`` in the ``dataio`` cube's ``dataimport`` module, 
   506    - for the ``MassiveObjectStore`` in the ``dataio`` cube's ``dataimport`` module,
   507      the relations are created in three steps:
   507      the relations are created in three steps:
   508      
   508 
   509      #. first, a table is created for each relation type, as in::
   509      #. first, a table is created for each relation type, as in::
   510             
   510 
   511             ...
   511             ...
   512             store.init_rtype_table(ent.cw_etype, rtype, extu.cw_etype)
   512             store.init_rtype_table(ent.cw_etype, rtype, extu.cw_etype)
   513             
   513 
   514         which comes down to lines such as::
   514         which comes down to lines such as::
   515             
   515 
   516             store.init_rtype_table('Disease', 'associated_genes', 'Gene')
   516             store.init_rtype_table('Disease', 'associated_genes', 'Gene')
   517             store.init_rtype_table('Gene', 'gene_id', 'ExternalUri')
   517             store.init_rtype_table('Gene', 'gene_id', 'ExternalUri')
   518             
   518 
   519      #. second, the URI of each entity will be used as its identifier, in the 
   519      #. second, the URI of each entity will be used as its identifier, in the
   520         ``relate_by_iid`` method, such as::
   520         ``relate_by_iid`` method, such as::
   521             
   521 
   522             disease_uri = 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/3'
   522             disease_uri = 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/3'
   523             gene_uri = '<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HSD3B2'
   523             gene_uri = '<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HSD3B2'
   524             store.relate_by_iid(disease_uri, 'associated_genes', gene_uri)
   524             store.relate_by_iid(disease_uri, 'associated_genes', gene_uri)
   525             
   525 
   526      #. third, the relations for each relation type will be added to the database, 
   526      #. third, the relations for each relation type will be added to the database,
   527         via the ``convert_relations`` method, such as in::
   527         via the ``convert_relations`` method, such as in::
   528             
   528 
   529             store.convert_relations('Disease', 'associated_genes', 'Gene', 'cwuri', 'cwuri')
   529             store.convert_relations('Disease', 'associated_genes', 'Gene', 'cwuri', 'cwuri')
   530             
   530 
   531         and::
   531         and::
   532             
   532 
   533             store.convert_relations('Gene', 'hgnc_id', 'ExternalUri', 'cwuri', 'uri')
   533             store.convert_relations('Gene', 'hgnc_id', 'ExternalUri', 'cwuri', 'uri')
   534             
   534 
   535         where ``cwuri`` and ``uri`` are the attributes which store the URIs of the entities
   535         where ``cwuri`` and ``uri`` are the attributes which store the URIs of the entities
   536         defined in the data model, and of the ``ExternalUri`` entities, respectively.
   536         defined in the data model, and of the ``ExternalUri`` entities, respectively.
   537 
   537 
   538 #. flushes all relations and entities::
   538 #. flushes all relations and entities::
   539     
   539 
   540     store.flush()
   540     store.flush()
   541 
   541 
   542    which performs the actual commit of the inserted entities and relations in the database.
   542    which performs the actual commit of the inserted entities and relations in the database.
   543 
   543 
   544 If the ``MassiveObjectStore`` is used, then a cleanup of temporary SQL tables should be performed
   544 If the ``MassiveObjectStore`` is used, then a cleanup of temporary SQL tables should be performed
   549 Timing benchmarks
   549 Timing benchmarks
   550 #################
   550 #################
   551 
   551 
   552 In order to time the import script, we just decorate the import function with the ``timed``
   552 In order to time the import script, we just decorate the import function with the ``timed``
   553 decorator::
   553 decorator::
   554     
   554 
   555     from logilab.common.decorators import timed
   555     from logilab.common.decorators import timed
   556     ...
   556     ...
   557     
   557 
   558     @timed
   558     @timed
   559     def diseasome_import(session, filename):
   559     def diseasome_import(session, filename):
   560         ...
   560         ...
   561 
   561 
   562 After running the import function as shown in the "Importing the data" section, we obtain two time measurements::
   562 After running the import function as shown in the "Importing the data" section, we obtain two time measurements::
   563 
   563 
   564     diseasome_import clock: ... / time: ...
   564     diseasome_import clock: ... / time: ...
   565 
   565 
   566 Here, the meanings of these measurements are [#]_:
   566 Here, the meanings of these measurements are [#]_:
   567 
   567 
   568 - ``clock`` is the time spent by CubicWeb, on the server side (i.e. hooks and data pre- / post-processing on SQL 
   568 - ``clock`` is the time spent by CubicWeb, on the server side (i.e. hooks and data pre- / post-processing on SQL
   569   queries),
   569   queries),
   570 
   570 
   571 - ``time`` is the sum between ``clock`` and the time spent in PostGreSQL.
   571 - ``time`` is the sum between ``clock`` and the time spent in PostGreSQL.
   572 
   572 
   573 .. [#] The meanings of the ``clock`` and ``time`` measurements, when using the ``@timed``
   573 .. [#] The meanings of the ``clock`` and ``time`` measurements, when using the ``@timed``
   584 The module accepts two arguments:
   584 The module accepts two arguments:
   585 
   585 
   586 - the data file, introduced by ``-df [--datafile]``, and
   586 - the data file, introduced by ``-df [--datafile]``, and
   587 - the store, introduced by ``-st [--store]``.
   587 - the store, introduced by ``-st [--store]``.
   588 
   588 
   589 The timings (in seconds) for different stores are given in the following table, for 
   589 The timings (in seconds) for different stores are given in the following table, for
   590 importing 4213 ``Disease`` entities and 3919 ``Gene`` entities with the import module
   590 importing 4213 ``Disease`` entities and 3919 ``Gene`` entities with the import module
   591 just described:
   591 just described:
   592 
   592 
   593 +--------------------------+------------------------+--------------------------------+------------+
   593 +--------------------------+------------------------+--------------------------------+------------+
   594 | Store                    | CubicWeb time (clock)  | PostGreSQL time (time - clock) | Total time |
   594 | Store                    | CubicWeb time (clock)  | PostGreSQL time (time - clock) | Total time |
   611 Finally, we have seen four ways of importing data into CubicWeb.
   611 Finally, we have seen four ways of importing data into CubicWeb.
   612 
   612 
   613 Three of those are integrated into CubicWeb, namely the ``RQLObjectStore``, ``NoHookRQLObjectStore`` and
   613 Three of those are integrated into CubicWeb, namely the ``RQLObjectStore``, ``NoHookRQLObjectStore`` and
   614 ``SQLGenObjectStore`` stores, which have a common API:
   614 ``SQLGenObjectStore`` stores, which have a common API:
   615 
   615 
   616 - ``RQLObjectStore`` is by far the slowest, especially its time spent on the 
   616 - ``RQLObjectStore`` is by far the slowest, especially its time spent on the
   617   CubicWeb side, and so it should be used only for small amounts of 
   617   CubicWeb side, and so it should be used only for small amounts of
   618   "sensitive" data (i.e. where security is a concern).
   618   "sensitive" data (i.e. where security is a concern).
   619 
   619 
   620 - ``NoHookRQLObjectStore`` slashes by almost four the time spent on the CubicWeb side, 
   620 - ``NoHookRQLObjectStore`` slashes by almost four the time spent on the CubicWeb side,
   621   but is also quite slow; on the PostGres side it is as slow as the previous store. 
   621   but is also quite slow; on the PostGres side it is as slow as the previous store.
   622   It should be used for data where security is not a concern,
   622   It should be used for data where security is not a concern,
   623   but consistency (with the data model) is.
   623   but consistency (with the data model) is.
   624 
   624 
   625 - ``SQLGenObjectStore`` slashes by three the time spent on the CubicWeb side and by five the time 
   625 - ``SQLGenObjectStore`` slashes by three the time spent on the CubicWeb side and by five the time
   626   spent on the PostGreSQL side. It should be used for relatively great amounts of data, where
   626   spent on the PostGreSQL side. It should be used for relatively great amounts of data, where
   627   security and data consistency are not a concern. Compared to the previous store, it has the
   627   security and data consistency are not a concern. Compared to the previous store, it has the
   628   disadvantage that, for inlined relations, we must specify their subjects' types.
   628   disadvantage that, for inlined relations, we must specify their subjects' types.
   629 
   629 
   630 For really huge amounts of data there is a fourth store, ``MassiveObjectStore``, available
   630 For really huge amounts of data there is a fourth store, ``MassiveObjectStore``, available
   631 from the ``dataio`` cube. It provides a blazing performance with respect to all other stores:
   631 from the ``dataio`` cube. It provides a blazing performance with respect to all other stores:
   632 it is almost 25 times faster than ``RQLObjectStore`` and almost three times faster than 
   632 it is almost 25 times faster than ``RQLObjectStore`` and almost three times faster than
   633 ``SQLGenObjectStore``. However, it has a few usage caveats that should be taken into account:
   633 ``SQLGenObjectStore``. However, it has a few usage caveats that should be taken into account:
   634 
   634 
   635 #. it cannot insert relations defined as inlined in the schema,
   635 #. it cannot insert relations defined as inlined in the schema,
   636 #. no security or consistency check is performed on the data,
   636 #. no security or consistency check is performed on the data,
   637 #. its API is slightly different from the other stores.
   637 #. its API is slightly different from the other stores.