doc/tutorials/dataimport/data_import_tutorial.rst
changeset 8836 8a57802d40d3
child 8927 885dea8f16a0
equal deleted inserted replaced
8835:3612b760488b 8836:8a57802d40d3
       
     1 Importing relational data into a CubicWeb instance
       
     2 ==================================================
       
     3 
       
     4 Introduction
       
     5 ~~~~~~~~~~~~
       
     6 
       
     7 This tutorial explains how to import data from an external source (e.g. a collection of files) 
       
     8 into a CubicWeb cube instance.
       
     9 
       
    10 First, once we know the format of the data we wish to import, we devise a 
       
    11 *data model*, that is, a CubicWeb (Yams) schema which reflects the way the data
       
    12 is structured. This schema is implemented in the ``schema.py`` file.
       
    13 In this tutorial, we will describe such a schema for a particular data set, 
       
    14 the Diseasome data (see below).
       
    15 
       
    16 Once the schema is defined, we create a cube and an instance. 
       
    17 The cube is a specification of an application, whereas an instance 
       
    18 is the application per se. 
       
    19 
       
    20 Once the schema is defined and the instance is created, the import can be performed, via
       
    21 the following steps:
       
    22 
       
    23 1. Build a custom parser for the data to be imported. Thus, one obtains a Python
       
    24    memory representation of the data.
       
    25 
       
    26 2. Map the parsed data to the data model defined in ``schema.py``.
       
    27 
       
    28 3. Perform the actual import of the data. This comes down to "populating"
       
    29    the data model with the memory representation obtained at 1, according to
       
    30    the mapping defined at 2.
       
    31 
       
    32 This tutorial illustrates all the above steps in the context of relational data
       
    33 stored in the RDF format.
       
    34 
       
    35 More specifically, we describe the import of Diseasome_ RDF/OWL data.
       
    36 
       
    37 .. _Diseasome: http://datahub.io/dataset/fu-berlin-diseasome
       
    38 
       
    39 Building a data model
       
    40 ~~~~~~~~~~~~~~~~~~~~~
       
    41 
       
    42 The first thing to do when using CubicWeb for creating an application from scratch
       
    43 is to devise a *data model*, that is, a relational representation of the problem to be
       
    44 modeled or of the structure of the data to be imported. 
       
    45 
       
    46 In such a schema, we define
       
    47 an entity type (``EntityType`` objects) for each type of entity to import. Each such type
       
    48 has several attributes. If the attributes are of known CubicWeb (Yams) types, viz. numbers,
       
    49 strings or characters, then they are defined as attributes, as e.g. ``attribute = Int()``
       
    50 for an attribute named ``attribute`` which is an integer. 
       
    51 
       
    52 Each such type also has a set of
       
    53 relations, which are defined like the attributes, except that they represent, in fact,
       
    54 relations between the entities of the type under discussion and the objects of a type which
       
    55 is specified in the relation definition. 
       
    56 
       
    57 For example, for the Diseasome data, we have two types of entities, genes and diseases.
       
    58 Thus, we create two classes which inherit from ``EntityType``::
       
    59 
       
    60     class Disease(EntityType):
       
    61         # Corresponds to http://www.w3.org/2000/01/rdf-schema#label
       
    62         label = String(maxsize=512, fulltextindexed=True)
       
    63         ...
       
    64 
       
    65         #Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene
       
    66         associated_genes = SubjectRelation('Gene', cardinality='**')
       
    67         ...
       
    68 
       
    69         #Corresponds to 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/chromosomalLocation'
       
    70         chromosomal_location = SubjectRelation('ExternalUri', cardinality='?*', inlined=True)
       
    71 
       
    72 
       
    73     class Gene(EntityType):
       
    74         ...
       
    75 
       
    76 In this schema, there are attributes whose values are numbers or strings. Thus, they are 
       
    77 defined by using the CubicWeb / Yams primitive types, e.g., ``label = String(maxsize=12)``. 
       
    78 These types can have several constraints or attributes, such as ``maxsize``. 
       
    79 There are also relations, either between the entity types themselves, or between them
       
    80 and a CubicWeb type, ``ExternalUri``. The latter defines a class of URI objects in 
       
    81 CubicWeb. For instance, the ``chromosomal_location`` attribute is a relation between 
       
    82 a ``Disease`` entity and an ``ExternalUri`` entity. The relation is marked by the CubicWeb /
       
    83 Yams ``SubjectRelation`` method. The latter can have several optional keyword arguments, such as
       
    84 ``cardinality`` which specifies the number of subjects and objects related by the relation type 
       
    85 specified. For example, the ``'?*'`` cardinality in the ``chromosomal_relation`` relation type says
       
    86 that zero or more ``Disease`` entities are related to zero or one ``ExternalUri`` entities.
       
    87 In other words, a ``Disease`` entity is related to at most one ``ExternalUri`` entity via the
       
    88 ``chromosomal_location`` relation type, and that we can have zero or more ``Disease`` entities in the
       
    89 data base. 
       
    90 For a relation between the entity types themselves, the ``associated_genes`` between a ``Disease``
       
    91 entity and a ``Gene`` entity is defined, so that any number of ``Gene`` entities can be associated
       
    92 to a ``Disease``, and there can be any number of ``Disease`` s if a ``Gene`` exists.
       
    93 
       
    94 Of course, before being able to use the CubicWeb / Yams built-in objects, we need to import them::
       
    95 
       
    96     
       
    97     from yams.buildobjs import EntityType, SubjectRelation, String, Int
       
    98     from cubicweb.schemas.base import ExternalUri
       
    99 
       
   100 Building a custom data parser
       
   101 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       
   102 
       
   103 The data we wish to import is structured in the RDF format,
       
   104 as a text file containing a set of lines. 
       
   105 On each line, there are three fields. 
       
   106 The first two fields are URIs ("Universal Resource Identifiers"). 
       
   107 The third field is either an URI or a string. Each field bares a particular meaning:
       
   108 
       
   109 - the leftmost field is an URI that holds the entity to be imported. 
       
   110   Note that the entities defined in the data model (i.e., in ``schema.py``) should 
       
   111   correspond to the entities whose URIs are specified in the import file.
       
   112 
       
   113 - the middle field is an URI that holds a relation whose subject is the  entity 
       
   114   defined by the leftmost field. Note that this should also correspond
       
   115   to the definitions in the data model.
       
   116 
       
   117 - the rightmost field is either an URI or a string. When this field is an URI, 
       
   118   it gives the object of the relation defined by the middle field.
       
   119   When the rightmost field is a string, the middle field is interpreted as an attribute
       
   120   of the subject (introduced by the leftmost field) and the rightmost field is
       
   121   interpreted as the value of the attribute.
       
   122 
       
   123 Note however that some attributes (i.e. relations whose objects are strings) 
       
   124 have their objects defined as strings followed by ``^^`` and by another URI;
       
   125 we ignore this part.
       
   126 
       
   127 Let us show some examples:
       
   128 
       
   129 - of line holding an attribute definition:
       
   130   ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/CYP17A1> 
       
   131   <http://www.w3.org/2000/01/rdf-schema#label> "CYP17A1" .``
       
   132   The line contains the definition of the ``label`` attribute of an
       
   133   entity of type ``gene``. The value of ``label`` is '``CYP17A1``'.
       
   134 
       
   135 - of line holding a relation definition:
       
   136   ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/1> 
       
   137   <http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene> 
       
   138   <http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HADH2> .``
       
   139   The line contains the definition of the ``associatedGene`` relation between
       
   140   a ``disease`` subject entity identified by ``1`` and a ``gene`` object 
       
   141   entity defined by ``HADH2``.
       
   142 
       
   143 Thus, for parsing the data, we can (:note: see the ``diseasome_parser`` module):
       
   144 
       
   145 1. define a couple of regular expressions for parsing the two kinds of lines, 
       
   146    ``RE_ATTS`` for parsing the attribute definitions, and ``RE_RELS`` for parsing
       
   147    the relation definitions.
       
   148 
       
   149 2. define a function that iterates through the lines of the file and retrieves
       
   150    (``yield`` s) a (subject, relation, object) tuple for each line.
       
   151    We called it ``_retrieve_structure`` in the ``diseasome_parser`` module.
       
   152    The function needs the file name and the types for which information
       
   153    should be retrieved.
       
   154 
       
   155 Alternatively, instead of hand-making the parser, one could use the RDF parser provided
       
   156 in the ``dataio`` cube.
       
   157 
       
   158 .. XXX To further study and detail the ``dataio`` cube usage.
       
   159 
       
   160 Once we get to have the (subject, relation, object) triples, we need to map them into
       
   161 the data model.
       
   162 
       
   163 
       
   164 Mapping the data to the schema
       
   165 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       
   166 
       
   167 In the case of diseasome data, we can just define two dictionaries for mapping
       
   168 the names of the relations as extracted by the parser, to the names of the relations
       
   169 as defined in the ``schema.py`` data model. In the ``diseasome_parser`` module 
       
   170 they are called ``MAPPING_ATTS`` and ``MAPPING_RELS``. 
       
   171 Given that the relation and attribute names are given in CamelCase in the original data,
       
   172 mappings are necessary if we follow the PEP08 when naming the attributes in the data model.
       
   173 For example, the RDF relation ``chromosomalLocation`` is mapped into the schema relation 
       
   174 ``chromosomal_location``.
       
   175 
       
   176 Once these mappings have been defined, we just iterate over the (subject, relation, object)
       
   177 tuples provided by the parser and we extract the entities, with their attributes and relations.
       
   178 For each entity, we thus have a dictionary with two keys, ``attributes`` and ``relations``.
       
   179 The value associated to the ``attributes`` key is a dictionary containing (attribute: value) 
       
   180 pairs, where "value" is a string, plus the ``cwuri`` key / attribute holding the URI of 
       
   181 the entity itself.
       
   182 The value associated to the ``relations`` key is a dictionary containing (relation: value)
       
   183 pairs, where "value" is an URI.
       
   184 This is implemented in the ``entities_from_rdf`` interface function of the module 
       
   185 ``diseasome_parser``. This function provides an iterator on the dictionaries containing
       
   186 the ``attributes`` and ``relations`` keys for all entities.
       
   187 
       
   188 However, this is a simple case. In real life, things can get much more complicated, and the 
       
   189 mapping can be far from trivial, especially when several data sources (which can follow 
       
   190 different formatting and even structuring conventions) must be mapped into the same data model.
       
   191 
       
   192 Importing the data
       
   193 ~~~~~~~~~~~~~~~~~~
       
   194 
       
   195 The data import code should be placed in a Python module. Let us call it 
       
   196 ``diseasome_import.py``. Then, this module should be called via
       
   197 ``cubicweb-ctl``, as follows::
       
   198 
       
   199     cubicweb-ctl shell diseasome_import.py -- <other arguments e.g. data file>
       
   200 
       
   201 In the import module, we should use a *store* for doing the import.
       
   202 A store is an object which provides three kinds of methods for
       
   203 importing data:
       
   204 
       
   205 - a method for importing the entities, along with the values
       
   206   of their attributes.
       
   207 - a method for importing the relations between the entities.
       
   208 - a method for committing the imports to the database.
       
   209 
       
   210 In CubicWeb, we have four stores:
       
   211 
       
   212 1. ``ObjectStore`` base class for the stores in CubicWeb.
       
   213    It only provides a skeleton for all other stores and
       
   214    provides the means for creating the memory structures
       
   215    (dictionaries) that hold the entities and the relations
       
   216    between them.
       
   217 
       
   218 2. ``RQLObjectStore``: store which uses the RQL language for performing
       
   219    database insertions and updates. It relies on all the CubicWeb hooks 
       
   220    machinery, especially for dealing with security issues (database access
       
   221    permissions).
       
   222 
       
   223 2. ``NoHookRQLObjectStore``: store which uses the RQL language for
       
   224    performing database insertions and updates, but for which 
       
   225    all hooks are deactivated. This implies that 
       
   226    certain checks with respect to the CubicWeb / Yams schema 
       
   227    (data model) are not performed. However, all SQL queries 
       
   228    obtained from the RQL ones are executed in a sequential
       
   229    manner, one query per inserted entity.
       
   230 
       
   231 4. ``SQLGenObjectStore``: store which uses the SQL language directly. 
       
   232    It inserts entities either sequentially, by executing an SQL query 
       
   233    for each entity, or directly by using one PostGRES ``COPY FROM`` 
       
   234    query for a set of similarly structured entities. 
       
   235 
       
   236 For really massive imports (millions or billions of entities), there
       
   237 is a cube ``dataio`` which contains another store, called 
       
   238 ``MassiveObjectStore``. This store is similar to ``SQLGenObjectStore``,
       
   239 except that anything related to CubicWeb is bypassed. That is, even the
       
   240 CubicWeb EID entity identifiers are not handled. This store is the fastest,
       
   241 but has a slightly different API from the other four stores mentioned above.
       
   242 Moreover, it has an important limitation, in that it doesn't insert inlined [#]_
       
   243 relations in the database. 
       
   244 
       
   245 .. [#] An inlined relation is a relation defined in the schema
       
   246        with the keyword argument ``inlined=True``. Such a relation
       
   247        is inserted in the database as an attribute of the entity
       
   248        whose subject it is.
       
   249 
       
   250 In the following section we will see how to import data by using the stores
       
   251 in CubicWeb's ``dataimport`` module.
       
   252 
       
   253 Using the stores in ``dataimport``
       
   254 ++++++++++++++++++++++++++++++++++
       
   255 
       
   256 ``ObjectStore`` is seldom used in real life for importing data, since it is
       
   257 only the base store for the other stores and it doesn't perform an actual
       
   258 import of the data. Nevertheless, the other three stores, which import data,
       
   259 are based on ``ObjectStore`` and provide the same API.
       
   260 
       
   261 All three stores ``RQLObjectStore``, ``NoHookRQLObjectStore`` and
       
   262 ``SQLGenObjectStore`` provide exactly the same API for importing data, that is
       
   263 entities and relations, in an SQL database. 
       
   264 
       
   265 Before using a store, one must import the ``dataimport`` module and then initialize 
       
   266 the store, with the current ``session`` as a parameter::
       
   267 
       
   268     import cubicweb.dataimport as cwdi
       
   269     ...
       
   270 
       
   271     store = cwdi.RQLObjectStore(session)
       
   272 
       
   273 Each such store provides three methods for data import:
       
   274 
       
   275 #. ``create_entity(Etype, **attributes)``, which allows us to add
       
   276    an entity of the Yams type ``Etype`` to the database. This entity's attributes
       
   277    are specified in the ``attributes`` dictionary. The method returns the entity 
       
   278    created in the database. For example, we add two entities,
       
   279    a person, of ``Person`` type, and a location, of ``Location`` type::
       
   280 
       
   281         person = store.create_entity('Person', name='Toto', age='18', height='190')
       
   282 
       
   283         location = store.create_entity('Location', town='Paris', arrondissement='13')
       
   284 
       
   285 #. ``relate(subject_eid, r_type, object_eid)``, which allows us to add a relation
       
   286    of the Yams type ``r_type`` to the database. The relation's subject is an entity
       
   287    whose EID is ``subject_eid``; its object is another entity, whose EID is 
       
   288    ``object_eid``.  For example [#]_::
       
   289 
       
   290         store.relate(person.eid(), 'lives_in', location.eid(), **kwargs)
       
   291 
       
   292    ``kwargs`` is only used by the ``SQLGenObjectStore``'s ``relate`` method and is here
       
   293    to allow us to specify the type of the subject of the relation, when the relation is
       
   294    defined as inlined in the schema. 
       
   295 
       
   296 .. [#] The ``eid`` method of an entity defined via ``create_entity`` returns
       
   297        the entity identifier as assigned by CubicWeb when creating the entity.
       
   298        This only works for entities defined via the stores in the CubicWeb's
       
   299        ``dataimport`` module.
       
   300 
       
   301     The keyword argument that is understood by ``SQLGenObjectStore`` is called 
       
   302    ``subjtype`` and holds the type of the subject entity. For the example considered here,
       
   303    this comes to having [#]_::
       
   304 
       
   305         store.relate(person.eid(), 'lives_in', location.eid(), subjtype=person.dc_type())
       
   306 
       
   307    If ``subjtype`` is not specified, then the store tries to infer the type of the subject.
       
   308    However, this doesn't always work, e.g. when there are several possible subject types
       
   309    for a given relation type. 
       
   310 
       
   311 .. [#] The ``dc_type`` method of an entity defined via ``create_entity`` returns
       
   312        the type of the entity just created. This only works for entities defined via
       
   313        the stores in the CubicWeb's ``dataimport`` module. In the example considered
       
   314        here, ``person.dc_type()`` returns ``'Person'``.
       
   315     
       
   316    All the other stores but ``SQLGenObjectStore`` ignore the ``kwargs`` parameters.
       
   317 
       
   318 #. ``flush()``, which allows us to perform the actual commit into the database, along
       
   319    with some cleanup operations. Ideally, this method should be called as often as 
       
   320    possible, that is after each insertion in the database, so that database sessions
       
   321    are kept as atomic as possible. In practice, we usually call this method twice: 
       
   322    first, after all the entities have been created, second, after all relations have
       
   323    been created. 
       
   324 
       
   325    Note however that before each commit the database insertions
       
   326    have to be consistent with the schema. Thus, if, for instance,
       
   327    an entity has an attribute defined through a relation (viz.
       
   328    a ``SubjectRelation``) with a ``"1"`` or ``"+"`` object 
       
   329    cardinality, we have to create the entity under discussion,
       
   330    the object entity of the relation under discussion, and the
       
   331    relation itself, before committing the additions to the database.
       
   332 
       
   333    The ``flush`` method is simply called as::
       
   334 
       
   335         store.flush().
       
   336 
       
   337 
       
   338 Using the ``MassiveObjectStore`` in the ``dataio`` cube
       
   339 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
       
   340 
       
   341 This store, available in the ``dataio`` cube, allows us to
       
   342 fully dispense with the CubicWeb import mechanisms and hence
       
   343 to interact directly with the database server, via SQL queries.
       
   344 
       
   345 Moreover, these queries rely on PostGreSQL's ``COPY FROM`` instruction
       
   346 to create several entities in a single query. This brings tremendous 
       
   347 performance improvements with respect to the RQL-based data insertion
       
   348 procedures.
       
   349 
       
   350 However, the API of this store is slightly different from the API of
       
   351 the stores in CubicWeb's ``dataimport`` module.
       
   352 
       
   353 Before using the store, one has to import the ``dataio`` cube's 
       
   354 ``dataimport`` module, then initialize the store by giving it the
       
   355 ``session`` parameter::
       
   356 
       
   357     from cubes.dataio import dataimport as mcwdi
       
   358     ...
       
   359 
       
   360     store = mcwdi.MassiveObjectStore(session)
       
   361 
       
   362 The ``MassiveObjectStore`` provides six methods for inserting data
       
   363 into the database:
       
   364 
       
   365 #. ``init_rtype_table(SubjEtype, r_type, ObjEtype)``, which specifies the
       
   366    creation of the tables associated to the relation types in the database.
       
   367    Each such table has three column, the type of the subject entity, the
       
   368    type of the relation (that is, the name of the attribute in the subject
       
   369    entity which is defined via the relation), and the type of the object
       
   370    entity. For example::
       
   371 
       
   372         store.init_rtype_table('Person', 'lives_in', 'Location')
       
   373 
       
   374    Please note that these tables can be created before the entities, since
       
   375    they only specify their types, not their unique identifiers.
       
   376 
       
   377 #. ``create_entity(Etype, **attributes)``, which allows us to add new entities,
       
   378    whose attributes are given in the ``attributes`` dictionary. 
       
   379    Please note however that, by default, this method does *not* return 
       
   380    the created entity. The method is called, for example, as in::
       
   381 
       
   382         store.create_entity('Person', name='Toto', age='18', height='190', 
       
   383                             uri='http://link/to/person/toto_18_190')
       
   384         store.create_entity('Location', town='Paris', arrondissement='13',
       
   385                             uri='http://link/to/location/paris_13')
       
   386    
       
   387    In order to be able to link these entities via the relations when needed,
       
   388    we must provide ourselves a means for uniquely identifying the entities.
       
   389    In general, this is done via URIs, stored in attributes like ``uri`` or
       
   390    ``cwuri``. The name of the attribute is irrelevant as long as its value is
       
   391    unique for each entity.
       
   392 
       
   393 #. ``relate_by_iid(subject_iid, r_type, object_iid)`` allows us to actually 
       
   394    relate the entities uniquely identified by ``subject_iid`` and 
       
   395    ``object_iid`` via a relation of type ``r_type``. For example::
       
   396 
       
   397         store.relate_by_iid('http://link/to/person/toto_18_190',
       
   398                             'lives_in',
       
   399                             'http://link/to/location/paris_13')
       
   400 
       
   401    Please note that this method does *not* work for inlined relations!
       
   402 
       
   403 #. ``convert_relations(SubjEtype, r_type, ObjEtype, subj_iid_attribute,
       
   404    obj_iid_attribute)``
       
   405    allows us to actually insert
       
   406    the relations in the database. At one call of this method, one inserts
       
   407    all the relations of type ``rtype`` between entities of given types.
       
   408    ``subj_iid_attribute`` and ``object_iid_attribute`` are the names
       
   409    of the attributes which store the unique identifiers of the entities,
       
   410    as assigned by the user. These names can be identical, as long as
       
   411    their values are unique. For example, for inserting all relations
       
   412    of type ``lives_in`` between ``People`` and ``Location`` entities,
       
   413    we write::
       
   414         
       
   415         store.convert_relations('Person', 'lives_in', 'Location', 'uri', 'uri')
       
   416 
       
   417 #. ``flush()`` performs the actual commit in the database. It only needs 
       
   418    to be called after ``create_entity`` and ``relate_by_iid`` calls. 
       
   419    Please note that ``relate_by_iid`` does *not* perform insertions into
       
   420    the database, hence calling ``flush()`` for it would have no effect.
       
   421 
       
   422 #. ``cleanup()`` performs database cleanups, by removing temporary tables.
       
   423    It should only be called at the end of the import.
       
   424 
       
   425 
       
   426 
       
   427 .. XXX to add smth on the store's parameter initialization.
       
   428 
       
   429 
       
   430 
       
   431 Application to the Diseasome data
       
   432 +++++++++++++++++++++++++++++++++
       
   433 
       
   434 Import setup
       
   435 ############
       
   436 
       
   437 We define an import function, ``diseasome_import``, which does basically four things:
       
   438 
       
   439 #. creates and initializes the store to be used, via a line such as::
       
   440     
       
   441         store = cwdi.SQLGenObjectStore(session)
       
   442    
       
   443    where ``cwdi`` is the imported ``cubicweb.dataimport`` or 
       
   444    ``cubes.dataio.dataimport``.
       
   445 
       
   446 #. calls the diseasome parser, that is, the ``entities_from_rdf`` function in the 
       
   447    ``diseasome_parser`` module and iterates on its result, in a line such as::
       
   448         
       
   449         for entity, relations in parser.entities_from_rdf(filename, ('gene', 'disease')):
       
   450         
       
   451    where ``parser`` is the imported ``diseasome_parser`` module, and ``filename`` is the 
       
   452    name of the file containing the data (with its path), e.g. ``../data/diseasome_dump.nt``.
       
   453 
       
   454 #. creates the entities to be inserted in the database; for Diseasome, there are two 
       
   455    kinds of entities:
       
   456    
       
   457    #. entities defined in the data model, viz. ``Gene`` and ``Disease`` in our case.
       
   458    #. entities which are built in CubicWeb / Yams, viz. ``ExternalUri`` which define
       
   459       URIs.
       
   460    
       
   461    As we are working with RDF data, each entity is defined through a series of URIs. Hence,
       
   462    each "relational attribute" [#]_ of an entity is defined via an URI, that is, in CubicWeb
       
   463    terms, via an ``ExternalUri`` entity. The entities are created, in the loop presented above,
       
   464    as such::
       
   465         
       
   466         ent = store.create_entity(etype, **entity)
       
   467         
       
   468    where ``etype`` is the appropriate entity type, either ``Gene`` or ``Disease``.
       
   469 
       
   470 .. [#] By "relational attribute" we denote an attribute (of an entity) which
       
   471        is defined through a relation, e.g. the ``chromosomal_location`` attribute
       
   472        of ``Disease`` entities, which is defined through a relation between a
       
   473        ``Disease`` and an ``ExternalUri``.
       
   474    
       
   475    The ``ExternalUri`` entities are as many as URIs in the data file. For them, we define a unique
       
   476    attribute, ``uri``, which holds the URI under discussion::
       
   477         
       
   478         extu = store.create_entity('ExternalUri', uri="http://path/of/the/uri")
       
   479 
       
   480 #. creates the relations between the entities. We have relations between:
       
   481    
       
   482    #. entities defined in the schema, e.g. between ``Disease`` and ``Gene``
       
   483       entities, such as the ``associated_genes`` relation defined for 
       
   484       ``Disease`` entities.
       
   485    #. entities defined in the schema and ``ExternalUri`` entities, such as ``gene_id``.
       
   486    
       
   487    The way relations are added to the database depends on the store: 
       
   488    
       
   489    - for the stores in the CubicWeb ``dataimport`` module, we only use 
       
   490      ``store.relate``, in 
       
   491      another loop, on the relations (that is, a 
       
   492      loop inside the preceding one, mentioned at step 2)::
       
   493         
       
   494         for rtype, rels in relations.iteritems():
       
   495             ...
       
   496             
       
   497             store.relate(ent.eid(), rtype, extu.eid(), **kwargs)
       
   498         
       
   499      where ``kwargs`` is a dictionary designed to accommodate the need for specifying
       
   500      the type of the subject entity of the relation, when the relation is inlined and
       
   501      ``SQLGenObjectStore`` is used. For example::
       
   502             
       
   503             ...
       
   504             store.relate(ent.eid(), 'chromosomal_location', extu.eid(), subjtype='Disease')
       
   505    
       
   506    - for the ``MassiveObjectStore`` in the ``dataio`` cube's ``dataimport`` module, 
       
   507      the relations are created in three steps:
       
   508      
       
   509      #. first, a table is created for each relation type, as in::
       
   510             
       
   511             ...
       
   512             store.init_rtype_table(ent.dc_type(), rtype, extu.dc_type())
       
   513             
       
   514         which comes down to lines such as::
       
   515             
       
   516             store.init_rtype_table('Disease', 'associated_genes', 'Gene')
       
   517             store.init_rtype_table('Gene', 'gene_id', 'ExternalUri')
       
   518             
       
   519      #. second, the URI of each entity will be used as its identifier, in the 
       
   520         ``relate_by_iid`` method, such as::
       
   521             
       
   522             disease_uri = 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/3'
       
   523             gene_uri = '<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HSD3B2'
       
   524             store.relate_by_iid(disease_uri, 'associated_genes', gene_uri)
       
   525             
       
   526      #. third, the relations for each relation type will be added to the database, 
       
   527         via the ``convert_relations`` method, such as in::
       
   528             
       
   529             store.convert_relations('Disease', 'associated_genes', 'Gene', 'cwuri', 'cwuri')
       
   530             
       
   531         and::
       
   532             
       
   533             store.convert_relations('Gene', 'hgnc_id', 'ExternalUri', 'cwuri', 'uri')
       
   534             
       
   535         where ``cwuri`` and ``uri`` are the attributes which store the URIs of the entities
       
   536         defined in the data model, and of the ``ExternalUri`` entities, respectively.
       
   537 
       
   538 #. flushes all relations and entities::
       
   539     
       
   540     store.flush()
       
   541 
       
   542    which performs the actual commit of the inserted entities and relations in the database.
       
   543 
       
   544 If the ``MassiveObjectStore`` is used, then a cleanup of temporary SQL tables should be performed
       
   545 at the end of the import::
       
   546 
       
   547     store.cleanup()
       
   548 
       
   549 Timing benchmarks
       
   550 #################
       
   551 
       
   552 In order to time the import script, we just decorate the import function with the ``timed``
       
   553 decorator::
       
   554     
       
   555     from logilab.common.decorators import timed
       
   556     ...
       
   557     
       
   558     @timed
       
   559     def diseasome_import(session, filename):
       
   560         ...
       
   561 
       
   562 After running the import function as shown in the "Importing the data" section, we obtain two time measurements::
       
   563 
       
   564     diseasome_import clock: ... / time: ...
       
   565 
       
   566 Here, the meanings of these measurements are [#]_:
       
   567 
       
   568 - ``clock`` is the time spent by CubicWeb, on the server side (i.e. hooks and data pre- / post-processing on SQL 
       
   569   queries),
       
   570 
       
   571 - ``time`` is the sum between ``clock`` and the time spent in PostGreSQL.
       
   572 
       
   573 .. [#] The meanings of the ``clock`` and ``time`` measurements, when using the ``@timed``
       
   574        decorators, were taken from `a blog post on massive data import in CubicWeb`_.
       
   575 
       
   576 .. _a blog post on massive data import in CubicWeb: http://www.cubicweb.org/blogentry/2116712
       
   577 
       
   578 The import function is put in an import module, named ``diseasome_import`` here. The module is called
       
   579 directly from the CubicWeb shell, as follows::
       
   580 
       
   581     cubicweb-ctl shell diseasome_instance diseasome_import.py \
       
   582     -- -df diseasome_import_file.nt -st StoreName
       
   583 
       
   584 The module accepts two arguments:
       
   585 
       
   586 - the data file, introduced by ``-df [--datafile]``, and
       
   587 - the store, introduced by ``-st [--store]``.
       
   588 
       
   589 The timings (in seconds) for different stores are given in the following table, for 
       
   590 importing 4213 ``Disease`` entities and 3919 ``Gene`` entities with the import module
       
   591 just described:
       
   592 
       
   593 +--------------------------+------------------------+--------------------------------+------------+
       
   594 | Store                    | CubicWeb time (clock)  | PostGreSQL time (time - clock) | Total time |
       
   595 +==========================+========================+================================+============+
       
   596 | ``RQLObjectStore``       | 225.98                 | 62.05                          | 288.03     |
       
   597 +--------------------------+------------------------+--------------------------------+------------+
       
   598 | ``NoHookRQLObjectStore`` | 62.73                  | 51.38                          | 114.11     |
       
   599 +--------------------------+------------------------+--------------------------------+------------+
       
   600 | ``SQLGenObjectStore``    | 20.41                  | 11.03                          | 31.44      |
       
   601 +--------------------------+------------------------+--------------------------------+------------+
       
   602 | ``MassiveObjectStore``   | 4.84                   | 6.93                           | 11.77      |
       
   603 +--------------------------+------------------------+--------------------------------+------------+
       
   604 
       
   605 
       
   606 Conclusions
       
   607 ~~~~~~~~~~~
       
   608 
       
   609 In this tutorial we have seen how to import data in a CubicWeb application instance. We have first seen how to
       
   610 create a schema, then how to create a parser of the data and a mapping of the data to the schema.
       
   611 Finally, we have seen four ways of importing data into CubicWeb.
       
   612 
       
   613 Three of those are integrated into CubicWeb, namely the ``RQLObjectStore``, ``NoHookRQLObjectStore`` and
       
   614 ``SQLGenObjectStore`` stores, which have a common API:
       
   615 
       
   616 - ``RQLObjectStore`` is by far the slowest, especially its time spent on the 
       
   617   CubicWeb side, and so it should be used only for small amounts of 
       
   618   "sensitive" data (i.e. where security is a concern).
       
   619 
       
   620 - ``NoHookRQLObjectStore`` slashes by almost four the time spent on the CubicWeb side, 
       
   621   but is also quite slow; on the PostGres side it is as slow as the previous store. 
       
   622   It should be used for data where security is not a concern,
       
   623   but consistency (with the data model) is.
       
   624 
       
   625 - ``SQLGenObjectStore`` slashes by three the time spent on the CubicWeb side and by five the time 
       
   626   spent on the PostGreSQL side. It should be used for relatively great amounts of data, where
       
   627   security and data consistency are not a concern. Compared to the previous store, it has the
       
   628   disadvantage that, for inlined relations, we must specify their subjects' types.
       
   629 
       
   630 For really huge amounts of data there is a fourth store, ``MassiveObjectStore``, available
       
   631 from the ``dataio`` cube. It provides a blazing performance with respect to all other stores:
       
   632 it is almost 25 times faster than ``RQLObjectStore`` and almost three times faster than 
       
   633 ``SQLGenObjectStore``. However, it has a few usage caveats that should be taken into account:
       
   634 
       
   635 #. it cannot insert relations defined as inlined in the schema,
       
   636 #. no security or consistency check is performed on the data,
       
   637 #. its API is slightly different from the other stores.
       
   638 
       
   639 Hence, this store should be used when security and data consistency are not a concern,
       
   640 and there are no inlined relations in the schema.
       
   641 
       
   642 
       
   643 
       
   644 
       
   645 
       
   646