doc/book/devrepo/dataimport.rst
changeset 10491 c67bcee93248
parent 10461 37644c518705
child 10513 7bec01a59f92
equal deleted inserted replaced
10490:76ab3c71aff2 10491:c67bcee93248
       
     1 .. -*- coding: utf-8 -*-
       
     2 
       
     3 .. _dataimport:
       
     4 
       
     5 Dataimport
       
     6 ==========
       
     7 
       
     8 *CubicWeb* is designed to manipulate huge of amount of data, and provides utilities to do so.
       
     9 
       
    10 The main entry point is :mod:`cubicweb.dataimport.importer` which defines an
       
    11 :class:`ExtEntitiesImporter` class responsible for importing data from an external source in the
       
    12 form :class:`ExtEntity` objects. An :class:`ExtEntity` is a transitional representation of an
       
    13 entity to be imported in the CubicWeb instance; building this representation is usually
       
    14 domain-specific -- e.g. dependent of the kind of data source (RDF, CSV, etc.) -- and is thus the
       
    15 responsibility of the end-user.
       
    16 
       
    17 Along with the importer, a *store* must be selected, which is responsible for insertion of data into
       
    18 the database. There exists different kind of stores_, allowing to insert data within different
       
    19 levels of the *CubicWeb* API and with different speed/security tradeoffs. Those keeping all the
       
    20 *CubicWeb* hooks and security will be slower but the possible errors in insertion (bad data types,
       
    21 integrity error, ...) will be handled.
       
    22 
       
    23 
       
    24 Example
       
    25 -------
       
    26 
       
    27 Consider the following schema snippet.
       
    28 
       
    29 .. code-block:: python
       
    30 
       
    31     class Person(EntityType):
       
    32         name = String(required=True)
       
    33 
       
    34     class knows(RelationDefinition):
       
    35         subject = 'Person'
       
    36         object = 'Person'
       
    37 
       
    38 along with some data in a ``people.csv`` file::
       
    39 
       
    40     # uri,name,knows
       
    41     http://www.example.org/alice,Alice,
       
    42     http://www.example.org/bob,Bob,http://www.example.org/alice
       
    43 
       
    44 The following code (using a shell context) defines a function `extentities_from_csv` to read
       
    45 `Person` external entities coming from a CSV file and calls the :class:`ExtEntitiesImporter` to
       
    46 insert corresponding entities and relations into the CubicWeb instance.
       
    47 
       
    48 .. code-block:: python
       
    49 
       
    50     from cubicweb.dataimport import ucsvreader, RQLObjectStore
       
    51     from cubicweb.dataimport.importer import ExtEntity, ExtEntitiesImporter
       
    52 
       
    53     def extentities_from_csv(fpath):
       
    54         """Yield Person ExtEntities read from `fpath` CSV file."""
       
    55         with open(fpath) as f:
       
    56             for uri, name, knows in ucsvreader(f, skipfirst=True, skip_empty=False):
       
    57                 yield ExtEntity('Personne', uri,
       
    58                                 {'nom': set([name]), 'connait': set([knows])})
       
    59 
       
    60     extenties = extentities_from_csv('people.csv')
       
    61     store = RQLObjectStore(cnx)
       
    62     importer = ExtEntitiesImporter(schema, store)
       
    63     importer.import_entities(extenties)
       
    64     commit()
       
    65     rset = cnx.execute('String N WHERE X nom N, X connait Y, Y nom "Alice"')
       
    66     assert rset[0][0] == u'Bob', rset
       
    67 
       
    68 Importer API
       
    69 ------------
       
    70 
       
    71 .. automodule:: cubicweb.dataimport.importer
       
    72 
       
    73 
       
    74 Stores
       
    75 ~~~~~~
       
    76 
       
    77 Stores are responsible to insert properly formatted entities and relations into the database. They
       
    78 have the following API::
       
    79 
       
    80     >>> user_eid = store.prepare_insert_entity('CWUser', login=u'johndoe')
       
    81     >>> group_eid = store.prepare_insert_entity('CWUser', name=u'unknown')
       
    82     >>> store.relate(user_eid, 'in_group', group_eid)
       
    83     >>> store.flush()
       
    84     >>> store.commit()
       
    85     >>> store.finish()
       
    86 
       
    87 Some stores **require a flush** to copy data in the database, so if you want to have store
       
    88 independent code you should explicitly call it. (There may be multiple flushes during the
       
    89 process, or only one at the end if there is no memory issue). This is different from the
       
    90 commit which validates the database transaction. At last, the `finish()` method should be called in
       
    91 case the store requires additional work once everything is done.
       
    92 
       
    93 * ``prepare_insert_entity(<entity type>, **kwargs) -> eid``: given an entity
       
    94   type, attributes and inlined relations, return the eid of the entity to be
       
    95   inserted, *with no guarantee that anything has been inserted in database*.
       
    96 
       
    97 * ``prepare_update_entity(<entity type>, eid, **kwargs) -> None``: given an
       
    98   entity type and eid, promise for update given attributes and inlined
       
    99   relations *with no guarantee that anything has been inserted in database*.
       
   100 
       
   101 * ``prepare_insert_relation(eid_from, rtype, eid_to) -> None``: indicate that a
       
   102   relation ``rtype`` should be added between entities with eids ``eid_from``
       
   103   and ``eid_to``. Similar to ``prepare_insert_entity()``, *there is no
       
   104   guarantee that the relation has been inserted in database*.
       
   105 
       
   106 * ``flush() -> None``: flush any temporary data to database. May be called
       
   107   several times during an import.
       
   108 
       
   109 * ``commit() -> None``: commit the database transaction.
       
   110 
       
   111 * ``finish() -> None``: additional stuff to do after import is terminated.
       
   112 
       
   113 ObjectStore
       
   114 -----------
       
   115 
       
   116 This store keeps objects in memory for *faster* validation. It may be useful in development
       
   117 mode. However, as it will not enforce the constraints of the schema nor insert anything in the
       
   118 database, so it may miss some problems.
       
   119 
       
   120 
       
   121 RQLObjectStore
       
   122 --------------
       
   123 
       
   124 This store works with an actual RQL repository, and it may be used in production mode.
       
   125 
       
   126 
       
   127 NoHookRQLObjectStore
       
   128 --------------------
       
   129 
       
   130 This store works similarly to the *RQLObjectStore* but bypasses some *CubicWeb* hooks to be faster.
       
   131 
       
   132 
       
   133 SQLGenObjectStore
       
   134 -----------------
       
   135 
       
   136 This store relies on *COPY FROM*/execute many sql commands to directly push data using SQL commands
       
   137 rather than using the whole *CubicWeb* API. For now, **it only works with PostgresSQL** as it requires
       
   138 the *COPY FROM* command.