doc/book/en/devrepo/dataimport.rst
author Yann Voté <yann.vote@logilab.fr>
Fri, 26 Jun 2015 16:09:27 +0200
changeset 10460 d260722f2453
parent 10457 1f5026e7d848
child 10461 37644c518705
permissions -rw-r--r--
[dataimport] introduce the importer and extentity classes This introduces the ``ExtEntity`` class which is a transitional state between data at external source and the actual CubicWeb entities. ``ExtEntitiesImporter`` is then in charge to turn a bunch of ext entities into CW entities in repository, using a given store. This changeset also introduces ``SimpleImportLog`` and ``HTMLImportLog`` which implement the CW DataImportLog interface in order to show log messages in UI using simple text and HTML formats respectively, instead of storing these messages in database. Both have mostly been backported from cubes.skos.dataimport. Closes #5414753.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     1
. -*- coding: utf-8 -*-
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     2
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     3
.. _dataimport:
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     4
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     5
Dataimport
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     6
==========
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     7
10457
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
     8
*CubicWeb* is designed to manipulate huge of amount of data, and provides utilities to do so.  They
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
     9
allow to insert data within different levels of the *CubicWeb* API, allowing different
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    10
speed/security tradeoffs. Those keeping all the *CubicWeb* hooks and security will be slower but the
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    11
possible errors in insertion (bad data types, integrity error, ...) will be raised.
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    12
10457
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    13
These data import utilities are provided in the package `cubicweb.dataimport`.
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    14
10460
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    15
The API is built on top of the following concepts:
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    16
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    17
* `Store`, class responsible for inserting values in the backend database
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    18
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    19
* `ExtEntity`, some intermediate representation of data to import, using external identifier but no
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    20
  eid, and usually with slightly different representation than the associated entity's schema
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    21
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    22
* `Generator`, class or functions that will yield `ExtEntity` from some data source (eg RDF, CSV)
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    23
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    24
* `Importer`, class responsible for turning `ExtEntity`'s extid to eid, doing creation or update
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    25
  accordingly and may be controlling the insertion order of entities before feeding them to a
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    26
  `Store`
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    27
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    28
Stores
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    29
~~~~~~
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    30
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    31
Stores are responsible to insert properly formatted entities and relations into the database. They
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    32
have the following API::
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    33
10457
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    34
    >>> user_eid = store.prepare_insert_entity('CWUser', login=u'johndoe')
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    35
    >>> group_eid = store.prepare_insert_entity('CWUser', name=u'unknown')
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    36
    >>> store.relate(user_eid, 'in_group', group_eid)
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    37
    >>> store.flush()
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    38
    >>> store.commit()
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    39
    >>> store.finish()
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    40
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    41
Some stores **require a flush** to copy data in the database, so if you want to have store
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    42
independent code you should explicitly call it. (There may be multiple flushes during the
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    43
process, or only one at the end if there is no memory issue). This is different from the
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    44
commit which validates the database transaction. At last, the `finish()` method should be called in
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    45
case the store requires additional work once everything is done.
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    46
10457
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    47
* ``prepare_insert_entity(<entity type>, **kwargs) -> eid``: given an entity
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    48
  type, attributes and inlined relations, return the eid of the entity to be
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    49
  inserted, *with no guarantee that anything has been inserted in database*.
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    50
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    51
* ``prepare_update_entity(<entity type>, eid, **kwargs) -> None``: given an
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    52
  entity type and eid, promise for update given attributes and inlined
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    53
  relations *with no guarantee that anything has been inserted in database*.
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    54
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    55
* ``prepare_insert_relation(eid_from, rtype, eid_to) -> None``: indicate that a
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    56
  relation ``rtype`` should be added between entities with eids ``eid_from``
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    57
  and ``eid_to``. Similar to ``prepare_insert_entity()``, *there is no
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    58
  guarantee that the relation has been inserted in database*.
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    59
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    60
* ``flush() -> None``: flush any temporary data to database. May be called
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    61
  several times during an import.
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    62
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    63
* ``commit() -> None``: commit the database transaction.
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    64
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    65
* ``finish() -> None``: additional stuff to do after import is terminated.
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    66
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    67
ObjectStore
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    68
-----------
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    69
10457
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    70
This store keeps objects in memory for *faster* validation. It may be useful in development
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    71
mode. However, as it will not enforce the constraints of the schema nor insert anything in the
1f5026e7d848 [dataimport] Move stores to new API.
Yann Voté <yann.vote@logilab.fr>
parents: 8625
diff changeset
    72
database, so it may miss some problems.
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    73
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    74
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    75
RQLObjectStore
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    76
--------------
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    77
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    78
This store works with an actual RQL repository, and it may be used in production mode.
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    79
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    80
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    81
NoHookRQLObjectStore
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    82
--------------------
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    83
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    84
This store works similarly to the *RQLObjectStore* but bypasses some *CubicWeb* hooks to be faster.
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    85
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    86
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    87
SQLGenObjectStore
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    88
-----------------
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    89
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    90
This store relies on *COPY FROM*/execute many sql commands to directly push data using SQL commands
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    91
rather than using the whole *CubicWeb* API. For now, **it only works with PostgresSQL** as it requires
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    92
the *COPY FROM* command.
10460
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    93
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    94
ExtEntity and Importer
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    95
~~~~~~~~~~~~~~~~~~~~~~
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    96
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    97
.. automodule:: cubicweb.dataimport.importer