doc/book/devrepo/dataimport.rst
author Philippe Pepiot <ph@itsalwaysdns.eu>
Tue, 31 Mar 2020 18:22:05 +0200
changeset 12966 6cd938c29ca3
parent 11239 19cacea03fde
permissions -rw-r--r--
[server] Make connection pooler configurable and set better default values Drop the configuration connections-pool-size and add new configurations options: * connections-pool-min-size. Set to 0 by default so we open connections only when needed. This avoid opening min-size*processes connections at startup, which is, it think, a good default. * connections-pool-max-size. Set to 0 (unlimited) by default, so we move the bottleneck to postgresql. * connections-idle-timeout. Set to 10 minutes. I don't have arguments about this except that this is the default in pgbouncer.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
10461
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
     1
.. -*- coding: utf-8 -*-
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     2
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     3
.. _dataimport:
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     4
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     5
Dataimport
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     6
==========
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
     7
11238
bb5fdf1eb8fb [doc] fix abstract of data import
Nicolas Chauvat <nicolas.chauvat@logilab.fr>
parents: 10513
diff changeset
     8
*CubicWeb* is designed to easily manipulate large amounts of data, and provides
bb5fdf1eb8fb [doc] fix abstract of data import
Nicolas Chauvat <nicolas.chauvat@logilab.fr>
parents: 10513
diff changeset
     9
utilities to make imports simple.
10461
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    10
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    11
The main entry point is :mod:`cubicweb.dataimport.importer` which defines an
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    12
:class:`ExtEntitiesImporter` class responsible for importing data from an external source in the
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    13
form :class:`ExtEntity` objects. An :class:`ExtEntity` is a transitional representation of an
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    14
entity to be imported in the CubicWeb instance; building this representation is usually
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    15
domain-specific -- e.g. dependent of the kind of data source (RDF, CSV, etc.) -- and is thus the
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    16
responsibility of the end-user.
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    17
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    18
Along with the importer, a *store* must be selected, which is responsible for insertion of data into
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    19
the database. There exists different kind of stores_, allowing to insert data within different
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    20
levels of the *CubicWeb* API and with different speed/security tradeoffs. Those keeping all the
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    21
*CubicWeb* hooks and security will be slower but the possible errors in insertion (bad data types,
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    22
integrity error, ...) will be handled.
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    23
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    24
10461
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    25
Example
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    26
-------
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    27
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    28
Consider the following schema snippet.
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    29
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    30
.. code-block:: python
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    31
10461
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    32
    class Person(EntityType):
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    33
        name = String(required=True)
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    34
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    35
    class knows(RelationDefinition):
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    36
        subject = 'Person'
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    37
        object = 'Person'
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    38
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    39
along with some data in a ``people.csv`` file::
10460
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    40
10461
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    41
    # uri,name,knows
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    42
    http://www.example.org/alice,Alice,
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    43
    http://www.example.org/bob,Bob,http://www.example.org/alice
10460
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    44
10461
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    45
The following code (using a shell context) defines a function `extentities_from_csv` to read
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    46
`Person` external entities coming from a CSV file and calls the :class:`ExtEntitiesImporter` to
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    47
insert corresponding entities and relations into the CubicWeb instance.
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    48
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    49
.. code-block:: python
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    50
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    51
    from cubicweb.dataimport import ucsvreader, RQLObjectStore
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    52
    from cubicweb.dataimport.importer import ExtEntity, ExtEntitiesImporter
10460
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    53
10461
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    54
    def extentities_from_csv(fpath):
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    55
        """Yield Person ExtEntities read from `fpath` CSV file."""
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    56
        with open(fpath) as f:
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    57
            for uri, name, knows in ucsvreader(f, skipfirst=True, skip_empty=False):
11239
19cacea03fde [doc] fix abstract and example of data import
Nicolas Chauvat <nicolas.chauvat@logilab.fr>
parents: 11238
diff changeset
    58
                yield ExtEntity('Person', uri,
19cacea03fde [doc] fix abstract and example of data import
Nicolas Chauvat <nicolas.chauvat@logilab.fr>
parents: 11238
diff changeset
    59
                                {'name': set([name]), 'knows': set([knows])})
10460
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    60
10461
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    61
    extenties = extentities_from_csv('people.csv')
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    62
    store = RQLObjectStore(cnx)
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    63
    importer = ExtEntitiesImporter(schema, store)
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    64
    importer.import_entities(extenties)
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    65
    commit()
11239
19cacea03fde [doc] fix abstract and example of data import
Nicolas Chauvat <nicolas.chauvat@logilab.fr>
parents: 11238
diff changeset
    66
    rset = cnx.execute('String N WHERE X name N, X knows Y, Y name "Alice"')
10461
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    67
    assert rset[0][0] == u'Bob', rset
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    68
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    69
Importer API
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    70
------------
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    71
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    72
.. automodule:: cubicweb.dataimport.importer
37644c518705 [doc] Add a tutorial and extend documentation for ExtEntityImporter
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 10460
diff changeset
    73
10460
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    74
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    75
Stores
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    76
~~~~~~
d260722f2453 [dataimport] introduce the importer and extentity classes
Yann Voté <yann.vote@logilab.fr>
parents: 10457
diff changeset
    77
10513
7bec01a59f92 [dataimport] dispatch and deprecate old code
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 10491
diff changeset
    78
.. automodule:: cubicweb.dataimport.stores
8625
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    79
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    80
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    81
SQLGenObjectStore
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    82
-----------------
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    83
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    84
This store relies on *COPY FROM*/execute many sql commands to directly push data using SQL commands
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    85
rather than using the whole *CubicWeb* API. For now, **it only works with PostgresSQL** as it requires
7ee0752178e5 [dataimport] Add SQL Store for faster import - works ONLY with Postgres for now, as it requires "copy from" command - closes #2410822
Vincent Michel <vincent.michel@logilab.fr>
parents:
diff changeset
    86
the *COPY FROM* command.