|
1 .. -*- coding: utf-8 -*- |
|
2 |
|
3 .. _dataimport: |
|
4 |
|
5 Dataimport |
|
6 ========== |
|
7 |
|
8 *CubicWeb* is designed to manipulate huge of amount of data, and provides utilities to do so. |
|
9 |
|
10 The main entry point is :mod:`cubicweb.dataimport.importer` which defines an |
|
11 :class:`ExtEntitiesImporter` class responsible for importing data from an external source in the |
|
12 form :class:`ExtEntity` objects. An :class:`ExtEntity` is a transitional representation of an |
|
13 entity to be imported in the CubicWeb instance; building this representation is usually |
|
14 domain-specific -- e.g. dependent of the kind of data source (RDF, CSV, etc.) -- and is thus the |
|
15 responsibility of the end-user. |
|
16 |
|
17 Along with the importer, a *store* must be selected, which is responsible for insertion of data into |
|
18 the database. There exists different kind of stores_, allowing to insert data within different |
|
19 levels of the *CubicWeb* API and with different speed/security tradeoffs. Those keeping all the |
|
20 *CubicWeb* hooks and security will be slower but the possible errors in insertion (bad data types, |
|
21 integrity error, ...) will be handled. |
|
22 |
|
23 |
|
24 Example |
|
25 ------- |
|
26 |
|
27 Consider the following schema snippet. |
|
28 |
|
29 .. code-block:: python |
|
30 |
|
31 class Person(EntityType): |
|
32 name = String(required=True) |
|
33 |
|
34 class knows(RelationDefinition): |
|
35 subject = 'Person' |
|
36 object = 'Person' |
|
37 |
|
38 along with some data in a ``people.csv`` file:: |
|
39 |
|
40 # uri,name,knows |
|
41 http://www.example.org/alice,Alice, |
|
42 http://www.example.org/bob,Bob,http://www.example.org/alice |
|
43 |
|
44 The following code (using a shell context) defines a function `extentities_from_csv` to read |
|
45 `Person` external entities coming from a CSV file and calls the :class:`ExtEntitiesImporter` to |
|
46 insert corresponding entities and relations into the CubicWeb instance. |
|
47 |
|
48 .. code-block:: python |
|
49 |
|
50 from cubicweb.dataimport import ucsvreader, RQLObjectStore |
|
51 from cubicweb.dataimport.importer import ExtEntity, ExtEntitiesImporter |
|
52 |
|
53 def extentities_from_csv(fpath): |
|
54 """Yield Person ExtEntities read from `fpath` CSV file.""" |
|
55 with open(fpath) as f: |
|
56 for uri, name, knows in ucsvreader(f, skipfirst=True, skip_empty=False): |
|
57 yield ExtEntity('Personne', uri, |
|
58 {'nom': set([name]), 'connait': set([knows])}) |
|
59 |
|
60 extenties = extentities_from_csv('people.csv') |
|
61 store = RQLObjectStore(cnx) |
|
62 importer = ExtEntitiesImporter(schema, store) |
|
63 importer.import_entities(extenties) |
|
64 commit() |
|
65 rset = cnx.execute('String N WHERE X nom N, X connait Y, Y nom "Alice"') |
|
66 assert rset[0][0] == u'Bob', rset |
|
67 |
|
68 Importer API |
|
69 ------------ |
|
70 |
|
71 .. automodule:: cubicweb.dataimport.importer |
|
72 |
|
73 |
|
74 Stores |
|
75 ~~~~~~ |
|
76 |
|
77 Stores are responsible to insert properly formatted entities and relations into the database. They |
|
78 have the following API:: |
|
79 |
|
80 >>> user_eid = store.prepare_insert_entity('CWUser', login=u'johndoe') |
|
81 >>> group_eid = store.prepare_insert_entity('CWUser', name=u'unknown') |
|
82 >>> store.relate(user_eid, 'in_group', group_eid) |
|
83 >>> store.flush() |
|
84 >>> store.commit() |
|
85 >>> store.finish() |
|
86 |
|
87 Some stores **require a flush** to copy data in the database, so if you want to have store |
|
88 independent code you should explicitly call it. (There may be multiple flushes during the |
|
89 process, or only one at the end if there is no memory issue). This is different from the |
|
90 commit which validates the database transaction. At last, the `finish()` method should be called in |
|
91 case the store requires additional work once everything is done. |
|
92 |
|
93 * ``prepare_insert_entity(<entity type>, **kwargs) -> eid``: given an entity |
|
94 type, attributes and inlined relations, return the eid of the entity to be |
|
95 inserted, *with no guarantee that anything has been inserted in database*. |
|
96 |
|
97 * ``prepare_update_entity(<entity type>, eid, **kwargs) -> None``: given an |
|
98 entity type and eid, promise for update given attributes and inlined |
|
99 relations *with no guarantee that anything has been inserted in database*. |
|
100 |
|
101 * ``prepare_insert_relation(eid_from, rtype, eid_to) -> None``: indicate that a |
|
102 relation ``rtype`` should be added between entities with eids ``eid_from`` |
|
103 and ``eid_to``. Similar to ``prepare_insert_entity()``, *there is no |
|
104 guarantee that the relation has been inserted in database*. |
|
105 |
|
106 * ``flush() -> None``: flush any temporary data to database. May be called |
|
107 several times during an import. |
|
108 |
|
109 * ``commit() -> None``: commit the database transaction. |
|
110 |
|
111 * ``finish() -> None``: additional stuff to do after import is terminated. |
|
112 |
|
113 ObjectStore |
|
114 ----------- |
|
115 |
|
116 This store keeps objects in memory for *faster* validation. It may be useful in development |
|
117 mode. However, as it will not enforce the constraints of the schema nor insert anything in the |
|
118 database, so it may miss some problems. |
|
119 |
|
120 |
|
121 RQLObjectStore |
|
122 -------------- |
|
123 |
|
124 This store works with an actual RQL repository, and it may be used in production mode. |
|
125 |
|
126 |
|
127 NoHookRQLObjectStore |
|
128 -------------------- |
|
129 |
|
130 This store works similarly to the *RQLObjectStore* but bypasses some *CubicWeb* hooks to be faster. |
|
131 |
|
132 |
|
133 SQLGenObjectStore |
|
134 ----------------- |
|
135 |
|
136 This store relies on *COPY FROM*/execute many sql commands to directly push data using SQL commands |
|
137 rather than using the whole *CubicWeb* API. For now, **it only works with PostgresSQL** as it requires |
|
138 the *COPY FROM* command. |