39 Building a data model |
39 Building a data model |
40 ~~~~~~~~~~~~~~~~~~~~~ |
40 ~~~~~~~~~~~~~~~~~~~~~ |
41 |
41 |
42 The first thing to do when using CubicWeb for creating an application from scratch |
42 The first thing to do when using CubicWeb for creating an application from scratch |
43 is to devise a *data model*, that is, a relational representation of the problem to be |
43 is to devise a *data model*, that is, a relational representation of the problem to be |
44 modeled or of the structure of the data to be imported. |
44 modeled or of the structure of the data to be imported. |
45 |
45 |
46 In such a schema, we define |
46 In such a schema, we define |
47 an entity type (``EntityType`` objects) for each type of entity to import. Each such type |
47 an entity type (``EntityType`` objects) for each type of entity to import. Each such type |
48 has several attributes. If the attributes are of known CubicWeb (Yams) types, viz. numbers, |
48 has several attributes. If the attributes are of known CubicWeb (Yams) types, viz. numbers, |
49 strings or characters, then they are defined as attributes, as e.g. ``attribute = Int()`` |
49 strings or characters, then they are defined as attributes, as e.g. ``attribute = Int()`` |
50 for an attribute named ``attribute`` which is an integer. |
50 for an attribute named ``attribute`` which is an integer. |
51 |
51 |
52 Each such type also has a set of |
52 Each such type also has a set of |
53 relations, which are defined like the attributes, except that they represent, in fact, |
53 relations, which are defined like the attributes, except that they represent, in fact, |
54 relations between the entities of the type under discussion and the objects of a type which |
54 relations between the entities of the type under discussion and the objects of a type which |
55 is specified in the relation definition. |
55 is specified in the relation definition. |
56 |
56 |
57 For example, for the Diseasome data, we have two types of entities, genes and diseases. |
57 For example, for the Diseasome data, we have two types of entities, genes and diseases. |
58 Thus, we create two classes which inherit from ``EntityType``:: |
58 Thus, we create two classes which inherit from ``EntityType``:: |
59 |
59 |
60 class Disease(EntityType): |
60 class Disease(EntityType): |
71 |
71 |
72 |
72 |
73 class Gene(EntityType): |
73 class Gene(EntityType): |
74 ... |
74 ... |
75 |
75 |
76 In this schema, there are attributes whose values are numbers or strings. Thus, they are |
76 In this schema, there are attributes whose values are numbers or strings. Thus, they are |
77 defined by using the CubicWeb / Yams primitive types, e.g., ``label = String(maxsize=12)``. |
77 defined by using the CubicWeb / Yams primitive types, e.g., ``label = String(maxsize=12)``. |
78 These types can have several constraints or attributes, such as ``maxsize``. |
78 These types can have several constraints or attributes, such as ``maxsize``. |
79 There are also relations, either between the entity types themselves, or between them |
79 There are also relations, either between the entity types themselves, or between them |
80 and a CubicWeb type, ``ExternalUri``. The latter defines a class of URI objects in |
80 and a CubicWeb type, ``ExternalUri``. The latter defines a class of URI objects in |
81 CubicWeb. For instance, the ``chromosomal_location`` attribute is a relation between |
81 CubicWeb. For instance, the ``chromosomal_location`` attribute is a relation between |
82 a ``Disease`` entity and an ``ExternalUri`` entity. The relation is marked by the CubicWeb / |
82 a ``Disease`` entity and an ``ExternalUri`` entity. The relation is marked by the CubicWeb / |
83 Yams ``SubjectRelation`` method. The latter can have several optional keyword arguments, such as |
83 Yams ``SubjectRelation`` method. The latter can have several optional keyword arguments, such as |
84 ``cardinality`` which specifies the number of subjects and objects related by the relation type |
84 ``cardinality`` which specifies the number of subjects and objects related by the relation type |
85 specified. For example, the ``'?*'`` cardinality in the ``chromosomal_relation`` relation type says |
85 specified. For example, the ``'?*'`` cardinality in the ``chromosomal_relation`` relation type says |
86 that zero or more ``Disease`` entities are related to zero or one ``ExternalUri`` entities. |
86 that zero or more ``Disease`` entities are related to zero or one ``ExternalUri`` entities. |
87 In other words, a ``Disease`` entity is related to at most one ``ExternalUri`` entity via the |
87 In other words, a ``Disease`` entity is related to at most one ``ExternalUri`` entity via the |
88 ``chromosomal_location`` relation type, and that we can have zero or more ``Disease`` entities in the |
88 ``chromosomal_location`` relation type, and that we can have zero or more ``Disease`` entities in the |
89 data base. |
89 data base. |
90 For a relation between the entity types themselves, the ``associated_genes`` between a ``Disease`` |
90 For a relation between the entity types themselves, the ``associated_genes`` between a ``Disease`` |
91 entity and a ``Gene`` entity is defined, so that any number of ``Gene`` entities can be associated |
91 entity and a ``Gene`` entity is defined, so that any number of ``Gene`` entities can be associated |
92 to a ``Disease``, and there can be any number of ``Disease`` s if a ``Gene`` exists. |
92 to a ``Disease``, and there can be any number of ``Disease`` s if a ``Gene`` exists. |
93 |
93 |
94 Of course, before being able to use the CubicWeb / Yams built-in objects, we need to import them:: |
94 Of course, before being able to use the CubicWeb / Yams built-in objects, we need to import them:: |
95 |
95 |
96 |
96 |
97 from yams.buildobjs import EntityType, SubjectRelation, String, Int |
97 from yams.buildobjs import EntityType, SubjectRelation, String, Int |
98 from cubicweb.schemas.base import ExternalUri |
98 from cubicweb.schemas.base import ExternalUri |
99 |
99 |
100 Building a custom data parser |
100 Building a custom data parser |
101 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
101 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
102 |
102 |
103 The data we wish to import is structured in the RDF format, |
103 The data we wish to import is structured in the RDF format, |
104 as a text file containing a set of lines. |
104 as a text file containing a set of lines. |
105 On each line, there are three fields. |
105 On each line, there are three fields. |
106 The first two fields are URIs ("Universal Resource Identifiers"). |
106 The first two fields are URIs ("Universal Resource Identifiers"). |
107 The third field is either an URI or a string. Each field bares a particular meaning: |
107 The third field is either an URI or a string. Each field bares a particular meaning: |
108 |
108 |
109 - the leftmost field is an URI that holds the entity to be imported. |
109 - the leftmost field is an URI that holds the entity to be imported. |
110 Note that the entities defined in the data model (i.e., in ``schema.py``) should |
110 Note that the entities defined in the data model (i.e., in ``schema.py``) should |
111 correspond to the entities whose URIs are specified in the import file. |
111 correspond to the entities whose URIs are specified in the import file. |
112 |
112 |
113 - the middle field is an URI that holds a relation whose subject is the entity |
113 - the middle field is an URI that holds a relation whose subject is the entity |
114 defined by the leftmost field. Note that this should also correspond |
114 defined by the leftmost field. Note that this should also correspond |
115 to the definitions in the data model. |
115 to the definitions in the data model. |
116 |
116 |
117 - the rightmost field is either an URI or a string. When this field is an URI, |
117 - the rightmost field is either an URI or a string. When this field is an URI, |
118 it gives the object of the relation defined by the middle field. |
118 it gives the object of the relation defined by the middle field. |
119 When the rightmost field is a string, the middle field is interpreted as an attribute |
119 When the rightmost field is a string, the middle field is interpreted as an attribute |
120 of the subject (introduced by the leftmost field) and the rightmost field is |
120 of the subject (introduced by the leftmost field) and the rightmost field is |
121 interpreted as the value of the attribute. |
121 interpreted as the value of the attribute. |
122 |
122 |
123 Note however that some attributes (i.e. relations whose objects are strings) |
123 Note however that some attributes (i.e. relations whose objects are strings) |
124 have their objects defined as strings followed by ``^^`` and by another URI; |
124 have their objects defined as strings followed by ``^^`` and by another URI; |
125 we ignore this part. |
125 we ignore this part. |
126 |
126 |
127 Let us show some examples: |
127 Let us show some examples: |
128 |
128 |
129 - of line holding an attribute definition: |
129 - of line holding an attribute definition: |
130 ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/CYP17A1> |
130 ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/CYP17A1> |
131 <http://www.w3.org/2000/01/rdf-schema#label> "CYP17A1" .`` |
131 <http://www.w3.org/2000/01/rdf-schema#label> "CYP17A1" .`` |
132 The line contains the definition of the ``label`` attribute of an |
132 The line contains the definition of the ``label`` attribute of an |
133 entity of type ``gene``. The value of ``label`` is '``CYP17A1``'. |
133 entity of type ``gene``. The value of ``label`` is '``CYP17A1``'. |
134 |
134 |
135 - of line holding a relation definition: |
135 - of line holding a relation definition: |
136 ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/1> |
136 ``<http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/1> |
137 <http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene> |
137 <http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene> |
138 <http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HADH2> .`` |
138 <http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HADH2> .`` |
139 The line contains the definition of the ``associatedGene`` relation between |
139 The line contains the definition of the ``associatedGene`` relation between |
140 a ``disease`` subject entity identified by ``1`` and a ``gene`` object |
140 a ``disease`` subject entity identified by ``1`` and a ``gene`` object |
141 entity defined by ``HADH2``. |
141 entity defined by ``HADH2``. |
142 |
142 |
143 Thus, for parsing the data, we can (:note: see the ``diseasome_parser`` module): |
143 Thus, for parsing the data, we can (:note: see the ``diseasome_parser`` module): |
144 |
144 |
145 1. define a couple of regular expressions for parsing the two kinds of lines, |
145 1. define a couple of regular expressions for parsing the two kinds of lines, |
146 ``RE_ATTS`` for parsing the attribute definitions, and ``RE_RELS`` for parsing |
146 ``RE_ATTS`` for parsing the attribute definitions, and ``RE_RELS`` for parsing |
147 the relation definitions. |
147 the relation definitions. |
148 |
148 |
149 2. define a function that iterates through the lines of the file and retrieves |
149 2. define a function that iterates through the lines of the file and retrieves |
150 (``yield`` s) a (subject, relation, object) tuple for each line. |
150 (``yield`` s) a (subject, relation, object) tuple for each line. |
164 Mapping the data to the schema |
164 Mapping the data to the schema |
165 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
165 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
166 |
166 |
167 In the case of diseasome data, we can just define two dictionaries for mapping |
167 In the case of diseasome data, we can just define two dictionaries for mapping |
168 the names of the relations as extracted by the parser, to the names of the relations |
168 the names of the relations as extracted by the parser, to the names of the relations |
169 as defined in the ``schema.py`` data model. In the ``diseasome_parser`` module |
169 as defined in the ``schema.py`` data model. In the ``diseasome_parser`` module |
170 they are called ``MAPPING_ATTS`` and ``MAPPING_RELS``. |
170 they are called ``MAPPING_ATTS`` and ``MAPPING_RELS``. |
171 Given that the relation and attribute names are given in CamelCase in the original data, |
171 Given that the relation and attribute names are given in CamelCase in the original data, |
172 mappings are necessary if we follow the PEP08 when naming the attributes in the data model. |
172 mappings are necessary if we follow the PEP08 when naming the attributes in the data model. |
173 For example, the RDF relation ``chromosomalLocation`` is mapped into the schema relation |
173 For example, the RDF relation ``chromosomalLocation`` is mapped into the schema relation |
174 ``chromosomal_location``. |
174 ``chromosomal_location``. |
175 |
175 |
176 Once these mappings have been defined, we just iterate over the (subject, relation, object) |
176 Once these mappings have been defined, we just iterate over the (subject, relation, object) |
177 tuples provided by the parser and we extract the entities, with their attributes and relations. |
177 tuples provided by the parser and we extract the entities, with their attributes and relations. |
178 For each entity, we thus have a dictionary with two keys, ``attributes`` and ``relations``. |
178 For each entity, we thus have a dictionary with two keys, ``attributes`` and ``relations``. |
179 The value associated to the ``attributes`` key is a dictionary containing (attribute: value) |
179 The value associated to the ``attributes`` key is a dictionary containing (attribute: value) |
180 pairs, where "value" is a string, plus the ``cwuri`` key / attribute holding the URI of |
180 pairs, where "value" is a string, plus the ``cwuri`` key / attribute holding the URI of |
181 the entity itself. |
181 the entity itself. |
182 The value associated to the ``relations`` key is a dictionary containing (relation: value) |
182 The value associated to the ``relations`` key is a dictionary containing (relation: value) |
183 pairs, where "value" is an URI. |
183 pairs, where "value" is an URI. |
184 This is implemented in the ``entities_from_rdf`` interface function of the module |
184 This is implemented in the ``entities_from_rdf`` interface function of the module |
185 ``diseasome_parser``. This function provides an iterator on the dictionaries containing |
185 ``diseasome_parser``. This function provides an iterator on the dictionaries containing |
186 the ``attributes`` and ``relations`` keys for all entities. |
186 the ``attributes`` and ``relations`` keys for all entities. |
187 |
187 |
188 However, this is a simple case. In real life, things can get much more complicated, and the |
188 However, this is a simple case. In real life, things can get much more complicated, and the |
189 mapping can be far from trivial, especially when several data sources (which can follow |
189 mapping can be far from trivial, especially when several data sources (which can follow |
190 different formatting and even structuring conventions) must be mapped into the same data model. |
190 different formatting and even structuring conventions) must be mapped into the same data model. |
191 |
191 |
192 Importing the data |
192 Importing the data |
193 ~~~~~~~~~~~~~~~~~~ |
193 ~~~~~~~~~~~~~~~~~~ |
194 |
194 |
195 The data import code should be placed in a Python module. Let us call it |
195 The data import code should be placed in a Python module. Let us call it |
196 ``diseasome_import.py``. Then, this module should be called via |
196 ``diseasome_import.py``. Then, this module should be called via |
197 ``cubicweb-ctl``, as follows:: |
197 ``cubicweb-ctl``, as follows:: |
198 |
198 |
199 cubicweb-ctl shell diseasome_import.py -- <other arguments e.g. data file> |
199 cubicweb-ctl shell diseasome_import.py -- <other arguments e.g. data file> |
200 |
200 |
214 provides the means for creating the memory structures |
214 provides the means for creating the memory structures |
215 (dictionaries) that hold the entities and the relations |
215 (dictionaries) that hold the entities and the relations |
216 between them. |
216 between them. |
217 |
217 |
218 2. ``RQLObjectStore``: store which uses the RQL language for performing |
218 2. ``RQLObjectStore``: store which uses the RQL language for performing |
219 database insertions and updates. It relies on all the CubicWeb hooks |
219 database insertions and updates. It relies on all the CubicWeb hooks |
220 machinery, especially for dealing with security issues (database access |
220 machinery, especially for dealing with security issues (database access |
221 permissions). |
221 permissions). |
222 |
222 |
223 2. ``NoHookRQLObjectStore``: store which uses the RQL language for |
223 2. ``NoHookRQLObjectStore``: store which uses the RQL language for |
224 performing database insertions and updates, but for which |
224 performing database insertions and updates, but for which |
225 all hooks are deactivated. This implies that |
225 all hooks are deactivated. This implies that |
226 certain checks with respect to the CubicWeb / Yams schema |
226 certain checks with respect to the CubicWeb / Yams schema |
227 (data model) are not performed. However, all SQL queries |
227 (data model) are not performed. However, all SQL queries |
228 obtained from the RQL ones are executed in a sequential |
228 obtained from the RQL ones are executed in a sequential |
229 manner, one query per inserted entity. |
229 manner, one query per inserted entity. |
230 |
230 |
231 4. ``SQLGenObjectStore``: store which uses the SQL language directly. |
231 4. ``SQLGenObjectStore``: store which uses the SQL language directly. |
232 It inserts entities either sequentially, by executing an SQL query |
232 It inserts entities either sequentially, by executing an SQL query |
233 for each entity, or directly by using one PostGRES ``COPY FROM`` |
233 for each entity, or directly by using one PostGRES ``COPY FROM`` |
234 query for a set of similarly structured entities. |
234 query for a set of similarly structured entities. |
235 |
235 |
236 For really massive imports (millions or billions of entities), there |
236 For really massive imports (millions or billions of entities), there |
237 is a cube ``dataio`` which contains another store, called |
237 is a cube ``dataio`` which contains another store, called |
238 ``MassiveObjectStore``. This store is similar to ``SQLGenObjectStore``, |
238 ``MassiveObjectStore``. This store is similar to ``SQLGenObjectStore``, |
239 except that anything related to CubicWeb is bypassed. That is, even the |
239 except that anything related to CubicWeb is bypassed. That is, even the |
240 CubicWeb EID entity identifiers are not handled. This store is the fastest, |
240 CubicWeb EID entity identifiers are not handled. This store is the fastest, |
241 but has a slightly different API from the other four stores mentioned above. |
241 but has a slightly different API from the other four stores mentioned above. |
242 Moreover, it has an important limitation, in that it doesn't insert inlined [#]_ |
242 Moreover, it has an important limitation, in that it doesn't insert inlined [#]_ |
243 relations in the database. |
243 relations in the database. |
244 |
244 |
245 .. [#] An inlined relation is a relation defined in the schema |
245 .. [#] An inlined relation is a relation defined in the schema |
246 with the keyword argument ``inlined=True``. Such a relation |
246 with the keyword argument ``inlined=True``. Such a relation |
247 is inserted in the database as an attribute of the entity |
247 is inserted in the database as an attribute of the entity |
248 whose subject it is. |
248 whose subject it is. |
272 |
272 |
273 Each such store provides three methods for data import: |
273 Each such store provides three methods for data import: |
274 |
274 |
275 #. ``create_entity(Etype, **attributes)``, which allows us to add |
275 #. ``create_entity(Etype, **attributes)``, which allows us to add |
276 an entity of the Yams type ``Etype`` to the database. This entity's attributes |
276 an entity of the Yams type ``Etype`` to the database. This entity's attributes |
277 are specified in the ``attributes`` dictionary. The method returns the entity |
277 are specified in the ``attributes`` dictionary. The method returns the entity |
278 created in the database. For example, we add two entities, |
278 created in the database. For example, we add two entities, |
279 a person, of ``Person`` type, and a location, of ``Location`` type:: |
279 a person, of ``Person`` type, and a location, of ``Location`` type:: |
280 |
280 |
281 person = store.create_entity('Person', name='Toto', age='18', height='190') |
281 person = store.create_entity('Person', name='Toto', age='18', height='190') |
282 |
282 |
283 location = store.create_entity('Location', town='Paris', arrondissement='13') |
283 location = store.create_entity('Location', town='Paris', arrondissement='13') |
284 |
284 |
285 #. ``relate(subject_eid, r_type, object_eid)``, which allows us to add a relation |
285 #. ``relate(subject_eid, r_type, object_eid)``, which allows us to add a relation |
286 of the Yams type ``r_type`` to the database. The relation's subject is an entity |
286 of the Yams type ``r_type`` to the database. The relation's subject is an entity |
287 whose EID is ``subject_eid``; its object is another entity, whose EID is |
287 whose EID is ``subject_eid``; its object is another entity, whose EID is |
288 ``object_eid``. For example [#]_:: |
288 ``object_eid``. For example [#]_:: |
289 |
289 |
290 store.relate(person.eid(), 'lives_in', location.eid(), **kwargs) |
290 store.relate(person.eid(), 'lives_in', location.eid(), **kwargs) |
291 |
291 |
292 ``kwargs`` is only used by the ``SQLGenObjectStore``'s ``relate`` method and is here |
292 ``kwargs`` is only used by the ``SQLGenObjectStore``'s ``relate`` method and is here |
293 to allow us to specify the type of the subject of the relation, when the relation is |
293 to allow us to specify the type of the subject of the relation, when the relation is |
294 defined as inlined in the schema. |
294 defined as inlined in the schema. |
295 |
295 |
296 .. [#] The ``eid`` method of an entity defined via ``create_entity`` returns |
296 .. [#] The ``eid`` method of an entity defined via ``create_entity`` returns |
297 the entity identifier as assigned by CubicWeb when creating the entity. |
297 the entity identifier as assigned by CubicWeb when creating the entity. |
298 This only works for entities defined via the stores in the CubicWeb's |
298 This only works for entities defined via the stores in the CubicWeb's |
299 ``dataimport`` module. |
299 ``dataimport`` module. |
300 |
300 |
301 The keyword argument that is understood by ``SQLGenObjectStore`` is called |
301 The keyword argument that is understood by ``SQLGenObjectStore`` is called |
302 ``subjtype`` and holds the type of the subject entity. For the example considered here, |
302 ``subjtype`` and holds the type of the subject entity. For the example considered here, |
303 this comes to having [#]_:: |
303 this comes to having [#]_:: |
304 |
304 |
305 store.relate(person.eid(), 'lives_in', location.eid(), subjtype=person.cw_etype) |
305 store.relate(person.eid(), 'lives_in', location.eid(), subjtype=person.cw_etype) |
306 |
306 |
307 If ``subjtype`` is not specified, then the store tries to infer the type of the subject. |
307 If ``subjtype`` is not specified, then the store tries to infer the type of the subject. |
308 However, this doesn't always work, e.g. when there are several possible subject types |
308 However, this doesn't always work, e.g. when there are several possible subject types |
309 for a given relation type. |
309 for a given relation type. |
310 |
310 |
311 .. [#] The ``cw_etype`` attribute of an entity defined via ``create_entity`` holds |
311 .. [#] The ``cw_etype`` attribute of an entity defined via ``create_entity`` holds |
312 the type of the entity just created. This only works for entities defined via |
312 the type of the entity just created. This only works for entities defined via |
313 the stores in the CubicWeb's ``dataimport`` module. In the example considered |
313 the stores in the CubicWeb's ``dataimport`` module. In the example considered |
314 here, ``person.cw_etype`` holds ``'Person'``. |
314 here, ``person.cw_etype`` holds ``'Person'``. |
315 |
315 |
316 All the other stores but ``SQLGenObjectStore`` ignore the ``kwargs`` parameters. |
316 All the other stores but ``SQLGenObjectStore`` ignore the ``kwargs`` parameters. |
317 |
317 |
318 #. ``flush()``, which allows us to perform the actual commit into the database, along |
318 #. ``flush()``, which allows us to perform the actual commit into the database, along |
319 with some cleanup operations. Ideally, this method should be called as often as |
319 with some cleanup operations. Ideally, this method should be called as often as |
320 possible, that is after each insertion in the database, so that database sessions |
320 possible, that is after each insertion in the database, so that database sessions |
321 are kept as atomic as possible. In practice, we usually call this method twice: |
321 are kept as atomic as possible. In practice, we usually call this method twice: |
322 first, after all the entities have been created, second, after all relations have |
322 first, after all the entities have been created, second, after all relations have |
323 been created. |
323 been created. |
324 |
324 |
325 Note however that before each commit the database insertions |
325 Note however that before each commit the database insertions |
326 have to be consistent with the schema. Thus, if, for instance, |
326 have to be consistent with the schema. Thus, if, for instance, |
327 an entity has an attribute defined through a relation (viz. |
327 an entity has an attribute defined through a relation (viz. |
328 a ``SubjectRelation``) with a ``"1"`` or ``"+"`` object |
328 a ``SubjectRelation``) with a ``"1"`` or ``"+"`` object |
329 cardinality, we have to create the entity under discussion, |
329 cardinality, we have to create the entity under discussion, |
330 the object entity of the relation under discussion, and the |
330 the object entity of the relation under discussion, and the |
331 relation itself, before committing the additions to the database. |
331 relation itself, before committing the additions to the database. |
332 |
332 |
333 The ``flush`` method is simply called as:: |
333 The ``flush`` method is simply called as:: |
373 |
373 |
374 Please note that these tables can be created before the entities, since |
374 Please note that these tables can be created before the entities, since |
375 they only specify their types, not their unique identifiers. |
375 they only specify their types, not their unique identifiers. |
376 |
376 |
377 #. ``create_entity(Etype, **attributes)``, which allows us to add new entities, |
377 #. ``create_entity(Etype, **attributes)``, which allows us to add new entities, |
378 whose attributes are given in the ``attributes`` dictionary. |
378 whose attributes are given in the ``attributes`` dictionary. |
379 Please note however that, by default, this method does *not* return |
379 Please note however that, by default, this method does *not* return |
380 the created entity. The method is called, for example, as in:: |
380 the created entity. The method is called, for example, as in:: |
381 |
381 |
382 store.create_entity('Person', name='Toto', age='18', height='190', |
382 store.create_entity('Person', name='Toto', age='18', height='190', |
383 uri='http://link/to/person/toto_18_190') |
383 uri='http://link/to/person/toto_18_190') |
384 store.create_entity('Location', town='Paris', arrondissement='13', |
384 store.create_entity('Location', town='Paris', arrondissement='13', |
385 uri='http://link/to/location/paris_13') |
385 uri='http://link/to/location/paris_13') |
386 |
386 |
387 In order to be able to link these entities via the relations when needed, |
387 In order to be able to link these entities via the relations when needed, |
388 we must provide ourselves a means for uniquely identifying the entities. |
388 we must provide ourselves a means for uniquely identifying the entities. |
389 In general, this is done via URIs, stored in attributes like ``uri`` or |
389 In general, this is done via URIs, stored in attributes like ``uri`` or |
390 ``cwuri``. The name of the attribute is irrelevant as long as its value is |
390 ``cwuri``. The name of the attribute is irrelevant as long as its value is |
391 unique for each entity. |
391 unique for each entity. |
392 |
392 |
393 #. ``relate_by_iid(subject_iid, r_type, object_iid)`` allows us to actually |
393 #. ``relate_by_iid(subject_iid, r_type, object_iid)`` allows us to actually |
394 relate the entities uniquely identified by ``subject_iid`` and |
394 relate the entities uniquely identified by ``subject_iid`` and |
395 ``object_iid`` via a relation of type ``r_type``. For example:: |
395 ``object_iid`` via a relation of type ``r_type``. For example:: |
396 |
396 |
397 store.relate_by_iid('http://link/to/person/toto_18_190', |
397 store.relate_by_iid('http://link/to/person/toto_18_190', |
398 'lives_in', |
398 'lives_in', |
399 'http://link/to/location/paris_13') |
399 'http://link/to/location/paris_13') |
435 ############ |
435 ############ |
436 |
436 |
437 We define an import function, ``diseasome_import``, which does basically four things: |
437 We define an import function, ``diseasome_import``, which does basically four things: |
438 |
438 |
439 #. creates and initializes the store to be used, via a line such as:: |
439 #. creates and initializes the store to be used, via a line such as:: |
440 |
440 |
441 store = cwdi.SQLGenObjectStore(session) |
441 store = cwdi.SQLGenObjectStore(session) |
442 |
442 |
443 where ``cwdi`` is the imported ``cubicweb.dataimport`` or |
443 where ``cwdi`` is the imported ``cubicweb.dataimport`` or |
444 ``cubicweb_dataio.dataimport``. |
444 ``cubicweb_dataio.dataimport``. |
445 |
445 |
446 #. calls the diseasome parser, that is, the ``entities_from_rdf`` function in the |
446 #. calls the diseasome parser, that is, the ``entities_from_rdf`` function in the |
447 ``diseasome_parser`` module and iterates on its result, in a line such as:: |
447 ``diseasome_parser`` module and iterates on its result, in a line such as:: |
448 |
448 |
449 for entity, relations in parser.entities_from_rdf(filename, ('gene', 'disease')): |
449 for entity, relations in parser.entities_from_rdf(filename, ('gene', 'disease')): |
450 |
450 |
451 where ``parser`` is the imported ``diseasome_parser`` module, and ``filename`` is the |
451 where ``parser`` is the imported ``diseasome_parser`` module, and ``filename`` is the |
452 name of the file containing the data (with its path), e.g. ``../data/diseasome_dump.nt``. |
452 name of the file containing the data (with its path), e.g. ``../data/diseasome_dump.nt``. |
453 |
453 |
454 #. creates the entities to be inserted in the database; for Diseasome, there are two |
454 #. creates the entities to be inserted in the database; for Diseasome, there are two |
455 kinds of entities: |
455 kinds of entities: |
456 |
456 |
457 #. entities defined in the data model, viz. ``Gene`` and ``Disease`` in our case. |
457 #. entities defined in the data model, viz. ``Gene`` and ``Disease`` in our case. |
458 #. entities which are built in CubicWeb / Yams, viz. ``ExternalUri`` which define |
458 #. entities which are built in CubicWeb / Yams, viz. ``ExternalUri`` which define |
459 URIs. |
459 URIs. |
460 |
460 |
461 As we are working with RDF data, each entity is defined through a series of URIs. Hence, |
461 As we are working with RDF data, each entity is defined through a series of URIs. Hence, |
462 each "relational attribute" [#]_ of an entity is defined via an URI, that is, in CubicWeb |
462 each "relational attribute" [#]_ of an entity is defined via an URI, that is, in CubicWeb |
463 terms, via an ``ExternalUri`` entity. The entities are created, in the loop presented above, |
463 terms, via an ``ExternalUri`` entity. The entities are created, in the loop presented above, |
464 as such:: |
464 as such:: |
465 |
465 |
466 ent = store.create_entity(etype, **entity) |
466 ent = store.create_entity(etype, **entity) |
467 |
467 |
468 where ``etype`` is the appropriate entity type, either ``Gene`` or ``Disease``. |
468 where ``etype`` is the appropriate entity type, either ``Gene`` or ``Disease``. |
469 |
469 |
470 .. [#] By "relational attribute" we denote an attribute (of an entity) which |
470 .. [#] By "relational attribute" we denote an attribute (of an entity) which |
471 is defined through a relation, e.g. the ``chromosomal_location`` attribute |
471 is defined through a relation, e.g. the ``chromosomal_location`` attribute |
472 of ``Disease`` entities, which is defined through a relation between a |
472 of ``Disease`` entities, which is defined through a relation between a |
473 ``Disease`` and an ``ExternalUri``. |
473 ``Disease`` and an ``ExternalUri``. |
474 |
474 |
475 The ``ExternalUri`` entities are as many as URIs in the data file. For them, we define a unique |
475 The ``ExternalUri`` entities are as many as URIs in the data file. For them, we define a unique |
476 attribute, ``uri``, which holds the URI under discussion:: |
476 attribute, ``uri``, which holds the URI under discussion:: |
477 |
477 |
478 extu = store.create_entity('ExternalUri', uri="http://path/of/the/uri") |
478 extu = store.create_entity('ExternalUri', uri="http://path/of/the/uri") |
479 |
479 |
480 #. creates the relations between the entities. We have relations between: |
480 #. creates the relations between the entities. We have relations between: |
481 |
481 |
482 #. entities defined in the schema, e.g. between ``Disease`` and ``Gene`` |
482 #. entities defined in the schema, e.g. between ``Disease`` and ``Gene`` |
483 entities, such as the ``associated_genes`` relation defined for |
483 entities, such as the ``associated_genes`` relation defined for |
484 ``Disease`` entities. |
484 ``Disease`` entities. |
485 #. entities defined in the schema and ``ExternalUri`` entities, such as ``gene_id``. |
485 #. entities defined in the schema and ``ExternalUri`` entities, such as ``gene_id``. |
486 |
486 |
487 The way relations are added to the database depends on the store: |
487 The way relations are added to the database depends on the store: |
488 |
488 |
489 - for the stores in the CubicWeb ``dataimport`` module, we only use |
489 - for the stores in the CubicWeb ``dataimport`` module, we only use |
490 ``store.relate``, in |
490 ``store.relate``, in |
491 another loop, on the relations (that is, a |
491 another loop, on the relations (that is, a |
492 loop inside the preceding one, mentioned at step 2):: |
492 loop inside the preceding one, mentioned at step 2):: |
493 |
493 |
494 for rtype, rels in relations.iteritems(): |
494 for rtype, rels in relations.iteritems(): |
495 ... |
495 ... |
496 |
496 |
497 store.relate(ent.eid(), rtype, extu.eid(), **kwargs) |
497 store.relate(ent.eid(), rtype, extu.eid(), **kwargs) |
498 |
498 |
499 where ``kwargs`` is a dictionary designed to accommodate the need for specifying |
499 where ``kwargs`` is a dictionary designed to accommodate the need for specifying |
500 the type of the subject entity of the relation, when the relation is inlined and |
500 the type of the subject entity of the relation, when the relation is inlined and |
501 ``SQLGenObjectStore`` is used. For example:: |
501 ``SQLGenObjectStore`` is used. For example:: |
502 |
502 |
503 ... |
503 ... |
504 store.relate(ent.eid(), 'chromosomal_location', extu.eid(), subjtype='Disease') |
504 store.relate(ent.eid(), 'chromosomal_location', extu.eid(), subjtype='Disease') |
505 |
505 |
506 - for the ``MassiveObjectStore`` in the ``dataio`` cube's ``dataimport`` module, |
506 - for the ``MassiveObjectStore`` in the ``dataio`` cube's ``dataimport`` module, |
507 the relations are created in three steps: |
507 the relations are created in three steps: |
508 |
508 |
509 #. first, a table is created for each relation type, as in:: |
509 #. first, a table is created for each relation type, as in:: |
510 |
510 |
511 ... |
511 ... |
512 store.init_rtype_table(ent.cw_etype, rtype, extu.cw_etype) |
512 store.init_rtype_table(ent.cw_etype, rtype, extu.cw_etype) |
513 |
513 |
514 which comes down to lines such as:: |
514 which comes down to lines such as:: |
515 |
515 |
516 store.init_rtype_table('Disease', 'associated_genes', 'Gene') |
516 store.init_rtype_table('Disease', 'associated_genes', 'Gene') |
517 store.init_rtype_table('Gene', 'gene_id', 'ExternalUri') |
517 store.init_rtype_table('Gene', 'gene_id', 'ExternalUri') |
518 |
518 |
519 #. second, the URI of each entity will be used as its identifier, in the |
519 #. second, the URI of each entity will be used as its identifier, in the |
520 ``relate_by_iid`` method, such as:: |
520 ``relate_by_iid`` method, such as:: |
521 |
521 |
522 disease_uri = 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/3' |
522 disease_uri = 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/3' |
523 gene_uri = '<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HSD3B2' |
523 gene_uri = '<http://www4.wiwiss.fu-berlin.de/diseasome/resource/genes/HSD3B2' |
524 store.relate_by_iid(disease_uri, 'associated_genes', gene_uri) |
524 store.relate_by_iid(disease_uri, 'associated_genes', gene_uri) |
525 |
525 |
526 #. third, the relations for each relation type will be added to the database, |
526 #. third, the relations for each relation type will be added to the database, |
527 via the ``convert_relations`` method, such as in:: |
527 via the ``convert_relations`` method, such as in:: |
528 |
528 |
529 store.convert_relations('Disease', 'associated_genes', 'Gene', 'cwuri', 'cwuri') |
529 store.convert_relations('Disease', 'associated_genes', 'Gene', 'cwuri', 'cwuri') |
530 |
530 |
531 and:: |
531 and:: |
532 |
532 |
533 store.convert_relations('Gene', 'hgnc_id', 'ExternalUri', 'cwuri', 'uri') |
533 store.convert_relations('Gene', 'hgnc_id', 'ExternalUri', 'cwuri', 'uri') |
534 |
534 |
535 where ``cwuri`` and ``uri`` are the attributes which store the URIs of the entities |
535 where ``cwuri`` and ``uri`` are the attributes which store the URIs of the entities |
536 defined in the data model, and of the ``ExternalUri`` entities, respectively. |
536 defined in the data model, and of the ``ExternalUri`` entities, respectively. |
537 |
537 |
538 #. flushes all relations and entities:: |
538 #. flushes all relations and entities:: |
539 |
539 |
540 store.flush() |
540 store.flush() |
541 |
541 |
542 which performs the actual commit of the inserted entities and relations in the database. |
542 which performs the actual commit of the inserted entities and relations in the database. |
543 |
543 |
544 If the ``MassiveObjectStore`` is used, then a cleanup of temporary SQL tables should be performed |
544 If the ``MassiveObjectStore`` is used, then a cleanup of temporary SQL tables should be performed |
611 Finally, we have seen four ways of importing data into CubicWeb. |
611 Finally, we have seen four ways of importing data into CubicWeb. |
612 |
612 |
613 Three of those are integrated into CubicWeb, namely the ``RQLObjectStore``, ``NoHookRQLObjectStore`` and |
613 Three of those are integrated into CubicWeb, namely the ``RQLObjectStore``, ``NoHookRQLObjectStore`` and |
614 ``SQLGenObjectStore`` stores, which have a common API: |
614 ``SQLGenObjectStore`` stores, which have a common API: |
615 |
615 |
616 - ``RQLObjectStore`` is by far the slowest, especially its time spent on the |
616 - ``RQLObjectStore`` is by far the slowest, especially its time spent on the |
617 CubicWeb side, and so it should be used only for small amounts of |
617 CubicWeb side, and so it should be used only for small amounts of |
618 "sensitive" data (i.e. where security is a concern). |
618 "sensitive" data (i.e. where security is a concern). |
619 |
619 |
620 - ``NoHookRQLObjectStore`` slashes by almost four the time spent on the CubicWeb side, |
620 - ``NoHookRQLObjectStore`` slashes by almost four the time spent on the CubicWeb side, |
621 but is also quite slow; on the PostGres side it is as slow as the previous store. |
621 but is also quite slow; on the PostGres side it is as slow as the previous store. |
622 It should be used for data where security is not a concern, |
622 It should be used for data where security is not a concern, |
623 but consistency (with the data model) is. |
623 but consistency (with the data model) is. |
624 |
624 |
625 - ``SQLGenObjectStore`` slashes by three the time spent on the CubicWeb side and by five the time |
625 - ``SQLGenObjectStore`` slashes by three the time spent on the CubicWeb side and by five the time |
626 spent on the PostGreSQL side. It should be used for relatively great amounts of data, where |
626 spent on the PostGreSQL side. It should be used for relatively great amounts of data, where |
627 security and data consistency are not a concern. Compared to the previous store, it has the |
627 security and data consistency are not a concern. Compared to the previous store, it has the |
628 disadvantage that, for inlined relations, we must specify their subjects' types. |
628 disadvantage that, for inlined relations, we must specify their subjects' types. |
629 |
629 |
630 For really huge amounts of data there is a fourth store, ``MassiveObjectStore``, available |
630 For really huge amounts of data there is a fourth store, ``MassiveObjectStore``, available |
631 from the ``dataio`` cube. It provides a blazing performance with respect to all other stores: |
631 from the ``dataio`` cube. It provides a blazing performance with respect to all other stores: |
632 it is almost 25 times faster than ``RQLObjectStore`` and almost three times faster than |
632 it is almost 25 times faster than ``RQLObjectStore`` and almost three times faster than |
633 ``SQLGenObjectStore``. However, it has a few usage caveats that should be taken into account: |
633 ``SQLGenObjectStore``. However, it has a few usage caveats that should be taken into account: |
634 |
634 |
635 #. it cannot insert relations defined as inlined in the schema, |
635 #. it cannot insert relations defined as inlined in the schema, |
636 #. no security or consistency check is performed on the data, |
636 #. no security or consistency check is performed on the data, |
637 #. its API is slightly different from the other stores. |
637 #. its API is slightly different from the other stores. |