# HG changeset patch # User Vladimir Popescu # Date 1363109475 -3600 # Node ID 8a57802d40d328154e94ecb3bc1c122515dbf3d3 # Parent 3612b760488b834a8d3a7b512917e8789d89ed56 [cubicweb/doc] Add tutorial on data import in CubicWeb. This involves creating the "tutorials/dataimport" directory structure under "cubicweb/doc" and, inside the "dataimport" directory, putting several files: - a ResT file containing the tutorial *per se*; this tutorial addresses the following issues: * creating a CubicWeb schema for representing a given data set (here, the Diseasome RDF data, for illustration purposes); * parsing the data; * importing the data, by using several stores: + the ``RQLObjectStore``, ``NoHookRQLObjectStore`` and ``SQLGenObjectStore`` from the ``dataimport`` module in CubicWeb; + the ``MassiveObjectStore`` from the ``dataimport`` module in the ``dataio`` cube. The tutorial also provides timing benchmarks of the various stores. - a set of Python files illustrating the data import, in the context of Diseasome RDF data parsing: * a Diseasome RDF data parse module, * a Diseasome data import module, * a CubicWeb schema for representing Diseasome data. diff -r 3612b760488b -r 8a57802d40d3 doc/tutorials/dataimport/data_import_tutorial.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/tutorials/dataimport/data_import_tutorial.rst Tue Mar 12 18:31:15 2013 +0100 @@ -0,0 +1,646 @@ +Importing relational data into a CubicWeb instance +================================================== + +Introduction +~~~~~~~~~~~~ + +This tutorial explains how to import data from an external source (e.g. a collection of files) +into a CubicWeb cube instance. + +First, once we know the format of the data we wish to import, we devise a +*data model*, that is, a CubicWeb (Yams) schema which reflects the way the data +is structured. This schema is implemented in the ``schema.py`` file. +In this tutorial, we will describe such a schema for a particular data set, +the Diseasome data (see below). + +Once the schema is defined, we create a cube and an instance. +The cube is a specification of an application, whereas an instance +is the application per se. + +Once the schema is defined and the instance is created, the import can be performed, via +the following steps: + +1. Build a custom parser for the data to be imported. Thus, one obtains a Python + memory representation of the data. + +2. Map the parsed data to the data model defined in ``schema.py``. + +3. Perform the actual import of the data. This comes down to "populating" + the data model with the memory representation obtained at 1, according to + the mapping defined at 2. + +This tutorial illustrates all the above steps in the context of relational data +stored in the RDF format. + +More specifically, we describe the import of Diseasome_ RDF/OWL data. + +.. _Diseasome: http://datahub.io/dataset/fu-berlin-diseasome + +Building a data model +~~~~~~~~~~~~~~~~~~~~~ + +The first thing to do when using CubicWeb for creating an application from scratch +is to devise a *data model*, that is, a relational representation of the problem to be +modeled or of the structure of the data to be imported. + +In such a schema, we define +an entity type (``EntityType`` objects) for each type of entity to import. Each such type +has several attributes. If the attributes are of known CubicWeb (Yams) types, viz. numbers, +strings or characters, then they are defined as attributes, as e.g. ``attribute = Int()`` +for an attribute named ``attribute`` which is an integer. + +Each such type also has a set of +relations, which are defined like the attributes, except that they represent, in fact, +relations between the entities of the type under discussion and the objects of a type which +is specified in the relation definition. + +For example, for the Diseasome data, we have two types of entities, genes and diseases. +Thus, we create two classes which inherit from ``EntityType``:: + + class Disease(EntityType): + # Corresponds to http://www.w3.org/2000/01/rdf-schema#label + label = String(maxsize=512, fulltextindexed=True) + ... + + #Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/associatedGene + associated_genes = SubjectRelation('Gene', cardinality='**') + ... + + #Corresponds to 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseasome/chromosomalLocation' + chromosomal_location = SubjectRelation('ExternalUri', cardinality='?*', inlined=True) + + + class Gene(EntityType): + ... + +In this schema, there are attributes whose values are numbers or strings. Thus, they are +defined by using the CubicWeb / Yams primitive types, e.g., ``label = String(maxsize=12)``. +These types can have several constraints or attributes, such as ``maxsize``. +There are also relations, either between the entity types themselves, or between them +and a CubicWeb type, ``ExternalUri``. The latter defines a class of URI objects in +CubicWeb. For instance, the ``chromosomal_location`` attribute is a relation between +a ``Disease`` entity and an ``ExternalUri`` entity. The relation is marked by the CubicWeb / +Yams ``SubjectRelation`` method. The latter can have several optional keyword arguments, such as +``cardinality`` which specifies the number of subjects and objects related by the relation type +specified. For example, the ``'?*'`` cardinality in the ``chromosomal_relation`` relation type says +that zero or more ``Disease`` entities are related to zero or one ``ExternalUri`` entities. +In other words, a ``Disease`` entity is related to at most one ``ExternalUri`` entity via the +``chromosomal_location`` relation type, and that we can have zero or more ``Disease`` entities in the +data base. +For a relation between the entity types themselves, the ``associated_genes`` between a ``Disease`` +entity and a ``Gene`` entity is defined, so that any number of ``Gene`` entities can be associated +to a ``Disease``, and there can be any number of ``Disease`` s if a ``Gene`` exists. + +Of course, before being able to use the CubicWeb / Yams built-in objects, we need to import them:: + + + from yams.buildobjs import EntityType, SubjectRelation, String, Int + from cubicweb.schemas.base import ExternalUri + +Building a custom data parser +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The data we wish to import is structured in the RDF format, +as a text file containing a set of lines. +On each line, there are three fields. +The first two fields are URIs ("Universal Resource Identifiers"). +The third field is either an URI or a string. Each field bares a particular meaning: + +- the leftmost field is an URI that holds the entity to be imported. + Note that the entities defined in the data model (i.e., in ``schema.py``) should + correspond to the entities whose URIs are specified in the import file. + +- the middle field is an URI that holds a relation whose subject is the entity + defined by the leftmost field. Note that this should also correspond + to the definitions in the data model. + +- the rightmost field is either an URI or a string. When this field is an URI, + it gives the object of the relation defined by the middle field. + When the rightmost field is a string, the middle field is interpreted as an attribute + of the subject (introduced by the leftmost field) and the rightmost field is + interpreted as the value of the attribute. + +Note however that some attributes (i.e. relations whose objects are strings) +have their objects defined as strings followed by ``^^`` and by another URI; +we ignore this part. + +Let us show some examples: + +- of line holding an attribute definition: + `` + "CYP17A1" .`` + The line contains the definition of the ``label`` attribute of an + entity of type ``gene``. The value of ``label`` is '``CYP17A1``'. + +- of line holding a relation definition: + `` + + .`` + The line contains the definition of the ``associatedGene`` relation between + a ``disease`` subject entity identified by ``1`` and a ``gene`` object + entity defined by ``HADH2``. + +Thus, for parsing the data, we can (:note: see the ``diseasome_parser`` module): + +1. define a couple of regular expressions for parsing the two kinds of lines, + ``RE_ATTS`` for parsing the attribute definitions, and ``RE_RELS`` for parsing + the relation definitions. + +2. define a function that iterates through the lines of the file and retrieves + (``yield`` s) a (subject, relation, object) tuple for each line. + We called it ``_retrieve_structure`` in the ``diseasome_parser`` module. + The function needs the file name and the types for which information + should be retrieved. + +Alternatively, instead of hand-making the parser, one could use the RDF parser provided +in the ``dataio`` cube. + +.. XXX To further study and detail the ``dataio`` cube usage. + +Once we get to have the (subject, relation, object) triples, we need to map them into +the data model. + + +Mapping the data to the schema +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In the case of diseasome data, we can just define two dictionaries for mapping +the names of the relations as extracted by the parser, to the names of the relations +as defined in the ``schema.py`` data model. In the ``diseasome_parser`` module +they are called ``MAPPING_ATTS`` and ``MAPPING_RELS``. +Given that the relation and attribute names are given in CamelCase in the original data, +mappings are necessary if we follow the PEP08 when naming the attributes in the data model. +For example, the RDF relation ``chromosomalLocation`` is mapped into the schema relation +``chromosomal_location``. + +Once these mappings have been defined, we just iterate over the (subject, relation, object) +tuples provided by the parser and we extract the entities, with their attributes and relations. +For each entity, we thus have a dictionary with two keys, ``attributes`` and ``relations``. +The value associated to the ``attributes`` key is a dictionary containing (attribute: value) +pairs, where "value" is a string, plus the ``cwuri`` key / attribute holding the URI of +the entity itself. +The value associated to the ``relations`` key is a dictionary containing (relation: value) +pairs, where "value" is an URI. +This is implemented in the ``entities_from_rdf`` interface function of the module +``diseasome_parser``. This function provides an iterator on the dictionaries containing +the ``attributes`` and ``relations`` keys for all entities. + +However, this is a simple case. In real life, things can get much more complicated, and the +mapping can be far from trivial, especially when several data sources (which can follow +different formatting and even structuring conventions) must be mapped into the same data model. + +Importing the data +~~~~~~~~~~~~~~~~~~ + +The data import code should be placed in a Python module. Let us call it +``diseasome_import.py``. Then, this module should be called via +``cubicweb-ctl``, as follows:: + + cubicweb-ctl shell diseasome_import.py -- + +In the import module, we should use a *store* for doing the import. +A store is an object which provides three kinds of methods for +importing data: + +- a method for importing the entities, along with the values + of their attributes. +- a method for importing the relations between the entities. +- a method for committing the imports to the database. + +In CubicWeb, we have four stores: + +1. ``ObjectStore`` base class for the stores in CubicWeb. + It only provides a skeleton for all other stores and + provides the means for creating the memory structures + (dictionaries) that hold the entities and the relations + between them. + +2. ``RQLObjectStore``: store which uses the RQL language for performing + database insertions and updates. It relies on all the CubicWeb hooks + machinery, especially for dealing with security issues (database access + permissions). + +2. ``NoHookRQLObjectStore``: store which uses the RQL language for + performing database insertions and updates, but for which + all hooks are deactivated. This implies that + certain checks with respect to the CubicWeb / Yams schema + (data model) are not performed. However, all SQL queries + obtained from the RQL ones are executed in a sequential + manner, one query per inserted entity. + +4. ``SQLGenObjectStore``: store which uses the SQL language directly. + It inserts entities either sequentially, by executing an SQL query + for each entity, or directly by using one PostGRES ``COPY FROM`` + query for a set of similarly structured entities. + +For really massive imports (millions or billions of entities), there +is a cube ``dataio`` which contains another store, called +``MassiveObjectStore``. This store is similar to ``SQLGenObjectStore``, +except that anything related to CubicWeb is bypassed. That is, even the +CubicWeb EID entity identifiers are not handled. This store is the fastest, +but has a slightly different API from the other four stores mentioned above. +Moreover, it has an important limitation, in that it doesn't insert inlined [#]_ +relations in the database. + +.. [#] An inlined relation is a relation defined in the schema + with the keyword argument ``inlined=True``. Such a relation + is inserted in the database as an attribute of the entity + whose subject it is. + +In the following section we will see how to import data by using the stores +in CubicWeb's ``dataimport`` module. + +Using the stores in ``dataimport`` +++++++++++++++++++++++++++++++++++ + +``ObjectStore`` is seldom used in real life for importing data, since it is +only the base store for the other stores and it doesn't perform an actual +import of the data. Nevertheless, the other three stores, which import data, +are based on ``ObjectStore`` and provide the same API. + +All three stores ``RQLObjectStore``, ``NoHookRQLObjectStore`` and +``SQLGenObjectStore`` provide exactly the same API for importing data, that is +entities and relations, in an SQL database. + +Before using a store, one must import the ``dataimport`` module and then initialize +the store, with the current ``session`` as a parameter:: + + import cubicweb.dataimport as cwdi + ... + + store = cwdi.RQLObjectStore(session) + +Each such store provides three methods for data import: + +#. ``create_entity(Etype, **attributes)``, which allows us to add + an entity of the Yams type ``Etype`` to the database. This entity's attributes + are specified in the ``attributes`` dictionary. The method returns the entity + created in the database. For example, we add two entities, + a person, of ``Person`` type, and a location, of ``Location`` type:: + + person = store.create_entity('Person', name='Toto', age='18', height='190') + + location = store.create_entity('Location', town='Paris', arrondissement='13') + +#. ``relate(subject_eid, r_type, object_eid)``, which allows us to add a relation + of the Yams type ``r_type`` to the database. The relation's subject is an entity + whose EID is ``subject_eid``; its object is another entity, whose EID is + ``object_eid``. For example [#]_:: + + store.relate(person.eid(), 'lives_in', location.eid(), **kwargs) + + ``kwargs`` is only used by the ``SQLGenObjectStore``'s ``relate`` method and is here + to allow us to specify the type of the subject of the relation, when the relation is + defined as inlined in the schema. + +.. [#] The ``eid`` method of an entity defined via ``create_entity`` returns + the entity identifier as assigned by CubicWeb when creating the entity. + This only works for entities defined via the stores in the CubicWeb's + ``dataimport`` module. + + The keyword argument that is understood by ``SQLGenObjectStore`` is called + ``subjtype`` and holds the type of the subject entity. For the example considered here, + this comes to having [#]_:: + + store.relate(person.eid(), 'lives_in', location.eid(), subjtype=person.dc_type()) + + If ``subjtype`` is not specified, then the store tries to infer the type of the subject. + However, this doesn't always work, e.g. when there are several possible subject types + for a given relation type. + +.. [#] The ``dc_type`` method of an entity defined via ``create_entity`` returns + the type of the entity just created. This only works for entities defined via + the stores in the CubicWeb's ``dataimport`` module. In the example considered + here, ``person.dc_type()`` returns ``'Person'``. + + All the other stores but ``SQLGenObjectStore`` ignore the ``kwargs`` parameters. + +#. ``flush()``, which allows us to perform the actual commit into the database, along + with some cleanup operations. Ideally, this method should be called as often as + possible, that is after each insertion in the database, so that database sessions + are kept as atomic as possible. In practice, we usually call this method twice: + first, after all the entities have been created, second, after all relations have + been created. + + Note however that before each commit the database insertions + have to be consistent with the schema. Thus, if, for instance, + an entity has an attribute defined through a relation (viz. + a ``SubjectRelation``) with a ``"1"`` or ``"+"`` object + cardinality, we have to create the entity under discussion, + the object entity of the relation under discussion, and the + relation itself, before committing the additions to the database. + + The ``flush`` method is simply called as:: + + store.flush(). + + +Using the ``MassiveObjectStore`` in the ``dataio`` cube ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +This store, available in the ``dataio`` cube, allows us to +fully dispense with the CubicWeb import mechanisms and hence +to interact directly with the database server, via SQL queries. + +Moreover, these queries rely on PostGreSQL's ``COPY FROM`` instruction +to create several entities in a single query. This brings tremendous +performance improvements with respect to the RQL-based data insertion +procedures. + +However, the API of this store is slightly different from the API of +the stores in CubicWeb's ``dataimport`` module. + +Before using the store, one has to import the ``dataio`` cube's +``dataimport`` module, then initialize the store by giving it the +``session`` parameter:: + + from cubes.dataio import dataimport as mcwdi + ... + + store = mcwdi.MassiveObjectStore(session) + +The ``MassiveObjectStore`` provides six methods for inserting data +into the database: + +#. ``init_rtype_table(SubjEtype, r_type, ObjEtype)``, which specifies the + creation of the tables associated to the relation types in the database. + Each such table has three column, the type of the subject entity, the + type of the relation (that is, the name of the attribute in the subject + entity which is defined via the relation), and the type of the object + entity. For example:: + + store.init_rtype_table('Person', 'lives_in', 'Location') + + Please note that these tables can be created before the entities, since + they only specify their types, not their unique identifiers. + +#. ``create_entity(Etype, **attributes)``, which allows us to add new entities, + whose attributes are given in the ``attributes`` dictionary. + Please note however that, by default, this method does *not* return + the created entity. The method is called, for example, as in:: + + store.create_entity('Person', name='Toto', age='18', height='190', + uri='http://link/to/person/toto_18_190') + store.create_entity('Location', town='Paris', arrondissement='13', + uri='http://link/to/location/paris_13') + + In order to be able to link these entities via the relations when needed, + we must provide ourselves a means for uniquely identifying the entities. + In general, this is done via URIs, stored in attributes like ``uri`` or + ``cwuri``. The name of the attribute is irrelevant as long as its value is + unique for each entity. + +#. ``relate_by_iid(subject_iid, r_type, object_iid)`` allows us to actually + relate the entities uniquely identified by ``subject_iid`` and + ``object_iid`` via a relation of type ``r_type``. For example:: + + store.relate_by_iid('http://link/to/person/toto_18_190', + 'lives_in', + 'http://link/to/location/paris_13') + + Please note that this method does *not* work for inlined relations! + +#. ``convert_relations(SubjEtype, r_type, ObjEtype, subj_iid_attribute, + obj_iid_attribute)`` + allows us to actually insert + the relations in the database. At one call of this method, one inserts + all the relations of type ``rtype`` between entities of given types. + ``subj_iid_attribute`` and ``object_iid_attribute`` are the names + of the attributes which store the unique identifiers of the entities, + as assigned by the user. These names can be identical, as long as + their values are unique. For example, for inserting all relations + of type ``lives_in`` between ``People`` and ``Location`` entities, + we write:: + + store.convert_relations('Person', 'lives_in', 'Location', 'uri', 'uri') + +#. ``flush()`` performs the actual commit in the database. It only needs + to be called after ``create_entity`` and ``relate_by_iid`` calls. + Please note that ``relate_by_iid`` does *not* perform insertions into + the database, hence calling ``flush()`` for it would have no effect. + +#. ``cleanup()`` performs database cleanups, by removing temporary tables. + It should only be called at the end of the import. + + + +.. XXX to add smth on the store's parameter initialization. + + + +Application to the Diseasome data ++++++++++++++++++++++++++++++++++ + +Import setup +############ + +We define an import function, ``diseasome_import``, which does basically four things: + +#. creates and initializes the store to be used, via a line such as:: + + store = cwdi.SQLGenObjectStore(session) + + where ``cwdi`` is the imported ``cubicweb.dataimport`` or + ``cubes.dataio.dataimport``. + +#. calls the diseasome parser, that is, the ``entities_from_rdf`` function in the + ``diseasome_parser`` module and iterates on its result, in a line such as:: + + for entity, relations in parser.entities_from_rdf(filename, ('gene', 'disease')): + + where ``parser`` is the imported ``diseasome_parser`` module, and ``filename`` is the + name of the file containing the data (with its path), e.g. ``../data/diseasome_dump.nt``. + +#. creates the entities to be inserted in the database; for Diseasome, there are two + kinds of entities: + + #. entities defined in the data model, viz. ``Gene`` and ``Disease`` in our case. + #. entities which are built in CubicWeb / Yams, viz. ``ExternalUri`` which define + URIs. + + As we are working with RDF data, each entity is defined through a series of URIs. Hence, + each "relational attribute" [#]_ of an entity is defined via an URI, that is, in CubicWeb + terms, via an ``ExternalUri`` entity. The entities are created, in the loop presented above, + as such:: + + ent = store.create_entity(etype, **entity) + + where ``etype`` is the appropriate entity type, either ``Gene`` or ``Disease``. + +.. [#] By "relational attribute" we denote an attribute (of an entity) which + is defined through a relation, e.g. the ``chromosomal_location`` attribute + of ``Disease`` entities, which is defined through a relation between a + ``Disease`` and an ``ExternalUri``. + + The ``ExternalUri`` entities are as many as URIs in the data file. For them, we define a unique + attribute, ``uri``, which holds the URI under discussion:: + + extu = store.create_entity('ExternalUri', uri="http://path/of/the/uri") + +#. creates the relations between the entities. We have relations between: + + #. entities defined in the schema, e.g. between ``Disease`` and ``Gene`` + entities, such as the ``associated_genes`` relation defined for + ``Disease`` entities. + #. entities defined in the schema and ``ExternalUri`` entities, such as ``gene_id``. + + The way relations are added to the database depends on the store: + + - for the stores in the CubicWeb ``dataimport`` module, we only use + ``store.relate``, in + another loop, on the relations (that is, a + loop inside the preceding one, mentioned at step 2):: + + for rtype, rels in relations.iteritems(): + ... + + store.relate(ent.eid(), rtype, extu.eid(), **kwargs) + + where ``kwargs`` is a dictionary designed to accommodate the need for specifying + the type of the subject entity of the relation, when the relation is inlined and + ``SQLGenObjectStore`` is used. For example:: + + ... + store.relate(ent.eid(), 'chromosomal_location', extu.eid(), subjtype='Disease') + + - for the ``MassiveObjectStore`` in the ``dataio`` cube's ``dataimport`` module, + the relations are created in three steps: + + #. first, a table is created for each relation type, as in:: + + ... + store.init_rtype_table(ent.dc_type(), rtype, extu.dc_type()) + + which comes down to lines such as:: + + store.init_rtype_table('Disease', 'associated_genes', 'Gene') + store.init_rtype_table('Gene', 'gene_id', 'ExternalUri') + + #. second, the URI of each entity will be used as its identifier, in the + ``relate_by_iid`` method, such as:: + + disease_uri = 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/diseases/3' + gene_uri = '. + +"""This module imports the Diseasome data into a CubicWeb instance. +""" + +# Python imports +import sys +import argparse + +# Logilab import, for timing +from logilab.common.decorators import timed + +# CubicWeb imports +import cubicweb.dataimport as cwdi +from cubes.dataio import dataimport as mcwdi + +# Diseasome parser import +import diseasome_parser as parser + +def _is_of_class(instance, class_name): + """Helper function to determine whether an instance is + of a specified class or not. + Returns a True if this is the case and False otherwise. + """ + if instance.__class__.__name__ == class_name: + return True + else: + return False + +@timed +def diseasome_import(session, file_name, store): + """Main function for importing Diseasome data. + + It uses the Diseasome data parser to get the contents of the + data from a file, then uses a store for importing the data + into a CubicWeb instance. + + >>> diseasome_import(session, 'file_name', Store) + + """ + exturis = dict(session.execute('Any U, X WHERE X is ExternalUri, X uri U')) + uri_to_eid = {} + uri_to_etype = {} + all_relations = {} + etypes = {('http://www4.wiwiss.fu-berlin.de/' + 'diseasome/resource/diseasome/genes'): 'Gene', + ('http://www4.wiwiss.fu-berlin.de/' + 'diseasome/resource/diseasome/diseases'): 'Disease'} + # Read the parsed data + for entity, relations in parser.entities_from_rdf(file_name, + ('gene', 'disease')): + uri = entity.get('cwuri', None) + types = list(relations.get('types', [])) + if not types: + continue + etype = etypes.get(types[0]) + if not etype: + sys.stderr.write('Entity type %s not recognized.', types[0]) + sys.stderr.flush() + if _is_of_class(store, 'MassiveObjectStore'): + for relation in (set(relations).intersection(('classes', + 'possible_drugs', 'omim', 'omim_page', + 'chromosomal_location', 'same_as', 'gene_id', + 'hgnc_id', 'hgnc_page'))): + store.init_rtype_table(etype, relation, 'ExternalUri') + for relation in set(relations).intersection(('subtype_of',)): + store.init_rtype_table(etype, relation, 'Disease') + for relation in set(relations).intersection(('associated_genes',)): + store.init_rtype_table(etype, relation, 'Gene') + # Create the entities + ent = store.create_entity(etype, **entity) + if not _is_of_class(store, 'MassiveObjectStore'): + uri_to_eid[uri] = ent.eid + uri_to_etype[uri] = ent.dc_type() + else: + uri_to_eid[uri] = uri + uri_to_etype[uri] = etype + # Store relations for after + all_relations[uri] = relations + # Perform a first commit, of the entities + store.flush() + kwargs = {} + for uri, relations in all_relations.iteritems(): + from_eid = uri_to_eid.get(uri) + # ``subjtype`` should be initialized if ``SQLGenObjectStore`` is used + # and there are inlined relations in the schema. + # If ``subjtype`` is not given, while ``SQLGenObjectStore`` is used + # and there are inlined relations in the schema, the store + # tries to infer the type of the subject, but this does not always + # work, e.g. when there are several object types for the relation. + # ``subjtype`` is ignored for other stores, or if there are no + # inlined relations in the schema. + kwargs['subjtype'] = uri_to_etype.get(uri) + if not from_eid: + continue + for rtype, rels in relations.iteritems(): + if rtype in ('classes', 'possible_drugs', 'omim', 'omim_page', + 'chromosomal_location', 'same_as', 'gene_id', + 'hgnc_id', 'hgnc_page'): + for rel in list(rels): + if rel not in exturis: + # Create the "ExternalUri" entities, which are the + # objects of the relations + extu = store.create_entity('ExternalUri', uri=rel) + if not _is_of_class(store, 'MassiveObjectStore'): + rel_eid = extu.eid + else: + # For the "MassiveObjectStore", the EIDs are + # in fact the URIs. + rel_eid = rel + exturis[rel] = rel_eid + else: + rel_eid = exturis[rel] + # Create the relations that have "ExternalUri"s as objects + if not _is_of_class(store, 'MassiveObjectStore'): + store.relate(from_eid, rtype, rel_eid, **kwargs) + else: + store.relate_by_iid(from_eid, rtype, rel_eid) + elif rtype in ('subtype_of', 'associated_genes'): + for rel in list(rels): + to_eid = uri_to_eid.get(rel) + if to_eid: + # Create relations that have objects of other type + # than "ExternalUri" + if not _is_of_class(store, 'MassiveObjectStore'): + store.relate(from_eid, rtype, to_eid, **kwargs) + else: + store.relate_by_iid(from_eid, rtype, to_eid) + else: + sys.stderr.write('Missing entity with URI %s ' + 'for relation %s' % (rel, rtype)) + sys.stderr.flush() + # Perform a second commit, of the "ExternalUri" entities. + # when the stores in the CubicWeb ``dataimport`` module are used, + # relations are also committed. + store.flush() + # If the ``MassiveObjectStore`` is used, then entity and relation metadata + # are pushed as well. By metadata we mean information on the creation + # time and author. + if _is_of_class(store, 'MassiveObjectStore'): + store.flush_meta_data() + for relation in ('classes', 'possible_drugs', 'omim', 'omim_page', + 'chromosomal_location', 'same_as'): + # Afterwards, relations are actually created in the database. + store.convert_relations('Disease', relation, 'ExternalUri', + 'cwuri', 'uri') + store.convert_relations('Disease', 'subtype_of', 'Disease', + 'cwuri', 'cwuri') + store.convert_relations('Disease', 'associated_genes', 'Gene', + 'cwuri', 'cwuri') + for relation in ('gene_id', 'hgnc_id', 'hgnc_page', 'same_as'): + store.convert_relations('Gene', relation, 'ExternalUri', + 'cwuri', 'uri') + # Clean up temporary tables in the database + store.cleanup() + +if __name__ == '__main__': + # Change sys.argv so that ``cubicweb-ctl shell`` can work out the options + # we give to our ``diseasome_import.py`` script. + sys.argv = [arg for + arg in sys.argv[sys.argv.index("--") - 1:] if arg != "--"] + PARSER = argparse.ArgumentParser(description="Import Diseasome data") + PARSER.add_argument("-df", "--datafile", type=str, + help="RDF data file name") + PARSER.add_argument("-st", "--store", type=str, + default="RQLObjectStore", + help="data import store") + ARGS = PARSER.parse_args() + if ARGS.datafile: + FILENAME = ARGS.datafile + if ARGS.store in (st + "ObjectStore" for + st in ("RQL", "NoHookRQL", "SQLGen")): + IMPORT_STORE = getattr(cwdi, ARGS.store)(session) + elif ARGS.store == "MassiveObjectStore": + IMPORT_STORE = mcwdi.MassiveObjectStore(session) + else: + sys.exit("Import store unknown") + diseasome_import(session, FILENAME, IMPORT_STORE) + else: + sys.exit("Data file not found or not specified") diff -r 3612b760488b -r 8a57802d40d3 doc/tutorials/dataimport/diseasome_parser.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/tutorials/dataimport/diseasome_parser.py Tue Mar 12 18:31:15 2013 +0100 @@ -0,0 +1,100 @@ +# -*- coding: utf-8 -*- + +""" +Diseasome data import module. +Its interface is the ``entities_from_rdf`` function. +""" + +import re +RE_RELS = re.compile(r'^<(.*?)>\s<(.*?)>\s<(.*?)>\s*\.') +RE_ATTS = re.compile(r'^<(.*?)>\s<(.*?)>\s"(.*)"(\^\^<(.*?)>|)\s*\.') + +MAPPING_ATTS = {'bio2rdfSymbol': 'bio2rdf_symbol', + 'label': 'label', + 'name': 'name', + 'classDegree': 'class_degree', + 'degree': 'degree', + 'size': 'size'} + +MAPPING_RELS = {'geneId': 'gene_id', + 'hgncId': 'hgnc_id', + 'hgncIdPage': 'hgnc_page', + 'sameAs': 'same_as', + 'class': 'classes', + 'diseaseSubtypeOf': 'subtype_of', + 'associatedGene': 'associated_genes', + 'possibleDrug': 'possible_drugs', + 'type': 'types', + 'omim': 'omim', + 'omimPage': 'omim_page', + 'chromosomalLocation': 'chromosomal_location'} + +def _retrieve_reltype(uri): + """ + Retrieve a relation type from an URI. + + Internal function which takes an URI containing a relation type as input + and returns the name of the relation. + If no URI string is given, then the function returns None. + """ + if uri: + return uri.rsplit('/', 1)[-1].rsplit('#', 1)[-1] + +def _retrieve_etype(tri_uri): + """ + Retrieve entity type from a triple of URIs. + + Internal function whith takes a tuple of three URIs as input + and returns the type of the entity, as obtained from the + first member of the tuple. + """ + if tri_uri: + return tri_uri.split('> <')[0].rsplit('/', 2)[-2].rstrip('s') + +def _retrieve_structure(filename, etypes): + """ + Retrieve a (subject, relation, object) tuples iterator from a file. + + Internal function which takes as input a file name and a tuple of + entity types, and returns an iterator of (subject, relation, object) + tuples. + """ + with open(filename) as fil: + for line in fil: + if _retrieve_etype(line) not in etypes: + continue + match = RE_RELS.match(line) + if not match: + match = RE_ATTS.match(line) + subj = match.group(1) + relation = _retrieve_reltype(match.group(2)) + obj = match.group(3) + yield subj, relation, obj + +def entities_from_rdf(filename, etypes): + """ + Return entities from an RDF file. + + Module interface function which takes as input a file name and + a tuple of entity types, and returns an iterator on the + attributes and relations of each entity. The attributes + and relations are retrieved as dictionaries. + + >>> for entities, relations in entities_from_rdf('data_file', + ('type_1', 'type_2')): + ... + """ + entities = {} + for subj, rel, obj in _retrieve_structure(filename, etypes): + entities.setdefault(subj, {}) + entities[subj].setdefault('attributes', {}) + entities[subj].setdefault('relations', {}) + entities[subj]['attributes'].setdefault('cwuri', unicode(subj)) + if rel in MAPPING_ATTS: + entities[subj]['attributes'].setdefault(MAPPING_ATTS[rel], + unicode(obj)) + if rel in MAPPING_RELS: + entities[subj]['relations'].setdefault(MAPPING_RELS[rel], set()) + entities[subj]['relations'][MAPPING_RELS[rel]].add(unicode(obj)) + return ((ent.get('attributes'), ent.get('relations')) + for ent in entities.itervalues()) diff -r 3612b760488b -r 8a57802d40d3 doc/tutorials/dataimport/schema.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/doc/tutorials/dataimport/schema.py Tue Mar 12 18:31:15 2013 +0100 @@ -0,0 +1,136 @@ +# -*- coding: utf-8 -*- +# copyright 2012 LOGILAB S.A. (Paris, FRANCE), all rights reserved. +# contact http://www.logilab.fr -- mailto:contact@logilab.fr +# +# This program is free software: you can redistribute it and/or modify it under +# the terms of the GNU Lesser General Public License as published by the Free +# Software Foundation, either version 2.1 of the License, or (at your option) +# any later version. +# +# This program is distributed in the hope that it will be useful, but WITHOUT +# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS +# FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more +# details. +# +# You should have received a copy of the GNU Lesser General Public License along +# with this program. If not, see . + +"""cubicweb-diseasome schema""" + +from yams.buildobjs import EntityType, SubjectRelation, String, Int +from cubicweb.schemas.base import ExternalUri + + +class Disease(EntityType): + """Disease entity definition. + + A Disease entity is characterized by several attributes which are + defined by URIs: + + - a name, which we define as a CubicWeb / Yams String object + - a label, also defined as a Yams String + - a class degree, defined as a Yams Int (that is, an integer) + - a degree, also defined as a Yams Int + - size, also defined as an Int + - classes, defined as a set containing zero, one or several objects + identified by their URIs, that is, objects of type ``ExternalUri`` + - subtype_of, defined as a set containing zero, one or several + objects of type ``Disease`` + - associated_genes, defined as a set containing zero, one or several + objects of type ``Gene``, that is, of genes associated to the + disease + - possible_drugs, defined as a set containing zero, one or several + objects, identified by their URIs, that is, of type ``ExternalUri`` + - omim and omim_page are identifiers in the OMIM (Online Mendelian + Inheritance in Man) database, which contains an inventory of "human + genes and genetic phenotypes" + (see http://http://www.ncbi.nlm.nih.gov/omim). Given that a disease + only has unique omim and omim_page identifiers, when it has them, + these attributes have been defined through relations such that + for each disease there is at most one omim and one omim_page. + Each such identifier is defined through an URI, that is, through + an ``ExternalUri`` entity. + That is, these relations are of cardinality "?*". For optimization + purposes, one might be tempted to defined them as inlined, by setting + the ``inlined`` keyword argument to ``True``. + - chromosomal_location is also defined through a relation of + cardinality "?*", since any disease has at most one chromosomal + location associated to it. + - same_as is also defined through an URI, and hence through a + relation having ``ExternalUri`` entities as objects. + + For more information on this data set and the data set itself, + please consult http://datahub.io/dataset/fu-berlin-diseasome. + """ + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/name + name = String(maxsize=256, fulltextindexed=True) + # Corresponds to http://www.w3.org/2000/01/rdf-schema#label + label = String(maxsize=512, fulltextindexed=True) + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/classDegree + class_degree = Int() + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/degree + degree = Int() + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/size + size = Int() + #Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/class + classes = SubjectRelation('ExternalUri', cardinality='**') + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/diseaseSubtypeOf + subtype_of = SubjectRelation('Disease', cardinality='**') + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/associatedGene + associated_genes = SubjectRelation('Gene', cardinality='**') + #Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/possibleDrug + possible_drugs = SubjectRelation('ExternalUri', cardinality='**') + #Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/omim + omim = SubjectRelation('ExternalUri', cardinality='?*', inlined=True) + #Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/omimPage + omim_page = SubjectRelation('ExternalUri', cardinality='?*', inlined=True) + #Corresponds to 'http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/chromosomalLocation' + chromosomal_location = SubjectRelation('ExternalUri', cardinality='?*', + inlined=True) + #Corresponds to http://www.w3.org/2002/07/owl#sameAs + same_as = SubjectRelation('ExternalUri', cardinality='**') + + +class Gene(EntityType): + """Gene entity defintion. + + A gene is characterized by the following attributes: + + - label, defined through a Yams String. + - bio2rdf_symbol, also defined as a Yams String, since it is + just an identifier. + - gene_id is an URI identifying a gene, hence it is defined + as a relation with an ``ExternalUri`` object. + - a pair of unique identifiers in the HUGO Gene Nomenclature + Committee (http://http://www.genenames.org/). They are defined + as ``ExternalUri`` entities as well. + - same_as is also defined through an URI, and hence through a + relation having ``ExternalUri`` entities as objects. + """ + # Corresponds to http://www.w3.org/2000/01/rdf-schema#label + label = String(maxsize=512, fulltextindexed=True) + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/geneId + gene_id = SubjectRelation('ExternalUri', cardinality='**') + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/hgncId + hgnc_id = SubjectRelation('ExternalUri', cardinality='**') + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/hgncIdPage + hgnc_page = SubjectRelation('ExternalUri', cardinality='**') + # Corresponds to http://www4.wiwiss.fu-berlin.de/diseasome/resource/ + # diseasome/bio2rdfSymbol + bio2rdf_symbol = String(maxsize=64, fulltextindexed=True) + #Corresponds to http://www.w3.org/2002/07/owl#sameAs + same_as = SubjectRelation('ExternalUri', cardinality='**')