doc/tutorials/dataimport/diseasome_parser.py
author Denis Laxalde <denis.laxalde@logilab.fr>
Mon, 06 Mar 2017 14:19:20 +0100
changeset 12008 7694dcf5ad30
parent 10663 54b8a1f249fb
permissions -rw-r--r--
[etwist] Do not call repository's start_looping_tasks anymore and warn about this We are about to drop this method from Repository class and replace it by a blocking alternative. This is not compatible with how things currently work in a Twisted server implementation. So do not start repository "looping tasks" in Twisted server anymore and issue a warning about this. If someone is interested in restoring the "all-in-one" behavior where the repository runs within a Twisted server, they may start by implementing repository looping tasks using a Twisted mechanism such as, e.g., http://twistedmatrix.com/documents/current/core/howto/time.html and eventually provide the repository with a compatible scheduler instance so that is can register its periodic tasks. At the moment, we lack resources to do this (and maintain the Twisted server of CubicWeb in general). Related to #17057223.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     1
# -*- coding: utf-8 -*-
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     2
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     3
"""
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     4
Diseasome data import module.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     5
Its interface is the ``entities_from_rdf`` function.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     6
"""
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     7
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     8
import re
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     9
RE_RELS = re.compile(r'^<(.*?)>\s<(.*?)>\s<(.*?)>\s*\.')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    10
RE_ATTS = re.compile(r'^<(.*?)>\s<(.*?)>\s"(.*)"(\^\^<(.*?)>|)\s*\.')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    11
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    12
MAPPING_ATTS = {'bio2rdfSymbol': 'bio2rdf_symbol',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    13
                'label': 'label',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    14
                'name': 'name',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    15
                'classDegree': 'class_degree',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    16
                'degree': 'degree',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    17
                'size': 'size'}
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    18
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    19
MAPPING_RELS = {'geneId': 'gene_id',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    20
                'hgncId': 'hgnc_id', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    21
                'hgncIdPage': 'hgnc_page', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    22
                'sameAs': 'same_as', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    23
                'class': 'classes', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    24
                'diseaseSubtypeOf': 'subtype_of', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    25
                'associatedGene': 'associated_genes', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    26
                'possibleDrug': 'possible_drugs',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    27
                'type': 'types',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    28
                'omim': 'omim', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    29
                'omimPage': 'omim_page', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    30
                'chromosomalLocation': 'chromosomal_location'}
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    31
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    32
def _retrieve_reltype(uri):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    33
    """
9702
c2108dbfb508 an URI -> a URI
Rémi Cardona <remi.cardona@logilab.fr>
parents: 8836
diff changeset
    34
    Retrieve a relation type from a URI.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    35
9702
c2108dbfb508 an URI -> a URI
Rémi Cardona <remi.cardona@logilab.fr>
parents: 8836
diff changeset
    36
    Internal function which takes a URI containing a relation type as input
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    37
    and returns the name of the relation.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    38
    If no URI string is given, then the function returns None.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    39
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    40
    if uri:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    41
        return uri.rsplit('/', 1)[-1].rsplit('#', 1)[-1]
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    42
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    43
def _retrieve_etype(tri_uri):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    44
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    45
    Retrieve entity type from a triple of URIs.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    46
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    47
    Internal function whith takes a tuple of three URIs as input
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    48
    and returns the type of the entity, as obtained from the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    49
    first member of the tuple.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    50
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    51
    if tri_uri:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    52
        return tri_uri.split('> <')[0].rsplit('/', 2)[-2].rstrip('s')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    53
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    54
def _retrieve_structure(filename, etypes):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    55
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    56
    Retrieve a (subject, relation, object) tuples iterator from a file.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    57
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    58
    Internal function which takes as input a file name and a tuple of 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    59
    entity types, and returns an iterator of (subject, relation, object)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    60
    tuples.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    61
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    62
    with open(filename) as fil:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    63
        for line in fil:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    64
            if _retrieve_etype(line) not in etypes:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    65
                continue
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    66
            match = RE_RELS.match(line)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    67
            if not match:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    68
                match = RE_ATTS.match(line)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    69
            subj = match.group(1)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    70
            relation = _retrieve_reltype(match.group(2))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    71
            obj = match.group(3)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    72
            yield subj, relation, obj
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    73
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    74
def entities_from_rdf(filename, etypes):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    75
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    76
    Return entities from an RDF file.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    77
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    78
    Module interface function which takes as input a file name and
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    79
    a tuple of entity types, and returns an iterator on the 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    80
    attributes and relations of each entity. The attributes
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    81
    and relations are retrieved as dictionaries.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    82
    
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    83
    >>> for entities, relations in entities_from_rdf('data_file', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    84
                                                     ('type_1', 'type_2')):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    85
        ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    86
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    87
    entities = {}
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    88
    for subj, rel, obj in _retrieve_structure(filename, etypes):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    89
        entities.setdefault(subj, {})
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    90
        entities[subj].setdefault('attributes', {})
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    91
        entities[subj].setdefault('relations', {})
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    92
        entities[subj]['attributes'].setdefault('cwuri', unicode(subj))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    93
        if rel in MAPPING_ATTS:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    94
            entities[subj]['attributes'].setdefault(MAPPING_ATTS[rel], 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    95
                                                    unicode(obj))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    96
        if rel in MAPPING_RELS:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    97
            entities[subj]['relations'].setdefault(MAPPING_RELS[rel], set())
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    98
            entities[subj]['relations'][MAPPING_RELS[rel]].add(unicode(obj))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    99
    return ((ent.get('attributes'), ent.get('relations')) 
10663
54b8a1f249fb [py3k] dict.itervalues → dict.values
Rémi Cardona <remi.cardona@logilab.fr>
parents: 9702
diff changeset
   100
            for ent in entities.values())