doc/tutorials/dataimport/diseasome_parser.py
author Sylvain Thénault <sylvain.thenault@logilab.fr>
Fri, 26 Jun 2015 13:04:25 +0200
changeset 10473 23a2fa8cb725
parent 9702 c2108dbfb508
child 10663 54b8a1f249fb
permissions -rw-r--r--
avoid sanitizing warnings
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     1
# -*- coding: utf-8 -*-
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     2
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     3
"""
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     4
Diseasome data import module.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     5
Its interface is the ``entities_from_rdf`` function.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     6
"""
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     7
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     8
import re
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     9
RE_RELS = re.compile(r'^<(.*?)>\s<(.*?)>\s<(.*?)>\s*\.')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    10
RE_ATTS = re.compile(r'^<(.*?)>\s<(.*?)>\s"(.*)"(\^\^<(.*?)>|)\s*\.')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    11
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    12
MAPPING_ATTS = {'bio2rdfSymbol': 'bio2rdf_symbol',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    13
                'label': 'label',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    14
                'name': 'name',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    15
                'classDegree': 'class_degree',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    16
                'degree': 'degree',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    17
                'size': 'size'}
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    18
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    19
MAPPING_RELS = {'geneId': 'gene_id',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    20
                'hgncId': 'hgnc_id', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    21
                'hgncIdPage': 'hgnc_page', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    22
                'sameAs': 'same_as', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    23
                'class': 'classes', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    24
                'diseaseSubtypeOf': 'subtype_of', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    25
                'associatedGene': 'associated_genes', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    26
                'possibleDrug': 'possible_drugs',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    27
                'type': 'types',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    28
                'omim': 'omim', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    29
                'omimPage': 'omim_page', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    30
                'chromosomalLocation': 'chromosomal_location'}
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    31
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    32
def _retrieve_reltype(uri):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    33
    """
9702
c2108dbfb508 an URI -> a URI
Rémi Cardona <remi.cardona@logilab.fr>
parents: 8836
diff changeset
    34
    Retrieve a relation type from a URI.
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    35
9702
c2108dbfb508 an URI -> a URI
Rémi Cardona <remi.cardona@logilab.fr>
parents: 8836
diff changeset
    36
    Internal function which takes a URI containing a relation type as input
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    37
    and returns the name of the relation.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    38
    If no URI string is given, then the function returns None.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    39
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    40
    if uri:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    41
        return uri.rsplit('/', 1)[-1].rsplit('#', 1)[-1]
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    42
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    43
def _retrieve_etype(tri_uri):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    44
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    45
    Retrieve entity type from a triple of URIs.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    46
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    47
    Internal function whith takes a tuple of three URIs as input
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    48
    and returns the type of the entity, as obtained from the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    49
    first member of the tuple.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    50
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    51
    if tri_uri:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    52
        return tri_uri.split('> <')[0].rsplit('/', 2)[-2].rstrip('s')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    53
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    54
def _retrieve_structure(filename, etypes):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    55
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    56
    Retrieve a (subject, relation, object) tuples iterator from a file.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    57
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    58
    Internal function which takes as input a file name and a tuple of 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    59
    entity types, and returns an iterator of (subject, relation, object)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    60
    tuples.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    61
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    62
    with open(filename) as fil:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    63
        for line in fil:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    64
            if _retrieve_etype(line) not in etypes:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    65
                continue
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    66
            match = RE_RELS.match(line)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    67
            if not match:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    68
                match = RE_ATTS.match(line)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    69
            subj = match.group(1)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    70
            relation = _retrieve_reltype(match.group(2))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    71
            obj = match.group(3)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    72
            yield subj, relation, obj
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    73
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    74
def entities_from_rdf(filename, etypes):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    75
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    76
    Return entities from an RDF file.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    77
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    78
    Module interface function which takes as input a file name and
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    79
    a tuple of entity types, and returns an iterator on the 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    80
    attributes and relations of each entity. The attributes
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    81
    and relations are retrieved as dictionaries.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    82
    
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    83
    >>> for entities, relations in entities_from_rdf('data_file', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    84
                                                     ('type_1', 'type_2')):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    85
        ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    86
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    87
    entities = {}
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    88
    for subj, rel, obj in _retrieve_structure(filename, etypes):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    89
        entities.setdefault(subj, {})
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    90
        entities[subj].setdefault('attributes', {})
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    91
        entities[subj].setdefault('relations', {})
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    92
        entities[subj]['attributes'].setdefault('cwuri', unicode(subj))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    93
        if rel in MAPPING_ATTS:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    94
            entities[subj]['attributes'].setdefault(MAPPING_ATTS[rel], 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    95
                                                    unicode(obj))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    96
        if rel in MAPPING_RELS:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    97
            entities[subj]['relations'].setdefault(MAPPING_RELS[rel], set())
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    98
            entities[subj]['relations'][MAPPING_RELS[rel]].add(unicode(obj))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    99
    return ((ent.get('attributes'), ent.get('relations')) 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   100
            for ent in entities.itervalues())