doc/tutorials/dataimport/diseasome_parser.py
author Julien Cristau <julien.cristau@logilab.fr>
Wed, 05 Feb 2014 16:34:21 +0100
branchstable
changeset 9523 cd5738fc440f
parent 8836 8a57802d40d3
child 9702 c2108dbfb508
permissions -rw-r--r--
[ajax] use a custom tag to handle dynamically loaded js Using <pre class="script"> makes it trivial for a malicious user to inject arbitrary javascript into a html or rest text element (because it looks innocent to the html sanitizer). Using a custom tag we can be sure that it actually comes from our code and not from untrusted user data. IE ignores custom tags, though, so we put it in its own namespace. https://extranet.logilab.fr/1530578
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
8836
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     1
# -*- coding: utf-8 -*-
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     2
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     3
"""
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     4
Diseasome data import module.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     5
Its interface is the ``entities_from_rdf`` function.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     6
"""
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     7
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     8
import re
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
     9
RE_RELS = re.compile(r'^<(.*?)>\s<(.*?)>\s<(.*?)>\s*\.')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    10
RE_ATTS = re.compile(r'^<(.*?)>\s<(.*?)>\s"(.*)"(\^\^<(.*?)>|)\s*\.')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    11
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    12
MAPPING_ATTS = {'bio2rdfSymbol': 'bio2rdf_symbol',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    13
                'label': 'label',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    14
                'name': 'name',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    15
                'classDegree': 'class_degree',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    16
                'degree': 'degree',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    17
                'size': 'size'}
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    18
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    19
MAPPING_RELS = {'geneId': 'gene_id',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    20
                'hgncId': 'hgnc_id', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    21
                'hgncIdPage': 'hgnc_page', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    22
                'sameAs': 'same_as', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    23
                'class': 'classes', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    24
                'diseaseSubtypeOf': 'subtype_of', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    25
                'associatedGene': 'associated_genes', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    26
                'possibleDrug': 'possible_drugs',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    27
                'type': 'types',
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    28
                'omim': 'omim', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    29
                'omimPage': 'omim_page', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    30
                'chromosomalLocation': 'chromosomal_location'}
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    31
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    32
def _retrieve_reltype(uri):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    33
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    34
    Retrieve a relation type from an URI.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    35
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    36
    Internal function which takes an URI containing a relation type as input
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    37
    and returns the name of the relation.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    38
    If no URI string is given, then the function returns None.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    39
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    40
    if uri:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    41
        return uri.rsplit('/', 1)[-1].rsplit('#', 1)[-1]
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    42
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    43
def _retrieve_etype(tri_uri):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    44
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    45
    Retrieve entity type from a triple of URIs.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    46
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    47
    Internal function whith takes a tuple of three URIs as input
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    48
    and returns the type of the entity, as obtained from the
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    49
    first member of the tuple.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    50
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    51
    if tri_uri:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    52
        return tri_uri.split('> <')[0].rsplit('/', 2)[-2].rstrip('s')
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    53
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    54
def _retrieve_structure(filename, etypes):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    55
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    56
    Retrieve a (subject, relation, object) tuples iterator from a file.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    57
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    58
    Internal function which takes as input a file name and a tuple of 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    59
    entity types, and returns an iterator of (subject, relation, object)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    60
    tuples.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    61
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    62
    with open(filename) as fil:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    63
        for line in fil:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    64
            if _retrieve_etype(line) not in etypes:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    65
                continue
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    66
            match = RE_RELS.match(line)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    67
            if not match:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    68
                match = RE_ATTS.match(line)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    69
            subj = match.group(1)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    70
            relation = _retrieve_reltype(match.group(2))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    71
            obj = match.group(3)
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    72
            yield subj, relation, obj
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    73
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    74
def entities_from_rdf(filename, etypes):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    75
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    76
    Return entities from an RDF file.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    77
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    78
    Module interface function which takes as input a file name and
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    79
    a tuple of entity types, and returns an iterator on the 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    80
    attributes and relations of each entity. The attributes
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    81
    and relations are retrieved as dictionaries.
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    82
    
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    83
    >>> for entities, relations in entities_from_rdf('data_file', 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    84
                                                     ('type_1', 'type_2')):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    85
        ...
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    86
    """
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    87
    entities = {}
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    88
    for subj, rel, obj in _retrieve_structure(filename, etypes):
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    89
        entities.setdefault(subj, {})
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    90
        entities[subj].setdefault('attributes', {})
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    91
        entities[subj].setdefault('relations', {})
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    92
        entities[subj]['attributes'].setdefault('cwuri', unicode(subj))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    93
        if rel in MAPPING_ATTS:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    94
            entities[subj]['attributes'].setdefault(MAPPING_ATTS[rel], 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    95
                                                    unicode(obj))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    96
        if rel in MAPPING_RELS:
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    97
            entities[subj]['relations'].setdefault(MAPPING_RELS[rel], set())
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    98
            entities[subj]['relations'][MAPPING_RELS[rel]].add(unicode(obj))
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
    99
    return ((ent.get('attributes'), ent.get('relations')) 
8a57802d40d3 [cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff changeset
   100
            for ent in entities.itervalues())