author | Sylvain Thénault <sylvain.thenault@logilab.fr> |
Fri, 30 Sep 2016 17:36:02 +0200 | |
changeset 11755 | 96ced95e4002 |
parent 10663 | 54b8a1f249fb |
permissions | -rw-r--r-- |
8836
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
1 |
# -*- coding: utf-8 -*- |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
2 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
3 |
""" |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
4 |
Diseasome data import module. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
5 |
Its interface is the ``entities_from_rdf`` function. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
6 |
""" |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
7 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
8 |
import re |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
9 |
RE_RELS = re.compile(r'^<(.*?)>\s<(.*?)>\s<(.*?)>\s*\.') |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
10 |
RE_ATTS = re.compile(r'^<(.*?)>\s<(.*?)>\s"(.*)"(\^\^<(.*?)>|)\s*\.') |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
11 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
12 |
MAPPING_ATTS = {'bio2rdfSymbol': 'bio2rdf_symbol', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
13 |
'label': 'label', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
14 |
'name': 'name', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
15 |
'classDegree': 'class_degree', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
16 |
'degree': 'degree', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
17 |
'size': 'size'} |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
18 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
19 |
MAPPING_RELS = {'geneId': 'gene_id', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
20 |
'hgncId': 'hgnc_id', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
21 |
'hgncIdPage': 'hgnc_page', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
22 |
'sameAs': 'same_as', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
23 |
'class': 'classes', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
24 |
'diseaseSubtypeOf': 'subtype_of', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
25 |
'associatedGene': 'associated_genes', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
26 |
'possibleDrug': 'possible_drugs', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
27 |
'type': 'types', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
28 |
'omim': 'omim', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
29 |
'omimPage': 'omim_page', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
30 |
'chromosomalLocation': 'chromosomal_location'} |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
31 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
32 |
def _retrieve_reltype(uri): |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
33 |
""" |
9702 | 34 |
Retrieve a relation type from a URI. |
8836
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
35 |
|
9702 | 36 |
Internal function which takes a URI containing a relation type as input |
8836
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
37 |
and returns the name of the relation. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
38 |
If no URI string is given, then the function returns None. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
39 |
""" |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
40 |
if uri: |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
41 |
return uri.rsplit('/', 1)[-1].rsplit('#', 1)[-1] |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
42 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
43 |
def _retrieve_etype(tri_uri): |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
44 |
""" |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
45 |
Retrieve entity type from a triple of URIs. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
46 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
47 |
Internal function whith takes a tuple of three URIs as input |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
48 |
and returns the type of the entity, as obtained from the |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
49 |
first member of the tuple. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
50 |
""" |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
51 |
if tri_uri: |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
52 |
return tri_uri.split('> <')[0].rsplit('/', 2)[-2].rstrip('s') |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
53 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
54 |
def _retrieve_structure(filename, etypes): |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
55 |
""" |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
56 |
Retrieve a (subject, relation, object) tuples iterator from a file. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
57 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
58 |
Internal function which takes as input a file name and a tuple of |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
59 |
entity types, and returns an iterator of (subject, relation, object) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
60 |
tuples. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
61 |
""" |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
62 |
with open(filename) as fil: |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
63 |
for line in fil: |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
64 |
if _retrieve_etype(line) not in etypes: |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
65 |
continue |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
66 |
match = RE_RELS.match(line) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
67 |
if not match: |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
68 |
match = RE_ATTS.match(line) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
69 |
subj = match.group(1) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
70 |
relation = _retrieve_reltype(match.group(2)) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
71 |
obj = match.group(3) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
72 |
yield subj, relation, obj |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
73 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
74 |
def entities_from_rdf(filename, etypes): |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
75 |
""" |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
76 |
Return entities from an RDF file. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
77 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
78 |
Module interface function which takes as input a file name and |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
79 |
a tuple of entity types, and returns an iterator on the |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
80 |
attributes and relations of each entity. The attributes |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
81 |
and relations are retrieved as dictionaries. |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
82 |
|
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
83 |
>>> for entities, relations in entities_from_rdf('data_file', |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
84 |
('type_1', 'type_2')): |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
85 |
... |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
86 |
""" |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
87 |
entities = {} |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
88 |
for subj, rel, obj in _retrieve_structure(filename, etypes): |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
89 |
entities.setdefault(subj, {}) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
90 |
entities[subj].setdefault('attributes', {}) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
91 |
entities[subj].setdefault('relations', {}) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
92 |
entities[subj]['attributes'].setdefault('cwuri', unicode(subj)) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
93 |
if rel in MAPPING_ATTS: |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
94 |
entities[subj]['attributes'].setdefault(MAPPING_ATTS[rel], |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
95 |
unicode(obj)) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
96 |
if rel in MAPPING_RELS: |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
97 |
entities[subj]['relations'].setdefault(MAPPING_RELS[rel], set()) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
98 |
entities[subj]['relations'][MAPPING_RELS[rel]].add(unicode(obj)) |
8a57802d40d3
[cubicweb/doc] Add tutorial on data import in CubicWeb.
Vladimir Popescu <vladimir.popescu@logilab.fr>
parents:
diff
changeset
|
99 |
return ((ent.get('attributes'), ent.get('relations')) |
10663
54b8a1f249fb
[py3k] dict.itervalues → dict.values
Rémi Cardona <remi.cardona@logilab.fr>
parents:
9702
diff
changeset
|
100 |
for ent in entities.values()) |