cubicweb/server/sources/datafeed.py
author Sylvain Thénault <sylvain.thenault@logilab.fr>
Fri, 30 Sep 2016 17:36:40 +0200
changeset 11756 60fed6272771
parent 11740 dabbb2a4a493
child 11757 e845746b4d3c
permissions -rw-r--r--
[repository] Drop deprecated extid2eid API and friends This will break cwxmlparser based sources. They should be rewritten using a specific parser, based on xml representation or on rqlio. This is harsh but allows a so big cleanup of the code base. Furthermore, it's necessary for asource/extid handling in the entities table which is costly for most app that don't care at all about that... In this cset, delete: * all extid2eid methods * repo._extid_cache handling * [before/after]_entity_insertion source callback * the cwxmlparser and update related tests, notably unittest_datafeed where 'repull' testing has been removed, since it's now handled by the dataimport API and should not be retested there. Related to #15538288 Closes #15538383
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
11138
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
     1
# copyright 2010-2016 LOGILAB S.A. (Paris, FRANCE), all rights reserved.
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
     2
# contact http://www.logilab.fr/ -- mailto:contact@logilab.fr
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
     3
#
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
     4
# This file is part of CubicWeb.
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
     5
#
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
     6
# CubicWeb is free software: you can redistribute it and/or modify it under the
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
     7
# terms of the GNU Lesser General Public License as published by the Free
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
     8
# Software Foundation, either version 2.1 of the License, or (at your option)
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
     9
# any later version.
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    10
#
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    11
# CubicWeb is distributed in the hope that it will be useful, but WITHOUT
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    12
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    13
# FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for more
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    14
# details.
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    15
#
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    16
# You should have received a copy of the GNU Lesser General Public License along
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    17
# with CubicWeb.  If not, see <http://www.gnu.org/licenses/>.
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    18
"""datafeed sources: copy data from an external data stream into the system
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    19
database
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    20
"""
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
    21
10757
f73a9a884534 [py3k] io.BytesIO
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10662
diff changeset
    22
from io import BytesIO
8187
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
    23
from os.path import exists
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    24
from datetime import datetime, timedelta
11138
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
    25
from functools import partial
10603
65ad6980976e [py3k] import URL mangling functions using six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10581
diff changeset
    26
65ad6980976e [py3k] import URL mangling functions using six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10581
diff changeset
    27
from six.moves.urllib.parse import urlparse
10610
d53b9c157f99 [py3k] import urllib2 from six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10603
diff changeset
    28
from six.moves.urllib.request import Request, build_opener, HTTPCookieProcessor
d53b9c157f99 [py3k] import urllib2 from six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10603
diff changeset
    29
from six.moves.urllib.error import HTTPError
10611
f4dec0cca9a1 [py3k] import CookieJar using six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10610
diff changeset
    30
from six.moves.http_cookiejar import CookieJar
10603
65ad6980976e [py3k] import URL mangling functions using six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10581
diff changeset
    31
11042
079b32f4cd0d [datafeed] use tz-aware datetime objects
Julien Cristau <julien.cristau@logilab.fr>
parents: 10914
diff changeset
    32
from pytz import utc
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
    33
from lxml import etree
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    34
10914
fed8bd56f223 [repository] deprecate the extid2eid based multi-sources API
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 10805
diff changeset
    35
from logilab.common.deprecation import deprecated
fed8bd56f223 [repository] deprecate the extid2eid based multi-sources API
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 10805
diff changeset
    36
11756
60fed6272771 [repository] Drop deprecated extid2eid API and friends
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11740
diff changeset
    37
from cubicweb import RegistryNotFound, ObjectNotFound, ValidationError, SourceException
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    38
from cubicweb.server.sources import AbstractSource
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    39
from cubicweb.appobject import AppObject
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    40
7456
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
    41
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    42
class DataFeedSource(AbstractSource):
7552
82dde8276a5b [datafeed, entities] url for entities from a datafeed source should be on their origin site. Closes #1769391
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7548
diff changeset
    43
    use_cwuri_as_url = True
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    44
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    45
    options = (
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    46
        ('synchronize',
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    47
         {'type' : 'yn',
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    48
          'default': True,
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    49
          'help': ('Is the repository responsible to automatically import '
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    50
                   'content from this source? '
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    51
                   'You should say yes unless you don\'t want this behaviour '
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    52
                   'or if you use a multiple repositories setup, in which '
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    53
                   'case you should say yes on one repository, no on others.'),
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    54
          'group': 'datafeed-source', 'level': 2,
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    55
          }),
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    56
        ('synchronization-interval',
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    57
         {'type' : 'time',
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    58
          'default': '5min',
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    59
          'help': ('Interval in seconds between synchronization with the '
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    60
                   'external source (default to 5 minutes, must be >= 1 min).'),
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    61
          'group': 'datafeed-source', 'level': 2,
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    62
          }),
7921
a93e2ed5877a [datafeed] add max-lifetime for concurrent synchronization lock (closes #1908676)
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7731
diff changeset
    63
        ('max-lock-lifetime',
a93e2ed5877a [datafeed] add max-lifetime for concurrent synchronization lock (closes #1908676)
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7731
diff changeset
    64
         {'type' : 'time',
a93e2ed5877a [datafeed] add max-lifetime for concurrent synchronization lock (closes #1908676)
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7731
diff changeset
    65
          'default': '1h',
a93e2ed5877a [datafeed] add max-lifetime for concurrent synchronization lock (closes #1908676)
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7731
diff changeset
    66
          'help': ('Maximum time allowed for a synchronization to be run. '
a93e2ed5877a [datafeed] add max-lifetime for concurrent synchronization lock (closes #1908676)
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7731
diff changeset
    67
                   'Exceeded that time, the synchronization will be considered '
a93e2ed5877a [datafeed] add max-lifetime for concurrent synchronization lock (closes #1908676)
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7731
diff changeset
    68
                   'as having failed and not properly released the lock, hence '
a93e2ed5877a [datafeed] add max-lifetime for concurrent synchronization lock (closes #1908676)
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7731
diff changeset
    69
                   'it won\'t be considered'),
a93e2ed5877a [datafeed] add max-lifetime for concurrent synchronization lock (closes #1908676)
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7731
diff changeset
    70
          'group': 'datafeed-source', 'level': 2,
a93e2ed5877a [datafeed] add max-lifetime for concurrent synchronization lock (closes #1908676)
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7731
diff changeset
    71
          }),
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    72
        ('delete-entities',
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    73
         {'type' : 'yn',
8430
5bee87a14bb1 fix ldap removal handling in ldapfeed (closes #2376625 and #2385133)
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8429
diff changeset
    74
          'default': False,
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    75
          'help': ('Should already imported entities not found anymore on the '
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    76
                   'external source be deleted?'),
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    77
          'group': 'datafeed-source', 'level': 2,
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    78
          }),
7995
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
    79
        ('logs-lifetime',
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
    80
         {'type': 'time',
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
    81
          'default': '10d',
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
    82
          'help': ('Time before logs from datafeed imports are deleted.'),
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
    83
          'group': 'datafeed-source', 'level': 2,
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
    84
          }),
9182
75493f6ca586 [datafeed] add a timeout config option (closes #2745677)
David Douard <david.douard@logilab.fr>
parents: 8695
diff changeset
    85
        ('http-timeout',
75493f6ca586 [datafeed] add a timeout config option (closes #2745677)
David Douard <david.douard@logilab.fr>
parents: 8695
diff changeset
    86
         {'type': 'time',
75493f6ca586 [datafeed] add a timeout config option (closes #2745677)
David Douard <david.douard@logilab.fr>
parents: 8695
diff changeset
    87
          'default': '1min',
75493f6ca586 [datafeed] add a timeout config option (closes #2745677)
David Douard <david.douard@logilab.fr>
parents: 8695
diff changeset
    88
          'help': ('Timeout of HTTP GET requests, when synchronizing a source.'),
75493f6ca586 [datafeed] add a timeout config option (closes #2745677)
David Douard <david.douard@logilab.fr>
parents: 8695
diff changeset
    89
          'group': 'datafeed-source', 'level': 2,
75493f6ca586 [datafeed] add a timeout config option (closes #2745677)
David Douard <david.douard@logilab.fr>
parents: 8695
diff changeset
    90
          }),
9822
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
    91
        ('use-cwuri-as-url',
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
    92
         {'type': 'yn',
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
    93
          'default': None, # explicitly unset
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
    94
          'help': ('Use cwuri (i.e. external URL) for link to the entity '
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
    95
                   'instead of its local URL.'),
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
    96
          'group': 'datafeed-source', 'level': 1,
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
    97
          }),
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
    98
        )
8430
5bee87a14bb1 fix ldap removal handling in ldapfeed (closes #2376625 and #2385133)
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8429
diff changeset
    99
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   100
    def check_config(self, source_entity):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   101
        """check configuration of source entity"""
8674
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   102
        typed_config = super(DataFeedSource, self).check_config(source_entity)
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   103
        if typed_config['synchronization-interval'] < 60:
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   104
            _ = source_entity._cw._
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   105
            msg = _('synchronization-interval must be greater than 1 minute')
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   106
            raise ValidationError(source_entity.eid, {'config': msg})
8674
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   107
        return typed_config
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   108
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   109
    def _entity_update(self, source_entity):
8674
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   110
        super(DataFeedSource, self)._entity_update(source_entity)
7527
ef1e9bc38137 [datafeed] renaming parser attribute to parser_id makes things clearer
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7461
diff changeset
   111
        self.parser_id = source_entity.parser
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   112
        self.latest_retrieval = source_entity.latest_retrieval
8188
1867e252e487 [repository] ldap-feed source. Closes #2086984
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8187
diff changeset
   113
8674
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   114
    def update_config(self, source_entity, typed_config):
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   115
        """update configuration from source entity. `typed_config` is config
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   116
        properly typed with defaults set
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   117
        """
8674
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   118
        super(DataFeedSource, self).update_config(source_entity, typed_config)
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   119
        self.synchro_interval = timedelta(seconds=typed_config['synchronization-interval'])
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   120
        self.max_lock_lifetime = timedelta(seconds=typed_config['max-lock-lifetime'])
9182
75493f6ca586 [datafeed] add a timeout config option (closes #2745677)
David Douard <david.douard@logilab.fr>
parents: 8695
diff changeset
   121
        self.http_timeout = typed_config['http-timeout']
9822
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
   122
        # if typed_config['use-cwuri-as-url'] is set, we have to update
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
   123
        # use_cwuri_as_url attribute and public configuration dictionary
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
   124
        # accordingly
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
   125
        if typed_config['use-cwuri-as-url'] is not None:
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
   126
            self.use_cwuri_as_url = typed_config['use-cwuri-as-url']
4a118bfd6ab4 [datafeed] Allow to override use_cwuri_as_url in configuration of a datafeed source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9746
diff changeset
   127
            self.public_config['use-cwuri-as-url'] = self.use_cwuri_as_url
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   128
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   129
    def init(self, activated, source_entity):
8674
001c1592060a [repo sources] move handling of source's url into abstract source as this becomes shared by most sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8573
diff changeset
   130
        super(DataFeedSource, self).init(activated, source_entity)
7527
ef1e9bc38137 [datafeed] renaming parser attribute to parser_id makes things clearer
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7461
diff changeset
   131
        self.parser_id = source_entity.parser
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   132
        self.load_mapping(source_entity._cw)
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   133
9879
21278eb03bbf [datafeed sources] finish the session -> cnx switch
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 9860
diff changeset
   134
    def _get_parser(self, cnx, **kwargs):
10454
20f45a9b385c [datafeed] give an error message if a source is missing a parser id
Julien Cristau <julien.cristau@logilab.fr>
parents: 10143
diff changeset
   135
        if self.parser_id is None:
20f45a9b385c [datafeed] give an error message if a source is missing a parser id
Julien Cristau <julien.cristau@logilab.fr>
parents: 10143
diff changeset
   136
            self.warning('No parser defined on source %r', self)
20f45a9b385c [datafeed] give an error message if a source is missing a parser id
Julien Cristau <julien.cristau@logilab.fr>
parents: 10143
diff changeset
   137
            raise ObjectNotFound()
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   138
        return self.repo.vreg['parsers'].select(
9879
21278eb03bbf [datafeed sources] finish the session -> cnx switch
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 9860
diff changeset
   139
            self.parser_id, cnx, source=self, **kwargs)
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   140
9879
21278eb03bbf [datafeed sources] finish the session -> cnx switch
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 9860
diff changeset
   141
    def load_mapping(self, cnx):
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   142
        self.mapping = {}
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   143
        self.mapping_idx = {}
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   144
        try:
9879
21278eb03bbf [datafeed sources] finish the session -> cnx switch
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 9860
diff changeset
   145
            parser = self._get_parser(cnx)
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   146
        except (RegistryNotFound, ObjectNotFound):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   147
            return # no parser yet, don't go further
9879
21278eb03bbf [datafeed sources] finish the session -> cnx switch
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 9860
diff changeset
   148
        self._load_mapping(cnx, parser=parser)
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   149
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   150
    def add_schema_config(self, schemacfg, checkonly=False, parser=None):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   151
        """added CWSourceSchemaConfig, modify mapping accordingly"""
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   152
        if parser is None:
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   153
            parser = self._get_parser(schemacfg._cw)
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   154
        parser.add_schema_config(schemacfg, checkonly)
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   155
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   156
    def del_schema_config(self, schemacfg, checkonly=False, parser=None):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   157
        """deleted CWSourceSchemaConfig, modify mapping accordingly"""
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   158
        if parser is None:
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   159
            parser = self._get_parser(schemacfg._cw)
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   160
        parser.del_schema_config(schemacfg, checkonly)
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   161
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   162
    def fresh(self):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   163
        if self.latest_retrieval is None:
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   164
            return False
11042
079b32f4cd0d [datafeed] use tz-aware datetime objects
Julien Cristau <julien.cristau@logilab.fr>
parents: 10914
diff changeset
   165
        return datetime.now(tz=utc) < (self.latest_retrieval + self.synchro_interval)
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   166
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   167
    def update_latest_retrieval(self, cnx):
11042
079b32f4cd0d [datafeed] use tz-aware datetime objects
Julien Cristau <julien.cristau@logilab.fr>
parents: 10914
diff changeset
   168
        self.latest_retrieval = datetime.now(tz=utc)
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   169
        cnx.execute('SET X latest_retrieval %(date)s WHERE X eid %(x)s',
9879
21278eb03bbf [datafeed sources] finish the session -> cnx switch
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 9860
diff changeset
   170
                    {'x': self.eid, 'date': self.latest_retrieval})
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   171
        cnx.commit()
7446
6fba86efdd09 [datafeed] extract some methods from pull_data to ease writing custom datafeed sources
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7444
diff changeset
   172
11345
27b98f3cceae [datafeed] attempt to acquire synchronization lock even when force is given
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11255
diff changeset
   173
    def acquire_synchronization_lock(self, cnx):
7456
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   174
        # XXX race condition until WHERE of SET queries is executed using
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   175
        # 'SELECT FOR UPDATE'
11042
079b32f4cd0d [datafeed] use tz-aware datetime objects
Julien Cristau <julien.cristau@logilab.fr>
parents: 10914
diff changeset
   176
        now = datetime.now(tz=utc)
11345
27b98f3cceae [datafeed] attempt to acquire synchronization lock even when force is given
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11255
diff changeset
   177
        maxdt = now - self.max_lock_lifetime
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   178
        if not cnx.execute(
11125
e717da3dc164 c-c source-sync now actually force synchronization
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11057
diff changeset
   179
                'SET X in_synchronization %(now)s WHERE X eid %(x)s, '
e717da3dc164 c-c source-sync now actually force synchronization
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11057
diff changeset
   180
                'X in_synchronization NULL OR X in_synchronization < %(maxdt)s',
e717da3dc164 c-c source-sync now actually force synchronization
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11057
diff changeset
   181
                {'x': self.eid, 'now': now, 'maxdt': maxdt}):
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   182
            cnx.commit()
11345
27b98f3cceae [datafeed] attempt to acquire synchronization lock even when force is given
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11255
diff changeset
   183
            raise SourceException("a concurrent synchronization is already running")
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   184
        cnx.commit()
7456
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   185
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   186
    def release_synchronization_lock(self, cnx):
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   187
        cnx.execute('SET X in_synchronization NULL WHERE X eid %(x)s',
9879
21278eb03bbf [datafeed sources] finish the session -> cnx switch
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 9860
diff changeset
   188
                    {'x': self.eid})
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   189
        cnx.commit()
7456
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   190
11138
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   191
    def pull_data(self, cnx, force=False, raise_on_error=False, async=False):
7456
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   192
        """Launch synchronization of the source if needed.
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   193
11138
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   194
        If `async` is true, the method return immediatly a dictionnary containing the import log's
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   195
        eid, and the actual synchronization is done asynchronously. If `async` is false, return some
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   196
        imports statistics (e.g. number of created and updated entities).
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   197
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   198
        This method is responsible to handle commit/rollback on the given connection.
7456
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   199
        """
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   200
        if not force and self.fresh():
6972
12aa5cd81ce5 [datafeed] return empty dict when source is fresh avoid crash in the looping task because None returned
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 6957
diff changeset
   201
            return {}
11345
27b98f3cceae [datafeed] attempt to acquire synchronization lock even when force is given
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11255
diff changeset
   202
        try:
27b98f3cceae [datafeed] attempt to acquire synchronization lock even when force is given
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11255
diff changeset
   203
            self.acquire_synchronization_lock(cnx)
27b98f3cceae [datafeed] attempt to acquire synchronization lock even when force is given
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11255
diff changeset
   204
        except SourceException as exc:
27b98f3cceae [datafeed] attempt to acquire synchronization lock even when force is given
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11255
diff changeset
   205
            if force:
27b98f3cceae [datafeed] attempt to acquire synchronization lock even when force is given
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11255
diff changeset
   206
                raise
27b98f3cceae [datafeed] attempt to acquire synchronization lock even when force is given
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11255
diff changeset
   207
            self.error(str(exc))
7456
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   208
            return {}
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   209
        try:
11138
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   210
            if async:
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   211
                return self._async_pull_data(cnx, force, raise_on_error)
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   212
            else:
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   213
                return self._pull_data(cnx, force, raise_on_error)
7456
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   214
        finally:
11138
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   215
            cnx.rollback()  # rollback first in case there is some dirty transaction remaining
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   216
            self.release_synchronization_lock(cnx)
7456
c54038622fc9 [datafeed] use a boolean flag on CWSource to ensure we don't have concurrent synchronizations. Closes #1725690
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7447
diff changeset
   217
11138
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   218
    def _async_pull_data(self, cnx, force, raise_on_error):
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   219
        import_log = cnx.create_entity('CWDataImport', cw_import_of=self)
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   220
        cnx.commit()  # commit the import log creation before starting the synchronize task
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   221
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   222
        def _synchronize_source(repo, source_eid, import_log_eid):
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   223
            with repo.internal_cnx() as cnx:
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   224
                source = repo.sources_by_eid[source_eid]
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   225
                source._pull_data(cnx, force, raise_on_error, import_log_eid=import_log_eid)
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   226
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   227
        sync = partial(_synchronize_source, cnx.repo, self.eid, import_log.eid)
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   228
        cnx.repo.threaded_task(sync)
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   229
        return {'import_log_eid': import_log.eid}
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   230
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   231
    def _pull_data(self, cnx, force=False, raise_on_error=False, import_log_eid=None):
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   232
        importlog = self.init_import_log(cnx, import_log_eid)
11252
6b1d09ef0c45 [datafeed] rename parser.sourceuris to source_uris
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11251
diff changeset
   233
        source_uris = self.source_uris(cnx)
10454
20f45a9b385c [datafeed] give an error message if a source is missing a parser id
Julien Cristau <julien.cristau@logilab.fr>
parents: 10143
diff changeset
   234
        try:
11252
6b1d09ef0c45 [datafeed] rename parser.sourceuris to source_uris
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11251
diff changeset
   235
            parser = self._get_parser(cnx, import_log=importlog,
11254
4f467683b8c9 [datafeed] gives information about moved entities to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11253
diff changeset
   236
                                      source_uris=source_uris,
4f467683b8c9 [datafeed] gives information about moved entities to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11253
diff changeset
   237
                                      moved_uris=self.moved_uris(cnx))
10454
20f45a9b385c [datafeed] give an error message if a source is missing a parser id
Julien Cristau <julien.cristau@logilab.fr>
parents: 10143
diff changeset
   238
        except ObjectNotFound:
11740
dabbb2a4a493 [datafeed] Complete the import log even if parser could not be found
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 11345
diff changeset
   239
            msg = 'failed to load parser for %s'
dabbb2a4a493 [datafeed] Complete the import log even if parser could not be found
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 11345
diff changeset
   240
            importlog.record_error(msg % ('source "%s"' % self.uri))
dabbb2a4a493 [datafeed] Complete the import log even if parser could not be found
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 11345
diff changeset
   241
            self.error(msg, self)
dabbb2a4a493 [datafeed] Complete the import log even if parser could not be found
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 11345
diff changeset
   242
            stats = {}
8430
5bee87a14bb1 fix ldap removal handling in ldapfeed (closes #2376625 and #2385133)
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8429
diff changeset
   243
        else:
11740
dabbb2a4a493 [datafeed] Complete the import log even if parser could not be found
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 11345
diff changeset
   244
            if parser.process_urls(self.urls, raise_on_error):
dabbb2a4a493 [datafeed] Complete the import log even if parser could not be found
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 11345
diff changeset
   245
                self.warning("some error occurred, don't attempt to delete entities")
dabbb2a4a493 [datafeed] Complete the import log even if parser could not be found
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 11345
diff changeset
   246
            else:
dabbb2a4a493 [datafeed] Complete the import log even if parser could not be found
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 11345
diff changeset
   247
                parser.handle_deletion(self.config, cnx, source_uris)
dabbb2a4a493 [datafeed] Complete the import log even if parser could not be found
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 11345
diff changeset
   248
            stats = parser.stats
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   249
        self.update_latest_retrieval(cnx)
7995
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
   250
        if stats.get('created'):
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
   251
            importlog.record_info('added %s entities' % len(stats['created']))
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
   252
        if stats.get('updated'):
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
   253
            importlog.record_info('updated %s entities' % len(stats['updated']))
9746
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   254
        importlog.write_log(cnx, end_timestamp=self.latest_retrieval)
81b56897a377 [datafeed] update datafeed internals to use connection instead of session
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9665
diff changeset
   255
        cnx.commit()
7995
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
   256
        return stats
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   257
11252
6b1d09ef0c45 [datafeed] rename parser.sourceuris to source_uris
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11251
diff changeset
   258
    def source_uris(self, cnx):
11253
be480b9d6ee2 [datafeed] simplify SQL query used to retrieve information about entities from an external source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11252
diff changeset
   259
        sql = 'SELECT extid, eid, type FROM entities WHERE asource=%(source)s'
10581
7846d26ff91d [server/sources] use decode_extid in datafeed
Julien Cristau <julien.cristau@logilab.fr>
parents: 10551
diff changeset
   260
        return dict((self.decode_extid(uri), (eid, type))
11253
be480b9d6ee2 [datafeed] simplify SQL query used to retrieve information about entities from an external source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11252
diff changeset
   261
                    for uri, eid, type in cnx.system_sql(sql, {'source': self.uri}).fetchall())
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   262
11254
4f467683b8c9 [datafeed] gives information about moved entities to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11253
diff changeset
   263
    def moved_uris(self, cnx):
4f467683b8c9 [datafeed] gives information about moved entities to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11253
diff changeset
   264
        sql = 'SELECT extid FROM moved_entities'
4f467683b8c9 [datafeed] gives information about moved entities to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11253
diff changeset
   265
        return set(self.decode_extid(uri) for uri, in cnx.system_sql(sql).fetchall())
4f467683b8c9 [datafeed] gives information about moved entities to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11253
diff changeset
   266
11138
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   267
    def init_import_log(self, cnx, import_log_eid=None, **kwargs):
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   268
        if import_log_eid is None:
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   269
            import_log = cnx.create_entity('CWDataImport', cw_import_of=self,
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   270
                                           start_timestamp=datetime.now(tz=utc),
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   271
                                           **kwargs)
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   272
        else:
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   273
            import_log = cnx.entity_from_eid(import_log_eid)
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   274
            import_log.cw_set(start_timestamp=datetime.now(tz=utc), **kwargs)
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   275
        cnx.commit()  # make changes visible
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   276
        import_log.init()
78c8e64f3cef [sources] synchronize source asynchronously when started from the UI
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11129
diff changeset
   277
        return import_log
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   278
8187
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   279
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   280
class DataFeedParser(AppObject):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   281
    __registry__ = 'parsers'
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   282
11255
58be5fe4a232 [datafeed] don't allow arbitrary kwargs on DatafeedParser initializer
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11254
diff changeset
   283
    def __init__(self, cnx, source, import_log=None, source_uris=None, moved_uris=None):
58be5fe4a232 [datafeed] don't allow arbitrary kwargs on DatafeedParser initializer
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11254
diff changeset
   284
        super(DataFeedParser, self).__init__(cnx)
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   285
        self.source = source
7995
9a9f35ef418c Record a log of datafeed source imports (closes #2026097)
Julien Cristau <julien.cristau@logilab.fr>
parents: 7950
diff changeset
   286
        self.import_log = import_log
11252
6b1d09ef0c45 [datafeed] rename parser.sourceuris to source_uris
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11251
diff changeset
   287
        if source_uris is None:
6b1d09ef0c45 [datafeed] rename parser.sourceuris to source_uris
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11251
diff changeset
   288
            source_uris = {}
6b1d09ef0c45 [datafeed] rename parser.sourceuris to source_uris
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11251
diff changeset
   289
        self.source_uris = source_uris
11254
4f467683b8c9 [datafeed] gives information about moved entities to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11253
diff changeset
   290
        if moved_uris is None:
4f467683b8c9 [datafeed] gives information about moved entities to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11253
diff changeset
   291
            moved_uris = ()
4f467683b8c9 [datafeed] gives information about moved entities to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11253
diff changeset
   292
        self.moved_uris = moved_uris
8435
5064b6e0d6f4 [datafeed] correctly distinguish checked/updated
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8434
diff changeset
   293
        self.stats = {'created': set(), 'updated': set(), 'checked': set()}
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   294
8187
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   295
    def normalize_url(self, url):
9823
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   296
        """Normalize an url by looking if there is a replacement for it in
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   297
        `cubicweb.sobjects.URL_MAPPING`.
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   298
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   299
        This dictionary allow to redirect from one host to another, which may be
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   300
        useful for example in case of test instance using production data, while
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   301
        you don't want to load the external source nor to hack your `/etc/hosts`
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   302
        file.
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   303
        """
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   304
        # local import mandatory, it's available after registration
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   305
        from cubicweb.sobjects import URL_MAPPING
8187
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   306
        for mappedurl in URL_MAPPING:
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   307
            if url.startswith(mappedurl):
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   308
                return url.replace(mappedurl, URL_MAPPING[mappedurl], 1)
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   309
        return url
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   310
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   311
    def retrieve_url(self, url):
9823
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   312
        """Return stream linked by the given url:
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   313
        * HTTP urls will be normalized (see :meth:`normalize_url`)
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   314
        * handle file:// URL
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   315
        * other will be considered as plain content, useful for testing purpose
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   316
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   317
        For http URLs, it will try to find a cwclientlib config entry
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   318
        (if available) and use it as requester.
9823
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   319
        """
10603
65ad6980976e [py3k] import URL mangling functions using six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10581
diff changeset
   320
        purl = urlparse(url)
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   321
        if purl.scheme == 'file':
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   322
            return URLLibResponseAdapter(open(url[7:]), url)
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   323
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   324
        url = self.normalize_url(url)
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   325
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   326
        # first, try to use cwclientlib if it's available and if the
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   327
        # url matches a configuration entry in ~/.config/cwclientlibrc
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   328
        try:
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   329
            from cwclientlib import cwproxy_for
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   330
            # parse url again since it has been normalized
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   331
            cnx = cwproxy_for(url)
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   332
            cnx.timeout = self.source.http_timeout
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   333
            self.source.info('Using cwclientlib for %s' % url)
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   334
            resp = cnx.get(url)
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   335
            resp.raise_for_status()
11055
3c1139344621 [datafeed] io.BytesIO requires a buffer, not a unicode (closes #9783743)
David Douard <david.douard@logilab.fr>
parents: 11042
diff changeset
   336
            return URLLibResponseAdapter(BytesIO(resp.content), url)
10532
2cc74c688eb9 [datafeed] also catch EnvironmentError when trying to load the cwclientlib config file
David Douard <david.douard@logilab.fr>
parents: 10522
diff changeset
   337
        except (ImportError, ValueError, EnvironmentError) as exc:
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   338
            # ImportError: not available
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   339
            # ValueError: no config entry found
10532
2cc74c688eb9 [datafeed] also catch EnvironmentError when trying to load the cwclientlib config file
David Douard <david.douard@logilab.fr>
parents: 10522
diff changeset
   340
            # EnvironmentError: no cwclientlib config file found
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   341
            self.source.debug(str(exc))
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   342
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   343
        # no chance with cwclientlib, fall back to former implementation
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   344
        if purl.scheme in ('http', 'https'):
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   345
            self.source.info('GET %s', url)
10610
d53b9c157f99 [py3k] import urllib2 from six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10603
diff changeset
   346
            req = Request(url)
9825
946b483bc8a1 [datafeed parser] enhance retrieve_url to support POSTing data and custom HTTP headers
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9824
diff changeset
   347
            return _OPENER.open(req, timeout=self.source.http_timeout)
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   348
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   349
        # url is probably plain content
10757
f73a9a884534 [py3k] io.BytesIO
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10662
diff changeset
   350
        return URLLibResponseAdapter(BytesIO(url.encode('ascii')), url)
9823
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   351
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   352
    def add_schema_config(self, schemacfg, checkonly=False):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   353
        """added CWSourceSchemaConfig, modify mapping accordingly"""
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   354
        msg = schemacfg._cw._("this parser doesn't use a mapping")
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   355
        raise ValidationError(schemacfg.eid, {None: msg})
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   356
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   357
    def del_schema_config(self, schemacfg, checkonly=False):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   358
        """deleted CWSourceSchemaConfig, modify mapping accordingly"""
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   359
        msg = schemacfg._cw._("this parser doesn't use a mapping")
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   360
        raise ValidationError(schemacfg.eid, {None: msg})
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   361
11251
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   362
    def process_urls(self, urls, raise_on_error=False):
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   363
        error = False
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   364
        for url in urls:
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   365
            self.info('pulling data from %s', url)
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   366
            try:
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   367
                if self.process(url, raise_on_error):
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   368
                    error = True
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   369
            except IOError as exc:
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   370
                if raise_on_error:
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   371
                    raise
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   372
                self.import_log.record_error(
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   373
                    'could not pull data while processing %s: %s'
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   374
                    % (url, exc))
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   375
                error = True
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   376
            except Exception as exc:
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   377
                if raise_on_error:
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   378
                    raise
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   379
                self.import_log.record_error(str(exc))
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   380
                self.exception('error while processing %s: %s',
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   381
                               url, exc)
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   382
                error = True
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   383
        return error
b66a8c3eebeb [datafeed] move process_urls to the parser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11151
diff changeset
   384
8409
79534887943e [datafeed] fix/finish cleanup started by auc in 8393:c25b96ae4f8a: parser.process prototytpe is (url, raise_on_error=False). Drop partialcommit argument which were never specified
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8408
diff changeset
   385
    def process(self, url, raise_on_error=False):
6957
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   386
        """main callback: process the url"""
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   387
        raise NotImplementedError
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   388
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   389
    def created_during_pull(self, entity):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   390
        return entity.eid in self.stats['created']
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   391
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   392
    def updated_during_pull(self, entity):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   393
        return entity.eid in self.stats['updated']
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   394
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   395
    def notify_updated(self, entity):
ffda12be2e9f [repository] #1460066: backport datafeed cube as cubicweb source
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents:
diff changeset
   396
        return self.stats['updated'].add(entity.eid)
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   397
8435
5064b6e0d6f4 [datafeed] correctly distinguish checked/updated
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8434
diff changeset
   398
    def notify_checked(self, entity):
5064b6e0d6f4 [datafeed] correctly distinguish checked/updated
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8434
diff changeset
   399
        return self.stats['checked'].add(entity.eid)
5064b6e0d6f4 [datafeed] correctly distinguish checked/updated
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8434
diff changeset
   400
8187
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   401
    def is_deleted(self, extid, etype, eid):
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   402
        """return True if the entity of given external id, entity type and eid
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   403
        is actually deleted. Always return True by default, put more sensible
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   404
        stuff in sub-classes.
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   405
        """
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   406
        return True
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   407
11252
6b1d09ef0c45 [datafeed] rename parser.sourceuris to source_uris
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11251
diff changeset
   408
    def handle_deletion(self, config, cnx, source_uris):
6b1d09ef0c45 [datafeed] rename parser.sourceuris to source_uris
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11251
diff changeset
   409
        if config['delete-entities'] and source_uris:
8430
5bee87a14bb1 fix ldap removal handling in ldapfeed (closes #2376625 and #2385133)
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8429
diff changeset
   410
            byetype = {}
11252
6b1d09ef0c45 [datafeed] rename parser.sourceuris to source_uris
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 11251
diff changeset
   411
            for extid, (eid, etype) in source_uris.items():
8430
5bee87a14bb1 fix ldap removal handling in ldapfeed (closes #2376625 and #2385133)
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8429
diff changeset
   412
                if self.is_deleted(extid, etype, eid):
5bee87a14bb1 fix ldap removal handling in ldapfeed (closes #2376625 and #2385133)
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8429
diff changeset
   413
                    byetype.setdefault(etype, []).append(str(eid))
10662
10942ed172de [py3k] dict.iteritems → dict.items
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10611
diff changeset
   414
            for etype, eids in byetype.items():
8430
5bee87a14bb1 fix ldap removal handling in ldapfeed (closes #2376625 and #2385133)
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8429
diff changeset
   415
                self.warning('delete %s %s entities', len(eids), etype)
9879
21278eb03bbf [datafeed sources] finish the session -> cnx switch
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 9860
diff changeset
   416
                cnx.execute('DELETE %s X WHERE X eid IN (%s)'
21278eb03bbf [datafeed sources] finish the session -> cnx switch
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 9860
diff changeset
   417
                            % (etype, ','.join(eids)))
9975
98b4f7fa2e3a [datafeed] Commit after all deletions in datafeed parser
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 9879
diff changeset
   418
            cnx.commit()
8430
5bee87a14bb1 fix ldap removal handling in ldapfeed (closes #2376625 and #2385133)
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8429
diff changeset
   419
8188
1867e252e487 [repository] ldap-feed source. Closes #2086984
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8187
diff changeset
   420
    def update_if_necessary(self, entity, attrs):
1867e252e487 [repository] ldap-feed source. Closes #2086984
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8187
diff changeset
   421
        entity.complete(tuple(attrs))
1867e252e487 [repository] ldap-feed source. Closes #2086984
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8187
diff changeset
   422
        # check modification date and compare attribute values to only update
1867e252e487 [repository] ldap-feed source. Closes #2086984
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8187
diff changeset
   423
        # what's actually needed
8435
5064b6e0d6f4 [datafeed] correctly distinguish checked/updated
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8434
diff changeset
   424
        self.notify_checked(entity)
8188
1867e252e487 [repository] ldap-feed source. Closes #2086984
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8187
diff changeset
   425
        mdate = attrs.get('modification_date')
1867e252e487 [repository] ldap-feed source. Closes #2086984
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8187
diff changeset
   426
        if not mdate or mdate > entity.modification_date:
10662
10942ed172de [py3k] dict.iteritems → dict.items
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10611
diff changeset
   427
            attrs = dict( (k, v) for k, v in attrs.items()
8188
1867e252e487 [repository] ldap-feed source. Closes #2086984
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8187
diff changeset
   428
                          if v != getattr(entity, k))
1867e252e487 [repository] ldap-feed source. Closes #2086984
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8187
diff changeset
   429
            if attrs:
8483
4ba11607d84a [entity api] unify set_attributes / set_relations into a cw_set method. Closes #2423719
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8435
diff changeset
   430
                entity.cw_set(**attrs)
8434
39c5bb4dcc59 [ldapfeed] do not crash on ldap user deletion + pull + already deactivated users, cleanups (closes #2392933)
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 8430
diff changeset
   431
                self.notify_updated(entity)
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   432
8547
f23ac525ddd1 [datafeed] properly call hooks for inlined relations on entity creation. Closes #2481156
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8529
diff changeset
   433
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   434
class DataFeedXMLParser(DataFeedParser):
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   435
10914
fed8bd56f223 [repository] deprecate the extid2eid based multi-sources API
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 10805
diff changeset
   436
    @deprecated()
8409
79534887943e [datafeed] fix/finish cleanup started by auc in 8393:c25b96ae4f8a: parser.process prototytpe is (url, raise_on_error=False). Drop partialcommit argument which were never specified
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8408
diff changeset
   437
    def process(self, url, raise_on_error=False):
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   438
        """IDataFeedParser main entry point"""
7447
d5705c9bbe82 don't crash if we can't fetch data or if xml is malformed
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7446
diff changeset
   439
        try:
d5705c9bbe82 don't crash if we can't fetch data or if xml is malformed
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7446
diff changeset
   440
            parsed = self.parse(url)
8695
358d8bed9626 [toward-py3k] rewrite to "except AnException as exc:" (part of #2711624)
Nicolas Chauvat <nicolas.chauvat@logilab.fr>
parents: 8694
diff changeset
   441
        except Exception as ex:
7533
43835fbdf97d [datafeed] actually raise on error
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7527
diff changeset
   442
            if raise_on_error:
43835fbdf97d [datafeed] actually raise on error
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7527
diff changeset
   443
                raise
8069
4341fb713b14 [datafeed log] properly log errors catched at the source level
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8068
diff changeset
   444
            self.import_log.record_error(str(ex))
7447
d5705c9bbe82 don't crash if we can't fetch data or if xml is malformed
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7446
diff changeset
   445
            return True
d5705c9bbe82 don't crash if we can't fetch data or if xml is malformed
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7446
diff changeset
   446
        for args in parsed:
11151
4259c55df3e7 merge changes from 3.22.2
Julien Cristau <julien.cristau@logilab.fr>
parents: 11138
diff changeset
   447
            self.process_item(*args, raise_on_error=raise_on_error)
4259c55df3e7 merge changes from 3.22.2
Julien Cristau <julien.cristau@logilab.fr>
parents: 11138
diff changeset
   448
        return False
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   449
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   450
    def parse(self, url):
9823
258d2f9f7d39 [datafeed parser] factor out retrieve_url method from DataFeedXMLParser.parse
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9822
diff changeset
   451
        stream = self.retrieve_url(url)
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   452
        return self.parse_etree(etree.parse(stream).getroot())
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   453
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   454
    def parse_etree(self, document):
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   455
        return [(document,)]
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   456
10089
6346f53c85f1 [datafeed] Add a raise_on_error parameter to DataFeedSource.extid2entity
Denis Laxalde <denis.laxalde@logilab.fr>
parents: 9990
diff changeset
   457
    def process_item(self, *args, **kwargs):
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   458
        raise NotImplementedError
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   459
8187
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   460
    def is_deleted(self, extid, etype, eid):
10551
1182f5f16a3d [datafeed] fix typo in DataFeedXMLParser.is_deleted (closes #5729755)
David Douard <david.douard@logilab.fr>
parents: 10532
diff changeset
   461
        if extid.startswith('file://'):
1182f5f16a3d [datafeed] fix typo in DataFeedXMLParser.is_deleted (closes #5729755)
David Douard <david.douard@logilab.fr>
parents: 10532
diff changeset
   462
            return exists(extid[7:])
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   463
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   464
        url = self.normalize_url(extid)
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   465
        # first, try to use cwclientlib if it's available and if the
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   466
        # url matches a configuration entry in ~/.config/cwclientlibrc
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   467
        try:
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   468
            from cwclientlib import cwproxy_for
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   469
            # parse url again since it has been normalized
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   470
            cnx = cwproxy_for(url)
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   471
            cnx.timeout = self.source.http_timeout
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   472
            self.source.info('Using cwclientlib for checking %s' % url)
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   473
            return cnx.get(url).status_code == 404
10532
2cc74c688eb9 [datafeed] also catch EnvironmentError when trying to load the cwclientlib config file
David Douard <david.douard@logilab.fr>
parents: 10522
diff changeset
   474
        except (ImportError, ValueError, EnvironmentError) as exc:
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   475
            # ImportError: not available
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   476
            # ValueError: no config entry found
10532
2cc74c688eb9 [datafeed] also catch EnvironmentError when trying to load the cwclientlib config file
David Douard <david.douard@logilab.fr>
parents: 10522
diff changeset
   477
            # EnvironmentError: no cwclientlib config file found
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   478
            self.source.debug(str(exc))
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   479
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   480
        # no chance with cwclientlib, fall back to former implementation
10603
65ad6980976e [py3k] import URL mangling functions using six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10581
diff changeset
   481
        if urlparse(url).scheme in ('http', 'https'):
8187
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   482
            try:
10516
4c59409220b6 [datafeed] allow to use cwclientlib for datafeed's queries (closes #5456849)
David Douard <david.douard@logilab.fr>
parents: 10143
diff changeset
   483
                _OPENER.open(url, timeout=self.source.http_timeout)
10610
d53b9c157f99 [py3k] import urllib2 from six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10603
diff changeset
   484
            except HTTPError as ex:
8187
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   485
                if ex.code == 404:
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   486
                    return True
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   487
        return False
981f6e487788 [datafeed] set delete-entities=yes is now safer, by checking each entity actually seems deleted. Closes #2165381
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 8069
diff changeset
   488
9824
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   489
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   490
class URLLibResponseAdapter(object):
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   491
    """Thin wrapper to be used to fake a value returned by urllib2.urlopen"""
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   492
    def __init__(self, stream, url, code=200):
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   493
        self._stream = stream
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   494
        self._url = url
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   495
        self.code = code
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   496
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   497
    def read(self, *args):
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   498
        return self._stream.read(*args)
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   499
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   500
    def geturl(self):
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   501
        return self._url
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   502
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   503
    def getcode(self):
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   504
        return self.code
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   505
30183ecf5c61 [datafeed parser] fix retrieve_url to always return urllib2.urlopen compatible output
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 9823
diff changeset
   506
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   507
# use a cookie enabled opener to use session cookie if any
10610
d53b9c157f99 [py3k] import urllib2 from six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10603
diff changeset
   508
_OPENER = build_opener()
7378
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   509
try:
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   510
    from logilab.common import urllib2ext
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   511
    _OPENER.add_handler(urllib2ext.HTTPGssapiAuthHandler())
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   512
except ImportError: # python-kerberos not available
86a1ae289f05 [datafeed] extract a generic DataFeedXMLParser from CWEntityXMLParser
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 7351
diff changeset
   513
    pass
10610
d53b9c157f99 [py3k] import urllib2 from six.moves
Rémi Cardona <remi.cardona@logilab.fr>
parents: 10603
diff changeset
   514
_OPENER.add_handler(HTTPCookieProcessor(CookieJar()))