devtools/htmlparser.py
author Pierre-Yves David <pierre-yves.david@logilab.fr>
Fri, 15 Oct 2010 17:00:09 +0200
branchstable
changeset 6514 f328ec853e18
parent 5424 8ecbcbff9777
child 6771 da71f1ad1721
permissions -rw-r--r--
[devtools] Firefox need write access to $HOME Check that you have write acces to the user home before launching Firefox. Firefox need to write stuff in the home directory of the user and will silency crash when unable to do it. We add an assertion with an explicite message to improve crash report in such situation.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
5421
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     1
# copyright 2003-2010 LOGILAB S.A. (Paris, FRANCE), all rights reserved.
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     2
# contact http://www.logilab.fr/ -- mailto:contact@logilab.fr
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     3
#
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     4
# This file is part of CubicWeb.
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     5
#
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     6
# CubicWeb is free software: you can redistribute it and/or modify it under the
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     7
# terms of the GNU Lesser General Public License as published by the Free
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     8
# Software Foundation, either version 2.1 of the License, or (at your option)
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     9
# any later version.
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    10
#
5424
8ecbcbff9777 replace logilab-common by CubicWeb in disclaimer
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5421
diff changeset
    11
# CubicWeb is distributed in the hope that it will be useful, but WITHOUT
5421
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    12
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    13
# FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for more
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    14
# details.
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    15
#
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    16
# You should have received a copy of the GNU Lesser General Public License along
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    17
# with CubicWeb.  If not, see <http://www.gnu.org/licenses/>.
1977
606923dff11b big bunch of copyright / docstring update
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents: 1945
diff changeset
    18
"""defines a validating HTML parser used in web application tests
606923dff11b big bunch of copyright / docstring update
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents: 1945
diff changeset
    19
606923dff11b big bunch of copyright / docstring update
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents: 1945
diff changeset
    20
"""
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    21
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    22
import re
3325
44caeccd2db9 fix sys import
Julien Jehannet <julien.jehannet@logilab.fr>
parents: 3151
diff changeset
    23
import sys
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    24
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    25
from lxml import etree
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    26
1421
77ee26df178f doc type handling refactoring: do the ext substitution at the module level
sylvain.thenault@logilab.fr
parents: 1132
diff changeset
    27
from cubicweb.view import STRICT_DOCTYPE, TRANSITIONAL_DOCTYPE
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    28
STRICT_DOCTYPE = str(STRICT_DOCTYPE)
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    29
TRANSITIONAL_DOCTYPE = str(TRANSITIONAL_DOCTYPE)
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    30
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    31
ERR_COUNT = 0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    32
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    33
class Validator(object):
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    34
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    35
    def parse_string(self, data, sysid=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    36
        try:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    37
            data = self.preprocess_data(data)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    38
            return PageInfo(data, etree.fromstring(data, self.parser))
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    39
        except etree.XMLSyntaxError, exc:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    40
            def save_in(fname=''):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    41
                file(fname, 'w').write(data)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    42
            new_exc = AssertionError(u'invalid xml %s' % exc)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    43
            new_exc.position = exc.position
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    44
            raise new_exc
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    45
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    46
    def preprocess_data(self, data):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    47
        return data
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    48
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    49
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    50
class DTDValidator(Validator):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    51
    def __init__(self):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    52
        Validator.__init__(self)
3151
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    53
        # XXX understand what's happening under windows
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    54
        validate = True
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    55
        if sys.platform == 'win32':
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    56
            validate = False
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    57
        self.parser = etree.XMLParser(dtd_validation=validate)
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    58
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    59
    def preprocess_data(self, data):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    60
        """used to fix potential blockquote mess generated by docutils"""
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    61
        if STRICT_DOCTYPE not in data:
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    62
            return data
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    63
        # parse using transitional DTD
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    64
        data = data.replace(STRICT_DOCTYPE, TRANSITIONAL_DOCTYPE)
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    65
        tree = etree.fromstring(data, self.parser)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    66
        namespace = tree.nsmap.get(None)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    67
        # this is the list of authorized child tags for <blockquote> nodes
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    68
        expected = 'p h1 h2 h3 h4 h5 h6 div ul ol dl pre hr blockquote address ' \
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    69
                   'fieldset table form noscript ins del script'.split()
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    70
        if namespace:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    71
            blockquotes = tree.findall('.//{%s}blockquote' % namespace)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    72
            expected = ['{%s}%s' % (namespace, tag) for tag in expected]
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    73
        else:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    74
            blockquotes = tree.findall('.//blockquote')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    75
        # quick and dirty approach: remove all blockquotes
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    76
        for blockquote in blockquotes:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    77
            parent = blockquote.getparent()
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    78
            parent.remove(blockquote)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    79
        data = etree.tostring(tree)
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    80
        return '<?xml version="1.0" encoding="UTF-8"?>%s\n%s' % (
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    81
            STRICT_DOCTYPE, data)
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    82
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    83
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    84
class SaxOnlyValidator(Validator):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    85
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    86
    def __init__(self):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    87
        Validator.__init__(self)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    88
        self.parser = etree.XMLParser()
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    89
5276
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    90
class XMLDemotingValidator(SaxOnlyValidator):
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    91
    """ some views produce html instead of xhtml, using demote_to_html
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    92
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    93
    this is typically related to the use of external dependencies
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    94
    which do not produce valid xhtml (google maps, ...)
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    95
    """
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    96
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    97
    def preprocess_data(self, data):
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    98
        if data.startswith('<?xml'):
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    99
            self.parser = etree.XMLParser()
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   100
        else:
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   101
            self.parser = etree.HTMLParser()
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   102
        return data
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   103
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   104
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   105
class HTMLValidator(Validator):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   106
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   107
    def __init__(self):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   108
        Validator.__init__(self)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   109
        self.parser = etree.HTMLParser()
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   110
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   111
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   112
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   113
class PageInfo(object):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   114
    """holds various informations on the view's output"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   115
    def __init__(self, source, root):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   116
        self.source = source
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   117
        self.etree = root
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   118
        self.source = source
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   119
        self.raw_text = u''.join(root.xpath('//text()'))
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   120
        self.namespace = self.etree.nsmap
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   121
        self.default_ns = self.namespace.get(None)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   122
        self.a_tags = self.find_tag('a')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   123
        self.h1_tags = self.find_tag('h1')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   124
        self.h2_tags = self.find_tag('h2')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   125
        self.h3_tags = self.find_tag('h3')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   126
        self.h4_tags = self.find_tag('h4')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   127
        self.input_tags = self.find_tag('input')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   128
        self.title_tags = [self.h1_tags, self.h2_tags, self.h3_tags, self.h4_tags]
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   129
1945
2b59d9ae17ae new argument telling if we want text or (text / attrs), keeping bw compat
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 1485
diff changeset
   130
    def find_tag(self, tag, gettext=True):
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   131
        """return a list which contains text of all "tag" elements """
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   132
        if self.default_ns is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   133
            iterstr = ".//%s" % tag
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   134
        else:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   135
            iterstr = ".//{%s}%s" % (self.default_ns, tag)
1945
2b59d9ae17ae new argument telling if we want text or (text / attrs), keeping bw compat
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 1485
diff changeset
   136
        if not gettext or tag in ('a', 'input'):
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   137
            return [(elt.text, elt.attrib) for elt in self.etree.iterfind(iterstr)]
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   138
        return [u''.join(elt.xpath('.//text()')) for elt in self.etree.iterfind(iterstr)]
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   139
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   140
    def appears(self, text):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   141
        """returns True if <text> appears in the page"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   142
        return text in self.raw_text
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   143
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   144
    def __contains__(self, text):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   145
        return text in self.source
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   146
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   147
    def has_title(self, text, level=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   148
        """returns True if <h?>text</h?>
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   149
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   150
        :param level: the title's level (1 for h1, 2 for h2, etc.)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   151
        """
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   152
        if level is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   153
            for hlist in self.title_tags:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   154
                if text in hlist:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   155
                    return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   156
            return False
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   157
        else:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   158
            hlist = self.title_tags[level - 1]
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   159
            return text in hlist
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   160
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   161
    def has_title_regexp(self, pattern, level=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   162
        """returns True if <h?>pattern</h?>"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   163
        sre = re.compile(pattern)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   164
        if level is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   165
            for hlist in self.title_tags:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   166
                for title in hlist:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   167
                    if sre.match(title):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   168
                        return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   169
            return False
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   170
        else:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   171
            hlist = self.title_tags[level - 1]
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   172
            for title in hlist:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   173
                if sre.match(title):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   174
                    return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   175
            return False
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   176
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   177
    def has_link(self, text, url=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   178
        """returns True if <a href=url>text</a> was found in the page"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   179
        for link_text, attrs in self.a_tags:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   180
            if text == link_text:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   181
                if url is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   182
                    return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   183
                try:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   184
                    href = attrs['href']
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   185
                    if href == url:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   186
                        return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   187
                except KeyError:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   188
                    continue
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   189
        return False
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   190
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   191
    def has_link_regexp(self, pattern, url=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   192
        """returns True if <a href=url>pattern</a> was found in the page"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   193
        sre = re.compile(pattern)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   194
        for link_text, attrs in self.a_tags:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   195
            if sre.match(link_text):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   196
                if url is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   197
                    return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   198
                try:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   199
                    href = attrs['href']
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   200
                    if href == url:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   201
                        return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   202
                except KeyError:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   203
                    continue
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   204
        return False
2773
b2530e3e0afb [testlib] #345052 and #344207: major test lib refactoring/cleanup + update usage
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 1977
diff changeset
   205
b2530e3e0afb [testlib] #345052 and #344207: major test lib refactoring/cleanup + update usage
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 1977
diff changeset
   206
VALMAP = {None: None, 'dtd': DTDValidator, 'xml': SaxOnlyValidator}