devtools/htmlparser.py
author Sylvain Thénault <sylvain.thenault@logilab.fr>
Wed, 28 Apr 2010 10:06:01 +0200
branchstable
changeset 5421 8167de96c523
parent 5276 5037d891e207
child 5424 8ecbcbff9777
permissions -rw-r--r--
proper licensing information (LGPL-2.1). Hope I get it right this time.
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
5421
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     1
# copyright 2003-2010 LOGILAB S.A. (Paris, FRANCE), all rights reserved.
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     2
# contact http://www.logilab.fr/ -- mailto:contact@logilab.fr
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     3
#
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     4
# This file is part of CubicWeb.
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     5
#
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     6
# CubicWeb is free software: you can redistribute it and/or modify it under the
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     7
# terms of the GNU Lesser General Public License as published by the Free
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     8
# Software Foundation, either version 2.1 of the License, or (at your option)
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
     9
# any later version.
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    10
#
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    11
# logilab-common is distributed in the hope that it will be useful, but WITHOUT
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    12
# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    13
# FOR A PARTICULAR PURPOSE.  See the GNU Lesser General Public License for more
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    14
# details.
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    15
#
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    16
# You should have received a copy of the GNU Lesser General Public License along
8167de96c523 proper licensing information (LGPL-2.1). Hope I get it right this time.
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 5276
diff changeset
    17
# with CubicWeb.  If not, see <http://www.gnu.org/licenses/>.
1977
606923dff11b big bunch of copyright / docstring update
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents: 1945
diff changeset
    18
"""defines a validating HTML parser used in web application tests
606923dff11b big bunch of copyright / docstring update
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents: 1945
diff changeset
    19
606923dff11b big bunch of copyright / docstring update
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents: 1945
diff changeset
    20
"""
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    21
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    22
import re
3325
44caeccd2db9 fix sys import
Julien Jehannet <julien.jehannet@logilab.fr>
parents: 3151
diff changeset
    23
import sys
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    24
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    25
from lxml import etree
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    26
1421
77ee26df178f doc type handling refactoring: do the ext substitution at the module level
sylvain.thenault@logilab.fr
parents: 1132
diff changeset
    27
from cubicweb.view import STRICT_DOCTYPE, TRANSITIONAL_DOCTYPE
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    28
STRICT_DOCTYPE = str(STRICT_DOCTYPE)
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    29
TRANSITIONAL_DOCTYPE = str(TRANSITIONAL_DOCTYPE)
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    30
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    31
ERR_COUNT = 0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    32
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    33
class Validator(object):
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    34
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    35
    def parse_string(self, data, sysid=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    36
        try:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    37
            data = self.preprocess_data(data)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    38
            return PageInfo(data, etree.fromstring(data, self.parser))
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    39
        except etree.XMLSyntaxError, exc:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    40
            def save_in(fname=''):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    41
                file(fname, 'w').write(data)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    42
            new_exc = AssertionError(u'invalid xml %s' % exc)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    43
            new_exc.position = exc.position
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    44
            raise new_exc
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    45
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    46
    def preprocess_data(self, data):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    47
        return data
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    48
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    49
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    50
class DTDValidator(Validator):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    51
    def __init__(self):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    52
        Validator.__init__(self)
3151
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    53
        # XXX understand what's happening under windows
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    54
        validate = True
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    55
        if sys.platform == 'win32':
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    56
            validate = False
5d45c0945bd3 note about this test under windows
Aurelien Campeas
parents: 1977
diff changeset
    57
        self.parser = etree.XMLParser(dtd_validation=validate)
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    58
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    59
    def preprocess_data(self, data):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    60
        """used to fix potential blockquote mess generated by docutils"""
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    61
        if STRICT_DOCTYPE not in data:
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    62
            return data
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    63
        # parse using transitional DTD
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    64
        data = data.replace(STRICT_DOCTYPE, TRANSITIONAL_DOCTYPE)
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    65
        tree = etree.fromstring(data, self.parser)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    66
        namespace = tree.nsmap.get(None)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    67
        # this is the list of authorized child tags for <blockquote> nodes
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    68
        expected = 'p h1 h2 h3 h4 h5 h6 div ul ol dl pre hr blockquote address ' \
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    69
                   'fieldset table form noscript ins del script'.split()
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    70
        if namespace:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    71
            blockquotes = tree.findall('.//{%s}blockquote' % namespace)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    72
            expected = ['{%s}%s' % (namespace, tag) for tag in expected]
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    73
        else:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    74
            blockquotes = tree.findall('.//blockquote')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    75
        # quick and dirty approach: remove all blockquotes
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    76
        for blockquote in blockquotes:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    77
            parent = blockquote.getparent()
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    78
            parent.remove(blockquote)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    79
        data = etree.tostring(tree)
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    80
        return '<?xml version="1.0" encoding="UTF-8"?>%s\n%s' % (
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    81
            STRICT_DOCTYPE, data)
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    82
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
    83
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    84
class SaxOnlyValidator(Validator):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    85
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    86
    def __init__(self):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    87
        Validator.__init__(self)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    88
        self.parser = etree.XMLParser()
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
    89
5276
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    90
class XMLDemotingValidator(SaxOnlyValidator):
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    91
    """ some views produce html instead of xhtml, using demote_to_html
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    92
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    93
    this is typically related to the use of external dependencies
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    94
    which do not produce valid xhtml (google maps, ...)
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    95
    """
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    96
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    97
    def preprocess_data(self, data):
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    98
        if data.startswith('<?xml'):
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
    99
            self.parser = etree.XMLParser()
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   100
        else:
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   101
            self.parser = etree.HTMLParser()
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   102
        return data
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   103
5037d891e207 [devtools/validators] add an Xml validator able to degrade to Html validation because of views perusing demote_to_html
Aurelien Campeas <aurelien.campeas@logilab.fr>
parents: 4252
diff changeset
   104
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   105
class HTMLValidator(Validator):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   106
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   107
    def __init__(self):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   108
        Validator.__init__(self)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   109
        self.parser = etree.HTMLParser()
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   110
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   111
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   112
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   113
class PageInfo(object):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   114
    """holds various informations on the view's output"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   115
    def __init__(self, source, root):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   116
        self.source = source
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   117
        self.etree = root
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   118
        self.source = source
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   119
        self.raw_text = u''.join(root.xpath('//text()'))
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   120
        self.namespace = self.etree.nsmap
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   121
        self.default_ns = self.namespace.get(None)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   122
        self.a_tags = self.find_tag('a')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   123
        self.h1_tags = self.find_tag('h1')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   124
        self.h2_tags = self.find_tag('h2')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   125
        self.h3_tags = self.find_tag('h3')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   126
        self.h4_tags = self.find_tag('h4')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   127
        self.input_tags = self.find_tag('input')
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   128
        self.title_tags = [self.h1_tags, self.h2_tags, self.h3_tags, self.h4_tags]
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   129
1945
2b59d9ae17ae new argument telling if we want text or (text / attrs), keeping bw compat
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 1485
diff changeset
   130
    def find_tag(self, tag, gettext=True):
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   131
        """return a list which contains text of all "tag" elements """
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   132
        if self.default_ns is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   133
            iterstr = ".//%s" % tag
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   134
        else:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   135
            iterstr = ".//{%s}%s" % (self.default_ns, tag)
1945
2b59d9ae17ae new argument telling if we want text or (text / attrs), keeping bw compat
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 1485
diff changeset
   136
        if not gettext or tag in ('a', 'input'):
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   137
            return [(elt.text, elt.attrib) for elt in self.etree.iterfind(iterstr)]
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   138
        return [u''.join(elt.xpath('.//text()')) for elt in self.etree.iterfind(iterstr)]
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   139
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   140
    def appears(self, text):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   141
        """returns True if <text> appears in the page"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   142
        return text in self.raw_text
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   143
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   144
    def __contains__(self, text):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   145
        return text in self.source
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   146
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   147
    def has_title(self, text, level=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   148
        """returns True if <h?>text</h?>
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   149
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   150
        :param level: the title's level (1 for h1, 2 for h2, etc.)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   151
        """
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   152
        if level is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   153
            for hlist in self.title_tags:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   154
                if text in hlist:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   155
                    return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   156
            return False
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   157
        else:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   158
            hlist = self.title_tags[level - 1]
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   159
            return text in hlist
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   160
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   161
    def has_title_regexp(self, pattern, level=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   162
        """returns True if <h?>pattern</h?>"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   163
        sre = re.compile(pattern)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   164
        if level is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   165
            for hlist in self.title_tags:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   166
                for title in hlist:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   167
                    if sre.match(title):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   168
                        return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   169
            return False
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   170
        else:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   171
            hlist = self.title_tags[level - 1]
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   172
            for title in hlist:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   173
                if sre.match(title):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   174
                    return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   175
            return False
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   176
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   177
    def has_link(self, text, url=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   178
        """returns True if <a href=url>text</a> was found in the page"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   179
        for link_text, attrs in self.a_tags:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   180
            if text == link_text:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   181
                if url is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   182
                    return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   183
                try:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   184
                    href = attrs['href']
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   185
                    if href == url:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   186
                        return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   187
                except KeyError:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   188
                    continue
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   189
        return False
1485
4d532f3c012e nicer fix
sylvain.thenault@logilab.fr
parents: 1480
diff changeset
   190
0
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   191
    def has_link_regexp(self, pattern, url=None):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   192
        """returns True if <a href=url>pattern</a> was found in the page"""
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   193
        sre = re.compile(pattern)
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   194
        for link_text, attrs in self.a_tags:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   195
            if sre.match(link_text):
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   196
                if url is None:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   197
                    return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   198
                try:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   199
                    href = attrs['href']
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   200
                    if href == url:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   201
                        return True
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   202
                except KeyError:
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   203
                    continue
b97547f5f1fa Showtime !
Adrien Di Mascio <Adrien.DiMascio@logilab.fr>
parents:
diff changeset
   204
        return False
2773
b2530e3e0afb [testlib] #345052 and #344207: major test lib refactoring/cleanup + update usage
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 1977
diff changeset
   205
b2530e3e0afb [testlib] #345052 and #344207: major test lib refactoring/cleanup + update usage
Sylvain Thénault <sylvain.thenault@logilab.fr>
parents: 1977
diff changeset
   206
VALMAP = {None: None, 'dtd': DTDValidator, 'xml': SaxOnlyValidator}