doc/book/en/devrepo/fti.rst
changeset 10491 c67bcee93248
parent 10490 76ab3c71aff2
child 10492 68c13e0c0fc5
equal deleted inserted replaced
10490:76ab3c71aff2 10491:c67bcee93248
     1 .. _fti:
       
     2 
       
     3 Full Text Indexing in CubicWeb
       
     4 ------------------------------
       
     5 
       
     6 When an attribute is tagged as *fulltext-indexable* in the datamodel,
       
     7 CubicWeb will automatically trigger hooks to update the internal
       
     8 fulltext index (i.e the ``appears`` SQL table) each time this attribute
       
     9 is modified.
       
    10 
       
    11 CubicWeb also provides a ``db-rebuild-fti`` command to rebuild the whole
       
    12 fulltext on demand:
       
    13 
       
    14 .. sourcecode:: bash
       
    15 
       
    16    cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance
       
    17 
       
    18 You can also rebuild the fulltext index for a given set of entity types:
       
    19 
       
    20 .. sourcecode:: bash
       
    21 
       
    22    cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance Ticket Version
       
    23 
       
    24 In the above example, only fulltext index of entity types ``Ticket`` and ``Version``
       
    25 will be rebuilt.
       
    26 
       
    27 
       
    28 Standard FTI process
       
    29 ~~~~~~~~~~~~~~~~~~~~
       
    30 
       
    31 Considering an entity type ``ET``, the default *fti* process is to :
       
    32 
       
    33 1. fetch all entities of type ``ET``
       
    34 
       
    35 2. for each entity, adapt it to ``IFTIndexable`` (see
       
    36    :class:`~cubicweb.entities.adapters.IFTIndexableAdapter`)
       
    37 
       
    38 3. call
       
    39    :meth:`~cubicweb.entities.adapters.IFTIndexableAdapter.get_words` on
       
    40    the adapter which is supposed to return a dictionary *weight* ->
       
    41    *list of words* as expected by
       
    42    :meth:`~logilab.database.fti.FTIndexerMixIn.index_object`. The
       
    43    tokenization of each attribute value is done by
       
    44    :meth:`~logilab.database.fti.tokenize`.
       
    45 
       
    46 
       
    47 See :class:`~cubicweb.entities.adapters.IFTIndexableAdapter` for more documentation.
       
    48 
       
    49 
       
    50 Yams and ``fulltext_container``
       
    51 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       
    52 
       
    53 It is possible in the datamodel to indicate that fulltext-indexed
       
    54 attributes defined for an entity type will be used to index not the
       
    55 entity itself but a related entity. This is especially useful for
       
    56 composite entities. Let's take a look at (a simplified version of)
       
    57 the base schema defined in CubicWeb (see :mod:`cubicweb.schemas.base`):
       
    58 
       
    59 .. sourcecode:: python
       
    60 
       
    61   class CWUser(WorkflowableEntityType):
       
    62       login     = String(required=True, unique=True, maxsize=64)
       
    63       upassword = Password(required=True)
       
    64 
       
    65   class EmailAddress(EntityType):
       
    66       address = String(required=True,  fulltextindexed=True,
       
    67                        indexed=True, unique=True, maxsize=128)
       
    68 
       
    69 
       
    70   class use_email_relation(RelationDefinition):
       
    71       name = 'use_email'
       
    72       subject = 'CWUser'
       
    73       object = 'EmailAddress'
       
    74       cardinality = '*?'
       
    75       composite = 'subject'
       
    76 
       
    77 
       
    78 The schema above states that there is a relation between ``CWUser`` and ``EmailAddress``
       
    79 and that the ``address`` field of ``EmailAddress`` is fulltext indexed. Therefore,
       
    80 in your application, if you use fulltext search to look for an email address, CubicWeb
       
    81 will return the ``EmailAddress`` itself. But the objects we'd like to index
       
    82 are more likely to be the associated ``CWUser`` than the ``EmailAddress`` itself.
       
    83 
       
    84 The simplest way to achieve that is to tag the ``use_email`` relation in
       
    85 the datamodel:
       
    86 
       
    87 .. sourcecode:: python
       
    88 
       
    89   class use_email(RelationType):
       
    90       fulltext_container = 'subject'
       
    91 
       
    92 
       
    93 Customizing how entities are fetched during ``db-rebuild-fti``
       
    94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       
    95 
       
    96 ``db-rebuild-fti`` will call the
       
    97 :meth:`~cubicweb.entities.AnyEntity.cw_fti_index_rql_queries` class
       
    98 method on your entity type.
       
    99 
       
   100 .. automethod:: cubicweb.entities.AnyEntity.cw_fti_index_rql_queries
       
   101 
       
   102 Now, suppose you've got a _huge_ table to index, you probably don't want to
       
   103 get all entities at once. So here's a simple customized example that will
       
   104 process block of 10000 entities:
       
   105 
       
   106 .. sourcecode:: python
       
   107 
       
   108 
       
   109     class MyEntityClass(AnyEntity):
       
   110         __regid__ = 'MyEntityClass'
       
   111 
       
   112     @classmethod
       
   113     def cw_fti_index_rql_queries(cls, req):
       
   114         # get the default RQL method and insert LIMIT / OFFSET instructions
       
   115         base_rql = super(SearchIndex, cls).cw_fti_index_rql_queries(req)[0]
       
   116         selected, restrictions = base_rql.split(' WHERE ')
       
   117         rql_template = '%s ORDERBY X LIMIT %%(limit)s OFFSET %%(offset)s WHERE %s' % (
       
   118             selected, restrictions)
       
   119         # count how many entities you'll have to index
       
   120         count = req.execute('Any COUNT(X) WHERE X is MyEntityClass')[0][0]
       
   121         # iterate by blocks of 10000 entities
       
   122         chunksize = 10000
       
   123         for offset in xrange(0, count, chunksize):
       
   124             print 'SENDING', rql_template % {'limit': chunksize, 'offset': offset}
       
   125             yield rql_template % {'limit': chunksize, 'offset': offset}
       
   126 
       
   127 Since you have access to ``req``, you can more or less fetch whatever you want.
       
   128 
       
   129 
       
   130 Customizing :meth:`~cubicweb.entities.adapters.IFTIndexableAdapter.get_words`
       
   131 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       
   132 
       
   133 You can also customize the FTI process by providing your own ``get_words()``
       
   134 implementation:
       
   135 
       
   136 .. sourcecode:: python
       
   137 
       
   138     from cubicweb.entities.adapters import IFTIndexableAdapter
       
   139 
       
   140     class SearchIndexAdapter(IFTIndexableAdapter):
       
   141         __regid__ = 'IFTIndexable'
       
   142         __select__ = is_instance('MyEntityClass')
       
   143 
       
   144         def fti_containers(self, _done=None):
       
   145             """this should yield any entity that must be considered to
       
   146             fulltext-index self.entity
       
   147 
       
   148             CubicWeb's default implementation will look for yams'
       
   149             ``fulltex_container`` property.
       
   150             """
       
   151             yield self.entity
       
   152             yield self.entity.some_related_entity
       
   153 
       
   154 
       
   155         def get_words(self):
       
   156             # implement any logic here
       
   157             # see http://www.postgresql.org/docs/9.1/static/textsearch-controls.html
       
   158             # for the actual signification of 'C'
       
   159             return {'C': ['any', 'word', 'I', 'want']}