[fti] add cw_fti_index_rql_limit method
Improves on and deprecate cw_fti_index_rql_queries: execute the rql
directly, so that we don't need to compute the set of eids up-front, but
can just keep track of the last seen entity. Use the new method in
reindex_entities.
Keep calling cw_fti_index_rql_queries if it has been redefined (e.g in
cubes).
Related to #3621392.
.. _fti:
Full Text Indexing in CubicWeb
------------------------------
When an attribute is tagged as *fulltext-indexable* in the datamodel,
CubicWeb will automatically trigger hooks to update the internal
fulltext index (i.e the ``appears`` SQL table) each time this attribute
is modified.
CubicWeb also provides a ``db-rebuild-fti`` command to rebuild the whole
fulltext on demand:
.. sourcecode:: bash
cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance
You can also rebuild the fulltext index for a given set of entity types:
.. sourcecode:: bash
cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance Ticket Version
In the above example, only fulltext index of entity types ``Ticket`` and ``Version``
will be rebuilt.
Standard FTI process
~~~~~~~~~~~~~~~~~~~~
Considering an entity type ``ET``, the default *fti* process is to :
1. fetch all entities of type ``ET``
2. for each entity, adapt it to ``IFTIndexable`` (see
:class:`~cubicweb.entities.adapters.IFTIndexableAdapter`)
3. call
:meth:`~cubicweb.entities.adapters.IFTIndexableAdapter.get_words` on
the adapter which is supposed to return a dictionary *weight* ->
*list of words* as expected by
:meth:`~logilab.database.fti.FTIndexerMixIn.index_object`. The
tokenization of each attribute value is done by
:meth:`~logilab.database.fti.tokenize`.
See :class:`~cubicweb.entities.adapters.IFTIndexableAdapter` for more documentation.
Yams and ``fulltext_container``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is possible in the datamodel to indicate that fulltext-indexed
attributes defined for an entity type will be used to index not the
entity itself but a related entity. This is especially useful for
composite entities. Let's take a look at (a simplified version of)
the base schema defined in CubicWeb (see :mod:`cubicweb.schemas.base`):
.. sourcecode:: python
class CWUser(WorkflowableEntityType):
login = String(required=True, unique=True, maxsize=64)
upassword = Password(required=True)
class EmailAddress(EntityType):
address = String(required=True, fulltextindexed=True,
indexed=True, unique=True, maxsize=128)
class use_email_relation(RelationDefinition):
name = 'use_email'
subject = 'CWUser'
object = 'EmailAddress'
cardinality = '*?'
composite = 'subject'
The schema above states that there is a relation between ``CWUser`` and ``EmailAddress``
and that the ``address`` field of ``EmailAddress`` is fulltext indexed. Therefore,
in your application, if you use fulltext search to look for an email address, CubicWeb
will return the ``EmailAddress`` itself. But the objects we'd like to index
are more likely to be the associated ``CWUser`` than the ``EmailAddress`` itself.
The simplest way to achieve that is to tag the ``use_email`` relation in
the datamodel:
.. sourcecode:: python
class use_email(RelationType):
fulltext_container = 'subject'
Customizing how entities are fetched during ``db-rebuild-fti``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``db-rebuild-fti`` will call the
:meth:`~cubicweb.entities.AnyEntity.cw_fti_index_rql_limit` class
method on your entity type.
.. automethod:: cubicweb.entities.AnyEntity.cw_fti_index_rql_limit
Customizing :meth:`~cubicweb.entities.adapters.IFTIndexableAdapter.get_words`
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can also customize the FTI process by providing your own ``get_words()``
implementation:
.. sourcecode:: python
from cubicweb.entities.adapters import IFTIndexableAdapter
class SearchIndexAdapter(IFTIndexableAdapter):
__regid__ = 'IFTIndexable'
__select__ = is_instance('MyEntityClass')
def fti_containers(self, _done=None):
"""this should yield any entity that must be considered to
fulltext-index self.entity
CubicWeb's default implementation will look for yams'
``fulltex_container`` property.
"""
yield self.entity
yield self.entity.some_related_entity
def get_words(self):
# implement any logic here
# see http://www.postgresql.org/docs/9.1/static/textsearch-controls.html
# for the actual signification of 'C'
return {'C': ['any', 'word', 'I', 'want']}