1 .. _fti: |
|
2 |
|
3 Full Text Indexing in CubicWeb |
|
4 ------------------------------ |
|
5 |
|
6 When an attribute is tagged as *fulltext-indexable* in the datamodel, |
|
7 CubicWeb will automatically trigger hooks to update the internal |
|
8 fulltext index (i.e the ``appears`` SQL table) each time this attribute |
|
9 is modified. |
|
10 |
|
11 CubicWeb also provides a ``db-rebuild-fti`` command to rebuild the whole |
|
12 fulltext on demand: |
|
13 |
|
14 .. sourcecode:: bash |
|
15 |
|
16 cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance |
|
17 |
|
18 You can also rebuild the fulltext index for a given set of entity types: |
|
19 |
|
20 .. sourcecode:: bash |
|
21 |
|
22 cubicweb@esope~$ cubicweb db-rebuild-fti my_tracker_instance Ticket Version |
|
23 |
|
24 In the above example, only fulltext index of entity types ``Ticket`` and ``Version`` |
|
25 will be rebuilt. |
|
26 |
|
27 |
|
28 Standard FTI process |
|
29 ~~~~~~~~~~~~~~~~~~~~ |
|
30 |
|
31 Considering an entity type ``ET``, the default *fti* process is to : |
|
32 |
|
33 1. fetch all entities of type ``ET`` |
|
34 |
|
35 2. for each entity, adapt it to ``IFTIndexable`` (see |
|
36 :class:`~cubicweb.entities.adapters.IFTIndexableAdapter`) |
|
37 |
|
38 3. call |
|
39 :meth:`~cubicweb.entities.adapters.IFTIndexableAdapter.get_words` on |
|
40 the adapter which is supposed to return a dictionary *weight* -> |
|
41 *list of words* as expected by |
|
42 :meth:`~logilab.database.fti.FTIndexerMixIn.index_object`. The |
|
43 tokenization of each attribute value is done by |
|
44 :meth:`~logilab.database.fti.tokenize`. |
|
45 |
|
46 |
|
47 See :class:`~cubicweb.entities.adapters.IFTIndexableAdapter` for more documentation. |
|
48 |
|
49 |
|
50 Yams and ``fulltext_container`` |
|
51 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
52 |
|
53 It is possible in the datamodel to indicate that fulltext-indexed |
|
54 attributes defined for an entity type will be used to index not the |
|
55 entity itself but a related entity. This is especially useful for |
|
56 composite entities. Let's take a look at (a simplified version of) |
|
57 the base schema defined in CubicWeb (see :mod:`cubicweb.schemas.base`): |
|
58 |
|
59 .. sourcecode:: python |
|
60 |
|
61 class CWUser(WorkflowableEntityType): |
|
62 login = String(required=True, unique=True, maxsize=64) |
|
63 upassword = Password(required=True) |
|
64 |
|
65 class EmailAddress(EntityType): |
|
66 address = String(required=True, fulltextindexed=True, |
|
67 indexed=True, unique=True, maxsize=128) |
|
68 |
|
69 |
|
70 class use_email_relation(RelationDefinition): |
|
71 name = 'use_email' |
|
72 subject = 'CWUser' |
|
73 object = 'EmailAddress' |
|
74 cardinality = '*?' |
|
75 composite = 'subject' |
|
76 |
|
77 |
|
78 The schema above states that there is a relation between ``CWUser`` and ``EmailAddress`` |
|
79 and that the ``address`` field of ``EmailAddress`` is fulltext indexed. Therefore, |
|
80 in your application, if you use fulltext search to look for an email address, CubicWeb |
|
81 will return the ``EmailAddress`` itself. But the objects we'd like to index |
|
82 are more likely to be the associated ``CWUser`` than the ``EmailAddress`` itself. |
|
83 |
|
84 The simplest way to achieve that is to tag the ``use_email`` relation in |
|
85 the datamodel: |
|
86 |
|
87 .. sourcecode:: python |
|
88 |
|
89 class use_email(RelationType): |
|
90 fulltext_container = 'subject' |
|
91 |
|
92 |
|
93 Customizing how entities are fetched during ``db-rebuild-fti`` |
|
94 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
95 |
|
96 ``db-rebuild-fti`` will call the |
|
97 :meth:`~cubicweb.entities.AnyEntity.cw_fti_index_rql_queries` class |
|
98 method on your entity type. |
|
99 |
|
100 .. automethod:: cubicweb.entities.AnyEntity.cw_fti_index_rql_queries |
|
101 |
|
102 Now, suppose you've got a _huge_ table to index, you probably don't want to |
|
103 get all entities at once. So here's a simple customized example that will |
|
104 process block of 10000 entities: |
|
105 |
|
106 .. sourcecode:: python |
|
107 |
|
108 |
|
109 class MyEntityClass(AnyEntity): |
|
110 __regid__ = 'MyEntityClass' |
|
111 |
|
112 @classmethod |
|
113 def cw_fti_index_rql_queries(cls, req): |
|
114 # get the default RQL method and insert LIMIT / OFFSET instructions |
|
115 base_rql = super(SearchIndex, cls).cw_fti_index_rql_queries(req)[0] |
|
116 selected, restrictions = base_rql.split(' WHERE ') |
|
117 rql_template = '%s ORDERBY X LIMIT %%(limit)s OFFSET %%(offset)s WHERE %s' % ( |
|
118 selected, restrictions) |
|
119 # count how many entities you'll have to index |
|
120 count = req.execute('Any COUNT(X) WHERE X is MyEntityClass')[0][0] |
|
121 # iterate by blocks of 10000 entities |
|
122 chunksize = 10000 |
|
123 for offset in xrange(0, count, chunksize): |
|
124 print 'SENDING', rql_template % {'limit': chunksize, 'offset': offset} |
|
125 yield rql_template % {'limit': chunksize, 'offset': offset} |
|
126 |
|
127 Since you have access to ``req``, you can more or less fetch whatever you want. |
|
128 |
|
129 |
|
130 Customizing :meth:`~cubicweb.entities.adapters.IFTIndexableAdapter.get_words` |
|
131 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
132 |
|
133 You can also customize the FTI process by providing your own ``get_words()`` |
|
134 implementation: |
|
135 |
|
136 .. sourcecode:: python |
|
137 |
|
138 from cubicweb.entities.adapters import IFTIndexableAdapter |
|
139 |
|
140 class SearchIndexAdapter(IFTIndexableAdapter): |
|
141 __regid__ = 'IFTIndexable' |
|
142 __select__ = is_instance('MyEntityClass') |
|
143 |
|
144 def fti_containers(self, _done=None): |
|
145 """this should yield any entity that must be considered to |
|
146 fulltext-index self.entity |
|
147 |
|
148 CubicWeb's default implementation will look for yams' |
|
149 ``fulltex_container`` property. |
|
150 """ |
|
151 yield self.entity |
|
152 yield self.entity.some_related_entity |
|
153 |
|
154 |
|
155 def get_words(self): |
|
156 # implement any logic here |
|
157 # see http://www.postgresql.org/docs/9.1/static/textsearch-controls.html |
|
158 # for the actual signification of 'C' |
|
159 return {'C': ['any', 'word', 'I', 'want']} |
|