;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; Sphinx search module for Drupal 5.x ;; $Id: README.txt,v 1.2.2.8 2008-10-01 20:02:19 markuspetrux Exp $ ;; ;; Original author: markus_petrux at drupal.org (July 2008) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; REQUIREMENTS ============ - Drupal 5.x (planned port to D6) - PHP 4.4.x or PHP 5.x (PHP needs to be compiled with --enable-memory-limit). - It should work for any DB engine supported by Drupal. - Sphinx 0.9.8 - Shell access to the box where Sphinx is installed. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; INSTALLATION ============ 1) Install Sphinx. It is recommended to install Sphinx on separate box, but it may also work on any other server of your farm, or even in the same box your web server, mysql or whatever is installed. For more details, additional requirements, etc. please, read Sphinx documentation. Here's just a quick start guide. You need root access to the box. # move to a temp directory. cd /opt # download and untar Sphinx source. wget http://www.sphinxsearch.com/downloads/sphinx-0.9.8.tar.gz tar xzf sphinx-0.9.8.tar.gz cd sphinx-0.9.8 # optionally, download and untar libstemmer. wget http://snowball.tartarus.org/dist/libstemmer_c.tgz tar xzf libstemmer_c.tgz # you may need to adjust file ownerships. chown -R root.root * # build, compile and install sphinx + libstemmer. ./configure --with-mysql --with-libstemmer --prefix=/usr/local/sphinx make make install 2) See sphinxsearch/contrib subdirectory. It contains samples for sphinx.conf and sphinx start/stop script. ***** IMPORTANT ***** Files in contrib subdirectory are just samples. Please, note they are provided in order to help you setup your Sphinx installation, but without warranties of any kind. Note that I started to learn it just recently. Also, my environment and needs may differ a lot from yours. Please, don't use them as-is. If you do, it is at your own risk. ********************* 3) Install sphinxsearch Drupal module. - Copy to modules/sphinxsearch all files and directories. - Copy sphinxsearch_scripts subdirectory provided within this module to your Drupal root directory. Instead, you may wish to setup a symbolic link from your Drupal root to the sphinxsearch_scripts subdirectory of this module. This way you don't need to copy files when module is updated. Please, see README-XMLPIPE.txt for further information and examples. - Goto admin/build/modules to install the module. - Goto admin/user/access to adjust permissions. (use sphinxsearch, administer sphinxsearch) - Goto admin/settings/sphinxsearch to configure module options. (see below) 4) Customization. - Check module settings and adjust them to your environment. - Create and/or adjust your sphinx.conf to include definitions for all indexes required by your Drupal site. You need at least one main index, optionally as many main indexes as you need, and also optionaly one single delta index. It is also necessary to create a distributed index that will be used to join all your indexes when resolving search queries. (see contrib subdirectory for examples and further information). - Setup crontab to build your main and delta indexes at intervals. ***** IMPORTANT ***** There are options in the module settings panel that require you to rebuild main indexes. Otherwise, you may get errors when searching. ********************* - Watchdog logging: XMLPipe processing generates watchdog records with information on memory used, execution time, nodes processed, etc., to help you adjust module settings to suit your needs. - Steps to create your initial set of indexes: It is assumed that your sphinxsearch module has been installed and configured, also that you have already installed and configured your Sphinx server accordingly. 1) Stop your searchd daemon. 2) Use Sphinx indexer to build all your main indexes. 3) Start your searchd daemon. 4) Setup cron task to rebuild your delta index at short intervals. 5) Setup cron task to rebuild your main indexes once a day or so. Once your initial set of indexes is created, you don't need to stop your searchd daemon. Instead, you can invoke Sphinx indexer with --rotate argument. See docs/contrib subdirectory of this package for sample script. - Troubleshooting: Symptom: When creating your initial set of main/delta indexes, you may endup with index file names with ".new" in them. Often, Sphinx searchd daemon deals with this naming convention transparently. However, it may sometimes fail to recognise these files. Not exactly sure why, though. Solution: Stop searchd daemon and rename you index files to remove the ".new" part. ie. if you see something like "main.new.spp", you can rename it to "main.spp". Note than each Sphinx index uses several files with same name and different extension. Start again searchd daemon when all files have been renamed. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; SPHINX IMPLEMENTATION DETAILS ============================= - Sphinx is a fast and scalable full-text search engine. However, it currently has a few limitations related to the way text is indexed. - Sphinx index documents that are composed of fields of different types. It basically supports text fields, integers, timestamps, booleans, multi-valued attributes (lists of integers that can be used to implement 1-N relations), etc. For instance, to manage basic Drupal content (nodes) we can use text fields to index titles and bodies, an integer field to store the node author id, a timestamp field to store last updated time, a boolean field to store the is_deleted attribute (we'll see this later) or multi-valued attributes field to store the list of terms related to each node. - Current version of Sphinx does not support full live index updates. Instead, it is necessary to build indexes in jobs that transform your data into a special kind of documents that are stored in Sphinx indexes. This process is executed by the Sphinx indexer command and it should be invoked from the server where Sphinx is installed. A particular Sphinx installation can manage a number of indexes, you can partition indexes managed locally, or even remotely from other Sphinx servers. You can install your Sphinx server on a dedicated server (recommended) or it can coexist in one server with any other software of your choice. - Each Sphinx instance is configured with its own sphinx.conf file where you can specify how your indexes are built, structure of your Sphinx documents, how your data should be extracted to build them, as well as options that tell Sphinx how the searchd daemon should work. The searchd daemon can be configured to listen on a particular TCP port of the server. Then, Sphinx provides a series of APIs that can be used to connect your application to the searchd daemon (locally or remotely) to perform search queries, retrieve results, build excerpts highlighting keywords, or even update some kind of attributes. However, it is not possible to index new documents, it is not possible to update text fields and it is not possible to delete indexed documents. It is only possible to update non-text fields. - Therefore, it is necessary to create indexes in batch jobs. These jobs will index all your content at once, and it is necessary to repeat this task periodically in order to recover space used by documents marked as being deleted, index new documents and/or reindex documents that have been updated since last time indexes were built. - A note on document deletions. This module creates Sphinx documents with a boolean attribute, is_deleted, that is used as a flag to keep track of nodes that have been deleted from Drupal database, but that still exist in Sphinx indexes. When a node is indexed, its own is_deleted attribute is set to 0. When a node is deleted from the Drupal database, the Sphinx API is used to set the is_deleted attribute of that node to 1. Finally, all search queries sent by this module filter out documents with this attribute enabled. This method allows us to tell Sphinx supports live document deletions, but as you can see this is not the case. - In this scenario, we need to work in Sphinx with the so called main + delta scheme. See Sphinx documentation for more details. In short, main indexes should be rebuilt periodically in order to recover space used by deleted documents, and delta index should be rebuild as often as possible to take care of new and updates documents until your main indexes are rebuilt. - Once you have created your main indexes, new and/or updated nodes will be stored in delta index. You may wish to rebuild your delta index at short intervals using crontab from the server where Sphinx has been installed. These intervals basically depend on the time required to process each delta and the number of node updates in your site. You may wish to start with 5 minutes and adjust your crontab as you get more experience. The module generates full reports in watchdog to help you monitor index processing. - Sphinx also supports distributed indexes. This type of indexes can be used to join a number of indexes that share exact same structure. In this case, we join as many main indexes as we may need, plus the delta index. In case a document is stored in more than one index, the one stored in the last index in the list "wins". Joined indexes can be local (managed by the same Sphinx instance) or remote. This is great in terms of scalability. In fact, this means we can split the index rebuild process in chunks that can be easily managed, or even spread to other servers in your infrastructure. Queries sent to distributed indexes are resolved by Sphinx transparently, as if it was a single index. - Data sources to build Sphinx indexes can be of type MySQL, PostgreSQL and XMLPipe. In the case of MySQL or PostgreSQL source types it is possible to tell Sphinx indexer to extract data directly from your database, and this method is impressively fast. However, these methods cannot be used to index Drupal nodes, or at least it would be so difficult to achieve, because data related to nodes often needs to be proprocessed by a number of hooks that may involve a lot of small and quick (or not so quick) SQL queries and further processing performed by core modules as well as contrib modules. For instance, XMLPipe is the only method that allows us to index nodes along with their comments, cck fields, taxonomy terms, etc. In fact, this method allows us to index content the same way Drupal core search works. - It is something important to take into account that XMLPipe generation may require more resources than what one would expect at first, compared to other Sphinx implementations. It all depends on the complexity of your Drupal intallation, modules installed, size and number of nodes, available infraestructure, etc. Note that Drupal search core solves this problem by splitting index generation in chunks where a number of nodes is indexed at cron intervals, however with Sphinx we need to index all content at once. Of course, it is also possible to partition indexes so your nodes are spread into several storage units, though this method might only be recommended when your site has thousands of nodes, maybe millions. Again, it all depends on the time it takes to create your indexes, which may be from a few minutes up to one or more hours. - So here's why this module is based on and supports XMLPipe index type generation. Problem is now, this method is MUCH slower than indexing content using MySQL/PostgreSQL index types. You may wish now take a look at the docs subdirectory of this project to see the options this module provides to help you setup and manage your index creation jobs, etc. - In order to minimize these problems, the XMLPipe generation script provided with this module implements a few checks that will abort XMLPipe stream generation and report the cause of the problem to watchdog. Actually, the module monitors memory usage and execution time in order to prevent crashes when PHP memory_limit and/or max_execution_time values are exceeded. Depending on module settings, it is also possible to setup the XMLPipe generation script to restart client connection to DB server to prevent from getting max connection time problems. You may also wish to adjust PHP settings from the .htaccess file provided within the sphinxsearch_scripts subdirectory of this module. - Here's a couple of examples where I have implemented Sphinx, so you can get an idea of how many time it may take to process your indexes, and/or a sample reference on how to setup your Sphinx installation. a) phpBB based board with 14+ million posts, 15,000 posts a day average, and growing. Here, I used 4 main indexes with capacity for 5 million posts each, and one delta index. Generation of each main index takes around 1 hour. 1 or 2 main indexes are built daily. Generation of delta index just takes seconds and it is scheduled to run at 1 minute intevals from cron. If you wish, you can test Sphinx search engine implemented on this site from here: http://zonaforo.meristation.com/foros/search.php b) Drupal based site running this module. Site has 10,000+ blog entries and 30,000+ comments. It uses 1 main index + 1 delta. Main index takes less than 5 minutes to build and it is executed daily. Delta index takes a few seconds and it is executed at 5 minutes intervals from cron. Again, if you wish, you can test Sphinx search engine implemented on this site from here: http://blogs.gamefilia.com/search It all depends on several factors. Of course, your mileage may vary. - New or different ideas to fight against forementioned "limitations" are welcome. Please, use issue tracker of the module. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; TODO ==== - New features for D5 version are frozen. Active development has been moved to the Drupal 6 version of the module.