FAQ (Frequently Asked Questions)

  1. How do I pull the full rows (including string columns) from Sphinx without going to SQL database?
  2. How do I filter, sort, or group by string without string attributes?
  3. Do I need to build that SphinxSE thing into MySQL in order to use/test Sphinx?
  4. How do I pull more than 1000 matches from Sphinx
  5. My queries with non-ASCII characters are not working?
  6. How do I filter using OR conditions if filters only support AND?
  7. How do I search in several (but not all) fields?
  8. How do I search for substrings, not just whole words?
  9. I built SphinxSE as a plugin for MySQL 5.1.x but it crashes on CREATE TABLE?

How do I pull the full rows (including string columns) from Sphinx without going to SQL database?

APIs only returns what search index actually contains. And currently, Sphinx does not store all of the original row data. (This might change in the future, but there are no immediate plans.) It can store numeric columns but not the text ones. So you will have to pull document IDs from Sphinx and then go to the database for the full row contents. This is usually a matter of exactly 1 additional SQL query, for example:

SELECT * FROM documents WHERE id IN (3,5,7)

Note that the database will usually reorder the rows in the result set! So you will also have to reorder them back in the order as received from Sphinx. With MySQL, you can use its ORDER BY field feature:

SELECT * FROM documents WHERE id IN (3,5,7) ORDER BY FIELD(id,3,5,7)

The output you might be seeing from CLI search does exactly additional SQL queries (and, by the way, in a bad way, with a separate query per each row, instead of using IN() and firing only 1 query for the whole result set). It's only intended for debugging and tesing.

How do I filter, sort, or group by string column without string attributes?

You can do all of this, except for precise arbtrary-length sorting over several indexes.

To filter and group, you can replace the string with an unique numeric ID. Sometimes its possible to create a lookup dictionary in the database (eg. for fixed lists of cities or countries), or even use an existing one, replace strings with their IDs in that dictionary, then filter and group on that ID. If not, you can always replace the string with its checksum, eg. CRC32() or (any) 64 bits taken from MD5() at indexing time (no need to alter the tables!), store it using sql_attr_uint or sql_attr_bigint respectively, and then filter or group on that checksum attribute. (Note that there's a certain chance of CRC32() collisions if you have millions of strings but practically zero chance of MD5() collisions.)

Sorting is harder, but also possible to some extent. First, you can use sql_attr_str2ordinal that replaces every string with its ordinal number (a sequential number in the sorted list of all the uniqie strings mentioned in this index). However, that will produce garbled results when querying through several indexes at a time, because every index will assign its own sequential number to the same strings. Ie. string "zzz" could be number 1000 in index A but at the same time number 1 in index B, and errorneously float to the top when searching through both A+B. Second, you can extract 4-byte or 8-byte substrings, store them as attributes, and sort on those attributes. That will work across different indexes, but sort only on few first bytes.

Do I need to build that SphinxSE thing into MySQL in order to use/test Sphinx?

No, you don't!

SphinxSE is just an ordinary client that talks to searchd over network. You need it when you do not have a native API for your language and want to work through MySQL. Or if you want to optimize certain workloads that pull big Sphinx result sets directly to MySQL and additionally process them there (and avoid overheads of pulling Sphinx set to application and then immediately pushing it to MySQL). However, you are not required to use SphinxSE to work with Sphinx.

How do I pull more than 1000 matches from Sphinx?

There are two places where the default limit of at most 1000 matches per query is imposed.

First, on the server side, you have to change max_matches settings in sphinx.conf, and restart the searchd. (With 0.9.9 and above, you can simply reload the config file using SIGHUP.)

Second, there's also a 3rd argument to SetLimits() API call that also defaults to 1000 and must be raised. (With SphinxSE, use "maxmatches" option.)

My queries with non-ASCII characters are not working?

Character encoding must be consistent everywhere: namely, when pulling data from SQL, when indexing that data, and when searching. The checklist is as follows:

  1. Ensure that indexer is receiving properly encoded data. When using MySQL, your server might default to something like latin1. Use the following snippet to enforce the specific encoding for indexer's connection:
    sql_query_pre = SET NAMES utf8
    
  2. Ensure that charset_type and charset_table are both properly configured in config file. Note that the default values for charset_table only include ASCII Latin characters and Russian characters; but nothing else. Refer here for a number of tables for different languages.
  3. Ensure that the index was built with proper settings and that searchd was started with proper settings too. (Starting with 0.9.9, it's enough to reload the index using SIGHUP instead of restarting searchd. But in that case, you should also check searchd.log for index rotation status.)
  4. Ensure that the query passed to searchd is in proper encoding. If your web page is in latin1 but you've configured everything else to use UTF-8, queries won't work.

How do I filter using OR conditions if filters only support AND?

There are the following routes:

  1. Unless your filters include range checks, you can index magic keywords instead of the attributes, and then replace filtering condition with fulltext search condition. Here is a sample:
    # in sphinx.conf
    sql_query = SELECT id, ..., CONCAT('_user',userid,'; _group',groupid) FROM docs
    
    // in webapp
    $cl->SetMatchMode ( SPH_MATCH_EXTENDED2 );
    $cl->Query ( "my query ( _user123 | _group456 )", "myindex" );
    
  2. Starting with 0.9.9, you can compute an expression and filter matches based on that:
    $cl->SetSelect ( "*, userid=123 OR groupid=456 AS mycond" );
    $cl->SetFilter ( "mycond", array(1) );
    $cl->Query ( "my query", "myindex" );
    

How do I search in several (but not all) fields?

You can enumerate several fields in the field limit operator:

@(title,body) hello world

This will search for both keywords (hello and world) in both fields (title and body).

How do I search for substrings, not just whole words?

Currently, you have to explicitly a) configure Sphinx to index either prefixes (substrings starting at word start) or infixes (general substrings), and b) enable star-syntax (ie. hello wor*) which is disabled by default for compatibility reasons. Refer to documentation on min_prefix_len, min_infix_len, and enable_star for details. Note that indexing time and index size can grow significantly when using these!

I built SphinxSE as a plugin for MySQL 5.1.x but it crashes on CREATE TABLE?

This can be caused by configure-time options used to build your MySQL binary and SphinxSE plugin. For instance, pre-built MySQL binaries from Sun usually come built with --with-fast-mutexes switch enabled at configure. That results in a size mismatch between certain internal structures and leads to crashes. Unfortunately, we do not know of a reliable way to automatically detect that.

To fix the issue, either rebuild SphinxSE plugin using ./configure --with-fast-mutexes, or rebuild mysqld binary from the source.

Did not find a specific answer?

...that you were looking for? Consider our commercial support, or ask around on community forum for advice.