July 2009 - Posts

SearchIcon

A short time ago I was working on a project using SQL Server Full Text Search.  This was my first real deep exposure to the engine and it worked pretty well.  However, for the project I was working on there were some very serious problems that I was never able to overcome.  Try as I might to tweak SQL Server Full Text Search I was never able to tweak it as much as I needed. 

Here are some of the problems that I faced while playing with the Full Text Engine:

  1. SQL Server Full Text Engine 2005 (FTE) had some scalability issues.  I didn’t so much need it to index hundreds of millions of rows (although it needed to be able to do a lot of records) but mostly I need a lot of queries per second.  It seemed to only be able to run on a single processor at at time.  If someone did a ridiculously huge query everyone else on that server would come to a halt!  On a box with 8 logical processors this was not acceptable!
  2. FTE has not capability for compound words (i.e. the term ‘Riverboat’ and ‘River Boat’).  Sure, you could put such pairs in the Thesaurus as expansions, right?  Well, I will get to that.
  3. FTE black boxed the word breaker.  You might think that this is not a big deal but you would be wrong!  FTE considered the ampersand ‘&’ a word breaker when I needed it to not be.  For example, in FTE if you did a search for ‘R&B’ the ampersand would break that into the words ‘R’ and ‘B’.  Both of such are in the noise words list by default. Therefore, the terms ‘R&B’ and ‘AT&T’, etc, were optimized out of the query by design.  Creating your own word breaker is possible but very difficult and not recommended.  Also, I needed an underscore ‘_’ to break words and it did not.
  4. The ranking information that came back from FTE was not very good.  This is because the ranking could not be used to compare two different queries and also because the ranking data was not very good.  The numbers were not very evenly distributed. IE, I might have the top 30 rows with a rank of “34”, and the rest had a rank of “20”.  The numbers are also arbitrary and meaningless.
  5. The ‘FREETEXTTABLE’ term is useless!  It will only UNION sets rather than INTERSECT them.  This means that a search for ‘Sax’ could return 1,254 rows while the term ‘Tenor Sax’ would return 2,259 rows.  Every term you add will increase the size of the result rather than decrease it. We had to use the ‘CONTAINSTABLE’ search term but that led to problems with any compound word in the thesaurus and looked awful! Something like:
    (“Sax” OR “Sax*” OR FORMSOF(“Sax”) OR THESAURUS(“Sax”))
    That is for a one term word.  Each word would need it’s own set of criteria.
  6. FTE was kind of slow on the larger queries.  Returning a set of over 1,000 seemed to be quite a chore!
  7. In order to add things to the thesaurus you had to change the XML file and restart the service.  You may even have to rebuild the index – I don’t remember.
  8. Every few days or so the searches would slow down a lot!  In order to get it speed back up we had to shut down the FTE service and the SQL Server service and rebuild the index. We hoped to write a script to do this chore every so often but for some reason the script that Management Studio generated didn't seem to work the same way.

That's all well and good, but what did I gain by writing my own version?

  1. Most of the code is reentrant and the code that is not uses highly efficient reader/writer locking to make sure the index doesn’t change out from underneath you!  This means that I can fully utilize all logical processors on the machine and large queries will not interfere with smaller ones.
  2. It also means that the index can be rebuilt while the old index is still in use.  Once the build is complete they can be quickly and easily swapped.
  3. I was able to create ranking that yielded very nice even distribution between every result in the set. (more on this below)
  4. I was able to pull word breaking terms from a file. There is also one file per language. These files are also used to break the search term that was used to create the index. Because it uses the exact same word breaker on both the search terms and the index data we got much better search matching!  Terms like ‘R&B’ were indexed and searched for in the same way. Noise words would be dropped the same way as well. It made working with compound words possible.
  5. This search is much faster than SQL Server FTE.  All my queries returned in less than 100ms, usually much less!
  6. Ranking could be customized. (more on this below)
  7. I was able to create a ‘FREETEXTTABLE’ style query that reduced the number of results as the search term became more specific.
  8. I could change the thesaurus on the fly.
  9. I was able to manage compound words much more effectively.
  10. I got to implement my own custom Binary Search which was pretty fun!

Before you go tearing off to download my version of search and rip out SQL Server FTE, there are some tradeoffs to using my system.  Most notably I keep my index in memory where I don’t believe FTE does that.  In the case of this project it only resulted in ~16MiB of memory usage per language.  We had 32GiB of memory on that system so that isn’t such a big deal even with the pretty deep set of data indexed. But if you had 100 million or more then this may not be a good solution for you. 

CSharpQuery is also slower at creating the index.  It seems like the index creation for FTE is almost instantaneous where it takes about 10 seconds per language to build mine.  Again, not a big deal in most cases but could be problematic for very large data sets. Lastly, FTE is able to provide the data right to you in T-SQL and is easy to setup and consume.  CSharpQuery will merely give you a list of indexes and ranking info and you are expected to get the data into SQL Server yourself.  Although, this does mean that it can be used outside of SQL Server which is a plus!

The real genius in the CSharpQuery code is the ranking.  As mentioned above the ranking in FTE has two fatal flaws.  It doesn’t allow the rankings to be used outside of the result set and the ranking values is pretty scant.  In a typical result set of 100 rows, there would only be maybe 5-10 distinct numbers for ranking data.  This means that there are large sections that appear unsorted or unranked! This is a pretty difficult problem to come up with a ranking algorithm that is both universal in it’s application and provides a ranking so unique and precise as to almost be synonymous with a hash. 

This is accomplished by using these four filters:

  • Word Proximity – How close are the search terms to each other?
  • Multiple Occurrence – How many times does the search term appear in the indexed phrase?
  • Low Phrase Index – Are the search terms found near the beginning of the phrase, perhaps in a title?
  • Word Matching – Did we find the exact words or did we use the thesaurus or front word matching to find this row?

The ranking of results is the most complex part of the code. It is also the most process intensive. Each row is ranked independently based on the 4 different filters. Each filter ranks the row between 0 and 1. To produce the final rank, each filter result is weighted as some filters are better at finding what you are looking for than others. Based on this setup some of the less discretionary filters can be used to break ties. This results in a very nice even distribution between the entire result set. The final ranking number is also a number between 0 and 1.

Another great feature of CSharpQuery is the tweaked thesaurus for compound words.  Lets say that you were looking for a song titled “Riverboat Shuffle”. In FTE if we were to simply do a thesaurus lookup on “Riverboat” we’ll also get “River Boat”.  This means it will return all results with river and boat but not necessarily together.  In my version of the thesaurus the exact same results are returned from the queries “Riverboat Shuffle” or “River Boat Shuffle”.

NOTE: The thesaurus that comes with this download is not necessarly a great one.  Basically I used a dictionary to find compound words but discovered that there are a lot of words that to a computer look like a compound word but are actually not.  Such an example is “monkey” was found from “mon” and “key” but “mon key” is not the same as “monkey”.  It is your responsibility to clean up the thesaurus.  If you would like you can also send it back to me when your done so everyone else can benefit.

Building an Index

In yet another instance of “It’s free for a reason”, CSharpQuery doesn’t come with any kind of tool to build your index for you.  Instead you have to make a small program to build the index.  This can also be advantageous.  For example, in my C# code I combine different rows from various track tables. This way even though the composer is part of a different table, typing their name in will show all of the tracks for that composer.

public static void UpdateCSharpFullTextSearchIndex(
    IDataStore con, int langId, CultureInfo cultureInfo, 
    string cSharpQueryIndexDirectory) {
    
    // Quicksearch SQL
    string quicksearchSql = @"
        SELECT 
            t.TrackID,
            p1.Text + ' ' + -- TrackName
            p2.Text + ' ' + -- TrackDescription
            p3.Text + ' ' + -- AlbumName
            p4.Text + ' ' + -- LibraryName
            ar.ArtistName + ' ' +
            t.Publisher as IndexText
        FROM Track t 
        INNER JOIN Album a 
            ON t.AlbumID = a.AlbumID
        INNER JOIN RecordLabel r 
            ON a.RecordLabelID = r.RecordLabelID
        INNER JOIN Artist ar 
            ON t.ArtistID = ar.ArtistID

        INNER JOIN Phrase p1 
            ON t.Title_Dict = p1.DictionaryID 
            AND p1.LanguageID=@LangID
        INNER JOIN Phrase p2 
            ON t.Description_Dict = p2.DictionaryID 
            AND p2.LanguageID=@LangID
        INNER JOIN Phrase p3 
            ON a.AlbumName_Dict = p3.DictionaryID 
            AND p3.LanguageID=@LangID
        INNER JOIN Phrase p4 
            ON r.RecordLabelName_Dict = p4.DictionaryID 
            AND p4.LanguageID=@LangID";

    SqlConnection conn = Con(con).Connection as SqlConnection;
    try {
        conn.Open();
        SqlCommand cmd = new SqlCommand(quicksearchSql, conn);
        cmd.CommandType = System.Data.CommandType.Text;
        cmd.Parameters.AddWithValue("@LangID", langId);
        SqlDataReader rdr = cmd.ExecuteReader();

        SQLServerIndexCreator creater = new SQLServerIndexCreator();
        
        // Quicksearch Index
        creater.CreateIndex(
            "QuickSearch", // The name of the index
            cSharpQueryIndexDirectory, // Index Dir
            rdr, // An open Data Reader
            cultureInfo, // The culture info for this index
            "TrackID", // The [Key] (int) column of the index
            "IndexText"); // The [Value] column of the index
        rdr.Close();
    } finally {
        if (conn != null && 
            conn.State == System.Data.ConnectionState.Open)
            conn.Close();
    }
}

As you can see in this index creation example, we concatenate the track name, track description, album name, and library name for the quick search index.  This will then create the .index file like “Index_QuickSearch.en-US.index”.  You will notice the file is in the format of “Index_[index name].[culture code].index”.  It is important to have a few things in place before you try this.  The following files should exist by default:

  • Invalid Chars.txt – Contains invalid chars like [tab]~{}[CR][LF], etc.
  • Noisewords.global.txt – Contains words that are not useful to index like [a-z], “and”, “the”, etc.
  • Substitutions.global.txt – Contains a list of substitutions you wish to make, this is usually used to indicate what symbols break words and which ones do not.  For example: “:= “ means that we’re going to substitute the “:” sign for a blank space.
  • Thesaurus.global.xml – The thesaurus contains synonyms and compound words.  NOTE: if you use the compound word functionality, the compound term must come second.
  • WhiteSpace.global.txt  – This file tells the work breaker which chars are whitespace so those can be safely used to split the word terms.

These files all work together to help create a better index.  You will also notice that the convention is “.global.txt” for these files.  That is because for each culture you will want to specialize these files for these languages.  So you can have the file “WhiteSpace.en.txt” and “WhiteSpace.en-us.txt” etc. The global lists are merged with the list for a specific language so they are global for all languages.

Congratulations, you now have an index!  Now for the easy part – Using the index.  A typical full text query will look something like this:

public static List<Track> QuickSearch(
    IDataStore con, 
    string csharpQueryIndexLocation, 
    string searchCriteria, 
    int recordLabelID, 
    int pageSize, 
    int pageNumber, 
    SortCriteriaType sort, 
    Language uiCulture, 
    User user, 
    out int rowCount) {
    
    rowCount = 0;
    int? labelID = 
        (recordLabelID > 0) ? recordLabelID : (int?)null;

    FreeTextQuery.DatabasePath = csharpQueryIndexLocation;
    CultureInfo ci = 
        uiCulture.LanguageID == 1 ? 
        CultureInfo.InvariantCulture : 
        new CultureInfo(uiCulture.CultureCode);

    List<QueryResult> trackResults = 
        FreeTextQuery.SearchFreeTextQuery(
        "QuickSearch",    // The Index Name 
        ci,               // The culture info
        searchCriteria);  // The search term

    string trackIDxml = FormatResultsAsXml(trackResults);


    // Do the search
    int? nRowCount = 0;
    var searchResults = Con(con).QuickSearch2(
        trackIDxml, 
        uiCulture.LanguageID, 
        pageSize, 
        pageNumber, 
        labelID, 
        user.UserID, 
        (int)sort, 
        ref nRowCount);
    
    rowCount = nRowCount ?? 0;

    return searchResults;
}

The code above simply calls FreeTextQuery.DatabasePath to set the location of the text index.  You only need to set this path once since it is static but I set it for every call, just in case.  Next I call FreeTextQuery.SearchFreeTextQuery to perform the search.  This returns a list of QueryResult which gives you back the [Key] that you specified when creating the index, the rank (presorted), and the locations of the words in the original [Value] that it matched to.  This is very handy if you wanted to do something like highlight search terms especially if you want to also highlight words that matched via the thesaurus.

After I get my results I call a function called FormatResutlsAsXml so I can get an XML string to send to SQL server and join these keys to the actual track information.  My implementation of FormatResultsAsXml actually uses a StringBuilder because I found this to be a little quicker at creating an XML string.

private static string FormatResultsAsXml(
List<QueryResult> phraseResults) {
if (phraseResults == null || phraseResults.Count == 0) return null; StringBuilder sb = new StringBuilder(); sb.AppendLine("<root>"); foreach (QueryResult r in phraseResults) { sb.AppendLine("<result>"); sb.AppendLine("<key>" + r.Key + "</key>"); sb.AppendLine("<rank>" + r.Rank + "</rank>"); sb.AppendLine("</result>"); } sb.AppendLine("</root>"); return sb.ToString(); }

All we have to do now is pass this into our stored proc.  Now, at about this point I’m sure you are thinking something along the lines of I’m using SQL Server 2005, why not use the XmlDocument rather than pass this value in as text.  The answer to that question is simple – For some very strange reason the new XML capability in SQL 2005 is VERY SLOW! It is unusably slow for sets of data larger than just 500 nodes!

Here is a snippet of my T-SQL code:

ALTER PROCEDURE [dbo].[QuickSearch]
(
    @TracksKeysXml text
    ,@LanguageID int
    ,@PageSize int
    ,@PageNumber int
    ,@RecordLabelID int = NULL
    ,@UserID int
    ,@SortID int = NULL 
    ,@RowCount int OUTPUT
) AS
BEGIN

    DECLARE @Tracks TABLE 
    ( 
        TrackID int primary key, 
        [Rank] real 
    )

    IF ( @TracksKeysXml IS NOT NULL )
    BEGIN
        DECLARE @xmlDocTracks int
        EXEC sp_xml_preparedocument 
            @xmlDocTracks OUTPUT, 
            @TracksKeysXml

        /* xml input: 
        <root>
            <result>
                <key>123456</key>
                <rank>0.75245</rank>
            </result>
        </root>*/
                
        INSERT INTO @Tracks
            SELECT * FROM 
            OPENXML(@xmlDocTracks, '/root/result', 2) 
            WITH ( [key] int 'key[1]', [rank] real 'rank[1]')
            
        EXEC sp_xml_removedocument @xmlDocTracks
    END

...

It’s as easy as that!  Feel free to leave a comment if you have any questions.

This software is distributed under the terms of the Microsoft Public License (Ms-PL). Under the terms of this license, and in addition to it, you may:

  • Use this code royalty free in either open source or for profit software.
  • You may not remove the copyright notices from any of the files.
  • You may not charge any 3rd party for the use of this code.
  • You may alter the code, but must not distribute the source once it has been altered.
  • You should give the author, Nathan Zaugg, credit for code whenever possible.
  • The code is provided to you as-is without any warranty, implicit or explicit.

I would also appreciate it if you left a comment if you found the code to be useful.

You may download the source from here: http://interactiveasp.net/media/p/1124.aspx

UPATE: This project is now being maintained on CodePlex.  See http://csharpquery.codeplex.com/ to get the latest code.

Microsoft Public License (Ms-PL) This license governs use of the accompanying software. If you use the software, you accept this license. If you do not accept the license, do not use the software. Definitions The terms "reproduce," "reproduction," "derivative...