Searching

edited November 2011 in Feedback
Vanilla 2 search uses MySql full text searching which doesn't behave intuitively. In fact, our users find the our forums' search completely unusable. The information about how to correct this as best possible is somewhat buried, as are the customisation options for searching.

Using Google Custom Search is a poor solution, because you have to wait for the next page crawl.

A mainstream search function should try to mirror the same query syntax used for Google, since that is what most users will expect. You don't need the full Google syntax. You just need the following rules:
  • term => all discussions with "term" anywhere (discussion title, discussion body, comment body)
  • term1 term2 => all discussions with "term1" AND "term2" anywhere
  • "some phrase" => all discussions with "some phrase" anywhere
  • "some phrase" term2 => all discussions with "some phrase" AND "term2" anywhere.
That would go a long way. You could get slightly more fancy and allow prefixing terms or phrases with "-" to exclude them. "+" would be unnecessary (but could be recognised and ignored), since in Google it means *precise* phrase, not relevant for mysql boolean full text searching.

The problems with the existing behaviour are:
  • You enter 2 words, and get something along the lines of word1 OR word2, completely counter what you expect
  • Words of 3 characters or less result in no hits at all
  • There's no explanation of existing behaviour
  • You often get back either nothing, or completely irrelevant results. Often you *know* a page exists but just can't find it.
  • The SQL query appears to be wrongly structured; with boolean searching, and the query "+term1 +term2", you only get results where *just* the discussion, or *just* the a single comment, contains both terms, but what I'm after is where the [discussion and all comments] contains both terms anywhere - for example, the discussion contains term1, and the 3rd comment contains term2.
The solution to this is all relatively simple:
  1. Provide a list of the configuration options with explanation. I've discovered this option: $Configuration['Garden']['Search']['Mode'] = 'matchboolean'; // matchboolean, match, boolean, like
    and worked out that:
    • match = default behaviour; natural language searching which attempts to sort by relevance and largely fails, in part due to the min-4-character word indexing, and OR operator
    • boolean = use mysql boolean searching, where you enter "+term1 +term2 -term3" to mean "term1 AND term2 AND NOT term 3"
    • matchboolean = hybrid; if there are any pluses or minuses, it uses boolean searching - this is probably the best default option (in the absence of below improvements to vanilla search), but you'll need to explain to users that the Google search eggs sausages "scrambled eggs" needs to be written +eggs +sausages +"scrambled eggs"
    • like = use SQL LIKE, low performance but OK for small forums
  2. Change the SQL query so it is effectively searching the page you see when you view the discussion - i.e. the discussion title, body, and all comment bodies. Allow each term to "hit" anywhere in the discussion and comments.
  3. Provide documentation explaining how to remove the mysql 4 character min word length and stop words. In short:
    Edit /etc/mysql/my.conf
    Under [mysqld], add something like this: ft_min_word_len = 1 ft_max_word_len = 30 ft_stopword_file = ''
    You might need to restart (sudo /etc/init.d/mysql restart)
    Log into mysql (mysql -u username -p databasename) and execute:
    REPAIR TABLE GDN_Discussion QUICK;
    REPAIR TABLE GDN_Comment QUICK;
  4. Lastly have a preprocessor in the PHP which executes the SQL. Before constructing the full text search query (using boolean mode), it should parse the search query using Google syntax and convert into Mysql boolean syntax.
    So, when the user enters eggs sausages "scrambled eggs" -mushrooms
    vanilla supplies +eggs +sausages +"scrambled eggs" -mushrooms to mysql "in boolean mode".
This should result in near-perfect Google query emulation.
RastegarDirtyDogrbrahmson

Comments

  • For part 4, the pseudo code should be:

    Iterate through all characters in the search query.
    Track whether we are "in quotes" or not.
    When not "in quotes", and a word start is encountered (the first non-space character, or any non-space character following a space):
    If it is a plus, keep
    If it is a minus, keep
    Otherwise, insert a plus.

    I think that's right. I'm going to try hacking this up on our site's PHP.
    Rastegar
  • I always offer google site search as a default, if it generates no result then the offer the full text search. Even very famous sites use google site search if they can't justify a dedicated search.

    Searching produces an overhead that needs to be factored. It is also a good idea to promote tagging, you can enhance the tag search with multiple tags.

    It is best using a dedicated search engine like Lucene or Sphinx if you are really serious about having search functionality.

    grep is your friend.

    Yoazt
  • 422422 Developer MVP
    Sphinx is cool, but last time we used it, you could not initiate search updates via cron only via admin...
  • TimTim Operations Vanilla Staff
    We're using Sphinx internally and although its a ***** to configure, it works pretty well. @Todd is the search guy, I wonder what he thinks of all this?

    Vanilla Forums COO [GitHub, Twitter, About.me]

    422
  • ToddTodd Chief Product Officer Vanilla Staff
    To me programming a custom search parsing like this is a wild goose chase that the core development team won't go down. I tried doing a custom search before the launch of Vanilla 2 and it just didn't perform well and the parsing broke in unicode. Using an OR query on a large data set is also very slow.

    The fact of the matter is MySQL is not built for search. The other fact is that no matter how good of a search engine we use for a site, nothing will beat Google. And real-time searching is a a goal that is really not achievable for full-text search. In a way, putting an internal search on a forum is just for the graphic design.

    Our default search sucks because MySQL full-text search sucks and that's what we have to support. However, search is not our goal with Vanilla. Building a great forum is.

    We may change the default search, but if we do the following will be used.

    1. A simple like search on search terms.
    2. More search options like author and date and tags.
    422
  • candymancandyman ✭✭
    edited November 2011
    It's my impression or Vanilla internal search engine look at the posts text only and not at the thread name? Just to know, thank you.

    Another question: in other forums software I've found a "restore searching database" (or similar) function to improving searching results. In Vanilla there's something similar? I've tried to searching something (in a 250Mb database) but the results are max on 2 pages. I expected more.

    p.s.: more I use Vanilla (just installed with some problems of collation) more I like it... :)
  • @Todd, I agree, the approach I'm describing may not work for multi-byte unicode, but should for unicode, and certainly does for western/ascii.

    I'm not trying to completely solve the problem of search and come up with a perfect solution. I'm only trying to do an 80:20 - to make a few simple tweaks that measurably improve the user experience for the most common search requirements, while keeping the basic paradigm (MySQL full text searching).

    (Unfortunately, we can't rely on Google CSE, since half of our site is restricted-access, only available to logged-in users in permissioned groups. For public sites, a "fig-leaf" search isn't an issue if you can swap it for CSE, but for ours, it's a really big deal, since the entire point of our forums is to be a searchable knowledge repository for users of our product).

    To sum up, I'm trying to allow users to search, for example, for eggs bacon -"scrambled eggs" as they would in Google and have that be interpreted as "Any discussion or comment containing "eggs" AND "bacon" AND NOT EXACT PHRASE "scrambled eggs", rather than the default behaviour of natural language searching which returns results that don't containing all required phrases and misses out words 3 letters or shorter.

    The limitations of the changes I'm outlining are:
    • Unlike Google, search terms must be exact. You can't search for "colour" and get hits for "color" or "colouring"
    • I don't know how this will work for unicode. It certainly won't work for multi-byte, but there are a great number of non-multibyte locales using Vanilla, I should imagine. I would expect unicode to work OK; all I'm doing is inserting a + character before every search term.
    • They don't address the issue where all your search terms must match against a single discussion or comment. So, if you are searching for [term1 term2], and your discussion contains "term1" in one comment and "term2" in another comment, you still won't get any results.
    • Performance - I don't know how the full-text indexing tweaks will perform on very large sites; on our modest 1000+ discussion site, they have no noticeable impact.
    For those who wish to try this out, here are the instructions:
    1. In conf/config.php, add the following to the end of the 'Garden' section:
      $Configuration['Garden']['Search']['Mode'] = 'google';
    2. In applications/dashboard/models/class.searchmodel.php
    3. insert a plus before every search term and switch to "boolean" search mode, if the search mode is initially 'google'. I'll paste the tested & working changes necessary in my next comment.
    4. Update the MySQL database to perform full-text indexing of every word without any constraints. This is described in step 3 of my original post, above.
    Rastegar
  • edited November 2011
    These are the necessary changes to applications/dashboard/models/class.searchmodel.php for step 2 above.
    Line numbers relate to Vanilla 2.0.17.3.

    After "Reset" function definition, around line 71:
    // Converts a Google search phrase such as [eggs bacon "scrambled eggs" -omelette]
            // into a boolean mode equivalent [+eggs +bacon +"scrambled eggs" -omelette]
            // Prefixes all terms (words or quoted phrases) with a plus, 
            // unless they already have a plus or minus prefix.
            public function ConvertGoogleToBoolean($s) {
              $inQuotes = false;
              $nextIsTermStart = true;
              for ($i = 0; $i < strlen($s); $i++) {
                $c = substr($s, $i, 1);
                if (!$inQuotes) {
                  if ($c==' ') {
                    // Term separator
                    $nextIsTermStart = true;
                  } else {
                    // Everything other than space is considered a term character
                    if ($nextIsTermStart) {
                      // This is the first character in the term, including a quote
                      if ($c!='+' && $c!='-') {
                        // Insert s +, unless there is already a + or -
                        $s = substr($s, 0, $i) . '+' . substr($s, $i);
                        $i++;
                      }
                      $nextIsTermStart = false;
                    }
                  }
                }
                if ($c == '"') $inQuotes = !$inQuotes;
              }
              return $s;
            }
    In "Search" function definition, around line 82, between the "ForceSearchMode" and "matchboolean" 'if' blocks:
    if ($SearchMode == 'google') {
            $SearchMode = 'boolean';
            $Search = $this->ConvertGoogleToBoolean($Search);
          }
    ifihaveyoujohansonlockerRastegar
  • ToddTodd Chief Product Officer Vanilla Staff
    You can wrap your version of the SearchModel in plugin by doing the following:

    1. Create a plugin.
    2. Copy your version of class.searchmodel.php into /plugins/yourplugin/class.searchmodel.php.

    The framework will use your search model instead of the core when your plugin is enabled.
    jspautsch422
  • Thanks for the tip, @Todd. I'll do so and share it, if I get time.
  • Yes, please share @stevevisokio :)

  • mcu_hqmcu_hq yippie ki-yay Arizona, USA ✭✭✭

    Todd said:
    You can wrap your version of the SearchModel in plugin by doing the following:

    1. Create a plugin.
    2. Copy your version of class.searchmodel.php into /plugins/yourplugin/class.searchmodel.php.

    The framework will use your search model instead of the core when your plugin is enabled.

    Is this true? I have been trying this with no luck....@todd

  • ToddTodd Chief Product Officer Vanilla Staff

    This is true. Make sure you clear your /cache/*.ini if you are creating files in development.

    UnderDog
  • mcu_hqmcu_hq yippie ki-yay Arizona, USA ✭✭✭

    Todd said:
    This is true. Make sure you clear your /cache/*.ini if you are creating files in development.

    Nice call. It is working as intended. Thanks Todd.

    UnderDog
Sign In or Register to comment.