Please upgrade here. These earlier versions are no longer being updated and have security issues.

Searches only phrases starting with Roman characters

This discussion is related to the SphinxSearch addon.
edited August 2012 in Vanilla 2.0 - 2.3

Hi,

I installed Vanilla and SphinxSearch (on OS X 10.8, with PHP 5.3) and they both work great. For the most part.

The problem is that it appears that SphinxSearch erroneously dismisses searches that don't start with roman characters.

I run a Persian forum, and naturally almost everything is in Persian. I've added Arabic and Persian character mappings (http://sphinxsearch.com/wiki/doku.php?id=charset_tables#arabic) to Sphinx's charset_table (in assets/sphinx.conf.tpl) and all Persian conversations have been indexed (I can verify it by running python api/test.py اپل and seeing the results).

But, when I search for a Persian phrase, or anything that doesn't start with [a-zA-Z0-9_+], the SphinxSearch plugin ignores the search.

Here's the /var/query.log when I search for "mac", "اپل" (it's "Apple" in Persian"), "mac اپل" and "اپل mac". Notice that for the second and fourth queries (that don't start with Roman characters), it only updates the "vss_stats" index, and doesn't do the actual searching. My guess (and I'll be shocked if it's not true!) is that somewhere in your giant PHP codebase (7884 lines of code!) you try to validate the searched phrase, but do it incorrectly and reject some searches.

I looked for it in your code, and the offending validator must be in class.searchmodel.php, but I couldn't find anything.

[Wed Aug 22 22:59:23.745 2012] 0.002 sec [ext2/0/rel 2354 (0,30)] [vanilla] [MainSearch]   @(title,body) mac
[Wed Aug 22 22:59:23.745 2012] 0.000 sec [ext2/0/rel 131 (0,20)] [vanilla] [RelatedMainThreads] @(title) mac
[Wed Aug 22 22:59:23.746 2012] 0.000 sec [ext2/1/rel 1 (1,20) @keywords_crc] [vss_stats] [Related Searches] @(keywords) mac


[Wed Aug 22 22:59:28.269 2012] 0.000 sec [ext2/1/rel 0 (1,20) @keywords_crc] [vss_stats] [Related Searches] @(keywords) اپل


[Wed Aug 22 22:59:39.341 2012] 0.003 sec [ext2/0/rel 6550 (0,30)] [vanilla] [MainSearch]   @(title,body) اپل | mac
[Wed Aug 22 22:59:39.342 2012] 0.000 sec [ext2/0/rel 212 (0,20)] [vanilla] [RelatedMainThreads] @(title) اپل | mac
[Wed Aug 22 22:59:39.343 2012] 0.000 sec [ext2/1/rel 0 (1,20) @keywords_crc] [vss_stats] [Related Searches] @(keywords) اپل | mac


[Wed Aug 22 23:24:44.116 2012] 0.000 sec [ext2/1/rel 0 (1,20) @keywords_crc] [vss_stats] [Related Searches] @(keywords) mac | اپل

I'll be posting the exact same thing on GitHub Issues page.

Tagged:
mcu_hq

Comments

  • mcu_hqmcu_hq yippie ki-yay Arizona, USA ✭✭✭
    edited August 2012

    I must say, this is probably the best bug report I have ever gotten in terms of insight and feedback! I really like how you posted it on github because now everyone can track the progress and see what was fixed and what not...so thank you!

    One of the things I did omit from the sphinx config file was the char mapping. I'll put it in a new option in the next release.

    class.searchmodel.php is actually the generic vanilla core file that I modified slightly. I had to modify that to put in my own hook which would launch sphinx correctly and disallow the default MYSQL query to be run.

    I do some validating of my own here: https://github.com/mcuhq/SphinxSearchPlugin/blob/master/widgets/class.widgets.php#L307

    I'll have a look at this later today, but in the mean time see if I filter out your search there.

  • Thanks for the response (and I forgot: thanks for the amazing plugin!).

    No, that didn't help. My persian $words are pass through is_string just fine.

  • mcu_hqmcu_hq yippie ki-yay Arizona, USA ✭✭✭
    edited August 2012

    Yea, I looked at this a little last night and narrowed it down to the charset table.

    The problem I think comes down to escaping your input for sphinx by using EscapeString function in the API.

    If you uncomment this line in the main plugin file and perform a search with Arabic/Persian characters, you should see an array printed with an error for some of the queries unexpected $end near. I supposed this is because Persian and Arabic characters are unknown characters. Since you put in Arabic and Persian charset mappings, I thought that this would fix it, but I'll try to re-create on my side of things.

  • I changed line #159 to echo '<pre>'; var_dump($Results); die; so it would print prettier.

    I attached three gists. What happens is exactly what you predicted: https://gist.github.com/3443785 and https://gist.github.com/3443800 and https://gist.github.com/3443792

    (in the first gist, the Persian word was entered first)

  • mcu_hqmcu_hq yippie ki-yay Arizona, USA ✭✭✭
    edited August 2012

    You added the Arabic and Persian charset table in your sphinx.conf file right? You can view it on the cpanel from the plugin. Your first post leads me to believe that you did. What that var dump is not showing is the actual query. The ones that fail are something like @(title,body) اپل .

    Try running python api/test.py @(title,body) اپل .

    Not sure why the last one works properly..will investigate.

  • mcu_hqmcu_hq yippie ki-yay Arizona, USA ✭✭✭
    edited August 2012

    Not sure why this seems to work, but can you try adding a space after the main query as shown here and test it on your main search page?

    https://github.com/mcuhq/SphinxSearchPlugin/commit/8405f76b9d9abceab3839fbbc5268da741f345bb

    I was able to get my sphinx to query successfully after I terminated each query with a space. I'm going to seek out an answer as to why this is.

  • 1:

    $ python api/test.py @(title,body) اپل
    Query '@(title,body) اپل ' retrieved 0 of 0 matches in 0.039 sec
    Query stats:
      'titl' found 7366 times in 5304 documents
      'bodi' found 19 times in 17 documents
      'اپل' found 8656 times in 4694 documents
      'title' found 0 times in 0 documents
      'body' found 0 times in 0 documents
      '?' found 0 times in 0 documents
      '?' found 0 times in 0 documents
    
    Matches:
    

    2: The . ' ' certainly did the trick! All queries now return the relevant results... It still doesn't search in RelatedMainThreads index (so, there's no 'Related Threads' in the results page, or when you're starting a new discussion), but I'm not sure if your hack was supposed to fix all problems.

    :)

  • mcu_hqmcu_hq yippie ki-yay Arizona, USA ✭✭✭

    No, it was just a test case to see if it would work.

    I filed a bug report on sphinx here. The latest commit should solve this issue found here on github.

    pooriaazimi
  • Thanks a lot :)

Sign In or Register to comment.