Please upgrade here. These earlier versions are no longer being updated and have security issues.
HackerOne users: Testing against this community violates our program's Terms of Service and will result in your bounty being denied.

Making Sphinx respect permissions

i installed Sphinx today. It's really fantastic. The only concern i have is that it is returning search results which include closed areas of my forum even to users who are not logged in. Obviously they cant get to the actual discussion, but I would like to eliminate these sub forums from results when users do not have permission to view the content.

Any ideas?

«134

Comments

  • mcu_hqmcu_hq yippie ki-yay ✭✭✭
    edited December 2012

    That may be a little bit difficult, but certainly doable. Before it searches, you can check what user is performing the request (if he is logged in or not) and then add another filter on the CategoryID (catid). You can do that around here.

    You will want to use SetFilter. Probably want to use exclude = true. Exclude the CategoryIDs that you don't want to be displayed. Any threads/posts under that Category will not be returned.

  • hbfhbf wiki guy? MVP

    thanks @mcu_hq i'll poke around a bit. This is probably a functionality that should be added back into the main branch of your plugin if i get it working. alternatively if you wanted to tackle it and make a few bucks i'll give you $30 for the enhancement.

  • mcu_hqmcu_hq yippie ki-yay ✭✭✭

    I would prefer if you just clone the master trunk as it is now on Github (the one on plugin site is outdated after a few fixes) and then see if you can add it in.

  • hbfhbf wiki guy? MVP

    ok, i made the changes. took a couple hours to figure our how your plugin works, it's brilliant by the way, very well written.

    the place to make changes is two fold, to get the best performance you want to add the categorypermission id as sql_attr_bigint = catpermid into the indexing routines. You pick up this info from the category table, which you are already joining so there it's a breeze.

    to use this new found information

    into class.sphinxsearch.plugin.php

    in runsearch you need to add one line (i left the context around the line, the new line is if(Gdn::Session()....)

        if (isset($Info['attrs']))
        {
            //echo "  catid ". $Info['attrs']['catid'] . "PerID" .  $Info['attrs']['catpermid']  ;
            if(Gdn::Session()->CheckPermission('Vanilla.Discussions.View', TRUE, 'Category', $Info['attrs']['catpermid']))
                                            $ResultDocs[$Id] = Gdn_Format::ArrayAsObject($Info['attrs']); //get the result documents
                            }
    
  • peregrineperegrine MVP
    edited December 2012

    I haven't tried your mod, but I believe you :)@hbf - kudos to you both.

    I may not provide the completed solution you might desire, but I do try to provide honest suggestions to help you solve your issue.

  • hbfhbf wiki guy? MVP

    the changes are currently up on my site (of course you dont know what your not seeing, 'cause your not supposed to see it.)

  • mcu_hqmcu_hq yippie ki-yay ✭✭✭

    Awesome @hbf!

    That is a good idea to use that permission ID... I didn't even know that it existed or its purpose. I am going to add your snippet in the next release, which will be very soon. There is still one problem with extremely large databases (> 4GB) that I can't seem to figure out yet because my laptop hard drive is only 4 gigs big.

  • hbfhbf wiki guy? MVP

    category permission ID is a mapping between category definitions and the permission group definitions. That way categories dont have to replicate the default discussion and comment permission set each time a new one is created. by default the category permission id is -1 (which is why i had to use bigint instead of uint).

    what issue are you having with large databases?

  • mcu_hqmcu_hq yippie ki-yay ✭✭✭
    edited December 2012

    Oh yea, and then there is another table which mirrors the permssionID and it attributes such as guest allowed viewing, members only, etc.

    If you wanted to get around the signed int stuff, you can tell sphinx to Add +1 to each permission ID before indexing and then just remember this when comparing permissionIDs. This is sorta what I did to each discussionID in order to have the whole data set return a unique number for each post/thread start. Vanilla has two tables, discussion and comment. Since a topic starter ID will belong in the discussion table and NOT the comment table, I had to tax the maximum discussionID as the offset into the comment table.I then added an attribute which would tell me if the result set is from the comment or discussion table. It is pretty confusing, but you can look at your sphinx.conf for more info. There are a few tricks I put in there to get it all to work without putting more than 1 index.

    One person on here is seeing large response times with >5GB database, but not with a database of <3GB. I suspected it was the SQL query NOT on the sphinx side, but we shall see. I can't replicate until I get a large crappy setup.

  • hbfhbf wiki guy? MVP

    yes, you're correct on the the table purpose.

    as for the +1, also correct, but i would recommend sticking with bigint, as it reflects the true value and requires no transformation to make the data usable. i find that unless transformations are necessary or provide a substantive performance gain, they add additional complexity to the code and make it harder to read and maintain. of course proper commenting can overcome this, but it's just more work.

    my database isn't quite that large, so everything is good for me. if you need a dev sandbox, i may be able to hook you up in january for about a month.

  • mcu_hqmcu_hq yippie ki-yay ✭✭✭
    edited January 2013

    OK I uploaded a new version that has correct a lot of things I have been meaning to put on the addon site.

    I've looked at your code snippet and noticed that is actually out of place. I added 3 lines to the code BEFORE it queries sphinx. This is much better than filtering AFTER the results are returned since now the pagination is correct. If you filter after sphinx returns the docs, the exact # returned may be incorrect and you can expect some of your pages to be missing expected documents.

    Here is where I put the edits.

    I noticed on your site that some pages for common search queries return only a few results while some others are fine. Can you upgrade and see if this fixes your problems? Also, how often do you index?

    if you need a dev sandbox, i may be able to hook you up in january for about a month.

    That would be awesome if you could make that happen.

  • hbfhbf wiki guy? MVP

    @mcu_hq said:
    OK I uploaded a new version that has correct a lot of things I have been meaning to put on the addon site.

    I've looked at your code snippet and noticed that is actually out of place. I added 3 lines to the code BEFORE it queries sphinx. This is much better than filtering AFTER the results are returned since now the pagination is correct. If you filter after sphinx returns the docs, the exact # returned may be incorrect and you can expect some of your pages to be missing expected documents.

    Here is where I put the edits.

    I noticed on your site that some pages for common search queries return only a few results while some others are fine. Can you upgrade and see if this fixes your problems? Also, how often do you index?

    That would be awesome if you could make that happen.

    @mcu_hq

    i updated the code with your suggested filtering, works great, although im not sure where you are setting the constant, so i just hard coded the filter column name for now.

    as for upgrading, i assume you mean pull the latest from git hub and over-write? if that not what you mean then let me know. i'll probably do the update tomorrow.

    as for indexing, it's a bit of a problem for me. i see the cron log showing successful re-indexing occurring according to your default schedule, but on the settings page it does not reflect the indexing time stamp from the cron runs. so i assume it's not working as intended. i'd like to get that resolved, but haven't had enough time to look under the hood to figure out what is happening. when i press the button on the settings page, it notifies me that the action was successful and the index time stamp gets updated. so i don't know the difference between pushing the button and having the cron run the php script.

    i'll get a sandbox set up for you for the large db stuff. i need to think about how i want to configure it since you are going to be loading more data than all of my other sites combined right now. i'm probably going to put the DB on a separate, lower performance test server, so it doesnt have any impact on my prod environments. i should have something for you next week.

  • mcu_hqmcu_hq yippie ki-yay ✭✭✭

    although im not sure where you are setting the constant,

    The constants are in their own separate file here

    The latest version includes all of the commits from Nov 13 whether or not you want to include those or not. Commit Log.

    i see the cron log showing successful re-indexing occurring according to your default schedule, but on the settings page it does not reflect the indexing time stamp from the cron runs. so i assume it's not working as intended.

    Actually what you are seeing is correct operation. I mentioned something like this in the FAQ a ways down and I still haven't found an elegant solution. When the cron job indexes, the PHP script is not updated. What I could do is have the cron job write to a file and then the plugin parse that when you land on the settings page. The only time the dashboard is updated with the correct time is when you manually index.

    i should have something for you next week.

    Ok, cool..the server needs to have at least 15GiG of space I'd say. the DB is around 4GB and each MYSQL index file could reach a Gig or so.

  • hbfhbf wiki guy? MVP

    thanks, i'll go ahead and pull everything and update.

    glad to hear the indexing is working as expected. saves me some time from poking it with a stick.

    space shouldn't be a problem im more or less thinking out loud. the web will be on the prod server, since that is the only one that is externally facing. DB will be on an older dev box, low cpu / memory performance but ooodles of storage space. it'll be slow but should have space for hundreds of gigs of data.

  • Hi I am using the latest version from github but I think the permission fix is still missing in the "related threads" widgets. When I do a search on my website the permissions are honored, but in the related threads widget I still see topics I am not supposed to see :)

  • mcu_hqmcu_hq yippie ki-yay ✭✭✭

    @cataldoc You are right, it is missing in the widgets...I'll be sure to add that

  • mcu_hqmcu_hq yippie ki-yay ✭✭✭

    Let me know if this commit fixes it. You just need to add 3 lines to 3 files:

    https://github.com/mcuhq/SphinxSearchPlugin/commit/26fc5de39e8eca11efd43eb697972cb7e025037f

  • Looks ok now, awesome ! :)

  • hbfhbf wiki guy? MVP

    i just pulled the latest main branch off github.... I'm not getting search results any more. I'm having a hard time figuring out whats going on. My guess is that the catpermid is somehow not getting interpreted correctly. im digging in but just wanted to give you a heads up

  • mcu_hqmcu_hq yippie ki-yay ✭✭✭

    @hbf

    How did you perform the upgrade? Did you disable the plugin and then re-enable? I'd suggest just going through the install wizard again to get all of the settings back aligned and working.

Sign In or Register to comment.