New Carmen Sandiego (GeoIP) Plugin
Hi Everyone,
I've built a plugin that provides some GeoIP functionality. I called it Carmen Sandiego... of course
The plugin provides 2 pieces of functionality. It can display the author's flag for discussions and comments as well as log geoip information when a user logs in (UserMeta table).
Here's where the work got interesting. Since lots of hosting servers do not include the PHP geoip library as well as not enabling "LOAD DATA" in MySQL, I've had to make my own functions as well as manually parse the CSV into the MySQL database during the plugins import routine.
I've also added a caching layer to make sure this stuff stays fast. I strongly recommend enabling Memcached if you want to run this thing properly. I also have proper logic in my caching, so each IP is cached and available in other discussions.
Please let me know what you think,
Cheers
DDD
Comments
Why not using a ready made library, such as the MaxMind one, for Geolocation? I wrote over two dozens of solutions that use geolocation, all based on that database, and they work perfectly.
My shop | About Me
I've used them before and they are great... But those libraries are not always included in shared hosting. I wanted something more standalone.
@dericdd: Can you line out what can be done with the plugin exactly? Does it provide a map functionality to create flags for users on a world map, or might there be an option that allows to find "users in your area"?
I guess the plugin provides the data, that can than be mapped on whatever a forum admin likes... but doesn't have the functionality already built in.
The plan is still ongoing, lol . I am certainly open to suggestions
For now it can put a country flag near the author's name in discussions and comments. It can also log geoip info of a user when they login. (simply check which features to enable in Settings panel).
I built this plugin for the experience of building a proper Vanilla plugin. For now the plugin has very little experience out in the wild. I am hoping that feedback will guide the next steps of it's development.
Those libraries are just PHP files, which you can include with your plugin using Composer. None of my plugins assume that they are pre-installed on the server, they are included with the distribution.
My shop | About Me
@businessdad (love the name btw). Not sure i follow what you mean...
I actually started making this plugin using PHP's geoip lib. It occurred to me that not all servers will have that lib included. (http://php.net/manual/en/book.geoip.php). I then decided to write something that would import MaxMind's GeoIP-Lite City CSV file, parse it and load it into MySQL tables. Another reason is that having the data in MySQL makes it easier to operate on for future dev (not to mention faster).
I had also originally started using LOAD DATA function in MySQL to load the CSV into the database and then also realized that this is supported even less often on servers. This is why I parse through the file manually... Which works pretty fast. My dev VM takes takes less than a minute for the complete import (1 core, 1 gig of RAM).
@businessdad Have you tried installing this plugin? I am hoping to get feedback.
What I mean is that you don't have to rely on the PHP GeoIP extension, there is a ready made library to query the MaxMind database: https://github.com/maxmind/geoip-api-php
The library doesn't require anything installed on the server.
My shop | About Me
Ah i see... Yeah i am trying to move away from DAT file. I wanted the data in our own MySQL DB.
I'm not sure that moving to MySQL would be a good idea. The RDBMS is the bottleneck of every website, while the DAT file could be queried independently for each request. Personally, I find that the DAT file scales better.
My shop | About Me
Perhaps this Plugin is not for you then
Personally, we tested it and the DAT file is slow and bloats the process. And if you're directly hitting the database and ignoring the cache then I don't know what to tell you.
Vanilla Forums COO [GitHub, Twitter, About.me]
Again, the (wrong) assumption that any site has a cache layer installed?
Paraphrasing: if you assume that all self-hosted Vanilla installations have a cache layer, then I don't know what to tell you.
My shop | About Me
I don't have a Vanilla Forums site to run (correction: not anymore), I wouldn't need it anyway.
My shop | About Me
One implementation (DAT) is slow and will always be slow. It balloons the size of the executing process whenever you query it, and supports only single IP lookups. At 5-10ms per lookup (experimentally obtained), that's 200-400ms per 40 comment discussion.
The other implementation (MySQL) is slow, but can be sped up by querying in parallel. That means you're suffering a 20-40ms penalty whether you ask for 1 or 100 IPs, and you leave the PHP process lean and fast (which is one of the biggest contributors to poor concurrency at scale).
Additionally, having the data in MySQL allows for more extensive queries that surpass those offered by the extension's API and Maxmind's PHP library.
Vanilla Forums COO [GitHub, Twitter, About.me]
Based on my experience, single IP lookups should be enough. If you query the DAT when the comment is posted, then store the result against the post, that should be all you need. As I mentioned, I implemented geolocation on dozens of sites, and there is a maximum of one lookup per visitor (location data is cached after the lookup, so that visitor's location is known from that moment on). That's 10ms once-off, on a single page load (other pages loaded by the same visitor are not affected) which seems quite good to me.
I don't see why someone would have to query hundreds of IP addresses (unless it's to populate a catalogue), but, in that case, a RBAR approach would indeed be slow.
Yet, it still depends on if those queries are needed. I've been dealing with geolocation on a daily basis for for three years straight now, and I still have to find a scenario where I need to query multiple IP addresses in one shot. If I had such need, then I would probably dump the data into a database as well.
My shop | About Me
@dericdd For the record, and to be clear, I'm not implying that the solution you implemented is wrong. I was just wondering why you decided to rewrite the whole geolocation logic, when there are ready-made libraries available. My personal approach is to use what's available as much as possible, and write my own solution only when there aren't existing ones, or when the available ones don't cover my requirements (and, as I mentioned, in all geolocation scenarios I worked with, the PHP libraries worked perfectly fine).
To be clear, if I had to write a geolocation solution for Vanilla, I would have used the MaxMind PHP library straight away, until the need for something more complex arised.
My shop | About Me
When businessdad the SQL guy doesn't use SQL you better sit up and listen, and if you are no-SQL guy you feel slightly smug.
Just teasing Of course use the right tool for the job.
grep is your friend.
Precisely. And, for the record, I have nothing against NoSQL, I simply haven't yet come across a scenario where it would actually bring any significant benefit over a RDBMS.
My shop | About Me
Well the discussion thread is a clear example of needing multiple queries at a time. If we have 20-30 comments and we query them one at a time it takes forever. So at the beginning of the discussion thread we tie into "discussionController_beforeDiscussionDisplay_Handler" hook to pre-query all the IPs from the discussion and comments. Then we use the "base_authorInfo_Handler" to display flag using pre-loaded geoip info.
You made a good point that we should perhaps record the geoIP data while posting discussion/comment. I will probably start doing this as well in future versions to lower that query count... However I wanted it to work for old data as well.
I did not re-invent the wheel or rewrite the logic. I am using CSV data/structure from MaxMind. Does not take a rocket scientist to figure out how to query it if you understand how IPs and subnets work.
I agree that one should not build something if it is already made... it just wasn't made the way I needed it.