This Perl script uses the Google API to search Google for two search terms that appear within a certain distance from each other on a page. It does this by using a seldom-discussed Google feature: within a quoted phrase, * can be used as a wildcard meaning "any word." So to search for coppola within 2 words of nepotism, in either order, you need 6 queries:
"coppola nepotism"
"coppola * nepotism"
"coppola * * nepotism"
"nepotism coppola"
"nepotism * coppola"
"nepotism * * coppola"
The GAPS script simply constructs these queries, gets the first page of Google results for each query, compiles all the results, and presents them in a specified sort order.
Because it takes so many separate queries to do each search, we've currently limited the distance between terms to 3. This is an arbitrary limit, and if you want to take the script and run it on your own server (see below), you can use a much higher limit.
Proximity searching on this very basic and clumsy level may or may not do anyone any good (especially since Google already seems to take proximity into account to some extent when ranking results for multiple terms on a page), but it is sort of neat. If you find a useful application for this tool, or if you just come up with an amusing set of results, please let us know.
Form Fields
Find (2 fields): The first and second terms you want to find in proximity to each other. Both fields must be filled in for GAPS to work. You can enter a single word or a multi-word phrase in each box. Since each term will be treated as an exact phrase, you don't need quotes or hyphens.
Proximity (within __ words): The maximum distance between the two search terms. All searches include a zero-distance query, where the terms are immediately adjacent to each other.
Order: If you choose in that order, only pages where the first term precedes the second term will be found; if you choose in either order, you'll also get pages where the second term precedes the first term.
Sort: The sort order in which the final list of results will be displayed:
The title and URL sorts are straightforward.
Sorting by ranking is a little sketchy and probably won't do quite what you want it to do. The "ranking" in question here is simply a result's position within the first page of results for the query that found it. So if a query only produces a single result, that result's ranking will be 1, even if the page has very little relevance to either of the search terms. There's really no way to rank the results from different queries relative to one another. When you sort by ranking, the results are sub-sorted by proximity.
Sorting by proximity will display the results in ascending order according to the distance between the terms, so pages where the terms are immediately adjacent to each other will be at the top. When you sort by proximity, the results are sub-sorted by ranking.
Additional terms: Whatever you enter here will be included in each query. Use this to specify extra words or phrases that can appear anywhere on a page, not necessarily near the two main terms. You can also use - (minus) to exclude terms, or use other Google keywords like site:, allintext:, etc. (Using OR here will probably produce strange results.)
Total limit (show __ results): Since a search may do up to 8 queries, it's possible for the script to return up to 80 results (although that's unlikely). Use this menu if you want to restrict the total number of results that will be displayed. Note that even if you limit the totalsay, to 10the script will still compile and sort the full set of results for all queries, and then will return the top 10.
Per-page limit (up to __ from each query): By default, GAPS gathers a full page of up to 10 results from each query. If you want fewer than 10 results from each query to be included in the final listing, use this menu.
Filter each query: This checkbox enables or disables Google's built-in filter, which causes only one or two results from any given site to be displayed on a given page of results. In general, turning filtering on will give you results from a greater number of different sites; turning it off will give you more results from each of the sites that are most directly relevant to your search terms.
License key: Google requires a license key to be passed for each query. These keys are assigned when a developer signs up to use the Google API, and each key currently allows 1000 queries per day. By default, these scripts use staggernation.com's key; since the scripts do multiple queries (a search for terms "within N words in either order" will do ((N+1) * 2) queries), we might hit that limit rather quickly. For this reason, if you've signed up for the Google API program and you're not approaching the 1000-query limit with your own key, we would appreciate it if you could enter it here, especially if you plan to use these scripts extensively. Staggernation.com will not store your key or do anything with it except pass it to Google for whatever searches you do with these scripts.
Source Code
GAPS has the following components:
- gaps.cgi: the Perl CGI script
- ga_lib.pl: a Perl code library with some routines shared by the Google API scripts
Please feel free to download, peruse, and improve as you see fit. The scripts were coded hastily, so there's probably plenty of room for improvement.
If you'd like to host a mirrored version of any of these scripts on your own site, that would be great. A standard Perl installation plus SOAP::Lite and URI::Escape (and a Google API license key) should be all you need.
Change History
8/6/02 - version 1.1 released
- Added support for diacritically marked characters
- Cleaned up variable declarations
To Do
Possible future enhancements, some imminent, some but a distant dream:
- Show ODP page summary and category if present
- Display Google "search comments" field if returned
- Allow greater distance between words if user provides own license key
- Enable additional searching beyond 10 results per query. Simplest way would probably be a "More Results" link that would just get results 11-20, etc., for each query.
Contact
Email googlescripts [AT] staggernation [DOT] com with questions, comments, bug reports, feature requests, and what have you.
