|
|
|
|
|
<< Back to documentation overviewIntroductionIt is very easy to write your own server side connector. One of the strengths of the ES is the ability to write your own connectors in Perl, which run directly on the ES server. These connectors only needs to download the data from the source, then all data converting will be handled by the ES. The ES crawler APIThe ES connector API require you to make a Perl package that exports at least the subroutine crawl_update(). crawl_update() is called at regular intervals to see if it is any new data available. It shall inspect its source data and determents if new data have arrived. If so, it uses add_document() to add it to the search index. The data added to the ES is always referred to as a “document”, regardless of type and source. Example: A Twitter connector using the Twitter json API and PerlThis example will show you how to make a custom connector for the ES. We will be crawling Twitter, a public data source, so we don’t have to worry about authenticating and data permissions. Twitter has an http api where you can see the latest twits for a user. This is done by crafting a special url in the format http://twitter.com/statuses/user_timeline/{USER}.{FORMAT} For e.xample CNN Breaking News have twiter page http:// twitter.com/cnnbrk . Making Rss and Json available from the following url's.
Getting startetStart by selecting the “Connectors” section in the ES admin. Then create a new connector by clicking on the “Create a new connector” button. The new connector will be issued a default name. So our first step is to change this to something reasonable. At the settings and parameters tab, set name to “MyTwitter” and click the “Update” button.
To make this connector as general as possible we are going to have with twitter screen name to index as an parameter. To do so we must first go to the settings and parameters tab and add a parameter called “screen name”.
At the configure test collection tab, set screen name to the twitter screen name you want to crawl. In this case “cnnbrk”.
The codeThen go to edit source tab where we will write the actual sorce code. The ES will have filed in some example code, but we don’t need that now. So start with removing all source code in crawl_update() so you get a clean routine like this.
sub crawl_update {
my (undef, $self, $opt) = @_;
};
The $opt variable is a hash reference containing all input options. For example the screen name we configured above will be at $opt->{'screen name'} . You can see the content in $opt by adding the following line to crawl_update(). warn "Options received: ", Dumper($opt), "\n"; At this point it's smart to test that the framework is working as exspected. Update the crawl_update() so you get:
sub crawl_update {
my (undef, $self, $opt) = @_;
warn "Options received: ", Dumper($opt), "\n";
};
Then click the save and run button below the code window.
The errors about mysql and bbdn can safely be ignored. You are not using threads and persistent bbdn connection. ImplementingBack at the edit source window we can start to implement the Twitter connector. We will be using the Cpan modules JSON::XS, use Date::Parse; and LWP::Simple in this connector. So first we add refferanses to them at the top of the source just below the other "use" and our statements.We gets: use Crawler; our @ISA = qw(Crawler); use LWP::Simple qw(get); use JSON::XS qw(from_json); use Date::Parse; Then we wil modefy crawl_update() to crawl Twitter. We build the url to the json feed. Then uses get() and from_json() to download and decode it.
my $jurl = "http://twitter.com/statuses/user_timeline/" . $opt->{'screen name'} . ".json";
my $t = from_json(get($jurl));
Finally we loop thru the json data, format it correctly, and submit is to the ES.
for my $usr (@{$t}) {
my $content = $usr->{text};
my $url = "http://twitter.com/" . "$usr->{user}{screen_name}/statuses/$usr->{id}";
next if $self->document_exists($url, 0);
my $substr = substr($content, 0, 50);
my $title = "$usr->{user}{name}: $substr ..";
my $created_at = str2time($usr->{created_at});
warn "Adding $title";
$self->add_document((
content => $content,
title => $title,
url => $url,
type => "tapp",
acl_allow => "Everyone",
last_modified => $created_at,
));
}
Click Save and Run. Hopefully you will see something like this.
Finally all we have to do is to enable anonymous search of this collection. Go to the Settings and parameters and select accesslevel as a input field. Then at the Configure test collection tab set accesslevel to "Anonymous". Click on the Public search page button in the left top corner and you will se the search page. Search for something.
Full code
package Perlcrawl;
use Carp;
use Data::Dumper;
use strict;
use warnings;
use Crawler;
our @ISA = qw(Crawler);
use LWP::Simple qw(get);
use JSON::XS qw(from_json);
use Date::Parse;
##
# Main loop for a crawl update.
# This is where a resource is crawled, and documents added.
sub crawl_update {
my (undef, $self, $opt) = @_;
warn "Options received: ", Dumper($opt), "\n";
my $jurl = "http://twitter.com/statuses/user_timeline/" . $opt->{'screen name'} . ".json";
my $t = from_json(get($jurl));
for my $usr (@{$t}) {
my $content = $usr->{text};
my $url = "http://twitter.com/" . "$usr->{user}{screen_name}/statuses/$usr->{id}";
next if $self->document_exists($url, 0);
my $substr = substr($content, 0, 50);
my $title = "$usr->{user}{name}: $substr ..";
my $created_at = str2time($usr->{created_at});
print "Adding $title\n";
$self->add_document((
content => $content,
title => $title,
url => $url,
type => "tapp",
acl_allow => "Everyone",
last_modified => $created_at,
));
}
};
sub path_access {
my ($undef, $self, $opt) = @_;
# During a user search, `path access' is called against the search results
# before they are shown to the user. This is to check if the user still has
# access to the results.
#
# If this is irrelevant to you, just return 1.
# You'll want to return 0 when:
# * The document doesn't exist anymore
# * The user has lost priviledges to read the document
# * .. when you want the document to be filtered from a user search in general.
return 1;
}
1;
Download the full source code at: http://www.searchdaimon.com/files/code%20examples/Simple%20Twitter%20connector.txt |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||