Acts_As_Ferret Tutorial

by g on Feb 19, 2017

So you want to add rapid full-text searching to your Rails app, and either you haven't had much luck getting full text working with Mysql, or you want to start working with something a little more customizable (and very fast). This tutorial will let you know exactly how to do it.

 

What exactly is Ferret?

Ferret: is a Ruby high-performance text search engine library based on Apache Lucene (which is what all the Java big boys use). Installing Ferret is as simple as:


gem install ferret

If you take a look at the guts of what ferret is, you'll notice it's a small amount of Ruby code bound into a large amount of C code. Ferret was designed for use with Ruby in mind, not particularly Ruby on Rails, and if you look at the Ferret API you'll notice there's a pretty good Ferret Tutorial if you're hardcore.

 

Then there was Acts_as_Ferret

Luckily for us Rapid Rails developers, Jens Kramer wrote Acts As Ferret, which gives us a very simple interface so we can start creating complex search indexes in very little time.

Acts As Ferret can be installed

As a plugin


ruby script/plugin install svn://projects.jkraemer.net/acts_as_ferret/tags/stable/acts_as_ferret
 

Basic Usage

We're going to start out with the simplest of examples, and move onward from there.

The first thing you're going to want to do is go to the model you want to add an index on, and add at the top:

1
2
3
class Member < ActiveRecord::Base
  acts_as_ferret :fields => [:first_name, :last_name]
end

As you can see here, you're going to want to specify the names of the fields you want to index. When you do a search, all of these fields will be searched and only these Members will be returned.

How to do a Basic Search

The Acts As Ferret plugin adds additional search methods to your ActiveRecord Model, and unlike other tutorials out there, we're going to start with:

find_id_by_contents

So, if we call:


total_results, members = Member.find_id_by_contents("G")

The following will happen:
1. A folder in our rails application called /index/development/member is created, and the index files will be created there.
2. All of my Members will be queried and their first/last name put into this index. Now every time I add/update/delete a member, this index will auto-magically be updated for me! If you ever need to regenerate an index, then just remove this corresponding folder, restart your server, and it will be regenerated next time you query against the table.
3. ActsAsFerret then calls Ferret's Search_Each function on our index.
4. We get returned a count of the items, and the first 10 results, in this format:

1
2
3
4
5
members = [
         {:model => "Member", :id => "4", :score => "1.0"}, 
         {:model => "Member", :id => "21", :score => "0.93211"}, 
         {:model => "Member", :id => "27", :score => "0.32212"}
         ]

We get an array of the first 10 results (I'm only showing 3 above) and we get the ids and search scores for each of em.

However, even if there are more then 40 possible results, we're only going to get 10 results returned.

So what if I want more then 10 results?

Well, find_id_by_contents has a bunch of options you can send into it:
(I pulled these out of the search_each method from Ferret)

Options

  • offset: Default: 0. The offset of the start of the section of the result-set to return. This is used for paging through results. Let’s say you have a page size of 10. If you don’t find the result you want among the first 10 results then set +:offset+ to 10 and look at the next 10 results, then 20 and so on.
  • limit: Default: 10. This is the number of results you want returned, also called the page size. Set +:limit+ to +:all+ to return all results

These should look familiar, and feel just like your normal "find" searches, except normally your find doesn't have a limit unless you specify one.

Alternatively, you can also use find_id_by_contents in a code block:

1
2
3
4
results = []
total_results = Member.find_id_by_contents("G") {|result| 
        results.push result
}

At this point you may be thinking: "Well, I want to display the results of the search not just the ids, and in order to do that I'm going to have to query the Models". So you might end up doing something that looks like this:

1
2
3
4
results = []
total_results = Member.find_id_by_contents("G") {|result| 
        results.push Member.find(result[:id])
}

However, there is a better way!

This is where find_by_contents comes in.


@results = Member.find_by_contents("G")

find_by_contents does the following:

  1. Calls our friend "find_id_by_contents" first thing and gets the ids.
  2. Keeps track of all the ids of what we return, and then queries for the Model data. So, if Member 4, 21, and 27 were returned, it's going to do a query to get the actual models: select * from members where (members.id in ('4', '21', '27'))
  3. Return an array of these results which we can treat just like an Array of ActiveRecord objects, but which is actually a ActsAsFerret::SearchResults class (giving us a few additional features shown below).

So we could do something that looks like this:

1
2
3
4
5
6
7
8
9
10
members = Member.find_by_contents("G")

# It gives us total hits!
puts "Total hits = #{members.total_hits}"   
for member in members
   puts "#{member.first_name} #{member.last_name}"

   # And the search Score! 
   puts "Search Score = #{member.ferret_score}"   
end

Be sure to notice "total_hits" and "ferret_score" above. Neither of these are fields in my database, I get them free!

 

So how do I paginate already?

To blatantly steal code from Roman Mackovcak's blog, you could do something like this:

Your model would have this function:

1
2
3
4
5
6
7
8
9
10
11
12
def self.full_text_search(q, options = {})
   return nil if q.nil? or q==""
   default_options = {:limit => 10, :page => 1}
   options = default_options.merge options
   
   # get the offset based on what page we're on
   options[:offset] = options[:limit] * (options.delete(:page).to_i-1)  
   
   # now do the query with our options
   results = Member.find_by_contents(q, options)
   return [results.total_hits, results]
end

Then in your application.rb:

1
2
3
4
5
6
def pages_for(size, options = {})
  default_options = {:per_page => 10}
  options = default_options.merge options
  pages = Paginator.new self, size, options[:per_page], (params[:page]||1)
  return pages
end

Then in your controller:

1
2
3
4
5
def search
  @query = params[:query]
  @total, @members = Member.full_text_search(@query, :page => (params[:page]||1))          
  @pages = pages_for(@total)
end

Then in your member view you could have the totally normal pagination helpers:

1
2
3
<%= link_to 'Previous page', { :page => @pages.current.previous, :query => @query} if @pages.current.previous %>
<%= pagination_links(@pages, :params => { :query=> @query }) %>
<%= link_to 'Next page', { :page => @pages.current.next, :query => @query} if @pages.current.next %>

You're 95% Golden!

Using the knowledge above is going to work fine for 95% of your work. However, ActsAsFerret has additional features which can do some cool stuff.

 

Additional Query Strings

There are a few things you can do with your strings. I'm going to go through a couple examples to illustrate.

  • "G Billack" - Will search for results which contain "GG" and "Billack" in ANY order in ANY of the fields
  • "GG OR Billack" - will search for results which contain "G" or "Billack"
  • "G~" - Fuzzy searching - Will return many more results including "G", since we're fuzzy
  • "first_name:G" - Search for results where the first name is "G", ignoring all other indexes
  • "+first_name:G -last_name:Jones" - Boolean Searching. Give me all the results where G is in the first name, and Jones is NOT in the last name.

For more complex query ideas, check out the Apache Lucene Parser Syntax page.

 

Adding Non-Model or Non-Standard Fields

Lets change our example. Lets say we have Books, and Books have many Authors. What if I want to have my search not only search book titles, but also the book authors.

The obvious problem here, is I'm dealing with two tables. My author's names are in the Author table and my Book titles are in the Book table. I don't want to have to search multiple indexes, so how do I do this?

Well, you'd change your /models/book.rb to look like this:

1
2
3
4
5
6
7
class Book < ActiveRecord::Base
  acts_as_ferret :fields => [:title, :author_name]

  def author_name
    return "#{self.author.first_name} #{self.author.last_name}"
  end
end

That's it! Now when I search books, I search the author name as well!

You can index anything you return in a model function. You can even reformat your fields.

You would do something similar you were using acts_as_taggable and you wanted to make your tags searchable. If book was taggable, then your model might look like this:

1
2
3
4
5
6
7
8
class Book < ActiveRecord::Base
  acts_as_taggable
  acts_as_ferret :fields => [:title, :tags_with_spaces]

  def tags_with_spaces
        return self.tag_names.join(" ")
  end
end

If you were using the acts_as_taggable plugin you might not even need the extra function, and use ":tag_list" in the ferret field list, as shown on Johnny's Thoughts. I'm not nearly as cool though, I'm using the acts_as_taggable gem.

Either way, now your tags get searched when you search the index.

 

Sorting

Everything we've done so far is getting sorted by the search score, which is what you're going to want most of the time. But what about when you want to sort by an alternative field such as book title?

The first thing you need to do is make sure the field you are trying to sort by is untokenized. Unfortunately, by making a field untokenized I'm not indexing it to be searchable anymore. This makes for a little funky coding.

So if I wanted to sort by title in the above example, but I also want to search by title, I would do this:

1
2
3
4
5
6
7
8
9
  acts_as_ferret :fields => {
        :title => {}, 
        :tags_with_spaces => {}, 
        :title_for_sort => {:index => :untokenized}
        }

  def title_for_sort
        return self.title
  end

Remember, if you change something in this acts_as_ferret line you'll want to regenerate your index. You can do this by deleting your /index directory and restarting your server.

I would then be able to do the following code to get returned results in title order:

1
2
3
s = Ferret::Search::SortField.new(:title_for_sort, :reverse => false)
@total, @members = Book.full_text_search(@query, 
                 {:page => (params[:page]||1), :sort => s})

Lastly, if you want to sort by date, you may need to convert the date field to an integer. See this Slash Dot Dash blog entry for an example.

 

Field Storage

Before we get into Highlighting (one of the coolest features), we need to discuss how the data in the indexes are stored.

If you take a look inside one of your search indexes right now, believe it or not, you would not see your data. By default acts_as_ferret does not store your data in a recoverable form, it just indexes it.

"What if my data is small and I want to store it in the index?" I hear you ask.

Good question grasshopper. If your data is small, or you only really care about one field of information, you can get a speed bonus by storing the data in the index itself.

To do this you would write the following:

1
2
3
4
acts_as_ferret :fields => {
         :title => {:store => :yes}, 
         :author_name => {:store => :yes}
         }

IThen when we run our queries, we specify the fields that we want to "lazy load" from the ferret index.


@books = Book.find_by_contents("J/span>", :lazy => [:title, :author_name])

Now when we render our view we don't have to touch the database at all! 0 Queries! The view might look like this:

1
2
3
4
5
6
<% @books.each do |book| %>
  <li>
    "<%= book.title %>" by 
    <%= book.author_name %>
  </li>
<% end %>

Pretty darn cool. Now for the icing on the cake:

 

Highlighting

You know how in google search results the words that you search for always appear bold in the search results? Well, now you can do that in ferret too!

The requirement to do this, however, is that you must have your search fields stored as I showed above.

To show how to use this, I'm going to slightly modify the code from above.

1
2
3
4
5
6
<% @books.each do |book| %>
  <li>
    "<%= book.highlight("J", :field => :title, :num_excerpts => 1, :pre_tag => "<strong>", :post_tag => "</strong>") %>" by 
    <%= book.highlight("J", :field => :author_name, :num_excerpts => 1, :pre_tag => "<strong>", :post_tag => "</strong>") %>
  </li>
<% end %>
What you get might be something that looks like this:

  1. "Story of G"
  2. "J's Book" by billack
  3. "G certainly is the Man" by y Ufuny

The Highlight function has a few other fun methods, and if your field is long (a blog entry for instance), you can have it return you an array of snippets with the keywords inside. See Highlight in the API for all the options.

 

Using Boost

Lastly, it's worth mentioning the Boost attribute. What this allows you to do is boost the score of a given indexed field, for instance:

1
2
3
4
  acts_as_ferret :fields => {
        :title => {:boost => 2}, 
        :author => {:boost => 0}
        }

This will modify the score slightly when you do a search, so that the results from a title match are scored a little higher then results from an author match.

However, this does NOT mean that all title results will appear above author results. If an author result is a direct match, it still may be ranked above a title result.

Perhaps this feature should be called "Nudge" instead of "Boost". I thought I could use a large boost to get all the title results to appear above the author results. I was mistaken, one can only "Nudge" the scores, but never separate them, as I was hoping.

 

Production Usage

As more people use Acts_As_Ferret in the production environment, the consensus is that you need to run it as A DRb server. Follow that last link to find out how.

Conclusion

Ferret is a very powerful search tool, as you can see here. Please let me know if you see any errors in my code above, and feel free to drop me a line if you need any assistance.


Comments

Leave a response

Duncan BeeversFebruary 19, 2017 @ 02:28 PM

Any information on how a single Ferret index deals with being accessed (read / write) by multiple instances of a Rails app simultaneously? If you're looking to scale, I recommend going with Hyper Estraier. aaf has some nice functionality like built-in multi-model search (which might make a good subject for a more advanced treatment of Ferret usage), but if you expect to scale beyond a single mongrel, you might want to read more about some of the lock issues Ferret has. Thanks for the clear tutorial, and I hope this gets the larger Rails community fired up about Search.


Jens KrämerFebruary 19, 2017 @ 03:29 PM

First of all, thanks for the great tutorial!

Duncan, you are right, Ferret has locking problems at least in some environments when it comes to lots of concurrent writes by multiple processes. However these problems can easily be solved by using some central process that does all the index reads/writes in a controlled manner.

Once you scale your app to more than one physical machine you have to do this anyway, regardless of what search engine you’re using. In a current project using Ferret without aaf we built a backgroundrb worker to do all the searching and indexing, which works really great - since then we never had a corrupted index again.

Now the good news for acts_as_ferret users is that the current development version comes with a built in DRb server - and it’s possible to switch between remote and local index usage with a single parameter to the acts_as_ferret call.


KeeranFebruary 19, 2017 @ 06:04 PM

Brilliant tech, brilliant article. Thanks guys - I wish I had found out about your blog a while ago!

Kee


BrentFebruary 19, 2017 @ 06:04 PM

Any suggestions on how to convert user entered search terms to return the widest possible array of results? So if params[:search_string] is “G Billack” I effectively would like to do a fuzzy OR search for G and Billack. Something like “G~ OR Billack~”

Is the best manner to just do normal string operations to create that from params[:search_string]?


Patrick HallFebruary 19, 2017 @ 06:10 PM

This looks cool, thanks for the writeup, looking forward to trying it.

Any idea whether this approach handles Unicode (utf-8) okay?


G BillackFebruary 19, 2017 @ 06:44 PM

Brent, you’re on the right track. To return the widest possible array you’d make everything OR’d and Fuzzy.

@query = params[:search].split(” “).collect{|term| term + ”~”}.join(” OR “)

Patrick, Yes, Ferret does provide UTF-8 support out of the box. If you look on the front page of the acts_as_ferret wiki, there’s a blurb about it.


Paul DavisFebruary 19, 2017 @ 11:17 PM

This is a clear, purpose-driven tutorial. Thanks for your excellent work.

I did notice that you used the word “then” in a comparison, when the proper word is “than”. It comes up a couple of times in this article.

It’s a common mistake in online writing, but it still grates.

Thanks.

Paul


Andre LewisFebruary 19, 2017 @ 11:49 PM

Nice writeup, thanks. I’m curious how the index holds up over time with the incremental additions—do you have to periodically rebuild the index to keep it from becoming fragmented?


Casey HelblingFebruary 20, 2017 @ 01:54 AM

Another thing to note is the way to specify an alternate place for the index files to be stored. In my production env I need them stored in …/shared/index and in development I store them in RAILS_ROOT/ferret_index—you can use the :index_dir switch on acts_as_ferret definition.

acts_as_ferret :fields=>{ :name => {:boost=>10}}, :index_dir => FERRET_INDEX_DIR

obviously you can specify the actual dir in the correct environment file.

And—to regenerate your indexes quickly you can do this

script/console >> ModelName.rebuild_index


ChrisFebruary 20, 2017 @ 04:26 AM

Looks very interesting. Anyone tested this with globalize? I wonder if i can use this in one of my mutli-languague sites.


slicemattFebruary 20, 2017 @ 12:11 PM

Great article - thanks guys.


G BillackFebruary 20, 2017 @ 04:18 PM

Andre, the index gets modified every time you add/edit/remove the ActiveRecord model it’s associated with. You never have to worry about doing this yourself, it happens automatically, so your search index is always 100% accurate. No rebuilding needed.

Casey, thank you for the great additions, didn’t know about those.


Walt StoneburnerFebruary 20, 2017 @ 04:21 PM

Be aware that Lucene’s parsing syntax is a little sneaky, the OR operator, for instance is all in caps. So if you convert the user’s input to lowercase, make it become ‘or’, then it is no longer a boolean operator. Even more surprising, ‘or’ is considered a stop-word, so it’s not indexed and will act as a no-op. As a result your query will then consist of two words, and that default behavior is an AND. Surprise, not what the user meant.

After a bit of discussion with the very kind folks that work on Lucene, it was explained to me that Lucene isn’t really doing boolean stuff. Any boolean expressions are actually converted into expressions involving + and -. Anything with a + must be present, anything with a - must not be present, and sans either operator, it’s a “nice to have, not mandatory, but, if present, the result will score even higher.”


Ilya GrigorikFebruary 20, 2017 @ 08:57 PM

G, great stuff! Very thorough, and with great examples! I followed up with some pagination code I’ve been using in conjunction with paginating_find: Ferret Pagination in Rails.


pirelandFebruary 21, 2017 @ 01:42 AM

Ditto on all the positive kudos! Using functions to gather data from other tables is SO powerful, thank you for pointing that out!

Has anyone implemented auto-complete or auto-suggest using Ferret?


MigrateFebruary 22, 2017 @ 12:39 PM

Thanks for the great tutorial. I will try the plugin.

I just have some issues: - I would like to index all columns in a specific model. How can I do that? - My application has roles and depending on the user’s role he/she will only see the information available for his/her role. For instance, consider the following: I have a Project Model where Project1 can be seen by role Customer1 and Project2 can be seen by role Customer2.

Do you know how can I do this with Ferret?

Thanks.


Casey HelblingFebruary 22, 2017 @ 04:20 PM

I haven’t tried this code but theoretically you can switch on the Role and pass the correct field to the query…

query_param = case role when ‘Cust1’ : “id:1 #{search_text}” when ‘Cust2’ : “id:2 #{search_text}” end

Project.find_by_contents(query_param)

I would suggest checking out the Lucene documentation on how to do it exactly


BrunoMarch 01, 2017 @ 12:07 PM

Thank you for the article!


AndrejMarch 01, 2017 @ 10:47 PM

Thanks for the article, very clear and concise. I used it for a project I’m currently working on. :)


Bill SiggelkowMarch 03, 2017 @ 03:45 PM

Great article - I read over the article, and, in under 5 minutes start to finish, added a search screen to the depot app - very sweet!


Xavier BelancheMarch 10, 2017 @ 03:03 PM

Hi folks and thankful for this fast-and-fantastic-howto. I’ve some problems when I try to put in practise the Highlighting way. It returns the next error:

undefined method `zero?’ for [1, [{:score=>0.205208286643028, :title=>nil}]]:Array

I try to analyze via console the find_storage_by_contents and always return ‘nil’ when I evaluate indexdoc (:title is one of the fields of model Book.rb)

Thanks and hope you can help me :), Xavier


Sorry, comments are closed for this Post, but feel free to email us with your input. We'd love to hear it.

Blog

Subscribe to RSS Feed SubscribeRSS

Podcast

SubscribeRSS
iTunes
Archive

   

Looking for Videos?

FUNNY VIDEOS

SPEAKING

TUTORIALS

Tags

Contact Us

Like what you see? Contact us today to get the ball rolling on your next great idea.


Hosting by Rails Machine