Memcached and Ruby

September 06, 2010 / category: Ruby / 3 comments

There is a long-awaited moment after launching a web application, when technical means used so far are no longer sufficient. Long story short, number of pageviews increases each day and at the same time response time increases rapidly. One of the first things which should be considered in such a case is introducing data caching.

Among filesystem and local memory, memcached is one of the most widely-used storage systems being adapted to cache data in web applications. I'll try to show different ways of integrating memcached with web applications in order to gain performance boost. Described techniques can be applied regardles of the framework (Ruby on Rails / Merb / Sinatra) and model layer library (ActiveRecord / DataMapper / Sequel) being used. I assume that reader has at least a basic understanding of what memcached is and how it can be used to store and retrieve data.

My eBook: “Memoirs of a Software Team Leader”
Read more »


The Quick and Dirty Way

So you have fully functional, but a little bit slow data retrieval methods? Try to decorate them with cache logic and see what happens:

#
# Get all blog posts.
#
def get_posts
  cached = memcached.get('posts')
  return cached if cached

  posts = Post.all
  memcached.set('posts', posts)

  posts
end

Here's a somewhat more civilized example:

def get_posts
  begin
    posts = memcached.get('posts')
  rescue Memcached::Error
    posts = Post.all
    memcached.set('posts', posts)
  end

  posts
end

What's the first impression like? Not too good, I hope. To be honest, the quick and dirty approach has only one advantage: it can be implemented quickly.

Let's move on to the disadvantages. First, writing code in such manner causes cache logic to be spread across the whole application. Moreover, it tempts to apply caching in a implementation-bound manner. What happens if cache keys (potentially repeated in a few places) change? What if one memcached client is going to be replaced in favor of another one? Also, in case of manual cache invalidation, there will be even more cache-related logic spread across the application code.

As you can see, this way of caching data in Ruby (or in nearly any programming language) can be implemented quickly, but it causes application tiers to be mixed and may turn introducing changes in application code into a horror.

Caching Method Calls - Memoization

Instead of mixing cache logic into your methods, try to wrap method calls with cache handler. This shouldn't be difficult, especially in view of some nice Ruby's features.

Let's wrap method call with a block like this:

posts = cached('posts') do
  get_posts()
end

It looks much more cleaner than the previous example, but when does the caching and retrieving data from cache takes place? Let's try to implement the cached() method now:

def cached(cache_key)
  if result = memcached.get(cache_key)
    return result
  else
    result = yield
    memcached.set(cache_key, result)

    return result
  end
end

Laziness is a virtue, remember? The block of code associated with the cached() method call is being run only when necessary, i.e. when cache data for specific key is empty. It means that the underlying code is called only once per amount of time (providing there was no cache eviction). Of course, this simple cached() method implementation can be further extended to handle method parameters and support various types of storage incl. filesystem and local memory. Wrapping expensive code is easy in Ruby as well as in any other programming language supporting anonymous functions.

By the way, storing results of function calls is often referred to as memoization, i.e. when calculation results are stored for future reference. This is similar to dynamic programming, which allows to quickly solve some computationally complex problems.

There are some Ruby gems available which perform memoization, yet they only store results in local memory. This is what we should avoid in a distributed, web environment - a situation when each machine holds its own, possibly stale copy of data and the same call is being made on many machines over and over again, is cleary inacceptable. However, these gems can be handy to optimize Ruby programs running on a local machine.

Memoization wrapped with blocks is at least a little bit cleaner than the "quick and dirty" approach and it's not strictly bound to implementation details. However, if you wish to invalidate cache data on demand and not rely on TTL only, you should take into account that this technique suffers from spreading invalidation logic to the same extent as the previous one.

Cache Handled Entirely by a Library

Sometimes we have a clever library at our disposal. This library is clever enough to wrap all data layer-related calls (incl. all CRUD operations) and claim full responsibility for storing results in cache and invalidating them when necessary.

Encapsulation level like this can only be achieved when all calls come through the library:

post = Post.create(:title => 'My New Ruby Post', :content => 'Some useless things...', :date => Date.today)

When we want to modify or delete a post, library should know about this, too:

post.update(:title => 'Much Better Title')

post.delete()

When writing code this way, we tell library of everything what's going on with the object. Then library knows when to store data in cache (create()), modify it (update()) and last but not least, delete it (delete()).

In this case our code is absolutely free from cache-related logic, though we must write the program having invisible cache layer in our mind.

Having a "clever library" has many advantages, though it is far from being perfect. First, we have to get this kind of library somehow and from my experience, libraries like this are usually home-baked solutions, made either by a single programmer for his own purpose or by a specific company to be used within the company only. That is to say, choice of such libraries is limited.

I think you'll agree this is not a big limitation; it shold not be difficult to write a library like this on top of ActiveRecord or DataMapper. But there's another problem: sometimes application logic is too complicated to be fully understand by the library, which is rather supposed to be general-purpose. When cache invalidation doesn't directly result from deleting an object, there has to be another way to mark cached data as stale. This way we get back to the same problem which has already emerged before: manual cache invalidation.

To sum up, having a smart cache-handling library is convenient, because it lets you easily forget about memcached or any other place where cached data will be stored. Cache will be completely transparent as long as you won't have to invalidate it manually.

View Cache

View cache is like memoization from a bird's-eye view. You don't care how many and how complicated methods need to be called to display a specific part of application output. You just cache it all and don't go into details.

View cache seems to be the general-purpose method, as language or framework used is not essential. Of course, a convenient framework like Merb or Ruby on Rails will let you implement view cache in a simple and elegant way, but this is not a must.

As usual, there is at least one drawback. The main point is that view cache leads to caching the same data multiple times, which possibly means increased number of cache evictions. Also, some computations are being performed many times while their results are already sitting in memcached, though they are formatted in a different way and can't be used. The same thing happens when you provide an API for your application - it usually returns data formatted as XML or JSON, not HTML.

But there is a huge (apart of easy implementation) benefit of view cache: this technique allows you to gain significant performance boost, especially when you use page cache. Sounds interesting?

Page Cache

Following examples are (loosely) based on Ruby on Rails, but their equivalents can be easily found in other web frameworks, such as Merb and Sinatra. Take a look at this simple controller:

class BlogController < ActionController

  def about_me
    @info = Me.get_info()
  end

  def posts
    ...
  end

end

It's a simple blog controller, able to display posts and "about me" page. Since we know that "about me" page rarely changes, we can page-cache it:

class BlogController < ActionController

  caches_page :about_me

  def about_me
    @info = Me.get_info()
  end

  def posts
    ...
  end

end

Code of the about_me action didn't change at all, however, its result became page cached. What does it mean in practice? When first run, this action will generate HTML output which Rails will store in a file located in the public directory. I think you might have guessed what happens then - web server will treat /about_me as a static page and serve it directly from disk, not even launching the Rails framework. I hope I don't have to convince you this is significantly faster than firing any Ruby-based web framework.

By the way, page cache doesn't have anything to do with memcached, but it's just too important to be concealed.

Action Cache

Action cache is similar to page cache, though the HTML output is not being saved on disk directly. Instead, it is kept in a cache store and displayed only when before filters allow to do so:

class SecretController < ActionController

  before_filter :check_access
  caches_action :classified

  def classified
    @data = XFiles.secret_data
  end

end

When user wants to display the "classified" page, check_access before filter is always run. When it says everything is OK, Rails looks into cache store (e.g. memcached) for output of the check_access() action and sends it back. It's a little bit more flexible than the page cache, but it's also slower. Nothing to add.

Fragment Cache

Fragment cache is also being referred to as partial cache. It's useful when not the whole response, but only its fragment (e.g. a list of active users displayed in block) can be cached. Take a look at this piece of Embedded Ruby (ERB) code:

<h1>Dashboard<h1>

<p>Your name: <strong>Lukasz Wrobel</strong></p>

<% cache do %>
  <ul>
    <% User.active do |u| %>
      <li><% = u.name %></li>
    <% end %>
  </ul>
<% end %>

Memcached will do just fine for storing fragment cache. As you can see, fragment cache simply boils down to wrapping method calls in blocks in order to cache their results, which is what we've already done before.

All view caching methods talked over so far have one common disadvantage: they require you to manually invalidate the cache, either by writing Rake tasks to delete generated files in case of page cache or to write so-called cache sweepers. In case of storing data in memcached, though, you can set Time To Live (TTL) and just wait for cache to expire. Of course, you can do it only if you can admit of some inconsistencies.

Conclusion

As you can see, there are many techniques available for developers who want to speed up their web applications written in Ruby. In most cases one can go without memcached and just store data in filesystem or local memory, but only memcached combines speed with ability to work well in a distributed environment.

There are even more powerful solutions available, including reverse proxy cache implementations like Squid and Varnish, which can gain thousands of requests per second on a single machine. There are also similar, but Ruby-based solutions, communicating via the Rack interface. However, all of these solutions operate in a much different way than techniques described in this article and they require a detailed analysis, including introduction of some of the not well-known properties of the HTTP protocol.

To sum up, whether you're using Rails, Merb, Sinatra or any other framework, sooner or later you will encounter efficiency problems. I hope my article is a good place to start and in conclusion, I can only wish you performance problems with your web applications - it usually means you're on the right way.

Comments

There are 3 comments / Submit your comment

qbolec
November 03, 2010 02:43 PM

At the beginning we used to cache HTML output of fragments, boxes, columns, even whole pages. But we learned it is a nightmare to maintain if you roll out builds several times a week, and some servers still serve stale cached versions of HTML incompatible with cached versions of js, cached versions of css, cached versions of graphics, or simply cached versions of other parts of the site.

It took a lot of effort to drop this idea and replace all of its occurrences with caching the data (not the view).

From the time perspective I think it might have been solved in a different ways, here are some of them: 1. using "generational" keys -- keys that have a version tackled to their name. Each release could increment the version, rendering all old keys invalid 2. using separate network of memcaches just for HTML, and flush them after each release. 3. manually changing names of keys of HTML parts affected by the new release (actually we tried to do so for a year, but we failed)

One more problem with HTML cache is that it hides other bottlenecks in middleware and backend (actually it is also its main advantage!), which can hurt you when you invalidate the cache. This is true for all three solutions above, so you have to perform updates at 6 A.M...

Lukasz Wrobel
November 03, 2010 07:31 PM

First of all, I should introduce qbolec (as he used to call himself) to the broader audience. He is a software architect at nk.pl (former nasza-klasa.pl), the biggest social Polish website and the biggest Polish website at all. We've been working together for a few months and I've got to tell you that if you're looking for someone who knows how to build huge-traffic web applications, then he's the guy. Not to mention he was a member of the successful TopCoder competition team.

But enough of this buttering up :-). I'm afraid that - when it comes to a reasonable amount of traffic - qbolec is right about problems related to view cache. Warming that kind of cache up may be complicated and might lead to unnecessary calculations. Which cache keys should be filled up in the first place?

To be honest, the only type of cache I've been working with that seemed to be almost completely transparent, was the cache handled entirely by a library. Of course, it puts some limitations on how the data can be accessed. In this case, you should forget about complex JOINS or other complicated SQL features. Wrapping all such cases by a library may be too complicated or even impossible.

Another thing I think of is a data store that ensures reads and writes fast enough, so that the difference between them and memcached is insignificant. Maybe the key-value stores are the bright future of web applications? Time will show.

Wes Noor
September 27, 2012 12:37 PM

You have done a good job explaining how Memcached can be used to enhance performance of web applications.

Here are some articles that explain how NCache can do the same in a better and easier way.

Scalable WCF Applications Using Distributed Caching

Scale ASP.NET Apps Through Distributed Caching

Scaling Java and JSP Apps with Distributed Caching

You can use Markdown in your comments if you wish. Examples:

*emphasis*
emphasis
**strong**
strong
`inline code`
inline code
[My blog](http://lukaszwrobel.pl)
My blog
# use 4 spaces to indent
# a block of code
    def my_method(x)
      x = x + 1
    end
def my_method(x)
  x = x + 1
end

* First.
* Second.
  • First.
  • Second.

> This is a citation.
> Even more citation.

I don't agree with you.

This is a citation. Even more citation.

I don't agree with you.


Submit your comment

(required)

(optional)

(required, Markdown supported)


Preview:

My eBook: “Memoirs of a Software Team Leader”

Read more »