A big problem we hit with the current release (1.5.0) of the Ruby memcache-client library is that if the memcache server connection dies, it leaves the mongrel permanently broken. I've written a big patch to refactor it to reconnect properly and also retry the request (once) to give it a fresh start.
Might as well just quote myself from the ticket:
I've written a big patch for memcache-client that does two things. Firstly, it reconnects properly to if the connection dies, so that you won't get permanently broken mongrels when the memcache server goes down but has been restarted or otherwise fixed up.
Secondly, if the connection dies, it retries requests – once only, it won't keep looping if things aren't working.
In doing this I've also cleaned up a bit of the codebase to provide for the refactoring that implements the retry-once mechanism. I've integrated the two apparently-equivalent patterns that were used to handle locking when multithreading is on and factored that into the mechanism too, so there's less repetition of the locking code.
We've tested this new version out with a genuine breakage – specifically, using the (now fixed) Solaris libevent event
port bindings to memcache, which made connections die painfully, quite regularly; with the old client, that would quickly leave us with permanently-broken mongrels (until we restarted them), but with this patched client they happily try and
reconnect to the memcache server, and handle repeat errors cleanly.
If you want/need this before it's been accepted and released by the seattlerb guys, you can grab the patch or patched memcache.rb from the rubyforge ticket.
Comments
Will, thanks for this nice patch. We just discovered this issue in testing our new Starling-based infrastructure. I'm going to apply this along with the Twitter set() patch (http://dev.twitter.com/2008/02/solving-case-of-missing-updates.html) and hopefully we'll have a robust client API.
Will, one issue your patch does not seem to address is the case of multiple memcached servers where one of them goes down. We want the client to seamlessly fail over to the other server. Currently the client fails the current operation without retrying another server. Do you have any suggestions on how to handle this? I'm thinking of having with_socket_management raise an error which I can handle in a higher-level block which effectively restarts the operation (and thus would go through the server selection process again).
Glad to hear the patch is useful to you. Can you +1 the ticket? Most patches I submit to these guys' projects just get ignored :(. The SET thing is pretty crap - that library evidently wasn't very carefully put together.
Re the retry thing, I haven't looked at that. Bearing in mind that the normal memcache design is that the data is partitioned across servers based on a hash of the key, if one server fails its' data wouldn't normally be available elsewhere. AFAIK having both partitioning+redundancy is outside the scope of the memcache design, has that changed?