Thursday, April 16, 2009

Erlang Style Actors Are All About Locking

I managed to offend a few Erlangers by talking about actors as being shared stateful objects a couple of articles back. Now I'm going to continue the tradition by talking about the actor locking model. And I'm not just talking about Erlang, but any of the libraries that have sprung up for other languages to support actor style concurrency. Let me emphasize again that I like both Erlang and actors. But I dislike the number of times I read from enthusiasts that you "can't have deadlocks in Erlang because it doesn't have any locks." Of course you can because it does. This article will explain.

Locks

There are all manner of locks in the world: mutexes, monitors, synchronization barriers, read/write locks etc., etc., yada yada yada. All of them boil down to a spot in the program where if conditions aren't good-to-go the currently executing thread stops executing until some other thread (or threads) change the condition the lock is waiting on[1]. Erlang has a locking primitive, too. It stops the execution of the current thread and waits for the right conditions to move on. The primitive is called (drum roll please)...'receive'.

"But, but, 'receive' receives messages - it's not a lock." Sure, it receives messages. True. But it also brings the currently running thread to a screeching halt until it receives a message it wants to handle. "receive foo -> doSomething()" waits until the atom "foo" is at the front of the current process's mailbox - any other message gets thrown to the back of a "this receive can't handle that message" queue for processing by some other receive. In other words "receive foo ->" is a lock that waits for the condition "foo at head of mailbox queue" before it unlocks.

Deadlocks

A deadlock is when there's a cycle in the locking dependencies of 2 or more threads. Thread 1 is locked waiting for condition 1 after which it will set condition 2, Thread 2 is waiting for condition 2 after which it will set condition 3, etc until you get Thread n which is waiting for condition n after which it will set condition 1. Everybody's waiting on somebody else to go first so nobody can go.

Based on this, a deadlock in Erlang is easy to create. Just spawn two actors that are each waiting for a message from the other

Foo = spawn(fun() -> receive 
  {From, foo} -> 
    From ! {self(), bar}, 
    io:format("foo done~n") 
  end 
end).
Bar = spawn(fun() -> receive 
  {From, bar} -> 
    From ! {self(), foo}, 
    io:format("bar done~n") 
  end 
end).

If a running system doesn't have any other actors that will send Foo or Bar the right message then they are deadlocked until some poor sleepy eyed admin comes along to break the deadlock with "Foo ! {Bar, foo}". Now, in this case the deadlock is not a big deal. The resources these actors consume are pretty tiny and they don't really do anything useful even when they unlock. But if these actors were meant to do something actually useful such a deadlock could seriously impair the functionality of a system.

This was a toy example, of course. It's easy to see the deadlock. In a "real world" large scale application it might not be so obvious where the deadlocks lurk. Actor's can end up in very complicated, dynamically changing relationships. Couple that with the possibility for race conditions and you'll see that deadlock is a real concern. Erlang style actors make concurrency much easier to deal with, but they don't remove the need for careful design and testing.

That's pretty much all I really wanted to say in this article, but I can't feel my work here is done until I've committed further atrocities against Erlang.

Perverting Erlang

A couple of posts ago I had a problem with race conditions on a shared stateful actor. Here's what it looked like.

1> c(sockmachine).
{ok,sockmachine}
2> Machine = sockmachine:start().
<0.38.0>
3> sockmachine:test(Machine, 2).
all test processes launched
[2] BONUS COIN!
[1] BONUS COIN!
ok
[2] Got my socks.
[1] Blasted machine ate my money! Give it to me!  Rattle, rattle, ARRRGGHGHHRHRHGH
[2] BONUS COIN!
[1] BONUS COIN!
[2] Got my socks.
[1] Blasted machine ate my money! Give it to me!  Rattle, rattle, ARRRGGHGHHRHRHGH
[2] BONUS COIN!
[1] BONUS COIN!

etc.

While there are many good solutions, one evil solution is to use a mutex. You see, one well known aspect of concurrent programming is that once you have one locking primitive, many others are pretty easy to build. To the perversion mobile!

-module(mutex).
-export([create/0, lock/1, unlock/1]).

create() -> spawn(fun() -> unlocked() end).

lock(Mutex) -> 
   Mutex ! {lock, self()},
   receive ok -> ok end.

unlock(Lock) -> Lock ! {unlock, self()}.

unlocked() -> 
   receive {lock, LockedBy} ->
      LockedBy ! ok, 
      locked(LockedBy, 1) 
   end.

locked(LockedBy, Count) -> 
  receive 
     {unlock, LockedBy} -> 
       if Count == 1 -> unlocked();
       true -> locked(LockedBy, Count - 1)
       end;
     {lock, LockedBy} -> 
       LockedBy ! ok,
       locked(LockedBy, Count + 1)
  end.

When a Mutex actor is the unlocked state then any actor can tell it to lock. Once in the locked state it won't unlock until the first actor tells it to. [2] The counting business is so that one process can lock a mutex several times and then must unlock the mutex the same number of times before it is considered unlocked.

And now a bit of test code. I've modified the original sockmachine test code so that before inserting coins and pushing buttons a testloop must acquire the lock on a mutex and once done must release it.

-module(mutextest).
-export([test/3]).

test(Machine, Mutex, Processes) -> 
  if 
    Processes > 0 ->
       spawn(fun() -> testloop(Processes, Machine, Mutex, 10) end),
       test(Machine, Mutex, Processes - 1);
    true -> io:format("all test processes launched~n")
  end.       

testloop(Process, Machine, Mutex, Count) ->
   if 
     Count > 0 -> 
       mutex:lock(Mutex),
       testzerocoins(Process, Machine),
       mutex:unlock(Mutex),
       testloop(Process, Machine, Mutex, Count - 1);
     true -> io:format("[~w] testing completed~n", [Process])
   end.

testzerocoins(Process, Machine) ->
  case sockmachine:insertcoin(Machine) of
    {nothing} -> testonecoin(Process, Machine);
    {coin} -> 
       io:format("[~w] BONUS COIN!~n", [Process]),
       testtwocoins(Process, Machine)
   end.

testonecoin(Process, Machine) ->
  case sockmachine:insertcoin(Machine) of
    {nothing} -> testtwocoins(Process, Machine);
    {coin} ->
       io:format("[~w] BONUS COIN!~n", [Process]),
       testtwocoins(Process, Machine)
  end.

testtwocoins(Process, Machine) ->
  case sockmachine:pushbutton(Machine) of
    {tubeofsocks} -> io:format("[~w] Got my socks.~n", [Process]);
    {nothing} -> io:format("[~w] Blasted machine ate my money! Give it to me!  Rattle, rattle, ARRRGGHGHHRHRHGH~n", [Process])
  end.

And here's what it looks like when I run all this

4> c(mutex).
{ok,mutex}
5> c(mutextest).
{ok,mutextest}
6> Mutex = mutex:create().
<0.53.0>
7> mutextest:test(Machine, Mutex, 2).
all test processes launched
ok
[2] Got my socks.
[1] Got my socks.
[2] Got my socks.
[1] Got my socks.
[2] Got my socks.
[1] Got my socks.
[2] Got my socks.
[1] Got my socks.
[2] Got my socks.
[1] Got my socks.
[2] Got my socks.
[1] Got my socks.
[2] Got my socks.
[1] Got my socks.
[2] Got my socks.
[1] Got my socks.
[2] Got my socks.
[1] Got my socks.
[2] Got my socks.
[2] testing completed
[1] Got my socks.
[1] testing completed

Ta da! Fixed my problem!

Kids, Don't Try This At Home

There are better ways to solve this kind of problem with actors. Seriously. Why use a custom mutex when Erlang has a lock built in as described above? Just change the sock machine to not be so stateful, i.e. make it take one message "{From, coin, coin, pushbutton}" and do all the work necessary before returning the tube of socks. Particularly nasty about this manual mutex code is that it's hard to ensure that all code paths will unlock a mutex. If my test code had thrown an exception at the wrong point then the mutex would be left locked. So don't do this. Unless you're evil. Like me.

Conclusions

Erlang's a great language. Actors are a great model. But they are neither race nor deadlock free. That's because, fundamentally, they're all about shared mutable state and locking. If after all of these articles you still don't believe me or you think I'm using the words wrong then please send your comments to dev/null.

In my next article I'll try to pervert Erlang for good rather than for evil and maybe try to explain the frequent observation that "yeah, races and deadlocks can happen with Erlang, but it seems less common than with more traditional shared memory/manual locking mechanisms."

Footnotes

  1. Erlangers use the word "process" because of the memory isolation enforced between different actors. I'm using "thread" here to avoid confusion with OS processes. But even here there's confusion because I mean logical threads which may or may not be the same as hardware threads. I've seen a few libraries call this concept "sparks."
  2. Or until some third actor spoofs an unlock message from the locking actor.

7 comments:

Mihai said...

echo "I think you're using the words wrong" > /dev/null

Ulf Wiger said...

I agree. Deadlock and race conditions can still happen. One of the simple ways to create a deadlock is to have two generic servers call each other. If you haven't selected Timeout = infinity, at least you won't have to wait forever...

I fact, some Erlang programmers have tried to avoid deadlock by changing (synchronous) calls to (asynchronous) casts. It's very important to understand that not only doesn't this eliminate deadlocks, it will make them much harder to detect!

It's probably fair to say that Erlang greatly reduces the number of deadlocks and race conditions by eliminating low-level locks, a bit like it reduces the risk of memory leaks by eliminating explicit pointers. Erlang code can still deadlock, and it can still leak memory, since these things can happen at any level of abstraction.

Anonymous said...

Badly designed communication protocols can deadlock - yes, isn't this rather obvious?

Who is arguing that this is not the case? Please share a few links. I am curious.

Tony said...

Having two gen_servers call each other cyclically was one of the first deadlocks I ran into implementing Reia, a hybrid object/actor language for the Erlang VM. Reia uses gen_server as the basis of its objects, and as it guises gen_server calls in traditional obj.method(arg1, arg2...) style syntax it's easy to accidentally have two objects call each other cyclically.

I believe we have a solution though: someone on the mailing list proposed a rather simple mechanism for cycle detection (by generating a unique ref for each call chain).

Rather than deadlocking on cyclical calls, Reia objects will immediately crash when called cyclically. There's no need to depend on timeouts to detect the problem.

Fernando Ipar said...

I'm not an Erlang expert, at all, but while it's true that deadlock can occur with a poorly designed communications protocol, Erlang/OTP gives programmers better tools to avoid this.

Of course, the first thing is to avoid bad designs :)

But since everyone can write buggy code, a receive with a timeout will probably solve your situation.
Also, using OTP's supervision trees will help you make sure that, should anything go wrong and explode in your face, your application (or parts of it) will be properly restarted.

My 2c, like I said, I'm no expert, so far I've only played around with the language while migrating one of my projects to it. My recommendation to anyone who is also not an expert would be to stick to OTP behaviors instead of reinventing the wheel.

Greg Rogers said...

I'm no expert but it looks like if your mutex is already locked it won't send ok back to the locking process to wake it up to leave the lock function. So recursively locking your mutex is broken and guaranteed to deadlock.

James Iry said...

Greg, you're right! I managed to paste a broken version into the post. I'll fix it.