Wednesday, February 26, 2014

Spark's "weird" context sharing rules - how I killed a week trying!

"I spent about a week on this."

That's my opening statement intended to fully depict my level of frustration. When you are facing deadlines (as many of us do in the real world), it is not easy to spend a week or two, just like that, with not much in terms of results. But, I digress, politely.

So, what's the problem? Well, imagine you spent some time and wrote a bunch of "queries" to crunch your "big data" in Spark. It holds a great promise - it is a much faster reimplementation of Hadoop, in memory. It can read, no, devour files in HDFS in parallel. It can do streaming, almost-in-real-time queries. It can do 0MQ, it can do parallel access to files on Amazon's S3, it can talk Cassandra, Mongo. It is written in Scala, it can do Akka actors talking to each other, it features Resilient Distributed Datasets that are reproducible (hence the resilient part).

Only problem is, the thing is undocumented. Well, almost documented. Well, that depends on your point of view and how much time you have to hack on it. The documentation is not bad, it is just lacking in the very fine points, the ones you inevitably hit when you start "abusing" the platform ;)

OK, back to my "situation". Let's say you spent some time, you learned Scala, you learned something about what/how Map/Reduce is, now you wrote a bunch of Scala programs to execute Spark queries. Each query is a jar file, that's the nature of the beast, you run these in batches, via a cron job or by hand when needed. Nothing wrong with that. Except....

Except, one day, your head data analyst comes to you and says that it would be nice to have all this stuff exposed via the web in a slick, "exploratory" kind of way. You say, sure, great! I get to learn Scalatra, Bootstrap and more!

You are all gung-ho, walking on clouds. Except....

Except that it turns out you cannot allocate multiple Spark Contexts (main way to drive Spark calculations and actions) from the same app. That's a bit of a problem (and that's a bit of an understatement ;) since a web server by default serves requests in parallel from multiple threads. You cannot pool SparkContext object, cannot optimize, heck, all your dreams have crashed and burned and you risk looking like an idiot in front of your whole team!

(Digression: Scalatra is absolutely awesome, by the way! It is a very clean, nice way to write RESTful APIs, you can serve HTML, Javascript, deluxe. But, it turns out there is a kink here too - I ran into major problems making a Scalatra servlet work with Spark - for some reason allocating a SparkContext from a Scalatra servlet was a no-go).

This drove me to Unfiltered (after I spent 3-4 days and $40 on a Scalatra book and fell in love with the framework, damn it!). Unfiltered is simple, functional, beautiful in a Haskellish kind of way, does not depend on any version of Akka so you can use whichever one matches the version Spark comes with.

OK. Back to sharing SparkContext instances. It turns out the friendly experts at Ooyala thought about this problem a few months ago and wrote a patch to the latest version of Spark (0.9). This jobserver github branch is meant to expose a RESTful API to submit jars to your Spark cluster. Sounds great (!) and it is, it actually works. Except....

Except that it comes with Hadoop 1.0.4 by default. My luck has it that my Hadoop HDFS cluster runs 2.2.0. You can compile Ooyala's jobserver branch of Spark-0.9-incubating with support for Hadoop 2.2.0 the standard way. It assembles. Except that it does not work. You cannot start the standalone cluster, it complains with the following exceptions:

14/02/25 20:15:24 ERROR ActorSystemImpl: RemoteClientError@akka://sparkMaster@ Error[java.lang.UnsupportedOperationException:This is supposed to be overridden by subclasses.
at akka.remote.RemoteProtocol$AddressProtocol.getSerializedSize(
[snipped for brevity]

So, what to do? A week or more later into this nightmare, I ran into a few online discussions claiming that since 0.8.1 (Dec 2013) Spark supports SparkContexts that are "thread safe". To be more precise, this is exactly the promise:

"Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users)."

In theory, it should work. Stay tuned for the conclusions from the practical experience.

No comments: