Thursday, February 27, 2014

Easy Spark development on Amazon

I have a VPC set up on Amazon for our data pipeline - from the get-go, it was a prerogative of mine to make everything as secure as possible. As part of this setup, I have created an OpenVPN gateway to access the "inside" of the pipeline. This whole setup takes some time and is a bit intricate - I will document it in a different post.

The goal of this post is to share my "workflow" for developing and testing Spark apps. Inside the VPC is a set of 16 nodes which are the standalone Spark cluster. The same 16 nodes are also a part of a Hadoop HDFS cluster where each node's ephemeral disk space (1.6TB on each machine) has been set up as a RAID0 array (Amazon gives you 4x400GB partitions) and these RAID0 arrays are a part of the HDFS pool. The difference in speed is very noticeable compared to EBS attached volumes, but I digress - as I said, this will be a topic for a different article.

I work at home on my Macbook Air but my cluster is on Amazon. Since I use OpenVPN, I purchased a copy of Viscosity to be able to connect to the VPC. There is also Tunnelblick which is free but I have found it to be "flaky" and a bit unstable (personal opinion/experience - YMMV) compared to Viscosity which has been solid and at $9/year the price cannot be beat.

So, workflow:

1/ Fire up Viscosity, establish VPN connection to VPC

2/ Mount a directory on the Spark cluster where I will be running my application code - I use sshfs:
shfs -o sshfs_sync -o sync_readdir sparkuser@spark-master:/home/sparkuser/spark_code spark-master/

To do this you will need to set up your ssh via keys.

3/ Now that my remote folder is mounted locally in spark_master/, I use SublimeText to edit my files. Each save is immediately propagated to the remote machine. This may or may not be slow, depending on your setup/connectivity.

4/ ssh sparkuser@spark-master

5/ cd spark_code/whatever_directory_my_current_project_is_in

6/ Run sbt

7/You can run sbt so it is sensitive to files changing in the project - it will trigger an automatic recompile each time a file changes. The simples way is to do
> ~ compile

In any case, hope this helps :)
Ognen

No comments: