Wednesday, November 16, 2011

Ruby Heredocs in Array Constants

While writing a spec that uses different chunks of csv and txt data I was wondering about the best way to define multi-line strings in array constants.

Normally, I would use Heredoc to define a single multi-line string like this:

CSV_CHUNK = <<-CSV
10, "a", "b"
20, "c", "d"
30, "e", "e"
CSV

Perfect. The unattractiveness starts when adding more chunk definitions. It usually ends up with having CSV_CHUNK_0, CSV_CHUNK_1, CSV_CHUNK_3 and so on in place. Thats a bit unfortunately. For example this hinders to use normal array iteration like each and friends.

So, my question was if there is a way to simply add chunk after chunk to an array. Sure its possible:

chunks = []
chunks <<<<-CSV
10, "a", "b"
20, "c", "d"
30, "e", "f"
CSV
chunks <<<<-CSV
40, "a", "b"
50, "c", "d"
60, "e", "f"
CSV

This is valid Ruby syntax. Actually its just the << method of Array plus the Heredoc syntax. ( Yes, you can add a space inbetween :) )

But, since we are altering a variable we can't use a constant here. To use a constant, we have to do the Heredoc definition inline in the Array declaration:

CHUNKS = [
  <<-CSV ,
10, "a", "b"
20, "c", "d"
30, "e", "f"
  CSV
  <<-CSV ]
40, "a", "b"
50, "c", "d"
60, "e", "f"
   CSV

Although, this looks pretty scary, its again valid Ruby syntax. As many other languages Ruby allows a comma in front of the closing square bracket. We can use this to pretty up this construct and to make it more readable:

CHUNKS = [
  <<-CSV ,
10, "a", "b"
20, "c", "d"
30, "e", "f"
  CSV
  <<-CSV ,
40, "a", "b"
50, "c", "d"
60, "e", "f"
   CSV
]

Thursday, October 6, 2011

Flume and Hadoop on OS X

For a kick-ass webscale big-data setup on your local mac, you'll want to have Hadoop and Flume place.
No seriously - this setup is especially useful, if you want to route your syslog output from your nodes to HDFS in order to process it later using Map/Reduce jobs.

It took me a while to figure out, how Flume and Hadoop have to be configured so that receiving messages are getting written into HDFS. This is why I decided to write a quick tutorial to get things up and running.

I assume that you have brew in place. So, the first step is as easy as:

brew install flume

But don't think that you can just install Hadoop with brew. The current version of Hadoop is pinned to 0.21 wich is an unstable version that, AFAIK, doesn't play together with Flume. We need version 0.20.2. I edited my Hadoop formula locally on my mac. But this should work, too:

brew install https://raw.github.com/mxcl/homebrew/d0efd9ee94a55e243f3b10e903526274fc21d569/Library/Formula/hadoop.rb

After you finished the hadoop installation, we need to edit a bunch of files in order to configure hadoop for local single node setup. Go to /usr/local/Cellar/hadoop/0.20.2/libexec/conf and change the following files:

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:54311</value>
  </property>
</configuration>

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Hadoop is now configured to use /tmp/hadoop as HDFS folder. Now, we need to create and format the directory.

mkdir /tmp/hadoop
cd /tmp/hadoop
hadoop namenode -format

Hadoop will connect to localhost using ssh. To configure ssh in the way that it can connect from localhost to localhost without needing a password, we need to add you public key to your authorized keys.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

We can test this by trying to ssh into localhost without using a password:

ssh localhost

You now should be able to start Hadoop. Fire up Hadoop by typing:

start-all.sh

To check if HDFS is up and running, try to list the files of your brand-new distributed filesystem:

hadoop dfs -ls /

To make Flume talk to HDFS we need to replace the hadoop-core.jar in the lib directory with the one that was shipped with hadoop.

cd /usr/local/Cellar/flume/0.9.3-CDH3B4/libexec/lib/
mv hadoop-core-0.20.2-CDH3B4.jar hadoop-core-0.20.2-CDH3B4.jar.unused
cp /usr/local/Cellar/hadoop/0.20.2/libexec/hadoop-0.20.2-core.jar .

Now it's time to start a Flume master node:

flume master

Go to a different terminal window and start a Flume node, too:

flume node_nowatch

At this point, we should be able to start the "dashboard" of Flume in the browser. Open http://localhost:35871/.
In the config section we can now configure a sink that writes into HDFS. For testing purposes we can use a fake source that just reads a local file. Choose your local node from the drop-down list and enter

text("/etc/services")
as source and
collectorSink("hdfs://localhost/","testfile")
as sink.
Now, check if the file was written to HDFS with:
hadoop dfs -ls /
# cat it
hadoop dfs -cat /testfilelog.00000038.20111006-160327897+0200.1317909807897847000.seq

If everything worked well, you're fine to switch the source in Flume to:

syslogUdp(5140)
Now, Flume acts like an syslog server and it writes all log messages directly to HDFS.

Cheers, Arbo