Thursday, October 6, 2011

Flume and Hadoop on OS X

For a kick-ass webscale big-data setup on your local mac, you'll want to have Hadoop and Flume place.
No seriously - this setup is especially useful, if you want to route your syslog output from your nodes to HDFS in order to process it later using Map/Reduce jobs.

It took me a while to figure out, how Flume and Hadoop have to be configured so that receiving messages are getting written into HDFS. This is why I decided to write a quick tutorial to get things up and running.

I assume that you have brew in place. So, the first step is as easy as:

brew install flume

But don't think that you can just install Hadoop with brew. The current version of Hadoop is pinned to 0.21 wich is an unstable version that, AFAIK, doesn't play together with Flume. We need version 0.20.2. I edited my Hadoop formula locally on my mac. But this should work, too:

brew install https://raw.github.com/mxcl/homebrew/d0efd9ee94a55e243f3b10e903526274fc21d569/Library/Formula/hadoop.rb

After you finished the hadoop installation, we need to edit a bunch of files in order to configure hadoop for local single node setup. Go to /usr/local/Cellar/hadoop/0.20.2/libexec/conf and change the following files:

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:54311</value>
  </property>
</configuration>

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Hadoop is now configured to use /tmp/hadoop as HDFS folder. Now, we need to create and format the directory.

mkdir /tmp/hadoop
cd /tmp/hadoop
hadoop namenode -format

Hadoop will connect to localhost using ssh. To configure ssh in the way that it can connect from localhost to localhost without needing a password, we need to add you public key to your authorized keys.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

We can test this by trying to ssh into localhost without using a password:

ssh localhost

You now should be able to start Hadoop. Fire up Hadoop by typing:

start-all.sh

To check if HDFS is up and running, try to list the files of your brand-new distributed filesystem:

hadoop dfs -ls /

To make Flume talk to HDFS we need to replace the hadoop-core.jar in the lib directory with the one that was shipped with hadoop.

cd /usr/local/Cellar/flume/0.9.3-CDH3B4/libexec/lib/
mv hadoop-core-0.20.2-CDH3B4.jar hadoop-core-0.20.2-CDH3B4.jar.unused
cp /usr/local/Cellar/hadoop/0.20.2/libexec/hadoop-0.20.2-core.jar .

Now it's time to start a Flume master node:

flume master

Go to a different terminal window and start a Flume node, too:

flume node_nowatch

At this point, we should be able to start the "dashboard" of Flume in the browser. Open http://localhost:35871/.
In the config section we can now configure a sink that writes into HDFS. For testing purposes we can use a fake source that just reads a local file. Choose your local node from the drop-down list and enter

text("/etc/services")
as source and
collectorSink("hdfs://localhost/","testfile")
as sink.
Now, check if the file was written to HDFS with:
hadoop dfs -ls /
# cat it
hadoop dfs -cat /testfilelog.00000038.20111006-160327897+0200.1317909807897847000.seq

If everything worked well, you're fine to switch the source in Flume to:

syslogUdp(5140)
Now, Flume acts like an syslog server and it writes all log messages directly to HDFS.

Cheers, Arbo

6 comments:

  1. Hi Arbo,

    Great post - I've had some issues previously with setting up Hadoop with macports and decided to give a try to brew.

    Just a small note to those who followed your tutorial and saw the error: Error: JAVA_HOME is not set.

    I've set the JAVA_HOME as:
    export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home
    and added this to the file:
    /usr/local/Cellar/hadoop/0.20.2//libexec/conf/hadoop-env.sh

    Thanks,
    Radek

    ReplyDelete
  2. I've forgot this step. Thanks for pointing this out!!

    ReplyDelete
  3. IMHO, Instead of giving the exact path while setting up JAVA_HOME, it could be safer, if you tried:

    shell> export JAVA_HOME=`/usr/libexec/java_home`

    ReplyDelete
  4. Hi arbovm,

    I installed hadoop and flume via homebrew. Flume is working (for some sinks) and I can write to console and local file system using text. Hadoop was also installed via homebrew.

    But if I use collectorSink and either a file target file:/// or hdfs://, I get an error. The error is same for both types of targets.


    2012-11-29 20:22:47,182 [logicalNode new-host-2.home-21] INFO debug.InsistentAppendDecorator: append attempt 6 failed, backoff (60000ms): failure to login
    2012-11-29 20:23:47,181 [pool-8-thread-1] INFO hdfs.EscapedCustomDfsSink: Opening file:///Users/abhi/flume/testfile20121129-202317290-0800.1354249397290310000.00000037
    2012-11-29 20:23:47,183 [logicalNode new-host-2.home-21] INFO debug.StubbornAppendSink: append failed on event 'new-host-2.home [INFO Thu Nov 29 20:21:44 PST 2012] #' with error: failure to login
    2012-11-29 20:23:47,183 [logicalNode new-host-2.home-21] INFO rolling.RollSink: closing RollSink 'escapedCustomDfs("file:///Users/abhi/flume","testfile%{rolltag}" )'
    2012-11-29 20:23:47,183 [logicalNode new-host-2.home-21] INFO rolling.RollSink: opening RollSink 'escapedCustomDfs("file:///Users/abhi/flume","testfile%{rolltag}" )'
    2012-11-29 20:23:47,184 [logicalNode new-host-2.home-21] INFO debug.InsistentOpenDecorator: Opened MaskDecorator on try 0
    2012-11-29 20:23:47,185 [pool-9-thread-1] INFO hdfs.EscapedCustomDfsSink: Opening file:///Users/abhi/flume/testfile20121129-202347184-0800.1354249427184178000.00000021
    2012-11-29 20:23:47,192 [logicalNode new-host-2.home-21] INFO debug.InsistentAppendDecorator: append attempt 7 failed, backoff (60000ms): failure to login

    I am not sure what is going on. The permissions to the directory is 777. Do you have any insight on what may be wrong?

    ReplyDelete
  5. Nice blog i like this post i am looking for such information long time & finally i got it from this post on hadoop,Thanks for sharing this great information.
    Hadoop Training in Hyderabad

    ReplyDelete
  6. Thanks for nice post. But am getting following error while installing flume in my Mac 10.7

    ganeshapple$ brew install flume
    Error: No available formula for flume
    Searching taps...

    ReplyDelete