For a kick-ass webscale big-data setup on your local mac, you'll want to have Hadoop and Flume place.
No seriously - this setup is especially useful, if you want to route your syslog output from your nodes to HDFS in order to process it later using Map/Reduce jobs.
It took me a while to figure out, how Flume and Hadoop have to be configured so that receiving messages are getting written into HDFS. This is why I decided to write a quick tutorial to get things up and running.
I assume that you have brew in place. So, the first step is as easy as:
brew install flume
But don't think that you can just install Hadoop with brew. The current version of Hadoop is pinned to 0.21 wich is an unstable version that, AFAIK, doesn't play together with Flume. We need version 0.20.2. I edited my Hadoop formula locally on my mac. But this should work, too:
brew install https://raw.github.com/mxcl/homebrew/d0efd9ee94a55e243f3b10e903526274fc21d569/Library/Formula/hadoop.rb
After you finished the hadoop installation, we need to edit a bunch of files in order to configure hadoop for local single node setup. Go to /usr/local/Cellar/hadoop/0.20.2/libexec/conf and change the following files:
core-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop</value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> </property> </configuration>
mapred-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> </property> </configuration>
hdfs-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Hadoop is now configured to use /tmp/hadoop as HDFS folder. Now, we need to create and format the directory.
mkdir /tmp/hadoop cd /tmp/hadoop hadoop namenode -format
Hadoop will connect to localhost using ssh. To configure ssh in the way that it can connect from localhost to localhost without needing a password, we need to add you public key to your authorized keys.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
We can test this by trying to ssh into localhost without using a password:
ssh localhost
You now should be able to start Hadoop. Fire up Hadoop by typing:
start-all.sh
To check if HDFS is up and running, try to list the files of your brand-new distributed filesystem:
hadoop dfs -ls /
To make Flume talk to HDFS we need to replace the hadoop-core.jar in the lib directory with the one that was shipped with hadoop.
cd /usr/local/Cellar/flume/0.9.3-CDH3B4/libexec/lib/ mv hadoop-core-0.20.2-CDH3B4.jar hadoop-core-0.20.2-CDH3B4.jar.unused cp /usr/local/Cellar/hadoop/0.20.2/libexec/hadoop-0.20.2-core.jar .
Now it's time to start a Flume master node:
flume master
Go to a different terminal window and start a Flume node, too:
flume node_nowatch
At this point, we should be able to start the "dashboard" of Flume in the browser. Open http://localhost:35871/.
In the config section we can now configure a sink that writes into HDFS. For testing purposes we can use a fake source that just reads a local file. Choose your local node from the drop-down list and enter
text("/etc/services")as source and
collectorSink("hdfs://localhost/","testfile")as sink.
Now, check if the file was written to HDFS with:
hadoop dfs -ls / # cat it hadoop dfs -cat /testfilelog.00000038.20111006-160327897+0200.1317909807897847000.seq
If everything worked well, you're fine to switch the source in Flume to:
syslogUdp(5140)Now, Flume acts like an syslog server and it writes all log messages directly to HDFS.
Cheers, Arbo