geekiriki: Map/Reduce

For a kick-ass webscale big-data setup on your local mac, you'll want to have Hadoop and Flume place.
No seriously - this setup is especially useful, if you want to route your syslog output from your nodes to HDFS in order to process it later using Map/Reduce jobs.

It took me a while to figure out, how Flume and Hadoop have to be configured so that receiving messages are getting written into HDFS. This is why I decided to write a quick tutorial to get things up and running.

I assume that you have brew in place. So, the first step is as easy as:

brew install flume

But don't think that you can just install Hadoop with brew. The current version of Hadoop is pinned to 0.21 wich is an unstable version that, AFAIK, doesn't play together with Flume. We need version 0.20.2. I edited my Hadoop formula locally on my mac. But this should work, too:

brew install https://raw.github.com/mxcl/homebrew/d0efd9ee94a55e243f3b10e903526274fc21d569/Library/Formula/hadoop.rb

After you finished the hadoop installation, we need to edit a bunch of files in order to configure hadoop for local single node setup. Go to /usr/local/Cellar/hadoop/0.20.2/libexec/conf and change the following files:

core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>
</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:54311</value>
  </property>
</configuration>

hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

Hadoop is now configured to use /tmp/hadoop as HDFS folder. Now, we need to create and format the directory.

mkdir /tmp/hadoop
cd /tmp/hadoop
hadoop namenode -format

Hadoop will connect to localhost using ssh. To configure ssh in the way that it can connect from localhost to localhost without needing a password, we need to add you public key to your authorized keys.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

We can test this by trying to ssh into localhost without using a password:

ssh localhost

You now should be able to start Hadoop. Fire up Hadoop by typing:

start-all.sh

To check if HDFS is up and running, try to list the files of your brand-new distributed filesystem:

hadoop dfs -ls /

To make Flume talk to HDFS we need to replace the hadoop-core.jar in the lib directory with the one that was shipped with hadoop.

cd /usr/local/Cellar/flume/0.9.3-CDH3B4/libexec/lib/
mv hadoop-core-0.20.2-CDH3B4.jar hadoop-core-0.20.2-CDH3B4.jar.unused
cp /usr/local/Cellar/hadoop/0.20.2/libexec/hadoop-0.20.2-core.jar .

Now it's time to start a Flume master node:

flume master

Go to a different terminal window and start a Flume node, too:

flume node_nowatch

At this point, we should be able to start the "dashboard" of Flume in the browser. Open http://localhost:35871/.
In the config section we can now configure a sink that writes into HDFS. For testing purposes we can use a fake source that just reads a local file. Choose your local node from the drop-down list and enter

text("/etc/services")

as source and

collectorSink("hdfs://localhost/","testfile")

as sink.
Now, check if the file was written to HDFS with:

hadoop dfs -ls /
# cat it
hadoop dfs -cat /testfilelog.00000038.20111006-160327897+0200.1317909807897847000.seq

If everything worked well, you're fine to switch the source in Flume to:

syslogUdp(5140)

Now, Flume acts like an syslog server and it writes all log messages directly to HDFS.

Cheers, Arbo

I just found out that it is possible to sort the result of Map/Reduce with a list function.

Let's take the simple example that you want to count all documents grouped by a field called type. The following map function emits the values of the type fields of all documents:

function(doc) {
  emit(doc.type, 1);
}

To sum up the documents with the same value in the type field, we just need this well-known reduce function:

function(key, values) {
  return sum(values)
}

By default CouchDB yields the result ordered by the keys. But if you want to order the result by occurrences of the type of the document you either have to sort it in your app or you use a list function like this:

function(head, req) { 
  var row
  var rows=[]
  while(row = getRow()) { 
    rows.push(row)  
  } 
  rows.sort(function(a,b) { 
    return b.value-a.value
  }) 
  send(JSON.stringify({"rows" : rows}))
}

If you save the list function as sort and the Map/Reduce-functions as count together in a design document, you can fetch your sorted result like this:

curl http://.../design-doc/_list/sort/count?group=true

Of course there are other options to sort a view result. I didn't found much documentation on this topic, but this thread at stackoverflow is very informative.

Back to the couch - Cheers!

geekiriki

Thursday, October 6, 2011

Flume and Hadoop on OS X

Wednesday, August 18, 2010

CouchDB: Using List Functions to sort Map/Reduce-Results by Value

About Me

Blog Archive

Search This Blog