Category Archives: Computers

Building Hadoop and HBase for HBase Maven application development

1 Introduction

HBase 0.90.3 needs Hadoop common 0.20-append branch in order to not lose data. More information about this can be found "Getting Started" section of HBase guide. However, there is no official release of Hadoop common 0.20-append binary. In order to have consistent and right bits on your cluster and your development platform, you need to compile your own binary version of Hadoop common from the 0.20-append branch source and your own version of HBase 0.90.3 using that Hadoop common binary.

This article provides an overview of building Hadoop and HBase for developing HBase applications that are managed using Maven.

2 Interpreting Maven terminology

A brief description about a few ambiguous terms is provided in this section to avoid potential confusion.

2.1 Maven repository vs repository manager

Maven repository refers to ~/.m2/repository, whereas Maven repository manager refers to an artifact repository manager like Apache Archiva or Artifactory.

2.2 Installing vs deploying artifacts

Installing an artifact is installing it in the Maven repository, whereas deploying an artifact means publishing the artifact in a Maven repository manager. For more information, please refer to Maven reference.

2.3 Installing artifacts vs binaries

Installing artifacts refers to installing them in Maven repository, whereas installing binaries refers to installing the entire binary distribution on the cluster.

3 Prerequisites

You need the following components for this process.

If you are using a Maven repository manager, then make sure that you configure the authentication settings for the repository manager in ~/.m2/settings file.

<settings>
  ...
  <servers>
    <server>
      <id>yourrepo.internal</id>
      <username>USER</username>
      <password>PASSWORD</password>
    </server>
  </servers>
  ...
</settings>

yourrepo.internal is the ID that you will be referring to later from Ant and Maven build configurations.

USER and PASSWORD are the username and password of an account with deployment role in your Maven repository manager.

4 Building Hadoop common

4.1 Checkout Hadoop common

Checkout Hadoop common from 0.20-append branch.

$ svn co http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-append/ hadoop-common-0.20-append

4.2 Create build.properties

Hadoop uses Apache Ant as a build tool. In order to build hadoop-common, you need to create a hadoop-common-0.20-append/build.properties file that looks something like this.

resolvers=internal
version=0.20-append-r1057313-yourversion
project.version=${version}
hadoop.version=${version}
hadoop-core.version=${version}
hadoop-hdfs.version=${version}
hadoop-mapred.version=${version}

Note that at the time of creation of this article, the latest revision that was available in 0.20-append branch was r1057313.

Also, try to assign a meaningful suffix in place of yourversion so that you can distinguish between the official artifacts and the artifacts that are deployed by you.

4.3 OPTIONAL: Configure your repository manager

Please follow this step only if you are running a Maven repository manager for team collaboration and you want to deploy the Hadoop common artifacts to that repository manager.

Edit hadoop-common-0.20.append/build.xml and add two new targets.

<target name="mvn-deploy-internal" depends="mvn-taskdef, bin-package, set-version, simpledeploy-internal"
   description="To deploy hadoop core and test jar's to apache maven repository"/>

<target name="simpledeploy-internal" unless="staging">
   <artifact:pom file="${hadoop-core.pom}" id="hadoop.core"/>
   <artifact:pom file="${hadoop-test.pom}" id="hadoop.test"/>
   <artifact:pom file="${hadoop-examples.pom}" id="hadoop.examples"/>
   <artifact:pom file="${hadoop-tools.pom}" id="hadoop.tools"/>
   <artifact:pom file="${hadoop-streaming.pom}" id="hadoop.streaming"/>

   <artifact:install-provider artifactId="wagon-http" version="${wagon-http.version}"/>
   <artifact:deploy file="${hadoop-core.jar}">
       <remoteRepository id="yourrepo.internal" url="http://yourreposerver.com:port/path"/>
       <pom refid="hadoop.core"/>
   </artifact:deploy>
   <artifact:deploy file="${hadoop-test.jar}">
       <remoteRepository id="yourrepo.internal" url="http://yourreposerver.com:port/path"/>
       <pom refid="hadoop.test"/>
   </artifact:deploy> 
   <artifact:deploy file="${hadoop-examples.jar}">
       <remoteRepository id="yourrepo.internal" url="http://yourreposerver.com:port/path"/>
       <pom refid="hadoop.examples"/>
   </artifact:deploy>
   <artifact:deploy file="${hadoop-tools.jar}">
       <remoteRepository id="yourrepo.internal" url="http://yourreposerver.com:port/path"/>
       <pom refid="hadoop.tools"/>
   </artifact:deploy>
   <artifact:deploy file="${hadoop-streaming.jar}">
       <remoteRepository id="yourrepo.internal" url="http://yourreposerver.com:port/path"/>
       <pom refid="hadoop.streaming"/>
   </artifact:deploy>
</target>

Note that yourrepo.internal is the same ID that you have configured authentication for in the ~m2/settings.xml file earlier.

4.4 Build and install/deploy Hadoop common artifacts

Now, build and install/deploy Hadoop common artifacts using the Maven ant tasks.

4.4.1 Install artifacts

If you do not have a repository manager, and skipped the previous step, then use the mvn-install target and skip the "Deploy artifacts" section. Otherwise, jump directly to "Deploy artifacts" section.

$ ant mvn-install

This target will generate Hadoop common artifacts and Maven POM files, and install them in your local Maven repository (~/.m2/repository).

4.4.2 Deploy artifacts

If you have an internal repository manager, you should deploy the artifacts on it that you have specified in build.xml of the previous step. To achieve this, run mvn-deploy-internal task.

$ ant mvn-deploy-internal

This target will generate the artifacts and Maven POM files, and publish them to your repository manager that you have specified in build.xml.

4.5 Generate the binary tarball to install on the cluster

You need to generate a binary tarball to install on the cluster. This is achieved by running tar target.

$ ant tar -Djava5.home=<Java 5 SE Home> -Dforrest.home=<Forrest 0.8 Home>

Please note that you need Java SE/EE 5 and Apache Forrest 0.8 for this step. Substituting Java SE/EE 5 or Apache Forrest 0.9 will result in a build failure.

This will generate the hadoop-common-0.20-append/build/hadoop-common-0.20-append-r1057313-yourversion.tar.gz tarball.

4.6 Install Hadoop binaries on the cluster

Copy the tarball that was generated in the previous step to your cluster, and unpack them in desired location.

This ensures that you have a consistent Hadoop installation because you are not mixing and matching artifacts from a Hadoop common official release and artifacts that you built.

5 Building HBase

5.1 Checkout HBase

Checkout HBase from 0.90.3 tag.

$ svn co http://svn.apache.org/repos/asf/hbase/tags/0.90.3 hbase-0.90.3

5.2 Modify HBase and Hadoop versions

Now, edit hbase-0.90.3/pom.xml and modify HBase and Hadoop versions.

...
<groupId>org.apache.hbase</groupId>
<artifactId>hbase</artifactId>
<packaging>jar</packaging>
<version>0.90.3-yourversion</version>
...
 <hadoop.version>0.20-append-r1057313-yourversion</hadoop.version>
...

Note that you should be using the same revision number for Hadoop that you have assigned while building Hadoop.

Also, try to assign a meaningful suffix in place of yourversion so that you can distinguish between the official artifacts and the artifacts that are deployed by you.

5.3 OPTIONAL: Specify the URL of your repository manager

If you are running an internal repository manager for team collaboration, it is the time to specify in the hbase-0.90.3/pom.xml. Add the following section to it.

<project>
  ...
  <distributionManagement>
    <repository>
      <id>yourrepo.internal</id>
      <name>Your internal repository</name>
      <url>http://yourreposerver.com:port/path</url>
    </repository>
  </distributionManagement>
  ...
</project>

Note that yourrepo.internal is the same ID that you have configured authentication for in the ~m2/settings.xml file earlier.

5.4 Build and install/deploy HBase artifacts

Now, build and install/deploy HBase artifacts using the Maven goals.

5.4.1 Install artifacts

If you do not have a repository manager, and skipped the previous step, then use the install goal and skip the "Deploy artifacts" section. Otherwise, jump directly to "Deploy artifacts" section.

$ mvn install

This goal will generate HBase artifacts and Maven POM files, and install them in your local Maven repository (~/.m2/repository). Ignore the rest of this section.

5.4.2 Deploy artifacts

If you have a repository manager, you should deploy the artifacts on your internal server that you have specified in pom.xml of the previous step. To achieve this, invoke deploy goal.

$ mvn deploy

This goal will generate the artifacts and Maven POM files, and publish them to your internal repository manager that you have specified in pom.xml.

5.5 Generate the binary tarball to install on the cluster

You need to generate a binary tarball to install on the cluster. This is achieved by invoking assembly:single goal.

$ mvn assembly:single

This will generate the hbase-0.90.3/target/hbase-0.90.3-yourversion.tar.gz tarball.

5.6 Install HBase binaries on the cluster

Copy the tarball that was generated in the last step to your cluster and unpack them in the desired location.

This ensures that you have a consistent HBase installation with the right version of Hadoop artifacts that you have built. There is no need of replacing any artifact by hand because, Maven automatically pulled the right version of the artifact that you have built.

6 OPTIONAL: Using the HBase artifact in your HBase application

Now you can edit the pom.xml of your HBase application to use the version of HBase that you have built (0.90.3-yourversion)

7 Feedback

I tried to provide only as much information in the article as possible without overloading the scope of it. I have also taken basic care to ensure that the above mentioned commands are accurate. However, there might be some typos or copy paste errors. If you find something that doesn't work for you please let me know and I'll fix them.

8 Credits

9 Disclaimer

This article is provided for informational purpose only and I will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its use.

Making Mac applications to make use of memories larger than 4 GB

Today, I upgraded the memory on my Macbook Pro (2010) from 4 GB to 8 GB. One of the main reasons for the memory upgrade was to be able to run Mircosoft Windows 7 virtual machine using Oracle VirtualBox. However, I noticed that VirtualBox was not able to see more than 4 GB of memory even after the upgrade while the system reported that there was 8 GB of installed memory.

A quick research showed that Mac OS X 10.6 (Snow Leopard) uses 32-bit kernel by default. This limits the applications to use only 4 GB of memory. In order for the applications to use larger memories, one need to use the 64-bit kernel (provided that you are on a 64-bit platform). There is an Apple support page that describes how to select the desired kernel.

I changed the defaults to use 64-bit kernel.

$ sudo systemsetup -setkernelbootarchitecture x86_64

This configuration is stored in the file /Library/Preferences/SystemConfiguration/com.apple.Boot.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
        <key>Kernel Architecture</key>
        <string>x86_64</string>
        <key>Kernel Flags</key>
        <string></string>
</dict>
</plist>

After making this change and rebooting, VirtualBox was able to see the whole memory.

Plotting a weight chart using Emacs Org-Mode and Gnuplot

I discovered about Org-Mode a couple of years ago. Since then, I have started using it for various tasks. One of the recent usage is to track my daily weight. This article describes how to use Org-Mode and Gnuplot to plot your weight measurements.

Here is the list of software that I use in my setup.

For the purpose of this article, it is assumed that you have a similar setup that works for you.

As part of my daily measurement, I track three measurements namely, weight, body fat percentage and body water percentage. This data can be represented in an Org table that has four columns as follows. Please note that the data provided below is hypothetical.

#+PLOT: script:"Weight.plt"
| Date       | Weight |  BF% |  BW% |
|------------+--------+------+------|
| 05/01/2011 |  145.2 | 30.3 | 50.3 |
| 05/02/2011 |  145.1 | 29.4 | 50.7 |
| 05/03/2011 |  144.2 | 30.1 | 50.4 |
| 05/04/2011 |  144.0 | 29.0 | 50.8 |
| 05/05/2011 |  144.1 | 28.7 | 51.0 |
| 05/06/2011 |  144.2 | 29.2 | 50.7 |
| 05/07/2011 |  144.4 | 27.8 | 51.4 |
| 05/08/2011 |  143.7 | 27.6 | 51.5 |
| 05/09/2011 |  143.2 | 28.5 | 51.0 |
| 05/10/2011 |  142.8 | 30.1 | 50.4 |
| 05/11/2011 |  142.2 | 30.1 | 50.4 |
| 05/12/2011 |  142.4 | 29.8 | 50.5 |
| 05/13/2011 |  142.2 | 29.9 | 50.5 |
| 05/14/2011 |  141.2 | 27.4 | 51.6 |
| 05/15/2011 |  141.5 | 27.2 | 51.7 |
|------------+--------+------+------|
| 05/16/2011 |  141.2 | 28.8 | 51.0 |
| 05/17/2011 |  140.8 | 28.9 | 50.9 |
| 05/18/2011 |  140.2 | 29.5 | 50.7 |
| 05/19/2011 |  140.5 | 26.8 | 51.9 |
| 05/20/2011 |  140.4 | 28.1 | 51.3 |
| 05/21/2011 |  139.2 | 26.7 | 51.9 |
| 05/22/2011 |  137.1 | 27.0 | 51.7 |
| 05/23/2011 |  137.5 | 27.0 | 51.7 |
| 05/24/2011 |  137.4 | 28.5 | 51.1 |
| 05/25/2011 |  136.8 | 29.6 | 50.6 |
| 05/26/2011 |  136.3 | 28.8 | 51.0 |
| 05/27/2011 |  136.3 | 28.2 | 51.2 |
| 05/28/2011 |  136.3 | 28.9 | 50.9 |
| 05/29/2011 |  136.2 | 28.0 | 51.3 |
| 05/30/2011 |  135.9 | 28.8 | 50.9 |
| 05/31/2011 |  135.5 | 28.2 | 51.2 |
|------------+--------+------+------|
| 06/01/2011 |  135.1 | 28.2 | 51.2 |
| 06/02/2011 |  135.5 | 28.2 | 51.2 |
| 06/03/2011 |  134.8 | 28.9 | 50.9 |
| 06/04/2011 |  134.4 | 29.6 | 50.6 |
| 06/05/2011 |  134.2 | 27.9 | 51.3 |
| 06/06/2011 |  134.4 | 28.1 | 51.3 |
| 06/07/2011 |  133.2 | 29.2 | 50.8 |
| 06/08/2011 |  133.1 | 28.8 | 51.0 |
| 06/09/2011 |  133.0 | 28.6 | 51.0 |
| 06/10/2011 |  133.2 | 27.8 | 51.4 |
| 06/11/2011 |  132.2 | 27.9 | 51.4 |
| 06/12/2011 |  132.0 | 27.9 | 51.4 |
| 06/13/2011 |  132.1 | 28.9 | 51.0 |
| 06/14/2011 |  131.0 | 29.0 | 50.9 |
| 06/15/2011 |  130.2 | 27.6 | 51.5 |
|------------+--------+------+------|

I use the following Gnuplot script, Weight.plt that plots the weight, body fat percentage, absolute body fat, body water percentage and absolute body water.

# Weight plotter.

set terminal aqua title "Weight plot"
set xdata time
set timefmt '"%m/%d/%Y"'
set xlabel 'Date'

set multiplot

set size 1, 0.33

# Weight.
set origin 0, 0.66
set ylabel 'Weight in lb'
plot '$datafile' using 1:2 title 'Weight' with lines linecolor 2

set size 0.5, 0.33

# Body fat.
# Percentage.
set origin 0, 0.33
set ylabel 'Body fat %'
plot '$datafile' using 1:3 title 'Body fat %' with lines linecolor 1
# Absolute.
set origin 0.5, 0.33
set ylabel 'Body fat in lb'
plot '$datafile' using 1:($3 * $2 / 100) title 'Body fat' with lines linecolor 1

# Body water.
# Percentage.
set origin 0, 0
set ylabel 'Body water %'
plot '$datafile' using 1:4 title 'Body water %' with lines linecolor 3
# Absolute.
set origin 0.5, 0
set ylabel 'Body water in lb'
plot '$datafile' using 1:($4 * $2 / 100) title 'Body water' with lines linecolor 3

unset multiplot

Here is the screenshot of the generated plot. Click the thumbnail to get the actual size.

Making GNU Emacs detect custom error messages – A Maven Example

GNU Emacs‘s compilation mode is capable of detecting error messages from various standard compilers and build tools. However, it is fairy common for one to run into a format of error message that Emacs can’t handle by default.

As you know, Emacs is highly extensible and it provides compilation-error-regexp-alist for accomplishing this. I have done this a couple of times earlier and I had to do it one more time lately for adding regular expression for Maven error messages. Construction of these regular expressions is not fun for everyone. However, regexp-builder makes it easier to construct the regular expression interactively.

Because the documentation around compilation-error-regexp-alist isn’t very intuitive for most, I have decided to make this screencast that describes how to approach this from scratch. Sorry about the typos! For best quality, watch it in HD.

Read more »

GNU Emacs and MIT Scheme on Mac OS X

Today, I planned to go back to the basics by taking 6.001 Structure and Interpretation of Computer Programs offered by MIT OpenCourseWare. I’ll save the reason behind it for another post.

For running the programs that are used in the class, I decided to use MIT/GNU Scheme. I am running GNU Emacs 23 on my Mac OS X. After some research, I figured out the best way of doing this is through xscheme.

First, download the MIT/GNU Scheme binary for Mac OS X and copy it to your Applications directory. Then configure Emacs to use the downloaded binary by adding the following lines to your .emacs.

(setq scheme-program-name
      "/Applications/mit-scheme.app/Contents/Resources/mit-scheme")
(require 'xscheme)

Now write your Scheme program.

; Compute the square root of a given number using successive
; approximation.
 
(define (sqrt value)
  (define (is-good-enough? guess value)
    (< (abs (- (* guess guess) value)) 0.0000001))
 
  (define (try guess value)
    (if (is-good-enough? guess value)
	guess
	(try (/ (+ guess (/ value guess)) 2) value)))
 
  (try 1 value))
 
(sqrt 4.0)

Invoke the Scheme process by 'M-x run-scheme'. Send the Scheme buffer to the Scheme process by 'M-o' and now you are able to run Scheme programs from Emacs.

Below is the screenshot of Scheme running under my Emacs session.

Emacs and Scheme

MIT Scheme running under GNU Emacs

Apache Software Foundation

Recently, I started playing with a lot of Open Source products from Apache Software Foundation. It all started with Hadoop, HBase and Cassandra. Day after day, I am getting my hands dirty on more Apache Foundation’s products like Ant, Maven, Archiva and Thrift.

When trying to build HBase from source, I noticed that the project was using Subversion for version control. I found it quite odd to see a modern project like HBase not using a distributed version control tool like Mercurial or Git. Soon, I realized that all Apache projects’ source code were maintained in Subversion. Then, I made a comment to my co-worker that, “Maybe Apache Foundation took over Subversion too!”, soon to realize that it was true. We learned that Subversion became an Apache Incubator project in 2009 and became an Apache top-level project in 2010.

I am really amazed by the number of projects that are now part of the Apache Software Foundation. Go Apache!

Introduction to Test-Driven Development in C++ using Boost Test Library

I have been following Test-Driven Development for a few years now. Even though TDD is widespread, often I come across a few friends who aren’t very familiar with TDD approach. It took a while for me to really appreciate TDD since I was introduced to it. When I demonstrated TDD in action, I got a few of my friends interested.

We have our own test framework that we use in our project which was primarily developed by David Carlton. It works very well for our needs. However, for my personal projects, I wanted to try something that is more widely used in the industry. I started using CppUnit for a while until I found Boost Test Library coming a long way. Now, I use Boost Test Library for all my personal projects. It is very easy to setup tests and I really like it.

I also wanted to write a quick introduction to Boost Test Library. So, I thought that I will put down a screencast that will solve two purposes of demonstrating Boost Test Library and serve as an introduction to TDD. This is not an extensive demo or an introduction. I have chosen a really simple problem that is often asked in preliminary rounds of technical interviews. But, it is a good place to start. I don’t guarantee that the solution is efficient. But, it is correct to my knowledge. Please feel free to suggest issues or improvements.

Please note that a HD version of this video is available when viewed on Vimeo’s site.

Introduction to Test Driven Development in C++ using Boost Test Library from Praveen Kumar on Vimeo.

Read more »

Simple GNU Emacs keyboard macro demonstration

My obsession for GNU Emacs has grown over years to an extent where I managed to get a significant amount of users to adopt Emacs. In the past 10 years, I have learned a lot of nice tricks that I can do on Emacs to improve my productivity. So, I have decided to create a series of screencasts demonstrating some of those.

I will start with a very simple one, macros. Quoting from Emacs documentation, “A keyboard macro is a command defined by an Emacs user to stand for another sequence of keys. For example, if you discover that you are about to type C-n M-d C-d forty times, you can speed your work by defining a keyboard macro to do C-n M-d C-d, and then executing it 39 more times.”

In this demo, I have taken a real world example where you have to add C++ class member variables and accessors. There are other efficient ways to do such tasks in GNU Emacs. I personally use yasnippets to do these things. However, this approach is shown just to demonstrate keyboard macros. To supplement this video, please take a look at the keyboard macro documentation that is available within Emacs.

Getting untruncated command line options passed to a Solaris process

If you have ever wanted to get the command line options that were passed to a running Solaris process, you might have noticed that the output of command line arguments from ps is truncated to 80 characters. Looking into /usr/include/sys/procfs.h will reveal the reason why! This is because of the restriction in struct psinfo. Here are the relevant fields from the definition of struct psinfo.

#define	PRFNSZ		16	/* Maximum size of execed filename */
#define	PRARGSZ		80	/* number of chars of arguments */
 
typedef struct psinfo {
         /* Fields omitted */
         char pr_fname[PRFNSZ];    /* name of exec'ed file */
         char pr_psargs[PRARGSZ];  /* initial characters of arg list */
         /* Fields omitted */
} psinfo_t;

So, due to the 80 characters restriction in psinfo::pr_psargs, the kernel will not be keeping track of arguments beyond the limit. Now, the only way to get the information is from the process’ memory of argv. In order to do this, you should have access to read the processes’ memory. This is the trick employed by both pargs and BSD version of ps with -ww switch.

To get the full length command line arguments passed to a process, you can do one of the following.

$ /usr/ucb/ps eww <pid>
$ pargs -l <pid>

One catch here is that, if the process has modified the argv since it was started, the output reported by both ps and pargs will show the modified data and not the initial arguments that were passed in. However, modifying argv within a program is not a standard practice and hence the chance of encountering such a scenario is remote.

Dumping core file from set-UID, set-GID ‘ed processes in Solaris

I had a previous post on how to turn on core files for set-UID, set-GID processes under Linux. Recently we ran into the same problem on Solaris. To turn on core files for set-id processes, use coreadm.

$ pfexec coreadm -e global-setid

Please keep in mind that these core files can have information that non-privileged user isn’t supposed to know. Quoting from Solaris man page:

     A process that is or ever has been setuid  or  setgid  since
     its  last  exec(2)  presents  security issues that relate to
     dumping  core.  Similarly,  a  process  that  initially  had
     superuser  privileges  and  lost  those  privileges  through
     setuid(2) also presents security issues that are related  to
     dumping core. A process of either type can contain sensitive
     information in  its  address  space  to  which  the  current
     nonprivileged  owner  of the process should not have access.
     If setid core files are enabled, they are created  mode  600
     and owned by the superuser.