Archive for the ‘programming’ Category

h1

Using Hadoop to Create SOLR Indexes

September 26, 2010

One of the most challenging projects I faced at work recently was to create a Apache SOLR index consisting of approx 15 million records. This index had been created once in the history of the company using a MySQL database and SOLR’s Data Import Handler (DIH). It had not been attempted since then because the original indexing process was time consuming (12-14 hours), required human supervision, and on failure had to be restarted from the very beginning.

For small data sets (say, less than 100,000 records) SOLR’s DIH and MySQL works fine. However, for these larger sets it’s just too much of a drain on resources.  Some members of our team and the architecture team had had success working with large data sets by leveraging the Apache Hadoop project. One of the most attractive aspect of Hadoop is that the processing is distributed which should reduce the total time to index. Also Hadoop has a robust fail-over system which would remove the need for human supervision. We architected a data pipeline by which data would be processed by modules. When one module completed its task it would alert the system and the next module would begin work on the the output of the previous module. The SOLR indexing is one module.

Read the rest of this entry ?

Advertisements
h1

r cannot be resolved

January 2, 2010

In the Google group Android Beginners I frequently see messages that ask what the error “r cannot be resolved” means in Eclipse.

Read the rest of this entry ?

h1

Comparing Hash Tables

August 11, 2008

I was recently asked in a comment how to compare 2 hash tables in Perl. Furthermore, the commenter mention that this would be use in a subroutine.

There is a module Data::Compare http://search.cpan.org/~dcantrell/Data-Compare-1.19/lib/Data/Compare.pm. I’ve never used this in any way other than to learn what it can do. From what I can tell it will not provide details. It will just tell you yes, the data structures are the same or no, the data structures are not the same.

Read the rest of this entry ?

h1

Hash tables

April 23, 2008

I have found a number of potentially unconventional uses for hash tables (aka “associative arrays”). I suppose the first thing that comes to mind when thinking of hash tables is as a way to map a given value to another value. As a very simple example, say you have a list of items and want to keep track of how many of those items you have.

my %items = ();
$items{shoes} = 2;
$items{pants} = 1;
$items{dogs} = 5;
$items{cats} = 50;

We often refer to this arrangement as a “key/value” pair. Now, if you want to know how many shoes you have you can do so by referencing $items{shoes}. If you want to know just how crazy the cat person is, look at $items{cats}.

Read the rest of this entry ?

h1

Converting from Apache Ant to Apache Maven

April 5, 2008

At work (http://www.monsanto.com/) we’re converting all projects from Apache ant to Apache maven. The architecture team came up with the POM which all our applications would extend and set up the company-wide repository. I spent about a week and a half converting our project. Everyone thought that conversions would go smoothly. However, I haven’t heard of any project taking less than a week. A colleague wrote an application called “mavenizer”. The mavenizer helps analyze your project and generates a POM for each artifact. We had to determine how to split our application into artifacts but using the mavenizer app helped us figure out the dependencies and got us started with an initial POM.

I can’t speak for other projects but probably the most time consuming part of the process was weeding out the dependencies. We had some production code that had interdependencies but the majority was located in the test code. For instance, we have a TestHelper class that provides methods setting up the data environment prior to running a test and for tearing it down once the test completes. This class was used in a couple of leaf artifacts (domainpos and batchprocess).

When working with maven each, for lack of a better word, installable module is packaged in what’s known as an artifact. An “installable” would be something like a jar or war or ear or some other unit. We have a “common” artifact that’s a jar and can be used by all other artifacts. Then there is an “appdao” jar which contains all database-related code. Finally there are 3 leaf artifacts which can depend on the common and/or appdao but should not depend on each other: domainpos (war that is our point-of-service); plugin (jar that is a plugin for the company-wide GUI framework); batchprocess (jar that runs on a schedule for batch processing).

I’ll post here more about the conversion process and about things I learn about maven.