Archive for the ‘hadoop’ Category


Using Hadoop to Create SOLR Indexes

September 26, 2010

One of the most challenging projects I faced at work recently was to create a Apache SOLR index consisting of approx 15 million records. This index had been created once in the history of the company using a MySQL database and SOLR’s Data Import Handler (DIH). It had not been attempted since then because the original indexing process was time consuming (12-14 hours), required human supervision, and on failure had to be restarted from the very beginning.

For small data sets (say, less than 100,000 records) SOLR’s DIH and MySQL works fine. However, for these larger sets it’s just too much of a drain on resources.  Some members of our team and the architecture team had had success working with large data sets by leveraging the Apache Hadoop project. One of the most attractive aspect of Hadoop is that the processing is distributed which should reduce the total time to index. Also Hadoop has a robust fail-over system which would remove the need for human supervision. We architected a data pipeline by which data would be processed by modules. When one module completed its task it would alert the system and the next module would begin work on the the output of the previous module. The SOLR indexing is one module.

Read the rest of this entry ?