Articles by Tom White
Running Hadoop MapReduce on Amazon EC2 and Amazon S3
Amazon Web Services Developer Connection, 18 July 2007
Managing large datasets is hard; running computations on large datasets is even harder. Once a dataset has exceeded the capacity of a single filesystem or a single machine, running data processing tasks requires specialist hardware and applications, or, if attempted on a network of commodity machines, it requires a lot of manual work to manage the process: splitting the dataset into manageable chunks, launching jobs, handling failures, and combining job output into a final result.
Introduction to Nutch, Part 2: Searching
java.net, 16 February 2006
In part one of this two part series on Nutch, the open-source Java search engine, we looked at how to crawl websites.
Introduction to Nutch, Part 1: Crawling
java.net, 10 January 2006
Nutch is an open source Java implementation of a search engine. It provides all of the tools you need to run your own search engine. But why would anyone want to run their own search engine? After all, there's always Google. There are at least three reasons.
Did You Mean: Lucene?
java.net, 9 August 2005
All modern search engines attempt to detect and correct spelling errors in users' search queries. Google, for example, was one of the first to offer such a facility, and today we barely notice when we are asked "Did you mean x?" after a slip on the keyboard. This article shows you one way of adding a "did you mean" suggestion facility to your own search applications using the Lucene Spell Checker, an extension written by Nicolas Maisonneuve and David Spencer.
Feedback on Lucene Dev mailing list, Lucene User mailing list, and The Server Side.
How To Build a Compute Farm
java.net, 21 April 2005
Some programs can be made to run faster by dividing them up into smaller pieces and running these pieces on multiple processors. This is known as parallel computing, and a large number of hardware and software systems exist to facilitate it. The most famous example of a (distributed) parallel program is SETI@home, but there are many other applications including ray tracing, database searching, code breaking, neural network training, genetic algorithms, and a whole host of NP-complete problems where a brute force approach is needed.
Can't beat Jazzy
IBM developerWorks, 22 September 2004
Users have come to expect spell-check capabilities from applications that involve natural-language text entry. Because building a spell checker from scratch is no simple task, this article offers you a workaround using Jazzy, an open source Java spell checker API. Java developer Tom White offers an in-depth explanation of the main algorithms behind computer-based spell checking, then shows you how the Jazzy API can help you incorporate the best of them into your Java applications.
Using XML Catalogs with JAXP
XML.com, 3 March 2004
XML documents often refer to other documents that an XML processor has to retrieve in order to make sense of the main document. These external resources, typically referred to by URIs, may be local files; or they may be remote, distributed across the web. In an ideal world the difference would be invisible, since it would be as cheap to access a remote resource as a local one. However, in the real world network failures do occur, and it is wise to design applications that take this into account.
Scheduling recurring tasks in Java applications
IBM developerWorks, 4 November 2003
All manner of Java applications commonly need to
schedule tasks for repeated execution. Enterprise applications need to
schedule daily logging or overnight batch processes. A J2SE or J2ME
calendar application needs to schedule alarms for a user's
appointments. However, the standard scheduling classes, Timer and TimerTask,
are not flexible enough to support the range of scheduling tasks
typically required. In this article, I show you
how to build a simple, general scheduling framework for task execution
conforming to an arbitrarily complex schedule.
Memoization in Java Using Dynamic Proxy Classes
O'Reilly Network, 20 August 2003
Memoization is a technique borrowed from functional programming languages like Lisp, Python, and Perl for giving functions a memory of previously computed values. Memoizing a function adds a transparent caching wrapper to the function, so that function values that have already been calculated are returned from a cache rather than being recomputed each time. Memoization can provide significant performance gains for computing-intensive calls. It is also a reusable solution to adding caching to arbitrary routines.
Using Thread-Local Variables in Java
Dr. Dobb's Journal, July 2003, #350
Version 1.2 of the Java 2 SDK, Standard Edition introduced a new class called ThreadLocal to help with concurrent programming. This article explains how to use thread-local variables to improve the performance of frequently used utility classes whose instances must be shared between multiple threads. Such a scenario is common nowadays on application servers which have many long-running execution threads.
Read more... (Requires subscription.)