Hadoop is a cluster-based software framework for distributed applications using large volumes (terabytes/petabytes) of distributed data running on thousands of nodes.
Hadoop was first developed for the Nutch search engine project, and Yahoo! is a major user and contributor. Yahoo was at the Cloud Computing Conference and Expo in Santa Clara, California, November 3rd., explaining how they use Hadoop to analyze the Web.
According to the article, Yahoo runs Hadoop on thousands of servers to batch-analyze data collected from spidering the web, and uses it to build the indexes behind Yahoo’s search engine. Yahoo runs it on 25,000 servers in clusters of 4,000 machines. The Hadoop file system subdivides files in 64 MB or greater chunks.