Saturday 22 December 2012

Part 1 : What is Hadoop ?

Imagine networking thousands of computers together. Each computer has its own processor and hard disk drive. All these computers are running software that makes the computers appear as a single powerful "super computer" with lots of processing capability and storage space. 

Now assume you are Amazon, and using this "super computer" to store click stream records from your amazon.com web site. You now want to understand if there are trends that lead to customers not completing transactions after having added items into their shopping carts. Recall that the click stream data is spread across the local disk drives on all of these thousands of computers.

To gain the insights required, a copy of your Analytics (application) logic is sent to each individual computer. Each computer  then runs the application logic against data stored locally. Instead of bring data to the application, the application (or function) is moved to the location where data is stored. Moving data across a network has a significant impact on performance, and by avoiding this, near linear scalability is achieved. Increasing data processing requirements can be accommodated simply by adding more computers.

Hadoop is the term used to describe this distributed filesystem(HDFS) and data processing engine that can be used to handle extremely high volumes of unstructured data at Internet scale. These group of computers make up a Hadoop cluster. Each computer in a Hadoop cluster is referred to as a node. The programming model used to bring the function (application logic) to the data is known as Map Reduce.

Technologies like Hadoop is what enables companies like Facebook, Google and Yahoo! to store millions of digital images and elements of our conversations, without having to design or understand up front, the format of the information or content they need to handle. This flexibility and ability to scale in a near linear fashion is one of the key attraction of Hadoop. Yahoo! reportedly has over 40,000 nodes spanning its Hadoop clusters which store over 40PB of data.

No comments:

Post a Comment