In previous blogs we addressed the history of "Big Data." We also talked about what comprises big data and why it's important to leverage big data. In this blog we'll delve into some of the underlying technology that people refer to when discussing big data as it pertains to content on the internet and other applications that leverage unstructured data.
In 2004 Google wrote a white paper about MapReduce. It was developed to handle the huge
data volumes being used in the internet. Google handles tens of thousands of search requests every second. This requires it to be able to search billions of sights for key words. Google also tracks clicks, meta tags, descriptions, etc. All these things require Google to handle massive data volumes.
Google needed to be able to spread processing across hardware. Hadoop is an Apache project. It is an open source data library with 2 components. First, it is a large scale, distributed file system called HDFS (Hadoop Distributed File System), which spreads processing across hardware.
Second it uses MapReduce, which reduces and distills data volumes in order to distribute large data sets across multiple servers.
MapReduce reduces the data into results and creates a summary of the data. It's based on simple word count.
On the left side of this picture is a set of words. Imagine this is your data input. It’s simply text with no structure. You see “Boat," "yacht," "lake," etc …
MapReduce first splits the data as you can see in the second phase of this diagram. It then maps and counts the data. Next, it shuffles the data into “like” bins. It then reduces the data based on counts. And finally, it produces a summary of the results.
Owen O'Malley, Architect for Mapreduce and Security says, "Hadoop lets you deal with volume, velocity and variety of data. It transforms commodity hardware and provides automatic failover."
Some common technical terminology you will hear related to the topic of “Big Data” are listed below.
HDFS - Hadoop Distributed File Sytem
PIG - High level language that converts work to MapReduce
HIVE - Transforms and converts to MapReduce
HBASE - Scalable, ditributed database that provides a simple interface to data
Zookeeper - Coordinates the activity of servers
HCatalog - This is the metadata from Hive
Mahout - The machine learning library
Scoop - A tool used to run MapReduce applications
Cascade - Compresses data down into MapReduce
Oozie - This is the workflow coordination required to learn MapReduce jobs
At a recent conference I attended, business users interchangably used the word "e-commerce" to refer to big data. Look for my article addressing common misconceptions published in CGT's Straight Talk, December issue for clarification on common big data myths.