The terms “Big Data,” “Data Warehousing,” and “Data Lake” are often used interchangeably these days. However, they are not the same thing.
The term “Data Lake” is used to describe a large repository of data. Structured and unstructured. A place to put data in case you need it someday. This is more like a repository of replicated data.
A “Data Warehouse” is a structured database, designed and used for reporting and analytics. It is the single source of truth. It incorporates a sound methodology and process whereby data from transactional systems, is extracted (typically on a nightly basis), cleansed, transformed into a single data type and loaded into a data model designed for analytics. A data warehouse can integrate data from internal systems with data from external sources. This is not replicated data. A data warehouse pulls the necessary information needed for business intelligence. This should not be confused with operational reporting, but that’s another topic.
“Big Data” can be fed into a data warehouse. But big data is more of an adjective than a noun. It’s descriptive. In general, it refers to unstructured data. It refers to data that includes data volumes, variety, velocity and veracity. Data on the internet, for example, is unstructured. Data on the internet is searchable. For example, if you search for information on type 1 diabetes, the search engine will look for instances and pages where that term was used and return results you will likely find helpful.
Big data is constantly growing. Google developed something called MapReduce and wrote a white paper on it in 2004. It was designed to handle big data and the huge data volumes and variety of that data being used in the internet. MapReduce reduces and distills data volumes. It uses a simple word count that splits, maps and counts the data then shuffles data into “like” bins in nano seconds. It reduces the data based on counts and produces a summary of the results.
A Data warehouse will give you new insights. Insights into things like product performance, trade ROI, store sales. This is business intelligence. Business intelligence is best derived out of a data warehouse in most cases. A data warehouse is an enterprise foundation that is structured. It integrates data from all necessary data sources, both internal and external, to get you the information you need. A data warehouse is used by decision makers to better manage their business.
In addition to standard business insights, you can also get predictive and what if analysis from a data warehouse. If you want to use social sentiment to determine if sentiment is effecting sales, you need to bring in unstructured Big Data, and align it with sales into a data warehouse.
The bottom line is that a data lake, a data warehouse and big data all have a definition and they are all not the same. There is significant overlap in termssince they deal with large volumes of data. But volume sizes vary, and in terms of velocity, variety and veracity, they all differ. Most of all, the purpose for each is different. They will be used together in many cases and they may also share data, but each will be used for it's own purpose. Sometimes by the same people but often by completely different groups. If you need help understanding the difference, contact Relational Solutions for more information.