"Big data is a petabyte," "Big data is a data warehouse," "Big data is social media data." These are all incorrect statements I've heard from so called "experts" over the past year.
It can be argued that big data has been around for many years. Although big data can include internal data from old mainframes, new ERP systems and data warehouses, it also includes external data from outside sources. It also includes "new" data being generated on the internet.
Software companies referring to big data today are generally referring to unstructured data on the web. They talk about volume, variety and velocity. But it's also about veracity and the complexity of integrating that data. A "Big Data Foundation" must be in place to manage the variety, velocity, volume and veracity of data. I'll discuss that in more detail throughout this series of big data blogs.
Unstructured, big data includes social media chatter that comes from Facebook, Google+ & Twitter. It includes Comments, Announcements and posts from professional networking sites like LinkedIn. These are the common areas that people think of when it comes to big data. But there are many other forms of unstructured, big data.
Think about things like, speech to text. Audio on-line is unstructured. Translating that audio from a recording into text starts to become more structured, but still not in the format most databases want it in.
Another example would include tags, Alt tags, meta keywords associated with images and video for example. This searchable text is an example of big data.
Data posted on sites like YouTube is big data. Think of all the photo’s posted on “Instagram” and “Facebook.” Even profile pictures on LinkedIn. That’s big data. Now add, goespacial information or location information. This information can be used by companies to do things like track shipments, identify missing cars, monitor storms, and so forth.
How about blogs like this one? Comments on blogs are tracked by companies to determine what’s being said about them. Google searches content so when you are looking for information on big data, you will find blogs like this!
Companies also use tools like Google Alerts and Tracker to figure out when competitors are making a new announcement. They can be used to identify comments being made about your products and other things.
Engineering companies can do things like scan and share schematics and blue prints over the web.
Another form of big data involves click stream analysis. Relational Solutions actually implemented what we believe was the first click stream analysis application back in the 90’s for a telecom company who wanted to track where their customers were going on their website. We could figure out what pages were getting the most visits and where they are going from each page. Today, we can do much more.
Today click stream analysis is also used to target market customers, based on what they seem to like, based on their clicks and demographic information provided. Clicks can track what you buy and information on your age, location, where you went to high school, etc.
The value gets even greater when unstructured data is integrated with other internal, structured data.
So why is “Big Data” associated with these items in the last circle depicted above, and what’s the difference in the data? Why do I have circles around the different data types and what determines where each of these items reside? Well, it’s related to structure.
I recently heard an expert in supply chain analytics refer to “big data” as anything over a petabyte of data. I thought it was a funny comment coming from someone with no database background and very little technology insight.
A petabyte is an arbitrary number. It is just a number that focuses strictly on volume. And big data is about more than just volume.
In my series of big data blogs, I will walk you through the big data evolution. Look for my next big data blog coming out within a week. Click below to request your Big Data Consultation.