Subscribe via E-mail

Your email:

Follow Me

Our Blog

Current Articles |  RSS Feed

Big Data Part 2

 

"Big Data Part 2" builds off my earlier blogs called “Before Big Data” and “Big Data Part 1.”

In this blog we will explore the different types of data and explain the differences at a high level. I thought of breaking this blog into three blogs due to length, but felt the subject matter was better served in one article.

So what's the difference in these various data types?

The first cylinder represents structured data. This includes data from ERP sytems, mainframes and data warehouses. Although structured, these data types are structured differently.

In my earlier blog, "Big Data Part 1," I separated these structured data types into two separate circles. That's because they are structured differently.

ERP data and other transactional systems are structured in a way that allows for easy data entry.

Data warehouse and business intelligence solutions are structured in a way that allows for easy retrieval of information. This is why I had them in separate circles on the previous blog. That said, both transactional and analytical systems are structured.

As described in my blog on "Analytical versus Transactional Business Intelligence," ERP and other transactional data sources are designed to RUN your business. Data warehouse and business intelligence solutions are designed to help MANAGE your business. These are data sources typically stored in a traditional database and therefore has structure to them.

The second cylinder contains unstructured data. This is data mainly found out there on the web. This includes social media data that includes things like “Tweets” and “Comments." But unstructured data also includes your activity, including your searches.

The internet captures a lot of different activity. Today, your social authority or clout can be tracked by determining how many followers you have and how many people follow you and how many times things you post are reposted, etc. Different applications apply different algorythms, but social authority is tracked in a variety of ways.

Authority can be tracked based on the number of people you have the capacity to influence. Someone with 100 followers does not have the same clout as someone with 3000 followers for example. However, someone with 3000 followers who is never on-line commenting, compared to someone who has 500 followers and regularly posts or tweets what they hear, could have a higher ranking authority level.

Big Data received a lot of attention in the press this summer. There were a lot of concerning stories. In June, "The Wall Street Journal" published an article that the NSA, America’s National Security Agency, was obtaining a complete record of all Verizon customers and their calling history, including all local and long distance calls within the US.

This made the news because it made a lot of people upset. The idea that the government is listening in on our calls means a potential invasion of privacy. Government claims it tracks and uses this information to help identify terrorists. We hope that’s true. But the fact that they have the capability and are monitoring this information can be unsettling.

Big data has also come up in recent stories associated with the monitoring of certain journalists calls and activities. In addition it is related to the IRS scandal which required search capabilities that would targeting certain non-profit, applications. Regardless of political affiliation, most people found this disturbing because targeting groups for political gain is wrong.

Monitoring these activities requires the government to leverage big data. But right or wrong, for good, for bad or for profit the capability to capture and leveraging big data does exist.

Most companies leverage big data to target market and to manage their brand and company reputations. Either way, technology exists today that allows us to track and monitor and profile just about whatever and whomever we want.

The last cylinder represents multi-structured data or hybrid data. A lot of data sources can fall into this space.

For the purposes of a consumer goods manufacturer, I used common outside data sources in the cylinder to represent hybrid data. Lets use point of sale data for example. Point of sale (POS) data comes in from multiple retailers with varying data elements at different times of the month. Even one retailer could have multiple ways of providing POS data.

Target is a good example of the ways in which POS data can arrive. If you are vendor for Target, you might get POS data in an EDI 852 file. You might also get POS data from Info Retriever or IRI. In addition, you might purchase data from A.C. Nielsen or Symphony IRI. All these sources contain different data elements. But they also all contain point of sale (POS) data.

Let's start with the POS data coming in from an EDI file. That EDI file is structured. However, although it’s supposed to be standardized, it is not. Different retailers provide different data. Rules aren't followed. Files can be missing days or data elements. EDI from one retailer will be different from another retailer. Also, EDI from Target today, might be different than the EDI coming from Target was last year. There could also be missing or duplicate data. In addition, retailers often "recast" data, etc. We classify this as "hybrid" data because of the inconsistent, lose, structure of the data and all the work around it required to make it work well with other data.

In addition to missing or invalid or duplicated data. Data has different hierarchy's, end dates, etc. Outside data needs to align with your internal hierarchy’s and calendars. It also needs to be aligned with outside data sources like weather trends, currency conversion, A.C. Nielsen, Symphony IRI, NPD and other data sources.

These are just a few examples of data issues that arise from outside data sources. In other words, there is some structure to it, but the structure needs to be altered to be managed, integrated into other sources and ultimately provide more value.

Watch for my next week blogs where I explain in more detail the way big data is further defined and described by the industry.

Watch our Big DataTraining 101

Transactional versus Analytical Business Intelligence

 

The easiest way to understand the difference between a transactional and analytical system is to think of transactional systems as those applications designed to run your business and analytical systems are those designed to manage your business.

Applications like SAP, Oracle Financials, JD Edwards and JDA for example, are transactional applications. They provide reports, but they tend to be reports from their systems unless you separately acquire their data warehouse modules. In most cases, even their data warehouse modules handle their own data better than other data sources. In general, ERP (Enterprise Resource Planning) systems are systems that are modeled for data entry. They are updated constantly throughout the day.

Reports derived from these systems are reports designed to understand what is going on at this moment. For example, what time did my last truck leave? Is my manufacturing formula set correctly today? What did that last customer complain about? They answer the "What?" not the "What if?" questions. 

These are transactional reports, coming from transactional systems. They are necessary reports required to run your day to day operations. A report pulled from a transactional system at noon will produce different results than a report pulled at 12:01 because the operational system is constantly changing. Even reports pulled symotaneously will likely produce different results. That is because in a transaction system, the route of each query can take different paths. In addition you never know who might be updating the system at any one point in time. 

We call this the “twinkling database effect.” That is because the data is constantly changing.

These “twinkling databases” are fine for pulling operational reports. But trying to produce an analytical report from a transactional system is not wise.

First, the data is formatted for data entry, not data retrieval. Therefore it could can take days to query the system. In addition, an ad-hoc query against a transactional system will effect the performance of that operational system. It will also negatively effect end users using the system. The last thing you want to do is make it difficult for people to enter orders. This could have a direct and negative impact on sales. Not to mention the negative, time wasting effect it will have on other job functions.

Analytical queries against a transaction system will put an undue burden on your network. In addition, it will return inconsistent, and often times, inaccurate results.

That is why data warehouse solutions became a necessity. Transaction systems are designed to RUN the business, data warehouses are designed to help MANAGE the business.

The data warehouse is modeled in a way that business users can easily find and retrieve the data they need. The underlying infrastructure of an enterprise data warehouse (EDW) offers an architecture that will align data, provide business rules and accommodate growth and change in an iterative manner.

Query tools allow for easy analysis and business intelligence. Users need fast access to reliable information with the flexibility to change the view. They need to be able to drag, drop, drill, sort, compare and ultimately learn and act on the information they are receiving.

More and more, we are hearing business analysts referred to as “Data Scientists.” This is because today, they should have the capability to think outside the box with information available to them. Rather than spending their day gathering and cobbeling together reporting information, they can be freed up to analyze it. Today, data integration can be automated and put into a usable format for data exploration.

By leveraging ALL your data, companies and their Data Scientists can understand not only WHAT is happening, but WHY!

The data warehouse is fed by the operational system and typically updated on a nightly
basis. Sometimes more often but most often, nightly. More and more, we see the data warehouse is also being fed by other outside sources. Relational Solutions advocates leveraging all the information you have access to. Unfortunately, not all data warehouses are created equally, so it's not always as easy as it sounds.

I’m pointing out these differences because this all background information needed to understand how companies can use big data. Big data creates a potentially “fuzzy area” for reporting depending on how it’s defined.

In my next blog I'll explain the difference between data marts and data warehouses, and how evolving data sources such as "big data," should be leveraged in your enterprise data warehouse and offer more business intelligence.

 

All Posts