Subscribe via E-mail

Your email:

Follow Me

Our Blog

Current Articles |  RSS Feed

Big Data Part 7


In previous blogs we addressed the history of "Big Data." We also talked about what comprises big data and why it's important to leverage big data. In this blog we'll delve into some of the underlying technology that people refer to when discussing big data as it pertains to content on the internet and other applications that leverage unstructured data.

In 2004 Google wrote a white paper about MapReduce. It was developed to handle the huge
data volumes being used in the internet. Google handles tens of thousands of search requests every second. This requires it to be able to search billions of sights for key words. Google also tracks clicks, meta tags, descriptions, etc. All these things require Google to handle massive data volumes.

Google needed to be able to spread processing across hardware. Hadoop is an Apache project. It is an open source data library with 2 components. First, it is a large scale, distributed file system called HDFS (Hadoop Distributed File System), which spreads processing across hardware.

Second it uses MapReduce, which reduces and distills data volumes in order to distribute large data sets across multiple servers.

MapReduce reduces the data into results and creates a summary of the data. It's based on simple word count.

On the left side of this picture is a set of words. Imagine this is your data input. It’s simply text with no structure. You see “Boat," "yacht," "lake," etc …

MapReduce first splits the data as you can see in the second phase of this diagram. It then maps and counts the data. Next, it shuffles the data into “like” bins. It then reduces the data based on counts. And finally, it produces a summary of the results.

Owen O'Malley, Architect for Mapreduce and Security says, "Hadoop lets you deal with volume, velocity and variety of data. It transforms commodity hardware and provides automatic failover."

Some common technical terminology you will hear related to the topic of “Big Data” are listed below.

HDFS - Hadoop Distributed File Sytem

PIG - High level language that converts work to MapReduce

HIVE - Transforms and converts to MapReduce

HBASE - Scalable, ditributed database that provides a simple interface to data

Zookeeper - Coordinates the activity of servers

HCatalog - This is the metadata from Hive

Mahout - The machine learning library

Scoop - A tool used to run MapReduce applications

Cascade - Compresses data down into MapReduce

Oozie - This is the workflow coordination required to learn MapReduce jobs

At a recent conference I attended, business users interchangably used the word "e-commerce" to refer to big data. Look for my article addressing common misconceptions published in CGT's Straight Talk, December issue for clarification on common big data myths. 

Sign up for a Demo \u0026amp\u003B See How BlueSky Integration Studio Integrates Big Data

Big Data Part 6


So, most companies define Big data as Volume, Variety and Velocity. I add one more key element that comprises “Big Data.”  Complexity.

Volume, Variety & Velocity along with Complexity makes up Big data. You might think that Variety covers complexity, but it doesn’t. Making social media data and other big data work with your business to provide value, involves a lot of complexity.

There are some vendors out there saying that infrastructure is not important. They are wrong. Maybe someday in the future, all data will be in the cloud, but that is not realistic today or in the near future. Every company has an internal infrastructure that includes SQL Server, SAP, Oracle, DB2, etc. Those companies will not be pulling out of internal IT departments any time soon.

Therefore, it's necessary for all big data to co-exist and to work together.

When data needs to be put into a usable format to be integrated with internal data, there are many alignments and rules that need to be applied. There are business rules, code, processes and mappings that need to be written to extract the data, load data, clean data and refine the data. All the variety of data needs to be transformed into a common data type for analysis. There is also meta data, which is data about the data that needs to be managed. Just the processes and data models alone that are designed to align data and put it into a usable format becomes new data.

Volume, Variety and Velocity focus strictly on the source, but data that makes the data
usable with internal data is also new data. That new data needs to be structured, documented, maintained and managed. This is a complexity that adds to “Big Data.”

These are a just a few examples of where CPG companies add to their big data due to complexity. Aligning hierarchy’s, integrating master data with retailer master data, comparing sales with sentiment as well as promotions and pricing. These are just a few examples of the complexity involved in getting more value out of big data.

Therefore, what comprises big data includes volume, variety, velocity & complexity!

We can’t talk about “Big Data,” without talking about Hadoop & MapReduce. So first, what is Hadoop? We'll describe that in next weeks blog, "Big Data, Part 7."

Big Data Part 5


As described in the last two blogs, Big data comprises volume & variety. But it also includes velocity and another key characteristic that will be described in this
and the next blogs. In this blog we examine the fact that big data gets even bigger when you start to consider timing.

It’s not just Volume and Variety, it’s velocity. The speed with which data is coming in. ERP data is updated every second of the day by multiple users across the company. POS data can come in daily, weekly or monthly. Sometimes more often if your directly able to download information from retailers portals. Pricing information and zip code information might come in quarterly or annually.

On-line data? That’s an entirely new story. Click stream analysis is happening constantly. Twitter feeds and hash tags happen sporadically throughout the day. For smaller companies new “mentions” might only happen a few times a month. Regardless of frequency, these comments still need to be monitored.

Larger companies, with more popular brands and a wide customer base, will have regular
comments about their products throughout the day. Depending on the number of brands you have, you may have hundreds of “mentions” per minute.

A large consumer good’s company must have the ability to respond to negative sentiment almost immediately. At the very least, your social listening group should be checking your facebook page, at least every half hour to check for negative sentiment that could be on your own Facebook page (if not, this needs to change).

Negative comments on your Facebook page should not be there long enough for others to
“Like” them.

I was recently at a customer who had just implemented a “Social Listening” team about a month earlier. The team consisted of about a dozen people. During the meeting, I logged on to their Facebook page and pointed out to them that they had a very negative comment right at the top of their Facebook page, and that it had been there for over 2 hours. In those 2 hours, that comment had received over 100 “Likes” and several other negative comments were posted along with the first one. Although I advised them to address it immediately, the comment amazingly, remained on their Facebook page throughout the entire three hour meeting. I
could not believe that no one in the room logged in to fix the problem right away, even when I made them aware of it and told them they should address it.

Anyway, had this comment been caught earlier, they could have deleted the comment and
blocked the “Follower.” I actually believe that post was made by one of their competitors. The reason I thought that is because it was a very generic, negative comment. Completely unrelated to a bad experience or recent even. Most genuine negative comments are related to a bad experience or recent event.

Social media has made it very easy to influence a company’s reputation. Managing your
“Social Reputation” is more important today than ever before. But to manage it, you must be aware of it and have the ability to respond.

We recommend tools that will help make you aware of negative comments by automating the monitoring and receiving of alerts when your company or brands are mentioned. Once you are aware of what's being said, you can begin to manage your social reputation. You can then take it to the next level and start analyzing it so you can make those comments "work" for you. 

If you can take those comments and analyze who is saying what, and where that sentiment is coming from, you can start to leverage that social media data to your benefit. Imagine if you could identify where negative sentiment is coming from, read the associated comments and call the store manager to let them know that your customers can't find your product or perhaps the retailers customers aren't coming in because of untidy conditions. Or perhaps there is a picket line that is costing them more business than they realize.

Social media can and should work toward your benefit. It can help save time. It can make you aware of conditions where people are unhappy with you, your retailer, your supply chain, your new marketing, etc. 

The timeliness of these comments, clicks, likes, etc are rapid. But that doesn't mean they should be ignored. There are ways to monitor and integrate this data for your benefit. And if done correctly, the benefit can be great.

So, most companies stop there and define Big data as Volume, Variety and Velocity. I add one more key component to “Big Data.” That’s Complexity. Watch for my next blog, “Big Data Part 6” to learn how “complexity” factors in to big data.

Sign up for a Demo \u0026amp\u003B See How BlueSky Integration Studio Integrates Big Data

Big Data Part 4


“Big Data” is about volume, but it’s more than that... Big data is also about Variety and a couple other characteristics that will be described in follow up blogs…

CPG companies are no stranger to variety. In addition to their own internal variety
of data residing in databases such as Access, Excel, Oracle, main frames, Teradata, DB2, Netezza so forth. You have multiple
applications such as trade promotion management applications, supply chain, manufacturing, planograms, CRM applications, forecasting and a slew of others.

In addition, the variety of data coming in from point of sale (POS) sources include retailer files that include EDI 852 files, EDI 867 files, AS2, flat files, other EDI files, retailer portal downloads and syndicated data from AC Nielsen, IRI, NPD and others. Most companies are also buying competitive market data, demographic information, surveys, weather trends, currency conversion information, and might even be trying to integrate emerging market data. In addition, you might have space information, displays and diagrams that are unstructured or semi-structured.

Those are all examples of various data sources that have existed over the years. Some of these sources are newer than others. But the newest variety of data is coming in via the web. These sources are coming from various applications that track your “Social Reputation,” clicks, and media presence to name a few.

Marketing teams also have ads, including print, on-line ads, tv commercials, radio spots. They might also have online targeted marketing on social media that include offers on web sites, mobile offers, YouTube videos, etc. All these sources are in different data formats containing different information. All of this adds up to a lot of variety.

Big data just got bigger with more variety from the internet. In these last two blogs we discussed volume and variety, but it's also about velocity and one other key characteristic that will be discussed in the next two blogs. Watch for our next blog, Big Data Part 5 on velocity.

Sign up for a Demo \u0026amp\u003B See How BlueSky Integration Studio Integrates Big Data

All Posts