What exactly is big data?
Welcome to the Introduction to Big Data course from AdminTome Online Training!
My name is Bill Ward and I will be your instructor for this course.
In this section, we will answer the question “What exactly is Big Data?”
Let’s start with the introduction to this section
First let’s discuss the learning objectives for this course.
After this section you will be able to define what Big Data is and why it is important.
In addition, you will be able to discuss how big data is collected.
Finally, we will wrap up this section by iterating through several advances in technology that accommodate big data.
My name is Bill Ward and I own a blog called admintome.com.
I have been blogging since 2013 and I like to write highly technical articles on big data, devops, development and more.
I work with big data at a large retail company as a senior systems engineer.
Currently, I am writing a book on big data in action.
Why did I create AdminTome Online Training?
I wanted to create budget friendly courses that offered high value.
That is why this particular course is free.
AdminTome is based on the premise of Universal Technical Knowledge for all and AdminTome Online Training tries to achieve this by making highly technical courses available for engineers with any budget.
My courses strive to provide high quality courses that have tons of content.
I wanted to have more content and better content than you could find on YouTube.
To that end as you are taking this course please provide me feedback through the link on the left or by sending me emails.
Your input will help me make the best course possible.
Lastly, my courses have content that you can use in your job today.
I want to make sure that you are not just learning theory, but you are gaining skills that you can use in your current job or skills that will help you get the job you want.
Let’s get this course started!
In the first part of this lesson we will discuss how we collect big data. But first we will discuss how we used to collect data.
Companies used to think of data as a single lane highway.
We only wanted data that we thought we cared about.
For example, most companies know that it is a good idea to collect logs from applications, devices, and operating systems.
Data like performance metrics may not have been collected.
Other data like social media shares or discussions where not collected because they were not deemed important.
One reason that this was the case, was due to the high cost of storage. It wasn’t very cost effective to collect a ton of data.
But times change, and prices come down.
Today storage is cheaper and with commodity servers compute resources are more accessible.
Today companies are realizing the importance of collecting as much data as possible.
The more data we have the better business decisions we can make.
For example, now we have the ability to collect shopping habits from our customers, tie that data into social media trends about our products, and use that data to show us the potential that a certain customer may have to switching to one of our competitors.
This is possible because we are collecting more data from more sources.
With examples like this in mind, we know that the more data the better.
This increase in data will give us advanced business insights and make better business decisions which is the true value of big data.
The value isn’t in the raw data itself, it is in the insights we get by analyzing that raw data that helps us make better business decisions.
So, what makes big data big?
Any discussion in big data will invariably lead to a discussion of the four V’s of big data.
In 2001 a Gartner analyst named Doug Laney introduced the idea of the three V’s in a Metagroup research publication titled “3D data management: Controlling data volume, variety and velocity.
Recently, a fourth V has been added: Value or Veracity.
Basically, these help us identify big data.
Let’s discuss these in more detail.
First on the list is Volume.
Naturally, when we think of big data we think of large volumes of data. A good rule of thumb to use is if your data is more than one terabyte in size then you most likely have big data.
Big data is data that is too big for traditional management such as SQL databases to handle.
As an example, the twitter.com statics page states “Every second, on average, around 6,000 tweets are tweeted on Twitter which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day and around 200 billion tweets per year.”
Now that is what I call big data!
The next V is variety.
Data comes in many different forms.
There is structured data like tables in a SQL database, CSV files, and even JSON or YAML files.
There is also unstructured data like photos, videos, and sound recordings.
Finally, there is streaming data. This is structured or unstructured data that comes in real-time. This data can be video camera feeds, temperature sensors, etc.
Which leads us to Velocity.
Big data deals with data that can arrive at high velocity.
This means that data is received in real-time.
Examples of high velocity data are video streams, social media feeds, and sensor data.
These types of data arrive too fast to be processed using typical means in order to analyze it efficiently to make informed business decisions quickly.
The last V is Value or Veracity.
Collecting ALL data can be costly.
We want to make sure that the data we are getting has value to us.
We don’t want to collect data that is meaningless.
We also want to make sure that the data we collect is accurate.
As an example, collecting data about the room temperature of our server room may not have that much value for predicting our customer behavior.
The four V’s present some challenging problems for dealing with Big Data.
We have large volumes of a variety of data arriving at high velocity.
So, the IT industry has developed new technologies to overcome these challenges.
In this section we will discuss these new technologies.
I will just briefly mention the new technologies here.
In a later section of the course we will go into more detail on the most popular of these.
Starting with streaming data there is Apache Spark which provides an excellent streaming library to handle your real-time streaming data.
And as you can see it has another library for Analyzing the data that you have collected.
Another player in analyzing big data is Apache Hadoop which uses the MapReduce algorithm to analyze data in a distributed manner. We will be discussing this more in depth much later also.
Machine learning helps us to get a deeper understanding of our data by using artificial intelligence algorithms to gain new insights into our data. There is a complete introduction to artificial intelligence later in this course.
Apache Spark provides a machine learning library called MLlib that lets you write programs in Python, Scala, R and Java that make use of these machine learning algorithms.
In addition, there is a python library called scikit-learn that you can use to write programs in Python that make use of machine learning.
As big data has become more mainstream, we have seen a rise in the need for data scientists. But what exactly is a data scientist?
Having tons of data is good. But there needs to be someone to truly understand that data. Someone with domain knowledge of the data so that they can design systems that can analyze that data and provide meaningful business insights. This is what data scientists are.
As you can imagine true data scientists are hard to come by. Any data engineer can stand up a big data infrastructure that can be used to write big data applications.
But to have someone that truly knows their domain enough to properly analyze big data and give meaningful business insights are hard to find.
That is why they are paid so highly.
One last technology to cover is the idea of a data lake.
A data lake is a huge storage solution that is used to store all your data. If you think of your data as coming in as streams of data, then it would be natural to think of combining those streams into a lake.
We can pull data from this lake, transform it to something usable, analyze it, then output the resulting data in a meaningful format that we can use to make business decisions from.
Let’s review what we have learned in this section.
We learned that companies have changed how they view data and therefor have changed how they collect that data.
The four Vs have shown us that big data is data that is composed of high volumes of a variety of data that can arrive at real-time and must provide value to the organization.
Lastly, we discussed the new technologies that have been developed to deal with issues collecting and analyzing big data.
In the next topic of this course we will go into detail discussing some of these big data applications in use today.
Thanks for watching this section and I will see you in the next section.
Remember to provide feedback using the link to the right so that I can improve the course. I would love to hear from you.