What is BIG DATA And How it can be Managed

5 min readSep 17, 2020

By Rupali

Hello Everyone !

In this article we are going to discuss these points

What is Big Data
Type of Big Data
Characteristics Of Big Data
Generation of Big Data at Social Media
How it can be managed

What is Big Data

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and yet growing exponentially with time. In short such data is so large and complex that none of the traditional data management tools are able to store it or process it efficiently.

Types Of Big Data

Big Data could be found in three forms

1. Structured

Any data that can be stored, accessed and processed in the form of fixed format is termed as a ‘structured’ data.

2. Unstructured

Any data with unknown form or the structure is classified as unstructured data. In addition to the size being huge, un-structured data poses multiple challenges in terms of its processing for deriving value out of it.

A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc.

3. Semi-structured

Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined with e.g. a table definition in relational DBMS.

Characteristics Of Big Data

(i) Volume — The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big Data.

ii) Variety — Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days, spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis applications.

(iii) Velocity — The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

(iv) Variability — This refers to the inconsistency which can be shown by the data at times, thus hampering the process of being able to handle and manage the data effectively.

Now , Let’s see how this Big Data is Generated

Generation of Big Data on Social Media

FaceBook

Facebook revealed some big, big stats on big data to a few reporters at its HQ today, including that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.

2. What’s app

WhatsApp users sent 65 billion messages per day. On average 29 million WhatsApp messages are sent per minute

3. Twitter: 12 Terabytes Per Day

One wouldn’t think that 140-character messages comprise large stores of data, but it turns out that the Twitter community generates more than 12 terabytes of data per day. That equals 84 terabytes per week and 4368 terabytes — or 4.3 petabytes — per year.

4. SnapChat

According to Snapchat, users open their app more than 20 times a day on average. They spend an average of 30 minutes using the app. Snapchat users sent out an average of 4 billion Snaps on a daily

5. Instagram — 500 million daily Instagram Stories users

How to manage Big data !

There are so many tools that help us to manage Big Data like Hadoop , Scuba, Cassandra , Hive , Prism etc . But here we are going to discuss Apache Hadoop

Apache Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.