BIG Data Analytics

BIG Data Analytics is like a mountain of data that you have to climb.The height of the mountain keeps increasing everyday and you have to climb it in less time than yesterday!!

This is the kind of challenge big data throws at us
Has forced us to go back to the drawing board and design every aspect of a compute system afresh
Has spawned research in every sub area of Computer Science

What Is Big Data ?

There is no consensus as to how to define big data
“Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” – Teradata Magazine article, 2011
“Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” – The McKinsey Global Institute, 2011

BIG DATA poses a big challenge to our capabilities

Data scaling outdoing scaling of compute resources
CPU speed not increasing either

At the same time, BIG DATA offers a BIGGER opportunity to

Understand nature
Understand evolution
Understand human behavior/psychology/physiology
Understand stock markets
Understand road and network traffic
Opportunity for us to be more and more innovative.

Big Data – Sources

Telecom data (» 4.75 bn mobile subscribers)
There are 3 Billion Telephone Calls in US each day,
30 Billion emails daily, 1 Billion SMS, IMs.
IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers!
WWW
Weblog data (160 mn websites)
Email data
Satellite imaging data
Social networking sites data
Genome data
CERN’s LHC (15 petabytes/year)
Just a Hype?
Or a real Challenge?
Or a great Opportunity?
Challenge in terms of how to manage & use this data
Opportunity in terms of what we can do with this data to enrich the lives of everybody around us and to make our mother Earth a better place to live
We are living in a world of DATA!!
We are generating more Data that we can handle!!!
Using Data to our benefit is a far cry!!!
In future, everything will be Data driven
High time we figured out how to tame this “monster” and use it for the benefit of the society.

Another Interesting Quote
“We don’t have better Algorithms, We just have more data”
– Peter Norvig, Director of Research, Google

Analyzing BIG DATA

Data analysis, organization, retrieval, and modeling are other foundational challenges. Data analysis is a clear bottleneck in many applications, both due to lack of scalability of the underlying algorithms and due to the complexity of the data that needs to be analyzed*
*Challenges and Opportunities with Big Data
A community white paper developed by leading researchers across the United States

Source:Challenges and Opportunities with Big Data
A community white paper developed by leading researchers across the United States.

BIG DATA is spawning research in:

Databases
Data Analytics (Data Warehousing, Data Mining & Machine Learning)
Parallel Programming & Programming Models
Distributed and High Performance Computing
Domain Specific Languages
Storage Technologies
Algorithms & Data Structures
Data Visualization
Architecture
Networks
Green Computing

Data Warehousing, Data Mining & Machine Learning are at the core of BIG Data Analytics

Why Cluster Computing ?

Scalable
Only way forward to deal with Big Data
Embrace this technology till any new disruptive/revolutionary technology surfaces

Extended RDBMS Architecture

Support for new data types required by BIG data
- Vectors & matrices
- Multimedia data
- Unstructured and semi-structured data
- Collection of name-value pairs, called as data bags
Provide support for processing new data types within the DBMS inner loop by means of user-defined functions

MapReduce

What’s there in a name?
Everything!!
Map + Reduce
Both are functions used in “functional programming”
Has primitives in LISP & other functional PLs
So what is functional programming?

Google paper 2004: MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI, 2004

Functional Programming

Functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions
Interactive PL
Expression Value
(function arg1 arg2 …argn)*
(+ 12 23 34 45) 104

Scheme – a dialect of LISP, is the 2nd oldest PL that is still in use

In early days computer use was very expensive, it was obvious to have the programming language resemble the architecture of the computer as close as possible.
A computer consists of a central processing unit and a memory.
Therefore a program consisted of instructions to modify the memory, executed by the processing unit
With that the imperative programming style arose
Imperative programming language, like Pascal and C, are characterized by the existence of assignments, executed sequentially.
Functions express the connection between parameters (the ‘input’) and the result (the ‘output’) of certain processes.
In each computation the result depends in a certain way on the parameters. Therefore a function is a good way of specifying a computation.
This is the basis of the functional programming style.
A ‘program’ consists of the definition of one or more functions
With the ‘execution’ of a program the function is provided with parameters, and the result must be calculated.
With this calculation there is still a certain degree of freedom
For instance, why would the programmer need to prescribe in what order independent subcalculations must be executed?
The theoretical basis of imperative programming was already founded in the 30s by Alan Turing (in England) and John von Neuman (in the USA)
The theory of functions as a model for calculation comes also from the 20s and 30s. Some of the founders are M. Sch¨onfinkel (in Germany and Russia), Haskell Curry (in England) and Alonzo Church (in the USA
The language Lisp of John McCarthy was the first functional programming language, and for years it remained the only one
ML, Scheme (an adjustment to Lisp), Miranda and Clean are other examples of functional programming languages
Haskell (first unified PL) & Gofer (a simplified version of Haskell)
ML & Schemes have overtones of imperative programming languages and therefore are not purely FPLs
Miranda is a purely FPL!

What is Gofer ?

? 5+2*3 11
(5 reductions, 9 cells) ?

The interpreter calculates the value of the expression entered, where * denotes multiplication.
After reporting the result (11) the interpreter reports the calculation took ‘5 reductions’ (a measure for the amount of time needed) and ‘9 cells’ (a measure for the amount of memory used)
The question mark shows the interpreter is ready for the next expression

? sum [1..10]
55
(91 reductions, 130 cells)

In this example [1..10] is the Gofer notation for the list of numbers from 1 to 10.
The standard function sum can be applied to such a list to calculate the sum (55) of those numbers.
A list is one of the ways to compose data, making it possible to apply functions to large amounts of data.
Lists can also be the result of a function:

? sums [1..10]
[0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55]
(111 reductions, 253 cells)

The standard function sums returns next to the sum of the numbers in the list also all the intermediate results.

? reverse (sort [1,6,2,9,2,7])
[9, 7, 6, 2, 2, 1]
(52 reductions, 135 cells)

g (f x) means that f should be applied to x and g should be applied to the result of that
Gofer is a largely parenthesis free language

Gofer: Defining New Functions

The editor is called by typing ‘:edit’, followed by the name of a file, for example:

? :edit new

Definition of the factorial function can be put in the file ‘new’.
In Gofer the definition of the function fac could look like:

fac n = product [1..n]

? :load new
Reading script file “new”:
Parsing………………………………………………….
Dependency analysis……………………………………….
Type checking…………………………………………….
Compiling………………………………………………..
Gofer session for:
/usr/staff/lib/gofer/prelude
new
?

Now fac can be used

? fac 6
720
(59 reductions, 87 cells)

Gofer: Adding fn to a file

It is possible to add definitions to a file when it is already loaded. Then it is sufficient to just type :edit; the name of the file needs not to be specified.
For example a function which can be added to a file is the function ‘n choose k’: the number of ways in which k objects can be chosen from a collection of n objects
This definition can, just as with fac, be almost literally been written down in Gofer:

choose n k = fac n / (fac k * fac (n-k))
Example:
? choose 10 3
120
(189 reductions, 272 cells)

Gofer: Defining a New Operator

A operator is a function with two parameters which is written between the parameters instead o fin front of them
In Gofer it is possible to define your own operators
The function choose from could have been defined as an operator, for example as !ˆ! :

n !^! k = fac n / (fac k * fac (n-k))

Gofer: Nesting Functions

Parameter of a function can be a function itself too!
An example of that is the function map, which takes two parameters: a function and a list.
The function map applies the parameter function to all the elements of the list.

For example:
? map fac [1,2,3,4,5]
[1, 2, 6, 24, 120]
? map sqrt [1.0,2.0,3.0,4.0]
[1.0, 1.41421, 1.73205, 2.0]
? map even [1..8]
[False, True, False, True, False, True, False, True]

Functions with functions as a parameter are frequently used in Gofer (why did you think it was called a functional language?).

MapReduce/Hadoop

MapReduce* – A programming model & its associated implementation
- provides a high level of abstraction
- but has limitations
- Only data parallel tasks stand to benefit!
MapReduce hides parallel/distributed computing concepts from users/programmers
Even novice users/programmers can leverage cluster computing for data-intensive problems
Cluster, Grid, & MapReduce are intended platforms for general purpose computing
Hadoop/PIG combo is very effective!
MapReduce works by breaking processing into the following 2 phases:
- Map (inherently parallel – each list el. processed ind.)
- Reduce (inherently sequential)

Map Function

Applies to a list
map(function, list) calls function(item) for each of the list’s items and returns a list of the return values. For example, to compute some cubes:

>>> def cube(x): return x*x*x
…
>>> map(cube, range(1, 11))
[1, 8, 27, 64, 125, 216, 343, 512, 729, 1000, 1331]

Reduce Function

Applies to a list
reduce(function, list) returns a single value constructed by calling the binary function on the first two items of the list, then on the result and the next item, and so on…
For example, to compute the sum of the numbers 1 through 11:
>>> def add(x,y): return x+y

…

>>> reduce(add, range(1, 11))

My view of MapReduce

MapReduce

A programming model & its associated implementation
MapReduce is a patented software framework introduced by Google to support distributed computing on large data sets on clusters of computers (wiki defn.)
Inspired by the map & reduce functions of functional programming (in a tweaked form)
MapReduce is a framework for processing huge datasets on certain kinds of distributable problems using a large number of computers
cluster (if all nodes use the same hardware) or as
grid (if the nodes use different hardware)
(wiki defn.)
Computational processing can occur on data stored either in a filesystem (unstructured) or within a database (structured).
Programmer need not worry about:
- Communication between nodes
- Division & scheduling of work
- Fault tolerance
- Monitoring & reporting
Map Reduce handles and hides all these dirty details
Provides a clean abstraction for programmer

Composable Systems

Processing can be split into smaller computations and the partial results merged after some post-processing to give the final result
MapReduce can be applied to this class of scientific applications that exhibit composable property.
We only need to worry about mapping a particular algorithm to Map & Reduce
If you can do that, with a little bit of high level programming, you are through!
SPMD algorithms
Data Parallel Problems

SPMD Algorithms

It’s a technique to achieve parallelism.
Tasks are split up and run on multiple processors simultaneously with different input data
The robustness provided by MapReduce implementations is an important feature for selecting this technology for such SPMD algorithms

Hadoop

MapReduce isn’t available outside Google!
Hadoop/HDFS is an open source implementation of MapReduce/GFS
Hadoop is a top-level Apache project being built and used by a global community of contributors, using Java
Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses

Hadoop & Facebook

FB uses Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
Currently FB has 2 major clusters:
- A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
- A 300-machine cluster with 2400 cores and about 3 PB raw storage.
- Each (commodity) node has 8 cores and 12 TB of storage.
- FB has built a higher level data warehousing framework using these features called Hive

http://hadoop.apache.org/hive/

- Have also developed a FUSE implementation over hdfs.
First company to abandon RDBMS and adopt Hadoop for implementation of a DW

A word about HIVE

A data warehouse infrastructure built on top of Hadoop
Enables easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.
provides a mechanism to put structure on this data
A simple query language called Hive QL (based on SQL) which enables users familiar with SQL to query this data.
HIVE QL allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language.

Hadoop & Yahoo

YAHOO

More than 100,000 CPUs in >36,000 computers running Hadoop
Biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM)
- Used to support research for Ad Systems and Web Search
- Also used to do scaling tests to support development of Hadoop on larger clusters.

HDFS

Data is organized into files & directories
Files are divided into uniform sized blocks (64 MB default) & distributed across cluster nodes
Blocks are replicated ( 3 default)to handle HW failure
Replication for performance & fault tolerance
Checksum for data corruption detection & recovery

Word Count using Map Reduce

Mapper
Input: <key: offset, value:line of a document>
Output: for each word w in input line output<key: w, value:1>

Input: (The quick brown fox jumps over the lazy dog.)
Output: (the, 1) , (quick, 1), (brown, 1) … (fox,1), (the, 1)

Reducer
Input: <key: word, value: list<integer>>
Output: sum all values from input for the given key input list of values and output <Key:word value:count>

Input: (the, [1, 1, 1, 1,1]), (fox, [1, 1, 1]) …
Output: (the, 5)
(fox, 3)

Map Reduce Architecture

Map Phase
- Map tasks run in parallel – output intermediate key value pairs
Shuffle and sort phase
- Map task output is partitioned by hashing the output key
- Number of partitions is equal to number of reducers
- Partitioning ensures all key/value pairs sharing same key belong to same partition
- The map output partition is sorted by key to group all values for the same key
Reduce Phase
- Each partition is assigned to one reducer.
- Reducers also run in parallel.
- No two reducers process the same intermediate key
- Reducer gets all values for a given key at the same time

Strengths of MapReduce

Provides highest level of abstraction (as on date)
Learning curve – manageable
Highly scalable
Highly fault tolerant
Economical

Big data analytics: What it is and why it matters ? Extended RDBMS Architecture

BIG Data Analytics

What Is Big Data ?

Big Data – Sources

Analyzing BIG DATA

Why Cluster Computing ?

Extended RDBMS Architecture

MapReduce

Functional Programming

What is Gofer ?

Gofer: Defining New Functions

Gofer: Adding fn to a file

Gofer: Defining a New Operator

Gofer: Nesting Functions

MapReduce/Hadoop

Map Function

Reduce Function

My view of MapReduce

MapReduce

Composable Systems

SPMD Algorithms

Hadoop

Hadoop & Facebook

A word about HIVE

Hadoop & Yahoo

HDFS

Word Count using Map Reduce

Map Reduce Architecture

Strengths of MapReduce

About The Author

Trenovision

BIG Data Analytics

What Is Big Data ?

Big Data – Sources

Analyzing BIG DATA

Why Cluster Computing ?

Extended RDBMS Architecture

MapReduce

Functional Programming

What is Gofer ?

Gofer: Defining New Functions

Gofer: Adding fn to a file

Gofer: Defining a New Operator

Gofer: Nesting Functions

MapReduce/Hadoop

Map Function

Reduce Function

My view of MapReduce

MapReduce

Composable Systems

SPMD Algorithms

Hadoop

Hadoop & Facebook

A word about HIVE

Hadoop & Yahoo

HDFS

Word Count using Map Reduce

Map Reduce Architecture

Strengths of MapReduce

About The Author

Trenovision

Related Articles

Personal Preparation for PMP Test Sharing Experience

What is Digital Marketing? 5 types of digital marketing?

What is a video card (GPU) and how does it work? Explained