Uploaded by AYUSHI GANG

Big Data

advertisement
Big Data - quantification - misconception
● Volume
● Variety - unstructured data - json
● Complexity - irregular timesteps, missing values
● Velocity - rate at which data comes into the source
Apache Hadoop - umbrella for diff types of large scale data processing open source software
● Using a large no. of cheap machines to process big data
● Not widely used now
● Hadoop file system - storage across machines to store data
● Hadoop mapreduce - cpus across thousands of servers to process
data
Big data is not an architecture
Why do we need Data
● Reporting - what happened - batch reports
● Analyzing - why did it happen - ad hoc, BI tools
● Predicting - what will happen - prediction models
● Operationalizing - what is happening now - link to operational
systems
● Activating - make it happen - automated linkages
● The process doesn’t change for big data
● The tools change for big data
Cloud
● Computation decoupled from storage
● Parallelism
● Need a way to know how the data is spread
An embarrassingly parallel workload or problem is one where little or no
effort is needed to separate the problem into a number of parallel tasks.
Computers
● Network - order of sec
● Disk - order of millisec - sequential access better than random access
● Memory & cpu - order of nanosec
Divide & conquer
● Planning - how to break the problem down into smaller problems that
can be solved in parallel
● Scheduling - top level prob can’t be solved before low level, schedule
smaller problems
● Executing - start from the low level all the way
up
Mapreduce
● Uses divide & conquer
● Parallel programming framework
○ Any programming lang
○ Harness cpu power
○ Optimizes for scale to avoid using a lot of
memory
● Provides
● Automatic parallelization
● Fault tolerance
● Monitoring & status updates
Spark
● Uses divide & conquer
● Parallel programming framework
○ Java, python
○ Harness cpu power
○ Optimizes for memory to get speed
● Provides
● Automatic parallelization
● Fault tolerance
● Monitoring & status updates
● Actions have side effects, a stage ends when an action is performed
● Resilient Distributed Datasets - allow storing output after each
operation is performed even if it fails
● RDD tree is created for the DAG
●
Duality of sorting & hashing
● What can be done by sorting can also be done by hashing
● Sorting - O(nlogn)
○ Small scale - quick sort
○ Large scale - merge sort
● Hashing - O(n)
○ Small scale - hash table
○ Large scale - partitioned hash table
Businesses over time
Data generated
Specific
software &
Hardware
by apps
must adapt to
Different types of
hardware
Hardware getting
Expensive
Cloud native
● non func software + cloud
rent out hardware
write software that can be scaled w/out having to rewrite it
● Enables loosely coupled systems that are resilient, manageable &
observable
● Scale to needs in terms of volume & requirement
● Design principle
Infrastructure as a service
● Public cloud providers Provide a lot of options
● Cloud layer
● VMs - physical resources isolated for a specific virtual purpose, a
physical machine could have multiple vms
● Storage types - local - ssd, hdd, virtual
● Memory sizes
● cpu , gpu
● Network types
● OS types
● Adv - Offers more control
● Disadvantages - need admins
● Mostly used by companies offering open source software
Platform as a service
● Public cloud providers Provides a few options
● Cloud + non-functional software layers
● Databases
● Application exec env - can exec apps w/out having to worry about
vms, cpu, memory etc.
● Workload specific Execution env
○ Amazon
○ Google
○ Microsoft azure
● Container service - allows running of software containers designed
using cloud native principles
○ Containers are lightweight
○ Can operate on any vm
○ Container service Automatically scales up
● Adv - actively managed
● Disadv - less control, predictability, stability
●
● Examples
● Mostly used by companies that need auto-scaling, backups &
recovery, on demand provisioning
● Serverless - special case
● A cloud native dev model that allows developers to build & run apps
w/out having to manage servers
● There are still servers but abstracted away from app dev
● Ideal for low complexity apps w/ embarrassingly parallel logic &
trigger based exec
Software as a service
● Public cloud providers Provides limited options
● Many enterprises provide this service
● Cloud + functional + non functional software layers
● Biz specific apps
○ Payroll
○ Attendance
○ Fintech
○ Foodtech
○ Edtech
○ Healthtech
● Adv - config driven, high predictability & performance, new features
● Disadv - least control
Download