Big Data - quantification - misconception ● Volume ● Variety - unstructured data - json ● Complexity - irregular timesteps, missing values ● Velocity - rate at which data comes into the source Apache Hadoop - umbrella for diff types of large scale data processing open source software ● Using a large no. of cheap machines to process big data ● Not widely used now ● Hadoop file system - storage across machines to store data ● Hadoop mapreduce - cpus across thousands of servers to process data Big data is not an architecture Why do we need Data ● Reporting - what happened - batch reports ● Analyzing - why did it happen - ad hoc, BI tools ● Predicting - what will happen - prediction models ● Operationalizing - what is happening now - link to operational systems ● Activating - make it happen - automated linkages ● The process doesn’t change for big data ● The tools change for big data Cloud ● Computation decoupled from storage ● Parallelism ● Need a way to know how the data is spread An embarrassingly parallel workload or problem is one where little or no effort is needed to separate the problem into a number of parallel tasks. Computers ● Network - order of sec ● Disk - order of millisec - sequential access better than random access ● Memory & cpu - order of nanosec Divide & conquer ● Planning - how to break the problem down into smaller problems that can be solved in parallel ● Scheduling - top level prob can’t be solved before low level, schedule smaller problems ● Executing - start from the low level all the way up Mapreduce ● Uses divide & conquer ● Parallel programming framework ○ Any programming lang ○ Harness cpu power ○ Optimizes for scale to avoid using a lot of memory ● Provides ● Automatic parallelization ● Fault tolerance ● Monitoring & status updates Spark ● Uses divide & conquer ● Parallel programming framework ○ Java, python ○ Harness cpu power ○ Optimizes for memory to get speed ● Provides ● Automatic parallelization ● Fault tolerance ● Monitoring & status updates ● Actions have side effects, a stage ends when an action is performed ● Resilient Distributed Datasets - allow storing output after each operation is performed even if it fails ● RDD tree is created for the DAG ● Duality of sorting & hashing ● What can be done by sorting can also be done by hashing ● Sorting - O(nlogn) ○ Small scale - quick sort ○ Large scale - merge sort ● Hashing - O(n) ○ Small scale - hash table ○ Large scale - partitioned hash table Businesses over time Data generated Specific software & Hardware by apps must adapt to Different types of hardware Hardware getting Expensive Cloud native ● non func software + cloud rent out hardware write software that can be scaled w/out having to rewrite it ● Enables loosely coupled systems that are resilient, manageable & observable ● Scale to needs in terms of volume & requirement ● Design principle Infrastructure as a service ● Public cloud providers Provide a lot of options ● Cloud layer ● VMs - physical resources isolated for a specific virtual purpose, a physical machine could have multiple vms ● Storage types - local - ssd, hdd, virtual ● Memory sizes ● cpu , gpu ● Network types ● OS types ● Adv - Offers more control ● Disadvantages - need admins ● Mostly used by companies offering open source software Platform as a service ● Public cloud providers Provides a few options ● Cloud + non-functional software layers ● Databases ● Application exec env - can exec apps w/out having to worry about vms, cpu, memory etc. ● Workload specific Execution env ○ Amazon ○ Google ○ Microsoft azure ● Container service - allows running of software containers designed using cloud native principles ○ Containers are lightweight ○ Can operate on any vm ○ Container service Automatically scales up ● Adv - actively managed ● Disadv - less control, predictability, stability ● ● Examples ● Mostly used by companies that need auto-scaling, backups & recovery, on demand provisioning ● Serverless - special case ● A cloud native dev model that allows developers to build & run apps w/out having to manage servers ● There are still servers but abstracted away from app dev ● Ideal for low complexity apps w/ embarrassingly parallel logic & trigger based exec Software as a service ● Public cloud providers Provides limited options ● Many enterprises provide this service ● Cloud + functional + non functional software layers ● Biz specific apps ○ Payroll ○ Attendance ○ Fintech ○ Foodtech ○ Edtech ○ Healthtech ● Adv - config driven, high predictability & performance, new features ● Disadv - least control