Virtuous Product Storage Hierarchy Cycle " ① Useful analyser user ② ③ Service behaviour & extract transform insights → insights Disk 11112 Cache RIÉM RATH > Actions 1 2 ≥ 21112 Cache ÷ Data SPEED Memory DISK → data is in messy suitable for → Not + form Ingest i data * ✓ → data & collect useful + 20cal DRAM → Local DRAM → : DRAM Local Disk Rack DRAM > FLASH > DISK slowest ↓ i =ÑeIiver& Deploy model Output Extremely expensive : to DISKS speed Goes down up drastically : features ↓ Datacenter Swick Analyze =%eate& build models + ④ decreases Fastest Clean & shape data a ✓ ③ increases drastically Memory Explored understand data Pre process into tabular form shape hierarchy Cluster Swick analysis → convert → up scaled -_ T i ✓ Transform ② : ⑤ costs ( $ ) understand Problem __ LATENCY Goes down Lifecycle ① Ingest ↑ increases speed Memory scaled Further up → : ④ Movement ↓ RACKSWITCK Data server DRAM single ③ BANDWIDTH : * ✓ in CAPACITY Throughput Evaluate data + of 7- Actual Rate data transmission "; Bandwidth communicate results - ' Bigger pipe Bi " " "" ( GB / s ) Infrastructure for big data ← , - - ← . time taken could be : +◦ travel one way or Round trip - - Cloud CMS) Computing large what ① Computing Resource - as of dynamic provision metered service virtual machines amounts of small . data : of data : amounts Bandwidth tells Latency tells us us rough time for . Data centre Ideas Why lower 1 Cost • Scalability : : capital operating expenses infinite capacity IAAS ( infrastructure ) , ' ' replace traditional hardware - 3 Elasticity : Scale up or down on demand with virtual machines bare bones - - e. ① SCALE with OS GECZ - PAASC Platform) - - i - Virtual machines ↓ -30500--1 ① Hypervisor manages VMs ② Different Apps → same machine ④ ③ Runs ③ Apps are independent of eachother ⑦ Saves computation time . - - separate expensive overhead Osperapp transmission delay Lightweight ② only 10s Provides a development platform Preconfigured e. g. machine to do not , UP ② Move PROCESSING on / HAS web server database server move _ specific Task task - - use big SEEKS chunks rather than small amounts ④ Seamless _ - data Sequentially reduce disk Applications to machines to ③ Process Data SAHS (software ) - OUT combine cheaper machines of data Scalability for 1 machine : 10 Machines 100 : to hours hours task for task MapReduce Implementation HADOOP Architecture ☆ 1 ① BASIC Input files { yopg & Reduce Phase 4 INPUT SPLIT [ KEY Vnti MULTIPLE contains . Reduce - . ② ↓ ( K 4) 771 input splits name node : + are stored as datanodes chunks .im?.?Treducer . function key-value pairs output file ~> } to the SAME Machine each Responsibilities ' chunk . ' is replicated Yia : Name Node ② keep track of file § !!! directory Addresses in slave read from blocks relevant nodes ② MAP Phase { ( Kil ) split 1 split 2 . ( Kv ) ( Ksu ) . . . } Map Task 1 • / Map Task 2 each - Makes ✓ • K ✗ in map task INDIVIDUAL CALL an emit MAP FUNCTION 2A Combiners intermediate KV pair > > ( K ,V , ) , ° mini - BARRIER : finished Kemp _ Aggregation K [11 113 ] → Reducer ALL map tasks are PART/ OHER : Maps - - Can be to . . , , customized re - distribute uneven key toads locally Map Tash's machine Sorted ! in depending e. g A I , / Ascending on B , C 2 , 3 key & order Key's class - A , , 1 A } pgpuq.nma.im of output → client → node then ① First → Result blocks which in µg×, Sort Requests : written in are created parallel sequentially sorting via secondary composite key ① New natural pair object ② Comparator Possible via feature still sorted by usage ) key secondary key has to , be created of : Additional sorting by this → this feature Custom 3 Parti Oher written defined . : reduce Task is handled first only NATURAL KEY portioning used to determine . node gyp,, ,a . if A ,B sent to reducer 1 sorting write , y , write are ② Next replicas > - First block to writes For multiple SAME , block addresses to , MUST NOT 0,1 multiple times C. Similar concepts be written to A. 3 combiner ran Writing Master Node returns - in By default : DEFAULT SORTING K Y stored A data Secondary of values into List → on combiner → > A. 1 µ game a , depend - For ( K" ") reducer ☆ Correctness optional Shuffle Phase - locally aggregates an 3 > combiner . File's block addresses found file structure xiamenoae directs client to read via 3 times Replication MASTER NODE APP sends request + ◦ file read particular 2 3 Reliability ' ③ Access & Metadata coordinated , sent can HDFS ① Files ( 128 MB) - HDMR PHASE SHUFFLE ① ② HDFS used for . ✗ _ I Mappeeducelmplementation BASIC PHASE SHUFFLE ① Input files yopg [ { INPUT SPLIT (128/413) - contains MULTIPLE > II Reduce Phase 4 key-Value pairs input splits sent can KEY Vn] to the SAME . - machine one Reduce . _ . . per . . : reducer function ↓ ( K 4) ~> output file , ② MAP Phase { ( Kil ) split 1 split 2 . ( Kv ) ( Ksu ) . . . } Map Task I / Map Task 2 each ✓ - K ✗ in - Makes an MAP FUNCTION . map task INDIVIDUAL CALL emit 2A Combiners intermediate KV pair > > Ck ,V , ) , > combiner - mini - > A. 1 reducer A. 1 cally aggregates - OPTIONAL data - A. A } A. 3 can be same as REDUCE function ☆ Correctness depend - ( Kii ) on combiner of output MUST NOT combiner ran 0,1 multiple times , SAME → RESULT 3 Shuffle Phase BARRIER : finished Kemp _ Aggregation ALL map tasks are PARTI OHER : Maps - - Can be K [11 , customized to re 113 ] . . , → of values into List → Reducer distribute uneven key toads - locally sorted : A g I - B , 2 / sorting Secondary New , C , 3 key & order Key's class secondary : used to determine reduce Task is handled first Possible via feature . natural usage still sorted by this → . Sort pair object ② Comparator on if A , B sent to reducer 1 which composite key Ascending in depending e. ① Map Task's machine DEFAULT SORTING via - in By default : Sorting K Y stored Additional sorting by ) key secondary key has to , be created of : this feature Custom 3 Parti Omer Whitten defined co > only NATURAL KEY portioning . used for . HADOOPArchitecture.IO ② HDFS & HDFS name node : ① Files HDMR + are stored as datanodes chunks Reliability 3tim-es ¥plication " ② ' each chunk ' is replicated ③ Access & Metadata coordinated Responsibilities 1 Yia : . ÷ iii. "" addresses found keep track of file file directory Addresses in slave structure Namenode read directs client to read blocks Name Node MASTER NODE App sends request to file read particular 2 3 . Nodes from relevant nodes for → writing C Similar concepts Master Node returns block addresses to be written to → client → node to writes then First block writes to in NEXT node REPLICA ✗ For multiple ① First - blocks write are ② Next replicas Requests : written in are created parallel sequentially ✗ similarity Metrics Relational DBS Projection ① → Mapper : Attributes in ALL take . in att with ① ✗ &Y : Euclidean distance DCA B) Selection ( key , Att ~ Mapper : predicate filtered through × take emit those that only d Tuples : É Ñ → - BY } Sca b) , Cos @ = = I HI / 1×111511 dot prod of of prod → cross ÉÑ lengths • DCA B) =D , -142 ④ Jaccard , pass predicate grouped are via - - - - - - ^ ^ ^ ~ S ¥ > Ca , b) Spit ! ' . selected attribute Similarity |A%ÉT " = - b→ 10 : , = . I = - Sy sets 1 A Group by 3 - tuples all in BÑ+( AY B . A . . - price +10 MapReduce → (Ax ② Manhattan distance values ) Tuples Of predicate, e.g : in _ : dz : = > Tuples • ② similarity ③ COSINE : tuples all with emit - Tuples " present MapReduce in E) Ting : 1 } • ÷ > , 1 2 MapReduce in → Mapper : Shuffle → Document Similarity Process 3 in emit with All the : tuples take K being key same keys are GROUP BY att - shingles shingle length - . grouped tgt : signature pairs same → could → Reducer AYGCprice ) : from be calculated can tuple list Could • Naive . → ✓ comparing & Expensive shingles Relational Understanding Shingles ( the cat ) ( cat is cis glad → I. . : 1. ( 51000 TABLE A ¥1.92020 - ✗c◦uNt note > > { 12 > , " } , minhash Prcncc , ) = hccz) ) = Value chosen } ¥9 nature Tsim (G) (2) . 1¥ # ¥72020 SG ICSIOOO SGICS 2020 - : My 1- CSIOOO My 1- CSZOZO All possible combinations unique ) . ② IN MAP REDUCE - 1) BROADCAST / MAP 2) Reduce Join side / common Join → 1) High memory cost . - to set r Nhashfuncs : TABLE NAME signatures of candidate threshold number of produced ) > I. n Mind }c { ? If } 3h1 Joins use Produce N → use to each other Minha Shing Joins { ① further compare to double check from N indicate matching . ?⃝ T­ ↓ v1 ☒ h S @ ↓ I o a D = ← § A = a a- g u o o § ¥ ' @ } 8 s o G - n O e n F I ° h ' + c I o s 0 C O = e- ← - O 1- N P , G e o 0 G o £ i 3 + A 1- My 0 → 0 - - J D " S 3 f @ ¢ -1 5 ¥ ° . . ✗ a A in f t . n s + o . > • ④ si > ± 3 ← > ~ £ ' s 8 § § s @ ◦ n @ s a- i - i I - - - - - + e s w o 9. µ @ n , s • - ° → C + on e I do 0 § Is . ≥ - s + + o a 3 A + P S - ↓ C n 23 + cos ~ -. . 3 9 9- On } I a % 5 0 S 1- O s @ c D 0 ↓ o O c , - u 8 gas -1-98 3 - 1C - 0 n • sq 0 n s 1C S n ↳ . . • a ~ o s • W as : . - 1¥ 9g 0 - , e n ' ' u + ° % € C b ↳ 3 ✗ = } ☒ É ¥ K = . Jk Jb ↓ ↓ E e- O n O 1- no a → 2 o o C e o 0 ty T -0 P" 3 54 s £ . • 11 = S 0 - O = , , a N 0 o 0 E ! 9 " p T 8 * o - - < O o O . - n n } 11 4 3 s + a a @ 5 - D , u - - I a ③ • I U } ≠ Q n • D - • < o n - • ☒ £@ 9 so I N y N 11 ↓ ④ W U ÷ ' I @ 5 . , E 3 . P u N O . 1- ~ g ☒ I - 1- . ° n C C 0 f- 1- 1- 3 - n + . 2 o × - . _ - 9- I f n } _ D £6 ✗ 3 9 3 o 3 € D e n • 0 ' → + -0 g ° i ' f- G a a s • ¥ s U ' = € • ☒ * ~ a a- + F- C $ o - ' . 3 s s 1- ' ¥-1T • A • 1- ' ⊖ • → 1 > * → i 0 3 § s o ~ ° ^ n ✓ i -0 . " c is a- n s 5 ¢ 5 e ° § %? S C S g ↳ - + -¥•¥ 8- ¥ E ✗ E. ~ N o n C § s 3 H ± n - 0 9 w % 8 u e D n - # C 8 £ S ↳ F- § U t o → s s * ' I r - D- = I 8 a m ✓ → n 1- 0 → o É ☐ ✓ ◦ É n , v↓✓ ^ I gO S e o , N " s + a a . n - - i o a s o vis s - ☒ -0 = 5 o c n I • So w o → 3 . 5- - IÉ n 0 A + o s E- In 3 -7 b . , €8s - ñ a E. 8 - + + + + b I u n - ~ - £ o X -0 > • I ≤ - n 1- =/ ; :&I ↳ ◦ 0 n . → is . ^ -0 0 S & , I J J * % d- 8 o ~ - . I - ? o × ◦ É E ~ s " ± + @ WN µ . s o s ← s a ° n o 8*9 & ¥ C } n + ° . f o I [ . ± 3 I p a o A . . . } → + E - → 0 o s - 0 -0 . 8 s - - @ i - 86 £ -0 3 -0 o - , → o o - . o . § ± I c . } ° 5 A ~ § I 0 + i t ◦ P ↓ n -0 o o ' ← a { _ o 5- -0 s - ↳ @ - • -8 T G o n - ← -0 • N . . § = I ⊖ . # → > * g- ↓ a I C 0 a - # 0 -0 ± o a a A → → n - 3 → • C • c. -3 I ④ w A ' G i­ ⑥ × o - 1- ± g- € I ☐ s ~ ☒ → } → ¥ - § o } ¥ E s a f- n 3 0 9- a o C n A : → o n r - - I ⊖ D = > Igg I C 141 ? . 3 @ e }% # 0 - u ↓ 0 n I s s • ◦ ~ MIN * ↳ ~ 088 3 N 0 58 _ . n O 0 O µ t.si o o E n o h o n . s I U & o o 8 I - ^ ↳ n O s 1- e on - 3 3 - f . S o ° a . s s n n - e ⇐ ¥ • . % ¥ I ✗ # ☒ t.it#:.:t :¥t¥ . I 8 , I ☒ 0 ⇐ • @ a O - 8 } } e 3 U e 5↓ t ± T % ' j III. - - - ñEÉ } % n I , :* - 11 5 § - ± s I ^ N - ✓ ✓ n u ✓ m * ÷ e } ~ n 5 n n ~ F- § - ¥ { is g 0 8 s ~ 9- µ • 3 • A + I 8 ÷ ✓ C = • s _ . n s + ↳ o ~ ¥ k-Means Algo Repeat I ~ ↓ > s ~ Blue a) NAME MAPPER IN cluster table B . : ( 1,1 ) ( 0,0 ) ( emit compare 4 cluster IDs P point centroid positions ( 0,1 ) extend L 1210,23 > EnÉÉ Mr> L found [ cluster / D. point - [ (1) 17,1] ] ↓ } naps in counting IN REDUCER input ① : K , ② list ( V ) ↓ ~ Cluster / D list of { C- 0,1 B ([ 1,1 ¢7,8 intialise s→( [0,010 ) ^ extended Pants gum , D , 1) , coitus , ④ obtain new , 1) } , centroid [ S ,Sz ] / 53 , , MAPPER IN cluster table B . P point : ( 1,1 ) ② ( 0,0 ) ↓③ ( compare 4 Cluster / Ds ① centroid positions ( 0,1 ) 12 kill ] sum 5@vsterID.Ch.h7h71GEYuetgeYe.i } ④ ③ "9 [Hind .ms] R { IN my CLEAN UP closest . . . centroid found in ② a only emit smaller locally aggregated point sum rather each IN REDUCER input ① : K ~ , Cluster / D B ② lista ) Pants list of extended { C- 0,1 intialise s→( [0,010 ) ↓ , D SUM with S , 1) ([ 1,1 ¢78,1) } , ^ 3 , , ④ obtain new centroid [ S ,Sz ] / 53 , in hash Iie _ " { ( 0,2) ~ emit B than point . i­ i­ I ¥ ¥ .É :¥¥¥É IT I I 5- → I § e- I ? 3 8 - n 3 g s f E 9 = ⑥ p → O s ~ n ^ a } É÷ • - } % I ? v -8 - I → e E o - a . ← ⇐ % " ' c- g- I - 3 n ¥ 488 • ← = ←g } + £ Es 503 ' I ± o ¥ w - ° ↓ § m , o e ¥ ~ ± - ~ E É ⊖ • ~ -0 _ o E- ¥ s - • s E w ^③ 8 } ' " a . s ~ ⑥ us Ej ← £ ¥ ⑦ . • → O e I f- s e e = u -0 # _ 0 - - s • w % ° § - 0 + M ° n : O ¥ I . = : : t.to - % I ¥ ¥ . n ← I I ✓ . + × . . F- ⇐ ¥ €0 C < × Eas § . 3 0--0 s - ¥§ F- £ € ~ - . ✓ - e- ← x ◦ ¥ 8 a i ② I - I -0 -0 5T¥ :c ✓ Is I J§⊕ s 3 0--0 n ¥ ⇐ c + I - ¥ % ? :-< I - . % I E- 5¥ § ± P = - ① _ + ¥ 488 03 -1 . a w n w e. S ⑥ • 3 o e y , En - ↓ E§ . M , -0 ° " E Is 5 O ¥ ~ ± s = . -8 ° 0 0 ✓ O E ⑦ > - 3 YE W } ⑨ SEE ? ⇐ : : : ¥ , + a ¥ 09 P - f Jp f 58 g G. - ✗ - so -- o s a -1 s s O PS gee+ • s } > . - n } o 3 - @ ⇐ Q Ps = ◦ q b - o o • = s o } s - + § :> = ~ , € ↳ } s + S P p so 8s s