Uploaded by Arihant

Notes

advertisement
Virtuous Product
Storage Hierarchy
Cycle
"
① Useful
analyser user
②
③
Service
behaviour & extract
transform insights
→
insights
Disk
11112 Cache
RIÉM
RATH
>
Actions
1
2
≥
21112 Cache
÷
Data
SPEED
Memory
DISK
→
data
is in
messy
suitable for
→ Not
+
form
Ingest
i
data
*
✓
→
data & collect useful
+
20cal DRAM
→
Local DRAM →
:
DRAM
Local Disk
Rack DRAM
> FLASH > DISK
slowest
↓
i
=ÑeIiver& Deploy model
Output
Extremely expensive
:
to DISKS
speed
Goes down
up
drastically
:
features
↓
Datacenter Swick
Analyze =%eate& build models +
④
decreases
Fastest
Clean & shape data
a ✓
③
increases
drastically
Memory
Explored understand data
Pre process
into tabular form
shape
hierarchy
Cluster Swick
analysis
→ convert
→
up
scaled
-_
T
i
✓
Transform
②
:
⑤ costs ( $ )
understand Problem
__
LATENCY
Goes down
Lifecycle
① Ingest
↑ increases
speed
Memory
scaled
Further up
→
:
④ Movement
↓
RACKSWITCK
Data
server DRAM
single
③ BANDWIDTH
:
* ✓
in
CAPACITY
Throughput
Evaluate data
+
of
7- Actual Rate
data transmission
";
Bandwidth
communicate results
-
'
Bigger pipe
Bi " "
""
( GB / s )
Infrastructure for big data
←
,
-
-
←
.
time taken
could be :
+◦
travel
one
way or
Round trip
-
-
Cloud
CMS)
Computing
large
what
① Computing Resource
-
as
of
dynamic provision
metered service
virtual machines
amounts of
small
.
data
:
of data :
amounts
Bandwidth
tells
Latency
tells us
us
rough
time for
.
Data centre Ideas
Why
lower
1
Cost
•
Scalability :
:
capital operating expenses
infinite capacity
IAAS ( infrastructure )
,
'
'
replace traditional hardware
-
3
Elasticity
:
Scale up
or
down
on
demand
with virtual machines
bare bones
-
-
e.
① SCALE
with OS
GECZ
-
PAASC Platform)
-
-
i
-
Virtual machines
↓
-30500--1
① Hypervisor manages VMs
② Different Apps → same machine
④
③ Runs
③ Apps are independent of eachother
⑦ Saves computation time
.
-
-
separate
expensive
overhead
Osperapp
transmission
delay
Lightweight
② only 10s
Provides
a
development platform
Preconfigured
e. g.
machine to do
not
,
UP
② Move PROCESSING
on / HAS
web server database server
move
_
specific Task
task
-
-
use
big
SEEKS
chunks rather than
small amounts
④
Seamless
_
-
data
Sequentially
reduce disk
Applications
to
machines
to
③ Process Data
SAHS (software )
-
OUT
combine cheaper machines
of data
Scalability for
1 machine
:
10 Machines
100
: to
hours
hours
task
for task
MapReduce
Implementation
HADOOP Architecture
☆
1
①
BASIC
Input files
{
yopg
&
Reduce Phase
4
INPUT SPLIT
[
KEY Vnti
MULTIPLE
contains
.
Reduce
-
.
②
↓
( K 4)
771 input splits
name node
:
+
are stored as
datanodes
chunks
.im?.?Treducer
.
function
key-value pairs
output file
~>
}
to the SAME
Machine
each
Responsibilities
'
chunk
.
'
is
replicated
Yia :
Name Node
②
keep track of file
§ !!!
directory
Addresses
in slave
read
from
blocks
relevant nodes
② MAP Phase
{ ( Kil )
split 1
split
2
.
( Kv ) ( Ksu )
.
.
.
}
Map Task 1
•
/ Map Task 2
each
-
Makes
✓
•
K ✗ in
map task
INDIVIDUAL CALL
an
emit
MAP FUNCTION
2A
Combiners
intermediate KV pair
>
> ( K ,V , )
,
°
mini
-
BARRIER
:
finished
Kemp _ Aggregation
K [11
113 ] → Reducer
ALL map tasks are
PART/ OHER : Maps
-
-
Can be
to
.
.
,
,
customized
re
-
distribute uneven key toads
locally
Map Tash's
machine
Sorted
!
in
depending
e.
g
A
I
,
/
Ascending
on
B
,
C
2
,
3
key
&
order
Key's class
-
A
,
,
1
A
}
pgpuq.nma.im
of
output
→
client
→
node
then
① First
→ Result
blocks
which
in
µg×,
Sort
Requests :
written in
are
created
parallel
sequentially
sorting
via
secondary
composite key
①
New
natural
pair object
② Comparator
Possible via
feature
still sorted by
usage
)
key secondary key
has to
,
be created
of :
Additional sorting by
this
→
this feature
Custom
3
Parti Oher written
defined
.
:
reduce Task
is
handled first
only NATURAL KEY
portioning
used to determine
.
node
gyp,, ,a
.
if A ,B sent to reducer 1
sorting
write , y ,
write
are
② Next replicas
>
-
First block
to
writes
For multiple
SAME
,
block addresses to
,
MUST NOT
0,1 multiple times
C. Similar concepts
be written to
A. 3
combiner
ran
Writing
Master Node returns
-
in
By default :
DEFAULT SORTING
K Y
stored
A
data
Secondary
of values into
List →
on
combiner
→
> A. 1
µ game a ,
depend
-
For
( K" ")
reducer
☆ Correctness
optional
Shuffle Phase
-
locally aggregates
an
3
>
combiner
.
File's block
addresses found
file
structure
xiamenoae
directs client
to read
via
3 times Replication
MASTER NODE
APP sends
request + ◦
file
read particular
2
3
Reliability
'
③ Access & Metadata coordinated
,
sent
can
HDFS
① Files
( 128 MB)
-
HDMR
PHASE
SHUFFLE
①
②
HDFS
used for
.
✗
_
I
Mappeeducelmplementation
BASIC
PHASE
SHUFFLE
①
Input files
yopg
[
{
INPUT SPLIT
(128/413)
-
contains
MULTIPLE
> II
Reduce Phase
4
key-Value pairs
input splits
sent
can
KEY Vn]
to the SAME
.
-
machine
one
Reduce
.
_
.
.
per
.
.
:
reducer
function
↓
( K 4)
~>
output file
,
② MAP Phase
{ ( Kil )
split 1
split
2
.
( Kv ) ( Ksu )
.
.
.
}
Map Task I
/ Map Task 2
each
✓
-
K ✗ in
-
Makes
an
MAP FUNCTION
.
map task
INDIVIDUAL CALL
emit
2A
Combiners
intermediate KV pair
>
> Ck ,V , )
,
>
combiner
-
mini
-
> A. 1
reducer
A. 1
cally aggregates
-
OPTIONAL
data
-
A.
A
}
A. 3
can be same as REDUCE function
☆ Correctness
depend
-
( Kii )
on
combiner
of
output
MUST NOT
combiner
ran
0,1 multiple times
,
SAME
→ RESULT
3
Shuffle Phase
BARRIER :
finished
Kemp _ Aggregation
ALL map tasks are
PARTI OHER : Maps
-
-
Can be
K [11
,
customized
to
re
113 ]
.
.
,
→
of values into
List →
Reducer
distribute uneven key toads
-
locally
sorted
:
A
g
I
-
B
,
2
/
sorting
Secondary
New
,
C
,
3
key
&
order
Key's class
secondary
:
used to determine
reduce Task
is
handled first
Possible via
feature
.
natural
usage
still sorted by
this
→
.
Sort
pair object
② Comparator
on
if A , B sent to reducer 1
which
composite key
Ascending
in
depending
e.
①
Map Task's
machine
DEFAULT SORTING
via
-
in
By default :
Sorting
K Y
stored
Additional sorting by
)
key secondary key
has to
,
be created
of :
this feature
Custom
3
Parti Omer Whitten
defined
co
>
only NATURAL KEY
portioning
.
used for
.
HADOOPArchitecture.IO
②
HDFS
&
HDFS
name node
:
① Files
HDMR
+
are stored as
datanodes
chunks
Reliability
3tim-es ¥plication
"
②
'
each
chunk
'
is
replicated
③ Access & Metadata coordinated
Responsibilities
1
Yia :
.
÷
iii.
""
addresses found
keep track of file
file
directory
Addresses
in slave
structure
Namenode
read
directs client
to read
blocks
Name Node
MASTER NODE
App sends
request to
file
read particular
2
3
.
Nodes
from
relevant nodes
for
→
writing
C Similar concepts
Master Node returns
block addresses to
be written to
→
client
→
node
to
writes
then
First block
writes to
in
NEXT
node
REPLICA
✗
For multiple
① First
-
blocks
write
are
② Next replicas
Requests :
written in
are
created
parallel
sequentially
✗
similarity Metrics
Relational DBS
Projection
①
→
Mapper
:
Attributes
in ALL
take
.
in
att
with
①
✗ &Y
:
Euclidean distance
DCA B)
Selection
( key , Att
~
Mapper
:
predicate
filtered through
×
take
emit
those that
only
d
Tuples
:
É Ñ
→
-
BY }
Sca b)
,
Cos @
=
=
I
HI / 1×111511
dot
prod of
of
prod
→ cross
ÉÑ
lengths
•
DCA B) =D , -142
④ Jaccard
,
pass
predicate
grouped
are
via
-
-
-
-
-
-
^
^
^
~
S
¥
>
Ca , b)
Spit
!
'
.
selected attribute
Similarity
|A%ÉT
"
=
-
b→
10
:
,
=
.
I
=
-
Sy
sets
1
A
Group by
3
-
tuples
all
in
BÑ+( AY
B
.
A
.
.
-
price +10
MapReduce
→
(Ax
② Manhattan distance
values )
Tuples
Of
predicate, e.g :
in
_
:
dz
:
=
>
Tuples
•
②
similarity
③ COSINE
:
tuples
all
with
emit
-
Tuples
"
present
MapReduce
in
E)
Ting
:
1
}
•
÷
>
,
1
2
MapReduce
in
→
Mapper
:
Shuffle
→
Document Similarity Process
3
in
emit
with
All the
:
tuples
take
K
being
key
same
keys
are
GROUP BY att
-
shingles
shingle length
-
.
grouped tgt
:
signature pairs
same
→ could
→
Reducer
AYGCprice )
:
from
be calculated
can
tuple
list
Could
•
Naive
.
→
✓
comparing
&
Expensive shingles
Relational
Understanding
Shingles
(
the
cat )
(
cat
is
cis
glad
→
I.
.
:
1.
( 51000
TABLE A
¥1.92020
-
✗c◦uNt
note
>
>
{
12
>
,
"
}
,
minhash
Prcncc , )
=
hccz) )
=
Value chosen
} ¥9
nature
Tsim (G) (2)
.
1¥ # ¥72020
SG ICSIOOO
SGICS 2020
-
:
My 1- CSIOOO
My 1- CSZOZO
All
possible
combinations
unique )
.
② IN MAP REDUCE
-
1) BROADCAST /
MAP
2) Reduce
Join
side / common
Join
→
1)
High memory
cost
.
-
to
set
r
Nhashfuncs :
TABLE NAME
signatures
of
candidate
threshold
number of
produced
)
>
I.
n
Mind
}c {
? If
} 3h1
Joins
use
Produce N
→ use
to each other
Minha Shing
Joins
{
①
further compare
to double check
from N
indicate
matching
.
?⃝
T­
↓
v1
☒
h
S
@
↓
I
o
a
D
=
←
§
A
=
a
a-
g
u
o
o
§
¥
'
@
}
8
s
o
G
-
n
O
e
n
F
I
°
h
'
+
c
I
o
s
0
C
O
=
e-
←
-
O
1-
N
P
,
G
e
o
0
G
o
£
i
3
+
A
1-
My
0
→
0
-
-
J
D
"
S
3
f
@
¢
-1
5
¥
°
.
.
✗
a
A
in
f
t
.
n
s
+
o
.
>
•
④
si
>
±
3
←
>
~
£
'
s
8
§
§
s
@
◦
n
@
s
a-
i
-
i
I
-
-
-
-
-
+
e
s
w
o
9. µ
@
n
,
s
•
-
°
→
C
+
on
e
I
do
0
§
Is
.
≥
-
s
+
+
o
a
3
A
+
P
S
-
↓
C
n
23 +
cos
~
-.
.
3
9
9-
On
}
I
a
%
5
0
S
1-
O
s
@
c
D
0
↓
o
O
c
,
-
u
8 gas
-1-98
3
-
1C
-
0
n
•
sq
0
n
s
1C
S
n
↳
.
.
•
a
~
o
s
•
W
as :
.
-
1¥
9g
0
-
,
e
n
'
'
u
+
°
%
€
C
b
↳
3
✗
=
}
☒ É
¥
K
=
.
Jk
Jb
↓
↓
E
e-
O
n
O
1-
no
a
→
2
o
o
C
e
o
0
ty
T
-0
P"
3
54
s
£
.
•
11
=
S
0
-
O
=
,
,
a
N
0
o
0
E
!
9
"
p
T
8
*
o
-
-
<
O
o
O
.
-
n
n
}
11
4
3
s
+
a
a
@
5
-
D
,
u
-
-
I
a
③
•
I
U
}
≠
Q
n
•
D
-
•
<
o
n
-
•
☒
£@
9
so
I
N
y
N
11
↓
④
W
U
÷
'
I
@
5
.
,
E
3
.
P
u
N
O
.
1-
~
g
☒
I
-
1-
.
°
n
C
C
0
f-
1-
1-
3
-
n
+
.
2
o
×
-
.
_
-
9-
I
f
n
}
_
D
£6
✗
3
9
3
o
3
€
D
e
n
•
0
'
→
+
-0
g
°
i
'
f-
G
a
a
s
•
¥
s
U
'
=
€
•
☒
*
~
a
a-
+
F-
C
$
o
-
'
.
3
s
s
1-
'
¥-1T
•
A
•
1-
'
⊖
•
→
1
>
*
→
i
0
3
§
s
o
~
°
^
n
✓
i
-0
.
"
c
is
a-
n
s
5
¢
5
e
°
§
%?
S
C
S
g
↳
-
+
-¥•¥
8- ¥
E
✗
E.
~
N
o
n
C
§
s
3
H
±
n
-
0
9
w
%
8
u
e
D
n
-
#
C
8
£
S
↳
F- §
U
t
o
→
s
s
*
' I
r
-
D-
=
I
8
a
m
✓
→
n
1-
0
→
o
É
☐
✓
◦
É
n
,
v↓✓
^
I
gO
S
e
o
,
N
"
s
+
a
a
.
n
- -
i
o
a
s
o
vis
s
-
☒
-0
=
5
o
c
n
I
•
So
w
o
→
3
.
5-
-
IÉ
n
0
A
+
o
s
E- In
3
-7
b
.
,
€8s
-
ñ
a
E.
8
-
+ +
+
+
b
I
u
n - ~
-
£
o
X
-0
>
•
I
≤
-
n
1-
=/
; :&I
↳
◦
0
n
.
→ is
.
^
-0
0
S
&
,
I
J
J
*
%
d-
8
o
~
-
.
I
-
?
o
×
◦
É
E
~
s
"
±
+
@
WN µ
.
s
o
s
←
s
a
°
n
o
8*9
&
¥
C
}
n
+
°
.
f
o
I
[
.
±
3
I
p
a
o
A
.
.
.
}
→
+
E
-
→
0
o
s
-
0
-0
.
8
s
-
-
@
i
-
86
£
-0
3
-0
o
-
,
→
o
o
-
.
o
.
§
±
I
c
.
}
°
5
A
~
§
I
0
+
i
t
◦
P
↓
n
-0
o
o
'
←
a
{
_
o
5-
-0
s
-
↳
@
-
•
-8
T
G
o
n
-
←
-0
•
N
.
.
§
=
I
⊖
.
#
→
>
*
g-
↓
a
I
C
0
a
-
#
0
-0
±
o
a
a
A
→
→
n
-
3
→
•
C
•
c.
-3
I
④
w
A
'
G
i­
⑥
×
o
-
1-
±
g-
€
I
☐
s
~
☒
→
}
→
¥
-
§
o
}
¥
E
s
a
f-
n
3
0
9-
a
o
C
n
A
:
→
o
n
r
-
-
I
⊖
D
=
>
Igg
I
C
141 ?
.
3
@
e
}%
#
0
-
u
↓
0
n
I
s s
•
◦
~
MIN
* ↳
~
088
3
N
0
58
_
.
n
O
0
O
µ
t.si
o
o
E
n
o
h
o
n
.
s
I
U
&
o
o
8
I
- ^
↳
n
O
s
1-
e
on
-
3
3
-
f
.
S
o
°
a
.
s
s
n
n
-
e
⇐
¥
•
.
%
¥
I
✗
#
☒
t.it#:.:t :¥t¥
.
I
8
,
I
☒
0
⇐
•
@
a
O
-
8
}
}
e
3
U
e
5↓
t
±
T
%
'
j
III.
- - -
ñEÉ }
%
n
I
,
:*
-
11
5
§
-
±
s
I
^
N
-
✓ ✓
n
u
✓
m
*
÷
e
}
~
n
5
n
n
~
F- §
-
¥
{
is
g
0
8
s
~
9- µ
•
3
•
A
+
I
8
÷
✓
C
=
•
s
_
.
n
s
+
↳
o
~
¥
k-Means Algo
Repeat
I
~
↓
>
s
~
Blue
a) NAME
MAPPER
IN
cluster
table
B
.
:
( 1,1 )
( 0,0 )
(
emit
compare
4
cluster IDs
P point
centroid positions
( 0,1 )
extend
L
1210,23
>
EnÉÉ
Mr>
L
found
[ cluster / D.
point
-
[ (1) 17,1]
]
↓
}
naps
in
counting
IN REDUCER
input
①
:
K
,
②
list ( V )
↓
~
Cluster / D
list of
{ C- 0,1
B
([ 1,1
¢7,8
intialise
s→( [0,010 )
^
extended Pants
gum
,
D
,
1)
,
coitus
,
④ obtain new
,
1) }
,
centroid
[ S ,Sz ] / 53
,
,
MAPPER
IN
cluster
table
B
.
P point
:
( 1,1 )
②
( 0,0 )
↓③
(
compare
4
Cluster / Ds
①
centroid positions
( 0,1 )
12
kill ]
sum
5@vsterID.Ch.h7h71GEYuetgeYe.i
}
④
③
"9
[Hind .ms]
R
{
IN
my
CLEAN
UP
closest
.
.
.
centroid
found
in
②
a
only emit
smaller locally
aggregated
point sum
rather
each
IN REDUCER
input
①
:
K
~
,
Cluster / D
B
②
lista )
Pants
list of extended
{ C- 0,1
intialise
s→( [0,010 )
↓
,
D
SUM
with
S
,
1)
([ 1,1
¢78,1) }
,
^
3
,
,
④ obtain new
centroid
[ S ,Sz ] / 53
,
in hash
Iie
_
"
{
( 0,2)
~
emit
B
than
point
.
i­
i­
I
¥
¥
.É
:¥¥¥É
IT
I
I
5-
→
I
§
e-
I
?
3
8
-
n
3
g
s
f
E
9
=
⑥
p
→
O
s
~
n
^
a
}
É÷
•
-
}
%
I ?
v
-8
-
I
→ e
E
o
-
a
.
←
⇐
%
"
'
c-
g-
I
-
3
n
¥
488
• ←
=
←g }
+
£
Es
503
'
I
±
o
¥
w
-
°
↓
§
m
,
o
e
¥
~
±
-
~
E
É
⊖
•
~
-0
_
o
E-
¥
s
-
•
s
E
w
^③
8
}
'
"
a
.
s
~
⑥
us
Ej
←
£
¥
⑦
.
•
→
O
e
I
f-
s
e
e
=
u
-0
#
_
0
-
-
s
•
w
%
°
§
-
0
+
M
°
n
:
O
¥
I
.
= : :
t.to
-
%
I
¥
¥
.
n
←
I I
✓
.
+
×
.
.
F-
⇐
¥
€0
C
<
×
Eas
§
.
3
0--0
s
-
¥§
F-
£
€
~
-
.
✓
-
e-
←
x
◦
¥
8
a
i
②
I
-
I
-0
-0
5T¥ :c
✓
Is I
J§⊕
s
3
0--0
n
¥
⇐
c
+
I
-
¥
% ? :-< I
-
.
%
I
E-
5¥
§
±
P
=
-
①
_
+
¥
488
03
-1
.
a
w
n
w
e.
S
⑥
•
3
o
e
y
,
En
-
↓
E§
.
M
,
-0
°
"
E
Is
5
O
¥
~
±
s
=
.
-8
°
0
0
✓
O
E
⑦
>
-
3
YE
W
} ⑨
SEE ?
⇐
:
: :
¥
,
+
a
¥
09
P
-
f Jp
f
58
g
G.
-
✗
-
so
--
o
s
a
-1
s
s
O
PS
gee+
•
s
}
>
.
-
n
}
o
3
-
@
⇐
Q
Ps
=
◦
q
b
-
o
o
•
=
s
o
}
s -
+
§ :>
=
~
,
€
↳
}
s
+ S
P
p
so
8s
s
Download