PPT - PMI Baltimore Chapter

advertisement
Cloud Computing:
What a Project Manager
Needs to Know
Dr. Patrick D. Allen, PMP
Patrick.allen@jhuapl.edu
Purpose
 Provide Project Managers with the very
basics of the two primary types of Clouds
and Cloud Computing, and the questions
they should ask when Clouds and their
project intersect
2
Overview
 “Computing as a Service” Clouds
 Questions PMs should ask
 “Data-Focused” Clouds
 Relational Databases vs Clouds
 Map-Reduce and Accumulo examples
 Questions PMs should ask
 General Cloud questions PMs should ask
 The importance of risk assessments
3
What’s a Cloud?
 Two primary definitions of Clouds presented today:
1. Compute-power as a Service (Utility Cloud; VMs)
 Infrastructure as a Service or
 Platforms as a Service or
 Software as a Service
2. A Data-focused Cloud that also runs on VMs
 E.g. Hadoop Data File System and data processing
 A third emerging type is a “Data Storage” Cloud
 PMs need to make sure everyone understands which
type is being discussed
 If you think you’re discussing a different one, confusion
results and expectations will not be met
4
First Type: Computing as a Service
 Instead of using your own computers, you use a ThirdParty’s computers at another location (e.g., AWS’s EC2)
 Usually all same hardware with a variety of Virtual
Machine (VM) configurations to meet customer needs
 When hardware dies, it is seamlessly replaced
 All hardware and infrastructure and physical security
headaches are the responsibility of the Third Party
 You’re responsible for secure comms to and from the
data stores and the security on the machines you use
 You only pay for what you use (memory, computing
power or number of virtual machines used)
 Great for surge-type activities, such as the census
that’s run every ten years, or new venture start-ups
 Virtual private clouds are available for better security
5
First Type: Questions PMs Should Ask –1
 What’s the cost per data stored (Cents per Gigabyte)?
 What’s the cost for number of VM’s used?
 How secure or private is my data when I store it on a
third-party platform?
 What security or privacy guarantees are provided?
 Will the PII be adequately protected?
 Can I test Cloud security before I put real data there?
 Would a Cloud be useful for my Continuity of
Operations (COOP) plans?
 It depends. Do your employees already regularly
perform remote operations like teleworking? Do you
have a re-routing plan to get them to the Cloud?
 Am I starting a new business with limited investment?
6
First Type: Questions PMs Should Ask – 2
 Can you store classified data on a cloud?
 If a properly secured government-accredited private cloud, Maybe
 If you are planning to use a Third-Party service, Maybe
 As a minimum, use a virtual private cloud (e.g., AWS VPC)
 And located entirely in the U.S. (not distributed world wide)
 Probably need to limit access to selected personnel at the
service provider site (like no foreign access in US Gov Cloud)
 US-Gov-only Cloud important for data under export control
 Need your security department’s approval, which includes your
plan and vetting the provider
 Probably need to do penetration testing before use, like “side
channel attack” prevention
 Not sure if this is yet being used for more than unclassified but
sensitive data
 For either case, always get a cyber security expert to prepare a risk
assessment, and for classified data, a proper accreditation
7
2nd Type: Data-Focused Cloud–Definitions
 Huge Data: Petabytes or larger amounts of data
 HDFS is Hadoop Data File System (more on this later)
 Relational Database: Think rows and columns, densely
populated (like a spreadsheet)
 Structured non-relational databases: Cloud-based structured
data technologies like Accumulo and HBase running on HDFS
 Can be densely or sparsely populated
 Tend to use flexible labels of length three to six (more later)
 Many different types of data that may have some overlapping
elements, but not the same across all types of data
 If put into rows and columns it would be a huge table only
sparsely populated
8
Relational Database Example
Name
John Smith
Jane Doe
Address
Age
Height
Washington DC
35
5’10”
Baltimore
29
5’8”
Fred Flintstone
Rockville
55
4’10”
Tony D. Tiger
Battle Creek
67
6’2”
Elmer Fudd
DeForest
60
4’6”
Peter Parker
New York
28
5’5”
Bruce Wayne
Gotham
36
6’1”
Roger Rabbit
Fantasyland
41
4’0”
Peter Rabbit
Rural Address
118
1’1”
White Rabbit
Wonderland
135
1’11”
Find the Names of those of Age >25 but <60, and > 5’ tall
9
Sparse Data Example
Medical Records
Drivers Licenses
Facebook
Dating Service
John Smith
John Smith
Age 35 5’10”
Washington DC
Jane Doe
Age 29 5’8”
Baltimore
Peter Parker
Age 28
Bruce Wayne
Gotham
Peter Parker
5’5”
New York
Bruce Wayne
36 6’1”
Find the Names of those of Age >25 but <60, and > 5’ tall from multiple data sets
10
Accumulo Data Example
ID
Col. Family
001
Personal
Name
31 Apr ‘12
PII
001
Personal
Age
31 Apr ‘12
PII
John Smith
35
001
Personal
Height
31 Apr ‘12
PII
5’ 10”
001
Address
City
31 Apr ‘12
PII
Wash DC
001
001
002
Address
Address
Personal
Street
Number
Name
31 Apr ‘12
31 Apr ‘12
31 Apr ‘12
PII
PII
PII
K Street
810
Peter Parker
002
Personal
Age
31 Apr ‘12
PII
28
002
Personal
Height
31 Apr ‘12
PII
5’ 5”
002
Address
City
31 Apr ‘12
PII
New York
002
002
Col. Qualifier
Time
Security
Value
31 Apr ‘12 PII
Street
Address
72nd Street
31 Apr ‘12 PII
Number
Address
145
Find the Names of those of Age >25 but <60, and > 5’ tall
11
2nd Type: Data-Focused Cloud
 Also runs on a VM farm, but uses a “Hadoop” or “Sector”
file management system (Hadoop is most widely used)
 What does a Hadoop Data File System (HDFS) do for you?
 Let’s you store huge amounts of non-relational data
 Automatically parallelizes the computations
 Automatically sorts results of “map” step
 Handles all of the overhead associated with storing,
locating and processing your data
 Allows for Map-Reduce programs and Direct Access
Table-based searches using Hadoop to be run
 Can find relationships not easily visible in unstructured
data and/or large amounts of data
12
2nd Type: Map-Reduce Program Example
 Find the number people per household in census data
Key = HH Size, Value = #
Distributed
Count
Hadoop
Databases of
Auto
Household (HH) members
of HH
Sorts
Census Data
Key = #, Value = Total
Add # HH
w/ N members,
N = 1 to 25
HH001, 3
HH001, 3
S
1, 3.5 M
HH002, 6
HH004, 3
S
2, 9.6 M
HH003, 4
HH003, 4
S
3, 6.8 M
HH004, 3
HH002, 6
S
4, 5.3 M
Map
Reduce
13
2nd Type: Map-Reduce Pros and Cons
 Map-Reduce programs are good for:




When you have huge data sets
If your data can't be managed in a relational database
When you are not sure what types of queries you will want to run
If you want to summarize the results of independent processes that
can be applied to data in parallel
 Map-Reduce programs are not good for:
 If you can answer your questions with an existing relational
database in a reasonable amount of time, why bother with the
overhead of a cloud?
 If your data can fit within a relational database, AND
 If the queries you plan to run are fairly well-defined THEN
 You probably don’t need the overhead of a Cloud
14
2nd Type: Questions PMs Should Ask – 1
 Do I even need to use a Cloud?
 If you have well-structured reasonable amounts of data, stick with a
relational database UNLESS you just want the compute power on
demand (1st Type of Cloud presented)
 If it is required by external authorities (like a customer), yes
 Do I have a lot of "surge" events, where you only need to store and
process large amounts of data periodically
 Then using a cloud makes sense
 Do I need to know how to write a Map-Reduce program or an
Accumulo Table to use a Cloud?
 No, can use pre-defined programs, OR you need someone who
knows how write new ones for you
 Do I need to know how to design a Map-Reduce program?
 No, but it helps so you can ask for realistic output from the Cloud
and really leverage the Cloud to solve your data problems
15
2nd Type: Questions PMs Should Ask – 2
 Do I have access to an existing Cloud I could use?
 If it meets your requirements, third-party Clouds work
 Make sure of the “fine print” on the guarantees, and whether the
recourse of the guarantee is sufficient to match the cost of the
failure to guarantee
 Have a security expert do a risk assessment before committing
 Do I need to build my own instead?
 If you have security, privacy or proprietary needs not met by an
existing Cloud, might want to build your own
 Consider the ongoing maintenance costs (may be primary rationale
for moving to a Cloud)
 More automation reduces Cloud maintenance costs
16
General Cloud Questions for PMs
 Where is the Cloud located? Can it be restricted to U.S.?
 Who gets access to it?
 How are the communications to/from the cloud secured?
 How does it ingest its data?
 How does it store its data?
 How do they secure your data at rest?
 How does it delete its data? Can you test that it’s gone?
 Does it keep your data separate from other people's data?
 Do you need/want a virtual private cloud instead?
 How often is the hardware upgraded?
 How many versions of VMs can you choose from?
 Has a security expert performed a risk assessment?
17
Summary Observations
 Cloud computing is here to stay
 Many more projects in the future will encounter Clouds in
some way that will impact the project
 Need to be aware of the strengths and limitations of
Clouds and whether they are appropriate for your project
 You may not have a choice whether or not to use a Cloud
 This briefing listed some of the basic questions you
should ask as appropriate to your project
 Hopefully some of the mystery (and hype) of the Cloud
has been dispelled by this talk
 It is useful to be able to design a Map-Reduce program so
your expectations of the output are realistic
 Always do a cyber risk assessment on a Cloud you plan
to use
18
Contact Info
Dr. Patrick D. Allen
Johns Hopkins University Applied Physics Lab
11100 Johns Hopkins Road
MS 21-N246
Laurel, MD 20723-6099
443-778-9915 v
443-778-3838 f
Patrick.allen@jhuapl.edu
19
Back-up: Terminology Relationship
APACHE
Accumulo
HDFS
GOOGLE
Big Table
Structured
Data
Hadoop
(Map Reduce)
Map Reduce
Map Reduce
Environment
Hadoop Data File System
(HDFS)
Google File System
(GFS)
File System
20
Back-up: Sample Map Reduce Program
 Map algorithm
Map (key: sourceURL, value: text) {
for each (targetURL in text)
EmitIntermediate (targetURL, sourceURL);
}
 Reduce Algorithm
Reduce (key: targetURL, value: sourceURL) {
sourceList[] = null;
for each (u in sourceURL)
add sourceList[sourceURL];
Emit (targetURL, sourceList[]);
}
21
Back-up: Map Reduce Example 2
For each URL, find all the pages that point to it.
Doc 1
Doc 2
Find targets for
source 1
Find targets for
source 2
targetURL a – URL1
targetURL a – URL1
targetURL b – URL1
targetURL a – URL2
targetURL a – URL2
targetURL b – URL1
targetURL c – URL2
targetURL b – URL10^9
targetURL c – URL2
targetURL c – URL10^9
Doc 10^9
Find targets for
source 10^9
Create list for
targetURL a
Create list for
targetURL b
sorted
targetURL –
sourceURL
list
Create list for
targetURL c
targetURL b – URL10^9
targetURL c – URL10^9
targetURL d – URL10^9
targetURL d – URL10^9
Create list for
targetURL d
22
Download