Uploaded by Kayla Morris

Data Science Review

advertisement
Data Science
Foundations
Big Data: the dynamic, large and disparate volumes of data being created by people, tools, and machines. It requires new,
innovative, and scalable technology to collect, host, and analytically process the vast amount of data gathered in order to
derive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced
shareholder value.
Machine Learning: a subset of AI that uses computer algorithms to analyze data and make intelligent decisions based on what
it is learned without being explicitly programmed. Machine learning algorithms are trained with large sets of data and they
learn from examples. They do not follow rules-based algorithms. Machine learning is what enables machines to solve
problems on their own and make accurate predictions using the provided data.
Deep learning: subset of machine learning; uses layered neural networks to simulate human decision-making. Deep learning
algorithms can label and categorize information and identify patterns. It is what enables AI systems to continuously learn on
the job and improve the quality and accuracy of results by determining whether decisions were correct.
Data Science vs Data Scientist
 A "data scientist" is an individual who finds solutions to problems by analyzing data of various sizes using the
appropriate and available tools and then tells a story to their major stakeholders.
 "Data science" is the art of taking Big Data and telling a story with it in order to reveal insights that assist
organizations in making strategic decisions
The “V”s of Big Data





Velocity – speed data accumulates (never stops)
Volume – scale of the data (increase in amount stored)
Variety – diversity of the data (structured = rows/columns, unstructured = videos/tweets) and comes from different
sources – 80% estimated to be unstructured
Veracity – quality and origin of data and conformity to facts and accuracy
Value – ability and needed to turn data into value
Data Science Writing Reports:
1. Cover page
2. Table of contents
3. Executive Summary
4. Introductory section
5. Methodology section
6.
7.
8.
9.
10.
Results section
Discussion section
Conclusion section
References
Acknowledgment
Terminology: “Data Sets” - Collection of data
 Data structure
o Tabular data (table)
o Hierarchical (tree)
o Network (graph – like connections on social media)
o Raw data (like images, etc)
 Private vs open data
o Open data: datacatalogs.org, Kaggle, datasetsearch on google
o Community data license agreement
 CDLA-sharing – can use, but must share under same licensing terms
 CDLA-permissive – can use, no obligations
 Data asset eXchange (DAX)
Methodology for Data Science
From Problem to Approach:
1. What is the problem you are trying to solve?
2. How can you use data to answer the question (what are you REALLY asking? Do you want to decrease costs or
increase productivity etc)
 Business understanding:
o Ask the RIGHT question
o Get “buy in” from stakeholders
 Analytic approach – how can you use the data to answer the question:
o Probabilities = predictive
o Show relationships = descriptive
o Yes/no = classification
Working with the Data:
3.
4.
5.
6.
What data is needed to answer the question – remember costs vs benefits can be a factor here
Where is the data sourced from and how will you get it?
Is the data representative of the problem to be solved?
What additional work is needed to manipulate and use the data?
 Data Understanding: “which ingredients are required to make the perfect recipe”
o Who, what, when ,where, why, how
 Data Understanding – collection:
 Data Understanding – descriptive statistics (increase data quality):
o Does any data appear to be missing, if so does this mean something important?
 Data preparation: “features” = characteristics that solve the problem – finding them and pulling them out can
help
Driving the Answer:
7.
8.
9.
10.
How can we visualize this data to get and show the answer?
Does the Model REALLY answer the question or does it need to be adjusted? Do we need different data etc?
Can you put the model into practice?
Can you get constructive feedback using this model?
Tools for Data Science
1. Fully integrated – no programming needed
 Open source: Knime, orange
 Commercial: Watson studio with Watson openscale (through cloud), h2o.ai
 Cloud: Azure Machine learning
2. Execution environments:
 Open source: apache spark (most used) , apache flink (stream processing – processes real time data streams)
3. Data asset management/data lineage (needs to annotated with metadata)
 Open source: Apache atlas, egeria, kylo
 Commercial: informatica, IBM infosphere Information Governance
 Cloud: amazon webservices DB, cloudant, DB2
4. Data Management
 Open source: MySQL, postgreSQL,mongodb, CouchDB, apache Cassandra, HadoopFS, ceph, elasticsearch (text data)
 Commercial: oracle, SQLsevere, IBMDB2
5. Data integration/transformation (ETL or ELT) – data refining/cleaning
 Open source: Apache airflow, apache kafka, Kubeflow, sparkSQL, node-red
 Commercial: informatica, IBM infosphere Datastage, talend, Watson studio data refinery
 Cloud: Informatica, data refinery
o Data refinery (makes it easy to cleanse)
6. Data visualization
 Open source: Hue, kibana, apache superset
 Commercial: tableau, Microsoft Power BI, IBM cognos analytics, Watson studio desktop
 Cloud: datameer, cognos, Watson studio
7. Model building
 Commercial: SPSS modeler and modeler flow, Sas Miner,
 Cloud: Watson machine learning, AI platform training (google)
8. Model deployment
 Open source: PredictionIO, Seldon, mleap, tensorflow/tensorflowlite
 Commercial: SPSS
 Cloud: SPSS modeler, Watson machine learning
9. Model Monitoring/assessment
 Open source: ModelIDB, Prometheus, AI fairness 360, Adversarial robustness 360 = helps prevent attacks and makes
model more robust, AI explainability 360 = explains the model
 Cloud: amazon sagemaker model monitor, Watson openscale
10. Code asset management/version management:
 Open source: github, gitlab, bitbucket
11. Development environment:
 Open source:
o jupyter (python), jupyterlab (next gen)  two-process model - client (interface for person to send code) &
Kernel (Executes code and sends result for display – the client is the browser), Jupyter notebooks = your
code, metadata, contents and outputs, apache zeppelin, Rstudio , spyder
 Commercial: Watson Studio
Libraries– collection of functions and methods that let you do things without writing the
code yourself
Python
 SCIENTIFIC COMPUTING LIBRARIES
o Pandas (data structures and tools)
o Numpy (pandas built on this – for arrays and matrices)
 VISUALIZATION LIBRARIES
o Matplotlib (plots and graphs)
o Seaborn (based on matplotlib - heat maps, time series, violin plots)
 HIGH-LEVEL MACHINE LEARNING AND DEEP LEARNING
o Sci-kit learn (built on numpy, scipy and matplotlib for statistical modeling including regression, classification,
clustering, etc)
o Keras (deep learning neural networks)
 DEEP LEARNING LIBRARIES
o Tensor flow (deep learning: production and deployment)
o pyTorch(deep learning: regression, classification…)
Libraries in other languages
 APACHE SPARK – general purpose cluster-computing framework; can use many languages including scala and scala
libraries like vegas, Big DL,
 R LIBRARIES: – dplyer (data manipulation), stringr (string manipulation), ggplot/plotly/lattice/leaflet (data viz), caret
(machine learning) – to install use command “install.packages(“package name”)
Languages of data science:



Python, R, SQL (we will see these used in the following) - recommended first
Scala, Java, C++, Julia some of most popular
o Java – Hadoop: manages data processing and storage for big data applications running in clustered systems
Go, Ruby, Visual Basic also have benefits
Python
Why Python is Great





It’s a high-level general-purpose programming language that can be applied to many different classes of problems.
It has a large, standard library that provides tools suited to many different tasks, including but not limited to
databases, automation, web scraping, text processing, image processing, machine learning, and data analytics.
For data science, you can use Python's scientific computing libraries such as Pandas, NumPy, SciPy, and Matplotlib.
For artificial intelligence, it has TensorFlow, PyTorch, Keras, and Scikit-learn.
Python can also be used for Natural Language Processing (NLP) using the Natural Language Toolkit (NLTK).
Comments


Use a “#” to put a comment in the line of code that won’t show up, really good practice to write out what you are
doing and why just in case any one else looks at the code
OR use “”” hello world “”” three quotations for MULTI LINE comments
Types








Text Type:
str (string – sequence of characters like numbers, letters or symbols)
Numeric Types: int (no decimal), float (decimal), complex (number and letter)
Sequence Types: list [], tuple (), range
Mapping Type: dict {‘apple’: ‘banana’}
Set Types:
set {‘apple’,’banana’}, frozenset({‘apple’,’banana’})
Boolean Type: bool (Boolean values =True(1), False(0)  must capitalize!)
Binary Types: bytes, bytearray, memoryview
Python collections (arrays):
o
o
o
o
List is a collection which is ordered and changeable. Allows duplicate members.
Tuple is a collection which is ordered and unchangeable. Allows duplicate members.
Set is a collection which is unordered and unindexed. No duplicate members.
Dictionary is a collection which is unordered and changeable. No duplicate members.
Expressions = Math Operations
a + b addition (sum of a and b)
-a (negative number of a or unary negation)
a-b (b subtracted from a)
a*b ( multiplication – product of a and b)
a / b (a divided by b  always has a FLOAT answer
a // b (a divided by b with an INTEGER answer)
a** b (a raised to the power of b)
+a (doesn’t mean anything just a regular number)
Variables



Names:
o Stores your values for example: my_variable=1 stores ‘my_variable’ as 1 for you to use in a function later
o Cannot start with a number and can only contain letters, numbers and underscore
o Can assign multiple values at a time (i.e. x,y,z=’banana’,’apple’,’cherry’  print(x)=’banana’, etc)
Output:
o Print() function
Global variables: create the variable OUTSIDE the function then use it INSIDE a function you can also have global
variable created by adding the global keyword next to a local function
local

Casting:
o Done through constructer function int(), str()
global
Global via
keyword
Strings





Strings are an array
o Since they are an array you can loop them with the FOR function
Square brackets can be used to access elements of a string (i.e. a=’hello world’, printa([0]) = h)
Length uses the LEN function (i.e. a=’hi’, print(len(a)) results in 2)
Can check using IF statements:
txt = "The best things in life are free!"
if "free" in txt:
print("Yes, 'free' is present.")
Concatenate strings:
o Use (+) operator  a=’hi’ b=’there’ c=a+b print(c) result is hithere if want to add space between c=a + “ “ + b
o Can’t combine strings and numbers through
STRING INDEX: quantifies how many elements are in the string (for example ‘Kayla’ has elements 0,1,2,3,4 OR can be negative (5,-4,-3,-2,-1)
STRING METHODS
capitalize()
casefold()
center()
count()
encode()
endswith()
expandtabs()
find()
format()
format_map()
index()
isalnum()
isalpha()
isdecimal()
isdigit()
isidentifier()
islower()
isnumeric()
isprintable()
isspace()
istitle()
isupper()
join()
ljust()
lower()
lstrip()
maketrans()
partition()
replace()
rfind()
rindex()
rjust()
rpartition()
rsplit()
rstrip()
split()
Converts the first character to upper case
Converts string into lower case
Returns a centered string
Returns the number of times a specified value occurs in a string
Returns an encoded version of the string
Returns true if the string ends with the specified value
Sets the tab size of the string
Searches the string for a specified value and returns the position of where it was found
Formats specified values in a string
Formats specified values in a string
Searches the string for a specified value and returns the position of where it was found
Returns True if all characters in the string are alphanumeric
Returns True if all characters in the string are in the alphabet
Returns True if all characters in the string are decimals
Returns True if all characters in the string are digits
Returns True if the string is an identifier
Returns True if all characters in the string are lower case
Returns True if all characters in the string are numeric
Returns True if all characters in the string are printable
Returns True if all characters in the string are whitespaces
Returns True if the string follows the rules of a title
Returns True if all characters in the string are upper case
Joins the elements of an iterable to the end of the string
Returns a left justified version of the string
Converts a string into lower case
Returns a left trim version of the string
Returns a translation table to be used in translations
Returns a tuple where the string is parted into three parts
Returns a string where a specified value is replaced with a specified value
Searches the string for a specified value and returns the last position of where it was found
Searches the string for a specified value and returns the last position of where it was found
Returns a right justified version of the string
Returns a tuple where the string is parted into three parts
Splits the string at the specified separator, and returns a list
Returns a right trim version of the string
Splits the string at the specified separator, and returns a list
splitlines()
startswith()
strip()
swapcase()
title()
translate()
upper()
zfill()
Splits the string at line breaks and returns a list
Returns true if the string starts with the specified value
Returns a trimmed version of the string
Swaps cases, lower case becomes upper case and vice versa
Converts the first character of each word to upper case
Returns a translated string
Converts a string into upper case
Fills the string with a specified number of 0 values at the beginning
STRING STRIDING
STRING SLICING
ESCAPE SEQUENCE
 \n = new line in the text in your string
 \t= tab in your string
 \\ = puts and actual backslash in your string
Tuple ( ) Immutable = ordered and cannot change, allows duplicates



Comma separated – must have AT LEAST ONE COMMA TO BE A TUPLE and Can be any data type, can contain
different data types
Concatenate – combine (+ to add tuples and * to multiple tuples) - Can nest within the tuples
Check if something in tuple
thistuple = ("apple", "banana", "cherry")
if "apple" in thistuple:
print("Yes, 'apple' is in the fruits tuple")
**The only way to change something in Tuple is to change it to a list then you can alter the list
UNPACK TUPLE
fruits = ("apple", "banana", "cherry")
(green, yellow, red) = fruits
print(green) results in apple
print(yellow) results in banana
print(red) results in cherry
LOOP TUPLE
o Use “FOR” loop
thistuple = ("apple", "banana", "cherry")
for x in thistuple:
print(x)
o
Through Index – use range() and len()
thistuple = ("apple", "banana", "cherry")
for i in range(len(thistuple)):
print(thistuple[i])
o
Use “WHILE” loop
thistuple = ("apple", "banana", "cherry")
i = 0
while i < len(thistuple):
print(thistuple[i])
i = i + 1
JOIN TUPLE by adding to tuples together with “+” or multiply with “*” which will just repeat the tuple that many times
TUPLE METHODS
Count()
Returns number of times a value occurs in the tuple
Index()
Searches tuple for value and returns position where found –
can do negative indexing as well and range of indexes
Lists [ ]
Mutable = can change, allow duplicates, can be any data type and different data types together
NOTABLE FUNCTIONS:


LOOP


Len() function for length of elements
Range()function
Use “for” loop to loop through list
thislist = ["apple", "banana", "cherry"]
for x in thislist:
print(x)
Use range() function and len() function to create iterable to loop by their index number
thislist = ["apple", "banana", "cherry"]
for x in thislist:
print(x)

Use “while” loop and len function  remember! Always increase the index by 1 after iteration!!!
thislist = ["apple", "banana", "cherry"]
i = 0
while i < len(thislist):
print(thislist[i])
i = i + 1
LISTS METHODS
append() Adds an element at the end of the list
clear()
Removes all the elements from the list
copy()
Returns a copy of the list – use this to make a copy since a list can be changed, the copy will let you
change it without changing original list OR make a new list by just using list() function and make a new
list
count()
Returns the number of elements with the specified value
extend() Add the elements of a list (or any iterable), to the end of the current list
list1 = ["a", "b" , "c"]
list2 = [1, 2, 3]
list1.extend(list2)
print(list1)
index()
insert()
List()
pop()
remove()
reverse()
sort()
Returns the index of the first element with the specified value
Adds an element at the specified position
Used to make method to make a copy of a list [i.e. thislist=x mylist=list(thislist) ]
Removes the element at the specified position
Removes the first item with the specified value
Reverses the order of the list
Sorts the list in order, ascending
To sort descending use argument (reverse=True)
Case sensitive – solve this by using argument (key = str.lower) to make list uniform
Dictionaries { } with keys and values


Used to store data values in key:value PAIRS- it is ordered, changeable and no duplicates can be any data type
Example thisdict= {“brand”:”ford”, “model”:”mustang”,”year”:1964}
FUNCTIONS TO USE WITH DICTIONARY:


Length using len()
Type()
ACCESSING ITEMS IN DICTIONARY:






Referring to the key name ie x = thisdict[“model”]  note square bracket
.get() will give the same result
.keys() will return list of all the keys in the dictionary
.values() will return list of all the values in the dictionary
.items () will return a Tuple of all the keys:values
Check if a key exits with the IN keyword – if model in thisdict: print(”Yes this key is in this dictionary”) can also use it
to return a true or false Boolean
CHANGE ITEMS IN DICTIONARY:


Refer to its key – thisdict[“year”]= 2018 will change it from 1964 to 2018
.update() – thisdict.update({“year”:”2018”})
ADD/REMOVE ITEMS IN DICTIONARY:

Add:
o
o
Make a new key – thisdict[“color”] = “red” will add a new key:value
.update() will do the same thisdict.update({“color”:”red”})

Remove:
o .pop() removes item with that key name
o .popitem() removes last inserted item
o Del keyword removes the model with that key name – del thisdict[“model”] – can also delete whole
dictionary
o .clear() empties the dictionary
LOOP THROUGH A DICTIONARY:





FOR loop – for x in thisdict: print(x) – returns all the keys in the dictionary one by one
FOR loop – for x in thisdict: print(thisdict[x])
.values() method - for x in thisdict.values(): print(x) prints all values
.keys() method – for x in thisdict.keys(): print(x) prints all keys
.items() method – for x,y in thisdict.items(): print(x,y) will print keys and values
COPY DICTIONARY:


.copy() method – mydict = thisdict.copy() print(mydict)
.dict() function – mydict =dict(thisdict) print(mydict)
NESTED DICTIONARY – either create a dictionary that has 3 nested dictionaries within it OR create 3 separate dictionaries and
combine into one new dictionary with all together
DICTIONARY METHODS
clear()
copy()
fromkeys()
get()
items()
keys()
pop()
popitem()
setdefault()
update()
values()
Removes all the elements from the dictionary
Returns a copy of the dictionary
Returns a dictionary with the specified keys and value
Returns the value of the specified key
Returns a list containing a tuple for each key value pair
Returns a list containing the dictionary's keys
Removes the element with the specified key
Removes the last inserted key-value pair
Returns the value of the specified key. If the key does not exist: insert the key, with the specified value
Updates the dictionary with the specified key-value pairs
Returns a list of all the values in the dictionary
Sets { }



Used to store multiple items in a single variable, can be any data type and multiple types, it is unordered and
unindexed
Constructor = set()
Length of set with len() function
ACCESS ITEMS:

find a specific using the “IN” keyword
thisset = {"apple", "banana", "cherry"}
print("banana" in thisset)

loop through the set using the “FOR” loop
thisset = {"apple", "banana", "cherry"}
for x in thisset:
print(x)
ADDING/REMOVE ITEMS – YOU CANNOT CHANGE A SET BUT YOU CAN ADD OR REMOVE FROM IT


Add:
 Add specific elements by using .add() function
 Add another set within the set with .update() function I.e. set a set b seta.update(setb) adds set b
Remove:
 .remove() method thisset.remove(“banana”)  if item doesn’t exist will return an error
 .discard() method thisset.discard(“banana”)  if item doesn’t exist will NOT return an error
 .pop() will remove LAST item – they are unordered so you won’t know what item gets removed
thisset = {"apple", "banana", "cherry"}
x = thisset.pop()
print(x)
print(thisset)


.clear() empties the set
del “name of set” will delete the set all together
LOOP SETS

Use “FOR” loop ie for x in thisset: print(x) will print all the elements in the set
JOIN SETS





Union() = new set with all the values from the sets you combined
Update() = will add the values from the set you choose into the other set
**both union and update exclude duplicates!!
Intersection_update() will keep only the items present in BOTH sets (keeps the duplicates!!) [i.e
x.intersection_update(y) ]
Intersection() for creating a NEW set that contains the duplicates from the sets [i.e. z = x.interaction(y) ]
SET METHODS
add()
clear()
copy()
difference()
difference_update()
discard()
intersection()
intersection_update()
isdisjoint()
issubset()
issuperset()
pop()
remove()
symmetric_difference()
symmetric_difference_update()
union()
update()
Adds an element to the set
Removes all the elements from the set
Returns a copy of the set
Returns a set containing the difference between two or more sets
Removes the items in this set that are also included in another, specified set
Remove the specified item
Returns a set, that is the intersection of two other sets
Removes the items in this set that are not present in other, specified set(s)
Returns whether two sets have a intersection or not
Returns whether another set contains this set or not
Returns whether this set contains another set or not
Removes an element from the set
Removes the specified element
Returns a set with the symmetric differences of two sets
inserts the symmetric differences from this set and another
Return a set containing the union of sets
Update the set with the union of this set and others
Conditions and Branching
LOGIC OPERATORS AKA CONDITIONS:
Equals: a == b
Not Equals: a != b
Less than: a < b
Less than or equal to: a <= b
Greater than: a > b
Greater than or equal to: a >= b
 How these branching – if/else:
o “if statements” and loops
 IF
a = 33
b = 200
if b > a:
print("b is greater than a")

ELIF – if the previous conditions were NOT true , then try this condition
a = 33
b = 33
if b > a:
print("b is greater than a")
elif a == b:
print("a and b are equal")

ELSE – catches anything that wasn’t caught with the previous conditions
a = 200
b = 33
if b > a:
print("b is greater than a")
elif a == b:
print("a and b are equal")
else:
print("a is greater than b")

Or can have ELSE statement without the ELIF
a = 200
b = 33
if b > a:
print("b is greater than a")
else:
print("b is not greater than a")
o
o
Shorthand IF
 if a > b: print("a is greater than b")
Shorthand if…else (Ternary operators or conditional expressions)
a = 2
b = 330
print("A") if a > b else print("B")
o
o
o
AND/ OR using logic operators above you can print a statement using and/or
Nested If – can have multiple if statements ending with an else statement
Pass statements – if you need to leave the content empty but don’t want to get an error
Loops

WHILE Loops – can execute a set of statements as long as the condition is TRUE
o Print i as long as i is less than 6
i = 1
while i < 6:
print(i)
i += 1
o
o
o
You must set the variable (we set it to 1) and you MUST increment i or loop will go on forever
Break statement can be used to stop the loop even if the statement is true (i.e. if i == 3: break)
Continue statement can bused to continue loop if statement is false
 Continue to the next iteration if i is 3:
i = 0
while i < 6:
i += 1
if i == 3:
continue
print(i)

o Use the else statement to run code when the condition is no longer true while I <6 print(i) else print (“no”)
FOR Loops - used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).
o The for loop does not require an indexing variable to set beforehand.
o Break statement
fruits = ["apple", "banana", "cherry"]
for x in fruits:
print(x)
if x == "banana":
break
o
Continue statement
fruits = ["apple", "banana", "cherry"]
for x in fruits:
if x == "banana":
continue
print(x)
o
Range() function
for x in range(2, 30, 3):
print(x)
o
ELSE in FOR loop **else will not be executed if stopped by break statement**
for x in range(6):
if x == 3:
print(x)
else:
print("Finally finished!")
o
Nested Loops: The "inner loop" will be executed one time for each iteration of the "outer loop"
adj = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
for x in adj:
for y in fruits:
print(x, y)
o
Pass statement – placeholder if no content so you don’t get an error
Functions
A function is a block of code which only runs when it is called. You can pass data, known as parameters, into a function. A
function can return data as a result.
CREATING A FUNCTION

Python function is defined by using the def keyword:
def my_function():
print("Hello from a function")


To call the function you just use the function name you created (i.e. my_function())
Arguments: Information can be passed into functions as arguments. Arguments are specified after the function name,
inside the parentheses. You can add as many arguments as you want, just separate them with a comma.
def my_function(fname):
print(fname + " Refsnes")
my_function("Emil")
my_function("Tobias")
my_function("Linus")


Parameter vs Argument:
o A parameter is the variable listed inside the parentheses in the function definition.
o An argument (args) is the value that is sent to the function when it is called.
 However many arguments it expects in the function you must put i.e. (def myfun(fname, lname):
your argument must be my_function(“Kayla”,”Morris”)
 If you don’t know how many arguments you will have put a * before the parameter (i.e. def
myfun(*name)) that way you can put a tuple of arguments in mygun(“kayla”,”kyle”,”brady”)
o Keyword arguments (kwargs) lets you send key=value syntax [i.e. def myfun(child1, child2) myfun(child 1 =
“brady”, child 2=”kyle”)]
 If unknown how many arguments put two ** before the parameter
o You can pass a list as an argument
Return Values will give you that as a result (i.e. this will return whatever you put in my_function x 5)
def my_function(x):
return 5 * x

Recursion – function can call itself: Recursion is a common mathematical and programming concept. It means that a
function calls itself. This has the benefit of meaning that you can loop through data to reach a result.
o Be very careful with recursion as it can be quite easy to slip into writing a function which never terminates, or
one that uses excess amounts of memory or processor power.
o In this example, tri_recursion() is a function that we have defined to call itself ("recurse"). We use the k
variable as the data, which decrements (-1) every time we recurse. The recursion ends when the condition is
not greater than 0 (i.e. when it is 0).
def tri_recursion(k):
if(k > 0):
result = k + tri_recursion(k - 1)
print(result)
else:
result = 0
return result
print("\n\nRecursion Example Results")
tri_recursion(6)
LAMBADA

A lambda function is a small anonymous function. A lambda function can take any number of arguments, but can
only have one expression. Why use them? The power of lambda is better shown when you use them as an
anonymous function inside another function.
o Example: Use that function definition to make a function that always doubles the number you send in:
def myfunc(n):
return lambda a : a * n
mydoubler = myfunc(2)
print(mydoubler(11))
Result 22
Exception Handling (Try/Except)





The try block lets you test a block of code for errors.
The except block lets you handle the error.
The finally block lets you execute code, regardless of the result of the try- and except blocks.
Try except else gives you an alternative code to run
Raise allows you to set when to raise an error
Classes and Objects
Python is an object oriented programming language. A Class is like an object constructor, or a "blueprint" for creating objects.

CREATE A CLASS
o Use ‘class’ keyword (i.e. class MyClass:)  Property underneath will be x=5
o Create object p1=MyClass() now you can print value of x with print(p1.x)
o For classes to be useful they need to have a function called __init__ to initialize the class example:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
p1 = Person("John", 36)
print(p1.name)
print(p1.age)
o
The Objects can also contain methods which are functions that belong to the object – create a method in the
Person class – insert a function that prints a greeting and execute it on the p1 object:
class Person:
def __init__(self, name, age):
self.name = name
self.age = age
def myfunc(self):
print("Hello my name is " + self.name)
p1 = Person("John", 36)
p1.myfunc()
o

The self-parameter is a reference to the current instance of the class, and is used to access variables that
belongs to the class. It does not have to be named self , you can call it whatever you like, but it has to be the
first parameter of any function in the class
o You can:
 Modify object properties (i.e. p1.age=40)
 Delete object properties or object (i.e. del p1.age) or (del p1)
INHERITANCE: Inheritance allows us to define a class that inherits all the methods and properties from another class.
o Parent class is the class being inherited from, also called base class.
o Child class is the class that inherits from another class, also called derived class.
1. Create a parent class:
Define the Class
Define the Function to Perform
Define the variable ‘Person’
Execute the function
2. Create the child class
Parent Class Above
New Child Class
Define new STUDENT variable & Execute
3. Use Super() function to inherit all the methods and properties from parent class
4. Add Properties, Add Methods get the Final:
Define Class
Parent class
Define function
Define Child Class
Add Super() to inherit parent class attributes and
then add its own properties below
Add methods
Define Variable and Execute


CREATE AN ITERATOR: iterator = object that contains countable number of values – it can be iterated upon, technically
they are an object that implement the iterator protocol which are methods __iter__() and __next__()
o As you have learned in the Python Classes/Objects chapter, all classes have a function called __init__(), which
allows you to do some initializing when the object is being created.
o Stopiteration
Example
The __iter__() method acts similar, you can do operations
(initializing etc.), but must always return the iterator object
itself.
The __next__() method also allows you to do operations, and
must return the next item in the sequence
To prevent the iteration to go on forever, we can use the
StopIteration statement. In the __next__() method, we can
add a terminating condition to raise an error if the iteration is
done a specified number of times (in this case 20):
Create variables
Create conditions and execute
Working with Data in Python
Reading and Writing Files
MODULE






Module – create a module by saving your file in ‘.py’ form so you will have a file of the functions you want to use in
other applications
You can use the module by importing it (i.e. import mymodule)
When using the function from the module call it through module_name.function_name
You can create an alias when you call a module by using the ‘as’ (import mymodule as mx)
There is a built-in function to list all the function names (or variable names) in a module. The dir() function:
You can also import just the variables you want by using the ‘from’ action (from mymodule import person1)
FILE HANDLING
 Open() function is the key – methods for opening:
o "r" - Read - Default value. Opens a file for reading, error if the file does not exist
o "a" - Append - Opens a file for appending, creates the file if it does not exist
o "w" - Write - Opens a file for writing, creates the file if it does not exist
o "x" - Create - Creates the specified file, returns an error if the file exists
 Also can state whether text (t) or binary (b – images)
 Example: f = open("demofile.txt", "rt") – read in text state
READ FILES – USE BUILT IN OPEN() AND READ()
 If file located in same folder as python:





f = open("demofile.txt", "r")
print(f.read())
If located somewhere else in computer:
f = open("D:\\myfiles\welcome.txt", "r")
print(f.read())
Read parts = .read(specify here)
Read lines = .readline()  reads first line
o Call it twice will read first two lines etc
Loop through the lines
f = open("demofile.txt", "r")
for x in f:
print(x)
ALWAYS CLOSE THE FILE file_name.close()
WRITE/CREATE FILES
o .Write or .append
f = open("demofile2.txt", "a")
f.write("Now the file has more content!")
f.close()
o
Create new
o To create a new file in Python, use the open() method, with one of the following parameters:
o "x" - Create - will create a file, returns an error if the file exist
o "a" - Append - will create a file if the specified file does not exist
o "w" - Write - will create a file if the specified file does not exist
DELETE FILES
o Import OS and delete
o import os
os.remove("demofile.txt")
Python References
Date/time
A date in Python is not a data type of its own, but we can import a module named datetime to work with dates as date objects
1. Import datetime
Directive
%a
%A
%w
%d
%b
%B
%m
%y
%Y
%H
%I
%p
%M
%S
%f
%z
%Z
%j
%U
%W
%c
%x
%X
%%
%G
%u
%V
create variable/function
3.
change directive
Description
Weekday, short version
Weekday, full version
Weekday as a number 0-6, 0 is Sunday
Day of month 01-31
Month name, short version
Month name, full version
Month as a number 01-12
Year, short version, without century
Year, full version
Hour 00-23
Hour 00-12
AM/PM
Minute 00-59
Second 00-59
Microsecond 000000-999999
UTC offset
Timezone
Day number of year 001-366
Week number of year, Sunday as the first day of week, 00-53
Week number of year, Monday as the first day of week, 00-53
Local version of date and time
Local version of date
Local version of time
A % character
ISO 8601 year
ISO 8601 weekday (1-7)
ISO 8601 weeknumber (01-53)
Keywords
Keyword
and
as
assert
break
class
continue
def
del
2.
Description
A logical operator
To create an alias
For debugging
To break out of a loop
To define a class
To continue to the next iteration of a loop
To define a function
To delete an object
Example
Wed
Wednesday
3
31
Dec
December
12
18
2018
17
05
PM
41
08
548513
+0100
CST
365
52
52
Mon Dec 31 17:41:00 2018
12/31/18
17:41:00
%
2018
1
01
elif
else
except
False
finally
for
from
global
if
import
in
is
lambda
None
nonlocal
not
or
pass
raise
return
True
try
while
with
yield
Used in conditional statements, same as else if
Used in conditional statements
Used with exceptions, what to do when an exception occurs
Boolean value, result of comparison operations
Used with exceptions, a block of code that will be executed no matter if there is an exception or not
To create a for loop
To import specific parts of a module
To declare a global variable
To make a conditional statement
To import a module
To check if a value is present in a list, tuple, etc.
To test if two variables are equal
To create an anonymous function
Represents a null value
To declare a non-local variable
A logical operator
A logical operator
A null statement, a statement that will do nothing
To raise an exception
To exit a function and return a value
Boolean value, result of comparison operations
To make a try...except statement
To create a while loop
Used to simplify exception handling
To end a function, returns a generator
Built-in Python Functions
Function
abs()
all()
any()
ascii()
bin()
bool()
bytearray()
bytes()
callable()
chr()
classmethod()
compile()
complex()
delattr()
dict()
dir()
divmod()
enumerate()
eval()
exec()
filter()
float()
format()
frozenset()
Description
Returns the absolute value of a number
Returns True if all items in an iterable object are true
Returns True if any item in an iterable object is true
Returns a readable version of an object. Replaces none-ascii characters with escape character
Returns the binary version of a number
Returns the boolean value of the specified object
Returns an array of bytes
Returns a bytes object
Returns True if the specified object is callable, otherwise False
Returns a character from the specified Unicode code.
Converts a method into a class method
Returns the specified source as an object, ready to be executed
Returns a complex number
Deletes the specified attribute (property or method) from the specified object
Returns a dictionary (Array)
Returns a list of the specified object's properties and methods
Returns the quotient and the remainder when argument1 is divided by argument2
Takes a collection (e.g. a tuple) and returns it as an enumerate object
Evaluates and executes an expression
Executes the specified code (or object)
Use a filter function to exclude items in an iterable object
Returns a floating point number
Formats a specified value
Returns a frozenset object
getattr()
globals()
hasattr()
hash()
help()
hex()
id()
input()
int()
isinstance()
issubclass()
iter()
len()
list()
locals()
map()
max()
memoryview()
min()
next()
object()
oct()
open()
ord()
pow()
print()
property()
range()
repr()
reversed()
round()
set()
setattr()
slice()
sorted()
@staticmethod()
str()
sum()
super()
tuple()
type()
vars()
zip()
Returns the value of the specified attribute (property or method)
Returns the current global symbol table as a dictionary
Returns True if the specified object has the specified attribute (property/method)
Returns the hash value of a specified object
Executes the built-in help system
Converts a number into a hexadecimal value
Returns the id of an object
Allowing user input
Returns an integer number
Returns True if a specified object is an instance of a specified object
Returns True if a specified class is a subclass of a specified object
Returns an iterator object
Returns the length of an object
Returns a list
Returns an updated dictionary of the current local symbol table
Returns the specified iterator with the specified function applied to each item
Returns the largest item in an iterable
Returns a memory view object
Returns the smallest item in an iterable
Returns the next item in an iterable
Returns a new object
Converts a number into an octal
Opens a file and returns a file object
Convert an integer representing the Unicode of the specified character
Returns the value of x to the power of y
Prints to the standard output device
Gets, sets, deletes a property
Returns a sequence of numbers, starting from 0 and increments by 1 (by default)
Returns a readable version of an object
Returns a reversed iterator
Rounds a numbers
Returns a new set object
Sets an attribute (property/method) of an object
Returns a slice object
Returns a sorted list
Converts a method into a static method
Returns a string object
Sums the items of an iterator
Returns an object that represents the parent class
Returns a tuple
Returns the type of an object
Returns the __dict__ property of an object
Returns an iterator, from two or more iterators
Pandas
Pandas is a library for working with data - it has functions for analyzing, cleaning, exploring, and manipulating data, once you
have Python and PIP installed on your computer then you just need to install pandas( C:\Users\Your Name>pip install pandas)
and import the Pandas library as pd to get started – using something like Anaconda can help since it already has it installed


Series  .Series()– it’s like a column in a table (one-dimensional array)
o Create a series from a list a=[“1,2,3”], If no labels created the index is 0,1,2 like usual
o Can create labels for the elements by example myvar=pd.Series(a,index=[“x”,”y”,”z”]) now 1,2,3 are x,y,z
o You can also do key/value objects as a series calories ={“day1”:100,”day2”:200} myvar=pd.Series(calories)
Dataframe  .DataFrame() – It’s like the whole table (a multi-dimensional array)
o

 Locate based on named indexes  print(df.loc["day2"])
READ CSV OR JSON
o Load the CSV into a data frame  use to_string() to get the whole Dataframe
o
o


To locate a row within a dataframe you use the loc attribute using row index  print(df.loc[0])
 Multiple indexes  print(df.loc[[0, 1]])
 Create Named indexes 
**BY DEFAULT WHEN YOU do print(df) you will only get the FIRST AND LAST 5 ROWS** MUST USE STRING
Only difference is if JSON is already in a python dictionary you can load it right into the dataframe
ANALYZING DATA: head(), tail(), info() – can tell you a lot of info including any null values (important for cleaning!)
CLEANING DATA
o Cleaning empty cells
 Remove empty rows
 Replace empty value file_name.fillna()
 Replace specific column file_name[“column_name”].fillna(#, inplace=True)
 Replace using mean(), median(), or mode()
 X= file_name[“column_name”].mean() then repeat the “specific column” code
o
Cleaning wrong format - either remove the rows or change all the cells in that column to the same format
 Change column - Date: use to_datetime()
 Remove rows: file_name.dropna(subset=['Column_name'], inplace = True)
Cleaning wrong data
 Replacing Values 
 file_name.loc[row #, 'Column_name'] = new #
 for x in file_name.index:
if file_name.loc[x, "Column_name"] > #:
File_name.loc[x, "Column name"] = # you need set to
 Removing rows 
 for x in file_name.index:
if file_name.loc[x, "Column_name"] > #:
file_name.drop(x, inplace = True)
o Removing duplicates
 Check for duplicates  print(file_name.duplicated())  will return Boolean values
 Remove  file_name.drop_duplicates(inplace=True)
CORRELATIONS  corr() method
o Show the relationship between columns  file_name.corr() … ignores not numeric columns
o Number varies from -1 to 1 with 1 (or -1) being a perfect correlation example:
 Duration to duration got perfect 1
 Duration to calories got 0.922721 – good correlation
 Duration to maxpulse got 0.009403 – bad correlation – cannot predict max pulse based on duration
of workout and vice versa
PLOTTING  plot() method  need pyplot from matplotlib
o Scatter plot:
o


o
Histogram (only needs one column):
 File_name["Column_name"].plot(kind = 'hist')
Numpy (arrays)












50x Faster than processing lists - object type is numpy.ndarray
Import numpy as np
Basics on arrays:
Creating an object  array_name=np.array([1,2,3 etc])
Dimensions in arrays nested array, arrays that have other arrays as their elements
o 0-D – 1 value arr=np.array(42)
o 1-d – np.array=([1,2,3])
o 2-d – np.array=([[1,2,3],[4,5,6],[7,8,9]]) – matrix
o Check number of dimensions  print(arr.ndim)
Indexing
o Array_name.size: Index – how many values
Slicing
o Print(array_name[first:last(not included)])
o Print(array_name[#:] from that element to the end
o Print(array_name[:#] from the beginning to that element NOT INCLUDING that element
o STEP:
 Print(array_name[beginning:how often:last(not included)])
 i.e.  print(arr[1:5:2]) return every other element from index 1 to 5
o 2-d
 Print(arr[list element, Beginning:end]) i.e.  print(arr[1, 1:4]
Data types – check with dtype property can also create specific type by adding argument dtype and astype for
changing a current array type
o i - integer
o M - datetime
o b - boolean
o O - object
o u - unsigned integer
o S - string
o f - float
o U - unicode string
o c - complex float
o V - fixed chunk of memory for other type (
o m - timedelta
void )
Copy vs view  .copy (won’t change the original) .view (will change the original)
Array shape
o Array_name.shape: size of the array in each dimension
Reshaping
o newarr = arr.reshape(#of arrays, #of elements)
o unknown dimensions  use -1 and numpy will calculate it for you
o flatten array into 1d – use reshape(-1)
Iterating – going through elements one by one
o Use the FOR loop
o 1-d and 2-d




o
Then to return actual values (the scalars)- iterate in each dimension
o
Simpler way
o
Enumeration means mentioning sequence number of somethings one by one
Join
o arr = np.concatenate((arr1, arr2))
o join 2-d along rows  arr = np.concatenate((arr1, arr2), axis=1)
o stacking (done along a new axis)  arr = np.stack((arr1, arr2), axis=1)
o stack along rows  arr = np.hstack((arr1, arr2))
o stack along columns  arr = np.vstack((arr1, arr2))
o stack along height  arr = np.dstack((arr1, arr2))
Split
o Split in 3 ways example  newarr = np.array_split(arr, 3)
o Split along rows  newarr = np.array_split(arr, 3, axis=1)
o Split along rows alt  newarr = np.hsplit(arr, 3)
o Vsplit() and dsplit() also available
Search
o Where function  x = np.where(arr == 4)
o Find even  x = np.where(arr%2 == 0)
o Find odd  x = np.where(arr%2 == 1)
o Find indexes where 7 should be inserted  x = np.searchsorted(arr, 7) (how many arrays)
o Search from the right side  x = np.searchsorted(arr, 7, side='right')
Sort: Sort array alphabetically  print(np.sort(arr))  can also search Boolean

Filter
o
o
Boolean index list
Simple APIs
Application Program Interface and REST APIs

WHAT IS AN API?: lets two pieces of software talk to each other the API sends your program to the other software
through inputs and outputs (you just need to know the inputs and outputs)
o API Libraries – the API is the only part of the library that you see where the library is the entire process of the
data going back and forth
o Example: pandas in python is the API that processes the data by speaking with other software components
 REST API: enable you to communicate via the internet taking advantage of storage, greater data access, artificial
intelligence algorithmns, etc stands for representational state transfer
o They have a set of rules about communication with the web service, input(request), and output(response)
Rest APIs, Webscraping, and Working with Files

Databases/SQL
Why SQL is great


originally for relational databases but has expanded
Knowing SQL will help you do many different jobs in data science, including business and data analyst, and it's a must
in data engineering
 When performing operations with SQL, you access the data directly. There's no need to copy it beforehand. This can
speed up workflow executions considerably.
 SQL is the interpreter between you and the database.
 SQL is an American National Standards Institute, or "ANSI," standard, which means if you learn SQL and use it with
one database, you will be able to easily apply that SQL knowledge to many other databases.
 There are many different SQL databases available, including MySQL, IBM Db2, PostgreSQL, Apache OpenOffice Base,
SQLite, Oracle, MariaDB, Microsoft SQL Server, and more. The syntax of the SQL you write might change a little bit
based on the relational database management system you’re using.
Basics of SQL
 What is SQL – a query language to get data out of a database
 Basic commands
o Create Table
o Insert
o Select command – retrieving data from the table; DML statement, it is a query, and the result from the query
is a set/table
 Select * from <tablename>
o Update
o Delete
Relational Database Model
 Data stored in tabular form (a table) – columns and rows
 RDBMS – Relational database management system - set of software tools that controls the access, organization, and
storage
 Advantages of relational model
 Explain how entity name and attributes map to a relational database

Working knowledge of SQL and database
Connect to database and run SQL queries
R
Why R is Great


R has become the world’s largest repository of statistical knowledge.
As of 2018, R has more than 15,000 publicly released packages, making it possible to conduct complex exploratory
data analysis.
 R integrates well with other computer languages, such as C++, Java, C, .Net, and Python.
 Common mathematical operations such as matrix multiplication work straight out of the box.
R has stronger object-oriented programming facilities than most statistical computing languages
Data Visualization with Python
Machine Learning with Python
Basics:


Machine Learning uses algorithms to identify patterns in the data through model training then can make decisions from
that training
Deep learning is a specialized type of machine learning – it’s a general set of models and techniques that tries to loosely
emulate the way the human brain solves a wide range of problems.
o Common uses: natural language processing, image/audio/visual analysis, time series forcasting,
o **requires VERY LARGE data sets of labeled data and is computing intensive
o Built using: tensorflow, pytorch, keras – look for “model zoo”
o Model asset exchange (MAX)
1. SUPERVISED LEARNING: In supervised learning, a human provides input data and the correct outputs. The model tries to
identify relationships and dependencies between the input data and the correct output. Generally speaking,
supervised learning is used to solve regression and classification problems. Controlled environment
 Regression: predict real numeric value (i.e. home values based off home characteristic)
 Classification: does something belong to a class (i.e. identify spam)
2. UNSUPERVISED LEARNING: In unsupervised learning, the data is not labelled by a human. The models must analyze the
data and try to identify patterns and structure within the data based only on the characteristics of the data itself.
Clustering and anomaly detection are two examples of this learning style. Less controlled environment
 Clustering: used to divide the record set into groups (purchase recommendations based off group of purchases
from previous)
 Anomaly detection: identify outliers (ie like detecting credit card fraud because of anomaly)
3. REINFORCEMENT LEARNING: The third type of learning, reinforcement learning, is loosely based on the way human beings
and other organisms learn. Think about a mouse in a maze. If the mouse gets to the end of the maze it gets a piece of
cheese. This is the “reward” for completing a task. The mouse learns – through trial and error – how to get through
the maze to get as much cheese as it can. In a similar way, a reinforcement learning model learns the best set of
actions to take, given its current environment, in order to get the most reward over time. This type of learning has
recently been very successful in beating the best human players in games such as go, chess, and popular strategy
video games.
Techniques
Python for machine learning





Numpy – for working with arrays and doing computation
SciPy – for scientific and high performance computation
Matplotlib
Pandas
SciKitLearn – classification, regression, and clustering algorithmns, easy to implement
Regression – predicting a continuous variable




Y = Dependent variable (all other variables affect this one) – must be continuous and cannot be a discrete value
X= Independent variables (affect the x variable)
Types:
 Simple regression: one independent variable is used to estimate a dependent variable
 Linear – dependent on the nature of relationship of the values
o
 Non-linear
 Multiple regression: multiple independent variables used to estimate a dependent variable
 Linear
 Non-linear
Regression algorithms

Measuring regression model accuracy



Can improve out of sample by doing:
o Train/test split evaluation – more accuracy on on out of sample, accuracy but highly dependent
on which dataset the data is trained and tested on
o K fold cross-validation – average each fold and each fold is distinct (no data reused in another
fold)
What is an error? Difference between the data points and the trend line of a model
o Mean absolute error – just the average error
o Mean squared error – focused on large errors due to the square terms
o Root mean squared error – easy to relate its information
o
o
Relative absolute error – total absolute error and normalizes it
Relative squared error – used to calculate R2 – represents how close the data values are to the
trend line higher R2 = better fit
Statistics for Data Science
Download