Uploaded by Lewis Jenkins

apsc-258-midterm-formula-sheet-full-formula-sheet

advertisement
lOMoARcPSD|24791910
APSC 258 Midterm Formula Sheet full formula sheet
Data Instrumentation and Design (The University of British Columbia)
Studocu is not sponsored or endorsed by any college or university
Downloaded by Lewis Jenkins (lewisjenkins789@gmail.com)
lOMoARcPSD|24791910
Variables are created the moment you assign a value to it.
No need to declare a type either. Variables can be assigned on one line as well.
Variable Names
Variable names must start with a letter or the underscore character, not a number. They can only contain alphanumeric
characters and underscores (A-z, 0-9, and _ ) and are case sensitive.
Intro To Python
metals = [“iron”, “copper”, “aluminum”]
x, y, z = metals
Machine Learning Methods: Supervised is defined by its use of labeled datasets (Like linear regression). Unsupervised learns by
analyzing and clustering unlabeled datasets (Like targeted adverts).
Steps: Big Picture > get data > visualize > prepare data > select model > fine tune model > present it > launch it
complex(var) converts to complex number
type(var) returns the data type
Fundamental Trade-Off: Notation E_train is the error on training data, E_test is the error on test data
E_test = (E_test - E_train) + E_train = E_approx + E_train where E_approx is the approximation error.
If E_approx is small, then E_train is a good approx. to E_test
Trade-Off Simple models E_approx is low (not very sensitive to training set), but E_train might be high.
Complex models E_test can be low, but E_approx might be high (very sensitive to training set).
Training error is high for low complexity (underfitting), Test error initially goes down, but eventually increases (overfitting).
5-fold cross-validation: Train on 80% of the data, validate on the other 20%. Repeat this 5 more times with different splits, and average
the score.
Strings are arrays of bytes representing unicode characters. The len(var) function returns the length of a string or
array. To check if a certain phrase is not in a string, use the <phrase> not in <var> keyword.
Membership Operators
in operator: used to check if a value exists in a sequence.
not in operator: used to check if a value is not in a sequence.
String Slicing, Lists, and Indexing
to index a variable use square brackets:
[index] to get value at index number
[index:] to get value at index all the way to the end
[:index] to get value at start all the way to index (but not including index)
Arrays can be indexed with positive and negative index values
var[start:stop:step] start through not past stop, by step, step >=1
To concatenate, or combine two strings you can use the + operator
P
0
-6
y
1
-5
t
2
-4
h
3
-3
o
4
-2
Norm of a Vector: p-Norm of a vector x=[x_1, x_2, …, x_N] denoted by ||x||_p (p = 1, 2, …)
Most common normal vector is a 2-Norm (p = 2)
Distance between Vectors is found with:
n
5
-1
% formatting can be used to insert variables into strings
You can use it like so: string = “var1 is %i, string1 is %s” % (var1, string1)
The directives are as follow: %s for strings, %i for ints, %f for floats, %d for decimal ints, %x for hexadecimal ints, %o for
octal ints, %u for unsigned ints, %e for float exponent.
Regression belongs to the supervised learning category. Fits a model to training data.
Linear Regression N data sets of size n:
Model:
Calculate the weights:
Set up matrices: Y matrix is the dependent data set, the X matrix is all other datasets. Each column is a data set.
String formatting is as follows: string = “var1 is {}, string1 is {}”.format(var1, string1)
f-Strings are less verbose and more readable. As follows:
string = f“var1 is {var1}, string1 is {string1}”
Display integers in different bases: string = f“var1 is {var1:7x}, is hexadecimal”
With o being oct, x being hex, X being HEX (all caps), d being decimal, b being binary and the number before being the
number of characters printed.
Center a string like so: string = f“str1 is {str1:^11}, is centered in a string of 11 chars”
You can add specific padding characters like so: {str1:*^11} to make * the padding char.
Format in scientific notation: string = f“str1 is {str1:.2E}, is in sci notation”
E or e is the type of notation, .2 is to round to the nearest 2 decimal places.
Operators
Bitwise
+
Addition
<<
Bit shift left
Subtraction
>>
Bit shift right
*
Multiplication
^
Bitwise xor
/
Division
&
Bitwise and
%
Modulus (Remainder)
|
Bitwise or
//
Floor Division (returns int)
~
Bitwise not
**
Exponent
Operators also work arithmetically, operator=
For boolean logic, and, or, xor, keywords are used. And not(var) is a func.
Comparison
==
Equal
!=
Not Equal
>
Greater than
<
Less than
>=
Greater equal to
<=
Less equal to
is
comp. identities
Degree of Polynomial and Fundamental Trade-off: as the polynomial degree goes down, E_train goes down, but
E_approx goes up. To prevent overfitting, we can pick different degrees to train the model on, then choose the
degree with the lowest validation error. We can also use a larger dataset.
L2 (Tikhonov) Regularization: controls the model complexity by adding a penalty on the weights.
Collection data types include List [] and Tuple () which are ordered, and allow duplicate members (List is changeable
and Tuples are not.). Set {} and Dictionary {} are unordered and unindexed collections with no duplicate members.
Tuples are immutable, so to change them
you need to re-declare them.
To convert between a list and tuple:
tuple(x) convert list x to tuple
list(y) convert tuple y to list
The equation applied to weights for L2 reg. is:
regularization.
Set is written with {}. They can’t be nested,
and values can’t be accessed with indexing,
but can be access iteratively.
Unchangeable.
x.add(val) adds val to x.
x.remove(val) removes val from x.
x.update(y) combines x with y.
x.union(y) combines x with y. And returns
the set.
where alpha is a const. That controls the strength of
To apply to the MSE (cost) function:
where c(w) is the original cost function.
Gradient is a vector with the direction and rate with the fastest increase of the function.
Gradient Descent is one of the most important algorithms in machine learning. It tweaks parameters w iteratively
to minimize the cost function.
The iterable equation is:
In vector form:
Operation on Dictionaries
Dictionaries {} are csv formatted like a set, but with a key and value: {key : value}
Keys are immutable, and keys and values can be any data type.
x.keys() returns a list of keys in x.
x.clear() removes all elements in x.
x.values() returns a list of values in x.
x.get(key) returns the value of a given key in x.
x.items() returns a list containing a tuple for each item.
x.pop(key) removes the value of a given key in x.
x.popitem() removes the last inserted key-value pair.
x.update(iterable) updates the dictionary with an iterable containing key-value pairs.
x.fromkeys(key,val) returns a dictionary with specified keys and value.
x.setdefault(key,val) returns the value of the specified key. If the key does not exist: insert the key and value.
where k is the iteration.
Gradient Descent to minimize MSE
(1) k = 0, pick up , set alpha
(2) Calculate
and update
(3) Repeat until stopping criterion.
Outliers are data samples that deviate from the other samples. Least squares amplifies the impact of outliers
since error is squared. The solution is to replace the squared error by absolute error.
The cost function becomes:
For least squares, we use the mean square error:
For least absolute, we use the mean absolute deviations:
Data Cleansing can be done by removing or calculating the mean of missing numbers, removing duplicate rows, removing outliers
(Z-score).
Design
Problem Definition Recognize the need, identify client/end user, make sure the client knows their need.
Need Statements written in a neg. tone. Describes how current situation is bad, should be descriptive and clear.
Format: Issue + who/what it applies to + positive result of solution.
Importing a Function or Package
Import a module as an alias import myModule as module_alias
Import a part of a module from myModule import myFunc
Import a module import myModule
Common Libraries/Modules
Math: import myModule Contains common mathematical operators like acos, asin, atan, ceil, cos, cosh, degrees,
e, erf, erfc, exp, factorial, floor, gcd, log, log10, log1p, log2, pi, pow, radians, sin, sinh, sqrt, tan,
tanh, tau, trunc, etc.
Defining the problem Goal: Brief, general and ideal re to need statement. Objective: Quantifiable expectations of performance, must
have units, establishing op environment, indicators of progress towards goal, describes performance characteristics to client, specifies
how individual performance params are optimized.
2
Diagrams Hierarchical system diagrams (1) define elements within a product using a tree diagram,
R202System
Functional Analysis diagrams (2) define functions and flow of information within a product, Quality Function
Random: import random Contains random number generation tools random.randrange(min, max).
Numpy: import numpy as np Is a library used for working with arrays. It is also more memory efficient, therefor faster.
np.array([v1, v2 ,v3], [v4, v5, v6], dtype=’int’) creates a 2x3 array of type int (dtype is optional).
np.array([v1, v2, v3, v4, v5, v6]).reshape([2,3]) creates a 2x3 array.
x.shape returns the shape of the matrix x (row, col)
x.ndim returns the number of rows
x.dtype returns the data type of matrix
x.size returns the size of matrix x.
x. np.arange(start, end, step) returns an array of linearly spaced points between start and end (not included).
np.linspace(start, end, n_points, endpoint=False) returns a linearly spaced array of numbers between start and
end, spaced out by n_points and can optionally include the endpoint.
np.random.rand(n_points) returns a horizontal array of n_points random values between 0 and 1.
np.random.randn(n_points) returns a horizontal array of n_points random gaussian values.
np.random.seed(seed) sets the random seed.
Arrays are indexed the same as lists, but have two indexes for row and column x[row_indx, col_indx]
You can index individual values in an array all at once with an array of integers as your index.
(1)
Diagram QFD (3) is used for matching customer requirements to engineering design and performance
parameters (Key elements being customer req., engineering req., and the characteristics of competing
products.)
(2)
QFD (4) Description:
Customer Req. must provide a clear and optimal direction
Engineering Req. must be quantifiable and have units
Units and Eng. Targets must have proper units and be realistic
The Roof describes relationships between Eng. Req. (‘+’, ‘-’, or nothing)
Matrix: Place an X where customer req. and Eng. req. are linked. Go row by row
Benchmarks are possible solutions and get an ‘O’ if it best fulfills customer req.
a.T returns the transpose of a.
a*b element wise multiplication of a and b.
a@b matrix multiplication of a and b.
np.linalg.det(a) returns determinant of a.
Numpy Statistics Operators np.median(array), np.average(array), np.mean(array), np.std(array)
Matplotlib: import matplotlib.pyplot as plt is a low level graph plotting library. Usually it is imported as plt.
To plot: plt.plot(x, y) To display: plt.show() Scatter plot: plt.scatter(x, y)
Downloaded by Lewis Jenkins (lewisjenkins789@gmail.com)
(3)
Eng. Req.
(4)
Bench
Eng. Req.
Matrix of req.
Relations
Eng. Targets
Customer Req.
10MA
Numpy Matrix Operators
np.linalg.inv(a) inverses matrix a.
np.dot(a, b) performs dot product of a and b.
np.matmul(a, b) performs matrix multiplication of a and b.
np.multiply(a, b) performs element multiplication of matrices a and b.
or # iter.
Advantages: has efficiency of O(NPK) where N is the number of input features, P is the number of datapoints, and
K is the number of iterations. Gradient descent is usually faster when the number of features s very large.
Gradient descent can be used for many optimization problems.
Quality Metrics Having made the model and obtained w, we are ready to make predictions using this model. How
do we measure the quality of this model?
Modular Programming take a large programming task and breaks it into smaller subtasks/modules. Functions,
Modules, and packages are all constructs in python. A module is a .py file containing functions or code.
Defining a function
def myFunc(in1, in2, …):
<code>
return output
Minimum MSE (Cost Function):
Polynomial Regression: Like linear regression, but in this case we will only work with one dependent data set, and
one input feature (the other dataset).
Model:
Calculations for weights and MSE are the same, however, although we have one input feature, the polynomial fit
makes new features:
Control Flow & Loops uses indenting to determine what is apart of the control statements.
If Else
For
While (also uses break and continue)
if var == condition:
for var in iterable:
while condition:
<code>
<code>
<code>
elif condition:
else:
Use break to stop loop, Use continue to skip to next iter.
<code>
<code>
For for, If var isn’t used, replace with _, else runs when the for loop
else:
ends, not when break is used.
Operation on Lists
x.append(y) inserts y at the end of the list x.
x.insert(y, i) inserts y at position i in x.
x.index(y) returns the index of y in the list x (1st occurance).
x.count(y) counts the number of y occurrences in x.
x.remove(y) removes the 1st occurrence of y in x.
x.reverse() reverses the list.
x.sort() sorts the list.
x.pop(i) remove and return first object in list at position i.
x.extend() sorts the list.
enumerate(iterable, start) returns a tuple containing a count from
start and the values obtained from iterating over iterable.
range(start, stop, step) function can be used to create an iterable
from start, to stop (not including), with a step of step (all integers).
Weight matrix is a column
matrix of size N+1 (w0 is
your y-intercept)
Units
Eng. Targets
Comp. Benchmark
Input Function
input(prompt)
Data Type Conversion
int(var) converts to integer
str(var) converts to string
float(var) converts to float
Customer Req.
Given a list or tuple, extract the values into variables.
Swap Variables
x, y = y, x
Machine learning is an algorithmic framework that teaches computers to predict and recognize
patterns through observing datasets, then making predictions based on those datasets.
Typical Steps: Data Collection to train the computer. Feature design to detect color, shape, size etc, can be done
automatically/manually. Model Training is teaching the machine to find a separation between two different things. Model Validation is
where the machine recognizes previously unseen samples to train from.
Machine Learning
Download