APSC 258 Formula Sheet: Data Instrumentation & Design

lOMoARcPSD|24791910 APSC 258 Midterm Formula Sheet full formula sheet Data Instrumentation and Design (The University of British Columbia) Studocu is not sponsored or endorsed by any college or university Downloaded by Lewis Jenkins (lewisjenkins789@gmail.com) lOMoARcPSD|24791910 Variables are created the moment you assign a value to it. No need to declare a type either. Variables can be assigned on one line as well. Variable Names Variable names must start with a letter or the underscore character, not a number. They can only contain alphanumeric characters and underscores (A-z, 0-9, and _ ) and are case sensitive. Intro To Python metals = [“iron”, “copper”, “aluminum”] x, y, z = metals Machine Learning Methods: Supervised is defined by its use of labeled datasets (Like linear regression). Unsupervised learns by analyzing and clustering unlabeled datasets (Like targeted adverts). Steps: Big Picture > get data > visualize > prepare data > select model > fine tune model > present it > launch it complex(var) converts to complex number type(var) returns the data type Fundamental Trade-Off: Notation E_train is the error on training data, E_test is the error on test data E_test = (E_test - E_train) + E_train = E_approx + E_train where E_approx is the approximation error. If E_approx is small, then E_train is a good approx. to E_test Trade-Off Simple models E_approx is low (not very sensitive to training set), but E_train might be high. Complex models E_test can be low, but E_approx might be high (very sensitive to training set). Training error is high for low complexity (underfitting), Test error initially goes down, but eventually increases (overfitting). 5-fold cross-validation: Train on 80% of the data, validate on the other 20%. Repeat this 5 more times with different splits, and average the score. Strings are arrays of bytes representing unicode characters. The len(var) function returns the length of a string or array. To check if a certain phrase is not in a string, use the <phrase> not in <var> keyword. Membership Operators in operator: used to check if a value exists in a sequence. not in operator: used to check if a value is not in a sequence. String Slicing, Lists, and Indexing to index a variable use square brackets: [index] to get value at index number [index:] to get value at index all the way to the end [:index] to get value at start all the way to index (but not including index) Arrays can be indexed with positive and negative index values var[start:stop:step] start through not past stop, by step, step >=1 To concatenate, or combine two strings you can use the + operator P 0 -6 y 1 -5 t 2 -4 h 3 -3 o 4 -2 Norm of a Vector: p-Norm of a vector x=[x_1, x_2, …, x_N] denoted by ||x||_p (p = 1, 2, …) Most common normal vector is a 2-Norm (p = 2) Distance between Vectors is found with: n 5 -1 % formatting can be used to insert variables into strings You can use it like so: string = “var1 is %i, string1 is %s” % (var1, string1) The directives are as follow: %s for strings, %i for ints, %f for floats, %d for decimal ints, %x for hexadecimal ints, %o for octal ints, %u for unsigned ints, %e for float exponent. Regression belongs to the supervised learning category. Fits a model to training data. Linear Regression N data sets of size n: Model: Calculate the weights: Set up matrices: Y matrix is the dependent data set, the X matrix is all other datasets. Each column is a data set. String formatting is as follows: string = “var1 is {}, string1 is {}”.format(var1, string1) f-Strings are less verbose and more readable. As follows: string = f“var1 is {var1}, string1 is {string1}” Display integers in different bases: string = f“var1 is {var1:7x}, is hexadecimal” With o being oct, x being hex, X being HEX (all caps), d being decimal, b being binary and the number before being the number of characters printed. Center a string like so: string = f“str1 is {str1:^11}, is centered in a string of 11 chars” You can add specific padding characters like so: {str1:*^11} to make * the padding char. Format in scientific notation: string = f“str1 is {str1:.2E}, is in sci notation” E or e is the type of notation, .2 is to round to the nearest 2 decimal places. Operators Bitwise + Addition << Bit shift left Subtraction >> Bit shift right * Multiplication ^ Bitwise xor / Division & Bitwise and % Modulus (Remainder) | Bitwise or // Floor Division (returns int) ~ Bitwise not ** Exponent Operators also work arithmetically, operator= For boolean logic, and, or, xor, keywords are used. And not(var) is a func. Comparison == Equal != Not Equal > Greater than < Less than >= Greater equal to <= Less equal to is comp. identities Degree of Polynomial and Fundamental Trade-off: as the polynomial degree goes down, E_train goes down, but E_approx goes up. To prevent overfitting, we can pick different degrees to train the model on, then choose the degree with the lowest validation error. We can also use a larger dataset. L2 (Tikhonov) Regularization: controls the model complexity by adding a penalty on the weights. Collection data types include List [] and Tuple () which are ordered, and allow duplicate members (List is changeable and Tuples are not.). Set {} and Dictionary {} are unordered and unindexed collections with no duplicate members. Tuples are immutable, so to change them you need to re-declare them. To convert between a list and tuple: tuple(x) convert list x to tuple list(y) convert tuple y to list The equation applied to weights for L2 reg. is: regularization. Set is written with {}. They can’t be nested, and values can’t be accessed with indexing, but can be access iteratively. Unchangeable. x.add(val) adds val to x. x.remove(val) removes val from x. x.update(y) combines x with y. x.union(y) combines x with y. And returns the set. where alpha is a const. That controls the strength of To apply to the MSE (cost) function: where c(w) is the original cost function. Gradient is a vector with the direction and rate with the fastest increase of the function. Gradient Descent is one of the most important algorithms in machine learning. It tweaks parameters w iteratively to minimize the cost function. The iterable equation is: In vector form: Operation on Dictionaries Dictionaries {} are csv formatted like a set, but with a key and value: {key : value} Keys are immutable, and keys and values can be any data type. x.keys() returns a list of keys in x. x.clear() removes all elements in x. x.values() returns a list of values in x. x.get(key) returns the value of a given key in x. x.items() returns a list containing a tuple for each item. x.pop(key) removes the value of a given key in x. x.popitem() removes the last inserted key-value pair. x.update(iterable) updates the dictionary with an iterable containing key-value pairs. x.fromkeys(key,val) returns a dictionary with specified keys and value. x.setdefault(key,val) returns the value of the specified key. If the key does not exist: insert the key and value. where k is the iteration. Gradient Descent to minimize MSE (1) k = 0, pick up , set alpha (2) Calculate and update (3) Repeat until stopping criterion. Outliers are data samples that deviate from the other samples. Least squares amplifies the impact of outliers since error is squared. The solution is to replace the squared error by absolute error. The cost function becomes: For least squares, we use the mean square error: For least absolute, we use the mean absolute deviations: Data Cleansing can be done by removing or calculating the mean of missing numbers, removing duplicate rows, removing outliers (Z-score). Design Problem Definition Recognize the need, identify client/end user, make sure the client knows their need. Need Statements written in a neg. tone. Describes how current situation is bad, should be descriptive and clear. Format: Issue + who/what it applies to + positive result of solution. Importing a Function or Package Import a module as an alias import myModule as module_alias Import a part of a module from myModule import myFunc Import a module import myModule Common Libraries/Modules Math: import myModule Contains common mathematical operators like acos, asin, atan, ceil, cos, cosh, degrees, e, erf, erfc, exp, factorial, floor, gcd, log, log10, log1p, log2, pi, pow, radians, sin, sinh, sqrt, tan, tanh, tau, trunc, etc. Defining the problem Goal: Brief, general and ideal re to need statement. Objective: Quantifiable expectations of performance, must have units, establishing op environment, indicators of progress towards goal, describes performance characteristics to client, specifies how individual performance params are optimized. 2 Diagrams Hierarchical system diagrams (1) define elements within a product using a tree diagram, R202System Functional Analysis diagrams (2) define functions and flow of information within a product, Quality Function Random: import random Contains random number generation tools random.randrange(min, max). Numpy: import numpy as np Is a library used for working with arrays. It is also more memory efficient, therefor faster. np.array([v1, v2 ,v3], [v4, v5, v6], dtype=’int’) creates a 2x3 array of type int (dtype is optional). np.array([v1, v2, v3, v4, v5, v6]).reshape([2,3]) creates a 2x3 array. x.shape returns the shape of the matrix x (row, col) x.ndim returns the number of rows x.dtype returns the data type of matrix x.size returns the size of matrix x. x. np.arange(start, end, step) returns an array of linearly spaced points between start and end (not included). np.linspace(start, end, n_points, endpoint=False) returns a linearly spaced array of numbers between start and end, spaced out by n_points and can optionally include the endpoint. np.random.rand(n_points) returns a horizontal array of n_points random values between 0 and 1. np.random.randn(n_points) returns a horizontal array of n_points random gaussian values. np.random.seed(seed) sets the random seed. Arrays are indexed the same as lists, but have two indexes for row and column x[row_indx, col_indx] You can index individual values in an array all at once with an array of integers as your index. (1) Diagram QFD (3) is used for matching customer requirements to engineering design and performance parameters (Key elements being customer req., engineering req., and the characteristics of competing products.) (2) QFD (4) Description: Customer Req. must provide a clear and optimal direction Engineering Req. must be quantifiable and have units Units and Eng. Targets must have proper units and be realistic The Roof describes relationships between Eng. Req. (‘+’, ‘-’, or nothing) Matrix: Place an X where customer req. and Eng. req. are linked. Go row by row Benchmarks are possible solutions and get an ‘O’ if it best fulfills customer req. a.T returns the transpose of a. a*b element wise multiplication of a and b. a@b matrix multiplication of a and b. np.linalg.det(a) returns determinant of a. Numpy Statistics Operators np.median(array), np.average(array), np.mean(array), np.std(array) Matplotlib: import matplotlib.pyplot as plt is a low level graph plotting library. Usually it is imported as plt. To plot: plt.plot(x, y) To display: plt.show() Scatter plot: plt.scatter(x, y) Downloaded by Lewis Jenkins (lewisjenkins789@gmail.com) (3) Eng. Req. (4) Bench Eng. Req. Matrix of req. Relations Eng. Targets Customer Req. 10MA Numpy Matrix Operators np.linalg.inv(a) inverses matrix a. np.dot(a, b) performs dot product of a and b. np.matmul(a, b) performs matrix multiplication of a and b. np.multiply(a, b) performs element multiplication of matrices a and b. or # iter. Advantages: has efficiency of O(NPK) where N is the number of input features, P is the number of datapoints, and K is the number of iterations. Gradient descent is usually faster when the number of features s very large. Gradient descent can be used for many optimization problems. Quality Metrics Having made the model and obtained w, we are ready to make predictions using this model. How do we measure the quality of this model? Modular Programming take a large programming task and breaks it into smaller subtasks/modules. Functions, Modules, and packages are all constructs in python. A module is a .py file containing functions or code. Defining a function def myFunc(in1, in2, …): <code> return output Minimum MSE (Cost Function): Polynomial Regression: Like linear regression, but in this case we will only work with one dependent data set, and one input feature (the other dataset). Model: Calculations for weights and MSE are the same, however, although we have one input feature, the polynomial fit makes new features: Control Flow & Loops uses indenting to determine what is apart of the control statements. If Else For While (also uses break and continue) if var == condition: for var in iterable: while condition: <code> <code> <code> elif condition: else: Use break to stop loop, Use continue to skip to next iter. <code> <code> For for, If var isn’t used, replace with _, else runs when the for loop else: ends, not when break is used. Operation on Lists x.append(y) inserts y at the end of the list x. x.insert(y, i) inserts y at position i in x. x.index(y) returns the index of y in the list x (1st occurance). x.count(y) counts the number of y occurrences in x. x.remove(y) removes the 1st occurrence of y in x. x.reverse() reverses the list. x.sort() sorts the list. x.pop(i) remove and return first object in list at position i. x.extend() sorts the list. enumerate(iterable, start) returns a tuple containing a count from start and the values obtained from iterating over iterable. range(start, stop, step) function can be used to create an iterable from start, to stop (not including), with a step of step (all integers). Weight matrix is a column matrix of size N+1 (w0 is your y-intercept) Units Eng. Targets Comp. Benchmark Input Function input(prompt) Data Type Conversion int(var) converts to integer str(var) converts to string float(var) converts to float Customer Req. Given a list or tuple, extract the values into variables. Swap Variables x, y = y, x Machine learning is an algorithmic framework that teaches computers to predict and recognize patterns through observing datasets, then making predictions based on those datasets. Typical Steps: Data Collection to train the computer. Feature design to detect color, shape, size etc, can be done automatically/manually. Model Training is teaching the machine to find a separation between two different things. Model Validation is where the machine recognizes previously unseen samples to train from. Machine Learning

APSC 258 Formula Sheet: Data Instrumentation & Design

Related documents

Products

Support

APSC 258 Formula Sheet: Data Instrumentation & Design

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib