BSAN 160 Cheat Sheet Eliana Umaña Data: the main ingredients in all form of analysis to support business decision making; collection of facts obtained as the result of experiments, observations, transactions or expressions ● Big Data: can not be stored in a single storage unit (ex. Healthcare, web searches, financial data, consumer opinions) Analytics: process of identifying patterns and relationships in data, developing actionable decisions based upon insights generated from data Business Intelligence (BI): concerned with managing high complex, high dimensional data to support business decisions and improve performance Descriptive Analytics: Uses data to find patterns and relationships among variables using past. What is happening? Predictive Analytics: the use of statistical techniques and data mining to predict what is going to happen (future values) using current and past data. What should I do and why? Prescriptive Analytics: aims to not only predict what is going to happen but compare different action/decisions and select best decision based on certain performance criteria. What will happen? Decision Support system (DSS): Computer based support system (helps utilize data and models to solace problems) True Positive Rate: TP/TP+FN True Negative Rate: TN/TN+FP Simulation: is a prescriptive analytics method to analyze the characteristics as random variables by using probability distributions for uncertain input data ● Must run multiple times Chapter 2 Structured Data: organized as data arrays (spreadsheets) with variables, observations, and values (e.g: names, dates, addresses) Unstructured data: cannot be processed and analyzed using conventional tools and methods (text, video, audio). Categorical: variables classify information into categories (ordinal and nominal) Types of Variables: Numeric interval and Ratio - (age, income, time), Ordinal - there is a natural ordering (grades A+, A, A-), Nominal - no natural ordering (race, gender) Descriptive statistics: (describes data on hand): uses mean, median, and mode, to measure central tendency, uses quartiles standard deviation (how far from mean) and variance ( how far from mean) to measure spread/variability - uses skewness to measure shape. ● Central Tendency: Mean → average, Median → values that appears in the middle of sata, Mode → value that appears most often ● When Mean>Median - Skew Positive (right), When Mean<Median Skew negative (left) Line graphs - good for time-series data (ex. Time series of monthly sales volume) Bar charts - good for depicting nominal or numerical data that can be easily categorized (ex. Yes/No questions) Pie charts - good for depicting proportions (ex. How much percent each person owns of a company) Histograms - depict frequency distributions (ex. The one with visual skew) Scatter plots - illustrate relationships between two or three variables (ex. Life expectancy vs. Net income) Boxplots - show descriptive statistics such as min, max, median, 1st and 3rd quartiles (ex. Student’s high school percentile of GPa for two categories: those that are still enrolled in 2nd year of college vs who dropped out) Chapter 3 Data Consolidation: relevant data, collect data, merge datasets (inner join, outer join) ● Left: discards unmatched rows from right, keeps left ● Right: discards unmatched rows from left, keeps unmatched rows from right ● Full: keeps all rows Data Cleaning: assigns values that were missing, reduce outliers and abnormal values, eliminate multiples entries, reduce noise Data Transformations: normalize data between a min and max to rescale, convert numeric into discrete categories, create new variables using existing variables Data Reduction: variable selection (study & discard), dimensional reductions, reduce number attributes and records, balance skewed data Cases /samples: sample balancing/satisfaction Imputation: uses mean variable of interest - most appropriate Information Dashboards provide visual displays of important information that is consolidated, cleaned, arranged on an (interactive) screen so that key insights can be digested at a single glance and easily drilled in for further exploration Chapter 4 - Regression Linear Regression: A straight lien that gives the “best fit” to points in a set of data (Most common type of Regression) ● Simple Linear Regression: one numerical dependent variable, one independent variable which can be numerical or categorical. Variables in the first degree (no exponents) ● Multiple Linear Regression: one numerical dependent variable, more than one independent variable which can be numerical or categorical. All variables in the first degree Logistic Regression: Dependent variable (output) is a binary variable (as opposed to a numerical variable) such as 0/1 or yes/no Precision: ratio of correctly predicted positive outcomes to the total number of predicted positive outcomes (a/a+c) a b Recall: recall is defined as the ratio of predicted positive outcomes to the actual number of positive outcomes (a/a+b) c d R-Squared: a measure of goodness of fit of the regression model for the given data, R-Squared always ranges between 0 and 1. The better the model is fir is the closer R-squares is to 1, if R-squared. 5then independent variable accounts for 50% of the variation ● Y= 𝜷0 + 𝜷1x1 + ɛ ● Coefficient 𝜷0 is the value of y where the line crosses the y-axis ● Coefficient 𝜷1 is the slope (angle) of the line ● Error/residual term ɛ represents gap between actual and predicted values of y Confusion Matrix: Shows performance of a model by tabulating predicted outcomes against actual outcomes Chapter 5 - Data Mining Data Warehouse: is a database; physical repository for data. Contains copies of relevant data in an organization, generally thousands of gigabytes in size and optimized for fast searching Relational data:are specifically organized data usually presented in multiple data tables, where data points in different tables points in different tables are related to one another ● Each row in the tables is recorded w/ a unique user id (called the key) OLAP: approach to execute queries against data such as 3-dimensional data cubes (how DW is commonly presented), by changing the data orientation and granularity ● Slice: refers to taking subset of a multidimensional dat ● Dice: refers to taking a slice on two or more dimensions of data ● Drill-up: refers to navigating among levels of data by moving from summarized → detailed ● Drill-down: refers to navigating among levels of data by moving from more detailed → summarized ● Pivot: used to change the dimensional orientation of data by transforming the data from rows of a table into data group on several columns Data mining methods are categorized based on the tasks and learning type Eliana Umaña Supervised Learning: algorithm learns from labeled training data aims to predict outcomes for unforeseen data Prediction: tells the nature of future occurrences of certain events, numerical or categorical, based on what has happened in the past (synonym: forecasting) ● Classification: when the output is categorical. Learns patterns from training data: input and output variables that describe items, objects, or past events (is supervised data mining) ● Regression: when the output variable is numerical or binary. A statistical estimation technique based on fitting a mathematical equation to existing data to explain the relationship between a dependent variable (y) and one (or more) independent variables (x1, x2…) ○ Numeric output (dependent) variable → Linear regression ○ Binary output (dependent) variable → Logistic regression Time series: when the values of a numerical output (dependent variable) are captured and modeled over time. Estimates a future value (1, 2, …, n step into the future) or a numerical variable based on its own past data values K-fold: complete data set is split into mutually exclusive of approximately equal size and tested on each left-out subset using the others as training sets. Support: of a rule X & Y is the probability of observing X and Ytogether in the entire dataset Condicence: of a rule Y|X is the probability of observing Y given X is observed P(X Y)/P(X) Chapter 6 - Data Mining: Clustering Tasks: Prediction, Association, Segmentation ● Segmentation: refers to data mining methods that categorize data into a set of classes Clustering: is a data mining tool for classifying items (e.g, people, things, events) into common groupings called “clusters” using similarity and distance metrics Cluster Analysis Methods: Hierarchical, k-means, neural networks, genetic algorithms Hierarchical Clustering: is clustering method that iteratively groups the items (e.g., individuals) in a dataset based on the pairwise distances between these items until every tie, is linked into one larger ● Linkage criteria: is the criteria to decide ow to define “closeness” when we use hierarchical clustering ● Single Linkage: creates clusters using the minimum distance between an item (e.g., person) and a cluster to decide which items to cluster in each step of the algorithm Validation: Evaluate the model performance on data that wasn't used to build the model Validation (Testing) Data: Data that is excluded from the model-building process and used to test model accuracy / performance Euclidean Distance: a commonly used distance measure, it measures the distance between (x1,y1) and (x2, y2) ● Equation in 2-D space: d = √(x2 - x1)2 + (y2 - y1)2 Linear programming: is used for obtaining the most optimal solution for a problem with given constraints Chapter 7 - Textual Data Textual data are characterized by their unstructured data Text Analytics - is a broad concept and includes analytics techniques used in textual data Text mining - is the specific process of extracting knowledge from textual data by discovering patterns and providing patterns and new and useful information Text Mining Goals ● Information extraction: Identification of key phrases, main points and relationships within text ● Categorization: Identifying/Categorizing the main themes of a document ● Topic tracking: Based on documents that a user views, predict other documents of interest to the user ● Summarization: Summarizing a document ● Clustering: Grouping similar documents Text Preprocessing: must clean data before analyzing Text Mining Methods Tokenization: is the process of breaking complex data like paragraphs into simple units called tokens. It is often combined with the removal of stop words ● Word tokenization: splits a sentence into a list of words ● Sentence tokenization: splits a paragraph into list of sentences Bag-of-Words (BoW): is a text mining technique that uses word frequencies as a method of feature extraction (How many times each word appears in numeric vector) Word Cloud: visual representation of frequency of words within a given body of text Bigrams: Tokens consist of two consecutive words → looks for most frequently used sets of 2 words ● Ex: [‘My name’, ‘name is’, ‘is John’] Ngrams: Tokens consist of ‘n’ number of consecutive words → looks for most frequently used sets of n words Chapter 8 - Decision Models Descriptive and predictive analytics creates the foundation (i.e., choice alternatives) for prescriptive analytics (i.e., selecting the best possible decision) Risk: is the level of uncertainty, i.e., probability of each possible outcomes occurring 1. 2. 3. 4. 5. There are several possible decision alternatives identified for a decision problem Each decision alternatives leads to possible outcome(s) There is uncertainty about which outcomes(s) will occur, and probabilities of the possible outcomes are assessed For each decision alternative and each possible outcome, there is an associated payoff/cost A “best” decision must be chosen using an appropriate decision criterion Expected Monetary Value (EMV): (End Node1 x Probability1) + (End Node2 x Probability2) Chapter 9 - Optimization Optimization - Prescriptive analytics Decision variables: variable whose value the decision maker can choose Objective function: Mathematical formulation to be optimized which is formulated using the decision variables Constraints: Physical, logical, or economic restrictions, depending in the nature of the problem Feasible solution:is a solution (set of values for all x variables) that satisfies all the constraints Infeasible Solutions: violates at least one, or more, constraints Optimal solution: is a feasible solution (set of values for all x variables that satisfy all constraints that optimize the objective function)
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )