edX Introduction to Data Science Module 1: Defining Data Science: - Data Science studies large quantities of data to gain insights and make strategic decision based upon those - Data Science involves some math, some science, and lots of curiosity to explore the data - Data Scientists need to not only be able to analyse and interpret data, but also use the insights to tell a story that informs decisions Module 2: What Data Scientists do: - General steps are as such - Identify and understand the problem that needs to be solved - Collect the data for analysis - Identify the right tools to use - Develop a data strategy - There are tons of data to be collected, and you need to figure out which is useful - Basic tools often used - - Regressions - Neural networks - Machine learning - K-nearest neighbours, estimations etc. Cloud - Remote central storage and working environment, no need to run things on your own system - Can access data and programs/algorithms without needing to install them locally Module 3: Big Data and Data Mining: - 5 V’s of Big Data - Velocity - Volume - Variety - Veracity - How fast Data is acucumulated How much data is and can be stored Truth and completeness, conformity to facts/categories etc Value Hadoop - Essentially a distributed platform which segments data, sends it to different servers and computes the parts separately, the brings them back together - Big data has different definitions depending on who you ask, but most all of them the name is the game, it’s very large amounts of data that can’t (or are very hard to) compute using traditional methods - New computation and statistical techniques have emerged that are better equipped to handle these giant data sets - Data Mining - - Establish goals for the mining Identify key questions and costs vs. benefits Establish expected accuracy and usefulness of data mined Select Data, quality of output depends largely on quality of data used You might not have the data you need handy, so you have to look for alternate sources, or plan data collection - Preprocess data to remove erroneous data, identify systematic errors, account for missing data etc. - Transform data such that it fits into the analysis you want to do E.g. aggregating all sources of income into one variable (“income”) - Storing data in a way that gives you easy access, but is also secure and prevents mining other unrelated datasets - Mine the data using a variety of algorithms (visualization is a good start) - Evaluate the mining results Could involve testing predictive capabilities of the data mined Could be how well does the mined data reproduce the observed one This entire process is iterative, so you interpret the results to improve your pipeline, then review the new results etc. Module 4: Deep Learning and Machine Learning - AI vs. Data Science - Data science is the process of extracting information from large amounts of disparate data via different methods and includes the entirety of the process - AI encompasses everything that allows a computer to learn and make intelligent decisions - Deep Learning - Based on neural networks that need a lot of computing power - Will get better the more data you have to train it on Module 5: Data Science and Business - What skills are people looking for in Data Scientist - Curiosity, analytical thinking, problem solving - Good storytelling, communication, interpersonal - Analytical thinking is maybe more important that analytical skills, but basics in statistics, some algorithms, maybe machine learning etc. are important edX Data Science Tools Module 1: Data Scientist’s Toolkit Languages - - Python - Most popular data science language - General purpose, e.g. Data Science, Machine learning, IoT, Web development etc. - Open source with many dedicated libraries for different things Data Science: Pandas, SciPy, NumPy, Matplotlib AI: TensorFlow, PyTorch, Keras, Scikit-learn Natural Language: NLTK R - Easy to translate math to code, hence used a lot to make stats models, analysis and graphs - World’s largest repository of statistical knowledge Integrates well with other languages SQL - Limited to querying and managing data - Works with structured data But interfaces with non-structured data repositories - Directly access the data, no need for copying - E.g. MySQL, IBM Db2, Postgesql, Apache, SQLite, Oracle, MariaDB, etc. Open source tools - Data management - Relational Databases SQL: MySQL, PostgreSQL noSQL: MongoDB, Apache CouchDB & Cassandra file-based: Hadoop File system, Ceph - Data Integration and Transformation (ETL) - Extract, transform, load - Apache AirFlow, KubeFlow, Apache Kafka, Apache Nifi, Apache SparkSQL, NodeRED - Data Visualization - Hue, Kibana, Apache Superset - Model Building - Model Deployment - - - PredictionIO - Seldon supports nearly every framework - TensorFlow Model Monitoring and Assessment - ModelDB, Prometheus - AIFairness360 assesses bias within models - AdversarialRobustness360 protect against adversarial attacks w/ bad data - AI Explainability 360 Toolkit tries to give examples of what happens in the model Code and Data Asset Management - Git for Code asset management GitHub and GitLab - Data needs metadata Apache Atlas or Kylo for data assets management - Development and Execution Environment - Jupyter Development Environment - Supports many different programming languages via Kernels - JupyterNotebooks unify documentation, code, output and visuals in a single document JupyterLab is the new generation to replace it - Can open different files including JupyterNotebooks Apache Zeppelin - Can do similar things, but can plot without having to code - - RStudio - One of the oldest ones, but only runs R - Python development possible - Also has remote data access Spyder - - - Tries to emulate RStudio to Python, but can’t do as much Apache Spark - Cluster execution environment linearly scalable - Can process huge amounts of data file by file (batch) - Apache Flink si similar, but can do real-time data flow KNIME Fully integrated visual tool - Can be extended by Python and R, as well as Apache Spark Commercial options - Database management: IBMdb2, MicrosoftSQL server, oracle - ETL tools - - - Informatica, IBM Infostage - GUI Data Visualization - Visually attractive and live dashboards - Tableau, Microsoft PowerBI, IBM Cognos analytics - Options specifically for data scientists available Model building - - Best for Data Mining SPSS Modeller SAS Miner Model Deployement - Tightly connected to Model building in a commercial setting - However, also supports open source deployment (e.g. SPSS Modeller supports this) - Model Monitoring and Code management are relatively new, hence open source is the way to go - Data Asset Management - Data has to be versioned and annotated - Informatica and IBM InfoSphere provide tools specifically for that Fully integrated Development environment - Watson Studio web based or desktop version - H20.ai Cloud based tools - - Fully integrated visual tools and platforms - Multiple server machine clusters - IBM Watson Studio - Microsoft Azure Machine learning - H20 driverless ai (H20.ai) SaaS (software as a service) - Cloud provider operates the product or you - Proprietary tooling available as a cloud-exclusive E.g. amazon web servise DynamoDB noSQL database Cloudant most things done by cloud provider, but compatible with CouchDB - - ELT extract, load, transform tools - Informatica cloud data integration - IBM Data Refinery Transform raw data into spreadsheet-like interface Offer data exploration and visualization Model Building - Machine learning services available from nearly every cloud provider E.g. Watson Machine learning Libraries - Collection of functions and methods that allow for actions without writing the code - Scientific Computing - Pandas for effective data cleaning, manipulation and analysis based on matrices - - Built on NumPy NumPy based on Arrays Visualization - Matplotlib - Seaborn Based on Matplotlib For eay heatmaps, violinplots etc High-Level Machine and Deep Leanring - Scikit-learn built on NumPy, SciPy & Matplotlib - Keras used GPU - Made for experimentation TensorFlow - Deep Learning Pytorch - Machine Learning Low level framework for large scale productions Apache Spark Can use Python, R, Scala, or SQL Complementary libraries Vegas stats data visualization BigDL Deep leaening - R packages Ggplot2 or Keras/TensorFlow interfaces Application Programming Interfaces (API) - Lets two pieces of software talk to each other - Is only the interface for the software components - E.g. pandas library has an API that allows us to call all the functions, even if not programmed in python - Rest APIs are a request-response API where one user/thing (‘client’) send a request to a different thing (‘resource’) and in turn gets a response from it Data Sets - - Collection of data in structures: - Tabular - Hierarchical/network represent relationships - Raw files images, audio etc. Private or open data - Open data from: open data portal lists, gov. intergov. And organization websites, Kaggle, Google search engine - Community Data Licence Agreement (CDLA) CDLA sharing use and modify, publication under same terms CDLA permissive use and modify no obligations Machine Learning Models - Supervised learning provide inputs and correct outputs, model looks at relationship b/w input and output - Regression predict numerical values - Classification predict class category - Unsupervised learning data not labelled by a human, model identifies patterns within the data - Clustering recommendations for online shopping - anomaly detection fraud - Reinforcement learning - Deep learning - - Tries to loosely emulate how human brain computes - Very large data sets needed to train it, but can download pre-trained - Built using frameworks can have prebuilt models (model zoo) Model building - Prepare data (labelling etc.) - Build actual model - Train model - Analyse results and iterate until model performs how you want it to Module 2: Open Source Tools GitHub - Repository (remote, but can be cloned locally via SSH or HTTPS) location where you put your stuff. Remote is on the github/gitlab - Mostly add things to your local repository - Do this via terminal commands, cd into your local git folder, add files you want to add to staging area (git add XYZ), and then add it to local repository (git commit) - Add from local to remote repository (git push) Can also push into remote repository directly online - - Can control which files go into a commit via the staging area Have to (git pull) into local repository Branches are different “subparts” of the repository - Master is the main branch, can create new branch via (git branch XYZ) - Can merge branches via (git merge) merges branch into branch you’ve switched into - Changing branches via (git checkout XYZ) - Forking repositories will give you access to the foreign repository in your private repository - Can post a Pull request in order to commit any changes to the private repository to the foreign one Jupyter - Kernels are the thing that offers R or Scala etc. compatibility - A wrapper for the interpreter - Run locally (either on laptop or whatever server you’re using) - Seaborn library used matplotlib but makes it prettier - Architecture - User Notebook server via the browser Module 3 IBM Tools for Data Science: IBM Watson Studion - Collaborative data science and machine learning platform - Offers Jupyter Notebooks, Anaconda and RStudio - Embedded Machine learning models neural networks, deep learning etc. - Projects are the centres of collaboration - Can add descriptions, readmes etc. in overview - Assets contains Notebooks, Data, Models etc. Connection allows you to access Data from outside sources - Environment contains Notebook and Library settings - Settings allows you to add specific services Jupyter Notebooks in Watson Studio - Add to ptoject: Notebook - Best practice is to add a markdown cell with Description - Environment Tab allows to change the Programming environment, restart it, or create a Read-Only Notebook version. You can also specify a Job to run at a specific Date and Time If you want to choose a different environment, stop the Kernel, change environment and select the new environment to be associated with the Notebook and open it in edit mode Will need to reload the Data again into the new environment Github and Watson - Profile settings Integrations GitHub personal Access Token copy and paste it - Settings tab GitHub integration on the project level - Remove or replace credentials before publishing as best practice IBM Watson Knowledge Catalog - Unites all information assets into a single metadata catalog - - Based on relationships between assets Data Asset Management, code asset management, data managemtn, data integration and transformation - Protects data and enables sharing via automated dynamic masking - Built in - Charts, visualization and stats - Watson Studio for AI, machine and deep learning tools - Only contains the Metadata assets on how to access the data - Browse assets gives you catalog suggestions Data Refinery - Simplifies task of refining data and its workflows - Available with Watson Studio, public/private cloud and desktop - Essentially just a toolbox which helps you clean up your data and explore it a little, can also schedule the sorting/analysis to run on a regular schedule SPSS Modeller - Modeler Flows - Data management, preparation, visulaization, and model building - Built using drag&drop editor - Consists of nodes with flows according to connection direction Nuggets are used to see info about models and results of the flow - Can connect nuggets to new data sources and view predictions SPSS modeller used to build predictive models without programming - GUI, with many different types of nodes, similar to modeller flows tbh SPSS Stats - Statistical and Machine learning app - GUI, but more spreadsheet like - File tab - - Data tabs - - Data operations, validation Transform tab - - Load data, save thing, import/export to other formats Data transformation pretty much, filing in missing values etc. Analyse menu - Stats and machine learning analysis stuff - GLM, decision trees, etc. - SPSS analyses can be saved in the form of Syntax for later use Model Deployment with Watson Machine Learning - Deployment of models can be done via different environments and workflows - Open standard for model deployment - - - PMML Includes models, combinations of models and functions Based on XML JSON based PFA ONNX Microsoft and FB Initially created for neural networks, but now supports traditional machine learning - IBM Watson Machine learning can do the following - Deploy Watson Studio, SPSS Modeler or open source packages - Support for PMML and ONNX - Batch or online scoring - Can integrate code from multiple different languages - AutoAI - - Automates certain aspects of repetitive work Data preparation Model development Feature engineering Hyper-Parameter optimization After it automatically generates pipelines, you can check how well they did Only for classification and regression tasks for now - OpenScale - Checks model for fairness and bias Checks a fairness value for certain groups Usually bias comes from underlying data, OpenScale may be able to show where in the Data the bias comes from - Audit and explains model decisions/predictions - Monitor model performance and check where it can be improved Drift is when the production data and training data start to differ and the model becomes less accurate - Checks which things cause drift Business benefit monitoring edX Data Science Methodology CRISP-DM - Cross industry process for data mining (CRISP-DM) - Aims to increase the use of data mining over a wide variety of application and industries - Does this by taking specific and general behaviours and makes them domain neutral - 6 parts to it - Business Understanding Looking at the reason for the project, what are the goals for the different stakeholders? - Align methodology and CRISP-DM Data Understanding What data do we need to collect to figure out the problem? What sources, what methods? - Data Preparation Transform the data into usable sets and clean it up - Data Modelling - Evaluation - Deployment Use the model on new data to gain new insights Might initiate re-evaluation of business need or questions asked, the model used, or the data applicable Module 1: From Problem to Approach Business Understanding - Having a clearly defined question/Goal is vital for data science - Ex. How can we reduce costs of performing XY do we want it to run more efficiently or more profitably - Objectives need to be clarified to reach a goal - Case study: Limited healthcare budget while ensuring quality care - Prioritized “readmissions” as effective review area Top of the readmissions list were patients with congestive heart failure - Decision tree model was applicable here - Setting business requirements (4 here) that would needed to be met by the model being built Readmission likelihood for congested heart failure patients Readmission outcome for them What events led to the predicted outcome Making a process to asses new patients readmission risk Analytic Approach - What type of pattern will be needed to address the question? - - Patterns? Relationships? Counts? Yes/No? Behaviour? Case Study: - Decision tree approach categorical outcome and likelihood Easy to understand and apply Can see what conditions are classifying the patient as high risk Multiple models can be built for different stages Module 2: From Requirements to Collection - Determine the Data requirements for the approach you need/want to take - - Contents, format and sources Case Study: - Defined their target group for data collection - One record per patient w/ columns representing different variables Patient could have 1000s of records, so data scientist needed to process the data such that it fits into one record - Data collection can be done via many records, diagnoses, demographic info, drug info etc. Data collection for unavailable data can be postponed, and you can start building a model to get intermediate results with the data already present - Data Collection - Gaps in data have to be identified fill out or replace them Module 3: From Understanding to Preparation Data Understanding - Is the data we collected representative of the problem? - Case Study - - Descriptive statistics on our data Correlations highly correlated makes them redundant Mean, SD, etc. Histograms for understanding distribution Quality assessment eg missing values Does a missing value mean “no”, ”0”, or “don’t know”? Invalid or misleading values? Data Preparation - Usually takes 70-90% of project time - Want to make data easier to work with and remove unwanted elements - - Feature engineering - - Address missing/invalids, and remove duplicates Create features that make the algorithm work Case Study: - How to define congestive heart failure? - Needed re-admission criteria for the same condition - First vs. re-admission 30 days following initial admission Transactional records were aggregated - Diagnostic group codes needed to be able to filter them 100s-1000s depending on clinical history and boiled down to one record New records had many more columns about all types of variables (treatment, drugs, comorbidities, etc.) Lit review added a few more indicators/procedures that weren’t considered before Age, gender, insurance etc. also added to the record - All record variables were used in the model and the dependent variable was “congestive heart failure” Module 4: From Modelling to Evaluation Modelling - Stage where data scientist can sample and see if it works - Key questions - - - What’s the purpose of data modeling - What are the characteristics Descriptive or predictive modelling - Descriptive if do A, likely to like B - Predictive more yes/no, stop/go type answers Case Study - After training set initial training overall model accuracy 85%, but “yes” accuracy only 45% no good Adjust relative costs of misclassified yes/no outcomes Increase false negative classification from 1:1, but we can change that - Second model cost ratio was 9:1 (Y/N) high yes accuracy, but low no accuracy 49% overall accuracy also not good - Third mode 4:1 ratio 68% Sensitivity (yes accurate) and 85% Specificity (No accuracy) best balance we can attain with a small training set - We can iterate back to data preparation stage to redefine some variable and check the model again Evaluation - Asses model quality and check if the model actually answers the question being asked - Diagnostic measures model works as intended? - Stats significance testing of model is data being handled and interpreted properly by model - Case Study - Check true vs. false positive rate which model is optimal - Maximum separation of ROC (receiver operating characteristic) curve to baseline i.e. how much more true than false positives do we get? Module 5: From Deployment to Feedback Deployment - Once model is built and tested, it’s gotta be deployed and used by different stakeholders - Case study - Translated results so that clinical staff could understand the results and look for suitable interventions - Intervention director wanted automated and near real-time risk assessment - Ideally easy to use and browser based, updated as patient was in the hospital - Part of deployment was a training on how to use the model - Tracking of patients and updating needs to be developed with IT, database admin and clinical staff model refinement Feedback - Value of model is dependent on how well feedback can be addressed and incorporated - Case study - Review process is defined and put in place first clinical management executives have overall responsibility for review process - Patients receiving intervention tracked and outcomes recorded - Intervention success would be tracked and compared to before intervention rates - Collection of data for one year and refinement, might also be worth to add missing data previously not collected (i.e. pharmaceutical), refinement of interventions as well SQL for Data Science Module 1: Introduction Databases and Basic SQL - SQL is a language to query data from relational databases - Databases are a repository for Data and provide functionality to add, modify or query it - Relational database are like a table of related things can form relationships between tables - DBMS database management system is the software used for itt - - MySQL, Oracle database, DB2, DB ExpressC Basic SQL commands - Create a table for data - Insert data - Select data - Update data - Delete data Creating a Database instance on Cloud - - Cloud Databases are good because Easy to use and access from anywhere Scalability Disaster recovery backups Ex.: IBM DB2 on cloud, MS Azure database, amazon something Database service instance is a logical abstraction for managing workflows Handles all application requests to work with the data in any database managed by that instance - Target of connection request from applications IBM DB2 on cloud IBM cloud Db2 service region to deploy to create Service credentials Host name unique computer label Port number DB port Database name userID personal user ID PW Module 2: Basic SQL - - Two categories of how to work with data in a table Data definition language DDL define, change or drop data Data manipulation language DML read and modify data CREATE TABLE - Primary key uniquely identifies the “row” or tuple Can be assigned to any entity - Can add conditions such as “NOT NULL” - Each column is usually an attribute/variable for any entity - General syntax Create table TABLENAME ( COLUMN1 datatype, COLUMN2 dataype, …); Can add unique ID or NOT NULL statements here too Common to add a DROP statement before so no duplication error is possible - SELECT - Retrieve data from a table DML query - select * from DATABASE select <column1>, <column2> from DATABASE add in column1/2 names proper - WHERE clause have to be True, False or Unknown (i.e. a predicate) Booleans can be used here - - COUNT retrieves # of rows that fit criteria - DISTINCT used to remove duplicates from a set - LIMIT restricting # of rows from database INSERT - DML statement - INSERT into TableName (ColumnName) VALUES (values) Multiple rows can be inserted into using values clause by comma separation after parentheses - UPDATE - UPDATE TableName SET (ColumnName = Value) WHERE (Condition) - Without specifying WHERE all rows of a column will be updated DELETE - DELETE FROM Tablename WHERE Condition Same Caveat as in Update Module 3: String Patterns, Ranges, Sorting and Grouping - SELECT using string patters and ranges - If we don’t know the author, but know his name starts with ‘R’ LIKE predicate % is placeholder “Wildcard character” - Can use ‘between’ statement to specifify - Where pages between XYZ and ZYX Multiple selections can be done via IN - Where firstname like ‘R%’ Where country IN (‘C1’, ‘C2’,….) SORTING Result Sets - Select x from y order by XYZ Default is ascending order - Order by desc descending order Order by XYZ can put many things in here (i.e. 2 for things starting with 2) GROUPUNG Result Sets - Distinct(XYZ) sorts for uniques removes duplicates - Can use ‘count(XYZ)’ to get a count of a variable and then use the ‘group by XYZ’ here it will match the variable and its count can add as count after the count command to also give its column a name (otherwise default name is just column #) - Group by clause conditions are defined by clause having i.e. having count(XYZ) > 4 Module 4: Functions, Sub-queries, Multiple Tables Built in Functions - using these functions can decrease the amount of data needed to be retrieve speeds up data processing - Aggregate or Column Functions takes a collection of a column and returns single value - SUM(), MIN(), MAX(), AVG() - E.g. select SUM(Column) from TABLE - Can include as XYZ to return it with a name Can include WHERE Scalar and String functions ROUND(), LENGTH(), UCASE, LCASE Can use lcase or ucase if you don’t know if the string you’re searching for is upper or lower case - Date and Time functions - Types of date and time types Date YYYYMMDD Time HHMMSS Timestamp Yearmicroseconds - Date and time functions can extract anything from year to microseconds - Can perform date or time arithmetic - E.g. SALEDATE + 3 DAYS Special functions for CURRENT_DATE and CURRENT_TIME Sub Queries and Nested Selects - Regular queries but placed inside parentheses nested inside another query - Have to use (select XYZ from TABLE) as parentheses as we need a sub-select since SQL functions are stupid - essentially if you ever want to use any selection or functions from above, gotta put them in parentheses and specify fro which table they come from Working with multiple Tables - Ways to access multiple tables - Sub-queries - Select X from Table1, where Y IN (select z from Table2) Implicit JOIN Select * from Table1,Table2 Can add a where Table1.something = Table2.something - Joins the rows of the tables JOIN operators Can use shorthand by “select * from Table1 E, Table2 B” Module 5: Python and Databases Accessing Databases with Python - Very rich ecosystem and easy tools for datascience yay python - - - Open source port to many platforms Python database API makes writing python code for accessing databases easy - Python connects to database via API calls - Many Databases are supported via SQL APIs Advantages of Jupyter - >40 programming languages supported - tons of display and output options Writing Code using DB APIs - DB-API is a python programming API via Jupyter Notebooks gotta load some database specific libraries though - - - Used for most relational databases - Easy to understand and consistent - Code portable across databases Connection Objects connect to DB and manage transactions - Commit() commits any buffer objects to memory - Rollback() deletes any buffer objects I think - Close() Cursor Objects database Queries - - Cursor() creates a new cursor Cursor is a controlled structure that allows traversal of DBs Used to run queries and fetch results Connecting using ibm_db API - Python functions for IBM server DBs - Need the following to connect after importing ibm_db - - - Driver name DB name Host DNS or IP Host port Connection protocol User ID User password Creating tables - - In DB2 Warehouse using python code use ibm_db.exec_immediate() Connection resource 1st Paramenter SQL statement (i.e. create table) use the same language as SQL Loading data - Also uses ibm_DB.exec_immediate() connection resource first parameter again - Then use the SQL insert into (insert into) Querying data - Ibm_db.exec_immediate() with connection and SQL statement (select) - NOTE: GOTTA PUT SQL STATEMENT IN “…” We can also load all database table into pandas Combining Python and SQL commands: - We can load Python “magic” extension that allows us to code in SQL - If we want an entire cell to be run as SQL simply add “%%sql” at the top - Otherwise anything within a line after typinh “%sql” will be run as SQL If we want to use Python variable simply put a ‘’ : ”before them - If there are mixed case headers (i.e. Id, not ID or id), use double quotes - When specifying something in single quotes within single quotes, use a backslash - ‘blabla = ‘trala’’ NO ‘blabla = \’trala\’’ LIMITing the # of rows displayed use ‘LIMIT’ and thena number - Tables may have 1000s or rows and we don’t want to load them all Getting Table and Column Details - DB2 database select * from syscat.tables - too many tables use this instead - select TABSCHEMA, TABNAME, CREATE_TIME from syscat.tables where tabschema = ‘username’ - select * from syscat.columns where tabname = ‘DOGS’ - will show all column names for that table - select distinct(name), coltype, length from sysibm.syscolumns where tbname = ‘TABLE’ this will display name, datatype and length of the column Analyzing Data with Python - McDonalds menu analysis - Need to store the nutrition table into the db2 Warehouse - Then we can load pandas to analyze the data - E.g. max sodium food item? First use visualization swarm plot via seaborn Df[‘Sodium’].describe() summary statistic of sodium Df[‘Sodium’].idxmax() index of max sodium item Df.at[82,’Item’] return Item name for index 82 - Can look at protein vs. fat visualize via jointplot in seaborn - Can check for outliers via boxplots