Uploaded by plip.graeff

IBM Data Science Certificate Course notes

advertisement
edX Introduction to Data Science
Module 1: Defining Data Science:
-
Data Science studies large quantities of data to gain insights and make strategic decision
based upon those
-
Data Science involves some math, some science, and lots of curiosity to explore the data
-
Data Scientists need to not only be able to analyse and interpret data, but also use the
insights to tell a story that informs decisions
Module 2: What Data Scientists do:
-
General steps are as such
-
Identify and understand the problem that needs to be solved
-
Collect the data for analysis
-
Identify the right tools to use
-
Develop a data strategy
-
There are tons of data to be collected, and you need to figure out which is useful
-
Basic tools often used
-
-
Regressions
-
Neural networks
-
Machine learning
-
K-nearest neighbours, estimations etc.
Cloud
-
Remote central storage and working environment, no need to run things on your
own system
-
Can access data and programs/algorithms without needing to install them locally
Module 3: Big Data and Data Mining:
-
5 V’s of Big Data
-
Velocity

-
Volume

-
Variety
-
Veracity

-
How fast Data is acucumulated
How much data is and can be stored
Truth and completeness, conformity to facts/categories etc
Value
Hadoop
-
Essentially a distributed platform which segments data, sends it to different
servers and computes the parts separately, the brings them back together
-
Big data has different definitions depending on who you ask, but most all of them the
name is the game, it’s very large amounts of data that can’t (or are very hard to)
compute using traditional methods
-
New computation and statistical techniques have emerged that are better equipped to
handle these giant data sets
-
Data Mining
-
-
Establish goals for the mining

Identify key questions and costs vs. benefits

Establish expected accuracy and usefulness of data mined
Select Data, quality of output depends largely on quality of data used

You might not have the data you need handy, so you have to look for
alternate sources, or plan data collection
-
Preprocess data to remove erroneous data, identify systematic errors, account for
missing data etc.
-
Transform data such that it fits into the analysis you want to do

E.g. aggregating all sources of income into one variable (“income”)
-
Storing data in a way that gives you easy access, but is also secure and prevents
mining other unrelated datasets
-
Mine the data using a variety of algorithms (visualization is a good start)
-
Evaluate the mining results

Could involve testing predictive capabilities of the data mined

Could be how well does the mined data reproduce the observed one

This entire process is iterative, so you interpret the results to improve
your pipeline, then review the new results etc.
Module 4: Deep Learning and Machine Learning
-
AI vs. Data Science
-
Data science is the process of extracting information from large amounts of
disparate data via different methods and includes the entirety of the process
-
AI encompasses everything that allows a computer to learn and make intelligent
decisions
-
Deep Learning
-
Based on neural networks that need a lot of computing power
-
Will get better the more data you have to train it on
Module 5: Data Science and Business
-
What skills are people looking for in Data Scientist
-
Curiosity, analytical thinking, problem solving
-
Good storytelling, communication, interpersonal
-
Analytical thinking is maybe more important that analytical skills, but basics in
statistics, some algorithms, maybe machine learning etc. are important
edX Data Science Tools
Module 1: Data Scientist’s Toolkit
Languages
-
-
Python
-
Most popular data science language
-
General purpose, e.g. Data Science, Machine learning, IoT, Web development etc.
-
Open source with many dedicated libraries for different things

Data Science: Pandas, SciPy, NumPy, Matplotlib

AI: TensorFlow, PyTorch, Keras, Scikit-learn

Natural Language: NLTK
R
-
Easy to translate math to code, hence used a lot to make stats models, analysis
and graphs

-
World’s largest repository of statistical knowledge
Integrates well with other languages
SQL
-
Limited to querying and managing data
-
Works with structured data

But interfaces with non-structured data repositories
-
Directly access the data, no need for copying
-
E.g. MySQL, IBM Db2, Postgesql, Apache, SQLite, Oracle, MariaDB, etc.
Open source tools
-
Data management
-
Relational Databases

SQL: MySQL, PostgreSQL

noSQL: MongoDB, Apache CouchDB & Cassandra

file-based: Hadoop File system, Ceph
-
Data Integration and Transformation (ETL)
-
Extract, transform, load
-
Apache AirFlow, KubeFlow, Apache Kafka, Apache Nifi, Apache SparkSQL,
NodeRED
-
Data Visualization
-
Hue, Kibana, Apache Superset
-
Model Building
-
Model Deployment
-
-
-
PredictionIO
-
Seldon  supports nearly every framework
-
TensorFlow
Model Monitoring and Assessment
-
ModelDB, Prometheus
-
AIFairness360  assesses bias within models
-
AdversarialRobustness360  protect against adversarial attacks w/ bad data
-
AI Explainability 360 Toolkit tries to give examples of what happens in the model
Code and Data Asset Management
-
Git for Code asset management  GitHub and GitLab
-
Data needs metadata  Apache Atlas or Kylo for data assets management
-
Development and Execution Environment
-
Jupyter  Development Environment
-
Supports many different programming languages via Kernels
-
JupyterNotebooks unify documentation, code, output and visuals in a single
document

JupyterLab is the new generation to replace it

-
Can open different files including JupyterNotebooks
Apache Zeppelin
-
Can do similar things, but can plot without having to code
-
-
RStudio
-
One of the oldest ones, but only runs R
-
Python development possible
-
Also has remote data access
Spyder
-
-
-
Tries to emulate RStudio to Python, but can’t do as much
Apache Spark
-
Cluster execution environment  linearly scalable
-
Can process huge amounts of data file by file (batch)
-
Apache Flink si similar, but can do real-time data flow
KNIME  Fully integrated visual tool
-
Can be extended by Python and R, as well as Apache Spark
Commercial options
-
Database management: IBMdb2, MicrosoftSQL server, oracle
-
ETL tools
-
-
-
Informatica, IBM Infostage
-
GUI
Data Visualization
-
Visually attractive and live dashboards
-
Tableau, Microsoft PowerBI, IBM Cognos analytics
-
Options specifically for data scientists available
Model building
-
-
Best for Data Mining

SPSS Modeller

SAS Miner
Model Deployement
-
Tightly connected to Model building in a commercial setting
-
However, also supports open source deployment (e.g. SPSS Modeller supports
this)
-
Model Monitoring and Code management are relatively new, hence open source is the
way to go
-
Data Asset Management
-
Data has to be versioned and annotated

-
Informatica and IBM InfoSphere provide tools specifically for that
Fully integrated Development environment
-
Watson Studio  web based or desktop version
-
H20.ai
Cloud based tools
-
-
Fully integrated visual tools and platforms
-
Multiple server machine clusters
-
IBM Watson Studio
-
Microsoft Azure Machine learning
-
H20 driverless ai (H20.ai)
SaaS (software as a service)
-
Cloud provider operates the product or you
-
Proprietary tooling available as a cloud-exclusive

E.g. amazon web servise DynamoDB  noSQL database

Cloudant  most things done by cloud provider, but compatible with
CouchDB
-
-
ELT  extract, load, transform tools
-
Informatica cloud data integration
-
IBM Data Refinery

Transform raw data into spreadsheet-like interface

Offer data exploration and visualization
Model Building
-
Machine learning services available from nearly every cloud provider

E.g. Watson Machine learning
Libraries
-
Collection of functions and methods that allow for actions without writing the code
-
Scientific Computing
-
Pandas for effective data cleaning, manipulation and analysis  based on
matrices

-
-
Built on NumPy
NumPy  based on Arrays
Visualization
-
Matplotlib
-
Seaborn

Based on Matplotlib

For eay heatmaps, violinplots etc
High-Level Machine and Deep Leanring
-
Scikit-learn  built on NumPy, SciPy & Matplotlib

-
Keras  used GPU

-
Made for experimentation
TensorFlow

-
Deep Learning
Pytorch

-
Machine Learning
Low level framework for large scale productions
Apache Spark

Can use Python, R, Scala, or SQL

Complementary libraries

Vegas  stats data visualization

BigDL  Deep leaening
-
R packages

Ggplot2 or Keras/TensorFlow interfaces
Application Programming Interfaces (API)
-
Lets two pieces of software talk to each other
-
Is only the interface for the software components
-
E.g. pandas library has an API that allows us to call all the functions, even if not
programmed in python
-
Rest APIs are a request-response API where one user/thing (‘client’) send a request to a
different thing (‘resource’) and in turn gets a response from it
Data Sets
-
-
Collection of data in structures:
-
Tabular
-
Hierarchical/network  represent relationships
-
Raw files  images, audio etc.
Private or open data
-
Open data from: open data portal lists, gov. intergov. And organization websites,
Kaggle, Google search engine
-
Community Data Licence Agreement (CDLA)

CDLA sharing  use and modify, publication under same terms

CDLA permissive  use and modify no obligations
Machine Learning Models
-
Supervised learning  provide inputs and correct outputs, model looks at relationship
b/w input and output
-
Regression  predict numerical values
-
Classification  predict class category
-
Unsupervised learning  data not labelled by a human, model identifies patterns within
the data
-
Clustering  recommendations for online shopping
-
anomaly detection  fraud
-
Reinforcement learning
-
Deep learning
-
-
Tries to loosely emulate how human brain computes
-
Very large data sets needed to train it, but can download pre-trained
-
Built using frameworks  can have prebuilt models (model zoo)
Model building
-
Prepare data (labelling etc.)
-
Build actual model
-
Train model
-
Analyse results and iterate until model performs how you want it to
Module 2: Open Source Tools
GitHub
-
Repository (remote, but can be cloned locally via SSH or HTTPS) location where you
put your stuff. Remote is on the github/gitlab
-
Mostly add things to your local repository
-
Do this via terminal commands, cd into your local git folder, add files you want to
add to staging area (git add XYZ), and then add it to local repository (git commit)

-
Add from local to remote repository (git push)
Can also push into remote repository directly online
-
-
Can control which files go into a commit via the staging area
Have to (git pull) into local repository
Branches are different “subparts” of the repository
-
Master is the main branch, can create new branch via (git branch XYZ)
-
Can merge branches via (git merge)  merges branch into branch you’ve
switched into
-
Changing branches via (git checkout XYZ)
-
Forking repositories will give you access to the foreign repository in your private
repository
-
Can post a Pull request in order to commit any changes to the private repository
to the foreign one
Jupyter
-
Kernels are the thing that offers R or Scala etc. compatibility
-
A wrapper for the interpreter
-
Run locally (either on laptop or whatever server you’re using)
-
Seaborn library used matplotlib but makes it prettier
-
Architecture
-
User  Notebook server via the browser
Module 3 IBM Tools for Data Science:
IBM Watson Studion
-
Collaborative data science and machine learning platform
-
Offers Jupyter Notebooks, Anaconda and RStudio
-
Embedded Machine learning models  neural networks, deep learning etc.
-
Projects are the centres of collaboration
-
Can add descriptions, readmes etc. in overview
-
Assets contains Notebooks, Data, Models etc.

Connection allows you to access Data from outside sources
-
Environment contains Notebook and Library settings
-
Settings allows you to add specific services
Jupyter Notebooks in Watson Studio
-
Add to ptoject: Notebook
-
Best practice is to add a markdown cell with Description
-
Environment Tab allows to change the Programming environment, restart it, or
create a Read-Only Notebook version. You can also specify a Job to run at a
specific Date and Time

If you want to choose a different environment, stop the Kernel, change
environment and select the new environment to be associated with the
Notebook and open it in edit mode

Will need to reload the Data again into the new environment
Github and Watson
-
Profile settings  Integrations  GitHub personal Access Token  copy and paste it
-
Settings tab GitHub integration on the project level
-
Remove or replace credentials before publishing as best practice
IBM Watson Knowledge Catalog
-
Unites all information assets into a single metadata catalog
-
-
Based on relationships between assets
Data Asset Management, code asset management, data managemtn, data integration
and transformation
-
Protects data and enables sharing via automated dynamic masking
-
Built in
-
Charts, visualization and stats
-
Watson Studio  for AI, machine and deep learning tools
-
Only contains the Metadata  assets on how to access the data
-
Browse assets gives you catalog suggestions
Data Refinery
-
Simplifies task of refining data and its workflows
-
Available with Watson Studio, public/private cloud and desktop
-
Essentially just a toolbox which helps you clean up your data and explore it a little, can
also schedule the sorting/analysis to run on a regular schedule
SPSS Modeller
-
Modeler Flows
-
Data management, preparation, visulaization, and model building
-
Built using drag&drop editor
-
Consists of nodes with flows according to connection direction

Nuggets are used to see info about models and results of the flow

-
Can connect nuggets to new data sources and view predictions
SPSS modeller used to build predictive models without programming
-
GUI, with many different types of nodes, similar to modeller flows tbh
SPSS Stats
-
Statistical and Machine learning app
-
GUI, but more spreadsheet like
-
File tab
-
-
Data tabs
-
-
Data operations, validation
Transform tab
-
-
Load data, save thing, import/export to other formats
Data transformation pretty much, filing in missing values etc.
Analyse menu
-
Stats and machine learning analysis stuff
-
GLM, decision trees, etc.
-
SPSS analyses can be saved in the form of Syntax for later use
Model Deployment with Watson Machine Learning
-
Deployment of models can be done via different environments and workflows
-
Open standard for model deployment
-
-
-
PMML

Includes models, combinations of models and functions

Based on XML

JSON based
PFA
ONNX

Microsoft and FB

Initially created for neural networks, but now supports traditional
machine learning
-
IBM Watson Machine learning can do the following
-
Deploy Watson Studio, SPSS Modeler or open source packages
-
Support for PMML and ONNX
-
Batch or online scoring
-
Can integrate code from multiple different languages
-
AutoAI
-
-
Automates certain aspects of repetitive work

Data preparation

Model development

Feature engineering

Hyper-Parameter optimization
After it automatically generates pipelines, you can check how well they did

Only for classification and regression tasks for now
-
OpenScale
-
Checks model for fairness and bias

Checks a fairness value for certain groups

Usually bias comes from underlying data, OpenScale may be able to show
where in the Data the bias comes from
-
Audit and explains model decisions/predictions
-
Monitor model performance and check where it can be improved

Drift is when the production data and training data start to differ and the
model becomes less accurate

-
Checks which things cause drift
Business benefit monitoring
edX Data Science Methodology
CRISP-DM
-
Cross industry process for data mining (CRISP-DM)
-
Aims to increase the use of data mining over a wide variety of application and industries
-
Does this by taking specific and general behaviours and makes them domain
neutral
-
6 parts to it
-
Business Understanding

Looking at the reason for the project, what are the goals for the different
stakeholders?

-
Align methodology and CRISP-DM
Data Understanding

What data do we need to collect to figure out the problem? What
sources, what methods?
-
Data Preparation

Transform the data into usable sets and clean it up
-
Data Modelling
-
Evaluation
-
Deployment

Use the model on new data to gain new insights

Might initiate re-evaluation of business need or questions asked, the
model used, or the data applicable
Module 1: From Problem to Approach
Business Understanding
-
Having a clearly defined question/Goal is vital for data science
-
Ex. How can we reduce costs of performing XY  do we want it to run more
efficiently or more profitably
-
Objectives need to be clarified to reach a goal
-
Case study: Limited healthcare budget while ensuring quality care
-
Prioritized “readmissions” as effective review area

Top of the readmissions list were patients with congestive heart failure
-
Decision tree model was applicable here
-
Setting business requirements (4 here) that would needed to be met by the
model being built

Readmission likelihood for congested heart failure patients

Readmission outcome for them

What events led to the predicted outcome

Making a process to asses new patients readmission risk
Analytic Approach
-
What type of pattern will be needed to address the question?
-
-
Patterns? Relationships? Counts? Yes/No? Behaviour?
Case Study:
-
Decision tree approach  categorical outcome and likelihood

Easy to understand and apply

Can see what conditions are classifying the patient as high risk

Multiple models can be built for different stages
Module 2: From Requirements to Collection
-
Determine the Data requirements for the approach you need/want to take
-
-
Contents, format and sources
Case Study:
-
Defined their target group for data collection
-
One record per patient w/ columns representing different variables

Patient could have 1000s of records, so data scientist needed to process
the data such that it fits into one record
-
Data collection can be done via many records, diagnoses, demographic info, drug
info etc.

Data collection for unavailable data can be postponed, and you can start
building a model to get intermediate results with the data already
present
-
Data Collection
-
Gaps in data have to be identified  fill out or replace them
Module 3: From Understanding to Preparation
Data Understanding
-
Is the data we collected representative of the problem?
-
Case Study
-
-
Descriptive statistics on our data

Correlations  highly correlated makes them redundant

Mean, SD, etc.

Histograms for understanding distribution
Quality assessment  eg missing values

Does a missing value mean “no”, ”0”, or “don’t know”?

Invalid or misleading values?
Data Preparation
-
Usually takes 70-90% of project time
-
Want to make data easier to work with and remove unwanted elements
-
-
Feature engineering
-
-
Address missing/invalids, and remove duplicates
Create features that make the algorithm work
Case Study:
-
How to define congestive heart failure?

-
Needed re-admission criteria for the same condition

-
First vs. re-admission  30 days following initial admission
Transactional records were aggregated

-
Diagnostic group codes needed to be able to filter them
100s-1000s depending on clinical history and boiled down to one record
New records had many more columns about all types of variables (treatment,
drugs, comorbidities, etc.)

Lit review added a few more indicators/procedures that weren’t
considered before

Age, gender, insurance etc. also added to the record
-
All record variables were used in the model and the dependent variable was
“congestive heart failure”
Module 4: From Modelling to Evaluation
Modelling
-
Stage where data scientist can sample and see if it works
-
Key questions
-
-
-
What’s the purpose of data modeling
-
What are the characteristics
Descriptive or predictive modelling
-
Descriptive  if do A, likely to like B
-
Predictive  more yes/no, stop/go type answers
Case Study
-
After training set initial training overall model accuracy 85%, but “yes” accuracy
only 45%  no good

Adjust relative costs of misclassified yes/no outcomes

Increase false negative classification from 1:1, but we can change
that
-
Second model cost ratio was 9:1 (Y/N)  high yes accuracy, but low no accuracy
 49% overall accuracy  also not good
-
Third mode 4:1 ratio 68% Sensitivity (yes accurate) and 85% Specificity (No
accuracy)  best balance we can attain with a small training set
-
We can iterate back to data preparation stage to redefine some variable and
check the model again
Evaluation
-
Asses model quality and check if the model actually answers the question being asked
-
Diagnostic measures  model works as intended?
-
Stats significance testing of model  is data being handled and interpreted properly by
model
-
Case Study
-
Check true vs. false positive rate  which model is optimal
-
Maximum separation of ROC (receiver operating characteristic) curve to baseline

i.e. how much more true than false positives do we get?
Module 5: From Deployment to Feedback
Deployment
-
Once model is built and tested, it’s gotta be deployed and used by different
stakeholders
-
Case study
-
Translated results so that clinical staff could understand the results and look for
suitable interventions
-
Intervention director wanted automated and near real-time risk assessment
-
Ideally easy to use and browser based, updated as patient was in the hospital
-
Part of deployment was a training on how to use the model
-
Tracking of patients and updating needs to be developed with IT, database admin
and clinical staff  model refinement
Feedback
-
Value of model is dependent on how well feedback can be addressed and incorporated
-
Case study
-
Review process is defined and put in place first  clinical management executives
have overall responsibility for review process
-
Patients receiving intervention tracked and outcomes recorded
-
Intervention success would be tracked and compared to before intervention rates
-
Collection of data for one year and refinement, might also be worth to add
missing data previously not collected (i.e. pharmaceutical), refinement of
interventions as well
SQL for Data Science
Module 1: Introduction
Databases and Basic SQL
-
SQL is a language to query data from relational databases
-
Databases are a repository for Data and provide functionality to add, modify or query it
-
Relational database are like a table of related things  can form relationships
between tables
-
DBMS  database management system is the software used for itt

-
-
MySQL, Oracle database, DB2, DB ExpressC
Basic SQL commands
-
Create a table for data
-
Insert data
-
Select data
-
Update data
-
Delete data
Creating a Database instance on Cloud
-
-
Cloud Databases are good because

Easy to use and access from anywhere

Scalability

Disaster recovery  backups

Ex.: IBM DB2 on cloud, MS Azure database, amazon something
Database service instance is a logical abstraction for managing workflows

Handles all application requests to work with the data in any database
managed by that instance

-
Target of connection request from applications
IBM DB2 on cloud

IBM cloud  Db2 service  region to deploy to  create

Service credentials

Host name  unique computer label

Port number  DB port

Database name

userID  personal user ID

PW
Module 2: Basic SQL
-
-
Two categories of how to work with data in a table

Data definition language DDL  define, change or drop data

Data manipulation language DML  read and modify data
CREATE TABLE
-
Primary key uniquely identifies the “row” or tuple

Can be assigned to any entity
-
Can add conditions such as “NOT NULL”
-
Each column is usually an attribute/variable for any entity
-
General syntax

Create table TABLENAME ( COLUMN1 datatype, COLUMN2 dataype, …);


Can add unique ID or NOT NULL statements here too
Common to add a DROP statement before so no duplication error is
possible
-
SELECT
-
Retrieve data from a table  DML query
-
select * from DATABASE

select <column1>, <column2> from DATABASE  add in column1/2
names proper
-
WHERE clause have to be True, False or Unknown (i.e. a predicate)

Booleans can be used here
-
-
COUNT  retrieves # of rows that fit criteria
-
DISTINCT  used to remove duplicates from a set
-
LIMIT  restricting # of rows from database
INSERT
-
DML statement
-
INSERT into TableName (ColumnName) VALUES (values)

Multiple rows can be inserted into using values clause by comma
separation after parentheses
-
UPDATE
-
UPDATE TableName SET (ColumnName = Value) WHERE (Condition)

-
Without specifying WHERE all rows of a column will be updated
DELETE
-
DELETE FROM Tablename WHERE Condition

Same Caveat as in Update
Module 3: String Patterns, Ranges, Sorting and Grouping
-
SELECT using string patters and ranges
-
If we don’t know the author, but know his name starts with ‘R’

LIKE predicate  % is placeholder “Wildcard character”

-
Can use ‘between’ statement to specifify

-
Where pages between XYZ and ZYX
Multiple selections can be done via IN

-
Where firstname like ‘R%’
Where country IN (‘C1’, ‘C2’,….)
SORTING Result Sets
-
Select x from y order by XYZ

Default is ascending order
-

Order by desc  descending order

Order by XYZ can put many things in here (i.e. 2 for things starting with 2)
GROUPUNG Result Sets
-
Distinct(XYZ) sorts for uniques  removes duplicates
-
Can use ‘count(XYZ)’ to get a count of a variable and

then use the ‘group by XYZ’  here it will match the variable and its
count

can add as count after the count command to also give its column a
name (otherwise default name is just column #)
-
Group by clause conditions are defined by clause having

i.e. having count(XYZ) > 4
Module 4: Functions, Sub-queries, Multiple Tables
Built in Functions
-
using these functions can decrease the amount of data needed to be retrieve  speeds
up data processing
-
Aggregate or Column Functions  takes a collection of a column and returns single
value
-
SUM(), MIN(), MAX(), AVG()
-
E.g. select SUM(Column) from TABLE
-

Can include as XYZ to return it with a name

Can include WHERE
Scalar and String functions

ROUND(), LENGTH(), UCASE, LCASE

Can use lcase or ucase if you don’t know if the string you’re searching for
is upper or lower case
-
Date and Time functions
-
Types of date and time types

Date YYYYMMDD

Time HHMMSS

Timestamp Yearmicroseconds
-
Date and time functions can extract anything from year to microseconds
-
Can perform date or time arithmetic

-
E.g. SALEDATE + 3 DAYS
Special functions for CURRENT_DATE and CURRENT_TIME
Sub Queries and Nested Selects
-
Regular queries but placed inside parentheses nested inside another query
-
Have to use (select XYZ from TABLE) as parentheses as we need a sub-select since SQL
functions are stupid
-
 essentially if you ever want to use any selection or functions from above, gotta put
them in parentheses and specify fro which table they come from
Working with multiple Tables
-
Ways to access multiple tables
-
Sub-queries

-
Select X from Table1, where Y IN (select z from Table2)
Implicit JOIN

Select * from Table1,Table2


Can add a where Table1.something = Table2.something


-
Joins the rows of the tables
JOIN operators
Can use shorthand by “select * from Table1 E, Table2 B”
Module 5: Python and Databases
Accessing Databases with Python
-
Very rich ecosystem and easy tools for datascience  yay python
-
-
-
Open source  port to many platforms
Python database API makes writing python code for accessing databases easy
-
Python connects to database via API calls
-
Many Databases are supported via SQL APIs
Advantages of Jupyter
-
>40 programming languages supported
-
tons of display and output options
Writing Code using DB APIs
-
DB-API is a python programming API via Jupyter Notebooks  gotta load some
database specific libraries though
-
-
-
Used for most relational databases
-
Easy to understand and consistent
-
Code portable across databases
Connection Objects  connect to DB and manage transactions
-
Commit()  commits any buffer objects to memory
-
Rollback()  deletes any buffer objects I think
-
Close()
Cursor Objects  database Queries
-
-
Cursor()  creates a new cursor

Cursor is a controlled structure that allows traversal of DBs

Used to run queries and fetch results
Connecting using ibm_db API
-
Python functions for IBM server DBs
-
Need the following to connect after importing ibm_db
-
-
-
Driver name

DB name

Host DNS or IP

Host port

Connection protocol

User ID

User password
Creating tables
-
-

In DB2 Warehouse using python code  use ibm_db.exec_immediate()

Connection resource  1st Paramenter

SQL statement (i.e. create table)  use the same language as SQL
Loading data
-
Also uses ibm_DB.exec_immediate()  connection resource first parameter again
-
Then use the SQL insert into (insert into)
Querying data
-
Ibm_db.exec_immediate() with connection and SQL statement (select)
-
NOTE: GOTTA PUT SQL STATEMENT IN “…”
We can also load all database table into pandas
Combining Python and SQL commands:
-
We can load Python “magic”  extension that allows us to code in SQL
-
If we want an entire cell to be run as SQL simply add “%%sql” at the top

-
Otherwise anything within a line after typinh “%sql” will be run as SQL
If we want to use Python variable simply put a ‘’ : ”before them
-
If there are mixed case headers (i.e. Id, not ID or id), use double quotes
-
When specifying something in single quotes within single quotes, use a backslash
-
‘blabla = ‘trala’’  NO  ‘blabla = \’trala\’’
LIMITing the # of rows displayed  use ‘LIMIT’ and thena number
-
Tables may have 1000s or rows and we don’t want to load them all
Getting Table and Column Details
-
DB2 database  select * from syscat.tables
-
too many tables use this instead
-
select TABSCHEMA, TABNAME, CREATE_TIME from syscat.tables where
tabschema = ‘username’
-
select * from syscat.columns where tabname = ‘DOGS’
-
will show all column names for that table
-
select distinct(name), coltype, length from sysibm.syscolumns where tbname =
‘TABLE’  this will display name, datatype and length of the column
Analyzing Data with Python
-
 McDonalds menu analysis
-
Need to store the nutrition table into the db2 Warehouse
-
Then we can load pandas to analyze the data
-
E.g. max sodium food item?

First use visualization  swarm plot via seaborn

Df[‘Sodium’].describe()  summary statistic of sodium

Df[‘Sodium’].idxmax()  index of max sodium item

Df.at[82,’Item’]  return Item name for index 82
-
Can look at protein vs. fat  visualize via jointplot in seaborn
-
Can check for outliers via boxplots
Download