Uploaded by Althea Lapus

isom5310 lab1 2223fall (1)

advertisement
ISOM5310 Lab 1 – Customer Pregnancy Prediction by Azure ML
Overview
In this lab, we will register an account for Microsoft Azure Machine Learning (Azure ML) Studio and navigate its
functionality. We will learn how to create and run an experiment that allows us to train and evaluate a machine
learning model in Azure. Specifically, we will train two binary classification models (of your choice) to determine
whether a retail customer is pregnant or not.
What you need:
•
•
•
A Windows or Mac OS computer
A web browser and Internet connection
lab01_pregancy.csv file downloaded from Canvas and saved in your local computer
Task 1: Sign up an Account
1.1
Open a web browser to visit Azure Machine Learning Studio at https://studio.azureml.net/.
1.2
Click the Sign In button. Log in with your HKUST student email. Move on to Task 2 if you successfully
landed on Microsoft Machine Learning Studio (classic). Otherwise, if you prefer to use a personal email
address, follow task 1.3-1.7 to create a new account.
1.3
Click the Sign up here link on the right. In the pop-up window, click the Sign up here link under Not an ML
Studio (classic) user. Go to the Free Workspace ($0/month) in the middle, and click Sign In.
1.4
You may sign in with your HKUST email account or you may choose to use a phone number or new email
address to create a new outlook.com email account. [Note that the details of account registration is skipped
here. You just need to follow up the instructions shown on the website.]
1.5
After the new account has been created, go back to the Azure page at https://studio.azureml.net/ and
choose Sign in with the email account that you just created.
ISOM 5310
1
[Note: there will be a notification pane saying that “Machine Learning Studio (classic) will retire on 31 August
2024”. You may read it and click Close. Next, there will also be a prompt asking “Would you like a tour of
ML Studio?”, You may click Not now and go back to the main window.]
1.6
By default, you will see a pane
for you to create a new item in
the studio, which could be
Dataset, Module, Project or
Experiment. Click the white
cross in the top right corner to go
back to the list of experiments in
your studio account.
Task 2: Import the Data
There are several sample datasets included in the Azure Machine Learning Studio for you to build experiments or
you can import data from external sources. In this lab, we will import a customer dataset of a grocery store, which
includes customer features generated from their purchase records and associated accounts.
2.1
Inside the menu on the left, click the DATASETS tab. Next, click +NEW at the bottom to create a new
dataset. Inside the floating pane, choose FROM LOCAL FILE.
2.2
Choose the lab1_pregnancy.csv file that you have downloaded from the Canvas, and click the tick button
(i.e. OK). Wait for a few seconds for the file to be uploaded.
ISOM 5310
2
Task 3: Build an Experiment
3.1
Click the EXPERIMENTS tab inside the menu on the left. Next, click +NEW at the bottom to create a new
experiment. Inside the floating pane, choose Blank Experiment.
Now you should be directed to the experiment canvas page, inside which modules of the experiment can
be added, and a selected machine learning model can be trained.
3.2
At the top of the canvas, change the name of the experiment to Lab1 – Customer Pregnancy Prediction.
3.3
On the left of the experiment canvas, there is a list of saved datasets and machine learning modules. Expand
Saved Datasets -> My Datasets to locate the lab1_pregnancy.csv dataset imported in task 2. Drag the
dataset to the experiment canvas.
ISOM 5310
3
3.4
To see what this dataset looks like, click the output port (i.e., the circle with the number ‘1’ inside) at the
bottom of the dataset, then select Visualize.
[Note: datasets and modules have
input and output ports
represented by small circles. To
create a flow of data in an
experiment, you can connect an
output port of one module to an
input port of another. At any time,
you can click the output port of a
module to see what the data looks
like at that point in the data flow.]
In this dataset, each row represents a customer, and the variables associated with each customer appear
as columns. You will predict whether the customer is pregnant or not (last column, titled “PREGNANT”)
using the other columns.
3.5
Go to the module list on the left, expand the Data Transformation -> Sample and Split, drag the Split
Data module and drop it to the experiment canvas, below the dataset module. This module allows us to split
the available data into a training set and a testing set.
3.6
Use your mouse to connect the output port of lab1_pregnancy.csv to the input port of the Split Data module.
3.7
Click the Split Data module inside the canvas to select it. In the Properties pane on the right, set the
Fraction of rows in the first output dataset to 0.8 and check Randomized split option. This means we
will use 80% random samples of the data to train the model and hold back the remaining 20% for testing.
ISOM 5310
4
3.8
Go to the module list on the left, expand the group Machine Learning -> Initialize Model -> Classification.
Select Two-Class Logistic Regression and drag it to the experiment canvas.
3.9
Go to the module list on the left, expand the group Machine Learning -> Train, and drag Train Model to
the experiment canvas.
3.10 Inside your canvas, connect the output the Logistic Regression module to the left input port of the Train
Model module, and connect the training data output (left port) of the Split Data module to the right input
port of the Train Model module.
3.11 Notice there is a red warning sign inside the Train Model module. Select it and go to its Properties pane on
the right. Click Launch column selector. Choose Include and column names in the first dropdown boxes,
and then click into the rightmost text box. When you see the list of column names, choose PREGNANT.
Click the tick sign at the bottom to confirm the setting. [Note: this step specifies PREGNANT as the target
variable when training the classification model.]
ISOM 5310
5
3.12 Go to the module list on the left, expand the group Machine Learning -> Score group, and drag a Score
Model module to the canvas. Connect the output port of the Train Model module to the left input port of
the Score Model module. Connect the test data port of the Split Data module (the right output port) to the
right input port of the Score Model module.
3.13 Go to the bottom of Azure Machine Learning Studio, click the RUN button to run the experiment (i.e., train
and score a logistic regression model). When the training is done, you can see green ticks on all modules
(except the dataset module).
3.14 Click the output port of Score Model and select Visualize from the context menu. The output shows the
predicted label and predicted probabilities for each testing data row (the Scored Labels and Scored
Probabilities column) and the known labels (the PREGNANT column).
3.15 Finally, let us evaluate the prediction accuracy of the model. Go to the module list on the left, expand the
group Machine Learning -> Evaluate, and drag Evaluate Model to the canvas. Connect the output of
Score Model to the left input port of Evaluate Model.
ISOM 5310
6
3.16 Run the experiment again. You may choose only run the Evaluate model module by selecting it and then
choosing Run selected at the bottom. Wait until you see a green tick in Evaluate Model. Your experiment
canvas should look like the following.
3.17 Click the output port of Evaluate Model and choose Visualize.
ISOM 5310
7
You can see that the Threshold is set at 0.5 by default, which is used for distinguishing Positive Label (i.e., 1) and
Negative Label (i.e., 0) with the estimated probability scores. You can drag the bar to set the threshold to some
other values. Notice that the model performance varies when the threshold is set differently.
The following performance metrics are shown for this classification model.
- Accuracy: the proportion of accurately predicted cases (positive + negative) out of the whole set
- Precision: the proportion of accurately predicted positives among all predicted positive cases
- Recall: the proportion of accurately predicted positive cases among all actual positive cases
- F1 Score: a combination of precision and recall to seek a balance of the two
Each of the measurements is in the range of [0, 1]; the closer to 1, the better.
Task 4: Compare Two Classifiers
4.1
Go to the module list on the left, expand the group Machine Learning -> Initialize Model -> Classification.
Notice that there are more than a dozen of models in the group. Models starting with “Two-Class” are
applicable to this dataset and prediction problem.
4.2
Select and drag any model starting with “Two-Class”, e.g., Two-Class Neural Network to the experiment
canvas.
4.3
Copy and Paste the Train Module and Score Module in your canvas. Now make the following connection:
a) connect the output port of the Two-class Neural Network module to the left input port of the 2nd Train
Model module;
b) connect the left output port of the Split Data module to the right input port of the 2nd Train Model module;
c) connect the output port of the 2nd Train Model module to the left input port of the 2nd Score Model module,
d) connect the right output port of the Split Data module to the right input port of the 2nd Score Model; and
e) finally connect the output port of the 2nd Score Model module to the right input port of the Evaluate Model
module.
After you are done, the experiment modules and their connection should look like the following.
b)
a)
c)
d)
e)
ISOM 5310
8
4.4
Go to the bottom of Azure Machine Learning Studio and click the RUN button to run the experiment. Wait
until you see a green tick in all modules.
4.5
Click the output port of Evaluate Model and choose Visualize to compare the performance of your chosen
classification models. Notice that this Neural Network model’s performance is slightly different than the
previous Logistic Regression model.
[Note: the blue curve is for the model
connected to the left input port of
Evaluate Model, and the red one is for
the model connected to the right port.
That is, the blue curve is for the
logistic regression model, and the red
one is the neural network model. For
this data set, it seems that logistic
regression outperforms the neural
network just a little (AUC: 0.918 vs
0.899.]
ISOM 5310
9
Download