Machine Learning lab

advertisement
FUCULTY OF ENGINEERING AND TECHNOLOGY
LAB 04
MODULE: ARTIFICIAL INTELLIGENCE AND NEURAL NETWORKS
CODE: EEEN 519
LAB TITLE: MACHINE LEARNING
NAME: KARABO DAVIS BOGAISANG
STUDENT ID: 14001009
PROGRAM: BENG COMPUTER AND TELECOMMUNICATIONS
Abstract
This lab covers concepts of machine learning as algorithm used to predict outcomes. Concepts of
Linear Regression in machine learning are used.
Introduction
Machine learning (ML) is a category of algorithm that allows software applications to become more
accurate in predicting outcomes without being explicitly programmed. The basic premise of machine
learning is to build algorithms that can receive input data and use statiscal analysis to predict an output
while updating outputs as new data becomes available.
Objectives
• To get an understanding of machine learning, covering theory, application, and inner workings
of supervised, unsupervised, and deep learning algorithms.
• Use linear regression, K Nearest Neighbours, Support Vector Machines (SVM), flat clustering,
hierarchical clustering, and neural networks.
Apparatus
• pycharm 3.3
• python 3.7
• python packages and their dependencies: numpy, pandas, sklearn, quandl
Theory
Machine learning was defined in 1959 by Arthur Samuel as the "field of study that gives
computers the ability to learn without being explicitly programmed." This means imbuing
knowledge to machines without hard-coding it. People outside the programming community
mainly believe machine intelligence is hard-coded, completely unaware of the reality of the
field. One of the largest challenges with machine learning is the abundance of material on the
learning part. You can find formulas, charts, equations, and a bunch of theory on the topic of
machine learning, but very little on the actual "machine" part, where you actually program
the machine and run the algorithms on real data. This is mainly due to the history. In the 50s,
machines were quite weak, and in very little supply, which remained very much the case for
half a century. Machine Learning was relegated to being mainly theoretical and rarely
actually employed. The Support Vector Machine (SVM), for example, was created by
Vladimir Vapnik in the Soviet Union in 1963, but largely went unnoticed until the 90s when
Vapnik was scooped out the Soviet Union to the United States by Bell Labs. The neural
network was conceived in the 1940's, but computers at the time were nowhere near powerful
enough to run them well, and have not been until the relatively recent times.
The "idea" of machine learning has come in and out of favour a few times through history,
each time leaving people thinking it was merely a fad. It is really only very recently that
we've been able to put much of machine learning to any decent test. Nowadays, you can spin
up and rent a $100,000 GPU cluster for a few dollars an hour, the stuff of PhD student
dreams just 10 years ago. Machine learning got another up tick in the mid 2000's and has
been on the rise ever since, also benefitting in general from Moore's Law. Beyond this, there
are ample resources out there to help you on your journey with machine learning, like this
tutorial. You can just do a Google search on the topic and find more than enough information
to keep you busy for a while.
This is so much so to the point where we now have modules and APIs at our disposal, and
you can engage in machine learning very easily without almost any knowledge at all of how
it works. With the defaults from Scikit-learn, you can get 90-95% accuracy on many tasks
right out of the gate. Machine learning is a lot like a car, you do not need to know much about
how it works in order to get an incredible amount of utility from it. If you want to push the
limits on performance and efficiency, however, you need to dig in under the hood, which is
more how this course is geared.
Despite the apparent age and maturity of machine learning, there's no better time than now to
learn it, since you can actually use it. Machines are quite powerful, the one you are working
on can probably do most of this series quickly.
Procedure
Python packages needed for machine learning was installed, python packages as sklearn, numpy,
quandl. The packages was installed using pip command. The following commands were used to install
packages using pip:
• pip install sklearn
• pip install quandl
• pip install pandas
1. The next thing is to understand regression in terms of using it with machine learning.
For regression to work, data is needed sometimes is easy to find acquire data and sometimes it is not
easy it has to be found out and being compiled. The data that grabs the stock price for Alphabet
(previously Google), with the ticker of GOOGL:
import pandas as pd
import Quandl
df = Quandl.get("WIKI/GOOGL")
print(df.head())
2. The output of the above code is displayed in results.
3. For stock pricing the following terms are defined: Open, High, Low, Close, Volume, Ex-Dividend,
Split Ratio, Adj. Open, Adj. High, Adj. Low, Adj. Close, Adj. Volume.
The original dataframe a bit are paired, the following defination is used for pairing:
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
Machine learning can highlight value if it is there, but it has to actually be there. how does one knows if
data had is meaningful? Are historical prices indicative of future prices? Some people think so, but this
has been continually disproven over time. What about historical patterns? This has a bit more merit
when taken to the extremes (which machine learning can help with), but is overall fairly weak. What
about the relationship between price changes and volume over time, along with historical patterns?
Probably a bit better. So, as you can already see, it is not the case that the more data the merrier, but we
instead want to use useful data. At the same time, raw data sometimes should be transformed.
Thus, not all of the data is useful, and sometimes further manipulation on data to make it is even more
valuable before feeding it through a machine learning algorithm. The following code is uses to
transform data:
• df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] *100.0
This creates a new column that is the % spread based on the closing price, which is our crude
measure of volatility. The following code do daily percent change:
• df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] *100.0
New dataframe is defined by the following code:
• df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
• print(df.head())
4. The output of the stock price are displayed in results.
5. How does the actual machine learning thing work?
With supervised learning, you have features and labels. The features are the descriptive attributes, and
the label is what you're attempting to predict or forecast. Another common example with regression
might be to try to predict the dollar value of an insurance policy premium for someone. The company
may collect your age, past driving infractions, public criminal record, and your credit score for
example. The company will use past customers, taking this data, and feeding in the amount of the
"ideal premium" that they think should have been given to that customer, or they will use the one they
actually used if they thought it was a profitable amount.
Thus, for training the machine learning classifier, the features are customer attributes, the label is the
premium associated with those attributes.
6. In case of stock pricing, what are the features and what is the label?
The following code is used to add new rows:
• forecast_col = 'Adj. Close'
• df.fillna(value=-99999, inplace=True)
• forecast_out = int(math.ceil(0.01 * len(df)))
The following code is used to add new column with a simple pandas operation:
• df['label'] = df[forecast_col].shift(-forecast_out)
The following code is used to drop any still NaN information from the dataframe:
• df.dropna(inplace=True)
Features and labes are defined by the code below, X as features and y as labels:
• X = np.array(df.drop(['label'], 1))
• y = np.array(df['label'])
The following do preprocessing:
• X = preprocessing.scale(X)
Create the label y:
• y = np.array(df['label'])
The following code do training and testing:
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
The following code define classifier by Support Vector Regression from Scikit-Learn’s svm package:
• clf = svm.SVR()
Training the classifier with sklearn .fit
• clf.fit(X_train, y_train)
Testing the classifier
• confidence = clf.score(X_test, y_test)
The following is the code for the whole lab
import quandl, math
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing, svm
from sklearn.linear_model import LinearRegression
df = quandl.get("WIKI/GOOGL")
pd.set_option('display.max_columns', None)
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
print(df.describe())
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] *100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] *100.0
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
print(df.describe())
forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
df.dropna(inplace=True)
X = np.array(df.drop(['label'], 1))
y = np.array(df['label'])
#preprocessing
X = preprocessing.scale(X)
#create label y
y = np.array(df['label'])
#training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#define classifier
clf = svm.SVR()
#another classifier
clf = LinearRegression()
#train classifier
clf.fit(X_train, y_train)
#test the classifier
confidence = clf.score(X_test, y_test)
print(confidence)
Results
Output of stock price for Alphabets
/home/davis/PycharmProjects/lab04/venv/bin/python
/home/davis/.PyCharmCE2018.3/config/scratches/scratch_2.py
Open High ...
Adj. Close Adj. Volume
Date
...
2004-08-19 100.010 104.06
2004-08-20 101.010 109.08
2004-08-23 110.760 113.48
2004-08-24 111.240 111.60
2004-08-25 104.760 108.00
2004-08-26 104.950 107.95
2004-08-27 108.100 108.62
2004-08-30 105.280 105.49
2004-08-31 102.320 103.71
2004-09-01 102.700 102.97
2004-09-02 99.090 102.37
2004-09-03 100.950 101.74
2004-09-07 101.010 102.00
2004-09-08 100.740 103.03
2004-09-09 102.500 102.71
2004-09-10 101.470 106.56
2004-09-13 106.630 108.41
2004-09-14 107.440 112.00
2004-09-15 110.560 114.23
2004-09-16 112.340 115.80
2004-09-17 114.420 117.49
2004-09-20 116.950 121.60
2004-09-21 120.200 120.42
2004-09-22 117.450 119.67
2004-09-23 118.840 122.63
2004-09-24 120.970 124.10
2004-09-27 119.560 120.88
2004-09-28 121.150 127.40
2004-09-29 126.530 135.02
2004-09-30 129.899 132.30
...
...
... ...
2018-02-13 1050.000 1061.22
2018-02-14 1054.320 1075.47
2018-02-15 1083.450 1094.10
2018-02-16 1093.380 1108.31
2018-02-20 1092.760 1116.29
2018-02-21 1109.100 1136.20
2018-02-22 1119.170 1125.46
2018-02-23 1118.660 1129.00
2018-02-26 1131.860 1144.20
2018-02-27 1143.700 1144.25
2018-02-28 1122.000 1127.65
2018-03-01 1109.540 1111.27
2018-03-02 1057.980 1086.89
2018-03-05 1078.130 1101.18
2018-03-06 1102.100 1105.63
2018-03-07 1092.820 1116.20
2018-03-08 1117.200 1131.44
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
50.322842 44659000.0
54.322689 22834300.0
54.869377 18256100.0
52.597363 15247300.0
53.164113 9188600.0
54.122070 7094800.0
53.239345 6211700.0
51.162935 5196700.0
51.343492 4917800.0
50.280210 9138200.0
50.912161 15118600.0
50.159839 5152400.0
50.947269 5847500.0
51.308384 4985600.0
51.313400 4061700.0
52.828075 8698800.0
53.916435 7844100.0
55.917612 10828900.0
56.173402 10713000.0
57.161452 9266300.0
58.926902 9472500.0
59.864797 10628700.0
59.102444 7228700.0
59.373280 7581200.0
60.597057 8535600.0
60.100525 9123400.0
59.313094 7066100.0
63.626409 16929000.0
65.742942 30516400.0
65.000651 13758000.0
...
1054.140000 1574121.0
1072.700000 2029979.0
1091.360000 1806206.0
1095.500000 1971928.0
1103.590000 1646405.0
1113.750000 2024534.0
1109.900000 1386115.0
1128.090000 1234539.0
1143.700000 1489118.0
1117.510000 2094863.0
1103.920000 2431023.0
1071.410000 2766856.0
1084.140000 2508145.0
1094.760000 1432369.0
1100.900000 1169068.0
1115.040000 1537429.0
1129.380000 1510478.0
2018-03-09
2018-03-12
2018-03-13
2018-03-14
2018-03-15
2018-03-16
2018-03-19
2018-03-20
2018-03-21
2018-03-22
2018-03-23
2018-03-26
2018-03-27
1139.500
1165.050
1171.830
1145.800
1149.570
1155.350
1117.760
1098.400
1092.570
1080.010
1051.370
1050.600
1063.900
1161.00
1178.16
1178.00
1159.76
1162.50
1156.81
1119.37
1105.55
1108.70
1083.92
1066.78
1059.27
1064.54
...
...
...
...
...
...
...
...
...
...
...
...
...
1160.840000
1165.930000
1139.910000
1148.890000
1150.610000
1134.420000
1100.070000
1095.800000
1094.000000
1053.150000
1026.550000
1054.090000
1006.940000
2070174.0
2129297.0
2129435.0
2033697.0
1623868.0
2654602.0
3076349.0
2709310.0
1990515.0
3418154.0
2413517.0
3272409.0
2940957.0
[3424 rows x 12 columns]
Process finished with exit code 0
Output of the stock price.
/home/davis/PycharmProjects/lab04/venv/bin/python
/home/davis/.PyCharmCE2018.3/config/scratches/scratch_2.py
Adj. Close HL_PCT PCT_change Adj. Volume
Date
2004-08-19 50.322842 8.072956 0.324968 44659000.0
2004-08-20 54.322689 7.921706 7.227007 22834300.0
2004-08-23 54.869377 4.049360 -1.227880 18256100.0
2004-08-24 52.597363 7.657099 -5.726357 15247300.0
2004-08-25 53.164113 3.886792 1.183658 9188600.0
Process finished with exit code 0
Yes, the new values better describe the stock price as the values do not have a big difference in terms of
linearity.
Output of classifier
/home/davis/PycharmProjects/lab04/venv/bin/python
/home/davis/.PyCharmCE2018.3/config/scratches/scratch_2.py
Adj. Close HL_PCT PCT_change Adj. Volume
Date
2004-08-19 50.322842 8.072956 0.324968 44659000.0
2004-08-20 54.322689 7.921706 7.227007 22834300.0
2004-08-23 54.869377 4.049360 -1.227880 18256100.0
2004-08-24 52.597363 7.657099 -5.726357 15247300.0
2004-08-25 53.164113 3.886792 1.183658 9188600.0
0.7887948507466865
Process finished with exit code 0
Output of classifier as Linear regression
/home/davis/PycharmProjects/lab04/venv/bin/python
/home/davis/.PyCharmCE2018.3/config/scratches/scratch_2.py
Adj. Close HL_PCT PCT_change Adj. Volume
Date
2004-08-19 50.322842 8.072956 0.324968 44659000.0
2004-08-20 54.322689 7.921706 7.227007 22834300.0
2004-08-23 54.869377 4.049360 -1.227880 18256100.0
2004-08-24 52.597363 7.657099 -5.726357 15247300.0
2004-08-25 53.164113 3.886792 1.183658 9188600.0
0.9794609954055038
Process finished with exit code 0
Q. 8
0.9794609954055038
Q. 9
Linear regression is much better since the value is cloce to 1.
Following is output of everything on the lab
/home/davis/PycharmProjects/lab04/venv/bin/python
/home/davis/.PyCharmCE2018.3/config/scratches/scratch_2.py
Adj. Open Adj. High Adj. Low Adj. Close Adj. Volume
count 3424.000000 3424.000000 3424.000000 3424.000000 3.424000e+03
mean 409.221461 412.786556 405.227809 409.057885 7.818568e+06
std 257.844081 259.366718 255.981628 257.773495 8.248211e+06
min
49.698414 51.027517 48.128568 50.159839 5.211410e+05
25% 231.292728 233.416785 228.738594 231.316552 2.430647e+06
50% 300.174109 302.496277 297.909618 300.264387 5.076200e+06
75% 561.178118 565.000000 556.673036 561.200000 1.020910e+07
max 1188.000000 1198.000000 1184.060000 1187.560000 8.215110e+07
Adj. Close
HL_PCT PCT_change Adj. Volume
count 3424.000000 3424.000000 3424.000000 3.424000e+03
mean 409.057885 2.124479 -0.025030 7.818568e+06
std 257.773495 1.397063 1.505326 8.248211e+06
min
50.159839 0.381715 -9.179757 5.211410e+05
25% 231.316552 1.239763 -0.742241 2.430647e+06
50% 300.264387 1.752383 -0.016254 5.076200e+06
75% 561.200000 2.535633 0.752556 1.020910e+07
max 1187.560000 16.278749 8.759770 8.215110e+07
0.9744166338755915
Process finished with exit code 0
Discussion
The processes involved in machine learning are similar to that of data miningand predictive modelling.
Both require searching through data to look for patterns and adjusting program actions accordingly.
Conclusion
The objectives of the lab were meet, the concept of linear regression in terms of machine learning,
machine learning is not actually collecting data from nothing but actually teaching the machine to use
data effieciently.
References
[1 S. J. R. a. P. Norvig, Artificial Intelligence A Modern Approach, third edition, New Jersey:
] Pearson Education, 2010.
[2, Margaret Rouse Machine learning platforms,
https://searchenterpriseai.techtarget.com/definition/machine-learning-ML], TechTarget
Download
Random flashcards
Radiobiology

39 Cards

Pastoralists

20 Cards

Marketing

46 Cards

African nomads

18 Cards

Ethnology

14 Cards

Create flashcards