MLtutorialAltun_0

Machine Learning Methods for

Human-Computer Interaction

Kerem Altun

Postdoctoral Fellow

Department of Computer Science

University of British Columbia

IEEE Haptics Symposium

March 4, 2012

Vancouver, B.C., Canada

Machine learning

Template matching

Machine learning

Pattern recognition

Statistical pattern recognition

Structural pattern recognition

Regression

Neural networks

Supervised methods

Unsupervised methods

IEEE Haptics Symposium 2012 2

What is pattern recognition?





 title even appears in the

International Association for Pattern Recognition

(IAPR) newsletter many definitions exist simply: the process of labeling observations ( x ) with predefined categories ( w

)


Various applications of PR

[Jain et al., 2000]


Supervised learning

“tufa”

“tufa”

“tufa”

Can you identify other “tufa”s here?

IEEE Haptics Symposium 2012 lifted from lecture notes by Josh Tenenbaum

5

Unsupervised learning

How many categories are there?

Which image belongs to which category?

IEEE Haptics Symposium 2012 lifted from lecture notes by Josh Tenenbaum

6

Pattern recognition in haptics/HCI











[Altun et al., 2010a] human activity recognition body-worn inertial sensors

 accelerometers and gyroscopes daily activities

 sitting, standing, walking, stairs, etc.

sports activities

 walking/running, cycling, rowing, basketball, etc.



[Altun et al., 2010a] walking right arm acc left arm acc

IEEE Haptics Symposium 2012 basketball

8


 [Flagg et al., 2012]

 touch gesture recognition on a conductive fur patch



[Flagg et al., 2012]

5

4

3

2

1

0

0 0.5

1 t (s)

1.5

2 2.5

stroke

5

4

3

2

1

0

0 0.5

1 t (s)

1.5

2 2.5

scratch

5

4

3

2

1

0

0 0.5

1 t (s)

1.5

2 2.5

light touch


Other haptics/HCI applications?


Pattern recognition example

 excellent example by

Duda et al.

 classifying incoming fish on a conveyor belt using a camera image

 sea bass

 salmon

[Duda et al., 2000]









 how to classify? what kind of information can distinguish these two species?

 length, width, weight, etc.

suppose a fisherman tells us that salmon are usually shorter

 so, let's use length as a feature what to do to classify?

 capture image – find fish in the image – measure length – make decision how to make the decision?

 how to find the threshold?



[Duda et al., 2000]



 on the average, salmon are usually shorter, but is this a good feature ?

 let's try classifying according to lightness of the fish scales



[Duda et al., 2000]



 how to choose the threshold?



 how to choose the threshold?

 minimize the probability of error

 sometimes we should consider costs of different errors





 salmon is more expensive customers who order salmon but get sea bass instead will be angry customers who order sea bass but occasionally get salmon instead will not be unhappy



 we don't have to use just one feature

 let's use lightness and width each point is a feature vector

2-D plane is the feature space

[Duda et al., 2000]



 we don't have to use just one feature

 let's use lightness and width each point is a feature vector

2-D plane is the feature space decision boundary

[Duda et al., 2000]



 should we add as more features as we can?

 do not use redundant features





 consider noise in the measurements





 consider noise in the measurements

 moreover,

 avoid adding too many features







 more features means higher dimensional feature vectors difficult to work in high dimensional spaces this is called the curse of dimensionality more on this later



 how to choose the decision boundary?

is this one better?

[Duda et al., 2000]



 how to choose the decision boundary?

is this one better?

[Duda et al., 2000]


Probability theory review

 a chance experiment, e.g., tossing a 6-sided die





1, 2, 3, 4, 5, 6 are possible outcomes the set of all outcomes:

W

={1,2,3,4,5,6} is the sample space

 any subset of the sample space is an event

 the event that the outcome is odd: A={1,3,5}

 each event is assigned a number called the probability of the event: P(A)

 the assigned probabilities can be selected freely, as long as Kolmogorov axioms are not violated


Probability axioms

 for any event,

 for the sample space,

 for disjoint events



 third axiom also includes the case die tossing – if all outcomes are equally likely

 for all i =1…6, probability of getting outcome i is 1/6


Conditional probability

 sometimes events occur and change the probabilities of other events



 example: ten coins in a bag





 nine of them are fair coins – heads (H) and tails (T) one of them is fake – both sides are heads (H)

I randomly draw one coin from the bag, but I don’t show it to you

H

0

: the coin is fake, both sides H

H

1

: the coin is fair – one side H, other side T which of these events would you bet on?



 suppose I flip the coin five times, obtaining the outcome HHHHH (five heads in a row)

 call this event F



H

0


H

1

: the coin is fair – one side H, other side T which of these events would you bet on now?



 definition: the conditional probability of event A given that event B has occurred: read as: "probability of A given B"

 P(AB) is the probability of events A and B occurring together



Bayes’ theorem:



H

0


H

1

: the coin is fair – one side H, other side T

F: obtaining five heads in a row (HHHHH)



 we know that F occurred we want to find –

 difficult – use Bayes’ theorem



H

0


H

1





H

0


H

1


F: obtaining five heads in a row (HHHHH) probability of observing F if H

0 was true prior probability

(before the observation F) posterior probability total probability of observing F



H

0


H

1



IEEE Haptics Symposium 2012 total probability of observing F

34


H

0


H

1



1

1



H

0


H

1



1 1/10

1 1/10



H

0


H

1



1 1/10

1 1/10 1/32



H

0


H

1



1 1/10

1 1/10 1/32 9/10



H

0


H

1



1 1/10

32/41

1 1/10 which event would you bet on?

IEEE Haptics Symposium 2012

1/32 9/10

39


H

0


H

1



1 1/10

32/41

1 1/10 1/32 this is very similar to a pattern recognition problem!

9/10



H

0


H

1



1 1/10

32/41

1 1/10 1/32 9/10 we can put a label on the coin as “fake” based on our observations!


Bayesian inference w

0

: the coin belongs to the “fake” class w

1

: the coin belongs to the “fair” class x : observation

 decide if the posterior probability is higher than others

 this is called the MAP (maximum a posteriori) decision rule


Random variables

 we model the observations with random variables

 a random variable is a real number whose value depends on a chance experiment

 discrete random variable

 the possible values form a discrete set

 continuous random variable

 the possible values form a continuous set


Random variables

 a discrete random variable X is characterized by a probability mass function (pmf)

 a pmf has two properties


Random variables

 a continuous random variable X is characterized by a probability density function

(pdf) denoted by for all possible values

 probabilities are calculated for intervals


Random variables

 a pdf also has two properties


Expectation

 definition

 average of possible values of X, weighted by probabilities

 also called expected value, mean


Variance and standard deviation

 variance is the expected value of deviation from the mean

 variance is always positive

 or zero, which means X is not random

 standard deviation is the square root of the variance


Gaussian (normal) distribution





 possibly the most ''natural'' distribution

 encountered frequently in nature central limit theorem

 sum of i.i.d. random variables is Gaussian definition: the random variable with pdf

 two parameters:


Gaussian distribution it can be proved that: figure lifted from http://assets.allbusiness.com


Random vectors

 extension of the scalar case

 pdf:

 mean:

 covariance matrix:

 covariance matrix is always symmetric and positive semidefinite


Multivariate Gaussian distribution

 probability density function:

 two parameters:

 compare with the univariate case:


Bivariate Gaussian exercise

The scatter plots show 100 independent samples drawn from zero-mean Gaussian distributions,with different covariance matrices. Match the covariance matrices with the scatter plots, by inspection only.

4

2

0

-2

-4

-4 -2 a

0 2 4

4

2

0

-2

-4

-4 -2 0 2 b


4

4

2

0

-2

-4

-4 -2 c

0 2

53

4

Bivariate Gaussian exercise

The scatter plots show 100 independent samples drawn from zero-mean Gaussian distributions,with different covariance matrices. Match the covariance matrices with the scatter plots, by inspection only.

4

2

0

-2

-4

-4 -2 a

0 2 4

4

2

0

-2

-4

-4 -2 0 2 b


4

4

2

0

-2

-4

-4 -2 c

0 2

54

4

Bayesian decision theory

 Bayesian decision theory falls into the subjective interpretation of probability

 in the pattern recognition context, some prior belief about the class (category) of an observation is updated using the Bayes rule



 back to the fish example

 say we have two class es ( states of nature )

 let be the prior probability that the fish is a sea bass

 is the prior probability that the fish is a salmon



 prior probabilities reflect our belief about which kind of fish to expect, before we observe it



 we can choose according to the fishing location, time of year etc. if we don’t have any prior knowledge, we can choose equal priors (or uniform priors )



 let be the feature vector obtained from our observations

 can include features like lightness, weight, length, etc.

 calculate posterior probabilities

 how to calculate?

 and



 is called the class-conditional probability density function (CCPDF)

 pdf of observation x if the true class was

 the CCPDF is usually not known

 e.g., impossible to know the pdf of the length of all sea bass in the world

 but it can be estimated, more on this later

 for now, assume that the CCPDF is known

 just substitute observation x in



 MAP rule (also called the minimum-error rule ):

 decide if

 decide otherwise

 do we really have to calculate ?



 multiclass problems: maximum a posteriori (MAP) decision rule the MAP rule minimizes the error probability, and is the best performance that can be achieved (of course, if the CCPDFs are known)

 if prior probabilities are equal: maximum likelihood (ML) decision rule


Exercise (single feature)

 find:

 the maximum likelihood decision rule

[Duda et al., 2000]



 find:

 the maximum likelihood decision rule

[Duda et al., 2000]



 find:

 the MAP decision rule

 if

 if

[Duda et al., 2000]



 find:

 the MAP decision rule

 if

 if

[Duda et al., 2000]


Discriminant functions

 we can generalize this

 let be the discriminant function for the ith class

 decision rule: assign x to class i if

 for the MAP rule:


Discriminant functions

 the discriminant functions divide the feature space into decision regions that are separated by decision boundaries


Discriminant functions for

Gaussian densities

 consider a multiclass problem ( c classes)

 discriminant functions:

 easy to show analytically that the decision boundaries are hyperquadrics

 if the feature space is 2-D, conic sections

 hyperplanes (or lines for 2-D) if covariance matrices are the same for all classes (degenerate case)


Examples equal and spherical covariance matrices

2-D equal covariance matrices

[Duda et al., 2000]


3-D

69

Examples

[Duda et al., 2000]


Examples

[Duda et al., 2000]


2-D example

 artificial data

[Jain et al., 2000]

3

2

1

0

-1

-2

-3

-2


0 2 4

72

Density estimation

 but, CCPDFs are usually unknown

 that's why we need training data density estimation parametric non-parametric assume a class of densities (e.g. Gaussian), find the parameters

IEEE Haptics Symposium 2012 estimate the pdf directly

(and numerically) from the training data

73

Density estimation

 assume we have n samples of training vectors for a class

 we assume that these samples are independent and drawn from a certain probability distribution

 this is called the generative approach


Parametric methods

 we will consider only the Gaussian case

 underlying assumption: samples are actually noise-corrupted versions of a single feature vector

 why Gaussian? three important properties

 completely specified by mean and variance

 linear transformations remain Gaussian

 central limit theorem: many phenomena encountered in reality are asymptotically

Gaussian


Gaussian case

 assume are drawn from a

Gaussian distribution

 how to find the pdf?


Gaussian case

 assume are drawn from a

Gaussian distribution

 how to find the pdf?

 finding the mean and covariance is sufficient sample mean sample covariance


2-D example

 back to the 2-D example calculate apply the MAP rule

4

3

2

1

0

-1

-2

-3

-2


0 2 4

78

2-D example

 back to the 2-D example


2-D example

3

2

1

0

-1

-2

-3

-2 0 2 4

IEEE Haptics Symposium 2012 decision boundary with true pdf decision boundary with estimated pdf

80

Haptics example

5

4

3

2

1

0

0 0.5

1 t (s)

1.5

2 2.5

stroke


5

4

3

2

1

0

0 0.5

1 t (s)

1.5

2 2.5

scratch

5

4

3

2

1

0

0 0.5

1 t (s)

1.5

2 2.5

light touch which feature to use for discrimination?


Haptics example

 [Flagg et al., 2012]

 7 participants performed each gesture 10 times

 210 samples in total

 we should find distinguishing features

 let's use one feature at a time

 we assume the feature value is normally distributed, find the mean and covariance


0.8

0.6

0.4

0.2

0

-5

1.4

Haptics example

1.2

stroke scratch light touch

1

0 5 minimum value

10

IEEE Haptics Symposium 2012 assume equal priors apply ML rule

83

15

10

5

0

3.5

30

Haptics example

25 stroke scratch light touch

20 apply ML rule decision boundaries?

(decision thresholds for 1-D)

4 4.5

maximum value

5


Haptics example

5

 let's plot the 2-D distribution

4.5

 clearly this isn't a

"good" classifier for this problem

4

 the Gaussian assumption is not valid

3.5

3

-1 0


1 2 3 minimum value stroke scratch light touch

4 5

85

Activity recognition example

 [Altun et al., 2010a]

 4 participants (2 male, 2 female)

 activities: standing, ascending stairs, walking

 720 samples in total

 sensor: accelerometer on the right leg

 let's use the same features

 minimum and maximum values


3

2.5

2

1.5

1

0.5

0

-5

Activity recognition example feature 1 feature 2 standing stairs walking

-4 -3 -2 minimum value

-1 0

3.5

3

2.5

2

1.5

1

0.5

0

-2 0 2 maximum value standing stairs walking

4



4

 the Gaussian assumption looks valid

3

2

1

 this is a "good" classifier for this problem

0

-1

-2

-5 -4 -3 -2 -1 minimum value standing stairs walking

0 1



4

 decision boundaries

3

2 standing stairs walking

1

0

-1

-2

-5 -4 -3 -2 -1 0 1 minimum value


Haptics example

 how to solve the problem?


Haptics example

 how to solve the problem?

 either change the classifier, or change the features


Non-parametric methods

 let's estimate the CCPDF directly from samples

 simplest method to use is the histogram

 partition the feature space into (equally-sized) bins

 count the number of samples in each bin k : number of samples in the bin that includes x n : total number of samples

V : volume of the bin



 how to choose the bin size?

 number of bins increase exponentially with the dimension of the feature space

 we can do better than that!



 compare the following density estimates

 pdf estimates with six samples image from http://en.wikipedia.org/wiki/Parzen_Windows


Kernel density estimation

 a density estimate can be obtained as

 where the functions are Gaussians centered at . More precisely,

K : Gaussian kernel h n

: width of the Gaussian


Kernel density estimation

 three different density estimates with different widths





 if the width is large, the pdf will be too smooth if the width is small, the pdf will be too spiked as the width approaches zero, the pdf converges to a sum of Dirac delta functions

[Duda et al., 2000]


KDE for activity recognition data

2

1.5

standing stairs walking

1

0.5

0

-5 -4 -3 -2 -1 minimum value

0 1

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0

-2 standing stairs walking

0 2 maximum value

4 6


KDE for activity recognition data

4


2

1

0

-1

-2

-5 -4 -3 -2 -1 0 1 minimum value


KDE for gesture recognition data

0.5

0.4

0.3

0.2

0.1

0

-5 stroke scratch light touch

0 5 minimum value

10

10

8

2

0

2

6


3 4 maximum value

5 6


Other density estimation methods

 Gaussian mixture models





 parametric model the distribution as sum of M Gaussians optimization algorithm:

 expectation-maximization (EM)

 k -nearest neighbor estimation





 non-parametric variable width fixed k


Another example

[Aksoy., 2011]


Measuring classifier performance

 how do we know our classifiers will work?

 how do we measure the performance, i.e., decide one classifier is better than the other?



 correct recognition rate confusion matrix

 ideally, we should have more data independent from the training set and test the classifiers


Confusion matrix confusion matrix for an 8class problem [Tunçel et al., 2009]


Measuring classifier performance

 use the training samples to test the classifiers

 this is possible, but not good practice

100% correct classification rate for this example!

because the classifier

"memorized" the training samples instead of "learning" them

[Duda et al., 2000]


Cross validation

 having a separate test data set might not be possible for some cases

 we can use cross validation

 use some of the data for training, and the remaining for testing

 how to divide the data?


Cross validation methods

 repeated random sub-sampling

 divide the data into two groups randomly (usually the size of the training set is larger)

 train and test, record the correct classification rate

 do this repeatedly, take the average


Cross validation methods





K -fold cross validation





 randomly divide the data into K sets use K -1 sets for training, 1 set for testing repeat K times, at each fold use a different set for testing leave-one-out cross validation



 use one sample for testing, and all the remaining for training same as K -fold cross validation, with K being equal to the total number of samples


0.8

0.6

0.4

0.2

0

-5

1.4

Haptics example

1.2


1

0 5 minimum value assume equal priors apply ML rule stroke scratch light touch stroke

53

2

35 scratch light touch

16 1

66

28

2

7

60.0%

10 the decision region for light touch is too small!!


15

10

5

0

3.5

30

Haptics example


20 apply ML rule stroke scratch light touch stroke

61

13


0 9

24 33

14 38

58.5%

4 4.5

maximum value

5


Haptics example

0.5

0.4

0.3


10

8

6

4


0.2

0.1

0

-5 0 5 minimum value

10 stroke scratch light touch stroke

48

2


16

67

30

6

1

8

58.8%

0

2 3 4 maximum value

5 6 stroke scratch light touch stroke

60

4


0 10

23 43

13 48

62.4%



3

2.5

2

1.5

1

0.5

0

-5 standing stairs walking


-1 0

3.5

3

2.5

2

1.5

1

0.5

0

-2 0 2 maximum value standing stairs walking

4 standing stairs walking standing stairs

239

5

0

1

171

132 walking

0

64

108

71.9% standing stairs walking

IEEE Haptics Symposium 2012 standing stairs

232

41

0

8

146

72 walking

0

53

168

75.8%

111


4


2

1

0 standing stairs walking standing stairs

239

0

0

1

209

56 walking

0

31

184

87.8%

-1

-2

-5 -4 -3 -2 -1 minimum value

0 1


Another cross-validation method

 used in HCI studies with multiple human subjects

 subject-based leave-one-out cross validation



 number of subjects: S leave one subject's data out, train with the remaining data



 repeat for S times, each time test with a different subject, then average gives an estimate for the expected correct recognition rate when a new user is encountered


Activity recognition example minimum value maximum value

K -fold standing stairs walking standing stairs

239

5

0

1

171

132 walking

0

64

108

71.9% standing stairs walking subject-based leave-one-out standing stairs

180

13

1

60

150

125 walking

0

77

114

61.6%

K -fold standing stairs walking standing stairs

232

41

0

8

146

72 walking

0

53

168

75.8% standing stairs walking subject-based leave-one-out standing stairs

134

42

0

106

135

71 walking

0

63

169

60.8%



K -fold

4

3 standing stairs walking standing stairs walking standing stairs

239

0

0

1

209

56 walking

0

31

184

87.8%

2

1

0

-1

-2

-5 -4 1 standing stairs walking subject-based leave-one-out standing stairs

206

0

0

34

182

39 walking

0

58

201

81.8%


0


Dimensionality reduction

 for most problems a few features are not enough

 adding features sometimes helps

[Duda et al., 2000]



 should we add as many features as we can?

 what does this figure say?


[Jain et al., 2000]

117


 we should add features up to a certain point

 the more the training samples, the farther away this point is

 more features = higher dimensional spaces

 in higher dimensions, we need more samples to estimate the parameters and the densities accurately

 number of necessary training samples grows exponentially with the dimension of the feature space

 this is called the curse of dimensionality



 how many features to use?

 rule of thumb: use at least ten times as many training samples as the number of features

 which features to use?

 difficult to know beforehand

 one approach: consider many features and select among them


Pen input recognition

[Willems, 2010]


Touch gesture recognition



Feature reduction and selection

 form a set of many features

 some of them might be redundant

 feature reduction (sometimes called feature extraction )



 form linear or nonlinear combinations of features features in the reduced set usually don’t have physical meaning

 feature selection

 select most discriminative features from the set


Feature reduction

 we will only consider Principal Component

Analysis (PCA)

 unsupervised method

 we don’t care about the class labels

 consider the distribution of all the feature vectors in the d -dimensional feature space

 PCA is the projection to a lower dimensional space that “best represents the data”

 get rid of unnecessary dimensions


Principal component analysis

 how to “best represent the data?”

6

4

2

0

-2

-4

-6

-6 -4 -2 0 2 4 6



 how to “best represent the data?”

6

4

2

0 find the direction(s) in which the variance of the data is the largest

-2

-4

-6

-6 -4 -2 0 2 4 6



 find the covariance matrix

 spectral decomposition:

 eigenvalues: on the diagonal of

 eigenvectors: columns of

 covariance matrix is symmetric and positive semidefinite = eigenvalues are nonnegative, eigenvectors are orthogonal



 put the eigenvalues in decreasing order

 corresponding eigenvectors show the principal directions in which the variance of the data is largest

 say we want to have m features only

 project to the space spanned by the first m eigenvectors



[Altun et al., 2010a]



 five sensor units (wrists, legs,chest)



 each unit has three accelerometers, three gyroscopes, three magnetometers

45 sensors in total computed 26 features from sensor signals



 mean, variance, min, max,

Fourier transform etc.

45x26=1170 features



 compute covariance matrix

 find eigenvalues and eigenvectors

 plot first 100 eigenvalues

 reduced the number of features to 30




Activity recognition example what does the Bayesian decision making (BDM) result suggest?


Feature reduction

 ideally, this should be done for the training set only

 estimate from the training set, find eigenvalues and eigenvectors and the projection

 apply the projection to the test vector

 for example for K -fold cross validation, this should be done K times

 computationally expensive


Feature selection

 alternatively, we can select from our large feature set

 say we have d features and want to reduce it to m

 optimal way: evaluate all possibilities and choose the best one

 not feasible except for small values of m and d

 suboptimal methods: greedy search


Feature selection

 best individual features

 evaluate all the d features individually, select the best m features


Feature selection

 sequential forward selection

 start with the empty set

 evaluate all features one by one, select the best one, add to the set

 form pairs of features with this one and one of the remaining features, add the best one to the set



 form triplets of features with these two and one of the remaining features, add the best one to the set

…


Feature selection

 sequential backward selection

 start with the full feature set

 evaluate by removing one feature at a time from the set, then remove the worst feature



 continue step 2 with the current feature set

…


Feature selection

 plus p – take away r selection

 first enlarge the feature set by adding p features using sequential forward selection

 then remove r features using sequential backward selection


Activity recognition example first 5 features selected by sequential forward selection first 5 features selected by PCA

[Altun et al., 2010b]

SFS performs better than PCA for a few features. If 10-15 features are used, their performances become closer.

Time domain features and leg features are more discriminative



[Altun et al., 2010b]


Discriminative methods

 we talked about discriminant functions

 for the MAP rule we used

 discriminative methods try to find directly from data


Linear discriminant functions

 consider the discriminant function that is a linear combination of the components of x

 for the two-class case, there is a single decision boundary



 for the multiclass case, there are options

 c two-class problems, separate from others

 consider classes pairwise


Linear discriminant functions distinguish one class from others

[Duda et al., 2000] consider classes pairwise



 or, use the original definition

 assign x to class i if

[Duda et al., 2000]


Nearest mean classifier

 find the means of training vectors

 assign the class of the nearest mean for a test vector y


2-D example

 artificial data

3

2

1

0

-1

-2

-3

-2


0 2 4

146

2-D example

 estimated parameters

3

2

1

0

-1 decision boundary with true pdf decision boundary with nearest mean classifier

-2

-3

-2


0 2 4

147


4


2

1

0

-1

-2

-5 -4 -3 -2 -1 0 1 minimum value


k-nearest neighbor method

 for a test vector y



 find the k closest training vectors let be the number of training vectors belonging to class i among these k vectors

 simplest case: k=1

 just find the closest training vector assign its class

 decision boundaries:

 Voronoi tessellation of the space


1-nearest neighbor

 decision regions:

[Duda et al., 2000]

IEEE Haptics Symposium 2012 this is called a

Voronoi tessellation

150

k-nearest neighbor





 test sample

 circle class

 square class

 triangle

 note how the decision is different for k=3 and k=5 k=3 k=5 http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm


k-nearest neighbor

 no training is needed

 computation time for testing is high

 many techniques to reduce the computational load exist

 other alternatives exist for computing the distance





Manhattan distance (L

1 chessboard distance (L

∞ norm) norm)


Haptics example

5

4.5

K -fold stroke scratch light touch stroke

52

7


6 12

40

16

23

41

63.3%

4

3.5

3

-1 0 subject-based leave-one-out

1 2 3 minimum value stroke scratch light touch

4 5 stroke scratch light touch stroke

50

7


6

41

23

14

22

33

59.0%



4



K -fold standing stairs

240

0

0

0

206

38 walking

0

34

202

90.0%

1

0

-1

-2

-5 -4 standing stairs walking subject-based leave-one-out standing stairs

240

0

0

0

202

40 walking

0

38

200

1

89.2%


0



4

3 standing stairs walking decision boundaries for k =3

2

1

0

-1

-2

-5 -4 -3 -2 -1 0 1 minimum value


Feature normalization

 especially when computing distances, the scales of the feature axes are important

 features with large ranges may be weighted more

 feature normalization can be applied so that the ranges are similar



 linear scaling where l is the lowest value and u is the largest value of the feature x

 normalization to zero mean & unit variance where m is the mean value and s is the standard deviation of the feature x

 other methods exist



 ideally, the parameters l , u , m

, and s should be estimated from the training set only, and then used on the test vectors

 for example for K -fold cross validation, this should be done K times



 another popular method is the binary decision tree

 start from the root node

 proceed in the tree by setting thresholds on the feature values

 proceed with sequentially answering questions like

 "is feature j less than threshold value T k

?"



4


2

1

0

-1

-2

-5 -4 -3 -2 -1 0 1 minimum value







 one very popular method is the support vector machine classifier linear classifier applicable to linearly separable data if the data is not linearly separable, maps to a higher dimensional space

 usually a Hilbert space


[Aksoy, 2011]

161

Comparison for activity recognition

 1170 features reduced to 30 by PCA

 19 activities

 8 participants


References

























S. Aksoy, Pattern Recognition lecture notes, Bilkent University, Ankara, Turkey, 2011.

A. Moore, Statistical Data Mining tutorials ( http://www.autonlab.org/tutorials )

J. Tenenbaum, The Cognitive Science of Intuitive Theories lecture notes, Massachussetts Institute of Technology,

MA, USA, 2006. (accessed online: http://www.mit.edu/~jbt/9.iap/9.94.Tenenbaum.ppt)

R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification , 2 nd ed., Wiley-Interscience, 2000.

A. K. Jain, R. P. D. Duin, J. Mao, “Statistical pattern recognition: a review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1):4 —37, January 2000.

A. R. Webb, Statistical Pattern Recognition , 2 nd ed., John Wiley & Sons, West Sussex, England, 2002.

V. N. Vapnik, The Nature of Statistical Learning Theory , 2 nd ed., Springer-Verlag New York, Inc., 2000.

K. Altun, B. Barshan, O. Tuncel, (2010a)

“Comparative study on classifying human activities with miniature inertial/magnetic sensors,” Pattern Recognition, 43(10):3605—3620, October 2010.

K. Altun, B. Barshan, (2010b) "Human activity recognition using inertial/magnetic sensor units," in Human Behavior

Understanding, Lecture Notes in Computer Science, A.A.Salah et al. (eds.), vol. 6219, pp. 38

—51, Springer,

Berlin, Heidelberg, August 2010.

A. Flagg, D. Tam, K. MacLean, R. Flagg, “Conductive fur sensing for a gesture-aware furry robot,” Proceedings of

IEEE 2012 Haptics Symposium , March 4-7, 2012, Vancouver, B.C., Canada.

O. Tuncel, K. Altun, B. Barshan, “Classifying human leg motions with uniaxial piezoelectric gyroscopes,” Sensors,

9(11):8508 —8546, November 2009.

D. Willems, Interactive Maps – using the pen in human-computer interaction, PhD Thesis, Radboud University

Nijmegen, Netherlands, 2010

(accessed online: http://www.donwillems.net/waaaa/InteractiveMaps_PhDThesis_DWillems.pdf)


MLtutorialAltun_0

Machine Learning Methods for

Human-Computer Interaction

Kerem Altun

Discriminant functions for

Gaussian densities

k-nearest neighbor method

1-nearest neighbor

k-nearest neighbor

k-nearest neighbor

Related documents

Products

Support

MLtutorialAltun_0

Machine Learning Methods for

Human-Computer Interaction

Kerem Altun

Discriminant functions for

Gaussian densities

k-nearest neighbor method

1-nearest neighbor

k-nearest neighbor

k-nearest neighbor

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib