Latent tree analysis of unlabeled data

advertisement
Latent Tree Analysis of Unlabeled Data
Nevin L. Zhang
Dept. of Computer Science & Engineering
The Hong Kong Univ. of Sci. & Tech.
http://www.cse.ust.hk/~lzhang
Page 2
Outline

Latent tree models

Latent tree analysis algorithms

What can LTA be used for:


Discovery of co-occurrence/correlation
patterns

Discovery of latent variable/structures

Multidimensional clustering
Examples

Danish beer survey data

Text data

TCM survey data
Page 3
Latent Tree Models

Tree-structured probabilistic graphical models

Leaves observed (manifest variables)
 Discrete or continuous

Internal nodes latent (latent variables)
 Discrete

Each edge is associated with a conditional
distribution

One node with marginal distribution

Defines a joint distributions over all the variables
(Zhang, JMLR 2004)
Latent Tree Analysis
From data on observed variables, obtain latent tree model
Learning latent tree models: Determine
•
•
•
•
Number of latent variables
Numbers of possible states for latent variables
Connections among nodes
Model Selection Criterion
Probability distributions
Find the model that maximize the BIC score
BIC(m|D) = log P(D|m, θ*) – d/2 logN
D: Data, N: sample size
m: model, θ*: MLE of parameters
d: number of free parameters
Page 5
Algorithms: EAST

Search-based
Extension, Adjustment,
Simplification until
Termination

Can deal with
~100 observed variables
(Chen, Zhang et al. AIJ 2011)
(Liu, Zhang et al. MLJ 2013)
UniDimensioanlity Test
(Liu, Zhang et al. MLJ 2013)
(Liu, Zhang et al. MLJ 2013)
Chow-Liu tree (1968)
(Liu, Zhang et al. MLJ 2013)
Close to EAST in terms of model quality. Can deal with 1,000 observed variables
Page 10
Outline

Latent tree models

Latent tree analysis algorithms

What can LTA be used for:


Discovery of co-occurrence/correlation
patterns

Discovery of latent variable/structures

Multidimensional clustering
Examples

Danish beer survey data

Text data

TCM survey data
Page 11
Danish Beer Market Survey


463 consumers, 11 beer brands
Questionnaire: For each brand:

Never seen the brand before (s0);

Seen before, but never tasted (s1);

Tasted, but do not drink regularly (s2)

Drink regularly (s3).
(Mourad et al. JAIR 2013)
Page 12
Why variables grouped as such?

GronTuborg and Carlsberg: Main mass-market beers

TuborgClas and CarlSpec: Frequent beers, bit darker than the above

CeresTop, CeresRoyal, Pokal, …: minor local beers

Grouped as such because responses on brands in each group strongly correlated.

Intuitively, latent tree analysis:

Partitions observed variables into groups such that
 Variables in each group are strongly correlated, and
 The correlations among each group can be properly be modeled using one single latent
variable
Page 13
Multidmensional Clustering

Each Latent variable gives a partition of consumers.

H1:
 Class 1: Likely to have tasted TuborgClas, Carlspec and Heineken , but do not drink
regularly
 Class 2: Likely to have seen or tasted the beers, but did not drink regularly
 Class 3: Likely to drink TuborgClas and Carlspec regularly

Intuitively, latent tree analysis is a technique for

K-Means, mixture models give only one partition.
multiple clustering.
Page 14
Binary Text Data: WebKB
(Liu et al. PGM 2012, MLJ 2013)
1041 web pages collected from 4 CS departments in 1997
336 words
Page 15
Latent Tree Model for WebKB Data by BI Algorithm
89 latent variables
Latent Tree Modes for WebKB Data
Page 17
Page 18
Page 19
Why variables grouped as such?

Group as such because words in in each group tend to co-occur.

On binary data, latent tree analysis:

Partitions observed word variables into groups such that
 Words in each group tend to co-occur and
 The correlations can be properly be explained using one single latent variable
LTA is a method for identifying co-occurrence
relationships.
Multidimensional Clustering
LTA is an approach to topic
detection

Y66=4: Object Oriented Programming (oop)

Y66=2: Non-oop programming

Y66=1: programming language

Y66=3: Not on programming
Page 21
Outline

Latent tree models

Latent tree analysis algorithms

What can LTA be used for:


Discovery of co-occurrence/correlation patterns

Discovery of latent variable/structures

Multidimensional clustering
Examples

Danish beer survey data

Text data

TCM survey data
Page 22
Background of Research


Common practice in China, increasingly in Western world

Patients of a WM disease divided into several TCM classes

Different classes are treated differently using TCM treatments.
Example:

WM disease: Depression

TCM Classes:
 Liver-Qi Stagnation (肝气郁结). Treatment principle: 疏肝解郁,
Prescription: 柴胡疏肝散
 Deficiency of Liver Yin and Kidney Yin (肝肾阴虚):Treatment
principle: 滋肾养肝, Prescription: 逍遥散合六味地黄丸
 Vacuity of both heart and spleen (心脾两虚). Treatment principle: 益
气健脾, Prescription: 归脾汤
 ….
Page 23
Key Question


How should patients of a WM disease be divided into
subclasses from the TCM perspective?

What TCM classes?

What are the characteristics of each TCM class?

How to differentiate different TCM classes?
Important for

Clinic practice

Research
 Randomized controlled trials for efficacy
 Modern biomedical understanding of TCM concepts

No consensus. Different doctors/researchers use different
schemes. Key weakness of TCM.
Page 24
Key Idea

Our objective:



Provide an evidence-based method for TCM patient classification
Key Idea

Cluster analysis of symptom data => empirical partition of patients

Check to see whether it corresponds to TCM class concept
Key technology: Multidimensional clustering

Motivation for developing latent tree analysis
Page 25
Symptoms Data of Depressive Patients
(Zhao et al. JACM 2014)

Subjects:

604 depressive patients aged between 19 and 69 from 9 hospitals

Selected using the Chinese classification of mental disorder clinic
guideline CCMD-3

Exclusion:
 Subjects we took anti-depression drugs within two weeks prior to the survey;
women in the gestational and suckling periods, .. etc

Symptom variables

From the TCM literature on depression between 1994 and 2004.

Searched with the phrase “抑郁 and 证” on the CNKI (China National
Knowledge Infrastructure) data

Kept only those on studies where patients were selected using the ICD-9,
ICD-10, CCMD-2, or CCMD-3 guidelines.

143 symptoms reported in those studies altogether.
Page 26
The Depression Data

Data as a table

604 rows, each for a patient

143 columns, each for a symptom

Table cells: 0 – symptom not present, 1 – symptom present

Removed: Symptoms occurring <10 times

86 symptoms variables entered latent tree analysis.

Structure of the latent tree model obtained on the next two slides.
Page 27
Model Obtained for a Depression Data (Top)
Page 28
Model obtained for a Depression Data (Bottom)
Page 29
The Empirical Partitions

The first cluster (Y29= s0) consists of 54% of the patients and while the cluster
(Y29= s1) consists of 46% of the patients.

The two symptoms ‘fear of cold’ and ‘cold limbs’ do not occur often in the first
cluster

While they both tend to occur with high probabilities (0.8 and 0.85) in the
second cluster.
Page 30
Probabilistic Symptom co-occurrence pattern

Probabilistic symptom co-occurrence pattern:



The table indicates that the two symptoms ‘fear of cold’ and ‘cold limbs’ tend
to co-occur in the cluster Y29= s1
Pattern meaningful from the TCM perspective.

TCM asserts that YANG DEFICIENCY (阳虚) can lead to, among other
symptoms, ‘fear of cold’ and ‘cold limbs’

So, the co-occurrence pattern suggests the TCM symdrome type (证型)
YANG DEFICIENCY (阳虚).
The partition Y29 suggests that

Among depressive patients, there is a subclass of
patient with YANG DEFICIENCY.

In this subclass, ‘fear of cold’ and ‘cold limbs’
co-occur with high probabilities (0.8 and 0.85)
Page 31
Probabilistic Symptom co-occurrence pattern

Y28= s1 captures the probabilistic co-occurrence of ‘aching lumbus’, ‘lumbar
pain like pressure’ and ‘lumbar pain like warmth’.

This pattern is present in 27% of the patients.

It suggests that

Among depressive patients, there is a subclass that correspond to the
TCM concept of KIDNEY DEPRIVED OF NOURISHMENT (肾虚失养)

Characteristics of the subclass given by distributions for Y28= s1
Page 32
Probabilistic Symptom co-occurrence pattern

Y27= s1 captures the probabilistic co-occurrence of ‘weak lumbus and knees’
and ‘cumbersome limbs’.

This pattern is present in 44% of the patients

It suggests that,

Among depressive patients, there is a subclass that correspond to the
TCM concept of KIDNEY DEFICIENCY (肾虚)


Characteristics of the subclass given by distributions for Y27= s1
Y27, Y28, Y29 together provide evidence for defining KIDNEY YANG
DEFICIENCY
Page 33
Probabilistic Symptom co-occurrence pattern

Pattern Y21= s1: evidence for defining STAGNANT QI TURNING INTO FIRE
(气郁化火)

Y15= s1 : evidence for defining QI DEFICIENCY

Y17 = s1 : evidence for defining HEART QI DEFICIENCY

Y16= s1 : evidence for defining QI STAGNATION

Y19= s1: evidence for defining QI STAGNATION IN HEAD
Page 34
Probabilistic Symptom co-occurrence pattern

Y9= s1 :evidence for defining DEFICIENCY OF BOTH QI AND YIN (气阴两虚)

Y10= s1: evidence for defining YIN DEFICIENCY (阴虚)

Y11= s1: evidence for defining DEFICIENCY OF STOMACH/SPLEEN YIN (脾
胃阴虚)
Page 35
Symptom Mutual-Exclusion Patterns

Some empirical partitions reveal
symptom exclusion patterns

Y1 reveals the mutual exclusion of
‘white tongue coating’, ‘yellow tongue
coating’ and ‘yellow-white tongue
coating’

Y2 reveals the mutual exclusion of ‘thin
tongue coating’, ‘thick tongue coating’
and ‘little tongue coating’.
Page 36
Summary of TCM Data Analysis

By analyzing 604 cases of depressive patient data using latent tree models we
have discovered a host of probabilistic symptom co-occurrence patterns and
symptom mutual-exclusion patterns.

Most of the co-occurrence patterns have clear TCM syndrome connotations,
while the mutual-exclusion patterns are also reasonable and meaningful.

The patterns can be used as evidence for the task of defining TCM classes in
the context of depressive patients and for differentiating between those
classes.
Page 37
Another Perspective: Statistical Validation of TCM Postulates
(Zhang et al. JACM 2008)
…..
Kidney deprived of
nourishment
Yang Deficiency

…..
Y28 = s1
Y29 = s1
TCM terms such as Yang Deficiency were introduced to explain symptom cooccurrence patterns observed in clinic practice.
Page 38
Value of Work in View of Others

D. Haughton and J. Haughton. Living Standards Analytics:
Development through the Lens of Household Survey Data. Springer.
2012

Zhang et al. provide a very interesting application of latent class
(tree) models to diagnoses in traditional Chinese medicine (TCM).

The results tend to confirm known theories in Chinese traditional
medicine.

This is a significant advance, since the scientific bases for
these theories are not known.

The model proposed by the authors provides at least a
statistical justification for them.
Page 39
Summary


Latent tree models:

Tree-structure probabilistic graphical models

Leaf nodes: observed variables

Internal nodes: latent variable
What can LTA be used for:

Discovery of co-occurrence patterns in binary data

Discovery of correlation patterns in general discrete data

Discovery of latent variable/structures

Multidimensional clustering

Topic detection in text data

Key role in TCM patient classification
Page 40
References:

N. L. Zhang (2004). Hierarchical latent class models for cluster analysis. Journal of Machine Learning Research,
5(6):697-723, 2004.

T. Chen, N. L. Zhang, T. F. Liu, Y. Wang, L. K. M. Poon (2011). Model-based multidimensional clustering of categorical
data. Artificial Intelligence, 176(1), 2246-2269.

T.F.Liu, N. L. Zhang, A.H. Liu, L.K.M. Poon (2012). A Novel LTM-based Method for Multidimensional Clustering.
European Workshop on Probabilistic Graphical Models (PGM-12), 203-210.

T.F, Liu, N. L. Zhang, P. X. Chen, A. H.Liu, L. K. M. Poon, and Yi Wang (2013). Greedy learning of latent tree models
for multidimensional clustering. Machine Learning, doi:10.1007/s10994-013-5393-0.

R. Mourad, C. Sinoquet, N. L. Zhang, T.F. Liu and P. Leray (2013). A survey on latent tree models and applications.
Journal of Artificial Intelligence Research, 47, 157-203 , 13 May 2013. doi:10.1613/jair.3879.

N. L. Zhang, S. H. Yuan, T. Chen and Y. Wang (2008). Statistical Validation of TCM Theories. Journal of Alternative
and Complementary Medicine, 14(5):583-7.

N. L. Zhang, S. H. Yuan, T. Chen and Y. Wang (2008). Latent tree models and diagnosis in traditional Chinese
medicine. Artificial Intelligence in Medicine. 42: 229-245.

Z.X. Xu, N. L. Zhang, Y.Q. Wang, G.P. Liu, J. Xu, T. F. Liu, and A. H. Liu (2013). Statistical Validation of Traditional
Chinese Medicine Syndrome Postulates in the Context of Patients with Cardiovascular Disease. The Journal of
Alternative and Complementary Medicine.

Y. Zhao, N. L. Zhang, T.F.Wang, Q. G. Wang (2014). Discovering Symptom Co-Occurrence Patterns from 604 Cases
of Depressive Patient Data using Latent Tree Models. The Journal of Alternative and Complementary Medicine.
Thank You!
Download