Predicting Market - Artificial Intelligence Laboratory

advertisement
Predicting Market Movements:
From Breaking News to Emerging
Social Media
Dr. Hsinchun Chen
Director, Artificial Intelligence Lab
University of Arizona
hchen@eller.arizona.edu
http://ai.arizona.edu
Acknowledgements: NSF CRI; NSF EXP-LA; DOD
DTRA, CTFP, NPS; (ARFL WMD, CIA, FBI)
PREDICITNG
MARKET
MOVEMENTS
Predicting Markets






Markets: international markets, emerging markets, import/export markets,
financial market, stock market, commodity market, retail market
Economics (macro), international relations (trade, geopolitics), finance
(international/banking/stock), accounting (market return), marketing
(sales/retailing)
US (NSF SBE, social behavioral economics; governments, think tanks),
Europe/Asia  Business school research in not science (cannot be funded by
NSF in US)!
Economics, finance, accounting, political science, social science, marketing,
computer science (small, no funding in US!), MIS (business intelligence)
Geopolitical/econ/finance/accounting models/theories, market
metrics/parameters, analytical techniques, results interpretations, predicating
markets
EMH (efficiency market hypothesis), RWT (random walk theory), CAPM (capital
asset pricing model), quant/algorithm trading
Research Opportunities

Sophisticated econ/finance/accounting/marketing
models/theories, established analytical techniques and metrics
(numeric), abundant structured databases (financial metrics,
economic indicators, stock quotes)

New, diverse unstructured (text) web-enabled business data
sources, e.g., 10K/10Q SEC reports, mass media news, local
news, Internet news, financial blogs, investor forums, tweets…
Topic extraction, named entity recognition, sentiment/affect
analysis, multilingual language models, social network analysis,
statistical machine learning, temporal data/text mining, timeseries analysis…

Nerds on Wall Street
“Future technological stars…(1)
Advanced electronic market tools; (2)
Understanding both quantitative and
qualitative information…”
“The Text Frontier, Collective
Intelligence, Social Media, and
Market Monitors”
“Stocks are stories, bonds are
mathematics.”
David Leinweber, 2009
AZ BIZ INTEL:
BUSINESS MASS MEDIA, SOCIAL MEDIA,
TEXT ANALYTICS, SENTIMENT
ANALYSIS, SPIKE DETECTION,
FINANCE/ACCOUNTING/MARKETING
MODELING, PREDICTING MARKET
MOVEMENTS
Business Intelligence & Analytics
•
•
•
•
$3B BI revenue in 2009 (Gartner, 2006)
The Data Deluge (The Economists, March 2010); internet
traffic 667 Exabytes by 2013, Cisco; Total amount of
information in 2010, 1.2 Zettabyte (KB-MB-GB-TB-PB-EB-ZBYB)
$9.4B BI software M&A spending in 2010 and $14.1B by 2014
(Forrester)
IBM spent $14B in BI in five years; $9B BI revenue in 2010
(USA Today, November 2010); 24 acquisitions, 10,000 BI
software developers, 8,000 BI consultants, 200 BI
mathematicians  Acquired i2/COPLINK in 2011
Business Intelligence & Analytics
•
BI: “skills, technologies, applications, and practices used to
help an enterprise better understand its business and
market.”
•
Technologies: data warehousing; Extraction,
Transformation, and Load(ETL); Business Performance
Management (BPM); visual dashboards; and advanced
knowledge discovery using data and text mining
BI 2.0: web intelligence, web analytics, web 2.0, social media
analytics, opinion mining; cloud computing and web
services; real-time monitoring and mining; enterprise
performances (marketing/accounting/finance/healthcare)
•
AZ BIZ INTEL
•
•
•
•
•
•
•
Mass media, social media contents
Text & social media analytics techniques
Finance/accounting/marketing models (Tetlock/Columbia,
Antweiler/UBC, Das/Santa Clara)  NYU (Dhar), Arizona (Dhaliwal,
Kelly, Jiang, Lusch, Yong), National Taiwan U (Li, Hong, Lu)
Bag of words, named entities, proper nouns, topics (1, 2-, 3- grams)
Sentiment/valence, lexicons, machine learning, stakeholder
analysis, EFLS analysis
Time series models, spike detection, decaying function, trading
windows, targeted sentiment
Econometrics/regression models (R-sqr, p-value), 10-fold validation
(F, accuracy), simulated trading (cost, frequency, exit)
AZ ONLINE
WOM
AZ WOM: events, volume, sentiment
Data Collection
Yahoo!
Movie
Parsing
Data Processing
OpinionFinder
SentiWordNet
Sales Data
Professional
Evaluation
Firms
Strategy
Online WOM evolution
Correlation between
different WOM measures
Measures and Metrics
Online WOM measures
Messages
Statistical Analysis
Number of messages
Number of sentences
Valence
Subjectivity
Number of valence words
New-product performance metrics
Opening-week box office sales
Total box office sales
Opening strength
Longevity
Professional evaluation
Correlation of WOM
measure across newproduct lifecycle
Correlation between
online WOM and
product performance
Correlation between
online WOM measures
and new-product
performance across the
whole new-product
lifecycle
11
Results

Evolution of online WOM through new-product lifecycle



WOM communication starts early in preproduction, becomes highly
active before movie release, then diminishes gradually
Valence has a clear decreasing trend over time, indicating that WOM
becomes more negative after movie release
Subjectivity, number of sentences and number of valence words stay
stable over time
12
IT’S THE BUZZ!
13
AZ STOCK
TRACKER I & II
Literature Review:
Stock Performance Prediction

Theoretical perspectives on stock behavior

Efficient market hypothesis (Fama 1964)



Random walk theory (Malkiel 1973)



Price of a stock reflects all available information
Market reacts instantaneously; impossible to outperform
Price of a stock varies randomly over time
Future prediction, outperforming the market is impossible
Pessimistic assessments of the predictability of
stock behavior refuted through empirical studies

Lo and MacKinlay 1988; Jaffe et al 1989; Pesaran and
Timmermann 1995
15
Literature Review:
Stock Performance Prediction

Predominant approaches to stock prediction

Fundamentalists utilize fundamental and financial
measures of economy, industry, and firm

Economy and sector indicators, financial ratios of the firm



Technicians utilize historical time-series information
of the stock and market behavior


Fama-French three factors model (Fama and French 1993)

Market return, market capitalization, book to market ratio
Currency exchange rates, interest rates, dividends
Historical price, volatility, trading volume
Various machine learning models applied

Regression, ANN, ARIMA, support vector machines
16
Literature Review:
Stock Performance Prediction

In addition to financial and stock variables, researchers
have incorporated firm-related news article measures

Developed trend-based language models for news articles


Categorized press releases (good, bad, neutral)


Mittermayer 2004
Examined various textual representations of news articles


Lavrenko et al. 2000
Schumaker and Chen, 2009a; 2009b
But few have incorporated firm-related web forums

Thomas and Sycara (2000) utilize text classifications of
discussions on Raging Bull to inform stock trading strategies
17
Literature Review:
Firm-Related Web Forums and Stock

Studies relating web forums and stock behavior


Examined firm-related web forums on major web portals
Early studies focused on activity, without content analysis

Supported market efficiency; only concurrent relationships identified


Subsequently challenged; forum activity predicted stock behavior


Wysocki 1998; Tumarkin and Whitelaw 2001
Antweiler and Frank 2002; 2004; Das and Chen 2007
Analysis advanced to measure opinions in discussions

‘Bullishness’ classifiers to distinguish investment positions
Antweiler and Frank 2004; Das and Chen 2007
Classified buy, hold, or sell positions with 60 – 70% accuracy




Identified predictive relationships between forum discussion
sentiment and subsequent stock returns, volatility, trading volume
Shortcomings

Retrospective analyses, shareholder perspective of major forums
18
AZ FinText: numbers + text
• Techniques: bag of words, named entities, proper nouns, past stock prices +
SVR
• Testbed: S&P 500 5 weeks, Oct-Nov 2005, 2,809 news, 10M stock quotes,
GICS industry classification
• Evaluation: Return, vs. Quant funds; 20-minute prediction
AZ FinText in the news
Thursday, June 10, 2010
AI That Picks Stocks Better Than the Pros
A computer science professor uses textual analysis
of articles to beat the market.
WSJ Technology News and Insights
June 21, 2010, 1:45 PM ET
Using Artificial Intelligence to Digest News,
Trade Stocks
AZ STOCK TRACKER I: mass, social
media, topic, volume, sentiment
Web
Forums
Mutual information
phrase extractor
Conversation
analysis
Traffic dynamics
Topic correlation and
evolution
Sentiment identification
Sentiment grader
Database
Author
Discussion
topics
Spider/
Parser
Topic
Sentiment correlation and
evolution
Active topics and
sentiments
Sentiment aggregator
Market prediction
Message
sentiments
Message
21
t
Online
news
Topic extraction
Sentimen
Data collection
User-Generated Contents (UGC):
Conversations of 30,000 Wal-Mart Constituents and 500,000 Responses
Data sources
Duration
# of
Threads
# of
Messages
# of
Users
Wall Street Journal
- WalMart-related News (WSJ)
Aug 1999
- Mar 2007
N/A
4,081
657
Yahoo! Finance
- WalMart Message Board (YAHOO)
Jan 1999
- Jun 2008
139,062
441,954
25,500
Walmart-blows Forum
- Employee Department Board (EMP)
Dec 2003
- Oct 2008
7,440
102,240
2,930
Walmart-blows Forum
- WalMart Sucks Board (WSB)
Nov 2003
- Nov 2008
1,354
19,624
1,855
Wakeupwalmart Forum
- General WalMart Discussion Board (GDB)
Aug 2005
- Nov 2008
2,136
23,940
967
22
320
16000
280
14000
240
12000
200
10000
160
8000
120
6000
80
4000
40
2000
0
WSJ
# of messages
# of news
Post Dynamics
YAHOO
EMP
WSB
GDB
0
99
00
01
02
03
04 05
Year
06
07
08
23
Average sentiment
Sentiment Trend
0.01
0
WSJ
YAHOO
EMP
WSB
GDB
-0.01
-0.02
-0.03
-0.04
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Year
3 months' moving
average sentiment
0.01
0
-0.01
-0.02
-0.03
YAHOO
WSJ
EMP
WSB
GDB
-0.04
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Year
24
Market Modeling
Correlation
Return
Return
1
Volatility
0.0348
Volatility
1
Trading Volume
Sentiment
Trading Volume
1
0.0338
Disagreement
-0.0507
-0.03578
Message Volume
-0.3186
0.3131
Message Length
0.0473
-0.1840
Disagreement One Day Lag
-0.0527
-0.0475
Message Volume One Day Lag
-0.3433
0.3026
Message Length One Day Lag
0.0859
-0.1795
Subjectivity
Sentiment One Day Lag
Subjectivity One Day Lag
-0.0425
Correlation coefficients with p<0.10 are shown (two-tailed test)

Correlation


Sentiment expressed in the forum contemporaneously correlates significantly with stock return
Disagreement, volume, and length expressed in the forum also hold significant correlations with
volatility and trading volume
25
Market Predictive Results (cont’d)
Overall
Forum
Markett
Sentimentt-1
Disagreementt-1
Message Volumet-1
Message Lengtht-1
Subjectivityt-1
Returnt
0.8723***
(31.33)
0.0025
(0.31)
0.0000
(0.04)
-0.0007**
(-2.29)
0.0002
(1.42)
0.0015
(1.46)
Volatilityt
-0.0010
(-0.25)
0.0074
(0.47)
-0.0023***
(-4.94)
-0.0122***
(-19.09)
0.0030***
(7.82)
0.0149***
(7.27)
Trading
Volumet
0.7627***
(15.06)
-0.4275**
(-2.06)
0.0140**
(2.29)
0.1957***
(23.18)
-0.0668***
(-13.24)
-0.3014***
(-11.11)
Note: *p<0.10;**p<0.05;***p<0.01

•
•
Predictive regression (t-1)
The significant measures of forum discussions identified in contemporaneous
regressions maintain their significance in the predictive regression models
Additionally, sentiment expressed in the web forum holds a significant relationship
with the trading volume on the following day
•
Positive sentiment reduces trading volume; negative sentiment induces trading activity
26
AZ STOCK TRACKER II: stakeholder
analysis
27
Experimental Design:
Description of Prediction Models
Variables
Description
Dependent:
RETURN t
Stock return on day t (log difference of share price)
Fundamental:
FFSIZE
FFBTM
FFMARKET t-1
FFMARKET t-2
Technical:
Fama-French firm size (prior year; market capitalization = share price * shares outstanding)
Fama-French book-to-market ratio (prior year; book value / market value of shares)
Fama-French market return on day t – 1 (log difference of S&P 500 index price)
Fama-French market return on day t – 2 (log difference of S&P 500 index price)
RETURN t-1
RETURN t-2
VOLATILITY t-1
VOLATILITY t-2
VOLUME t-1
VOLUME t-2
DAY d t
Stock return on day t – 1 (log difference of share price)
Stock return on day t – 2 (log difference of share price)
Stock price volatility on day t – 1 (volatility modeled using a GARCH(1,1))
Stock price volatility on day t – 2 (volatility modeled using a GARCH(1,1))
Stock trading volume on day t – 1 (in log)
Stock trading volume on day t – 2 (in log)
Dummy variables for trading day of the week on day t
t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4)
28
Experimental Design:
Description of Prediction Models
Variables
Description
Forum:
MESSAGES t-1
LENGTH t-1
SENTI t-1
VARSENTI t-1
SUBJ t-1
VARSUBJ t-1
Stakeholder:
Number of messages posted in the forum on day t – 1 (in log (1 + messages))
Average length of messages posted in the forum on day t – 1 (in number of sentences)
Average sentiment of messages posted in the forum on day t – 1
Variance in sentiment of messages posted in the forum on day t – 1
Average subjectivity of messages posted in the forum on day t – 1
Variance in subjectivity of messages posted in the forum on day t – 1
MESSAGES s t-1
LENGTH s t-1
SENTI s t-1
VARSENTI s t-1
SUBJ s t-1
VARSUBJ s t-1
Number of messages posted by stakeholder cluster s on day t – 1 (in log (1 + messages))
Average length of messages posted by stakeholder cluster s on day t – 1 (in number of sentences)
Average sentiment of messages posted by stakeholder cluster s on day t – 1
Variance in sentiment of messages posted by stakeholder cluster s on day t – 1
Average subjectivity of messages posted by stakeholder cluster s on day t – 1
Variance in subjectivity of messages posted by stakeholder cluster s on day t – 1
t = days (t = 1, 2, …, n); stakeholder clusters (s = 1, 2, …, c)
29
Experimental Design:
Description of Prediction Models

Baseline Model – Baseline-FF

Fundamental variables: Fama-French model
RETURN t =

β0 + β1 FFSIZE + β2 FFBTM + β3 FFMARKET t-1 + β4 FFMARKET t-2 + εt
Baseline Model – Baseline-Tech

Technical variables: Lagged stock returns, volatility, trading volume, day-of-week dummies
RETURN t =
β0 + β1 RETURN t-1 + β2 RETURN t-2 + β3 VOLATILITY t-1 + β4 VOLATILITY t-2
+ β5 VOLUME t-1 + β6 VOLUME t-2 + (β7 DAY1t + … + β10 DAY4t)+ εt

Baseline Model – Baseline-Comp

Comprehensive: all fundamental and technical variables
RETURN t =
β0 + β1 FFSIZE + β2 FFBTM + β3 FFMARKET t-1 + β4 FFMARKET t-2
+ β5 RETURN t-1 + β6 RETURN t-2 + β7 VOLATILITY t-1 + β8 VOLATILITY t-2
+ β9 VOLUME t-1 + β10 VOLUME t-2 + (β11 DAY1t + … + β14 DAY4t) + εt
Where t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4)
30
Experimental Design:
Description of Prediction Models

Forum models

Comprehensive baseline variables plus forum-level measures
RETURN t =
β0 + β1 FFSIZE + β2 FFBTM + β3 FFMARKET t-1 + β4 FFMARKET t-2
+ β5 RETURN t-1 + β6 RETURN t-2 + β7 VOLATILITY t-1 + β8 VOLATILITY t-2
+ β9 VOLUME t-1 + β10 VOLUME t-2 + (β11 DAY1t + … + β14 DAY4t)
+ β15 MESSAGES t-1 + β16 LENGTH t-1 + β17 SENTI t-1 + β18 VARSENTI t-1
+ β19 SUBJ t-1 + β20 VARSUBJ t-1 + εt
Where t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4); stakeholder clusters (s = 1, 2, …, c)
31
Experimental Design:
Description of Prediction Models

Stakeholder models

Comprehensive baseline variables plus stakeholder grouplevel forum measures
RETURN t =
β0 + β1 FFSIZE + β2 FFBTM + β3 FFMARKET t-1 + β4 FFMARKET t-2
+ β5 RETURN t-1 + β6 RETURN t-2 + β7 VOLATILITY t-1 + β8 VOLATILITY t-2
+ β9 VOLUME t-1 + β10 VOLUME t-2 + (β11 DAY1t + … + β14 DAY4t)
+ (β15 MESSAGES 1 t-1 + β16 LENGTH 1 t-1 + β17 SENTI 1 t-1 + β18 VARSENTI 1 t-1
+ β19 SUBJ 1 t-1 + β20 VARSUBJ 1 t-1 + … + βk MESSAGES c t-1 + βk+1 LENGTH c t-1
+ β k+2 SENTI c t-1 + β k+3 VARSENTI c t-1 + β k+4 SUBJ c t-1 + β k+5 VARSUBJ c t-1) + εt
Where t = days (t = 1, 2, …, n); day of the week (d = 1, …, 4); stakeholder clusters (s = 1, 2, …, c); index k = (((c - 1) * 6) + 15)
32
Experimental Design:
Social Media Data

A 17 month period was utilized for analysis and experimentation


November 1, 2005 to March 31, 2007
First five months were utilized to calibrate the initial stock return prediction models




November1, 2005 – March 31, 2006
Calibrated models applied for prediction during each trading day in the next month
Each subsequent month, new models were calibrated using five previous months
of time-series variables, for stock return prediction during the next month of trading
In total, stock return prediction was performed daily for one year (250 trading days)

April 1, 2006 – March 31, 2007
Forum
Yahoo Finance – WMT
(finance.yahoo.com)
Wal-Mart Blows
(www.walmartblows.com)
Wakeup Wal-Mart
(www.wakeupwalmart.com)
Messages
Discussion
Threads
Stakeholders
Messages
per Thread
Messages per
Stakeholder
134,201
40,633
5,533
3.30
24.25
55,125
3,690
1,461
14.94
37.73
10,797
1,306
915
8.27
11.80
33
Results and Discussion

Hypothesis testing results
Hypothesis
Result
H1.1 Baseline-Comp model > Baseline-FF model
Partially supported
H1.2 Baseline-Comp model > Baseline-Tech model
Rejected
H2 Forum-level models > best baseline models
Rejected
H3.1 Stakeholder-level models > best baseline
models
Supported
H3.2 Stakeholder-level models > forum-level models
Partially
supported
H4.1 Social network > discussion content representation
Partially supported
H4.2 Writing style > discussion content representation
Rejected
H4.3 Social network > writing style representation
Partially supported
H5.1 ANN > OLS
Rejected
H5.2 SVR > OLS
Partially supported
H5.3 SVR > ANN
Partially supported
34
Results and Discussion

Wal-Mart stock return prediction model results

Baseline models using fundamental and technical variables


Results across 250 trading days forecasted
Baselines for simulated trading (initial investment of $10,000):


Holding Wal-Mart stock for the year results in $10,096
Holding S&P 500 for the year results in $11,012
Model
Baseline-FF
Baseline-Tech
Baseline-Comp
OLS $
$ 9,787
$ 8,799
$ 10,763
OLS Accuracy
55.20%
57.20%
54.40%
ANN $
$ 9,998
$ 9,702
$ 10,418
ANN Accuracy
44.40%
57.60%
56.80%
SVR $
$ 9,408
$ 9,503
$ 10,645
SVR Accuracy
51.20%
56.40%
56.80%
35
Results and Discussion

Wal-Mart stock return prediction model results

Incorporating the Wakeup Wal-Mart web forum

Results across 250 trading days forecasted
Model
Best Baseline
Forum
Stakeholder-SN
Stakeholder -Content
Stakeholder -Style
Stakeholder-SN+Content
Stakeholder-SN+Style
Stakeholder-Content+Style
Stakeholder-SN+Content+Style
OLS $
$ 10,763
$ 10,367
$ 9,873
$ 10,689
$ 10,271
$ 10,384
$ 10,744
$ 10,696
$ 10,976
OLS Accuracy
57.20%
57.60%
55.20%
60.40%
56.00%
61.60%
60.00%
59.20%
58.00%
ANN $
$ 10,418
$ 10,397
$ 10,930
$ 11,595
$ 9,653
$ 13,066
$ 10,792
$ 10,590
$ 10,778
ANN Accuracy
57.60%
59.20%
57.20%
60.40%
56.80%
60.80%
60.40%
56.40%
56.40%
SVR $
$ 10,645
$ 10,303
$ 10,669
$ 11,976
$ 9,305
$ 11,866
$ 11,249
$ 10,603
$ 10,881
SVR Accuracy
56.80%
59.20%
59.20%
61.20% *
56.00%
62.80% **
57.60%
58.80%
59.60%
Pair-wise t-test; improvement over best baseline model at * p < 0.10 ** p < 0.05
36
AZ STOCK
TRACKER III
Introduction


Forward-looking statements (FLS) refer to
 Projections, forecasts, or other predictive statements
 Made by firm management
 Section 21E of the Securities Exchange Act (1934)
Extended forward-looking statements (EFLS)
 Statements that may have implications for a firms
future development
 Similar to FLS, but broader
 Including information from information intermediaries
(e.g., newspapers, newswires) and individuals (e.g.,
blogs)
38
Recognizing EFLS

EFLS: Extends FLS to include statements about
firm’s future performance from other sources such
as financial press, analysts’ reports, and individuals
Goal
Recognition Task
Definition
EFLS Recognition
Future Timing (FT)
Primary content is about
future events or states
Explicit Uncertainty
(EU)
Explicit accounts of doubt or
unreliability
Overall Assessment
(ALL)
Affect decision maker’s
belief about a firm’s future
cash flow
Positive (POS)
Positive impact on the belief
Negative (NEG)
Negative impact on the
belief
EFLS Sentiment
39
AZ STOCK TRACKER III:
EFLS
40
Summary of Annotation Results
Agreement
ALL
0.91
(0.88, 0.93)
POS
0.90
(0.88, 0.93)
NEG
0.89
(0.86, 0.91)
• High kappa values (>0.7) on
risks supports the coding
0.81
scheme being empirically
(0.76, 0.86)
valid
0.79
(0.73, 0.85)• Agreement upper bound
• 89% to 91% (for ALL,
0.77
(0.71, 0.82)
POS, and NEG)
Cohen’s
Kappa
Category
Count
Percent
ALL
1157
46%
POS
836
33%
NEG
904
36%
Note: (95% CI) from 1,000 Bootstrappings
•
Reference Standard Dataset:
– 2539 sentences in total
41
Experiment 1: Sentence-Level Evaluation
Model
Accuracy†
F-Measure‡
Recall‡
Precision‡
LASSO
67.1%
66.5%
83.8%
55.1%
ENET75
69.3%
68.0%
87.7%
55.6%
ENET50
68.9%
68.7%
90.5%
55.4%
ENET25
69.4%
68.9%
91.2%
55.4%
SVM
69.5%
70.2%
83.9%
60.3%
SVM w/IG
69.1%
68.9%
84.3%
58.3%
FKC
64.7%
50.9%
69.7%
40.1%
OF_PN
54.8%
27.9%
19.1%
51.4%
42
EFLS Impacts: Hypotheses
Development

Theoretical framework (Easley and O’Hara, 2004)
 There are 𝐼𝑘 signals for stock k (𝑠𝑘1 , 𝑠𝑘2 , … , 𝑠𝑘𝐼 )
𝑘
1
𝑣𝑘 ,
𝛾𝑘

𝑠𝑘𝑖 ~𝑁

(𝑠𝑘1 , 𝑠𝑘2 , 𝑠𝑘3 , 𝑠𝑘(𝛼𝑘𝐼𝑘) , 𝑠𝑘(𝛼𝑘 𝐼𝑘+1) , … , 𝑠𝑘(𝐼𝑘 −1) , 𝑠𝑘𝐼𝑘 )
Private Signals

Public Signals
𝛼𝑘 : The relative amount of private-versus-public
information
43
Hypotheses Development (Cont’d.)

Hypothesis 1: Firms with lower EFLS
intensity are associated with higher expected
return.
𝜕𝐸[𝑣𝑘 − 𝑝𝑘 ]
𝛿𝑥𝑘 1 − 𝜇𝑘 𝐼𝑘 𝛾𝑘
= 2
𝜕𝛼𝑘
𝐶𝑘 1 + 𝛼𝑘 𝐼𝑘 𝜂𝑘 𝜇𝑘2 𝛾𝑘 𝜎 −2
2
>0
44
Hypotheses Development (Cont’d.)

Hypothesis 2: Firms with lower EFLS intensity
are associated with the higher stock volatility.
𝜕𝑉𝑎𝑟(𝑣𝑘 − 𝑝𝑘 )
𝛿 4 𝛾𝑘 𝐼𝑘 1 − 𝜇𝑘 2𝛿 4 + 𝑉1,𝑘 + 𝑉2,𝑘
=
𝜕𝛼𝑘
𝜂𝑘 𝛿 2 𝜌𝑘 + 𝛾𝑘 𝐼𝑘 (1 + 𝛼𝑘 (𝜇𝑘 − 1)) + 𝛼𝑘 𝜂𝑘 𝛾𝑘 𝐼𝑘 𝜇𝑘2 (𝛾𝑘 𝐼𝑘 + 𝜌𝑘 )
𝑉1,𝑘 =
𝛾𝑘 𝐼𝑘 − 𝜌𝑘 + 𝜇𝑘 𝛾𝑘 𝐼𝑘 + 𝜌𝑘
3
𝛼𝑘 𝜂𝑘2 𝐼𝑘 𝛾𝑘 𝜇𝑘2 + 𝛿 2 𝜂𝑘
𝑉2,𝑘 = −1 + 2𝜇𝑘 + 𝜇𝑘2 𝛿 2 𝜂𝑘 𝛾𝑘 𝐼𝑘 𝛼𝑘
𝜕𝑉𝑎𝑟 𝑣−𝑝𝑘
𝜕𝛼𝑘

If 𝐼𝑘 𝛾𝑘 > 𝜌𝑘 and 𝜇𝑘 > 2 − 1 then

Intuition: if there are enough signals and the fraction of informed
investors is larger than 41%, then firms with lower amounts of
EFLS  Higher Volatility
>0
45
Control Variables
Variable
Definition
Number of news articles mentioning firm i in month t.
Logarithm of market value, computed using the closing market price of month t-1.
Logarithm of book-to-market ratio, computed following Fama and French (1993).
Log(Dollar trading volume of firm i in month t)
Log(variance); variance of firm i in month t is computed using daily stock returns.
Proportion of individual ownership of stock i, using the latest available data,
computed by aggregating 13f filings (Fang and Peress 2009).
Log(1+number of analysts covering firm i in month t).
Log(1+standard deviation of analyst’s earnings predictions).
Firm-Level Performance Evaluation
(Cont’d.)

Empirical Model 1:
Hypothesis 1 Predicts Negative b1
𝑟𝑖,𝑡+1 = 𝑎0 + b1 𝐴𝐿𝐿_𝐼𝑁𝑖,𝑡 + 𝑐1 𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞𝑖,𝑡 + 𝑐2 𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖𝑖,𝑡 +
𝑑1 𝐿𝑜𝑔𝑆𝑖𝑧𝑒𝑖,𝑡 + 𝑑2 𝐿𝑜𝑔𝐵𝑀𝑖,𝑡 + 𝑑3 𝑟𝑖,𝑡 + 𝑑4 𝐿𝑜𝑔𝑉𝑖,𝑡 + 𝑒𝑖𝑡

Empirical Model 2:
Hypothesis 2 Predicts b1 ≠ 0
𝐿𝑜𝑔𝑉𝑖,𝑡+1
= 𝑎0 + b1 ALL_INi,t + 𝑐1 𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞𝑖,𝑡 + 𝑐2 𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖𝑖,𝑡 +
𝑑1 𝐿𝑜𝑔𝑉𝑜𝑙𝑢𝑚𝑒𝑖,𝑡 + 𝑑2 𝐿𝑜𝑔𝑉𝑖,𝑡 + 𝑑3 𝐿𝑜𝑔𝑆𝑖𝑧𝑒𝑖,𝑡 +
𝑑4 𝐿𝑜𝑔𝐵𝑀𝑖,𝑡 + 𝑑5 𝑟i,t + 𝑑6 𝐼𝑛𝑑𝑣𝑂𝑤𝑛𝑖,𝑡 +
𝑑7 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝐶𝑜𝑣𝑒𝑟𝑖,𝑡 + 𝑑8 𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝑆𝐷𝑖,𝑡 + 𝑒𝑖,𝑡
47
Experiment Two: Firm-Level Evaluation

Research Testbed: January 1986 to May 2008,
1,134,321 Wall Street Journal news articles
 Merged with CRSP, Compustat, and IBES
 Stock prices lower than $5 at the end of a month were
removed (Cohen and Frazzini 2008; Fang and Peress
2009)
 1,274,711 firm-months, spanning 269 months
48
Expected Return and EFLS Intensity
Variable
Value
-0.0026*
Variable
Value
-0.0052**
Variable
Value
-0.0039
Control Variables
Intercept
***, **, *
0.00069***
0.00068***
0.00067***
-0.00081
-0.0012
-0.0015
-0.0019**
-0.0019***
-0.0019***
0.0025***
0.0025***
0.0025***
-0.046***
-0.046***
-0.046***
0.00042
0.00042
0.00042
0.039***
Intercept 0.039***
Intercept 0.039***
0.0031
0.0031
0.0031
indicate statistical significance at the 0.01, 0.05, and 0.1 levels, respectively.
49
Volatility and EFLS Intensity
Model 2A (𝐴𝐿𝐿_𝐼𝑁𝑖,𝑡 )
Variable
𝐴𝐿𝐿_𝐼𝑁𝑖,𝑡
Value
-0.074***
Model 2B (𝐹𝑇_𝐼𝑁𝑖,𝑡 )
Variable
𝐹𝑇_𝐼𝑁𝑖,𝑡
Model 2C (EU_𝐼𝑁𝑖,𝑡 )
Value
-0.196***
Variable
𝐸𝑈_𝐼𝑁𝑖,𝑡
Value
-0.254***
Control Variables
𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞𝑖,𝑡
𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖𝑖,𝑡
𝐿𝑜𝑔𝑉𝑜𝑙𝑢𝑚𝑒𝑖,𝑡
𝐿𝑜𝑔𝑉𝑖,𝑡
𝐿𝑜𝑔𝑆𝑖𝑧𝑒𝑖,𝑡
𝐿𝑜𝑔𝐵𝑀𝑖,𝑡
𝑟𝑖,𝑡
𝐼𝑛𝑑𝑣𝑂𝑤𝑛𝑖,𝑡
𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝐶𝑜𝑣𝑒𝑟𝑖,𝑡
𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝑆𝐷𝑖,𝑡
Intercept
𝑅2
***, **, *
0.012***
-0.105***
0.108***
0.565***
-0.222***
-0.066***
-0.615***
0.071***
0.016***
0.095***
-1.568***
0.57
𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞𝑖,𝑡
𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖𝑖,𝑡
𝐿𝑜𝑔𝑉𝑜𝑙𝑢𝑚𝑒𝑖,𝑡
𝐿𝑜𝑔𝑉𝑖,𝑡
𝐿𝑜𝑔𝑆𝑖𝑧𝑒𝑖,𝑡
𝐿𝑜𝑔𝐵𝑀𝑖,𝑡
𝑟𝑖,𝑡
𝐼𝑛𝑑𝑣𝑂𝑤𝑛𝑖,𝑡
𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝐶𝑜𝑣𝑒𝑟𝑖,𝑡
𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝑆𝐷𝑖,𝑡
Intercept
𝑅2
0.012***
-0.103***
0.108***
0.565***
-0.222***
-0.066***
-0.615***
0.071***
0.017***
0.095***
-1.566***
0.57
𝑁𝑒𝑤𝑠𝐹𝑟𝑒𝑞𝑖,𝑡
𝑁𝑒𝑤𝑠𝑆𝑒𝑛𝑡𝑖𝑖,𝑡
𝐿𝑜𝑔𝑉𝑜𝑙𝑢𝑚𝑒𝑖,𝑡
𝐿𝑜𝑔𝑉𝑖,𝑡
𝐿𝑜𝑔𝑆𝑖𝑧𝑒𝑖,𝑡
𝐿𝑜𝑔𝐵𝑀𝑖,𝑡
𝑟𝑖,𝑡
𝐼𝑛𝑑𝑣𝑂𝑤𝑛𝑖,𝑡
𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝐶𝑜𝑣𝑒𝑟𝑖,𝑡
𝐿𝑜𝑔𝐴𝑛𝑎𝑙𝑦𝑆𝐷𝑖,𝑡
Intercept
𝑅2
indicate statistical significance at the 0.01, 0.05, and 0.1 levels, respectively.
0.012***
-0.110***
0.108***
0.565***
-0.222***
-0.066***
-0.616***
0.071***
0.017***
0.095***
-1.566***
0.57
50
Take-Away and WIP (20%)






Mass and social media texts provide additional signals for market
prediction (in addition to numbers)
Message volume important; aggregate sentiment may not (EMH)
Business sentiment processing difficult; may require additional
content pre-processing (stakeholder; EFLS)
Predicting return hard; predicting volatility easier (VIX Chicago Board)
Large-scale stock news tracking and text analytics can be automated
Trading windows; decay function; targeted sentiment; extensive
trading periods (up/down); industry and news category (oil/banking);
firm & index size (Russell/NYSE); emerging markets (China)
 All the firms (10K), all the news (1M each), all the time ???
 Trading strategy ???
51
SEC/Edgar
NYSE.com
NASDAQ.com
Finance.Yahoo.com
Company Information Database
Ticker
CIK
CUSIP
Company
Name
PERMNO
Yahoo Finance
Forums
Company
Websites
Twitter
Stock
Exchange
WSJ
Dynamic Data Sources
Search
Engines
10K
Report
Blogs
News
Data
Processing
Transformation/Integration
Performance
Indicators
Topics &
Sentiments
Time Series
/ Burst
Risk Model
SNA
Data
Analysis
Interactive Applications
Data Collection
Predefined Data Sources
Company
Keywords
Static
Figures/Dashboards
Basic
Information
Data Sources for US Public Companies
Analytic Approaches
Single Media
Analysis
Cross Media
Analysis
Predictive
Analysis
Simulated
Trading
52
AZ BIZ INTEL System Design
Visualization
Hsinchun Chen, Ph.D.
Artificial Intelligence Lab, University of
Arizona
hchen@eller.arizona.edu
http://ai.arizona.edu
Download