Competitive Intelligence

advertisement
Competitive Intelligence and the Web
Presented at
AMCIS2003
Tampa, Florida
by
Dr. Robert J. Boncella
Washburn University
1
Competitive Intelligence
“the process of ethically collecting,
analyzing and disseminating accurate,
relevant, specific, timely, foresighted and
actionable intelligence regarding the
implications of the business environment,
competitors and the organization itself”
2
Competitive Intelligence Process
– Planning and direction
• working with decision makers to discover and hone their intelligence needs
– Collection activities
• conducted legally and ethically
– Analysis
• interpreting data and compiling recommended actions
– Dissemination
• presenting findings to decision makers
– Feedback
• taking into account the response of decision makers and their needs for
continued intelligence
3
CI and The Web
• A business Web site will contain a variety of useful information,
– company history, corporate overviews, business visions
– product overviews, financial data, sales figures
– annual reports, press releases, biographies of top executives,
locations of offices, and hiring ads.
– An example of this information is
http://www.google.com/about.html
• The cost of this information is, for the most part, free.
• Access to open sources does not require proprietary software such
as a number of commercial database
4
The Web Structure and
Information Retrieval
• HTTP protocol and the use of Uniform Resource
Locators (URL)
• Mathematical network of nodes and arcs
• Information Retrieval (IR)
– follows the links (arcs)
– from document to document (node to node)
• Retrieve documents so their content can be
evaluated and a new set of URLs would be
available to follow
5
Issues Associated With
CI and The Web
•
•
•
•
Information Gathering
Information Analysis
Information Verification
Information Security
6
Information Gathering
7
General Web Search Engines
• Architecture
– Web Crawlers (Web Spiders) are used to collect Web pages
using graph searching techniques
– An indexing method is used to index collected Web pages
and store the indices into a database.
– Retrieval and ranking methods that are used to retrieve
search results from the database and present ranked results
to users.
– A user interface
• allow users to query the database and customize their searches
8
Domain Specific Web Search Engines
• Northern Light, a search engine for commercial
publications, in the domains of business and general
interest.
• EDGAR is the United States Securities and Exchange
Commission clearinghouse of publicly available
information on company information and filings.
• Westlaw is a search engine for legal materials.
• OVID Technologies provides a user interface that unifies
searching across many subfields and databases of medical
information.
9
Meta-search engine
• Upon receipt of query connects to several
general search engines
• Returns integrated results of searches
• examples
– www.metacrawler.com
– www.dogpile.com
10
Difficulties with Information Gathering
•
•
•
•
Time to carry out search
Number of pages returned
Currency of information
Accessible pages
– Web contains 552.5 billion pages
– Growth rate of 7.3 million per day
• “Surface Web” v.s. “Deep Web”
– Surface Web page freely available to public
– Deep Web
• dynamic pages, intranets & proprietary databases
– Surface Web contains about 2.5 billion
– Deep Web contains about 550 billion (200 times more)
• Charge for Web retrieval
11
Information Analysis
(Web Mining)
12
Web Page Content
• Focused Spiders (On Line)
– Return Appropriate Set of Pages
• Intelligent Agent
• User Interface
– CI Spider by Chau & Chen - University of Arizona
– Answers On-line by Answer Chase
13
Search Result Mining
• Text Mining (Off Line)
– Automate the task of organizing and summarizing
numerous pages
– Requires automated analysis of natural language
texts
– Commercially available text mining applications e.g.
TextAnalyst by Megacomputer
– ANN solution SITEX by Fukuda et. al.
14
Web Structure
– Page Rank
•
•
•
•
•
Utilized in keyword searching of web
Measure of the number of “back links” to a page
Importance of page determined by number links to the page
Page’s priority determined by this measure
Implemented in the Google search engine
– Hyperlink-Induced Topic Search (HITS)
• Hub & Authority measures associated with page
– Hub - a page that contains links to authoritative pages
– Authoritative - best page (sources) for requested informatiom
• Starts with a keyword search that returns a set of pages
– hubs and authoritative
15
Web Usage
– Data mining on Web logs
– Web logs contain “clickstream” data
• Server side
– Information about pages provided
• Client side
– Information about pages requested
16
Information Verification
17
Techniques to Verify Accuracy of Information
• Deep web sources more reliable that surface web sources
• Confirm with non-web source
• Answer the following
– Who is the author?
– Who maintains the web site?
– How current is the web page?
• Observe the Top Level Domain (TLD) of the URL
– “~” within URL denotes a personal web page
18
Domain Names
• Original TLDs
– .com
– .edu
– .gov
– .net
– .org
• New TLDs
– .aero (for the air-transport
industry)
– .biz (for businesses),
– .coop (for cooperatives)
– .info (for all uses)
– .museum (for museums)
– .name (for individuals)
– .pro (for professions).
19
Information Security
20
Information Security Issues
• Assuring the privacy and integrity of private information
– Managed with usual computer and network security methods
• Assuring the accuracy of a firm’s public information
– Defend against:
•
•
•
•
•
Web hijacking
Web defacing
Cognitive hacking (semantic attack)
Negative information
Reference - Cybenko, Giani, & Thompson
• Avoiding unintentionally revealing information that ought to
be private
21
Web Hijacking
Due to a bug in CNN’s software,
when people at the spoofed site
clicked on the “E-mail This” link,
the real CNN system distributed a
real CNN e-mail to recipients with a
link to the spoofed page.
With each click at the bogus site,
the real site’s tally of most popular
stories was incremented for the
bogus story.
Allegedly this hoax was started by
a researcher who sent the spoofed
story to three users of AOL’s
Instant Messenger chat software.
Within 12 hours more than 150,000
people had viewed the spoofed
page.
22
Web Defacing
In February 2001 the New York
Times web site was defaced by a
hacker identified as “splurge” from a
group called “Sm0ked Crew”, which
had a few days previously defaced
sites belonging to Hewlett-Packard,
Compaq, and Intel.
THE-REV | SPLURGE
Sm0ked crew is back and better than ever!
“Well, admin I’m sorry to say by you have just got
sm0ked by splurge. Don’t be scared though,
everything will be all right, first fire your current
security advisor . . .”
23
Cognitive Hacking
• Cognitive hacking is the manipulation of perception.
• Causes
– disgruntled customers/employees
– competition
– random act of vandalism
24
Two types of cognitive hacking
• single source cognitive hacking.
– when a reader reads information and the reader
does not know who posted the information and
has no way of verifying the information or
contacting the author of the information.
• multiple source cognitive hacking
– occurs when there are several sources for a
topic, and this becomes a concern when the
information is not accurate.
25
Categories of Cognitive Attacks
• Overt
– No attempt is made to conceal overt cognitive attacks
• website defacements.
• Covert
– Provision of misinformation
• the intentional distribution or insertion of false or misleading
information intended to influence reader’s decisions and/or
activities
26
Emulex & Mark Jakob
• On 8/25/2000 a press release distributed by financial news
services stated that Emulex revised its per share gain to a
per share loss
• Price per share of Emulex moved from $104.00 to $43.00
in 16 minutes
• The press released was false - fabricated by Mark Jakob
who was currently on the wrong side of a stock short sale.
• Jakob launched this press release via Internet Wire - LA
based firm that distributes press releases.
27
The Jonathan Lebed Case
According to the US Security
Exchange Commission, 15year-old Jonathan Lebed
earned between $12,000 and
$74,000 daily over six
months - for a total gain of
$800,000. Lebed would buy
a block of FTEC stock and
then using only AOL
accounts with fictitious
names he would post a
message like the one in the
next text box. Doing this a
number of times he increased
the daily trading volume of
FTEC from 60,000 shares to
more than one million.
DATE: 2/03/00 3:43pm Pacific Standard Time
FROM: LebedTG1
FTEC is starting to break out! Next week, this thing will EXPLODE . . .
Currently FTEC is trading for just $21/2. I am expecting to see FTEC at
$20
VERYSOON . . .
Let me explain why . . .
Revenues for the year should very conservatively be around $20 million.
The average company in the industry trades with a price/sales
ratio of 3.45. With 1.57 million shares outstanding, this will value FTEC
at . . . $44. It is very possible that FTEC will see $44, but since I would
like to remain very conservative . . . my short term price target on
FTEC is still $20!
The FTEC offices are extremely busy . . . I am hearing that a number of
HUGE deals are being worked on. Once we get some news from FTEC
and the word gets out about the company . . . it will take-off to MUCH
HIGHER LEVELS!
I see little risk when purchasing FTEC at these DIRT-CHEAP PRICES.
FTEC is making TREMENDOUS PROFITS and is trading UNDER
BOOK
VALUE!!!
This is the #1 INDUSTRY you can POSSIBLY be in RIGHT NOW.
There are thousands of schools nationwide who need FTEC to install
security systems . . . You can’t find a better positioned company than
FTEC!
These prices are GROUND-FLOOR! My prediction is that this will be
the #1 performing stock on the NASDAQ in 2000. I am loading up with
all of the shares of FTEC I possibly can before it makes a run to $20.
Be sure to take the time to do your research on FTEC! You will probably
never come across an opportunity this HUGE ever again in your
entire life.
28
POSSIBLE COUNTERMEASURES
• Single source
– Authentication of source
– Information "trajectory" modeling
– Ulam games
• Multiple Sources
– Source Reliability via Collaborative Filtering and
Reliability reporting
– Detection of Collusion by Information Sources
– Byzantine Generals Models
29
Countermeasures: Single Source
• Authentication of Source
– Due diligence
– Implied verification - PKI (Digital Signature)
• Information Trajectory
– Variation on a theme
• e.g. Lebed case variation of the “pump & dump”
scheme
• Ulam Games
– Model that assumes false information
– How fast can that be determined using
questions & answers of source
30
Countermeasures: Multiple Sources
• Collaborative filtering and reliability reporting
– when a site keeps records and uses those records to verify future
claims by those with access to publishing on the site.
• Detection of Collusion by Information Sources
– Linguistic analysis
– Determine if different sources are by same author
• Byzantine generals model
– message communicating system has two types of processes:
reliable and unreliable.
– Given a number of processes from this system determine which of
type is each process.
31
Countermeasures:Negative Information
• Monitor Web Sites
–
–
–
–
5360 URLs with the phrase “Microsoft sucks”
Use an IA to monitor
Text mining for type of negative information
Respond accordingly
32
Countermeasures:
Unintentional Disclosure
• Carry out a CI project against yourself
33
Conclusions
• Reconcile “deep web” v.s. “surface web”
• Determine when all pages are needed vs
“right” set of pages
• Automate “authoritative page selection”
– “Consumer Reports” type process
– e.g. posting a Web page in early 90s (Yahoo)
• Automate detection of
– false information
– inaccurate information
– negative information
34
Slides:
http://www.washburn.edu/cas/cis/boncella
E-mail:
bob.boncella@washburn.edu
35
References
Aaron, R. D. and Naylor, E. “Tools for Searching the ‘Deep Web’ ”, Competitive Intelligence Magazine, (4:4),
Online at http://www.scip.org/news/cimagazine_article.asp?id=156. (date of access April 18, 2003).
Calishain, T. and Dornfest, R. (2003) Google Hacks: 100 Industrial-Strength Tips & Tools, Sebastopool, CA:
O’Reilly & Associates.
Chakrabarti, S. (2003) Mining the Web: Discovering Knowledge from Hypertext Data, San Francisco, CA:
Morgan Kaufmann.
Chen, H., Chau, M.l, and Zebg, D. (2002) “CI Spider: A Tool for Competitive Intelligence on the Web”,
Decision Support Systems, (34:1) pp. 1-17.
Cybenko, G., Giani, A., and Thompson, P. (2002) “Cognitive Hacking: A Battle for the Mind”, IEEE Computer
(35:8) August, pp. 50–56.
Dunham. M. H. (2003), Data Mining: Introductory and Advanced Topics, Upper Saddle River, NJ: Prentice
Hall.
Fleisher, C. S. and Bensoussan, B. E. (2000) Strategic and Competitive Analysis, Upper Saddle River, NJ:
Prentice Hall, 2003.
Fuld, L. (1995) The New Competitor Intelligence, New York: Wiley.
Herring, J. P. (1998) "What Is Intelligence Analysis?" Competitive Intelligence Magazine, (1:2), pp., 13-16.
http://www.scip.org/news/cimagazine_article.asp?id=196
36
References
Kleinberg, J. M. (1999), “Authoritative Sources in a Hyperlinked Environment”, Journal of the ACM (46:5),
pp. 604-632, September.
Krasnow, J. D. (2000), “The Competitive Intelligence and National Security Threat from Website Job Listings”
http://csrc.nist.gov/nissc/2000/proceedings/papers/600.pdf. (date of access April 18, 2003).
Lyman, P. and Varian, H.R. (2000) “Internet Summary” Berkeley, CA: How Much Information Project,
University of California, Berkeley, http://www.sims.berkeley.edu/research/projects/how-muchinfo/internet.html. (date of access April 18, 2003).
Murray, M. and Narayanaswamy, R. (2003) “The Development of a Taxonomy of Pricing Structures to
Support the Emerging E-business Model of ‘Some Free, Some Fee’”, Proceedings of SAIS 2003, pp. 51-54.
Page, Lawrence, and Brin, Sergey, ”The Anatomy of a Large-Scale Hypertextual Web Search Engine”,
http://www-db.stanford.edu/~backrub/google.html , 1998.(date of access April 22, 2003).
Schneier, Bruce (2000) “Semantic Attacks: The Third Wave of Network Attacks”, Crypto-gram Newsletter,
October 15, 2000, http://www.counterpane.com/crypto-gram-0010.html. (Date of access April 18, 2003).
SCIP (Society of Competitive Intelligence Professionals) http://www.scip.org/. (date of access April 18, 2003).
37
Download