searchnet15-RPL - Hobart and William Smith Colleges

advertisement
Search and the ‘Net in 2015
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
For Rochester Public Library Staff
For today . . .
 The Searchscape
 Behind the Screen:
Current Web Search Developments
 New Services
 The Social Web and Research
 Bing, Yahoo and DuckDuckGo
 Google
 Free Digital Collections
 Linklist
http://people.hws.edu/hunter/searchnet15links.htm
E-Reading Rises as Device Ownership Jumps
By Kathryn Zickuhr and Lee Rainie
http://www.pewinternet.org/2014/01/16/e-reading-rises-asdevice-ownership-jumps
American adults 18+ - % who
read at least 1 book in that year
American adults 18+ - % who
own each device
New Top Level Domains
 First made available 1/29/14
 Over 150 now live on donuts.co (2/15/15)
 Content-significant

.bike, .energy, .delivery, .legal, .guru
 Brand-specific – “vanity domains”

.android, .walmart, .nyc
 Allow for non-roman scripts –Arabic, Chinese etc.
 Require proof of identity/relationship to TLD
 Unique TLD costs $185,000
Growth of Query Types over 1 year
http://searchenginewatch.com/sew/how-to/2383498/how-willvoice-search-impact-a-search-marketers-world
Web Access in 2015
Mobile has outpaced Desktop
http://www.comscore.com/Insights/Presentations_and_Whitepaper
s/2013/The_Digital_World_in_Focus
Web Search in 2015
Who’s crawling the Web?
 Google
 Bing
(aka Yahoo!)
 Gigablast
 DuckDuckGo
 Baidu
 Yandex
Market Share Growth Oct. 2013– Oct. 2014
www.comscore.com
80
70
60
50
40
2013
30
2014
20
10
0
Google
Bing
Yahoo!
Ask
AOL
Behind the Screen:
WAY beyond matching keywords
Semantic Processing
Internal to the
search engine
Predictive Operations
From data about
and from the user
ANSWERS, NOT JUST SEARCH RESULTS
Semantic Processes
NLP Parsing
Knowledgebase Entities
Term Frequency Data
Pattern Matching
Structured Data
NLP Parsing
 Machine-learned meaning derived from
human or natural language speech or text.
(Adapted from Wikipedia)
 Analysis of large sets of documents (corpora)
that have been human-annotated with parts
of speech and other semantic information
 Machine “learns” the relationships and
meaning through statistical inference
 Visualization at http://nlpviz.bpodgursky.com
Knowledgebase Entities
Google’s Knowledge Graph – Bing’s Satori
 Google’s Knowledge Graph – rooted in the
(human) community-created entities in
Freebase
 Crowdsourcing too slow; often ignores
specialized areas of knowledge, non-English
content
 Knowledge Vault – Automated extraction of
raw data and creation of entities derived from
that data
DOM trees-structures that help browsers represent and interact with
documents in html and other formats (Wikipedia)
More Semantic Processing…
 Term Frequency Data


Frequency, proximity, order
Aids in discovery across subject areas,
filetypes and entire domains
 Pattern Matching Algorithms

Focuses on recognition of patterns and
regularities in text, data and images
 Structured Data


Structured Web tables and data sets
(.xls, .kml, .sdf)
Human created tags – Schema.org
Predictive Operations: Inferring the user’s intent
“The Holy Grail of Search”
 Location-based results – IP and GPS

Weather, entertainment, restaurants…..
 Anonymous past searches and user behavior
 Personal data volunteered by user
 Time of day
 Device used
Semantic
Processing
Predictive
Operations
--Correctly interpret the query, or a portion of the query
--Give a “best guess” answer based on highly trusted
sources (knowledgebase) and similar searches
--Aggregate and grow the knowledgebase through
iterative, real-time web crawls
Discovery Apps:
Personalized Search on Steroids
 Combines your




Personal preferences
Location
Demographic characteristics
Social network data

People, Preferences, Interests, Events
 Suggests entertainment, restaurants and more
 Chat with your social network friends
 “Current events you may like within X miles”
 Gravy – Free on I Tunes
Personal Assistant Apps
 Connects to your



E-mail
Calendar
Facebook events
 Prompts for transportation times, quickest routes
 Includes some discovery and chat features
 Relies heavily on user-supplied personal data
 Sunrise, Tempo, et. al.
Apps and the Deep Web
 Currently crawler-based S.E.’s cannot access
content in apps unless the app allows it to.



Posts
Links
Personal data
 User must have the app loaded in order to
access content, even if it appears in the S.E.
 Education apps continue to grow in content,
quality and use
 Google is working on it…..
New Services
Qwant
A fresh approach to search
 Aims to offer a European-based service that respects





user’s privacy
No cookies or other tracking of user's search behavior
No filtering of content unless user-initiated
Launched in France in 2013
Search verticals offered:
Web
News
Social
Images Videos
Shopping
Boards (Online Forums, mostly European)
16 interface languages, which influence search results
Instya meta engine
www.instya.com
 Launched April, 2015
 Results from each source appear in their own
browser tab
 Sources include
Web (7)
Image (8) News (11)
Video (7) Shopping (11) Dictionary (14)
Answers (8)
Social (11)
 Domain search offers website data, analysis
7 Backlink sources
6 Website stats
10 Domain information sites
CC Search
search.creativecommons.org/
 Searches media in the public domain
 Flickr, YouTube, Jamendo, Wikimedia Commons,
SoundCloud and others…..
 Some sponsored results appear that are not in
the public domain
 Verify use conditions for each result
Search and the Dark Web
 Dark Web- Networks with server addresses
intentionally obscured
 Often house online criminal activities
 Includes TOR Networks Hidden Services
 have .onion TLD
 Only accessible via TOR’s private browser
 Content not PW protected, but not accessible
to crawler-based services due to lack of
linkage
Memex
DOD’s Dark Web Search Engine
 Software to visualize and organize big data
 Searches text, handwritten text, images,
geographic data embedded in photos….
 Identifies hidden relationships among
websites, deep web sites and forums
 Can access Dark Web obscured networks
 Used in online criminal investigations


Sex-trafficking ads
ISIS-funding and other money laundering
 Contact memex@darpa.mil
http://www.wsj.com/articles/sleuthing-search-engine-even-better-than-google1423703464
The Social Web and Research
Why search the social web???
 Public responses, attitudes, opinions



Breaking news, events
Trending topics and people
Latest product reviews
 First-hand accounts of events-text, image,
audio, video (primary sources)
 Security, technology topics (latest virus, etc.)
 Locate individuals/experts and their networks
 People interested in a topic/hobby
 Social web research projects
BuzzSumo -
meta for social networks
 Discovers the most shared content
 Crawls FB, TW, LinkedIn, Pinterest, Google+
 Backlink and sharer data for 20 or more instances
 Advanced search features
Boolean
Author search
URL or domain search
Twitter user search
 Filters
Article
Giveaways
Infographic
Interviews
Guest Post
Videos
Date
 Requires (free) account; other fee-based options
Twitter Search -
search.twitter.com
 Now includes every public Tweet since 2006
 Searchable with all search features previously
available at twitter.com/search-advanced
 Indexes ca. ½ trillion tweets, and grows by
several billion tweets a week.
 Tweets deal with “everyday human experiences
to major historical events”
Entire TV, sports seasons
Conferences
Places
Events
Industry discussions
Long-lived hashtags across countries, ideologies
#ScotlandDecides #HongKong #Ferguson #Hamas
TW as social indicator and
health predictor – Upenn study
 Linguistic and emoticon analysis of geo-
tagged tweets combined with health data
from over 1,300 US counties
 Tweets expressing negative emotions-stress,
anger, fatigue-are associated with higher
heart disease risk
 Tweets with positive emotions-optimism,
enthusiasm-are associated with lower levels
of risk
 http://www.upenn.edu/pennnews/news/twitter-can-predict-
rates-coronary-heart-disease-according-penn-research
Education and the social searchscape
 Offers first-hand accounts of events and
conditions
 Informative of current world cultures and
trends on a wide range of subjects
 Gateway to blogs and other online
communication that can enhance scholarship
 Channel for updates to educational programs
 Embedded links and other information often
highly relevant and recent
 Requires careful evaluation of information
found there
Bing, Yahoo and DuckDuckGo
Looking for a niche
 Bing and Yahoo represent 29% of all US
searches http://comscore.com 12/1/14
 Yahoo


Focus is on local and personalized search
results
Now partnered with Yelp, local business
search engine
 Bing


Focus is on lifestyle, travel, images, maps
Social search results (FB, TW) in a sidebar
Bing Image Search
 High quality images
 Related search offered, based on descriptive
text associated with the image
 Clustering by topic
 Filters




Size
Color
Type
Layout
People
Date
License
SafeSearch
 Image Match with a URL or image you upload
DuckDuckGo http://ddg.gg
 Offers anonymous search functionality
 Popularity spiked after NSA PRISM search
engine scandal
 Does not save search history of any type
 G. does, using it "to increase relevancy"
 Included as a search option in Apple's latest
version of Safari
 Has been blocked in China !!!
Google
Knowledge Vault
Beyond the Graph…..
 Knowledge Graph seeded from Freebase entities
and human additions
 Automated generation of entities increases
number and discovers hidden relationships
among entities and their attributes
 Entities now appear at top of results page with
related topics or other relevant information
 Type of additional information varies depending
on entity
Right to be Forgotten ruling
EU's European Court of Justice, May 2014
 G. and other search engines must remove results deemed
to be "inadequate, irrelevant or no longer relevant, or
excessive in relation to the purposes for which they were
processed and in the light of the time that has elapsed."
http://curia.europa.eu/jcms/upload/docs/application/pdf/2014-05/cp140070en.pdf
 Does not require them to be removed from the servers on
which they are located
 Makes the content more difficult to find
 Of the initial 12,000 removal requests
 33% - fraud accusations
 20% - related to violent/serious crimes
 12% - related to child pornography arrests
App indexing
 G. currently indexes content from apps that
open their content to G's crawlers
 Results from apps are combined with mobile
search results if the searcher has that app
installed on their mobile device.
 Agawi - streaming technology that breaks
apps up into small files, allowing users to
access content in the app while the full app is
loading. (Similar to YouTube's streaming
video technology)
 G. acquired Agawi in the fall of 2014
Google’s device-dependent results sets
 The intent and context of queries varies between
devices
 G.'s search results on mobile devices vary from
those on desktops or laptops by as much as 43%
 Mobile results


Tend to focus more on local-based results
Display pages with smaller file size, on average
 Based on analysis of first 30 results for 10,000
keyword searches
“US Google Ranking Factors 2014” http://www.searchmetrics.com/news-andevents/mobile-optimization/
Maps Gallery, In-depth articles
 Interactive digital thematic map collections



Historic city plans
Climate trends
Housing affordability
Shipwrecks
Up-to-date evacuation routes
 In-depth articles caveat



"How to write the In Depth Articles that
Google Loves" copyblogger.com
Content farm orientation?
Requires careful evaluation of each item;
unvetted websites in particular
Google's tech projects
 Google for Kids - under 13; more parental controls
 Project Loon - Provide Web access via solar-powered





drones
Self-driving cars
Google Glass 2
Smart contact lenses
Continuous health monitoring via disease-detecting
nanoparticles
Liftware - stabilized spoon for tremor sufferers
"Google Tracker 2015" http://arstechnica.com
Search in the Future
 Will continue to be more specialized
Shopping - Amazon
Travel - Kayak
 Movies - IMDB
Real-time news - TW
Discovery software will integrate more diverse types
of data, crowdsourced to expert
Semantic processing and predictive search will grow
Social web will increase as a tool for social change
Search engines will be challenged by governments
worldwide in the areas of commercial monopoly and
individual privacy





Thank You and
Enjoy Your Searching!
Michael Hunter
Reference Librarian
Hobart and William Smith Colleges
Geneva, NY 14456
(315) 781-3014
hunter@hws.edu
Download