SNATZ PowerPoint presentation

advertisement
SNATZ
TECHNOLOGY
as news analysis tool
Main terms used in presentation
Term – a phrase, which system uses for training NLP algorithms.
Summary – a phrase, which system automatically detects during analyzing of news content.
Trend - an unique chain, which contains one, two or more summary. These chains are
created as result of cluster analysis.
Tag – a term, which created by moderator for detecting user’s interest category.
User interests - cloud of tags, which system recognizes from user’s social accounts and
OPML files.
Segments - a groups of ‘similar’ Trends, which are intersected more than 30% by search
results.
Semantic network - is a network which represents semantic relations between keywords
Data warehouse - is a database used for reporting and data analysis.
The main goal of SNATZ
Snatz is a data mining instrument.
• It can recognizes semantic of news content using NLP algorithm
• On the basis of acquired summary SNATZ can define new knowledges:
• detect new Summary sets
• gathering Trends statistics
• opportunity to build Segments
• using new Summary as Terms for training NLP algorithm
• Making recommendation of news from different Segments
Our solutions allow to change the paradigm of ‘Collaborative Filtering’
Snatz platform architecture
Snatz platform architecture consists of:
• SNATZ Recommender System - personal recommendation based on the
users’ interests
• SNATZ Data Mining Tool – semantic network of trends. It is created by
sending recognized metadata to analysis processing
SNATZ Recommender
• CRAWLER. Crawler interacts with Web sites by receiving RSS-feeds and tweets.
Content of RSS and tweets are the main resources.
• Blogosphere. All resources which web crawler detected are saved in data
warehouse makes internal SNATZ “blogosphere”.
• Data Processing. Exporting resources to the SNATZ DM Tool. Also, together with
news articles it sends sets of labels, terms, summary.
• USERS
• Tags/Posts:
- Posts. System recognized users’ posts from Fb, Tw and OPML files.
- Tags. Using NLP algorithm system defines the User’s interests.
• Recommendations. Component contains rules of forming news recommendations.
• News archive. News items which were recommended for a users
SNATZ Data Mining Tool
Documenter. Imports resources from Data Processing component and sends them to
the NLP
NLP. Semantic analysis:
- POS Tagging
- Defining articles attributes: labels, terms and summary.
Meta-Docs. Data warehouse of articles with semantic analysis
Analysis:
- Multi Clusterization
- Trends defining
Semantic Network of Trends:
- Segments
Reporter
- Trends statistics
Data workflow
Data Mining Tool
Recommender System
Data
Processing
WEB
CRAWLER
Users
Tags/
Posts
IMPORT
EXPORT
Docs
NLP
Blogs
News
Archive
Recommendation
Meta
Docs
Reporter
Analysis
Semantic
Network of
Trends
Building segments
Documents
Import
Meta-Docs
Analysis
Tree of Trends
Segmentation
Meta –Docs
•
•
•
Labels
Terms
Summary
Tree of Trends
Segments
Building segments
• Documenter imports resources and sets of labels, terms, summary from Data
Processing component. And sends them to the NLP.
• NLP recognizes an attributes in recourses: labels, terms, summary. These
resources become a meta-docs and are saved in Data warehouse.
• Meta Docs are sent to the Analysis and system forms actual Trend Tree.
• Trends identify related summary, i.e. the main direction of its topics and subtopics. Through such relations of trends SNATZ finds similar/related topics and
groups them into Segments:
- If Trends intersects more than 30% than trends create a new Segment.
Recommendations
Users Posts
Update interests
Tags
Meta-Docs
Defining Related
Trends
Related Tags
Interests
Personal
Recommendations
Daily Review
Recommendations
•
•
•
•
•
System parses users posts from Fb, Tw, uploaded OPML file of subscriptions.
Updating of interests performed every 4 hours
NLP recognized interests from Posts resulting a set of Tags.
If number of Tags is less than 12, system tries to find relates Tags.
System takes Trends which were received from Meta-Docs and defines related Trends
for users interests. If Trend contains user’s interest it becomes connected with user.
• Summary which are in related Trend becomes the Related Tags.
• System takes trends from User’s Trend tree and makes Daily Review
Personal Recommendations
User Interests
Get Trends
User’s Trends
Segments
User’s
Trends
Check Trends
User’s Tree of Trends
Interests
'Diversity'
Filtering
List of 12 News
Personal Recommendations
• System takes ‘Last Trends’ which contains users interests and forms User’s Trends
• User’s Trends are checked on segments and forms User’s Trends Tree.
’Diversity’ filtering:
• System does not take more than 2 interests from one category
• No more than one news article for the trend
• System gets news only with new keywords (i.e. comparing with previous
sets of news)
• Only 1 news from same segment
• Only 2 news from one category
SNATZ server architecture
SNATZ server architecture
Cluster High-availability provides the following services:
1. virtual ip for cluster.
2. DRBD storage of cluster .
3. ext4 file system on top of DRBD.
4. containers openVZ on ext4 over DRBD.
•
•
•
•
each cluster is assembled on two nodes.
corosyn is used for managing.
Pacemaker is a resource manager.
system is five two-node clusters.
SNATZ server architecture
Redundant services are performed on openVZ containers and start
together with the start of the container.
Interaction redundant services between the containers is carried via the local
network, which is connected via a separate commutator to the second network
interface of each node.
For each two-node cluster written sequence of start of redundant services:
1. switching active / passive DRBD
2. mount the ext4 file system to the mount point of the active node .
3. start of openVZ containers which are placed on DRBD.
SNATZ + Elasticsearch engine
Elasticsearch is a search server which provides distributed, multitenant-capable
full-text search engine with a RESTful web interface and schema-free JSON
documents.
Advantages Elasticsearch for SNATZ:
• Elasticsearch is a stable working project
• AWS Cloud Plugin (allows to use Amazon EC2 API)
• Real time data Search and Analysis
• Index versioning support
• Search opportunities: fuzzy requests & etc.
Elasticsearch + Amazon EC2
Elasticsearch + Amazon EC2
Features:
• ability to maintain a high performance cluster designed for I/O intensive
operations
• new instances are started and stopped when required
• no need to pay for long-term servers and their administration
• pricing is per instance-hour consumed for each instance
• ability to create images from a working machine (configured & set up) and start
other instances from these images
SNATZ + NLP
•
•
•
•
•
Features:
Part-of-speech-tagging
Summary extraction
User-defined Terms and Labels
Synonyms handling
Supervised text classification using user-defined datasets for training/evaluating
performance
Language support:
•
English
•
Japanese (using third-party tools like MeCab)
Challenges of SNATZ
•
•
•
•
•
Filter Bubble (user’s interests)
Diversity and ‘Long Tail’
Data sparsity (‘the cold start problem’)
Scalability
Segmentation (‘related topics’)
How SNATZ solves this problems?
Using TRENDs
What is Filter Bubble
User can see popular news only by TOP-Tags from
his interests’ categories.
But user doesn’t see related Tags outside
the Filter Bubble
What is TREND?
All summary and terms of articles has
close connections.
The task of SNATZ to define significant
connections.
How Trends are detected?
News with Terms
Clustering By Terms
Clustering By Labels
Clustering By Summary
System detected Trends
Abstraction of algorithm
Multilevel clustering algorithm has 3 abstractions:
• Labels
• Terms
• Summary
SNATZ outside Filter bubble
SNATZ tries to show news beyond users` filter
bubble to cover more Trends.
Trends identify related summary, i.e. the main
direction of its topics and sub-topics
Long Tail problem
Users usually doesn't see most of news
because they have too small Popularity Rank.
SNATZ solves this problem:
• For user recommendations SNATZ selects
Trends only by different Segments
• In order to provide users with *new* content,
SNATZ does NOT make recommendations
based on Summary that were already picked
for previous recommendations. This way the
user can see the news based on the latest
Trends
• SNATZ does NOT use TOP-Tags from user’s
interest categories.
Collaborating Filter
The first and most common way to determine the significance of an article is
its social rating. This is determined through an advanced technique
called Collaborative Filtering, which collects taste preferences or personal
information (such as language, country, etc.) from many users and uses that data
to make automatic predictions.
SNATZ recommends news solely on the basis of user interests. Every step of
recommendations is unique and depends on the previous step.
Recommendations are made only on the basis of the individual user's
experience.
Effective Content Personalization
One approach to effective content
personalization is called ‘the classification of
trends’, and based on the principles of identifying
the most significant relationships between
summary, creating a unique chain of summary
called a trend.
A trend contains one or more summary from
Web content, and determines specific subtopics.
The main characteristic of a trend is dynamics of
chains or summary, with positive (growing) or
negative (fading) conditions over a specific period
of time.
Automatic segmentation of blogosphere
Through such relations of chains SNATZ finds
similar/related topics and groups them into Segments.
For example: ‘Network+Tumblr’ intersects with
‘Network+Tumblr+Instagram’ by more than 30%.
These chains create a new Segment.
Trends are determined by analyzing content in the
current news state of the daily Blogosphere at it’s most
basic form - relevant daily news topics. If a
recommendation engine calculates the thematic
proximity of trends, then it can auto-classify them into
trend segments, so that similar sub-topics are put in the
same segments. This auto-classification of segments
splits Web content on various major topics.
Automatic segmentation
A recommendation engine that applies this classification
process on trends (and not tags) solves two major
personalization problems:
• Removes Long Tail, making
news recommendations from
different segments possible
• Solves the problem of thematic
proximity, making sure that
similar or duplicate news
is filtered out
Data mining result. Infographics
• System
System detect
can:
• detects more actual Trends
for any given topic.
• detects ‘Related’ Tags
for any given topic.
• detects the dynamics of Trends
• detects the sentiment of news
Findings
Information becomes increasingly dense, consumers deserve to get the news
that they want to read – not the news an algorithm thinks they want.
SNATZ gives is a personalization algorithm that can solve the challenges of the
filter bubble and long tail
Thanks for your attention!
SNATZ Team
Download