Crime Scene Investigation: SMS Spam Data Analysis

advertisement
Crime Scene Investigation: SMS
Spam Data Analysis
Ilona Murynets
AT&T Security
Research Center
New York, NY
ilona@att.com
Roger Piqueras Jover
AT&T Security
Research Center
New York, NY
roger.jover@att.com
IMC’12, November 14–16, 2012, Boston, Massachusetts, USA.
Spam is the commonly adopted name to refer to
unwanted messages that are massively sent to a
large number of recipients.
e-mail spam
• 90% of the daily e-mail via the Internet is spam
• multiple solutions detect and block
• a small amount of spam reaching inboxes
SMS spam
SMS-spam
• connect aircards & cell to PC
• yearly growth larger than 500%
• effective anti-abuse messaging filters injected
• content-based algorithms (for email) works
less efficient
Why
• acronyms/pruned spellings/emoticons
• Shut down/swap SIM
SMS-spam
• consume network resources for legitimate
services otherwise.
• user pays at a per received message basis
• exposes smart phone users to viruses
• fraudulent messaging activities such as
phishing, identity theft and fraud
This paper:
• used for SMS spam detection engine
Outline
• three data sets for analysis
• Data analysis
– Account information
– Messaging Abuse
• Response ratio
• Message timing and time series
– The Scene of the Crime
• Location & targets
• Mobility
– Hardware choice
– Voice and IP traffic
three data sets: SMS cell M2M
• tier-1 cellular operator
• Call Detail Records (CDR) of 9000 SMS spammer
& 17000 legitimate (cell & M2M)
• Mobile Originated (MO):transmitting party
• Mobile Terminated (MT):receiver
• Spammers identified & disconnected from the
network.
• SMS : prepaid
cell : postpaid
• M2M: TAC
three data sets for analysis
Outline
• three data sets for analysis
• Data analysis
– Account information
– Messaging Abuse
• Response ratio
• Message timing and time series
– The Scene of the Crime
• Location & targets
• Mobility
– Hardware choice
– Voice and IP traffic
notes
• In all the figures throughout the paper,
legitimate cellphone users, M2M systems and
spammers (SMS) are represented in green,
blue and red, respectively.
Account information
• spammers (99.64%) are using pre-paid
accounts with unlimited messaging plans
• SIM cards are constantly switched to
circumvent detection schemes
• discard it once an account is canceled and
work with a new one
• average age is 7 to 11 days (legitimate user is
several months to a couple years)
Outline
• three data sets for analysis
• Data analysis
– Account information
– Messaging Abuse
• Response ratio
• Message timing and time series
– The Scene of the Crime
• Location & targets
• Mobility
– Hardware choice
– Voice and IP traffic
Messaging Abuse
Messaging Abuse
• Spammers generate a large load of messages
• Spammers not only send but also receive
more than legitimate customers do
– opt-out
– trick
Messaging Abuse
Actual spam messages often attempt to trick the recipient into replying to
the message.
Despite a small percentage of users will reply, the large amount of
accounts targeted in a spam campaign results in many responses.
Messaging Abuse
Messaging Abuse
• legitimate accounts have a small set of
recipients. (7 on average)
• spammers hit a couple of thousand victims
• legitimate users send multiple messages to a
small set of destinations
• spammers send one message to each victim
Outline
• three data sets for analysis
• Data analysis
– Account information
– Messaging Abuse
• Response ratio
• Message timing and time series
– The Scene of the Crime
• Location & targets
• Mobility
– Hardware choice
– Voice and IP traffic
Response ratio
Response ratio
• legitimate users, messages are sent in
response to a previous message in a
sequential way. the response ratio close to 1.
• For spammers the amount of MT SMSs is
proportionally very small to the number of
transmitted messages. the response ratio is
close to 0
Outline
• three data sets for analysis
• Data analysis
– Account information
– Messaging Abuse
• Response ratio
• Message timing and time series
– The Scene of the Crime
• Location & targets
• Mobility
– Hardware choice
– Voice and IP traffic
Message timing and time series
Message timing and time series
Message timing and time series
• Inter-SMS intervals for spammers are short
less random -- low entropy
• intervals for legitimate messages are less
frequently random--higher entropy.
• Messaging activities of certain M2M devices
are prescheduled.
Message timing and time series
Outline
• three data sets for analysis
• Data analysis
– Account information
– Messaging Abuse
• Response ratio
• Message timing and time series
– The Scene of the Crime
• Location & targets
• Mobility
– Hardware choice
– Voice and IP traffic
Location & targets
Location & targets
•
•
•
•
•
•
•
California,
Sacramento and Orange
Los Angeles
New York/New Jersey/Long Island
Miami Beach
Illinois, Michigan
North Carolina and Texas.
Location & targets
Location & targets
• The legitimate recipients -- local area (i.e. the
area around the subscriber’s home or areas
where the subscriber works, used to live or
where friends and relatives reside).
• The spam recipients distributed uniformly
over the US population.
Location & targets
Location & targets
• Spammers are characterized by messaging a
large number of area codes, always greater
than those of cell-phone users and M2M.
Location & targets
Location & targets
• low entropy (legitimate cell) -- contacts
repeatedly the same area codes.
• High entropy (SMS) -- sends messages to a
more random set of area codes.
• Network enabled appliances (M2M) -- a
predefined set of cell-phones, the entropy is
the lowest.
Location & targets
Location & targets
• linear relation -- SMS spammers
• Both M2M systems and cell-phone users
cluster around the bottom-left area of
• the graph.
• M2M send up to 20000 messages to 1 single
destination???
Location & targets
Location & targets
• Cellphone users destinations-to-messages
ratio and a small set of area codes.
• A great majority of spammers exhibit the
opposite behavior.
• bottom-right corner (SMS) target very specific
geographical regions. ratio of one
destination/message. targeted area codes is
limited
Outline
• three data sets for analysis
• Data analysis
– Account information
– Messaging Abuse
• Response ratio
• Message timing and time series
– The Scene of the Crime
• Location & targets
• Mobility
– Hardware choice
– Voice and IP traffic
mobility
mobility
Outline
• three data sets for analysis
• Data analysis
– Account information
– Messaging Abuse
• Response ratio
• Message timing and time series
– The Scene of the Crime
• Location & targets
• Mobility
– Hardware choice
– Voice and IP traffic
Hardware choice
•
•
•
•
•
1. USB Modem/Aircard A1
2. Feature mobile-phone M1
3. Feature mobile-phone M2
4. USB Modem/Aircard A2
5. USB Modem/Aircard A3
Outline
• three data sets for analysis
• Data analysis
– Account information
– Messaging Abuse
• Response ratio
• Message timing and time series
– The Scene of the Crime
• Location & targets
• Mobility
– Hardware choice
– Voice and IP traffic
Voice call
Voice call
IP traffic
Voice call
IP traffic
STOPPING THE CRIME
• An advanced SMS spam detection algorithm is
proposed based on an ensemble of decision
trees
• Over 40 specific features are extracted from
messaging patterns and processed through a
combination of decision trees
CONCLUSIONS
• pre-paid accounts ---- 7 and 11 days.
• large number of messages sent to a wide
target(also receive a large amount)
• five different models of hardware
• large number of phone calls, very short duration
• main geographical sources in US: Sacramento, Los
Angeles-Orange County and Miami Beach
• certain networked appliances
• have messaging behavior close to that of a
spammer.
Download