Modeling for Crime Busting Da-wei Sun ,

advertisement
Modeling for Crime Busting
1
Da-wei Sun , Xia-yang Zheng1, Zi-jun Chen1, Hong-min Wang1
1
Department of Electric and Electronic, North China Electricity Power University, Beijing, China
(dddd129216713@hotmail.com)
Abstract - The paper models for identifying people in a
83-workers-company who are the most likely conspirators.
The train of thought is that: (1) get a priority list for
vauleing the suspicious degree of the workers. (2) get a line
separating conspirators from nonconspirators. (3) get the
leader of the conspiracy.
The paper first sets different values of suspicious degree
for messages with various features in order to value the
suspicious degree of everybody. Secondly, we optimizes the
primary figure by using a formula based on weighted
average method. Thirdly, we worked through each
individual on the better priority list from both ends. Then,
the paper used some methods of semantic analysis to better
distinguish possible conspirators from the others and finally
got the priority list. Next, the discriminate line is determined
by using probability theory and clustering analysis theory.
At last, get the leaders by the priority list and discriminate
line.
Key words - mathematic model, crime busting, social
network, text analysis
I.
RESTATEMENT OF THE PROBLEM
The present paper is for investigating a conspiracy to
commit a criminal act. Now, we realize that the
conspiracy is taking place to embezzle funds from the
company and use internet fraud to steal funds from credit
cards of people who do business with the company. All
we know is that there are 83 people, 400 messages (sent
by the 83 people), 15 topics (3 have been deemed to be
suspicious), 7 known conspirators, 8 known
non‐conspirators and there are three managers in the
company.
II.
N
Yi  R im  Q m
(1)
m 0
Table I: Values for Qm
Suspicious topics
Normal topics
Spoken to
Weight
Known
c.
Unknown
worker
Known
non-c.
Known
c.
Unknown
worker
Known
non-c.
15(A11)
10(A1)
2(A12)
4(A15)
2(A16)
1(A16)
10(A2)
6(A3)
4(A4)
2(A17)
1(A8)
0(A9)
2(A13)
4(A15)
0(A14)
1(A17)
0(A10)
0(A18)
Spoken
from
Known c.
Unknown
worker
Known
non-c.
‘Yi’ is an intermediate variable for worker i and we
will use ‘Yi’ in formula (2). ‘N’ means total amount of
the messages and in our case N=400. ‘m’ means the No.
of a message. ‘Qm’ means the weight that No.m message
gives. The value of ‘Qm’ is given by Table 1. Rim means
the relation between No.m message and worker i. If the
No.m message is sent from or sent to worker i, Rim=1. If
not Rim=0. At first we use ‘Yi’ to present the suspicious
degree for every workers.
The primary priority list is the result and it is made
by traversal all messages with Matlab.
MODEL AND RESULTS
A.
STEP1
Our goal is to get a table to explain the suspicious
degree of different messages, and then we can get a
preliminary priority list on the basis of the table.
We consider that every message connects two
workers. According to the suspicious degree of the
message’s topic, we add a reasonable weight to each
worker. The weight is related not only to the suspicious
degree of the message’s topic, but also to the suspicious
degree of the speaker and listener.
In the table, 'c.' means conspirator. This table gives the
value to Qm that we use in the formula (1). The corner
mark ‘m’ means the number of a message. Mentioned in
the section of sensitivity analysis, ‘(A#)’ stands for the
number of the cell in the table. This table is one of our
foundations of the model. We can see the Qm is decided
by three factors: the speaker, the listener and the topic of
No.m message.
Fig.1: primary priority figure
It is obvious that there are two special points (20, 12)
and (56, 58). Our goal is to distinguish conspirators and
non-conspirators, so there should be one special point to
divide the figure into two parts in theory. So we don't
satisfy with this figure, and there must be some other
factors that we didn't take into consideration.
B.
STEP2
Our goal is to optimize the priority figure we have
got in step 1 using method of weighted average.
We think ‘Yi’ (mentioned in formula (1)) to stand
for the suspicious degree is not reliable. Through a
formula we create, we can evaluate the suspicious degree
of each worker by using the method of weighted average.

Wi 
M
Sd(k)  A i (k)
k 1
A i  Sd max
 Yi
(2)
‘Wi’ means the crime suspicious degree of worker i.
‘k’ means the No. of the topic. M means the total amount
of topics and in our case, M=15. ‘Yi’ is given by formula
(1). Sd(k) stands for the influence of linguistics for No.k
topic. Before step4, Sd(k)=1 when topic k is not
suspicious and Sd(k)=2 when topic is the suspicious. And
we will optimize the value of Sd(k) at step4 and finally
give many different values for Sd(k) according to
different topic number. Sdmax is defined by
Sdmax  max Sd1Sd2 Sd M 
(3)
(1, 3) worker as an innocent. And use the way of step1
again. So we get another priority list.
But we think it’s unconvincing that let two unknown
worker to be a criminal and an innocent apart in each time.
Because if you see the figure 1, you will find the point
(68, 93) worker has obvious difference from the near one
in crime suspicious degree. But point (1, 3) worker has
tiny difference from the near one. So we match the 67
points a curve and use slopes of the two vertex tangents to
describe how many unknown workers should be treated
as criminals and innocents. So we do these continually
and finally we will places all unknown workers in two
categories: criminals and innocents. We will get a priority
list at last.
Ai is the total amount of text topics worker i gets.
For example, if David receives only one message and sent
no message. And the only message includes three text
topics, ADavid=3. Ai(k) is the total amount of text topics
of number k. That is to say
M
A i  A i (k)
(4)
k 1
We can get Wi, which is the crime suspicious degree
of worker i, by searching all messages. Then we get the
priority list by ranking Wi.
Fig.4: final priority figure
D.
STEP4
Our goal is that: consulting the literatures about the
text analysis methods, we try to optimize our priority
figure by this method.
We set ‘Sd(k)’ which means the suspicious degree
for the No. k topic. We think that the topics talked more
frequently in the group of conspirators should be added a
larger value for ‘Sd(k)’. We use the following formula (5)
(6)to describe ‘Sd(k)’
Fig.2: better priority figure
Figure 2 is better than figure 1, because there is only
one inflection point. But the slope of both sides of the
inflection point is not different largely. We want to
optimize the priority figure further more in the step 3.
C.
STEP3
Our goal is to optimize the better priority figure we
have got in step 2 using method of iteration.
We find a disadvantage of step1 and step2: We
didn’t consider the difference between the unknown
workers. In fact, some of the unknown workers are
criminals and the others are not. So they should be
considered differently. In another word, different
unknown worker have different influence for other’s ‘Wi’.
We consider these different influences in step3 to
optimize our priority list.
In the figure 1, point (1, 3) worker and point (68, 93)
worker are considered differently in this step. We treat
point (68, 93) worker as a known criminal and treat point
Sd(k)i 1 
Sd maxi
 φ  k i  φavei   10  Sd  k i (5)
φ maxi 
p(k) h
φk  h
p(k) j
j
(6)
‘P(k)h’ means the times that topic k is talked by
criminals. ‘h’ means the amount of criminals. ‘j’ means
φk
the total amount of people, which means j=83.
describes the frequency degree of topic k in conversation
of
criminals.
φ maxi  max{φ  k i };φavei  average φ  k i  ;Sd maxi  max{Sd(k)i }
. In step2, we elect unknown workers to be criminals and
innocents continually. So we set circulating in Matlab and
the corner mark ‘i’ means the times we circulate.
We use the text analysis idea to value Sd(k), and
finally change the ‘Wi’ (degree of crime suspicious for
worker i).
With the considering of influence from text analysis
method to ‘Sd(k)’, we get the final priority figure with
linguistics. We get our final priority list by the priority
figure as follow.
We finally give different discriminate lines and
evaluate the solution according to different probability of
first type error and second type error to fit different
requires of the police.
The first type error in our model is let the
conspirators get away with crime. We describe the
probability of first type error by ‘P1%’ using formula (9)
x2
 f ( x)dx
P1%  1  ( xx2
 f ( x)dx
)  100%(x1  x  x 2 ) (9)
x1
Fig.5: final priority figure with linguistics
E.
STEP5
In step5, our goal is to locate a discriminate line to
help distinctly categorize the unknown workers using the
ideas of cluster analysis and method of hypothesis testing.
We find a variable ‘AW1’ to describe the degree of
conspiring for the group of conspirator. ‘AW’ is defined
by the following formula (7)
67
AW1x  (Wi  W83 K  K)   83  x 
(7)
ix
‘AW1’ means the average weight for the group of
conspirators, and it stands for the degree of conspiring.
The ‘x’ is the abscissa of the point where the discriminate
line located at. ‘Wi’ is defined in formula (2). ‘K’ is the
amount of the known conspirators. We consider the
suspicious degrees (‘Wi’) of all known conspirators are
all same and the value of these suspicious degrees is same
as the most right point in figure 5.
(c) Draw the figure 6 that shows the changing of
‘AW1x’ by growing of the ‘x’.
Fig.6
Fig.7
(d) For the same reason, we can define ‘AW2’ to
describe the degree of conspiring for the group of
non-conspirator, using formula (8).
ix
AW2 x  (Wi  WL  L)   L  X 
(8)
The f(x) is the function of the curve we match in
figure 6. X1 and X2 is left and right point’s abscissa.
The second type error in our model is let the nonconspirators be treated as conspirators. We describe the
probability of first type error by ‘P2%’ using formula (10)
x
 g ( x)dx
P1%  1  ( x12
x
 g ( x)dx
) 100%(x1  x  x 2 )
(10)
x1
The g(x) is the function of the curve we match in
figure 7. And we finally find that P2%, the probability of
second type error, is decreasing with the increasing of ‘x’.
If we change the value of ‘x’, the probability of the
first and second type error will change too. At last we find
when ‘x’=55, the value of P1%+P2% is the smallest. So
we recommend police to locate discriminate line at the
point whose abscissa is 55 in figure 5[2-3].
6 STEP6
We try to find the boss of the crime group using the
concept of point centrality in study of social network
analysis [4].
Now that we know Jerome, Dolores and Gretchen
are the senior managers of the company. If one or two, or
even all of these three people are in the list of
conspirators, we can credibly make sure that the leader or
leaders come from the group of the three managers [5-7].
If all of them are not in the primary priority list, it
will become more complex. Assuming the crime group is
isolated from the other group, that is to say, it has little
connection to the outside, so we can focus only on them.
From previous work, we can obtain the criminal topics
and their Sd(k). So we can calculate everyone's weight of
point centrality by the same formula as formula (2). If
someone's weight is much higher than others, we can
surely know that he is the leader [8-11].
1
‘L’ is the amount of the known non-conspirators. We
consider the suspicious degrees (‘Wi’) of all known
non-conspirators are all same and the value of these
suspicious degrees is same as the most left point in figure
5. And we can draw figure 7 shows the changing of
‘AW2x’ by growing of the ‘x’.
III.
SENSITIVITY ANALYSIS AND MODEL EVALUATING
In our models, the value of weight in Table1 and
Sd(k) are defined by ourselves through perceptual
knowledge and some experiments of the example of
Investigation EZ. That is to say, the weight and Sd(k)
have no certain standard, so it is necessary for us to make
sure how it will affect our results.
There are 18 weights needed in our models, named
A1, A2, A3, A4, A5……A18, which you can see them in
table 1. We choose A3 and A16 randomly. With previous
value, we get the priority list as follows (be expressed by
the code for unknown workers):
3, 32, 15, 37, 17, 40, 10, 81, 34, 22, 31, 13 ……
First, change A3 from 4 to 5, we got the result:
3, 32, 15, 37, 40, 17, 10, 81, 4, 34, 31 ……
Then, change the value of A16 from 1 to 2, we got
result:
3, 32, 15, 37, 17, 40, 10, 81, 34, 22, 31, 13 ……
The basic value of Sd(k) is defined to be 1 when
topic k has nothing about crime. Otherwise, the topic is
suspicious. The problem is that we didn't define the
maximum value of its initial value. The value will
influent the results of formula (2).In our models, we
defined it as 2. Now, we analysis whether the small
change of the value will affect our results.
3, 32, 17, 15, 10, 37, 81, 40, 22, 16, 34, 4, 44 ……
Observing these results carefully, we can conclude
that when we change the weight or the maximum initial
value of Sd(k), the results are not sensitive. We can say
that our models behave well during the process of
sensitivity analysis.
Now we evaluate this model, because we don’t know
whether it is stable or accurate. We guess that our model
seems rely on the initial conditions. If it relies tightly on
initial conditions, the model may not be trustful because
no one can make sure the initial condition. So we will
show that how our model’s result may change when the
initial conditions change. And we only take the
conspirators who are at the top of the priority list into
consideration. These are some extreme conditions.
Condition 1.Set initial conditions as normal, this
result can be the basic standard.
Condition 2.Assume we cannot identify the
conspirators;
Condition 3.Assume we cannot identify the
non-conspirators;
Condition 4.Assume we cannot identify all of them;
Condition 5.Assume we can only identify some of
the conspirators, such as 7, 18, 21, 43;
Condition 6.Assume we can only identify some of
the non-conspirators, such as 0, 2, 48, 64[12-13];
Through the analysis of the table, we conclude some
laws:
(1) The initial conditions about 'Known conspirators'
and 'Unknown conspirators' will affect the results, but the
effect is tolerable; (2) The more accurate initial conditions
there are, the more fast and accuracy the results have; (3)
More initial conditions means more accuracy and less
time(especially for large data base), however it also
means more energy it costs; (4) Our model has great
stability so that it can be used widely, and it will show
strong adaptability[14].
REFERENCES
Table 2 the results.
No.
Known
criminals
Known noncriminals
Criminals
Similar
Time to
count
1
7,18,21,43,4
9,54,67
0,2,48,64,65,
68,74,78
7,18,21,43,49,
54,67,3,32,…
—
28
2
—
0,2,48,64,65,
68,74,78
89.5%
30
3
7,18,21,43,
49,54,67
—
89.5%
31
4
—
—
84.2%
33
5
7,18,21,43
0,2,48,64,65,
68,74,78
94.7%
29
6
7,18,21,43,
49,54,67
0,2,48,64
67,21,54,7,3,4
3,81,49,10,…
7,18,21,43,49,
54,67,3,32,…
21,67,54,7,3,4
3,32,2,18,1…
7,18,21,43,67,
54,3,49,17,…
7,18,21,43,49,
54,67,3,17,…
94.7%
25
[1] Zhang Defeng. “MATLAB probability and mathematical
statistics analysis”. Mechanical industry press 2010.1
[2] LI Liang, Zhu Qianghua. “DSNE: a new dynamic social
network analysis algorithm”. “Journal of Jilin University”
(Engineering and Technology Edition) Mar.2008, Vol.38
No.2
[3] Guo Liya, Zhu Yu. “Application of Social Network Analysis
on Structure and Interpersonal Character of Sports Team”
[J], “2005 China sports science and technology” Vol.41,
No.5, 10-13.
[4] Ma Qian, Guo Jingfeng. “A study of the pattern-based
clustering theories” [J]. Yanshan University March 2007
[5] Estevez P A, Vera P, Saito K. Selecting the Most Influential
Nodes in Social Networks[C]//Proceedings of International
Joint
Conference on Neural Networks. Orlando. Florida, USA: [s.n.],
2007:12-17.
[6] Santos E E, Pan Long. Arendt D. An Effective Anytime
Anywhere Parallel Approach for Centrality Measurements in
Social Network Analysis[C]//Proceedings of 2006 IEEE
International Conference on System, Man, and
Cybertics.[S.I.]: IEEE Press, 2006:8.11.
[7] Kiss C, Scholz A, Bichler M. Evaluating Centrality
Measures in Large Call Graphs[C]//Proceedings of the 8th
IEEE International Conference On E-Commerce
Technology and the 3rd IEEE International Conference on
Enterprise Computing, E-Commerce, and E-Services.
Washington D.C., USA: IEEE Computer Society Press,
2006.
[8] Klovdahl AS, Potterat JJ, Woodhouse DE. Social networks
and infectious diseases: the Colorado Springs Study. Soc
Sci Med 1994;38:79–99.
[9] Klovdahl AS. Social networks and the spread of infectious
diseases: the AIDS example. Soc Sci Med
1985;21:1203–16.
[10] Peiris JS, Yuen KY, Osterhaus AD. The severe acute
respiratory syndrome. N Engl J Med 2003;349:2431–41.
[11] Svoboda T, Henry B, Shulman L. Public health measures to
control the spread of the severe acute respiratory syndrome
during the outbreak in Toronto. N Engl J Med 2004;350:
2352–61.
[12] Marsden PV. Egocentric and sociometric measures of
network centrality. Soc Networks 2002;24:407–22.
[13] Newman ME. Properties of highly clustered networks.
Phys Rev E Stat Nonlin Soft Matter Phys 2003;68:026121.
Epub: Aug 21, 2003. (DOI: 10.1103/PhysRevE.68.026121).
[14] Anderson RM, May RM. Infectious diseases of humans:
dynamics and control. Oxford, United Kingdom: Oxford
University Press, 1992.
Download