Modeling for Crime Busting 1 Da-wei Sun , Xia-yang Zheng1, Zi-jun Chen1, Hong-min Wang1 1 Department of Electric and Electronic, North China Electricity Power University, Beijing, China (dddd129216713@hotmail.com) Abstract - The paper models for identifying people in a 83-workers-company who are the most likely conspirators. The train of thought is that: (1) get a priority list for vauleing the suspicious degree of the workers. (2) get a line separating conspirators from nonconspirators. (3) get the leader of the conspiracy. The paper first sets different values of suspicious degree for messages with various features in order to value the suspicious degree of everybody. Secondly, we optimizes the primary figure by using a formula based on weighted average method. Thirdly, we worked through each individual on the better priority list from both ends. Then, the paper used some methods of semantic analysis to better distinguish possible conspirators from the others and finally got the priority list. Next, the discriminate line is determined by using probability theory and clustering analysis theory. At last, get the leaders by the priority list and discriminate line. Key words - mathematic model, crime busting, social network, text analysis I. RESTATEMENT OF THE PROBLEM The present paper is for investigating a conspiracy to commit a criminal act. Now, we realize that the conspiracy is taking place to embezzle funds from the company and use internet fraud to steal funds from credit cards of people who do business with the company. All we know is that there are 83 people, 400 messages (sent by the 83 people), 15 topics (3 have been deemed to be suspicious), 7 known conspirators, 8 known non‐conspirators and there are three managers in the company. II. N Yi R im Q m (1) m 0 Table I: Values for Qm Suspicious topics Normal topics Spoken to Weight Known c. Unknown worker Known non-c. Known c. Unknown worker Known non-c. 15(A11) 10(A1) 2(A12) 4(A15) 2(A16) 1(A16) 10(A2) 6(A3) 4(A4) 2(A17) 1(A8) 0(A9) 2(A13) 4(A15) 0(A14) 1(A17) 0(A10) 0(A18) Spoken from Known c. Unknown worker Known non-c. ‘Yi’ is an intermediate variable for worker i and we will use ‘Yi’ in formula (2). ‘N’ means total amount of the messages and in our case N=400. ‘m’ means the No. of a message. ‘Qm’ means the weight that No.m message gives. The value of ‘Qm’ is given by Table 1. Rim means the relation between No.m message and worker i. If the No.m message is sent from or sent to worker i, Rim=1. If not Rim=0. At first we use ‘Yi’ to present the suspicious degree for every workers. The primary priority list is the result and it is made by traversal all messages with Matlab. MODEL AND RESULTS A. STEP1 Our goal is to get a table to explain the suspicious degree of different messages, and then we can get a preliminary priority list on the basis of the table. We consider that every message connects two workers. According to the suspicious degree of the message’s topic, we add a reasonable weight to each worker. The weight is related not only to the suspicious degree of the message’s topic, but also to the suspicious degree of the speaker and listener. In the table, 'c.' means conspirator. This table gives the value to Qm that we use in the formula (1). The corner mark ‘m’ means the number of a message. Mentioned in the section of sensitivity analysis, ‘(A#)’ stands for the number of the cell in the table. This table is one of our foundations of the model. We can see the Qm is decided by three factors: the speaker, the listener and the topic of No.m message. Fig.1: primary priority figure It is obvious that there are two special points (20, 12) and (56, 58). Our goal is to distinguish conspirators and non-conspirators, so there should be one special point to divide the figure into two parts in theory. So we don't satisfy with this figure, and there must be some other factors that we didn't take into consideration. B. STEP2 Our goal is to optimize the priority figure we have got in step 1 using method of weighted average. We think ‘Yi’ (mentioned in formula (1)) to stand for the suspicious degree is not reliable. Through a formula we create, we can evaluate the suspicious degree of each worker by using the method of weighted average. Wi M Sd(k) A i (k) k 1 A i Sd max Yi (2) ‘Wi’ means the crime suspicious degree of worker i. ‘k’ means the No. of the topic. M means the total amount of topics and in our case, M=15. ‘Yi’ is given by formula (1). Sd(k) stands for the influence of linguistics for No.k topic. Before step4, Sd(k)=1 when topic k is not suspicious and Sd(k)=2 when topic is the suspicious. And we will optimize the value of Sd(k) at step4 and finally give many different values for Sd(k) according to different topic number. Sdmax is defined by Sdmax max Sd1Sd2 Sd M (3) (1, 3) worker as an innocent. And use the way of step1 again. So we get another priority list. But we think it’s unconvincing that let two unknown worker to be a criminal and an innocent apart in each time. Because if you see the figure 1, you will find the point (68, 93) worker has obvious difference from the near one in crime suspicious degree. But point (1, 3) worker has tiny difference from the near one. So we match the 67 points a curve and use slopes of the two vertex tangents to describe how many unknown workers should be treated as criminals and innocents. So we do these continually and finally we will places all unknown workers in two categories: criminals and innocents. We will get a priority list at last. Ai is the total amount of text topics worker i gets. For example, if David receives only one message and sent no message. And the only message includes three text topics, ADavid=3. Ai(k) is the total amount of text topics of number k. That is to say M A i A i (k) (4) k 1 We can get Wi, which is the crime suspicious degree of worker i, by searching all messages. Then we get the priority list by ranking Wi. Fig.4: final priority figure D. STEP4 Our goal is that: consulting the literatures about the text analysis methods, we try to optimize our priority figure by this method. We set ‘Sd(k)’ which means the suspicious degree for the No. k topic. We think that the topics talked more frequently in the group of conspirators should be added a larger value for ‘Sd(k)’. We use the following formula (5) (6)to describe ‘Sd(k)’ Fig.2: better priority figure Figure 2 is better than figure 1, because there is only one inflection point. But the slope of both sides of the inflection point is not different largely. We want to optimize the priority figure further more in the step 3. C. STEP3 Our goal is to optimize the better priority figure we have got in step 2 using method of iteration. We find a disadvantage of step1 and step2: We didn’t consider the difference between the unknown workers. In fact, some of the unknown workers are criminals and the others are not. So they should be considered differently. In another word, different unknown worker have different influence for other’s ‘Wi’. We consider these different influences in step3 to optimize our priority list. In the figure 1, point (1, 3) worker and point (68, 93) worker are considered differently in this step. We treat point (68, 93) worker as a known criminal and treat point Sd(k)i 1 Sd maxi φ k i φavei 10 Sd k i (5) φ maxi p(k) h φk h p(k) j j (6) ‘P(k)h’ means the times that topic k is talked by criminals. ‘h’ means the amount of criminals. ‘j’ means φk the total amount of people, which means j=83. describes the frequency degree of topic k in conversation of criminals. φ maxi max{φ k i };φavei average φ k i ;Sd maxi max{Sd(k)i } . In step2, we elect unknown workers to be criminals and innocents continually. So we set circulating in Matlab and the corner mark ‘i’ means the times we circulate. We use the text analysis idea to value Sd(k), and finally change the ‘Wi’ (degree of crime suspicious for worker i). With the considering of influence from text analysis method to ‘Sd(k)’, we get the final priority figure with linguistics. We get our final priority list by the priority figure as follow. We finally give different discriminate lines and evaluate the solution according to different probability of first type error and second type error to fit different requires of the police. The first type error in our model is let the conspirators get away with crime. We describe the probability of first type error by ‘P1%’ using formula (9) x2 f ( x)dx P1% 1 ( xx2 f ( x)dx ) 100%(x1 x x 2 ) (9) x1 Fig.5: final priority figure with linguistics E. STEP5 In step5, our goal is to locate a discriminate line to help distinctly categorize the unknown workers using the ideas of cluster analysis and method of hypothesis testing. We find a variable ‘AW1’ to describe the degree of conspiring for the group of conspirator. ‘AW’ is defined by the following formula (7) 67 AW1x (Wi W83 K K) 83 x (7) ix ‘AW1’ means the average weight for the group of conspirators, and it stands for the degree of conspiring. The ‘x’ is the abscissa of the point where the discriminate line located at. ‘Wi’ is defined in formula (2). ‘K’ is the amount of the known conspirators. We consider the suspicious degrees (‘Wi’) of all known conspirators are all same and the value of these suspicious degrees is same as the most right point in figure 5. (c) Draw the figure 6 that shows the changing of ‘AW1x’ by growing of the ‘x’. Fig.6 Fig.7 (d) For the same reason, we can define ‘AW2’ to describe the degree of conspiring for the group of non-conspirator, using formula (8). ix AW2 x (Wi WL L) L X (8) The f(x) is the function of the curve we match in figure 6. X1 and X2 is left and right point’s abscissa. The second type error in our model is let the nonconspirators be treated as conspirators. We describe the probability of first type error by ‘P2%’ using formula (10) x g ( x)dx P1% 1 ( x12 x g ( x)dx ) 100%(x1 x x 2 ) (10) x1 The g(x) is the function of the curve we match in figure 7. And we finally find that P2%, the probability of second type error, is decreasing with the increasing of ‘x’. If we change the value of ‘x’, the probability of the first and second type error will change too. At last we find when ‘x’=55, the value of P1%+P2% is the smallest. So we recommend police to locate discriminate line at the point whose abscissa is 55 in figure 5[2-3]. 6 STEP6 We try to find the boss of the crime group using the concept of point centrality in study of social network analysis [4]. Now that we know Jerome, Dolores and Gretchen are the senior managers of the company. If one or two, or even all of these three people are in the list of conspirators, we can credibly make sure that the leader or leaders come from the group of the three managers [5-7]. If all of them are not in the primary priority list, it will become more complex. Assuming the crime group is isolated from the other group, that is to say, it has little connection to the outside, so we can focus only on them. From previous work, we can obtain the criminal topics and their Sd(k). So we can calculate everyone's weight of point centrality by the same formula as formula (2). If someone's weight is much higher than others, we can surely know that he is the leader [8-11]. 1 ‘L’ is the amount of the known non-conspirators. We consider the suspicious degrees (‘Wi’) of all known non-conspirators are all same and the value of these suspicious degrees is same as the most left point in figure 5. And we can draw figure 7 shows the changing of ‘AW2x’ by growing of the ‘x’. III. SENSITIVITY ANALYSIS AND MODEL EVALUATING In our models, the value of weight in Table1 and Sd(k) are defined by ourselves through perceptual knowledge and some experiments of the example of Investigation EZ. That is to say, the weight and Sd(k) have no certain standard, so it is necessary for us to make sure how it will affect our results. There are 18 weights needed in our models, named A1, A2, A3, A4, A5……A18, which you can see them in table 1. We choose A3 and A16 randomly. With previous value, we get the priority list as follows (be expressed by the code for unknown workers): 3, 32, 15, 37, 17, 40, 10, 81, 34, 22, 31, 13 …… First, change A3 from 4 to 5, we got the result: 3, 32, 15, 37, 40, 17, 10, 81, 4, 34, 31 …… Then, change the value of A16 from 1 to 2, we got result: 3, 32, 15, 37, 17, 40, 10, 81, 34, 22, 31, 13 …… The basic value of Sd(k) is defined to be 1 when topic k has nothing about crime. Otherwise, the topic is suspicious. The problem is that we didn't define the maximum value of its initial value. The value will influent the results of formula (2).In our models, we defined it as 2. Now, we analysis whether the small change of the value will affect our results. 3, 32, 17, 15, 10, 37, 81, 40, 22, 16, 34, 4, 44 …… Observing these results carefully, we can conclude that when we change the weight or the maximum initial value of Sd(k), the results are not sensitive. We can say that our models behave well during the process of sensitivity analysis. Now we evaluate this model, because we don’t know whether it is stable or accurate. We guess that our model seems rely on the initial conditions. If it relies tightly on initial conditions, the model may not be trustful because no one can make sure the initial condition. So we will show that how our model’s result may change when the initial conditions change. And we only take the conspirators who are at the top of the priority list into consideration. These are some extreme conditions. Condition 1.Set initial conditions as normal, this result can be the basic standard. Condition 2.Assume we cannot identify the conspirators; Condition 3.Assume we cannot identify the non-conspirators; Condition 4.Assume we cannot identify all of them; Condition 5.Assume we can only identify some of the conspirators, such as 7, 18, 21, 43; Condition 6.Assume we can only identify some of the non-conspirators, such as 0, 2, 48, 64[12-13]; Through the analysis of the table, we conclude some laws: (1) The initial conditions about 'Known conspirators' and 'Unknown conspirators' will affect the results, but the effect is tolerable; (2) The more accurate initial conditions there are, the more fast and accuracy the results have; (3) More initial conditions means more accuracy and less time(especially for large data base), however it also means more energy it costs; (4) Our model has great stability so that it can be used widely, and it will show strong adaptability[14]. REFERENCES Table 2 the results. No. Known criminals Known noncriminals Criminals Similar Time to count 1 7,18,21,43,4 9,54,67 0,2,48,64,65, 68,74,78 7,18,21,43,49, 54,67,3,32,… — 28 2 — 0,2,48,64,65, 68,74,78 89.5% 30 3 7,18,21,43, 49,54,67 — 89.5% 31 4 — — 84.2% 33 5 7,18,21,43 0,2,48,64,65, 68,74,78 94.7% 29 6 7,18,21,43, 49,54,67 0,2,48,64 67,21,54,7,3,4 3,81,49,10,… 7,18,21,43,49, 54,67,3,32,… 21,67,54,7,3,4 3,32,2,18,1… 7,18,21,43,67, 54,3,49,17,… 7,18,21,43,49, 54,67,3,17,… 94.7% 25 [1] Zhang Defeng. “MATLAB probability and mathematical statistics analysis”. Mechanical industry press 2010.1 [2] LI Liang, Zhu Qianghua. “DSNE: a new dynamic social network analysis algorithm”. “Journal of Jilin University” (Engineering and Technology Edition) Mar.2008, Vol.38 No.2 [3] Guo Liya, Zhu Yu. “Application of Social Network Analysis on Structure and Interpersonal Character of Sports Team” [J], “2005 China sports science and technology” Vol.41, No.5, 10-13. [4] Ma Qian, Guo Jingfeng. “A study of the pattern-based clustering theories” [J]. Yanshan University March 2007 [5] Estevez P A, Vera P, Saito K. Selecting the Most Influential Nodes in Social Networks[C]//Proceedings of International Joint Conference on Neural Networks. Orlando. Florida, USA: [s.n.], 2007:12-17. [6] Santos E E, Pan Long. Arendt D. An Effective Anytime Anywhere Parallel Approach for Centrality Measurements in Social Network Analysis[C]//Proceedings of 2006 IEEE International Conference on System, Man, and Cybertics.[S.I.]: IEEE Press, 2006:8.11. [7] Kiss C, Scholz A, Bichler M. Evaluating Centrality Measures in Large Call Graphs[C]//Proceedings of the 8th IEEE International Conference On E-Commerce Technology and the 3rd IEEE International Conference on Enterprise Computing, E-Commerce, and E-Services. Washington D.C., USA: IEEE Computer Society Press, 2006. [8] Klovdahl AS, Potterat JJ, Woodhouse DE. Social networks and infectious diseases: the Colorado Springs Study. Soc Sci Med 1994;38:79–99. [9] Klovdahl AS. Social networks and the spread of infectious diseases: the AIDS example. Soc Sci Med 1985;21:1203–16. [10] Peiris JS, Yuen KY, Osterhaus AD. The severe acute respiratory syndrome. N Engl J Med 2003;349:2431–41. [11] Svoboda T, Henry B, Shulman L. Public health measures to control the spread of the severe acute respiratory syndrome during the outbreak in Toronto. N Engl J Med 2004;350: 2352–61. [12] Marsden PV. Egocentric and sociometric measures of network centrality. Soc Networks 2002;24:407–22. [13] Newman ME. Properties of highly clustered networks. Phys Rev E Stat Nonlin Soft Matter Phys 2003;68:026121. Epub: Aug 21, 2003. (DOI: 10.1103/PhysRevE.68.026121). [14] Anderson RM, May RM. Infectious diseases of humans: dynamics and control. Oxford, United Kingdom: Oxford University Press, 1992.