PSU Research Proposal Title: A Toolbox for Arabic Text Mining Department: Computer Science PI Name: Ahmed Sameh Duration: 1 Year Budget Est.: SR 55,000 Date: 12/20/2010 0 I - PROPOSAL I-1: PROPOSAL TITLE (Provide a short descriptive title, give prominence to keywords) A Toolbox for Arabic Text Mining I-2: COMMERCIAL POTENTIAL Yes Could this project have commercial potential? (Select one) No If yes, briefly elaborate on the commercial potential I-3: CHECK-LIST Have you checked to ensure all questions in the application form have been answered? Have you checked to ensure you have included the correct costs in your budget? The principal investigator and all co-principal investigators should sign. I-4: PERSONNEL AND AUTHORIZATION PRINCIPAL INVESTIGATOR [PI] Academic Rank: College: Full Name: Ahmed Sameh Department: Computer Science CIS Telephone: Mobile: Professor 494-8524 Ext: 0544299846 X8524 E-Mail: asameh@cis.psu.edu.sa Signature: Date: 12/20/2010 CO- INVESTIGATOR(S) [CIs] 1) Full Name: Academic Rank: College: (non-PSU CIs permitted) Mona Diab Assistant Professor E-Mail: Department: Linguistics Department & Natural Language Processing Group Stanford University Telephone: Mobile: Signature: 2) Full Name: Date: NourelDean Soufian 1 / / Academic Rank: Assistant Professor College: CIS E-Mail: Department: Computer Science Telephone: Mobile: Signature: 3) Date: College: CIS Telephone: / / Date: / / Date: / / Date: / / Department: Computer Science Mobile: Full Name Academic Rank: College: Telephone: E-Mail: Department: Mobile: Signature: Full Name: Academic Rank: College: Telephone: E-Mail: Department: Mobile: Signature: 6) Date: E-Mail: Signature: 5) / Full Name: Mohamed Tounsi Academic Rank: Associate Professor 4) / Full Name: Academic Rank: College: Telephone: E-Mail: Department: Mobile: Signature: II - DESCRIPTION II-1: ABSTRACT (Provide a statement of the project - maximum 200 words) Text Mining refers to the process of deriving high-quality information from text. High-quality information is typically derived through the divining of patterns and trends through means such as statistical pattern 2 learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). Natural language processing (NLP) within the Arabic language has been struggling over the years. Very little has been done in term of producing powerful tools for Arabic processing. In fact, Arabic is feared to be recognized as a language of the past, as very many new terms and names in the modern world has no terms and names in the Arabic language. This problem has developed over the years due to the fact that the Arabic languagistic researchers are fare away from modern technological tools, and they are not willing to collaborate with information technology researchers. This lake of communication and collaboration has lead to the current state of affairs with the NLP of Arabic text. Arabic text mining is way behind compared to English text mining. Several English text mining algorithms in the areas of text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities have powerful algorithms and tools. This research proposal will try to rectify this situation by developing an Arabic toolbox that will cover basic comparable English algorithms. In this project we will develop an Arabic toolbox that will contain algorithms for categorization and classification, clustering and grouping of related documents, concept extraction algorithms, production of taxonomies, Wordnet for verbs, nouns, and adjectives, simple dictionary, sentiment analysis algorithms, and document summarization. The toolbox will be web based with background database of documents and related resources. An Arabic stemmer will be developed along with tagging algorithm. Sample of small implementations of some of these algorithms are demonstrated in this proposal. These are initial results that demonstrate the capabilities of the current team. II-2: PROJECT GOALS AND OBJECTIVES The specific goals of this project are to demonstrate the power of Text mining within the Arabic language in: -Concept Mining: Concept mining is an activity that results in the extraction of concepts from set of documents. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining. Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is nontrivial, but it can provide powerful insights into the meaning, provenance and similarity of documents. Traditionally, the conversion of words to concepts has been performed using a thesaurus, and for computational techniques the tendency is to do the same. The thesauri used are either specially created for the task, or a pre-existing language model, usually related to Princeton's WordNet. The mappings of words to concepts are often ambiguous. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available. Machine translation systems cannot easily infer context. For the purposes of concept mining however, these ambiguities tend to be less important than they are with machine translation, for in large documents the ambiguities tend to even out, much as is the case with text mining. There are many techniques for disambiguation that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora. Recently, techniques that base on semantic similarity between the possible concepts and the context have appeared and gained interest in the scientific community. 3 -Arabic Wordnet: WordNet is a lexical database for the Arabic language. It groups Arabic words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. The purpose is twofold: to produce a combination of dictionary and thesaurus that is more intuitively usable, and to support automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and can be downloaded and used freely. The database can also be browsed online. WordNet was created and is being maintained at the Cognitive Science Laboratory of Princeton University under the direction of psychology professor George A. Miller. Development began in 1985. Over the years, the project received funding from government agencies interested in machine translation. As of 2011, the WordNet does not have an Arabic version. Arabic may be one of the few languages that does not have WordNet version. This project will build one only for the verbs. -Arabic Dictionary : It’s an online interactive Arabic dictionary and thesaurus that helps you find the meanings of words and draw connections to associated words. You can easily see the meaning of each by simply placing the mouse cursor over it. Based on Arabic WordNet we will develop an Arabic dictionary. Our goals are for the dictionary to be: Easy to use dictionary and thesaurus, Learn how words associate in a visually interactive display, Get ideas to help write content for your blog, article, thesis or simply play with words! , No limit on number of searches. Look up as many words as you need anytime. The user just type words in the search box and click Go or simply hit Enter. Once the words branch off the main query, you can double click a node to find other related words. To explore the features: Place the mouse cursor over a word to view the meaning, Double click a node from the branch to view other related words, Scroll the mouse wheel over words to zoom in or out. This helps you see more associations or view words and meanings more clearly, finally, Click and drag a word or branch to move it around and explore other branches. The Words interface queries the Arabic WordNet lexical database developed by Princeton University and made available for students and language researchers. This dictionary groups synonyms into synsets through lexical relations between terms. These meanings and semantic relationships are revealed graphically by the interactive web technology made available by Snappy Words. -Arabic Documents Classification: Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information. There is also a semi-supervised document classification, where parts of the documents are labeled by the external mechanism. -Arabic Document summarization: Automatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text. The phenomenon of information overload has meant that access to coherent and correctly-developed summaries is vital. As access to data has increased so has interest in automatic summarization. An example of the use of summarization technology is search engines such as Google. Technologies that can make a coherent summary, of any kind of text, need to take into account several variables such as length, writingstyle and syntax to make a useful summary. III - INTRODUCTION III-1: REVIEW AND ANALYSIS OF RELATED WORK Labor-intensive manual text mining approaches first surfaced in the mid-1980s, but technological advances 4 have enabled the field to advance during the past decade. Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (common estimates say over 80%) is currently stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning. Currently there is very little previous work done in Arabic Text mining. English text mining on the contrary has many algorithms and techniques. One of the directions that we will explore in this research is borrow some ideas from these algorithms and try to develop similar Arabic versions. III-2: SIGNIFICANCE OF WORK The Arabic language needs more work from all of us to stand up as a living language and to coop up with the current advancement in technology. Such Arabic tool box is so much needed at this era of world globalization. IV - APPROACH AND METHODOLOGY IV-1: METHODOLOGY Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a semantic web, text mining can find content based on meaning and context (rather than just by a specific word). Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis. Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material. Recently, text mining has received attention in many areas. Many text mining software packages are marketed towards security applications, particularly analysis of plain text sources such as Internet news.It also involves in the study of text encryption. One of the directions in this research is to adapt and modify selected English Text Mining tools (from the above web site) in order to produce their equivalent Arabic versions. The cross validation method requires very accurate English/Arabic translator that will provide input data to the Algorithm/program conversion. Areas of investigations in this project include: Arabic Natural Language Processing, Text Mining of Quran: The second objective is to strive to improve the quantity and quality of Arabic contents in the area of “Data and Text Mining” on the Web. All published material from the Hub’s activities will be translated and reviewed by its author(s) to be available in an Arabic Digital Library. A systematic plan to translate many “data mining” articles and storing them in a searchable Arabic Digital Library will be developed. Text and 5 Multi-media mining tools will be used to explore this Arabic digital library contents and expose related and correlated paragraphs and sections for the purpose of developing new Arabic Text mining algorithms and enhance exiting ones. This brings the other area of focus of the Hub which is the unstructured Text mining. As for the Unstructured Text mining: Parallel to the Arabic digital library there will be also an English Data Mining digital library (having the same contents) that will be developed. Both libraries will have traditional search engine beside more elaborated classification and categorization capabilities. Further to this, Text and Multi-media mining tools will be used to explore the two digital libraries contents and expose related and correlated paragraphs and sections. Text mining is used to find interesting regularities in large textual digital libraries. Where interesting means: non-trivial, hidden, previously unknown and potentially useful. Both Arabic and English Text mining tools handle digital libraries text at the word level, sentence level, document level, document-collection level, linked-document collection level, and at the application level. Most of the text mining methods reply on the fact that there is usually high redundant data in the documents. Most of the tools make use of: document summarization techniques, single document graph visualization algorithms, segmentation algorithms, features selection algorithms, similarity algorithms, clustering, and information extraction techniques. They also make use of several visualization techniques such as: WebSOM, ThemeScape, Graph-Based visualization techniques, and Tiling-based visualization techniques. Statistical tools for text mining include: Yale/Rapid Miner word vector mining, UIMA by IBM, GATE, Aero Text suite, Attensity, Endeca Technologies, Inxight, and Language Ware. Similar to what we provide for “Data Mining” we also propose the same vertical stacking of text Mining, statistical, and visualization algorithms for performing text mining to both the English and the Arabic data mining digital libraries. This will provide an interesting context for researchers in “Text mining” and “Arabization” fields to investigate how to improve the Arabic text mining algorithms and use a cross reference to the English ones. A very interesting research direction can be developed there. For example, the same mining questions can be posed to both the English and the Arabic digital libraries and the results can be compared. In cases of differences, learning opportunities will be developed and algorithms’ modifications and enhancements are to be investigated. The two libraries will provide several ways and means for verification, validation, and cross checking Text Testing Client Egnine Deliverables in phase I: Beta Version I + its Benchmark + its Tuning Deliverables in Phase II: Beta Version II + its Benchmark + its Tuning Deliverables in Phase III: Beta Version III + its Benchmark + its Tuning Deliverables in Phase IV: Final Version + User Manual The following is the project plan schedule. It represents those different tasks within the research and estimated duration for each. 6 IV-2: AVAILABLE RESOURCES Currently there are some open source text mining algorithms that can be used as tools in some of the above investigations. IV-3: EXPECTED RESULTS/OUTPUTS The expected output from this project is a Web based Arabic toolbox that will contain basic Arabic algorithms for Arabic natural language text mining. Some of the algorithms that will be provided under this tool are: text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities). The following are some initial results that we have already implemented in the domains of: Arabic Text categorization/Classification, construction an Arabic Wordnet for Arabic Verbs, and developing an Arabic Stemmer. In the following paragraphs we provide short descriptions and some screen shots for the developed tools. This tool presents an Arabic Text Mining tool used for classification according to some statistics. As the number of Arabic documents that are displayed every day on the web or on other media has grown rapidly, he need to analyze and classify these documents has become important nowadays.This tool will take as an input any document that is presented on the web or news papers. The tool will then classify the input document to one of a number of categories provided by the tool. These categories are: o Economic paper o Political paper o Medical paper o Religion paper The general idea behind the tool is that it takes a document as an input provided by the user. The tool will then store all the words used in the document without repeating or excluding any word. After that, the tool will compute the frequency for each stored word which is then used as a statistic to classify the document. The tool will use a number of databases as training sets to classify the document. The tool does the following 7 processing: o o o o o Take the word with the highest frequency. Search for that word in each of the databases given If the word is found in any one of the databases, the tool will stop and classify the document as the same type of the database where the word has been found in. If the word is not found in any database, the tool will take the next highest frequency word and do the same thing done for the previous word. If none of the stored words is found in any of the databases, the program state that the document can't be classified. This tool as a matter of fact takes all the words in the input document without excluding any of the common used words in Arabic language. However, we don't need to check for these words and then remove them because these words will not be provided in any of the databases for the tool to search in. So, not including these words in the databases will not force the tool to remove them whenever they are found because the tool will skip them after not finding them in the databases. All the used databases are in a text file format and they can easily be updated by the user to increase the size of the training set. In addition to the classification, another text file including all the stored words and their frequency will be added. Further work: More databases can be added to the tool in order to have bigger training sets which will result in better results. The output text file including all the words along with their frequency can then be integrated with other text mining tools for the statistics it provides. The tool can be enhanced by taking the highest two or three words instead of the highest frequency one to be used to classify the document. This as a matter of fact will result in a better classification. The tool can also be modified by the following. Instead of classifying it as the same type of the database by the highest frequency word, the tool can be modified to provide a percentage for each database for the occurrence of the stored words that are found in each one of them as another statistical approximation. The user is then required to analyze the resulted percentages to better assign the document to one of the categories. The following are screen shots from the tool: 8 The second tool implemented is a sample Arabic Wordnet dictionary. It deals only with verbs. This tool presents an Arabic Text Mining tool. The tool provides a Wordnet for Arabic words only. These as 9 a matter of fact can be used to understand the meaning of the words provided which can be used for many purposes like classification, clustering, and summarization of a text.This tool will take as an input any document that contains Arabic words only. All the words in the input file are nouns and all of them is on the form of " " فعل. the output will then be another file containing the word and all of its synonyms. Method Used: The general idea behind the tool is to take a file containing only the words to look for their synonyms. The tool will then take all these words one by one. When the tool takes a word, it will go and search for that word in another file containing groups of words where each group contains words with the same meaning. When the tool find the target word in one of the groups, it will return that group and store the target word followed with all of its synonyms in an output file. If the target word is not found in group, the tool will put it also in the output file while notifying that it didn't find any related word to it. Further work: We can expand the training set to have bigger training sets so that we can find a meaning or a synonym for any input word. We can also make other training sets for verbs and other non-noun or non-verbs Arabic words to enlarge the training sets. The output file is formatted in a way that makes it easy to integrate it with other text mining tools and using it for other purposes like classification, clustering, and text summarization. The method used here to search for the word is using sequential search because the training set here is small. However, it would be better to enhance the tool by using another searching algorithm which is faster. The need for this will rise if we enlarge the training set or add another data files for verbs and other Arabic words types. The following are screen shots from the tool: 10 The third tool is an Arabic Stemmer. The following is a description with screen shots. The word Stemming in Data Mining and other fields refers to the process for reducing inflected (or sometimes derived) words to their stem, base or root form – generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. The first ever published stemmer was written by Julie Beth Lovins in 1968. Stemmers are commonly used for many purposes like: Information Retrieval and in commercial products. Stemmers are common elements in query systems such as Web search engines Description of The Tool: In this report, I will talk about an Arabic word stemmer that is adopted from Arabic Stemmer by Shereen Khoja by Motaz K. Saad. The tool will take as an input a file containing Arabic texts and words. The tool will then perform some operation to store all the words in the input file. While reading the file, the tool will remove any words that usually can't be stemmed because they are not important like numbers(written in letters), special characters, or symbols. For each word of intrest, the tool will does the following checks: o Check if the word consists of two letters. o Check if the word consists of three letters. o Check if the word consists of four letters. o Check if the word is a pattern. o Check for a definite article. o Check for the prefix. o Check for suffixes. The tool will use a large database consisting of stems of most of the Arabic used words that can be found in news articles, magazines, websites, etc… The tool is implemented using Java language and is integrated with the weka tool in order to stem. 11 - The output stems will then be stored in a way that can be easily integrated with other tools like search engines, clustering tools, classification, etc… أن ليت لعل السيما واليزال الحالي ضمن اول وله ذات اي بدال اليها انه الذين فانه إن بعد ضد يلي الى إلى في وفي من V - REFERENCES 1- Arabic Text Mining Tutorial : http://textminingthequran.com/tutorial/bismillah.html VI - ROLE(S) OF THE INVESTIGATOR(S) (Attach a brief CV for each investigator following the format in Appendix A) # Name of Investigator Area of contribution to the project 1 Prof. Ahmed Sameh System Design & Implementation 2 Asst Prof. Mona Diab Data Collection & Preparation 3 Dr. Mohamed Tunsi 4 Dr. Noureldean Soufian Data Mining Tools System Design & Implementation 5 6 12 VII - PROJECT SCHEDULE PHASES OF PROJECT IMPLEMENTATION (SEE GANETT CHART ABOVE) Steps 1 Duration Task (Months) System requirements specifications: Sameh, Tunsi System Architecture : Soufian System Design: Sameh Databases Designs: Sofian Prototyping of critical sub-systems: Tunsi, Sameh System Detailed Design: Sameh, Tunsi Beta Version Implementation: Sameh, Soufian Testing: Soufian Building Deployment Environment: Sameh Bench Marking and Collecting Results (First Round): Tunsi System Tuning (Based on First Round Results): Sameh Bench Marking and Collecting Results (Second Round Results): Soufian System Tuning (Based on Second Round Results): Sameh Bench Marking and Collecting Results (Third Round Results): Tunsi Version 1 Release:Tunsi Results Documentation and Analysis with the Performance requirements:Sameh Detailed Code Documentation: Sameh User and Installation Guide (Full How To): Soufian See Gantt Chart within this proposal Total duration for the proposed project 12 Month VIII - BUDGET OF THE PROPOSED RESEARCH (Budget in SAR) Amount Priority 1 = Max; Amount Requested 2 = Mod; Approved 3 = Low. (SAR) (SAR) Item A. Personnel* (Research Assistant) 24,000 1 For Official Use 1- Student Ahmed Al-Jabreen 2- Student Kamal Qarawi 3- Student Omar Al-Moughnee 4- Student Amro Al-Munajjed 13 B. Equipment* (List) 5,000 1 5,000 2 1000 2 10,000 1 10,000 1 Development Server C. Testing and Analysis* (Location/Laboratory) Labtop Computer D. Consumables* (List) Desk Tools E. Travel *(Local/Internat) 1- Travel for Mona Diab (Stanford/ Riyadh) F. Software* (List) -SAS Data Mining Tools -Oracle 9i Data Mining -Clementines from SPSS -Ants Model Builder G. Other Items* (Itemize) --- Total Amount Requested (SAR) 55,000 IX- JUSTIFICATION OF BUDGET (Justify each item listed in the budget in the previous section) Item A Students Research Assistants Justification Salary of SR 500 for each student for 12 months the duration of the project. 14 B For developing the proposed experiments. Development Server C For on-site data collection and on-site testing Laptop Computer D For general use by team members Desk tools E For the two outside PSU team members. Travel F Data Mining Tools Software Software G X - RELEASE TIME FOR RESEARCH TEAM MEMBERS RELEASE TIME FROM TEACHING LOAD # PI Time Commitment Team Member (hrs/weeks/terms) Ahmed Sameh 4 h/w 15 Teaching Load Max e.g. 1 course FA11 CI1 CI2 CI3 Noureldean Soufian 2h/w Mohamed Tounsi 2h/w Mona Diab 1h/w CI4 1h/w CI5 XI - EXTERNAL FUNDING # 1 Source of Funds Amount (SAR) Used for …… costs None 2 3 Appendix A: CV Format for Principal Investigator and Co-Investigators (Two pages maximum, material should be related to submitted project) Title and Name: Professor Ahmed Sameh Specialty: Artificial Intelligence, Modeling and Information Systems Department and College: Computer Science Summary of Experience/Achievements Related to Research Proposal: 1- Ahmed Sameh, Ayman Kassem, “Lumbar Spine: Parameter Estimation for Realistic Modelling”, WSEAS Transactions on Applied and Theoretical Mechanics, ISSN:1991-8747, Issue 5, Volume 2, May 2008 2- Ahmed Sameh, Ayman Kassem, “A General Framework for Lumbar Spine Modelling and Simulation”, International Journal of Human Factors in Modelling and Simulation, IJHFMS, The North American Spine Society, Volume 1, Issue 2, January 2008 3- Dalia El-Mansy, Ahmed Sameh, “A Collaborative Inter-Data Grid Strong Semantic Model with Hybrid Namespaces”, Journal of Software (JSW), Academic Publisher, Volume 3, Issue 1, January 2008 4- Ahmed Sameh, “Simulating Lumbar Spine Motion”, Research in Computing Science (RCS) Journal, 16 National Polytechnic Institute of Mexico, ISSN 1665-9899, Volume 18, Issue 4, June 2007 5- Ahmed Sameh, and Ayman Kassem, “3D Modeling and Simulation of Lumbar Spine Dynamics”, in the International Journal of Human Factors Modelling and Simulation , Volume IJHFMS-942, 2007 6-Adhami Louai, Abdel-Malek Karim, McGowan Dennis, Mohamed A. Sameh, "A Partial Surface/Volume Match for High Accuracy Object Localization", International Journal of Machine Graphics and Vision, vol 10, no. 2, 2001 7-Mohamed A. Sameh, “Interactive Learning in Artificial Neural Networks Through Visualization”, The International Journal of Computers and Applications (IJCA), Vol. 20, #2, 1998 8- Mohamed A. Sameh and Attia E. Emad, "Parallel 1D and 2D Vector Quantizers Using Kohonen SelfOrganizing Neural Network", in the International Journal of the Neural Computing and Applications, V. (4), no. 2, Springer Verlag, London, 1996 9- Ahmed Sameh, Amgad Madkour, “Intelligent open Spaces: Learning User History Using Neural Network for Future Prediction of Requested Resources”, Proceedings IEEE CSE'08, 11th IEEE International Conference on Computational Science and Engineering, 16-18 July 2008, São Paulo, SP, Brazil. IEEE Computer Society 2008, ISBN 978-0-7695-3193-9 10- Ahmed Sameh, Ayman Kaseem, “Modelling and Simulation of Human Lumbar Spine”, Proceedings of the 2008 International Conference on Modelling, Simulation, and Visualization, MSV 2008, Las Vegas, Nevada, July 14-17, 2008, CSREA Press 2008, ISBN 1-60132-081-7 11- Ahmed Sameh, Dalia El-Mansy, “A Collaborative Inter-Data Grids Model with Hybrid Namespace”, 14th IEEE International Conference on Availability, Reliability, and Security, (DAWAM – ARES 2007), Vienna, Austria, April 10-13, 2007 12- Ahmed Sameh, “Simulating Lumbar Spine Motion: Parameter Estimation for Realistic Modelling”, The 6th Mexican International Conference on Artificial Intelligence (MICAI07), Aguascalientes, Mexico, November 4-10, 2007 13- Sherif Akoush, Ahmed Sameh, “Bayesian Learning of Neural Networks for Mobile User Position Prediction”, The International Workshop on Performance Modelling and Evaluation in Computers and telecommunication Networks (PMECT07)- part of the IEEE 16th International Conference on Computer Communications and Networks, ICCCN 2007, Honolulu, Hawaii, August 13-16, 2007 14- Ahmed Sameh, “The Schlumberger High Performance Cluster at AUC”, Proceedings of the 13th International Conference on Artificial Intelligence Applications, Cairo, February 4-6, 2005 15-Mohamed A. Sameh, Rehab El-Kharboutly, "Modeling a Service Discovery Bridge Using Rapide Architecture Description Language", Proceedings of the 18th European Simulation Multiconference (ESM 2004), Magdeburg, Germany, June 13-16, 2004 16-Mohamed A. Sameh, Rehab El-Kharboutly, and Hazem Al-Ashmawy, "Modeling Wireless Discovery and Deployment of Hybrid Multimedia N/W-Web Services Using Rapide ADL", Proceedings of the 7th IEEE International Conference on High Speed N/Ws amd Multimedia Communications (HSNMC04), Toulouse, France, June 30- July 2nd, 2004 17-Mohamed A. Sameh, Rhab El-Kharboutly, "Modeling Jini-UpnP Using Rapide ADL", Proceedings of the 10th EUROMEDIA Conference (EUROMEDIA 2004), Hasselt, Belgium, April 19-21, 2004 18-Mohamed A. Sameh, "E-Access Custom Webber: A Multi-Protocol Stream Controller", Proceedings of the IADIS International Conference on Applied Computing, Lisbon, Portugal, March 23-26, 2004 19- Ayman Kassem, A. Sameh, and Tony Keller, “Modeling and Simulation of Lumbar Spine Dynamics”, Proceedings of the 15th IASTED International Conference on Modeling and Simulation and Optimization (MSO 2004), Marina Del Rey, California, March 2004 17 20-Mohamed A. Sameh, and Shenouda S., "Tera-Scale High Performance Distributed and Parallel SuperComputing at AUC", Proceedings of the 12th International Conference on Artificial Intelligence, Cairo, Feb. 18-20, 2004 21-Shenouda S., Mohamed L., and Mohamed A. Sameh, "AUC Cluster Participation in Global Grid Communities", Proceedings of the 12th International Conference on Artificial Intelligence, Cairo, Feb. 1820, 2004 22-El-Ashmawi Hazem, and Mohamed A. Sameh, “XML-Socket Language-Independent Distributed Object Computing Model”, Proceedings of the 15th International Conference on Parallel and Distributed Computing Systems, Louisville, Kentucky, September, 2002 23-Mohamed Karasha, Greenshields Ian, and Mohamed A. Sameh, “HUSKY: A Multi-Agent Architecture for Adaptive Scheduling of Grid Aware Applications”, Proceedings of the High Performance Computing Symposium with the 2002 Advanced Simulation Technologies Conference (ASTC 2002), San Diego, California, April 14-18, 2002 24-Atef Rania, Mohamed A. Sameh,and Abdel-Malek Karim, "Three Dimensional Deformable Modeling of the Spinal Lumbar Region", Proceedings of the 11th International Conference on Intelligent Systems on Emerging Technologies (ICIS-2002), Boston, July 18-20, 2002 25-Kassem Ayman, Mohamed A. Sameh, and Abdel-Malek Karim, "A Spring-Dashpot-String Element for Modeling Spinal Column Dynamics", Proceedings of the International Workshop on Growth and Motion in 3D Medical Images, Copenhagen, Denmark, May 28- June 1, 2002 26-Kassem Ayman, and Mohamed A. Sameh, “A Fast Technique for modeling and Control of Dynamic System”, Proceedings of the 11th International Conference on Intelligent Systems on Emerging Technologies (ICIS-2002), Boston, July 18-20, 2002 27-Mohamed A. Sameh, and Kaptan Noha, "Anytime Algorithms for Maximal Constraint Satisfaction", Proceedings of the ISCA 14th International Conference on Computer Applications in Industry and Engineering (CAINE' 2001), Nov. 27- 29, at Las Vegas, Nevada, 2001 28-Mohamed A. Sameh, and Mansour Marwa "Enhancing Partitionable Group Membership Service in Asynchronous Distributed Systems", Proceedings the ISCA 14th International Conference on Computer Applications in Industry and Engineering (CAINE' 2001), Nov. 27- 29, at Las Vegas, Nevada, 2001 29-Abdalla Mahmoud, Mohamed A. Sameh, Harras Khalid, Darwich Tarek, "Optimizing TCP in a Cluster of Low-End Linux Machines", Proceedings of the 3rd WSEAS Symposium on Mathematical Methods and Computational Techniques in Electrical Engineering, Athens, Greece, Dec. 29-31, 2001 30-Rania Abdel Hamid, and Mohamed A. Sameh, “Visual Constraint Programming Environment for Configuration Problems”, Proceedings of the 15th International Conference on Computers and their Applications, New Orleans, Louisiana, March 2000 31-Essam A. Lotfy, and Mohamed A. Sameh, “Applying Neural Networks in Case-Based Reasoning Adaptation for Cost Assessment of Steel Buildings”, Proceedings of the 10th International Conference on Computing and Information, ICCI-2000, Kuwait, Nov. 18-21, 2000 32-Ghada A. Nasr, and Mohamed A. Sameh, “ Evolution of Recurrent Cascade Correlation Networks with a Distributed Collaborative Species”, Proceedings of the IEEE Symposium on Computations of Evolutionary Computation and Neural Networks, San Antonio, TX, May 2000 33-El-Beltagy S., Rafea A., and Mohamed A. Sameh, “An Agent Based Approach to Expert System Explanation”, Proceedings of the 12th International FLAIRS Conference, Orlando, Florida, 1999 34- Mohamed A. Sameh, Botros A. Kamal, "2D and 3D Fractal Rendering and Animation", Proceedings of the Seventh Eurographics Workshop on Computer Animation and Simulation, Aug. 31st- Sept. 2nd, in 18 Poitiers, France, 1996 35-Mohamed A. Sameh, "A Robust Vision System for three Dimensional Facial Shape Acquisition, Recognition, and Understanding", Proceedings of the 1st Golden West International Conference on Intelligent Systems, Reno, Nevada, 1991 36-Mohamed A. Sameh, "A Neural Trees Architecture for Fast Control of Motion", Proceedings of the FLAIRS Artificial Intelligence Conference, Cocoa Beach, Florida, 1991 37-Mohamed A. Sameh, Armstrong W.W., "Towards a Computational Theory for Motion Understanding: The Expert Animator Model", Proceedings of the 4th International Conference on Artificial Intelligence for Space Applications, Nasa, Huntsville, Alabama, 1988 CV of Mona Diab: I am a scholar at Stanford University in the linguistics department working with Daniel Jurafsky and also part of the Natural language Processing lab . I finished my PhD in the University of Maryland, College Park, where I was in the linguistics department and was part of the CLIP lab in the University of Maryland Institute of Advanced Computer Studies . I worked under the supervision of a great advisor Philip Resnik. My thesis, defended in May 2003, is titled Word Sense Disambiguation within a Multilingual Framework. Earlier on, 1995-1997, I earned an MSc. degree in Artificial Intelligence (Machine Learning) from the George Washington University under the supervision of Professor Peter Bock. I worked in the Center for Spoken Language Research (CSLR) at the University of Colorado at Boulder for five months as a research associate after graduation, then I moved to Stanford, California in January of 2004. Here is my CV. Research Interests My main research area is statistical natural language processing. I am specifically involved in computational semantics, Arabic computational linguistics, semantic processing and machine learning. I am interested in cross linguistic similarities and divergences in language use and how these types of relations can be exploited to solve some of the language processing problems. The NLaSP coll maybe checked here. Publications Diab, Mona. Relieving the data acquisition bottleneck for Word Sense Disambiguation. Proceedings of ACL 2004.[pdf]. Diab, Mona, Kadri Hacioglu and Daniel Jurafsky. Automatic Tagging of Arabic Text: From raw text to Base Phrase Chunks. Proceedings of HLT-NAACL 2004.[pdf]. Diab, Mona. An Unsupervised Approach for bootstrapping Arabic Sense Tagging. Proceedings of Arabic Script Based Languages Workshop, Coling 2004.[pdf]. Diab, Mona and Philip Resnik, An Unsupervised Method for Word Sense Tagging using Parallel Corpora, Proceedings of ACL, 2002.[ps]. Diab, Mona. An Unsupervised Method for Word Sense Tagging using Parallel Corpora: A Preliminary Investigation. Special Interest Group in Lexical Semantics (SIGLEX) Workshop, Association for Computational Linguistics, 2000.[pdf]. Diab, Mona and Steven Finch. A Statistical Word-Level Translation Model for Comparable Corpora. Proc. of Conference on Content-based Multimedia Information Access (RIAO2000), 2000.[ps]. 19 Resnik, Philip and Mona Diab, Measuring Verb Similarity, Cognitive Science Society (CogSci2000), 2000.[pdf]. Dorr, Bonnie, Gina Levow, Douglas Oard, Philip Resnik, Amy Weinberg, Mona Diab, Maria Katsova. MADLIBS: An Event Translingual Lexical Conceptual Structure Based Information Retrieval System. North American Association for Computational Linguistics, NAACL 2000. Resnik, Philip, Mari B. Olsen and Mona Diab, The Bible as a Parallel Corpus: Annotating the `Book of 2000 Tongues', Computers and the Humanities, 33(1-2), 1999. Diab, Mona, John Schuster and Peter Bock. A Preliminary Statistical Investigation into the impact of an N-Gram Analysis Approach based on Word Syntactic Categories toward Text Author Classification, Proc. of 6th International Conference on Artificial Intelligence & Applications, Egypt 1998 [ps]. Riopka, Terry, Mona Diab and Peter Bock. Quantifying and Interpreting the Effect of Intelligent Information. Proc. of 6th International Conference on Artificial Intelligence & Applications, Egypt 1998 [ps]. Software o o o o We have developed a set of Arabic processing tools in conjunction with our NAACL'04 [paper]. The tools utilize the Yamcha SVM tools to tokenize, POS tag and Base Phrase Chunk Arabic text. You may download our tarred and compressed (55mb) [package]. The tools are compiled for a linux platform. For questions or comments contact [me]. CV of Noureldean Soufian Publication · · · · · · · · · · Book: S. Noureddine: Conceptual Development and Quantitative Analysis of an Availability Enhancing Middleware for Distributed Applications, Mensch & Buch Verlag, Berlin, 2002, ISBN 389820-347-6. S. Noureddine: A Geometric Programming Approach for the Satisfiability Problem, submitted to Comp. Intel. Studies, August, 2009. S. Noureddine: Some Aspects of Islamic Logic, submitted to Applied Computing and Informatics, KSA, August, 2009. A Geometric Programming Approach for the Satisfiability Problem, submitted to Comp. Intel. Studies, April, 2009. M. Madi, S. Noureddine, A. Fellah: On Cryptology: Origin, Science, and Novel Techniques to Interactive Data Decryption, First International Conference on Arab's & Muslim's History of Sciences, UAE, 2008. Fellah, S. Noureddine: Deterministic Timed AFA: A New Class of Timed Automata, Journal of Computer Science, Science Publications, 2007. S. Noureddine: Analysis of a New Reduction Calculus for the Satisfiability Problem, Proceedings of the 9th ALC conference, 2006. Fellah, S. Noureddine: Some Succinctness Properties of O-DTAFA, WSEAS Transactions on Computers, 5(3), March, 2006. Y. Chali, S. Noureddine: Document Clustering with Grouping and Chaining Algorithms, In Proc. of the 2nd International Joint Conference on Natural Language Processing, South Korea, 2005. Y. Chali, S. Noureddine: Text Clustering for Natural Language Applications, Journal of Computer Science, Science Publications, 2005. 20 · S. Noureddine: A Simple Reduction Calculus for Propositional Logic Formulas, 9th Asian Logic Conference, Russia, August, 2005. Publication · Book: S. Noureddine: Conceptual Development and Quantitative Analysis of an Availability Enhancing Middleware for Distributed Applications, Mensch & Buch Verlag, Berlin, 2002, ISBN 389820-347-6. · S. Noureddine: A Geometric Programming Approach for the Satisfiability Problem, submitted to Comp. Intel. Studies, August, 2009. · S. Noureddine: Some Aspects of Islamic Logic, submitted to Applied Computing and Informatics, KSA, August, 2009. · A Geometric Programming Approach for the Satisfiability Problem, submitted to Comp. Intel. Studies, April, 2009. · M. Madi, S. Noureddine, A. Fellah: On Cryptology: Origin, Science, and Novel Techniques to Interactive Data Decryption, First International Conference on Arab's & Muslim's History of Sciences, UAE, 2008. · Fellah, S. Noureddine: Deterministic Timed AFA: A New Class of Timed Automata, Journal of Computer Science, Science Publications, 2007. · S. Noureddine: Analysis of a New Reduction Calculus for the Satisfiability Problem, Proceedings of the 9th ALC conference, 2006. · Fellah, S. Noureddine: Some Succinctness Properties of O-DTAFA, WSEAS Transactions on Computers, 5(3), March, 2006. · Y. Chali, S. Noureddine: Document Clustering with Grouping and Chaining Algorithms, In Proc. of the 2nd International Joint Conference on Natural Language Processing, South Korea, 2005. · Y. Chali, S. Noureddine: Text Clustering for Natural Language Applications, Journal of Computer Science, Science Publications, 2005. · S. Noureddine: A Simple Reduction Calculus for Propositional Logic Formulas, 9th Asian Logic Conference, Russia, August, 2005. CV Of Mohamed Tounsi Dr. Mohamed Tounsi Associate Professor in Computer Science Specialization: Artificial Intelligence Short Bio: 21 I. Research interests Mohamed Tounsi received his PhD in Computer Science specialization in artificial intelligence from University of Nantes, FRANCE in 2002. He was the chairman of computer science department and Assistant Professor at the Department of Computer Science, Prince Sultan University, KSA. His current research interest includes constraint programming, meta-heuristics, bioinformatics, intelligent argent and optimization algorithm. Previously, Dr. Tounsi received his master of science from Paris 9, Dauphine University, Paris, FRANCE. Dr. Tounsi published several journals publication is different international journals (see Research section). He is currently an editorial member of various journals in the field of computing and he is a board member of Saudi Computer Society. Degrees PhD in Computer Science, specialization in Artificial Intelligence, University of Nantes, France 2002 M.S. in Computer Science, specialization in Operational Research, University of Paris Dauphine, France 1998 Engineer in Operation Research, University of Science and Technology Houari Boumedine, Algiers, 1995 Constraint Programming and constraint satisfaction problems, Local Search Methods and Hybrid Methods, Data Mining, Combinatorial Optimization, Multi-Objective Optimization, Multi-Criteria Decision Making, Multi-agent modeling and parallel solving, Fuzzy Set Theory, II. Current Projects Data Mining applications in social networks Data Mining applications in healthcare Mining Arabic text Small world based algorithms for optimization problems Swarm intelligence for solving unconstrained optimization problems III. Publications Recent Publications 1. Mohamed Tounsi, (2010) “An intelligent bank assessment system: Preliminary Results” International Journal of Electronic Finance, volume: 4 number: 03 Inderscience Publishing. 2. Mohamed Tounsi (2010) “TTGENERATOR: An Intelligent Solver for Timetabling System” Journal of Applied Soft Computing, Elsevier publishing (Accepted) 3. Mohamed Tounsi (2010) “New Swarm Intelligence Based Heuristics” Journal of Applied Soft Computing, Elsevier publishing (Accepted) 4. Mohamed Tounsi et al.(2010) “A multi-criteria approach for job preferences” International Journal of Data Analysis Techniques and Strategies (IJDATS) 22 5. Mohamed Tounsi (2010) “a Multi-Objective Heuristics Based for Optimization Problems” International Journal of Artificial Intelligence and Soft Computing, Inderscience Publishing.(Accepted) 6. Mohamed Tounsi et al. (2009) “The Role of BPR in the implementation of ERP Systems”, International Journal of Business Process Management journal. Vol. 15 No. 5. pp.:653-668. Emerald Publishing. 7. Mohamed Tounsi (2008), “An explanation-based tools for debugging constraint satisfaction problems”. Journal of Applied Soft Computing, Elsevier publishing (Accepted). 8(4): 1400-1406 (2008) 8. Mohamed Tounsi et al. (2008) “An Iterative local-search framework for solving constraint satisfaction problem”. Journal of Applied Soft Computing, Elsevier publishing (Accepted) 8(4): 1530-1535 (2008) 9. Mohamed Tounsi et al. (2008) “A Bluetooth intelligent e-healthcare system: analysis and design issues”. International Journal of Mobile Computing (IJMC) 6(6): 683-695 (2008). Inderscience Publishing. 10. Mohamed Tounsi (2008) “Toward a General Model for Local Search Technique” Journal of Applied Computer Informatics. Vol 7 No 1. 2008.Elsevier Publishing. 11. Mohamed Tounsi et al. (2008) “The development of an intelligent Agent Prototype for Mutual Fund Investment” International Journal of Electronic Finance. Vol 2 No. 3 pp.300-313. 2008. Inderscience Publishing. 12. Mohamed Tounsi (2008) “An overview of ILOG Optimization Suite”, Journal of Applied Computer Informatics. Vol 6 No 2. 2008. 13. Mohamed Tounsi et al.(2008) “Greedy-Based Approach for Solving Data Allocation Problem in a Distributed Environment” . Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2008, Las Vegas, USA, July 14-17, 2008, 975-980. 14. Mohamed Tounsi (2008) “Intelligent System for Bank Assessment: A Preliminary Results” The 19th Saudi National Computer Conference (NCC19) .December 2008. IV. Research Activities Member of Editorial Board for International Journal: International Journal of Electronic Healthcare (IJEH), Inderscience Eds. Business Process Management (BPMJ), Emerald Eds. Applied Computing and Informatics (ACI), SCS Eds. Reviewers of International Journals: Applied Artificial Intelligence Journal Applied Soft Computing Journal 23 Supercomputing Journal Business Process Management Journal New Mathematics and Natural Computation Journal (NMCJ), Applied of Computer and Informatics Reviewers of different International and National conferences International Conference on Artificial Intelligence conference International Conference on Parallel and Distributed Processing Techniques and Applications PDPTA conference IBAMA conference IASTED Conferences (AIA, MSO) ROADEF Conference (French conference of operational Research) JNPC Conference (French conference of solving NP-complete problems) Member of Scientific Committee of different conferences: IASTED Conference, Artificial Intelligence and Applications (AIA), 2009 International Conference on Artificial Intelligence ICAI’2009, Las Vegas, USA International Conference on Artificial Intelligence ICAI’2008, Las Vegas, USA ISCAL 2007 Conference. IASTED Conference, Artificial Intelligence and Applications (AIA), Innsbruck, Austria. WESEAS Conference, Distance Learning and Web Engineering (DIWEB'2006), Lisbon, Portugal. NCC18 (18th National Conference on Computer) Riyadh, Saudi Arabia. International Conference on Artificial Intelligence ICAI’2006, Las Vegas, USA IASTED Conference International Conference on PARALLEL AND DISTRIBUTED COMPUTING AND NETWORKS-2005 Workshop Organized Workshop at the 7th World Multi Conference on Systemic , Cybernetics and Informatics (SCI 2003), Florida, USA Member of Scientific Associations: Board member of Saudi Computer Society (SCS) French Association of Operations Research Appendix B: Evaluations and Approvals COLLEGE REVIEW COMMITTEE Evaluation and Recommendation Excellent Item/ Evaluation Research methodology Research objectives 24 Very Good Good Weak Research originality Research contribution Research applicability and relevance Overall evaluation Recommendations of College Committee Approved Amount of Budget Approved by College Committee: Disapproved (SAR) Chair College Committee - Title and Full Name: Signature: Date: Recommendations of the College Council / Approved / Disapproved Dean of the College Council - Title and Full Name Signature: Date: / / PSU INSTITUTIONAL RESEARCH COMMITTEE (IRC) Recommendation Recommendation of the PSU IRC Approved Disapproved Chair IRC Committee - Title and Full Name: Signature: Date: 25 / / PSU EXTERNAL REVIEW PANEL FOR RESEARCH PROPOSALS Recommendation Recommendation of the Eternal Review Committee. Approved: Amount of grant approved: Disapproved: Postponed: Directed to: Chair of External Review Panel - Title and Full Name: Signature: ( SAR) Date: Recommendation of University Council / / Approved Signature: Date: 26 Disapproved / /