Spring 2011 CS 544 Final Assignment Given by Eduard Hovy Write your own text summarizer. Part 0: Preparation Step 1: Get 10 texts of the same genre (I recommend news articles). Place each sentence on a new line. Identify 6 Development texts and 4 final Evaluation texts. Set aside the Evaluation texts for Part II. Part I: Implement a simple text summarization engine (25 points) The input should be a query of M words (0 =< M < 4) indicating the reader’s interests a document of N sentences (one sentence per line) and the output should be a summary of 0.3N sentences. Step 1: Build at least three sentence scoring modules (see below) Step 2: Build the framework in which the modules fit: Accept input (text and query); activate each module; collect its scores for each sentence; at end combine the scores for each sentence across all modules; produce output (top-scoring 1/3 of the number of sentences) with their scores; order the sentences as you think they should be ordered; save result as the summary Step 3: Test the framework on the 6 Development texts, and decide how you want to combine the scores to get the best-sounding output summaries. Change your combination formula (and whatever else you want to change). EXTRA CREDIT if you use a machine learning method to determine optimal score weights. The modules: REQUIRED: EITHER o Choose 2 simple (word/position-based) topic scoring methods (word frequency counting of various forms, position, title, and cue phrase methods, query overlap method). Up to 5 points each for each method. o Choose 1 complex (cross-sentence) topic scoring method (cross-sentence overlaps of various kinds, lexical chains, discourse structure, etc.). Up to 10 points. OR (hard): Implement the information extraction frame/template method that includes at least 3 frames with at least 4 slots per frame. Up to 20 points. OPTIONAL: Up to 5 points for redundancy checking, pronoun replacement, and sentence order manipulations that improve the coherence of the summary REQUIRED: Up to 5 points for method of integrating the scores of the various modules (the more advanced the method, the more points. Simply adding the various modules’ scores gives 1 point; weighting them in some way is better, but requires an analysis showing how and why the particular weighting factors were determined) Part II: Implement a summary evaluator (15 points) The input is an ideal summary and a system summary. The output is the Precision and Recall scores. Step 1: Build the evaluator: Input is two summaries (ideal, of length A sentences, and system, of length B sentences). First count every sentence in system summary that also occurs in ideal summary (= C). Then compute Recall = (# correct sentences in system output) / (# sentences in ideal summary) = C/A Precision = (# correct sentences system output) / (total # sentences in system output) = C/B (If you want, you can do fancy things with the scoring to take into account the actual system score value for each sentence.) Step 2: Create three ideal summaries: for the 4 Evaluation texts, extract the N sentences (about 0.3 times the total number for each text) that you consider the most important. Rank them from 1 to N, and save them together with their rank scores. You should use the SAME query when your system generates the summary as when you generate the gold standard. You can pick just one query per text, although of course it’s more fun to have more than one query (and hence more than one summary) per text! Part III: Apply it! (10 points) Step 1: Run your summarizer: Feed the 4 Evaluation texts into your summarizer and save its output. Step 2: Run the evaluator: For each Evaluation text, input your summary (with rank scores) and the system’s summary (with sentence scores) and output the overlap score. Please email ALL the following to hovy@isi.edu by Saturday May 7: The code of each topic scoring module A trace-through of at least 2 texts for each topic scoring module (can be the same texts) showing the output scores assigned to each unit The code of the score integration module, which a discussion/description of how the weightings (if any) were determined The input and final output, with their Precision and Recall scores, of at least 2 texts, including the scores of the output sentences