COMPUTER-AIDED PROCESSING OF THE NEWS by J. Francis Reintjes Richard S. Marcus Status Report ESL--SR-348 May, 1968 Electronic Systems Laboratory Electrical Engineering Department Massachusetts Institute of Technology Cambridge, Massachusetts 02139 I ACKNOW LEDGEMENT The research reported in this document was made possible through the support extended the Massachusetts Institute of Technology under a grant from the American Newspaper Publishers Association. iii I CONTENTS pag e 1 I- SUMMARY II NEWS.-RETRIEVAL SYSTEM 3 A. GENERAL DESCRIPTION 3 B- THE COMPUTING SYSTEM FACILITIES 3 C. THE INTREX SYSTEM 4 D, THE NEWS -RETRIEVAL SYSTEM; DETAILS AND STATUS 6 1 Reader for TTS Paper Tapes 6 2 Paper-Tape Acquisition 6 3 Paper -Tape Input and Code Conversion 8 4. Automatic Indexing 9 5 Inverted-File Generation 11 6. Retrieval Operations 11 7. Manual Indexing 12 E FUTURE PLANS 14 STAFF 15 REFERENCES 15 APPENDIX 16 I lr.... i ~ ~ ~ I _~ ~ ~ I CII ·I I_ _ ~ __ _ I. SUMMARY Approximately one year ago, technical responsibility for the ANPA-MIT research project was transferred to the Electronic Systems Laboratory. A summary of the past year's activities is presented in this report. The ANPA-MIT project has as its objective the application of modern data-processing techniques to the newspaper business. In particular, the project seeks to apply technologies being developed by MIT's Project Intrex to news processing. Since Project Intrex is ex- ploring the use of multiaccess computers operating in an online interaction mode with a community of users for information-storage and transfer purposes, much of the Intrex work and experience should be applicable to news processing. To date our newspaper group has concentrated its efforts on basic techniques for storing,indexing, and accessing digitally en- coded news which has already appeared in print. Arrangements have been made with a nearby publisher (the Worcester Telegram and Gazette) to receive selected paper tapes used to set type for their evening and Sunday papers, the Worcester Gazette and the Sunday Telegram. An experiment has been planned in which four categories of news will be indexed by computer means. The results will be com- pared with indexing performed manually by a professional indexer. A major difficulty was encountered in reading the TTS tapes into our computer system at MIT. Our tape reader has now been modified to accommodate tapes having either a seven-bit code and a symmetrically aligned feed hole, or a six-bit code and an off-center feed hole (TTS tape). Since it is anticipated that other groups may encounter a similar difficulty with TTS tapes, we are including details of our modification in this report. We have been fortunate during the current academic year to be able to excite the interest of a selected group of undergraduate students in the application of digital techniques to news processing. Much of our present format and design for an online news storage and retrieval system has evolved from a seminar offered to four freshmen on the -_- -2It is expected subject of Electronic Aids to Information Transfer. that at least one or two of these students will join us next year on a part-time basis. Although the news storage and retrieval system described in succeeding sections may appear to be, in essence, a computer-stored news library, we believe it has broader implications. Central to any computer-stored news system are the methods employed for putting away the information in the machine and the mechanisms or "handles" (the computer programs) provided for subsequent recovery of the stored information. It is these basic principles of identifying, stor- ing, and retrieving the news that we are investigating. Once news can be compactly stored and quickly retrieved, many opportunities become available for employing the information. minute news it can be accessed, diate printing. As up-to-the- assessed, and formatted for imme- Specialized up-to-the-minute news may be sold and communicated over wires to persons who subscribe to the service. As published news, the information can be reused for background purposes, or re-edited, repackaged, and resold. As stated previously, our main effort to date has been bn methods of indexing, storing, and retrieving the news. Computer time- sharing systems open additional opportunities for news processing, however, and we intend to explore these as quickly as we can acquire additional qualified manpower. Online news editing, page makeup and formatting, and online classified-advertisement systems which integrate the customer relationship, ad makeup, page-formatting procedures, and the business operations including credit checking and customer billing, are examples of future applications of computerstored news systems. 1I. A NEWS RETRIEVAL SYSTEM GENERAL DESCRIPTION The news-retrieval system being investigated at the Electronic Systems Laboratory, Massachusetts Institute of Technology, is an experimental system for testing the utility of storing and retrieving news articles in an online, time-sharing computer environment., The system has the following features: 1. Inputting of the full text of news articles into the computer from the Teletypesetter (TTS) punched paper tapes regularly used in news paper Linotype operations. 2. Automatic indexing of the subject content of these articles. 3o Rapid retrieval of these articles, or references to them, through online, interactive usercomputer dialog at consoles remote from the c omput e r 4. In-depth manual cataloging of these same articles for purposes of comparing the utility of automatic and human indexing. The specific function served by this system will, on first thought, be associated with newspaper archives, a term which traditionally connotes a repository for rarely-used information, The ease and rapidity of access possible with a computer-based system and the resultant opportunities for broader usage, however, recommends our adoption of the preferred term "news -retrieva.l system. " B. THE COMPUTING SYSTEM FACILITIES The experimental news-retrieval system is designed to work in the environment of the MIT-modified IBM 7094 Compatible TimeSharing Computer System (CTSS). Approximately 200 typewriter con-- soles on and nearby the MIT campus are connected via phone lines to the 7094 machine.. Approximately 30 of these consoles may carry on a dialog with this computer simultaneously- -3- -4A user may request action from the computer by typing at one of these consoles. The central processor unit (CPU) of the 7094 will schedule this action to be carried out in time "slices", or periods, of a few seconds each. During these periods the CPU is operating on the programs required to perform the requested action. In the interven- ing periods other programs are read into the CPU to perform the actions requested by other users. When an action is requested by a user his request is placed in a service queue. The typical waiting time for the system to begin servicing a request may range from three to five seconds. The results of an action will generally require a message to be sent back to the user. CTSS can transmit these messages simultan- eously to many users. Thus, if the system is not being too heavily used, each individual user gains the impression that the system is almost completely dedicated to servicing his requests. For a more complete description of CTSS, see References 1 and 2. C. THE INTREX SYSTEM As a preliminary to the detailed description of the experimental news-retrieval system, it is advisable to describe further the workings of the Intrex System to which the news system is so closely allied. Intrex embodies a set of programs for performing the storage and retrieval operations on catalog information pertaining to library documents. These programs operate in the context of CTSS. As diagrammed in Fig. 1, the operations in the storage or file-generation phase include: inputting punched paper tapes into the computer, code conversion, editing, index-term extraction, sorting, and formatting catalog and inverted (index) files. The retrieval phase includes pro- gram modules to interpret the user requests, to perform file searches, to transmit output to the user, and to monitor the user-system dialog. For further details on Project Intrex goals and status see References 3 and 4. Jsn X:]INI -v , Io .2. _ _ ' o Z 7U C0 ~~~~~~~~~~~~~~~~~~~ C~~X Cr~~- : ~~~~ "' cI) I, x,, ' 01rr~~ TI 0~ Ouu I ~~-~'i 0 _1 x I-- 8 Ix_~:[ I . Q) o- ~ iU ~ i I~ D 0 § Cu ': Tu ,,, L ~ >- - --L-0 _ CIL) Q)~~0 , ~~~~~~~~~~~~~z <OV >, ic _ v,~~ 1 0 ~~~~~~~~~~~~~~~~~~~~~~~~~~I ,Cx I EAx LU 9~~~~~~~~~~~~~~~~~~~~~~~~~~~I n z nsaa , C LU LU x, C L.LL cu r 0 o a, CC C O rr 0 w LU w ~~~~~~~~~z LU tO 0-0. . -6TTHE NEWS-RETRIEVAL SYSTEM: DETAILS AND STATUS D. A simplified diagram of the news-retrieval system is given in The various aspects of the system are detailed below. Fig. 2. 1. Reader for TTS Paper Tapes The TTS punched paper tapes which are used by the newspapers are of great potential value in our computer experiments in that they contain the full text of news articles in digital form. The encoding of natural-language text into digital form has been a costly bottleneck that has impeded the progress of other language processing applications (e.g., automatic indexing of professional literature, text storage and retrieval, mechanical translation, and general linguistic analys is ). One difficulty, however, with the direct use of TTS paper tapes is that theyhavea different physical configuration (6-level, advanced feed hole) than standard computer punched paper-tape readers. The ANPA Research Institute has recognized the undesirability of this incompatibility and has recommended gradual newspaper industry conversion to the computer standard. However, in order to begin our experiments as soon as possible, we made modifications to the presently used MIT-CTSS paper-tape reader (Digitronics Model 2500). Detailed information on these modifications is given in the Appendix. The modified paper-tape reader is now working smoothly and accurately in our operating environment. 2. Paper-Tape Acquisition As a means of generating a data base of news articles of current interest for our retrieval experiments, we have made arrangements for obtaining TTS paper tapes from personnel at the Worcester Telegram and Gazette. The Worcester Telegram has a Digital Equipment Corporation PDP-8 computer which is used to justify and hyphenate automatically articles prepared on TTS paper tapes. Some of these tapes and some tapes directly from the wire services, which are already justified and which have been selected for incorporation into saa~ ~ A~~~~~~~~~~~~~ ~:'T~~~~~~~~~~~~~~~ ~ LU LU I-- C~~~~~~~C ~ Q) ~ Q) ~~~ C I-~~<S Isnr+ Q, >C, ~ ~ ~ ~ ~ I,\_ Ln C~~I( Q 0 ~~~~~~~~~~Q) u U~~~~~~~~ Z~~~~~~~~~ u.J LU z 0r~~~~~~0 V 4) Q-W F a ~~ ~ ~ LLC >- ~ I '.'l 0 -1 LU cn L cC~~~~~~~~~~~~~~~~~~~~~~~c U -C ~ 0 0 u~~~~~~~ c~~~~~~~~ U ~~~~~~~~~~~~~~~~~C (3~~~~~~~~~~~~~~~~~~~~~~~( ~~~~~~~~~~~~ Q)~~~~~~~~~~~~~~~~~~~0 C~~~~~~~~~~~~~~~~~~~~~~~~C U~~~~~~~~L nrr ~ ~~~~9gl~ ~ CL~~~~~C ~~~~~ ~ u~~~~~~ u Q) Lr) 0 -C~~~~~~~~~~~~~~I Z C ;9~~~~~~ -8the newspaper, are gathered by the Telegram for use in our experiments. Tapes are selected for articles which fall into one or more of the following news categories: The 1968 Presidential Election The Vietnamese Conflict The Racial Crisis Worcester Urban Renewal Not all articles that go into the paper are being selected because the number of these would overtax the facilities available to the experiments. These particular categories were chosen to provide a sizable number of articles on topics that should be of intense interest during the six-month period beginning April, 1968. The category "Worcester Urban Renewal" was chosen as a topic having both local interest and rather general implications. It is expected that articles falling within these four categories in the Worcester Sunday Telegram and (evening) Gazette will number about 50 to 100 a week and that the total number of articles in the data base for a six-month collection period will be about 2,000. Approximately 100 tapes covering a total about 70 articles (some articles are on more than one tape) have been collected to date (May 10, 1968). 3. Paper-Tape Input and Code Conversion The paper tapes are read into the computer by the Digitronics reader described in Section D. 1; the TTS codes are converted to ASCII codes (the Intrex-adopted code set for internal character representation)Z and the full text of the articles is stored on magnetic tape in files of 20 articles each, where this information is maintained for further processing. The articles are then printed out by means of a line printer for inspection by the manual cataloger (see Section D. 7) and for general hard-copy reference purposes. In inputting the tapes, the operator also enters the date that the article appeared in the paper. This information is written on the tape at the Worcester Telegram (as is the fact that several tapes make up one article). The operator need enter a given date only once, since -9the computer program keeps the last date until changed by the operator and the tapes are batched by date- The computer program automatically assigns each article the next available number as an identifier and stores this information along with the article appearance date, the online inputting date, and the size of the article in computer words as part of a "header block" for each article. The programs for performing these operations have been written and are currently in operation. 40 Automatic Indexing The full text of the articles, as prepared by the programs described in Section D. 3, is used as input to a program which extracts information that can be used for indexing the subject content of the article, The automatic-indexing program takes advantage of the nature and style of newspaper writing and the likely type of retrieval desired to identify subject terms by simple rules. In particular, because the first paragraph of a good newspaper story should contain a summary of the contents of that story, we use all the words in this paragraph as subject terms, Also, much o' what a news article is about, and to what a user would presumably wish to refer, is designated by words in the class of proper nouns -a-names of people, organizations, and places, We use a simple way to capture these nouns: extract all capitalized words. One further class of indexing information is derived from the punctuation and format of a typed news article, Using these clues we can obtain the dateline and byline, where these are pres ent. This kind of indexing -- by first paragraph, capitalized words, dateline and byline -- is evidently quite deep. The ratio of the number of words extracted to the total number of words in the article appears to be about 0.2. The quality of this indexing is a subject for our experi- ments, a.s discussed below in Section D, 7 The automatic indexing programs have been written and successfully run on several dozen articles. cle and the resulting index terms. See Fig. 3 for a typical news arti- -10- [Joseph E. Carter,] president of [Wyman-Gordon Co.,] has been named chairman of the [Industry] and [Commerce Committee] for [Project Concern,] a campaign to raise funds for a hospital in [South Vietnam.] The project has raised $11,000 of its $50,000 goal so far for the hospital, which will be built as a memorial to [Worcester County] men killed in the [Vietnam ] war. Carter will be host at a meeting of industrialists at 5 p.m. tomorrow in the [Worcester Club. The] meeting is designed as a first step in securing financial support from county industry for the project. A committee named to assist [Carter] includes [Douglas Lo. Liston,] president of [Thompson-Liston Associates Inc.;] [Warren C. Lane Jr.,] a partner in the law firm of [Bowditch,] [Gowetz] and [Lane,] and [Albert D. Farnum,] public relations manager for [Wyman-Gordon.] The project was begun by [Francis Carroll,] commander of [Bernon Hill Post,] [American Legion.] LEGEND: Phrases in brackets [ ]are selected by the capitalization algorithm. Underlined words are selected from the first paragraph. (Note that all first-paragraph words are selected.) Fig. 3 Sample News Article Showing Indexed Terms 5n Inverted-File Generation The automatic indexing programs prepare index terms in a form in which they can be manipulated by standard Intrex programs. first of these operations performs phrase decomposition. terms are created originally in multi-word phrases, graph is considered an extended phrase. words is also considered a phrase. The The subject The first para- Each string of capitalized The phrase-decomposition opera- tion separates these phrases into their individual words and tags these words according to their position within a given phrase, so that the subsequent retrieval operations can suitably account for nearness of pairs of words, if that be desired. The next operation, stemming, deletes endings of words so that in later retrieval operations a user term will still match an index term even though there are minor morphological differences in these terms. For example, a user requesting information about either "bank" or "banking" will be assured of a match if "bank" is in the data base. A third operation sorts these stemmed words alphabetically so that, later, the retrieval operations can take advantage of fast "directed" or "dictionary" searching. In a fourth operation, the actual inverted or index files themselves are generated, together with appropriate directory files for rapid access, In these inverted files are lists of references for given stemmed words. Thus, under the "bank-" list are all references to newspaper articles from which the subject words "bank", "banks", "banking", etc. have been extracted in the indexing process. A final operation is the printing of the inverted files for review and analysis purposes. All these operations have been carried out on a test basis for several dozen news articles using the appropriate Intrex programs. 6. Retrieval Operations The retrieval of news articles, or references to them, will be accomplished online simultaneously from multiple remote (to the central computer) consoles by use of programs yet to be fully developed, basic idea is that a user at any one of these consoles will type in a The subject phrase of one or more words. These words will be stemmed and the references to articles with these stems as index terms will be taken from the inverted files. The computer will then report to the user how many articles matched his subject phrase to a given degree of relevancy. Relevance can be estimated, for example, by the number of matching words. The user may then request to see the identification numbers of the matching articles or additional information on these articles (for example, dateline or byline) or, finally, the full text of these articles (which may be stored on magnetic disc or magnetic tape). On the other hand, if the number of matching articles is too large, the user may wish to set additional restrictions (for example, a given byline or match only on first paragraph index terms). This dialog can continue until the user obtains the desired information. At present, Intrex retrieval programs that find matches for a simple subject phrase are being modified to list article identification numbers. 7. Manual Indexing How good is the (automatic) indexing? extensive? Is it necessary that it be so Project Intrex has set up elaborate procedures to answer just such questions for a data base within the materials-science field. Intrex has specified an "augmented catalog" which provides a depth of index information to documents far in excess of that provided by ordinary card catalogs. The experimental procedure, then, is to monitor the use of this augmented catalog by users in an actual retrieval situation and to analyze just how useful are the various data elements in the catalog. This same method of attack has been adopted for analysis of the news-retrieval application. Augmented-Catalog specifications have been adopted for news articles and codified in a cataloging manual. 6 shows a listing of the data elements of the catalog. Figure 4 Cataloging of over 100 news articles has been carried out to test the reasonableness of the catalog specifications. -13- CONTROL DATA FIELDS 1. Record Number or Identification 3o 8. Input Control Cross Reference ARTICLE DESCRIPTION FIELDS 21, Personal News Source (Byline) 22. Personal News Source Title, Affiliation 23. Corporate News Source 24, Headline 26. Edition Statement 27. Newspaper Name 31o Format 32, Length 33, Il ustrations 46. Dateline 47. Newspaper Article Location SUBJECT CONTENT FIELDS 65. News Category 70° Synopsis 73. Subject Terms Fig, 4 List of Data Fields for Manual Cataloging News Articles -14E. FUTURE PLANS The future plans for the experimental news -retrieval system fall into three main areas: 1. Completion of the computer programs and establishment of the data base. 2. Experimental operation and analysis of the system. 3. Extension and elaboration of the system. Some of the work contemplated in each of these areas is outlined below: 1. System Completion ' Continued acquisition of TTS tapes. · Generation of data base from the TTS tapes. Completion of the index-term searching and text-access retrieving programs. o Manual indexing of news articles. 2. Experimentation * Experimental retrieval by systems analysts. Experimental retrieval by newspaper personnel, possible setup of consoles at newspaper locations or at ANPA Research Laboratory at Easton, Penns ylvania . * Experimental operation by the general MIT CTSS community. Analysis of experimental results to determine system strengths, weaknesses, and costs. 3. Extens ions * More complete automatic indexing. * Less redundant automatic indexing. *More efficient system and analysis of require- ments for operational, commercial system. · More fully automated system (for example, direct tie-in to wire-service transmissions). *Retrieval System more fully integrated into general newspaper computer-oriented processing systems. -15STAFF PROFESSOR J. FRANCIS REINTJES (part time) MR. RICHARD MARCUS (part time) MR. OMAR SANCHEZ (full time) MRS. SONDRA LAGE (part time) MRo JOHN WARD (part time) DANIEL GRIFFIN, student BRUCE CICHOWLAS, student HARRY KLEIN, student CRAIG RICHARDSON, student REFERENCES 1. Fano, R. M., "The MAC System; the Computer-Utility Approach," IEEE Spectrum, January, 1965, pp. 56-64. 20 MIT Computation Center, The Compatible Time-Sharing System, Cambridge, Massachusetts, The MIT Press, 1965. 3. Overhage, C. F. J. (Ed. ), Intrex, Report of a Planning Conference on Information Transfer Experiments, MIT Press, 1965. 4. Intrex Staff, "Project Intrex Semiannual Activity Report," Massachusetts Institute of Technology, Electronic Systems Laboratory, 15 March 1968. 5. Rinehart, William D., "Future Tape and Code Standards," ANPA Research Institute Research Bulletin 921, June 14, 1967. 6. Lage, S. B. and Marcus, R. S., "A Cataloging Manual for a NewsRetrieval System," MIT Electronic Systems Laboratory Technical Memorandum ESL-TM-349, May 15, 1968. APPENDIX A PROCEDURE FOR MODIFYING A DIGITRONICS MODEL 2500 PAPER TAPE READER TO ACCEPT TELETYPESETTER (TTS)AND USASI STANDARD TAPES 1. BACKGROUND In order that the MIT-ANPA research group may work with Tele- typesetter (TTS) punched-paper tapes it is necessary to read them into the MIT-modified IBM 7094 Compatible Time-Sharing Computer System (CTSS). The only current method of handling paper tapes in the CTSS machine is to read them first into the PDP-7 display buffer computer which has a 300 character/sec reader. Available block-transfer pro- grams using the data-channel connection between the PDP-7 and CTSS then permit the data to be transferred to CTSS. The above procedure has worked well with Project Intrex and other tapes prepared on Flexowriter equipment. However, the TTS-tape for- mat used in the newspaper business differs from the USASI tape standards used in all other types of tape equipment in two important respects: the feed holes in TTS tapes are advanced 0. 013" from a center line drawn across the tape through the code holes; and the margin between the reference edge and the first code hole is increase of 0. 042". 0. 098" instead of 0.056", an Otherwise, the hole sizes and spacings are identical. Before TTS tapes can be read on the Digitronics Model 2500 reader used in the PDP-7, three changes are therefore necessary in the reader assembly and circuits: 1. The tape guides must be set for a tape width of 0 875" for the 6-level TTS tapes, as opposed to 1.000" for the 8-level conventional tape. 2. The spacing between the reference edge guide and the read head assembly must be increased by 0. 042". 3. The delay in the feed-hole sense circuitry must be increased to allow for the advanced position of the feed holes. --16- -17- The PDP-7 reader at MIT has been modified to permit switching back and forth between these two formats at the flick of two levers. Details of the modification are presented below. 2. TECHNICAL DETAILS Tape Width A standard option on the Digitronics 2500 reader is an adjustable edge guide with stops for 5-, 6-, and 8-level tape- The reader asso- ciated with the PDP-7 had a fixed guide, set for 8-level tape, but an adjustable guide was obtained from the Digital Equipment Corporation (DEC) thus solving the first problem. Edge Spacing Conversations with DEC and Digitronics engineers indicated that there was no provision for altering the edge spacing on the Model 2500 reader- However, Digitronics suggested that it was possible to drill two holes through the read head (a small metal block containing the photocells) and to slide it back and forth on two studs substituted for the original mounting screws. The attached drawing 16126'BN001 illustrates how this was done, with a convenient lever-cam to control the movement and limit motion to the required 0. 042". A microswitch was added, which is actuated by the movement of the head. This switch is used to alter the feed-hole delay circuitry, as discussed below. Feed-Hole Delay In the DEC tape-reader circuitry, a pulse is generated when the leading edge of the feed hole is optically sensed. This pulse is delayed by 650 i1sec to account for the half-diameter of the feed hole, and then used to gate the reading of the code holes when they are approximately centered over the photodiodes. At the tape speed of 3 0 " per second, the 0. 013" advance of the feed hole in TTS tapes requires an additional delay of 0. 013' 30'j/s=ec 430 iFsec The delay in the DEC circuitry is created by a delay multi- vibrator (DEC R302 module), which has an internal delay-control -1 8potentiometer and also provision for substitution of an external potentiometer. The modification consisted of the use of the SPDT microswitch to switch between the R302 internal potentiometer and an added external potentiometer mounted in the computer frame. Thus the delays can be separately adjusted for the two types of tape. 3. OPERATION Step 1. Tape width is adjusted by in-out movement of the original tape-guide lever. Step 2. colored) lever. Click stops are part of this assembly. Edge margin and delay are switched by the added (brassStraight down is normal DEC position, and to the right (CCW rotation) is TTS position. Stops on the cover define the two correct positions. The above modifications in the Digitronics Model 2500 reader were devised and installed under the direction of Mr. John E. Ward. oY%~~ d~Y V)3 cocsQ 3 j ©o 2 0 r(Y 9; ~ -- r 3~~~~~~~~~~~~~~~~~~~~~~~~ yjo k9VI C9 ~~~~~~~~~~~ 1 i I X~~ 2 ~ ~U I-I~ IJJO ~ ~ Z ~ ~ ' ~A c- -