Automated Content Authorship While I have been working on this project for about 16 years now, and publishing for about 8 years, the news of the patent (September 2007) set off a chain reaction in the press. Ambiguous or terse writing (probably due to space limitations) has lead to a bit of confusion. I will start with a “Myth” report: Myth #1 - I am working on romance novels … I have a method for doing so, but find that this area is not a top priority for automation (given the genre is covered by so many authors already). Of all the forms of fictional literature (beyond poetry), romance novels may be the most formulaic (or at least some of the sub-genres) as many titles follow established rules (by époque, level of explicit language, etc.) and therefore may be reverse engineered and automated. There are, of course, books dedicated to deciphering formulas within this genre (e.g. Complete Idiot's Guide to Writing Erotic Romance). One journalist correctly quotes me (Shellie Karabell): But he’s not looking to replace authors of fiction – even best selling fiction. “I would never try to programme a computer that would write the Harry Potter series,” he says. “The amount of time needed to write a programme would be longer than the human writing the book.” I have explained to journalists that such a project would be very time consuming, and, in my opinion, would not produce titles of much use to anyone (not withstanding a potential glut of fiction titles, at least in the English language). My current projects are reference and educational materials (in numerous languages), web-based educational materials (my online dictionary) and educational video animations. Romance novels authored by computers are an R&D project for the future, and may never see the light of day (at least from me). Myth #2 – the book is created after someone orders. In a Guardian article, the author writes “Nothing but the title need actually exist until somebody orders a copy. At that point, a computer assembles the book's content and prints up a single copy.” This statement may be misleading. All titles are created in advanced, vetted, and supplied to distribution well in advance of any order. No title exists on a distributor’s site that does not exist in its entirety first (i.e. no book is written on demand); financial gap analyses are updated if needed before being sent to the user, but an original pre-exists. Only the printing in paper format is done upon request. Some 95% or more of the titles are sold electronically (through channels not used by the general public, such as netLibarry, OCLC’s netLibary, MyiLibrary - Ingrams, EBSCO, marketresearch.com, etc.). PDF versions of each title are sent to distributors months in advance before they appear on sites. Amazon only recently requested that they carry my business titles as they expand to business segments. The automated business titles have been published since the year 2000 and are purchased by central governments, large multinationals, banks and businesses with imports/exports. Myth #3 – I consider myself to the most published author. Actually, I was fist introduced as such by a dean at INSEAD to a visitor. Later a spokesperson from Amazon.com was quoted in BusinessWeek: "He may be the most prolific author in history," says Amazon's Kurt Beidler. Later I was introduced as such in a public seminar by the moderator. To none of the persons above, nor to my friends or colleagues have I positioned myself in these terms (in fact, most of my colleagues at INSEAD, except our librarian and others in my department, had no knowledge of or were surprised that I was working on this topic). On a few occasions I have jokingly mentioned the quotations. This claim, however, is a bit off the mark, given the fact that I am working on reference and educational titles, many of which are in report form. Amazon lists these titles in their “book” section, but they have nothing in common with novels (considered by many to be the only legitimate form of “authored” work). The book industry, of course, has many types of books. The ones I work on are of very particular genres. One journalist (Asher Moses, from Sydney), uses wording I would prefer: But Parker doesn't describe himself as an "author", and he's far from the creative type. Rather, the US-based professor of management science at INSEAD business school has developed and patented algorithms enabling computers to write books for him. My patent uses the word “title material”. For this I am certainly not the most prolific (e.g. compared to the U.S. Government). Myth #4 – I do not announce that the books are generated by computer. We need to distinguish across genres. One article is correct in saying that nothing announces that the healthcare titles are computer generated. I do not announce this since the text is not written by a computer. All of the text is written by professionals. The computer does not scour the internet automatically summarizing the results into a book. Some internet sources are cited – given that this is an internet guide series. Only the formatting and indexing, lists, front matter, lists, etc. are done automatically (this copy editing saves a substantial amount of time). The health series happens to be about “how to use the internet” so the confusion is understandable. It might be misleading to mention that computers were used, when applied to the formatting (e.g. “the table of contents of this book is computer generated by indexing it to Heading1 settings in a Word document …”). When a newspaper uses software to format columns, they do not announce this to the reader. Interestingly, the one series I have worked on where humans wrote 99% of the text is the one where some readers gave some titles a low rating in Amazon, or were willing to believe that a computer wrote the text (in contrast, some patient associations recommend and/or send the guides to their members). Of course, it is normal that some titles have high and some have low ratings when one publishes hundreds of titles in a subject area. All the health titles and medical text was vetted by professionals (e.g. medical librarians, forums, etc.). The series was created when medical libraries started “internet training services” and were seeking guides that were disease specific. In that series I am listed as an editor, not author. All contributed text is cited, and all organizations were contacted to obtain permissions for quotations. All passages were hand cleaned and edited by human editors. In the genres where 99+% is written by computers, there are few if any negative comments (i.e. computer authored titles score a higher average rating than human written and edited books in this simple case - not enough evidence for a academic test however to make any strong conclusions); the business, crossword, and classics titles are thinly sold via Amazon - but through more traditional business and/or direct to library channels, and are greatly appreciated. In the genres where computers algorithms create original content or results (which is what my patent covers), this is described in the methodology of each book by using language like “an econometric model is applied, …” I do not say that a computer was used to do the econometrics as this would appear silly to the reader (rarely, if ever, do people calculate the sum of squared errors by hand). The non-econometric or formulaic sections are written by me (by hand). The reader expects a computer to be used for the formulaic aspects of these titles (e.g. in the trade reports where 90% of the titles are tables and charts). Finally, in the crosswords and classics, I do not announce they are computers are used because I never really thought about it – does it really matter? The computer is nested with a very specific editorial and/or linguistic logic, that makes the titles useful of use to non-English speakers (graph theory is used in these genres – a potentially distracting topic). I find this issue to be an interesting question. We watch movies, and yet no-one pre-announces to the viewer that productivity enhancing tools like spell checkers, or Adobe Premier, Avid, or Maya are used. Myth #5 – My authoring programs dynamically scour the Internet and compile the results in a book (as auto-bloggers do). Neither my YouTube video nor the New York Times article (or others) mentions this explicitly (the Times article is suggestive). Bloggers extrapolate this conclusion from words like “data bases” and/or “public information sources.” Well before the Internet, people wrote crosswords, performed economic analyses, and wrote poetry from “public information sources” (e.g. times series, and word lists/dictionaries). My computers do so “off line” by mimicking, for example, economists or poets. There is simply no need to use the Internet. None of my applications dynamically grab things off the internet based on a Google search and throw them into a book. For example, for the econometric studies, INSEAD purchases large quantities of source data not available to the public over the internet, and which existed since the 1980s and distributed via data tapes, or now, via DVD (I’ve mentioned to journalists trade organizations, the IMF, etc., as economists do in practice). All applications are database driven (some store links to the internet, if the subject of the book is the Internet itself; these were not scraped from Google or other search engines). I have also amassed over the last 25 years a very large multilingual lexicon, among others, some of which I have posted for public use (www.websters-online-dictionary.org – again, the pages, and editing, of which are generated via automation programs). Here are reviews of my dictionary which was created in started in the early 1990s, and launched off-line in 1999: http://www.websters-online-dictionary.org/credits/editor.html I think the Internet scraping approach, however, is fruitful in terms of generating new knowledge and knowledge structures. In today’s age, people assume that if data are involved, there can only be an Internet approach. My automation project began well before browsers were invented, using data available before the Internet age. The Internet, however, does allow genres that could not have existed otherwise (e.g. guides to the Internet). Myth #6 – the programs simply copy and paste pre-existing information The programs can do this, and will do so for some genres for some limited sentences or sections (such as boilerplate), but the vast majority (i.e. the 200,000 titles) do not do so (the output is “calculated on the fly”) for the bulk of the pages; i.e. titles cannot violate a copyright as the output is wholly original, and mimic my thought process as an economist or linguist (i.e. I use a very fast pen). This results in titles whose contents do not pre-exist in any databases, nor can be found on the Internet. I am working on series that can combine pre-existing data with original content (as some genres are designed to be this way). What are the “algorithms”? They are generally mathematical approaches that I have found well suited to specific genres (or subgenres). Here is a summary of the methods used (please refer to Wikipedia.org for definitions of unfamiliar terms), posted with my YouTube video: The "algorithms" depend on the genre. The most advanced use parametric, non-parametric as well as Bayesian econometrics, graph theory, and meta analysis (mostly coupled with some specialized computational linguistics and editorial rules that are required within certain genres) -- each piece is rather straight forward; the combination allows complexity. In terms of IT or programming languages, there is no rigidity to this - again it depends on the genre. If animation is the goal, then code is written to write MEL scripts, etc., which can automate Maya, which can in turn automate rendering, lights, etc., via macros. This works well, but for only certain aspects of that genre. Some titles are 98 to 100 percent computer automated (e.g. business titles, crosswords, etc.). For health titles, only the format editing and production side is automated. The text in the health books was written by medical professionals and edited by a professional editor; the computer expedited formatting using about 50 odd routines (the preface, chapter intros, glossaries, indexes, headings, margins, etc.); highlights are made to sources generally not known to internet-averse readers or medical practitioners (designed for medical libraries with internet training services). Currently, some 2 percent of the titles rely on government sources for text. None perform a google search, spider the net, etc. Some 98 percent of the titles are wholly generated via automation programs; the applications create original information or content that cannot be found elsewhere (e.g. maximum likelihood trade estimates, latent demand forecasts via a decision calculus approach, Chinese and English crosswords, etc.) - offline applications with no interaction to the internet. In total, there are about 17 genres created this way (about 200,000 titles or so since 2000). It can take several years to set up an application (including all human inputs, licensed sound effects, textures, models, mocap, data, or decision rules that go into any genre-specific application). Platforms (e.g. Maya) pre-exist. The incremental, or marginal creation time per title is mentioned in the video. The genres are blind or peer reviewed and/or vetted by users (e.g. librarians or end-users) before they are put into print. The games are played by kids to see what they like. For 3D games, a pre-existing rendering engine is like a blank word document. The rendering engine is not created from scratch, but licensed (like MS Word). I am mostly now working on education titles for Asian, African, and Native American languages that do not have educational materials (games, supplements, texts, videos, mobile phone books, etc.) written in or augmented by their languages. See my dictionary at: http://www.websters-online-dictionary.org A very small percent of the linguistic material used is posted. Watch for a major update and linguistic augmentation to the dictionary this summer when I will also be introducing EVE. She is an "economically viable entity". A step beyond a chat bot, using some of the algorithms mentioned above (with a bit of utility theory and optimal control theory thrown in). There is no "commercial" or "public" or "open source" software that can be used by the general public. Some applications are terabytes large. I am working on a relatively small poetry application for public use -- to be released when completed (probably in a year), which will do several forms of poetry, on any topic the user desires; and allow the user to request "another" if they do not like the first one written, or "change that line", etc. The following are samples of grammatical acrostics, practiced in elementary schools to introduce children to poetry (title is an acronym for words in the poem): NUDE Naked unclad, dear enactment. LOVE Lean of vile emotions. GOD Gentlemen of divinity! BOOK Bible ordered, obtained Koran. The application for this genre uses graph theory (clique commonality) and over 40,000 grammatical structures, ranked by meta-analytic probabilities of being understood by English readers. There are many other areas I am working on, as there are multiple avenues to explore, especially in the areas of new media (mobile and fixed), but more so in high-end analytics and knowledge discovery (i.e. generating knowledge that could not be created otherwise) as applied to business, language and public services (e.g. criminology) - where unmanageable, sparse, disintegrated or larger data sets (off-line) result in new knowledge structures usable by decision makers (e.g. connecting the dots where humans have difficulty doing so, for lack of time or expertise). Myth #7 – it costs 12 cents to create a book This figure (or similar numbers) reflects the marginal cost. The full or average cost is much higher. The set up cost for an application can be hundreds of thousands of dollars – costs that may not be recovered over the life of the genre. This is true for both electronic and non-electronic versions. Concluding Remarks The most interesting aspects to me about this project is what can be achieved by it. To date, journalists have not covered this angle. I think what I have done thus far is extremely modest, and many other applications can be developed, especially in genres that involve highly repetitive writing methodologies, or that lack the economies to be created otherwise (languages or topics that are obscure to most, but critical to others). Here are a few comments from around the Internet by people who see this potential: In a way, humanity can be defined by what it is that humans can do that machines can’t do. That boundary is continually being pushed further, and in coming years we will need to move to increasingly complex and imaginative tasks of synthesis and creativity that computers cannot do. Philip Parker, a professor at INSEAD, is probably doing more than anyone else to push this boundary. … In many cases the market is too small to justify a person writing the report. However there is no question that a significant part of an analyst’s work can be automated. The boundaries of human value are being pushed further, and this is just the beginning. Ross Dawson As [his] video demonstrates, many of his works are economic or market analyses and forecasts, but he also uses the technology to write about obscure medical topics – both genres that he’s able to succeed in because they are underserved by traditional authors. Scott D. Anthony It's a fascinating subject and it calls into question many of our assumptions about writing and research. This guy is part of a movement that is doing to office workers what the industrial revolution did to blacksmiths. Daryl To be fair, here is the other side of the coin: Philip Parker has won today’s “Worst Person in Publishing” award. I wanted to give him the “Worst Person in the World” title, but, well, I’m fairly certain that’s been copyrighted. Hmm, maybe the ”’New York Times”’ will share the honors, if only due to its continued lack of critical thinking when it comes to covering books and publishing. … Likewise, I am not sure that the ”’NYT”’, as close an industry-town publication as possible, is capable of writing about the publishing business with clear-eyed intelligence. Kassia Krozser Fire the monkeys! Return them to their happy habitats! Our genre of choice will be written by GLaDOS, and other AI computers, because there’s only “so many body parts” about which to write a romance. SB Sarah Actually Parker is providing a rather useful service for those who understand the limits of his “books.” I just hope that the “Make Money Fast” crowd doesn’t catch on too quickly to the possibilities here and come up with yet another product category to push through e-mail and blog comment spams. As for Amazon, I wouldn’t mind a filter to separate Parker-style books from the purely human-done variety. Meanwhile perhaps Parker and his machine-aided crew can go on to write a coping guide to for victims of technology. David Rothman Mr. Parker is an “author’ only in the loosest sense. Jane Should authors be worried? Probably not, at least not yet. There's a wide gap between what a computer can compile and the nuanced hand of a skilled artist. Still, this news is a bit unsettling to those employed in the creative arts. And, taking the music industry as an example, it doesn't seem well advised to underestimate this sort of development. It's the kind of trend that could as easily become a dead end as an overnight sensation. Either way, it's worth consideration. Nathan Denny He also says, "'My goal isn’t to have the computer write sentences, but to do the repetitive tasks that are too costly to do otherwise.'" That has me really baffled. Aren't romances composed of many, many sentences? Fortunately I, having endured this sort of ignorant notion of romance novels for twenty years, have learned to calm down and carry on relatively quickly. Margaret Moore His ignorance [about romance novels] is almost embarrassing. Kimber Chin The London Times has pointed out one Philip M Parker who has created over 200, 000 titles (albeit mostly statistic books from what I can see) using print on demand technology. But the worst part is that, by his own admission, automation produced a large part of his works. And he’s planning to move into romance novels and poetry. that’s what freaks me out. No matter how formulaic either genre can be, in the most juvenile hands, it is still something human. The idea of automated poetry makes my skin crawl. bookology.wordpress.com … it’s now possible to foresee a literary future in which human intervention is no longer required. Michael Moran The best publishers are focusing on building large growing communities. Content is becoming a commodity, as content without subscribers is worthless. As failing mainstream publishers follow in Mr. Parker's footsteps, small publishers stand no chance to compete unless they have an army of brand fans. Aaron Wall I guess the automated content may look good enough to look real, but the talent is something more than that. I think such automated tools are a threat to everyone who publishes mediocre content though. bobby_handzhiev Won't the advent of programmes like this enable more small publishers to produce content? I think this will drive the premium on quality original content higher still. However, long term (maybe 20 years +) perhaps AI will have reached the point where it can start drawing its own conclusions. Then we really become redundant! And who will be leading the way with AI? Perhaps the company collecting huge amounts of data of every aspect of our lives? Google. BenCo "if you are ever stuck for an absorbing read don’t forget "The 2007-2012 World Outlook For Bridges, Crowns, Dentures and Other Orthodontic Appliances That Are Customised For Individual Application on a Prescription Basis" Roland Dodds … we hope someone sent from the future destroys these robot authors — partly because we don’t want to be destroyed by the machines, and partly because we are pretty well out of robot-war jokes. But we'll do what we have to if more news comes along — because, while we may run out of punch lines… [adopts growly, inspiring Bill Pullman voice]… we'll never run out of hope. —Ben Mathis-Lilley What is my take? I think that the most useful applications will be created for genres that are so complex or labor intensive, that automation is almost the only viable approach. That being said, writing hundreds of original high-quality Ph.D. theses will be easier to accomplish using this approach, than writing a single creative and highquality children’s story (given the lack of formulaic sub-genres that can be reverse engineered). “Human creativity” in this sense is the absence of formulaic authorship techniques that can be reverse engineered. Some Ph.D. theses, and forms of poetry for that matter, are not that “creative”. Creative authors, journalists, editors, report writers, manual writers, script writers, or bloggers, therefore, need not fear ever being replaced by this process. The same is true for creative doctoral students, moviemakers, television producers or PC game makers. Then what does original mean? From a pragmatic point of view, if one title borrows from another to a sufficiently large degree (especially without citation), it might be considered un-original, if not plagiaristic. If the two titles have so little in common that they do not seem to borrow from each other, one might say they are originals (e.g. a romance novel – not all - can have a formulaic plot, but use different sentences and paragraphs that do not overlap to any noticeable degree with an existing romance novel with exactly the same plot). This form of originality (or lack thereof) is often seen in television game shows. Each episode is original, but each episode uses the same segment sequences. Original and very entertaining, but not that creative from one episode to the next. In essence, viewers crave the formula and want to see it repeated in original episodes. The genre in its entirety, of course, can be a very creative result. What is quality? It lies in the eye of the “segment” (in publishing industry jargon). A trade study can be far more useful than a romance novel to someone wanting to prioritize world markets for the products they are selling. The opposite is true for someone who love novels and is not involved in international trade. There are segments to content markets. Can a computer, therefore, write prose that is higher quality than Shakespeare? Of course; especially if the person comparing passages side-by-side hates Shakespeare or does not understand Elizabethan English (probably a large enough segment). Will a computer generate work reaching the stature of Shakespeare in English courses – I doubt it (unless, of course, the formulas used by the Master can one day be reverse engineered; or a great author of that league, as yet unknown, is also a great programmer). Will this make human authorship obsolete? For some forms, potentially "yes", for at least the formulaic or mundane forms of human authorship, or for human authorship of genres that are uneconomical otherwise. Which genres of authorship (in video, text or other formats) are not formulaic enough to be automated? Time will tell. I hope this clarifies & thanks for reading. Phil More Background Overview Some like calling it a “book writing machine” or “software” but in fact it is a computer-based automation process for authoring, irrespective of the format (book, video, PC games, etc.), language, or subject (fiction or non-fiction). For those interested in the technical aspects of the process, please refer to the actual patent which presents flow diagrams, etc., and to a YouTube video that tersely describes the process and shows an example an application and some output: Patent: http://www.google.com/patents?id=bHeBAAAAEBAJ&dq=philip+m+parker YouTube: http://youtube.com/watch?v=SkS5PkHQphY It is strongly recommended that interested persons read the full patent. On the patent page, the reader will find detailed technical descriptions of the process and the prior art. Professor Parker began working on this project in the early 1990s. The goal was to create original titles (book, videos, games, etc.) on topics that would not be economically viable if published using traditional methods, or covering topics that might be of interest to a limited audience that would nevertheless find the titles useful (what some call the “long tail”). The process does not require “Internet scraping”, and most existing implementations of the process are Internet independent. The patent is written as a “pioneer patent” as it applies to all forms of original title materials (videos, books, PC games, etc.) created in this fashion. Forms of Authorship Much as authors publish various forms of fiction and non-fiction literature, it is convenient to see the method or process as allowing various forms of authorship automation (which can be used in combination). Form 1: Involves compiling existing information, sorts, formats, and draws basic conclusions (e.g. if there is no pre-existing content, then this fact alone may lead to original logical conclusions drawn about the topic). This level is useful for consolidating and structuring knowledge in a domain where much of the text, video or sounds pre-exist. The programming for this approach typically involves hundreds of details, especially with respect to formatting and style. Typically in the form of a compilation, some of the output components will be original, and can result in new knowledge. Form 2: Involves replicating a formula within a genre. In this case, new knowledge is not necessarily generated, though the reader or viewer may end up acquiring new knowledge. In this case, the data (words) may be in the public domain on a stand alone basis, but the output is as original as what a human author (or director, screenwriter or actor in the case of a movie) might create. The final result is typically wholly original. Form 3: Involves the generation of new knowledge as the primary goal. This involves, for example, the computer mimicking a specialist that is asked to prepare a report, film or game that draws original conclusions, images or levels of entertainment. For example, if one asks an economist for an opinion, the economist will typically perform an analysis and make summary statements based on his or her findings. The automation process, in this case, literally follows the behaviours of the economist, and reports the findings -findings that have never appeared before in any format or which pre-exist in any database or are currently available on the Internet. The computer, in this case, is pre-invested with knowledge or expertise (e.g. economic models and knowledge of economic geography). For this approach, the word “specialist” is domain independent. We can rewrite the example from above to be: “For example, when one asks a poet for a poem on a given subject, they will typically ponder on the subject and write prose based on their inspirations. The process, in this case, literally follows the behaviours of a poet, and creates a poetry book – consisting of poems that have never appeared before in any format.” The distinction between poetry and econometrics is the formulaic natures of the genres, but not the process to author them. The third level can create highend econometrics to the same degree that it can write poetry. It turns out that the most useful applications at Level 3 are for genres that are so complex or labour intensive, that automation is almost the only viable approach. History The origins and research began on this approach in the 1980s and early 1990s. The first titles authored via full automation relied on Form 3 (described above) – having the goal to generate of new knowledge that would be difficult to accomplish otherwise. These came in the form of e-books distributed via high-end distributors dedicated to this market (Dialog, MarketResearch.com, etc.) and then print-ondemand titles (Ingram’s LSI and Amazon’s Booksurge). The “Trade Perspective” series was created due to the inconsistencies of import data from importers, and export data from exporters. The model comes up with maximum-likelihood estimates of real trade flows (adjusting for currency fluctuations) – a rather boring process but of interest to people involved in international trade. This series is mostly used by government agencies and businesses. Similar series using Form 3 are “Word Outlook Reports” that produce Bayesian econometric estimates for the worldwide latent demand for various products and services, and the “financial and labour benchmark” series which mimic the process used by accounting firms and/or investment banks to compare real differences in economic performance across firms and/or economies with differing accounting rules. For each of these series, there is a very large upfront cost to creating a series like this (many man-years of programming in most cases), but once this is accomplished, the incremental cost per title is very low (the costs mentioned by journalists are the incremental cost of about 50 cents, not the total or average cost per title which are must higher when considering start-up costs). Samples of these books can be found at http://www.icongrouponline.com/browse/. Later, series using a combination of Form 1 and Form 2 were created in the form of patient and physician sourcebooks. Around 2001, medical libraries launched efforts on “internet training” for their patrons (e.g. how to use the internet to research diseases). This series was created for this market and is mostly distributed via OCLC’s NetLibrary service in e-book format. Form 1 was also used to create a series of bi-lingual classic titles which provide a running thesaurus in the language of the reader. More recently, multilingual crossword puzzle books and thesauri were created using Form 2. Some of the thesauri rely on a graph theoretic approach (combined with traditional computational linguistics) to derive what is probably the world’s largest multilingual thesaurus. A small percent of the databases required for some of these later genres is posted on Webster’s Online Dictionary (www.websters-online-dictionary.org), that was started in 1999 as a testing ground for the general approach (i.e. the automatic authoring or original content on a web site): Some Background & Reviews: http://www.websters-online-dictionary.org/credits/editor.html Another Review: http://hurricanecountry.blogspot.com/2006/12/dictionary-heaven.html The Objective: http://www.websters-online-dictionary.org/about.us/about.html The Site: www.websters-online-dictionary.org As only 10% of the data available are posted, future editions will be substantially larger and allow for high levels of interactivity. Recent History In terms of R&D, substantial time and effort is currently being invested to create (1) a series of interactive web sites that can automatically author titles, (2) educational game shows and (3) language learning programs. With respect to video, instead of automating “Word” to author a book, the same process is being used to automate Maya and video editing software (software for 3d animation/video used in movies like King Kong, the Matrix, and Shrek). The goal is create video programming to teach any concept, but also in any local language. It turns out that for most of the World’s languages (e.g. Estonian, Maltese, etc.), the costs of video programming using traditional methods is prohibitive, so local stations end up dubbing foreign-based programs (or programs receiving government subsidies). This project started in 2004 with 3d games and software (a bit easier to begin with than video) which has resulted in hundreds of titles distributed by Digital River, Handango and Microsoft (for Pocket PC versions) among others. The following is a YouTube link to cut scenes from a game show designed for language learning – a formulaic form of television (that is being coded for automation): http://www.youtube.com/watch?v=Fug4UGbsIxY The following is a video “word of the day”, that will be used across many languages: http://www.youtube.com/watch?v=slNTZ4vEqGQ Here is a cut scene from the 3D game: http://www.youtube.com/watch?v=2QBC5zlXdDw FAQ This FAQ covers other common questions. For each question, the generic answer is typically “it depends on the genre” and “it depends on the format (book, video, software, PC game, etc.).” Q: Can I have a copy of the software? A: No. The process is not a software package, but a complete system that requires that a computer or computer network be set up for this purpose – for a particular genre. Most genres are too large to be easily transferable via the internet. One video application is many terabytes, and other applications are many gigabytes. Q: How long does it take to set up a genre? A: This completely depends on the complexity of the genre and the quality one is willing to accept for the titles. The earliest genres took several man-years to create before they met industry standards (i.e. to the quality of a human author). The later genres took a matter of months (e.g. cross-word puzzle books). Sometimes the longest part is acquiring and coding domain knowledge (e.g. knowing how a Ph.D. thinks in a particular domain before they author a genre). Already published genres rely on advanced graph theory and econometrics; others rely on traditional content analysis. Q: How much does it cost to produce a book or other title? A: A: Depends on how you define cost. The marginal cost of creating a title in electronic format is the price of the electricity used to create the title, and some small amount of hardware depreciation (maybe around 50 US cents). The average cost, which includes the printing of the book (in paperback), or a game in DVD or CD format (printed on demand), and the overhead to distribute the book can range from around $10 to around $30. The total cost for an entire genre of books, videos, or software games can exceed hundreds of thousands of dollars or more in programming time, database acquisition or licensing, and other overheads. Once a large sum of sunk costs are expended, the marginal costs are minimal. For video or high-end gaming, the costs can be very high; with the budget to create a single traditional 3D animated movie, however, one can use this approach to create thousands of titles within a given video genre. Q: Is this really that complicated? A: It depends on the genre and format. During early genres it was found that rather complicated issues were simple to implement (e.g. Bayesian econometrics), and logically simple things were nearly impossible to implement (e.g. getting Windows to behave well when indenting certain graphics, or rendering in DirectX). In general, Joseph Weizenbaum says it all: 'It is said that to explain is to explain away. This maxim is nowhere so well fulfilled as in the area of computer programming, especially in what is called heuristic programming and artificial intelligence. For in those realms machines are made to behave in wondrous ways, often sufficient to dazzle even the most experience observer. But once a particular program is unmasked, once its inner workings are explained in language sufficiently plain to induce understanding, its magic crumbles away; it stands revealed as a mere collection of procedures, each quite comprehensible. The observer says to himself, "I could have written that." With that thought he moves the program in question from the shelf marked "intelligent" to that reserved for curios, fit to be discussed only with people less enlightened than he.' CASE STUDIES The following case studies illustrate a few examples of how the technology has been used to create large quantities of original title materials. These are presented for illustrative purposes only, and reflect a small part of potential applications. Reference, Research & Educational Books Output: Over 250,000 original titles, available in various paperback and ebook formats (www.icongrouponline.com). Distributors: Barnes & Noble®; amazon.com; Lightning Source (Ingram Book Group); NetLibrary [OCLC - eContent]; Ingram Digital and MyiLibrary; ebooks.com; google.com, among others. Beyond the tasks accomplished by acquisition editors and publishers, books are traditionally written by humans authors, edited by humans, and formatted by human production editors. These are in turn marketed by humans. Using the most advanced approaches to electronic publishing, this approach reduced the time to create and publish reference and educational books. The approach is of interest to the publishing industry which is becoming more fragmented and specialized as print-on-demand and ebook technologies are showing substantial growth. Coupled with electronic distribution via libraries, publishers and media companies can now access what may have previously been seen as saturated markets. Examples of genres produced for ICON Group include: Patient Sourcebooks (500 titles by disease or condition) Physician Dictionaries (2100 titles by disease or condition) Genome Sourcebooks (190 titles by disease or condition) Bilingual Crossword puzzles (1200 titles, 100 pages each) Classics – enhanced via computer authoring for test preparation (150 titles) Classics – enhanced for non-English mother tongue speakers (1000s of titles) Scientific Discovery, Research, Custom Publishing and Proposal Writing Output: Over 150,000 Industry and Business Intelligence Reports. Distributors: marketresearch.com; www.bharatbook.com; manta.com (ECNext); MindBranch, and EBSCO, among others. In terms of discovery, intelligence analysts, researchers, scientists, security specialists, or anyone who must "connect the dots" may not have the time or capacity to exploit their skills to a maximum potential. The databases and/or sources of information used to generate and quickly communicate new knowledge may be so vast or complex that traditional approaches simply fail to exploit the potential. Similarly, in business, a substantial amount of valuable management time can be wasted writing proposals, or proposals are never written resulting in opportunity losses. This approach has been used to create, for example, approximately 14,000 international trade studies that draw original conclusions with respect to the world's trade flows across numerous product categories. The meta data and related information required for distribution for each title were also authored via automation. Examples of these titles can be seen here, for example, at marketresearch.com, one of the largest distributors of high-end market intelligence. Had this genre been approached using traditional methods, the economics of each title would make the cost of producing these prohibitive. This approach can also be used to localize educational content for specific markets, down to an individual instructor or student. Networked Multiplayer Games/Simulations Output: A virtually infinite number of business simulations for INSEAD (Singapore and Fontainebleau, France), INTERCOMP Simulation (www.insead.edu). MBA programs and executive education programs around the world have, for years, relied on business simulations to teach strategy, operations, and marketing. These simulations or games, are played by teams or individuals who compete against each other while learning and applying business frameworks. Traditionally business simulations have been industry (e.g. consumer electronics), geography (e.g. a fictitious world) and/or language specific (e.g. English). INTERCOMP is not a simulation, but rather a simulation "writer." It was created using an approach that allows a virtually infinite number of simulations on any known industry (e.g. from toothpaste to industrial power transformers), any realistic geography (within a specific country, like China and its various cities, or across a selection of countries and cities relying on real economic data), and language (English, French, Chinese, Arabic, or any of 200 or more other languages). The simulations can be further tailored to specific business topics or emphasis (e.g. HR, finance, production, marketing, strategy, etc.). An example of one such simulation is dedicated to the mobile communications handset industry that pits Apple, Nokia, HP, Dell, Motorola, HPC, Samsung, LG, and Sony-Ericsson against each other in a global battle to conquer the world market across 57 countries. The setting is five years into the future when a new generation of mobile communications standards has been adopted by operators and manufacturers. This simulation has been used in an award- winning MBA elective and executive education course; a version dedicated to telecommunications is available for download at: http://webfac.insead.edu/intercomp/downloads/program_latest_version.html The advantage of this process is that simulations and/or multiplayer games can be created at minimal cost for a specific group, or “clique” of executives or individuals in a specific industry, simulating real competition faced in that industry. Because the simulation can be calibrated using real data, the output is not a simulation, but a strategic planning tool that can be used to foresee competitive activities or simulate game theoretic outcomes. After setup, no clique is too small for a fully customized simulation or game, given that the marginal cost of producing a game for the clique is virtually zero. PC Software and Video Games Output: 400 Educational Game Titles and over 1200 Reference Software Titles. Distributor(s): www.digitalriver.com There are role-playing games, adventure games, first person shooters, strategy games, sports games, educational games and a variety of others. Each of these follow a generally accepted set of rules which users have come to expect. Each title can be in 2D or 3D formats designed for a variety platforms (PC, console, mobile devices); each format is further bounded by formulaic requirements. Traditionally, dedicated teams create a single title within a genre, each with a substantial cost. I approach game development by automating "game writing" programs which author original titles, surrounding the entire genre selected. A recent example of this was a series of some 2000 third-person shooter PC games that allow children, ages 4 to 6, to learn basic English as a second language (or other topics). A tomato, called "Webster" defeats an enemy called IGNORANCE, who has armies of evil avatars (e.g. from dinosaurs to space ships). Within each topic covered by this sub-genre, there are 4 separate game titles featuring differing graphics, sound effects, challenges/puzzles and enemies. A video cut scene illustrating this game series can be seen here. Some game play can be seen here (towards minute 8). Each game title takes approximately 5 to 10 minutes to create, irrespective of the topic. Here is a low-resolution screen capture of an extended video of a game created this way. 2D multilingual games are listed here. Mobile Phone Applications (Pocket PC & Smartphone) Output: Thousands of Pocket PC dictionaries and games for Handango (www.handango.com), Microsoft.com, and others. Recent research indicates that people in many low-income countries often first experience the Internet via a mobile communications device. In high-income countries, Smartphones, Pocket PC's (PDAs), multimedia phones and video players are gaining greater acceptance as users upgrade from traditional devices, and operators push higher-end handsets which increase network traffic. Greater on-board memory, and higher download speeds are also creating greater demands for mobile content tailored to a large number of localities with differing content needs. Traditionally, mobile content publishers create a game or application, and once successful localize these titles for large markets or create sequels to the one market where the title was successful. The technology allows original titles cover the entire spectrum of topics/geographies within their respective genres, with each title authored in a matter of minutes. Automation also allows for cross-platform authoring, given the variety of operating systems (RIM, Symbian, Microsoft Mobile, etc.) and devices (iPod/iPhone, Nokia, Motorola, Sony-Ericsson, Samsung, HTC, Blackberry/RIM, LG, etc). An application of the technology in this area includes the creation of a mobile phone software generation programs for educational games and references software applications. Some 400 casino games, 200 bi-lingual dictionaries, and thousands of professional reference applications have been authored and are currently selling via various distribution channels (for PocketPC and Smartphones). Web Site Creation Output: World’s largest multilingual dictionary: Webster's Online Dictionary (www.websters-online-dictionary.org). Listed, for the year 1999, as an important “invention” of the 20 th century by The Great Idea Finder, Webster’s Online Dictionary – The Rosetta Edition is an open access dictionary that spans over 400 languages. The dictionary is now the world’s largest and is a mix of compiled and original content generation using the technology. Despite the dictionary being so large (with over 20,000,000 entries, and growing), it is maintained by no editorial, marketing or other staff. Well over 40% of the content, statistics, and entries were authored by computer, in the same manner that a lexicographer or linguist would. The dictionary is constantly being improved and is a laboratory for innovation. Currently the dictionary receives some 1,000,000 page views a month, and is ranked higher, in terms of traffic, than the Oxford English Dictionary. Over 1,000 sites link to the dictionary or its pages. Some 85 percent of the site’s traffic comes from outside of the United States, and is, for many languages, the primary site for language learning and reference. The dictionary is in its “first draft” form. Reviews and historical discussions of the current edition can be found here. Similarly, the approach can be adapted to create a high volume of content-oriented sites that span languages or topics, for use over traditional or mobile networks, that themselves become authors of original content, with or without end-user interaction. Video (All Formats & Media) Output: Various high volume programs. The cost of professional video production involves a large quantity of human inputs from producers, scriptwriters, actors, and directors, to set designers, photographers, camera crews, special effects specialists, and pre- and post-production editors. Human and material costs have often prevented the creation of niche programming or films on narrow topics, or for languages or cultures that might not have a large enough audience to profitably justify an investment. This has lead to content shortages for many countries, languages, interest groups or cliques (micro-segments). The substantial costs of production have also lead to a number of media companies relying on user-generated or contributed content of variable quality and/or that will fail to meet the needs of these unserved niches (e.g. there are not enough video producers interested in, say, Tarahumara to justify creating enough content to support a channel for that audience). Automated video authoring is similar in nature to that of books or software, though the formats have higher dimensionality and the "intelligences" modeled are different. The goal is to drive the cost of high-quality video production to a minimal marginal cost (e.g. the cost of rendering alone). The technology is now being used for video production for a variety of the more formulaic genres (news, games shows, education, mobile phone snacks, classic story telling, comedy, etc.). Examples of test renders for mobile telephone snacks and television segments can be found here on YouTube: Mobile/Traditional Snacks Word of the Day “Snack” – Macroglossia (thousands of these across languages are in production). Word of the Day “Snack” – Hindsight Word of the Day “Snack” – Euphonious Word of the Day “Snack” – Laconic Word of the Day “Snack” – Excretion Gameshow A Multilingual Gameshow (cut scenes only, created for all written languages, for people wanting to learn English). Advertising/Promotion A Video Promotion Clip for a Hangman Game (also authored via computer). Segues A Classic Movie Review before it Airing A DVD Introduction Segment The Future As the above cases illustrate, the application of the technology is format and context independent. Only a small percent of ideas are represented here. Future applications, in the works, include fully interactive, real-time authoring systems and other activities that fully integrate human activities, allowing third parties, but also end-users to allow their systems to create original title materials. Glossary of Important Terms and Concepts The following glossary can prove useful to our partners in approaching automated content creation. We have sorted these definitions in a logical order of “conception” to “delivery”: Method and apparatus for automated authoring and marketing: an approach for automatic authoring, marketing, and/or distributing of title material. A computer automatically authors material. The material is automatically formatted into a desired format, resulting in a title material. The title material may also be automatically distributed to a recipient. Meta material, marketing material, and control material are automatically authored and if desired, distributed to a recipient. Further, the title material may be authored on demand, such that it may be in any desired language and with the latest version and content. Original work of authorship: Works of authorship include title materials, such as literary works; musical works, including the lyrics; dramatic works, including any accompanying music; pictorial or graphical works; motion pictures and other audiovisual works; sound recordings; and any compilations and/or derivative works or the work of authorship; and other materials. Materials: any information and data capable of being used in a title material, for example text, audio, video, descriptive, tabular, artistic, and/or graphical information. Title material: publishable and/or authored work, such as literary works, serial publications, theatrical plays, books, including fiction and nonfiction works (for example, but not limited to, reference books, market research reports, travel guides, company competitive analyses, industry reports, company reports, management consulting reports, technical documents, and the like), newsletters, magazines, computer instructions, software, software publications, Internet publications, computerbased content, Internet web sites, musical scores, screen plays, video productions, holographic or 3-d works, virtual reality works, and the like. Alternatively, title material includes any work that is capable of being associated with a unique identification alpha-numeric code, for example a unique alpha-numeric identifier that is used to identify the work or a catalog number. Title material also includes any work that is capable of being associated with a unique alpha-numeric codes, such as an ISBN (International Standard Book Number), ISSN (International Standard Serial Number), a UPC (Uniform Product Code), a library number (such as the Library of Congress identifier), a bar code, an item number, an SKU (Stock Keeping Unit), a number code, a case law number, a docket number, an abstract number, a year of publication, a chapter code, and the like. Title material can also includes any authored or published work that is to be commercially available. Title materials can include any work with an alpha-numeric numbering system that is observable or intended to be observable within the public domain. Marketing material: includes information used to market, disseminate knowledge of, or promote title material. Marketing materials publicize or announce title materials to various audiences, including remote servers that post electronic announcements. Marketing material includes public relations works, press releases, product announcements, brochures, flyers, billboards or outdoor copy, video, audio, magazine or print media copy, emails, banners, displays or similar materials, etc.. Meta material: include materials used to describe title material. Meta materials may be used in the publishing and media industries to catalogue and/or promote title material. Meta materials describe title material to publishers, resellers, distributors, industry associations, industry organizations, government organizations, or end-users such as libraries or individuals. Further, meta materials may include text, graphics, numerical data, coverings (such as a book jacket, a CD jacket, videotape jacket, or the like) or other information that is used to describe the title material. Additionally, meta material may include, but is not limited to, information regarding the price of the title material, the length in pages or time of the title material, the language of the title material, the physical or electronic format of the title material, the binding or packaging of the title material, an abstract of the title material's content, an alpha-numeric identification number of the title material, subject codes or text of the title material, comments from the author of the title material, comments from the publisher of the title material, credits related to the title material, endorsements of the title material, reviews of the title material, a table of contents of the title material, date of publication of the title material, place of publication of the title material, name of the publisher or producer of the title material, address of the publisher or producer of the title material, or the like. Further, meta material includes meta files and/or metadata. Control materials: include any information used to control, track, index or account for title material. Control material include items in meta, title or marketing materials, but may also include information used for inventory control, billing, financial accounting, stock keeping, information relating to the target audience, and cataloguing information used for internal control. Database files: include modules, queries, macros, reports, tables, templates, graphics, automation programs, audio and video files, data files, material files, information in a database, document files, and the like. Genre: A genre is a group or series of title materials having common characteristics or using similar procedures to be authored. Genres include, for example, a series of market research reports having similar formats, logical statements, calculations, graphics, or patterns with different content for each title material within the genre. A genre of materials may include multiple materials having similar characteristics. Recipient: A recipient is any individual, entity, computer, or the like, that is capable of receiving title, meta, marketing, and/or control materials authored by the present invention. For example, a recipient may include a distributor or an end-user of the title material. User: A user includes any individual, entity, computer, or the like, that is using the system of the present invention to automatically author, distribute, and/or market title materials. End-user: An end-user includes any individual, entity, computer, or the like, that is to be the ultimate consumer of the title material. System of networked computers: any system of multiple computers that are directly or indirectly interconnected by any types of electronic connections, including connections via hardwire, Ethernet, token ring, modem, digital subscriber line, cable modem, wireless, radio, satellite, and combinations thereof. Such connections may be implemented using copper wire, fiber optics, radio waves, coherent light, or other media. The system of networked computers may be the Internet, an intranet, a secure virtual private network (VPN), or any other system of computers that are interconnected by electronic connections. As used herein, the term "network" refers to any such system of networked computers, including the Internet. Likewise, as used herein, the expression "providing a system of networked computers" means creating a network specifically for the purpose of facilitating the present invention or simply connecting to an existing network for the purpose of facilitating the present invention. Computer: any general-purpose machine that processes data according to a set of instructions that is stored internally either temporarily or permanently, including, but not limited to, a general purpose computer, workstation, laptop computer, personal computer, set top box, web access device (such as WEB TV.TM. (Microsoft Corporation)), cable television, satellite television, broadband network, an electronic viewing or listening device, any other type of computer, wireless devices, such as a personal digital assistant (PDA), cellular or mobile telephones, electronic handheld units for the wireless receipt and/or transmission of data, such as a BlackBerry® (Research In Motion Limited), or the like. Learning More If you would like to organize a seminar for your company on this topic, please contact INSEAD’s Executive Education department.