Myth #1 - I am working on Romance novels

advertisement
Automated Content Authorship
While I have been working on this project for about 16 years now, and publishing for
about 8 years, the news of the patent (September 2007) set off a chain reaction in the
press. Ambiguous or terse writing (probably due to space limitations) has lead to a bit
of confusion. I will start with a “Myth” report:
Myth #1 - I am working on romance novels …
I have a method for doing so, but find that this area is not a top priority for
automation (given the genre is covered by so many authors already). Of all the
forms of fictional literature (beyond poetry), romance novels may be the most
formulaic (or at least some of the sub-genres) as many titles follow established
rules (by époque, level of explicit language, etc.) and therefore may be reverse
engineered and automated. There are, of course, books dedicated to
deciphering formulas within this genre (e.g. Complete Idiot's Guide to Writing
Erotic Romance). One journalist correctly quotes me (Shellie Karabell):
But he’s not looking to replace authors of fiction – even best selling
fiction. “I would never try to programme a computer that would write
the Harry Potter series,” he says. “The amount of time needed to write
a programme would be longer than the human writing the book.”
I have explained to journalists that such a project would be very time
consuming, and, in my opinion, would not produce titles of much use to
anyone (not withstanding a potential glut of fiction titles, at least in the
English language). My current projects are reference and educational
materials (in numerous languages), web-based educational materials (my
online dictionary) and educational video animations. Romance novels
authored by computers are an R&D project for the future, and may never see
the light of day (at least from me).
Myth #2 – the book is created after someone orders.
In a Guardian article, the author writes “Nothing but the title need actually
exist until somebody orders a copy. At that point, a computer assembles the
book's content and prints up a single copy.” This statement may be
misleading. All titles are created in advanced, vetted, and supplied to
distribution well in advance of any order. No title exists on a distributor’s site
that does not exist in its entirety first (i.e. no book is written on demand);
financial gap analyses are updated if needed before being sent to the user, but
an original pre-exists. Only the printing in paper format is done upon request.
Some 95% or more of the titles are sold electronically (through channels not
used by the general public, such as netLibarry, OCLC’s netLibary,
MyiLibrary - Ingrams, EBSCO, marketresearch.com, etc.). PDF versions of
each title are sent to distributors months in advance before they appear on
sites. Amazon only recently requested that they carry my business titles as
they expand to business segments. The automated business titles have been
published since the year 2000 and are purchased by central governments, large
multinationals, banks and businesses with imports/exports.
Myth #3 – I consider myself to the most published author.
Actually, I was fist introduced as such by a dean at INSEAD to a visitor. Later
a spokesperson from Amazon.com was quoted in BusinessWeek:
"He may be the most prolific author in history," says Amazon's Kurt Beidler.
Later I was introduced as such in a public seminar by the moderator. To none
of the persons above, nor to my friends or colleagues have I positioned myself
in these terms (in fact, most of my colleagues at INSEAD, except our librarian
and others in my department, had no knowledge of or were surprised that I
was working on this topic). On a few occasions I have jokingly mentioned the
quotations. This claim, however, is a bit off the mark, given the fact that I am
working on reference and educational titles, many of which are in report form.
Amazon lists these titles in their “book” section, but they have nothing in
common with novels (considered by many to be the only legitimate form of
“authored” work). The book industry, of course, has many types of books. The
ones I work on are of very particular genres. One journalist (Asher Moses,
from Sydney), uses wording I would prefer:
But Parker doesn't describe himself as an "author", and he's far from the creative
type. Rather, the US-based professor of management science at INSEAD business
school has developed and patented algorithms enabling computers to write books for
him.
My patent uses the word “title material”. For this I am certainly not the most
prolific (e.g. compared to the U.S. Government).
Myth #4 – I do not announce that the books are generated by computer.
We need to distinguish across genres. One article is correct in saying that
nothing announces that the healthcare titles are computer generated. I do not
announce this since the text is not written by a computer. All of the text is
written by professionals. The computer does not scour the internet
automatically summarizing the results into a book. Some internet sources are
cited – given that this is an internet guide series. Only the formatting and
indexing, lists, front matter, lists, etc. are done automatically (this copy editing
saves a substantial amount of time). The health series happens to be about
“how to use the internet” so the confusion is understandable. It might be
misleading to mention that computers were used, when applied to the
formatting (e.g. “the table of contents of this book is computer generated by
indexing it to Heading1 settings in a Word document …”). When a newspaper
uses software to format columns, they do not announce this to the reader.
Interestingly, the one series I have worked on where humans wrote 99% of the
text is the one where some readers gave some titles a low rating in Amazon, or
were willing to believe that a computer wrote the text (in contrast, some
patient associations recommend and/or send the guides to their members). Of
course, it is normal that some titles have high and some have low ratings when
one publishes hundreds of titles in a subject area. All the health titles and
medical text was vetted by professionals (e.g. medical librarians, forums, etc.).
The series was created when medical libraries started “internet training
services” and were seeking guides that were disease specific. In that series I
am listed as an editor, not author. All contributed text is cited, and all
organizations were contacted to obtain permissions for quotations. All
passages were hand cleaned and edited by human editors. In the genres where
99+% is written by computers, there are few if any negative comments (i.e.
computer authored titles score a higher average rating than human written and
edited books in this simple case - not enough evidence for a academic test
however to make any strong conclusions); the business, crossword, and
classics titles are thinly sold via Amazon - but through more traditional
business and/or direct to library channels, and are greatly appreciated.
In the genres where computers algorithms create original content or results
(which is what my patent covers), this is described in the methodology of each
book by using language like “an econometric model is applied, …” I do not
say that a computer was used to do the econometrics as this would appear silly
to the reader (rarely, if ever, do people calculate the sum of squared errors by
hand). The non-econometric or formulaic sections are written by me (by
hand). The reader expects a computer to be used for the formulaic aspects of
these titles (e.g. in the trade reports where 90% of the titles are tables and
charts).
Finally, in the crosswords and classics, I do not announce they are computers
are used because I never really thought about it – does it really matter? The
computer is nested with a very specific editorial and/or linguistic logic, that
makes the titles useful of use to non-English speakers (graph theory is used in
these genres – a potentially distracting topic). I find this issue to be an
interesting question. We watch movies, and yet no-one pre-announces to the
viewer that productivity enhancing tools like spell checkers, or Adobe
Premier, Avid, or Maya are used.
Myth #5 – My authoring programs dynamically scour the Internet and compile the
results in a book (as auto-bloggers do).
Neither my YouTube video nor the New York Times article (or others)
mentions this explicitly (the Times article is suggestive). Bloggers extrapolate
this conclusion from words like “data bases” and/or “public information
sources.” Well before the Internet, people wrote crosswords, performed
economic analyses, and wrote poetry from “public information sources” (e.g.
times series, and word lists/dictionaries). My computers do so “off line” by
mimicking, for example, economists or poets. There is simply no need to use
the Internet. None of my applications dynamically grab things off the internet
based on a Google search and throw them into a book. For example, for the
econometric studies, INSEAD purchases large quantities of source data not
available to the public over the internet, and which existed since the 1980s and
distributed via data tapes, or now, via DVD (I’ve mentioned to journalists
trade organizations, the IMF, etc., as economists do in practice). All
applications are database driven (some store links to the internet, if the subject
of the book is the Internet itself; these were not scraped from Google or other
search engines). I have also amassed over the last 25 years a very large multilingual lexicon, among others, some of which I have posted for public use
(www.websters-online-dictionary.org – again, the pages, and editing, of which
are generated via automation programs). Here are reviews of my dictionary
which was created in started in the early 1990s, and launched off-line in 1999:
http://www.websters-online-dictionary.org/credits/editor.html
I think the Internet scraping approach, however, is fruitful in terms of
generating new knowledge and knowledge structures. In today’s age, people
assume that if data are involved, there can only be an Internet approach. My
automation project began well before browsers were invented, using data
available before the Internet age. The Internet, however, does allow genres
that could not have existed otherwise (e.g. guides to the Internet).
Myth #6 – the programs simply copy and paste pre-existing information
The programs can do this, and will do so for some genres for some limited sentences
or sections (such as boilerplate), but the vast majority (i.e. the 200,000 titles) do not
do so (the output is “calculated on the fly”) for the bulk of the pages; i.e. titles cannot
violate a copyright as the output is wholly original, and mimic my thought process as
an economist or linguist (i.e. I use a very fast pen). This results in titles whose
contents do not pre-exist in any databases, nor can be found on the Internet. I am
working on series that can combine pre-existing data with original content (as some
genres are designed to be this way). What are the “algorithms”? They are generally
mathematical approaches that I have found well suited to specific genres (or subgenres). Here is a summary of the methods used (please refer to Wikipedia.org for
definitions of unfamiliar terms), posted with my YouTube video:
The "algorithms" depend on the genre. The most advanced use parametric,
non-parametric as well as Bayesian econometrics, graph theory, and meta
analysis (mostly coupled with some specialized computational linguistics and
editorial rules that are required within certain genres) -- each piece is rather
straight forward; the combination allows complexity. In terms of IT or
programming languages, there is no rigidity to this - again it depends on the
genre. If animation is the goal, then code is written to write MEL scripts, etc.,
which can automate Maya, which can in turn automate rendering, lights, etc.,
via macros. This works well, but for only certain aspects of that genre.
Some titles are 98 to 100 percent computer automated (e.g. business titles,
crosswords, etc.). For health titles, only the format editing and production side
is automated. The text in the health books was written by medical
professionals and edited by a professional editor; the computer expedited
formatting using about 50 odd routines (the preface, chapter intros, glossaries,
indexes, headings, margins, etc.); highlights are made to sources generally not
known to internet-averse readers or medical practitioners (designed for
medical libraries with internet training services).
Currently, some 2 percent of the titles rely on government sources for text.
None perform a google search, spider the net, etc. Some 98 percent of the
titles are wholly generated via automation programs; the applications create
original information or content that cannot be found elsewhere (e.g. maximum
likelihood trade estimates, latent demand forecasts via a decision calculus
approach, Chinese and English crosswords, etc.) - offline applications with no
interaction to the internet. In total, there are about 17 genres created this way
(about 200,000 titles or so since 2000).
It can take several years to set up an application (including all human inputs,
licensed sound effects, textures, models, mocap, data, or decision rules that go
into any genre-specific application). Platforms (e.g. Maya) pre-exist. The
incremental, or marginal creation time per title is mentioned in the video.
The genres are blind or peer reviewed and/or vetted by users (e.g. librarians or
end-users) before they are put into print. The games are played by kids to see
what they like. For 3D games, a pre-existing rendering engine is like a blank
word document. The rendering engine is not created from scratch, but licensed
(like MS Word).
I am mostly now working on education titles for Asian, African, and Native
American languages that do not have educational materials (games,
supplements, texts, videos, mobile phone books, etc.) written in or augmented
by their languages. See my dictionary at:
http://www.websters-online-dictionary.org
A very small percent of the linguistic material used is posted. Watch for a
major update and linguistic augmentation to the dictionary this summer when I
will also be introducing EVE. She is an "economically viable entity". A step
beyond a chat bot, using some of the algorithms mentioned above (with a bit
of utility theory and optimal control theory thrown in).
There is no "commercial" or "public" or "open source" software that can be
used by the general public. Some applications are terabytes large. I am
working on a relatively small poetry application for public use -- to be
released when completed (probably in a year), which will do several forms of
poetry, on any topic the user desires; and allow the user to request "another" if
they do not like the first one written, or "change that line", etc. The following
are samples of grammatical acrostics, practiced in elementary schools to
introduce children to poetry (title is an acronym for words in the poem):
NUDE
Naked unclad, dear enactment.
LOVE
Lean of vile emotions.
GOD
Gentlemen of divinity!
BOOK
Bible ordered, obtained Koran.
The application for this genre uses graph theory (clique commonality) and
over 40,000 grammatical structures, ranked by meta-analytic probabilities of
being understood by English readers. There are many other areas I am
working on, as there are multiple avenues to explore, especially in the areas of
new media (mobile and fixed), but more so in high-end analytics and
knowledge discovery (i.e. generating knowledge that could not be created
otherwise) as applied to business, language and public services (e.g.
criminology) - where unmanageable, sparse, disintegrated or larger data sets
(off-line) result in new knowledge structures usable by decision makers (e.g.
connecting the dots where humans have difficulty doing so, for lack of time or
expertise).
Myth #7 – it costs 12 cents to create a book
This figure (or similar numbers) reflects the marginal cost. The full or average
cost is much higher. The set up cost for an application can be hundreds of
thousands of dollars – costs that may not be recovered over the life of the
genre. This is true for both electronic and non-electronic versions.
Concluding Remarks
The most interesting aspects to me about this project is what can be achieved by it. To
date, journalists have not covered this angle. I think what I have done thus far is
extremely modest, and many other applications can be developed, especially in genres
that involve highly repetitive writing methodologies, or that lack the economies to be
created otherwise (languages or topics that are obscure to most, but critical to others).
Here are a few comments from around the Internet by people who see this potential:

In a way, humanity can be defined by what it is that humans can do that
machines can’t do. That boundary is continually being pushed further, and in
coming years we will need to move to increasingly complex and imaginative
tasks of synthesis and creativity that computers cannot do. Philip Parker, a
professor at INSEAD, is probably doing more than anyone else to push this
boundary. … In many cases the market is too small to justify a person writing
the report. However there is no question that a significant part of an analyst’s
work can be automated. The boundaries of human value are being pushed
further, and this is just the beginning. Ross Dawson

As [his] video demonstrates, many of his works are economic or market
analyses and forecasts, but he also uses the technology to write about obscure
medical topics – both genres that he’s able to succeed in because they are
underserved by traditional authors. Scott D. Anthony

It's a fascinating subject and it calls into question many of our assumptions
about writing and research. This guy is part of a movement that is doing to
office workers what the industrial revolution did to blacksmiths. Daryl
To be fair, here is the other side of the coin:

Philip Parker has won today’s “Worst Person in Publishing” award. I wanted
to give him the “Worst Person in the World” title, but, well, I’m fairly certain
that’s been copyrighted. Hmm, maybe the ”’New York Times”’ will share the
honors, if only due to its continued lack of critical thinking when it comes to
covering books and publishing. … Likewise, I am not sure that the ”’NYT”’,
as close an industry-town publication as possible, is capable of writing about
the publishing business with clear-eyed intelligence. Kassia Krozser

Fire the monkeys! Return them to their happy habitats! Our genre of choice
will be written by GLaDOS, and other AI computers, because there’s only “so
many body parts” about which to write a romance. SB Sarah

Actually Parker is providing a rather useful service for those who understand
the limits of his “books.” I just hope that the “Make Money Fast” crowd
doesn’t catch on too quickly to the possibilities here and come up with yet
another product category to push through e-mail and blog comment spams. As
for Amazon, I wouldn’t mind a filter to separate Parker-style books from the
purely human-done variety. Meanwhile perhaps Parker and his machine-aided
crew can go on to write a coping guide to for victims of technology. David
Rothman

Mr. Parker is an “author’ only in the loosest sense. Jane

Should authors be worried? Probably not, at least not yet. There's a wide gap
between what a computer can compile and the nuanced hand of a skilled artist.
Still, this news is a bit unsettling to those employed in the creative arts. And,
taking the music industry as an example, it doesn't seem well advised to
underestimate this sort of development. It's the kind of trend that could as
easily become a dead end as an overnight sensation. Either way, it's worth
consideration. Nathan Denny

He also says, "'My goal isn’t to have the computer write sentences, but to do
the repetitive tasks that are too costly to do otherwise.'" That has me really
baffled. Aren't romances composed of many, many sentences? Fortunately I,
having endured this sort of ignorant notion of romance novels for twenty
years, have learned to calm down and carry on relatively quickly. Margaret
Moore
His ignorance [about romance novels] is almost embarrassing. Kimber Chin
The London Times has pointed out one Philip M Parker who has created over
200, 000 titles (albeit mostly statistic books from what I can see) using print
on demand technology. But the worst part is that, by his own admission,
automation produced a large part of his works. And he’s planning to move into
romance novels and poetry. that’s what freaks me out. No matter how
formulaic either genre can be, in the most juvenile hands, it is still something








human. The idea of automated poetry makes my skin crawl.
bookology.wordpress.com
… it’s now possible to foresee a literary future in which human intervention is
no longer required. Michael Moran
The best publishers are focusing on building large growing communities.
Content is becoming a commodity, as content without subscribers is worthless.
As failing mainstream publishers follow in Mr. Parker's footsteps, small
publishers stand no chance to compete unless they have an army of brand fans.
Aaron Wall
I guess the automated content may look good enough to look real, but the
talent is something more than that. I think such automated tools are a threat to
everyone who publishes mediocre content though. bobby_handzhiev
Won't the advent of programmes like this enable more small publishers to
produce content? I think this will drive the premium on quality original
content higher still. However, long term (maybe 20 years +) perhaps AI will
have reached the point where it can start drawing its own conclusions. Then
we really become redundant! And who will be leading the way with AI?
Perhaps the company collecting huge amounts of data of every aspect of our
lives? Google. BenCo
"if you are ever stuck for an absorbing read don’t forget "The 2007-2012
World Outlook For Bridges, Crowns, Dentures and Other Orthodontic
Appliances That Are Customised For Individual Application on a Prescription
Basis" Roland Dodds
… we hope someone sent from the future destroys these robot authors —
partly because we don’t want to be destroyed by the machines, and partly
because we are pretty well out of robot-war jokes. But we'll do what we have
to if more news comes along — because, while we may run out of punch
lines… [adopts growly, inspiring Bill Pullman voice]… we'll never run out of
hope. —Ben Mathis-Lilley
What is my take? I think that the most useful applications will be created for genres
that are so complex or labor intensive, that automation is almost the only viable
approach. That being said, writing hundreds of original high-quality Ph.D. theses will
be easier to accomplish using this approach, than writing a single creative and highquality children’s story (given the lack of formulaic sub-genres that can be reverse
engineered). “Human creativity” in this sense is the absence of formulaic authorship
techniques that can be reverse engineered. Some Ph.D. theses, and forms of poetry for
that matter, are not that “creative”. Creative authors, journalists, editors, report
writers, manual writers, script writers, or bloggers, therefore, need not fear ever being
replaced by this process. The same is true for creative doctoral students,
moviemakers, television producers or PC game makers.
Then what does original mean? From a pragmatic point of view, if one title borrows
from another to a sufficiently large degree (especially without citation), it might be
considered un-original, if not plagiaristic. If the two titles have so little in common
that they do not seem to borrow from each other, one might say they are originals
(e.g. a romance novel – not all - can have a formulaic plot, but use different sentences
and paragraphs that do not overlap to any noticeable degree with an existing romance
novel with exactly the same plot). This form of originality (or lack thereof) is often
seen in television game shows. Each episode is original, but each episode uses the
same segment sequences. Original and very entertaining, but not that creative from
one episode to the next. In essence, viewers crave the formula and want to see it
repeated in original episodes. The genre in its entirety, of course, can be a very
creative result.
What is quality? It lies in the eye of the “segment” (in publishing industry jargon). A
trade study can be far more useful than a romance novel to someone wanting to
prioritize world markets for the products they are selling. The opposite is true for
someone who love novels and is not involved in international trade. There are
segments to content markets. Can a computer, therefore, write prose that is higher
quality than Shakespeare? Of course; especially if the person comparing passages
side-by-side hates Shakespeare or does not understand Elizabethan English (probably
a large enough segment). Will a computer generate work reaching the stature of
Shakespeare in English courses – I doubt it (unless, of course, the formulas used by
the Master can one day be reverse engineered; or a great author of that league, as yet
unknown, is also a great programmer).
Will this make human authorship obsolete? For some forms, potentially "yes", for at
least the formulaic or mundane forms of human authorship, or for human authorship
of genres that are uneconomical otherwise. Which genres of authorship (in video, text
or other formats) are not formulaic enough to be automated? Time will tell.
I hope this clarifies & thanks for reading.
Phil
More Background
Overview
Some like calling it a “book writing machine” or “software” but in fact it is a
computer-based automation process for authoring, irrespective of the format (book,
video, PC games, etc.), language, or subject (fiction or non-fiction). For those
interested in the technical aspects of the process, please refer to the actual patent
which presents flow diagrams, etc., and to a YouTube video that tersely describes the
process and shows an example an application and some output:
Patent:
http://www.google.com/patents?id=bHeBAAAAEBAJ&dq=philip+m+parker
YouTube: http://youtube.com/watch?v=SkS5PkHQphY
It is strongly recommended that interested persons read the full patent. On the patent
page, the reader will find detailed technical descriptions of the process and the prior
art. Professor Parker began working on this project in the early 1990s. The goal was
to create original titles (book, videos, games, etc.) on topics that would not be
economically viable if published using traditional methods, or covering topics that
might be of interest to a limited audience that would nevertheless find the titles useful
(what some call the “long tail”). The process does not require “Internet scraping”, and
most existing implementations of the process are Internet independent. The patent is
written as a “pioneer patent” as it applies to all forms of original title materials
(videos, books, PC games, etc.) created in this fashion.
Forms of Authorship
Much as authors publish various forms of fiction and non-fiction literature, it is
convenient to see the method or process as allowing various forms of authorship
automation (which can be used in combination).
Form 1: Involves compiling existing information, sorts, formats, and draws
basic conclusions (e.g. if there is no pre-existing content, then this fact alone
may lead to original logical conclusions drawn about the topic). This level is
useful for consolidating and structuring knowledge in a domain where much
of the text, video or sounds pre-exist. The programming for this approach
typically involves hundreds of details, especially with respect to formatting
and style. Typically in the form of a compilation, some of the output
components will be original, and can result in new knowledge.
Form 2: Involves replicating a formula within a genre. In this case, new
knowledge is not necessarily generated, though the reader or viewer may end
up acquiring new knowledge. In this case, the data (words) may be in the
public domain on a stand alone basis, but the output is as original as what a
human author (or director, screenwriter or actor in the case of a movie) might
create. The final result is typically wholly original.
Form 3: Involves the generation of new knowledge as the primary goal. This
involves, for example, the computer mimicking a specialist that is asked to
prepare a report, film or game that draws original conclusions, images or
levels of entertainment. For example, if one asks an economist for an opinion,
the economist will typically perform an analysis and make summary
statements based on his or her findings. The automation process, in this case,
literally follows the behaviours of the economist, and reports the findings -findings that have never appeared before in any format or which pre-exist in
any database or are currently available on the Internet. The computer, in this
case, is pre-invested with knowledge or expertise (e.g. economic models and
knowledge of economic geography). For this approach, the word “specialist”
is domain independent. We can rewrite the example from above to be:
“For example, when one asks a poet for a poem on a given subject, they will
typically ponder on the subject and write prose based on their inspirations.
The process, in this case, literally follows the behaviours of a poet, and
creates a poetry book – consisting of poems that have never appeared before
in any format.”
The distinction between poetry and econometrics is the formulaic natures of
the genres, but not the process to author them. The third level can create highend econometrics to the same degree that it can write poetry. It turns out that
the most useful applications at Level 3 are for genres that are so complex or
labour intensive, that automation is almost the only viable approach.
History
The origins and research began on this approach in the 1980s and early 1990s. The
first titles authored via full automation relied on Form 3 (described above) – having
the goal to generate of new knowledge that would be difficult to accomplish
otherwise. These came in the form of e-books distributed via high-end distributors
dedicated to this market (Dialog, MarketResearch.com, etc.) and then print-ondemand titles (Ingram’s LSI and Amazon’s Booksurge). The “Trade Perspective”
series was created due to the inconsistencies of import data from importers, and
export data from exporters. The model comes up with maximum-likelihood estimates
of real trade flows (adjusting for currency fluctuations) – a rather boring process but
of interest to people involved in international trade. This series is mostly used by
government agencies and businesses. Similar series using Form 3 are “Word Outlook
Reports” that produce Bayesian econometric estimates for the worldwide latent
demand for various products and services, and the “financial and labour benchmark”
series which mimic the process used by accounting firms and/or investment banks to
compare real differences in economic performance across firms and/or economies
with differing accounting rules. For each of these series, there is a very large upfront
cost to creating a series like this (many man-years of programming in most cases), but
once this is accomplished, the incremental cost per title is very low (the costs
mentioned by journalists are the incremental cost of about 50 cents, not the total or
average cost per title which are must higher when considering start-up costs). Samples
of these books can be found at http://www.icongrouponline.com/browse/.
Later, series using a combination of Form 1 and Form 2 were created in the form of
patient and physician sourcebooks. Around 2001, medical libraries launched efforts
on “internet training” for their patrons (e.g. how to use the internet to research
diseases). This series was created for this market and is mostly distributed via
OCLC’s NetLibrary service in e-book format. Form 1 was also used to create a series
of bi-lingual classic titles which provide a running thesaurus in the language of the
reader.
More recently, multilingual crossword puzzle books and thesauri were created using
Form 2. Some of the thesauri rely on a graph theoretic approach (combined with
traditional computational linguistics) to derive what is probably the world’s largest
multilingual thesaurus.
A small percent of the databases required for some of these later genres is posted on
Webster’s Online Dictionary (www.websters-online-dictionary.org), that was started
in 1999 as a testing ground for the general approach (i.e. the automatic authoring or
original content on a web site):
Some Background & Reviews:
http://www.websters-online-dictionary.org/credits/editor.html
Another Review:
http://hurricanecountry.blogspot.com/2006/12/dictionary-heaven.html
The Objective:
http://www.websters-online-dictionary.org/about.us/about.html
The Site:
www.websters-online-dictionary.org
As only 10% of the data available are posted, future editions will be substantially
larger and allow for high levels of interactivity.
Recent History
In terms of R&D, substantial time and effort is currently being invested to create (1) a
series of interactive web sites that can automatically author titles, (2) educational
game shows and (3) language learning programs. With respect to video, instead of
automating “Word” to author a book, the same process is being used to automate
Maya and video editing software (software for 3d animation/video used in movies like
King Kong, the Matrix, and Shrek). The goal is create video programming to teach
any concept, but also in any local language. It turns out that for most of the World’s
languages (e.g. Estonian, Maltese, etc.), the costs of video programming using
traditional methods is prohibitive, so local stations end up dubbing foreign-based
programs (or programs receiving government subsidies). This project started in 2004
with 3d games and software (a bit easier to begin with than video) which has resulted
in hundreds of titles distributed by Digital River, Handango and Microsoft (for Pocket
PC versions) among others. The following is a YouTube link to cut scenes from a
game show designed for language learning – a formulaic form of television (that is
being coded for automation):
http://www.youtube.com/watch?v=Fug4UGbsIxY
The following is a video “word of the day”, that will be used across many languages:
http://www.youtube.com/watch?v=slNTZ4vEqGQ
Here is a cut scene from the 3D game:
http://www.youtube.com/watch?v=2QBC5zlXdDw
FAQ
This FAQ covers other common questions. For each question, the generic answer is
typically “it depends on the genre” and “it depends on the format (book, video,
software, PC game, etc.).”
Q: Can I have a copy of the software?
A: No. The process is not a software package, but a complete system that requires that
a computer or computer network be set up for this purpose – for a particular genre.
Most genres are too large to be easily transferable via the internet. One video
application is many terabytes, and other applications are many gigabytes.
Q: How long does it take to set up a genre?
A: This completely depends on the complexity of the genre and the quality one is
willing to accept for the titles. The earliest genres took several man-years to create
before they met industry standards (i.e. to the quality of a human author). The later
genres took a matter of months (e.g. cross-word puzzle books). Sometimes the longest
part is acquiring and coding domain knowledge (e.g. knowing how a Ph.D. thinks in a
particular domain before they author a genre). Already published genres rely on
advanced graph theory and econometrics; others rely on traditional content analysis.
Q: How much does it cost to produce a book or other title?
A: A: Depends on how you define cost. The marginal cost of creating a title in
electronic format is the price of the electricity used to create the title, and some small
amount of hardware depreciation (maybe around 50 US cents). The average cost,
which includes the printing of the book (in paperback), or a game in DVD or CD
format (printed on demand), and the overhead to distribute the book can range from
around $10 to around $30. The total cost for an entire genre of books, videos, or
software games can exceed hundreds of thousands of dollars or more in programming
time, database acquisition or licensing, and other overheads. Once a large sum of sunk
costs are expended, the marginal costs are minimal. For video or high-end gaming, the
costs can be very high; with the budget to create a single traditional 3D animated
movie, however, one can use this approach to create thousands of titles within a given
video genre.
Q: Is this really that complicated?
A: It depends on the genre and format. During early genres it was found that rather
complicated issues were simple to implement (e.g. Bayesian econometrics), and
logically simple things were nearly impossible to implement (e.g. getting Windows to
behave well when indenting certain graphics, or rendering in DirectX). In general,
Joseph Weizenbaum says it all:
'It is said that to explain is to explain away. This maxim is nowhere so well fulfilled
as in the area of computer programming, especially in what is called heuristic
programming and artificial intelligence. For in those realms machines are made to
behave in wondrous ways, often sufficient to dazzle even the most experience
observer. But once a particular program is unmasked, once its inner workings are
explained in language sufficiently plain to induce understanding, its magic crumbles
away; it stands revealed as a mere collection of procedures, each quite
comprehensible. The observer says to himself, "I could have written that." With that
thought he moves the program in question from the shelf marked "intelligent" to that
reserved for curios, fit to be discussed only with people less enlightened than he.'
CASE STUDIES
The following case studies illustrate a few examples of how the technology has been
used to create large quantities of original title materials. These are presented for
illustrative purposes only, and reflect a small part of potential applications.
Reference, Research & Educational Books
Output: Over 250,000 original titles, available in various paperback and ebook formats
(www.icongrouponline.com).
Distributors: Barnes & Noble®; amazon.com; Lightning Source (Ingram Book Group);
NetLibrary [OCLC - eContent]; Ingram Digital and MyiLibrary; ebooks.com; google.com,
among others.
Beyond the tasks accomplished by acquisition editors and publishers, books are traditionally
written by humans authors, edited by humans, and formatted by human production editors.
These are in turn marketed by humans. Using the most advanced approaches to electronic
publishing, this approach reduced the time to create and publish reference and educational
books. The approach is of interest to the publishing industry which is becoming more
fragmented and specialized as print-on-demand and ebook technologies are showing
substantial growth. Coupled with electronic distribution via libraries, publishers and media
companies can now access what may have previously been seen as saturated markets.
Examples of genres produced for ICON Group include:
Patient Sourcebooks (500 titles by disease or condition)
Physician Dictionaries (2100 titles by disease or condition)
Genome Sourcebooks (190 titles by disease or condition)
Bilingual Crossword puzzles (1200 titles, 100 pages each)
Classics – enhanced via computer authoring for test preparation (150 titles)
Classics – enhanced for non-English mother tongue speakers (1000s of titles)
Scientific Discovery, Research, Custom Publishing and Proposal Writing
Output: Over 150,000 Industry and Business Intelligence Reports.
Distributors: marketresearch.com; www.bharatbook.com; manta.com
(ECNext); MindBranch, and EBSCO, among others.
In terms of discovery, intelligence analysts, researchers, scientists, security specialists, or
anyone who must "connect the dots" may not have the time or capacity to exploit their skills
to a maximum potential. The databases and/or sources of information used to generate and
quickly communicate new knowledge may be so vast or complex that traditional approaches
simply fail to exploit the potential. Similarly, in business, a substantial amount of valuable
management time can be wasted writing proposals, or proposals are never written resulting in
opportunity losses.
This approach has been used to create, for example, approximately 14,000 international trade
studies that draw original conclusions with respect to the world's trade flows across numerous
product categories. The meta data and related information required for distribution for each
title were also authored via automation. Examples of these titles can be seen here, for
example, at marketresearch.com, one of the largest distributors of high-end market
intelligence. Had this genre been approached using traditional methods, the economics of each
title would make the cost of producing these prohibitive. This approach can also be used to
localize educational content for specific markets, down to an individual instructor or student.
Networked Multiplayer Games/Simulations
Output: A virtually infinite number of business simulations for INSEAD (Singapore and
Fontainebleau, France), INTERCOMP Simulation (www.insead.edu).
MBA programs and executive education programs around the world have, for years, relied on
business simulations to teach strategy, operations, and marketing. These simulations or games,
are played by teams or individuals who compete against each other while learning and
applying business frameworks.
Traditionally business simulations have been industry (e.g. consumer electronics), geography
(e.g. a fictitious world) and/or language specific (e.g. English). INTERCOMP is not a
simulation, but rather a simulation "writer." It was created using an approach that allows a
virtually infinite number of simulations on any known industry (e.g. from toothpaste to
industrial power transformers), any realistic geography (within a specific country, like China
and its various cities, or across a selection of countries and cities relying on real economic
data), and language (English, French, Chinese, Arabic, or any of 200 or more other
languages). The simulations can be further tailored to specific business topics or emphasis
(e.g. HR, finance, production, marketing, strategy, etc.). An example of one such simulation is
dedicated to the mobile communications handset industry that pits Apple, Nokia, HP, Dell,
Motorola, HPC, Samsung, LG, and Sony-Ericsson against each other in a global battle to
conquer the world market across 57 countries. The setting is five years into the future when a
new generation of mobile communications standards has been adopted by operators and
manufacturers. This simulation has been used in an award- winning MBA elective and
executive education course; a version dedicated to telecommunications is available for
download at:
http://webfac.insead.edu/intercomp/downloads/program_latest_version.html
The advantage of this process is that simulations and/or multiplayer games can be created at
minimal cost for a specific group, or “clique” of executives or individuals in a specific
industry, simulating real competition faced in that industry. Because the simulation can be
calibrated using real data, the output is not a simulation, but a strategic planning tool that can
be used to foresee competitive activities or simulate game theoretic outcomes. After setup, no
clique is too small for a fully customized simulation or game, given that the marginal cost of
producing a game for the clique is virtually zero.
PC Software and Video Games
Output: 400 Educational Game Titles and over 1200 Reference Software Titles.
Distributor(s): www.digitalriver.com
There are role-playing games, adventure games, first person shooters, strategy games, sports
games, educational games and a variety of others. Each of these follow a generally accepted
set of rules which users have come to expect. Each title can be in 2D or 3D formats designed
for a variety platforms (PC, console, mobile devices); each format is further bounded by
formulaic requirements. Traditionally, dedicated teams create a single title within a genre,
each with a substantial cost.
I approach game development by automating "game writing" programs which author original
titles, surrounding the entire genre selected. A recent example of this was a series of some
2000 third-person shooter PC games that allow children, ages 4 to 6, to learn basic English as
a second language (or other topics). A tomato, called "Webster" defeats an enemy called
IGNORANCE, who has armies of evil avatars (e.g. from dinosaurs to space ships). Within
each topic covered by this sub-genre, there are 4 separate game titles featuring differing
graphics, sound effects, challenges/puzzles and enemies. A video cut scene illustrating this
game series can be seen here. Some game play can be seen here (towards minute 8). Each
game title takes approximately 5 to 10 minutes to create, irrespective of the topic. Here is a
low-resolution screen capture of an extended video of a game created this way. 2D
multilingual games are listed here.
Mobile Phone Applications (Pocket PC & Smartphone)
Output: Thousands of Pocket PC dictionaries and games for Handango
(www.handango.com), Microsoft.com, and others.
Recent research indicates that people in many low-income countries often first experience the
Internet via a mobile communications device. In high-income countries, Smartphones, Pocket
PC's (PDAs), multimedia phones and video players are gaining greater acceptance as users
upgrade from traditional devices, and operators push higher-end handsets which increase
network traffic. Greater on-board memory, and higher download speeds are also creating
greater demands for mobile content tailored to a large number of localities with differing
content needs.
Traditionally, mobile content publishers create a game or application, and once successful
localize these titles for large markets or create sequels to the one market where the title was
successful. The technology allows original titles cover the entire spectrum of
topics/geographies within their respective genres, with each title authored in a matter of
minutes. Automation also allows for cross-platform authoring, given the variety of operating
systems (RIM, Symbian, Microsoft Mobile, etc.) and devices (iPod/iPhone, Nokia,
Motorola, Sony-Ericsson, Samsung, HTC, Blackberry/RIM, LG, etc).
An application of the technology in this area includes the creation of a mobile phone software
generation programs for educational games and references software applications. Some 400
casino games, 200 bi-lingual dictionaries, and thousands of professional reference
applications have been authored and are currently selling via various distribution channels (for
PocketPC and Smartphones).
Web Site Creation
Output: World’s largest multilingual dictionary: Webster's Online Dictionary
(www.websters-online-dictionary.org).
Listed, for the year 1999, as an important “invention” of the 20 th century by The Great Idea
Finder, Webster’s Online Dictionary – The Rosetta Edition is an open access dictionary that
spans over 400 languages. The dictionary is now the world’s largest and is a mix of compiled
and original content generation using the technology. Despite the dictionary being so large
(with over 20,000,000 entries, and growing), it is maintained by no editorial, marketing or
other staff. Well over 40% of the content, statistics, and entries were authored by computer, in
the same manner that a lexicographer or linguist would. The dictionary is constantly being
improved and is a laboratory for innovation. Currently the dictionary receives some 1,000,000
page views a month, and is ranked higher, in terms of traffic, than the Oxford English
Dictionary. Over 1,000 sites link to the dictionary or its pages. Some 85 percent of the site’s
traffic comes from outside of the United States, and is, for many languages, the primary site
for language learning and reference. The dictionary is in its “first draft” form. Reviews and
historical discussions of the current edition can be found here. Similarly, the approach can be
adapted to create a high volume of content-oriented sites that span languages or topics, for use
over traditional or mobile networks, that themselves become authors of original content, with
or without end-user interaction.
Video (All Formats & Media)
Output: Various high volume programs.
The cost of professional video production involves a large quantity of human inputs from
producers, scriptwriters, actors, and directors, to set designers, photographers, camera crews,
special effects specialists, and pre- and post-production editors. Human and material costs
have often prevented the creation of niche programming or films on narrow topics, or for
languages or cultures that might not have a large enough audience to profitably justify an
investment. This has lead to content shortages for many countries, languages, interest groups
or cliques (micro-segments). The substantial costs of production have also lead to a number
of media companies relying on user-generated or contributed content of variable quality
and/or that will fail to meet the needs of these unserved niches (e.g. there are not enough video
producers interested in, say, Tarahumara to justify creating enough content to support a
channel for that audience).
Automated video authoring is similar in nature to that of books or software, though the
formats have higher dimensionality and the "intelligences" modeled are different. The goal is
to drive the cost of high-quality video production to a minimal marginal cost (e.g. the cost of
rendering alone).
The technology is now being used for video production for a variety of the more formulaic
genres (news, games shows, education, mobile phone snacks, classic story telling, comedy,
etc.). Examples of test renders for mobile telephone snacks and television segments can be
found here on YouTube:
Mobile/Traditional Snacks
Word of the Day “Snack” – Macroglossia (thousands of these across languages are
in production).
Word of the Day “Snack” – Hindsight
Word of the Day “Snack” – Euphonious
Word of the Day “Snack” – Laconic
Word of the Day “Snack” – Excretion
Gameshow
A Multilingual Gameshow (cut scenes only, created for all written languages, for
people wanting to learn English).
Advertising/Promotion
A Video Promotion Clip for a Hangman Game (also authored via computer).
Segues
A Classic Movie Review before it Airing
A DVD Introduction Segment
The Future
As the above cases illustrate, the application of the technology is format and context
independent. Only a small percent of ideas are represented here. Future applications, in the
works, include fully interactive, real-time authoring systems and other activities that fully
integrate human activities, allowing third parties, but also end-users to allow their systems to
create original title materials.
Glossary of Important Terms and Concepts
The following glossary can prove useful to our partners in approaching automated content creation. We
have sorted these definitions in a logical order of “conception” to “delivery”:
Method and apparatus for automated authoring and marketing: an approach for automatic
authoring, marketing, and/or distributing of title material. A computer automatically authors material.
The material is automatically formatted into a desired format, resulting in a title material. The title
material may also be automatically distributed to a recipient. Meta material, marketing material, and
control material are automatically authored and if desired, distributed to a recipient. Further, the title
material may be authored on demand, such that it may be in any desired language and with the latest
version and content.
Original work of authorship: Works of authorship include title materials, such as literary works;
musical works, including the lyrics; dramatic works, including any accompanying music; pictorial or
graphical works; motion pictures and other audiovisual works; sound recordings; and any
compilations and/or derivative works or the work of authorship; and other materials.
Materials: any information and data capable of being used in a title material, for example text, audio,
video, descriptive, tabular, artistic, and/or graphical information.
Title material: publishable and/or authored work, such as literary works, serial publications, theatrical
plays, books, including fiction and nonfiction works (for example, but not limited to, reference
books, market research reports, travel guides, company competitive analyses, industry reports,
company reports, management consulting reports, technical documents, and the like), newsletters,
magazines, computer instructions, software, software publications, Internet publications, computerbased content, Internet web sites, musical scores, screen plays, video productions, holographic or 3-d
works, virtual reality works, and the like. Alternatively, title material includes any work that is
capable of being associated with a unique identification alpha-numeric code, for example a unique
alpha-numeric identifier that is used to identify the work or a catalog number. Title material also
includes any work that is capable of being associated with a unique alpha-numeric codes, such as an
ISBN (International Standard Book Number), ISSN (International Standard Serial Number), a UPC
(Uniform Product Code), a library number (such as the Library of Congress identifier), a bar code, an
item number, an SKU (Stock Keeping Unit), a number code, a case law number, a docket number, an
abstract number, a year of publication, a chapter code, and the like. Title material can also includes
any authored or published work that is to be commercially available. Title materials can include any
work with an alpha-numeric numbering system that is observable or intended to be observable within
the public domain.
Marketing material: includes information used to market, disseminate knowledge of, or promote title
material. Marketing materials publicize or announce title materials to various audiences, including
remote servers that post electronic announcements. Marketing material includes public relations
works, press releases, product announcements, brochures, flyers, billboards or outdoor copy, video,
audio, magazine or print media copy, emails, banners, displays or similar materials, etc..
Meta material: include materials used to describe title material. Meta materials may be used in the
publishing and media industries to catalogue and/or promote title material. Meta materials describe
title material to publishers, resellers, distributors, industry associations, industry organizations,
government organizations, or end-users such as libraries or individuals. Further, meta materials may
include text, graphics, numerical data, coverings (such as a book jacket, a CD jacket, videotape
jacket, or the like) or other information that is used to describe the title material. Additionally, meta
material may include, but is not limited to, information regarding the price of the title material, the
length in pages or time of the title material, the language of the title material, the physical or
electronic format of the title material, the binding or packaging of the title material, an abstract of the
title material's content, an alpha-numeric identification number of the title material, subject codes or
text of the title material, comments from the author of the title material, comments from the publisher
of the title material, credits related to the title material, endorsements of the title material, reviews of
the title material, a table of contents of the title material, date of publication of the title material,
place of publication of the title material, name of the publisher or producer of the title material,
address of the publisher or producer of the title material, or the like. Further, meta material includes
meta files and/or metadata.
Control materials: include any information used to control, track, index or account for title material.
Control material include items in meta, title or marketing materials, but may also include information
used for inventory control, billing, financial accounting, stock keeping, information relating to the
target audience, and cataloguing information used for internal control.
Database files: include modules, queries, macros, reports, tables, templates, graphics, automation
programs, audio and video files, data files, material files, information in a database, document files,
and the like.
Genre: A genre is a group or series of title materials having common characteristics or using similar
procedures to be authored. Genres include, for example, a series of market research reports having
similar formats, logical statements, calculations, graphics, or patterns with different content for each
title material within the genre. A genre of materials may include multiple materials having similar
characteristics.
Recipient: A recipient is any individual, entity, computer, or the like, that is capable of receiving title,
meta, marketing, and/or control materials authored by the present invention. For example, a recipient
may include a distributor or an end-user of the title material.
User: A user includes any individual, entity, computer, or the like, that is using the system of the
present invention to automatically author, distribute, and/or market title materials.
End-user: An end-user includes any individual, entity, computer, or the like, that is to be the ultimate
consumer of the title material.
System of networked computers: any system of multiple computers that are directly or indirectly
interconnected by any types of electronic connections, including connections via hardwire, Ethernet,
token ring, modem, digital subscriber line, cable modem, wireless, radio, satellite, and combinations
thereof. Such connections may be implemented using copper wire, fiber optics, radio waves,
coherent light, or other media. The system of networked computers may be the Internet, an intranet, a
secure virtual private network (VPN), or any other system of computers that are interconnected by
electronic connections. As used herein, the term "network" refers to any such system of networked
computers, including the Internet. Likewise, as used herein, the expression "providing a system of
networked computers" means creating a network specifically for the purpose of facilitating the
present invention or simply connecting to an existing network for the purpose of facilitating the
present invention.
Computer: any general-purpose machine that processes data according to a set of instructions that is
stored internally either temporarily or permanently, including, but not limited to, a general purpose
computer, workstation, laptop computer, personal computer, set top box, web access device (such as
WEB TV.TM. (Microsoft Corporation)), cable television, satellite television, broadband network, an
electronic viewing or listening device, any other type of computer, wireless devices, such as a
personal digital assistant (PDA), cellular or mobile telephones, electronic handheld units for the
wireless receipt and/or transmission of data, such as a BlackBerry® (Research In Motion Limited),
or the like.
Learning More
If you would like to organize a seminar for your company on this topic, please contact
INSEAD’s Executive Education department.
Download