Journal of Advertising Research. Volume 25, Number 5, October/November, 1985 Using Test Scores to Guide the Content Analysis of TV Materials Langbourne W. Rust There is a class of questions of great importance to market and advertising research which has been remarkably resistant to solution. These are questions which require the discovery of the critical attributes of complex stimuli. What is there about different commercials, programs, and products that make people respond to them as they do? Some recent methodological advances, originating in a qualitative wing of academic sociology, may give new life to our efforts to answer such questions. The new approach is often referred to as “the development of grounded theory,” after a seminal book of that name by Glaser and Strauss. (1967). This paper reports on an analysis that sought to define the critical attributes of television pilot shows. It tested the generality of its findings by using them to predict audience response to new material. The Positivist Tradition. Most American behavioral scientists I wish to express my deep appreciation to Arnold Becker, vice president of research. CBS. for sponsoring the original research and granting permission to publish it here. I also wish to thank Francis lanni, Howard Becker. Arthur Kover. and Andrzej Osiatynski who read early versions of the paper and gave me most valuable encouragement and advice. have been trained to follow theory-testing strategies. Observations of the world (data) are used to test deductions from theories and thereby to test the theories themselves. The coding schemes used to describe behavior are established before the data are gathered. Research designs impose stringent limitations (controls) on the parameters of variation—reflecting the reasoning that the more you limit the phenomena being observed, the more confident you can be in drawing inferences about theory. There have been debates about this hypothetico-deductive model of science for years, many of them echoing the philosophical differences between logical positivists and phenomenologists. In the United States, especially in the more quantitative schools of psychology and economics, the positivists have had the upper hand. American advertising researchers are likely to approach a question like, “What makes some commercials get high recall scores and others get low?” with some sort of experimental or quasi-experimental design (Campbell, 1963). They might set up a controlled experiment to compare people’s recall of specially constructed commercials (intended to vary on only one dimension). Or they might do a survey: correlating a grid of content ratings with recall scores, checking for those which are statistically significant. By either design: hypotheses are framed, variables defined, scales selected, and controls are applied before the data are collected. The results are numbers reflecting the degree of association between the variables. Regarded as a system, this type of research takes qualitative inputs (the definitions) and uses data to produce quantitative outputs. The Problem of Input— ‘Garbage in …” There is at least one Achilles heel to this otherwise robust-looking strategy: its qualitative inputs. The pictures of the world produced by theory-validation research are limited by the theory the researcher started with. Stated in more general form: 7 17 Journal of Advertising Research the relationships found between variables depend on how we define the variables in the first place. This is a critical consideration in the behavioral sciences where the subjects’ conceptualizations of their environments may be very different from the researchers’. Many grown-ups, for example, have been concerned about “violent” content in children’s TV. Many studies have attempted to establish whether or not “violent” programming leads to antisocial behavior in children. Children may not perceive all the same things as violent as adults do, but I am aware neither of any comprehensive efforts to relate children’s behavior to programming which they see as violent nor of any attempts to develop grounded definitions of whatever it is in programs that might lead to antisocial behavior (without presuming it to be violence). An Alternative Research Tradition— the Chicago School. Academic sociology has been dominated by the theory-validation approach during much of this century. But minority positions, reaching back to the prewar Chicago School (W. I. Thomas, R. Park et a!.), have been arguing for a focus on how people describe and experience their own worlds. The voice of these qualitative researchers has grown stronger, and their influence has spread into other areas, including anthropology and educational research (Bogdan and Bicklin, 1982; Patton, 1980). The key to the new approach is the way it uses data. While traditional techniques use data to test preformed theories, the grounded approach uses data as a “ground” from which to induce new theories and concepts. A number of earlier writers considered ways to conduct “inductive analysis” (Znaniecki, 1934; Robinson, 1950; Turner 1955). They laid the groundwork for the synthesis achieved by Glaser and Strauss (1967). While most of the grounded theory literature has focused on the traditional concerns of sociology, education, and 18 anthropology, some of this author’s own work has explored matters of more direct concern to marketing and advertising: the attributes of toys affecting children’s spontaneous choices (Rust, 1971a); attributes of Sesame Street and The Electric Company affecting children’s attention levels (Rust, 197lb,c); and the study reported here—attributes of CBS sitcom pilot shows affecting Lazarsfeld-Stanton test scores. These grounded studies can be seen as using quantitative inputs to guide the production of qualitative outputs. Lazarsfeld-Stanton Data. For many years, CBS has tested prospective TV programs by using its program-analyzer system, usually referred to as the Stanton-Lazarsfeld procedure at CBS and as the Lazarsfeld-Stanton procedure elsewhere (Frank Stanton worked for CBS). Each pilot show is tested with a sample of 80 out-of-town tourists who are recruited off the street following a quota-sampling procedure. Each is Langbourne W. Rust is president of Langbourne Rust Research, Inc., which he founded 14 years ago. Most of his research has been directed toward understanding people’s perceptions of products and communications. Dr. Rust received a B.A. from Harvard and a Masters and an Ed.D. in psychology from Teachers College, versity. Columbia Uni- provided with two buttons to press as they watch the show. They are told to press the green button (on the right) if they like what they are watching and to press the red button (on the left) if they dislike it. They are free to press neither button, but are requested not to press both at once. The button pressings are recorded continuously. By aggregating viewer scores at regular intervals throughout the show, two graphs of audience response are prepared: one plotting green-button (positive) scores, the other plotting red-button (negative) scores. These are superimposed on a scene by scene outline of the test show to produce a viewer reaction profile. Data from the program analyzer are important inputs to CBS programming. A bad score or unhealthy profile (e.g., a steadily declining one) can kill a show. Fluctuations in response can help identify program elements that should be culled or emphasized in future development. Once a pilot show has been tested and the decisions regarding its fate have been made, the test data have little more utility to the network. They get filed away, along with a copy of the test show itself, and forgotten. Until the time of the study reported here there had been few, if any, attempts to use this large data base to develop a formal set of programming generalities. This analysis has two resources to work from: the films of the pilot shows and the aggregated scene scores. Scene scores express the percent of the audience pressing a button for the average moment within a scene. Shows had between 27 and 43 scenes each (see Table 1). Procedure The green- and red-button scores were analyzed separately. Each analysis was done in two phases or “rounds.” Round one took 12 shows and designed attributes to account for the scores of all their scenes. Round two applied the attributes to 14 new shows, changing, deleting, and adding Journal of Advertising Research Table 1 Pilot Shows Used in Rounds I and 2 Pilot shows Round I After the Honeymoon Arnie Barefoot in the Park Bridget Loves Bernie The Governor and JJ Me and Benjy Miss Stewart, Sir My Sister, Hank My Wives, Jane Oh Nurse! The Plumbum Vernon’s Volunteers Round 2 Anna and the King The Blonde Bobby-Jo & the Big Apple Goodtime Band The Bob Newhart Show Chicago Teddy Bears The Don Rickles Show Funny Face Keep the Faith The Living End Mary Tyler Moore Show Missy’s Men The New Dick Van Dyke Show Stanley versus the System Tim Conway Show Scenes ending in the first three minutes and were excluded in calculating program scores. to the definitions on the basis of the new data. The analytical process involved three steps: generation, refinement, and integration. They are elaborated here: (1) Generate the prototypes. • • 19 Identify and review the maximum-contrast cases. Which were the very highest and Average score Green button* Red button % % 32 28 30 37 28 23 25 46 28 36 31 27 4 3 2 2 2 32 31 25 37 34 32 3 3 4 2 4 3 5 3 5 2 4 2 3 35 2 31 26 4 34 5 3 25 3 47 2 20 34 4 3 the credits at the end maximum-contrast cases, make note of features that seem to differentiate the high-from the low-scoring ones. • Quantjfy prototype performance within these cases. Measure how well each attribute discriminates lowest scoring scenes? These are likely high-scoring scenes from low. to highlight whatever features audiences • Preliminary cull. Scrap all weak are affected by and to be least performers. If a prototype does not complicated by mixtures of positive and help you differentiate these very negative attributes. highest and very lowest scenes, it Induce prototype differentiators by is not likely to reflect a feature that studying contrasts. Review the was critical to the viewers. (2) Refine the definitions. For each protoype surviving the initial cull: • Test against the remaining cases and quantify its performance. This means reviewing the complete sample of scenes, not just those with the highest and lowest scores. If an attribute actually affects viewers, then there should be a measurable trend in the scores of scenes that have the attribute. If there is no trend, one is better off working with another prototype. • Deviant case analysis. Identify all cases scoring against the trend. Contrast them with conforming cases to induce ways to improve the definition. The working assumption is scenes which score against the trend probably do so because the attribute is improperly defined. • Test the revisions against the full sample of cases. • Iterate: Keep modifying the definition through deviant case analysis until improvements cease. Then select the next prototype for refinement. (3) Integrate the attribute set. When all prototypes have been explored and refined, take the survivors and look for ways to: • Generate superordinate attributes: Seek definitions which permit collapsing several into one higher-order attribute definition that accounts for the data as accurately as the independent ones do. • Reduce redundancy: Eliminate attributes which fail to make a net contribution to the analysis, i.e., those which differentiate cases which are already accounted for by the others. An Example: Design of a Green-Button Attribute. The search for prototypes began with the round 1 shows. The highest and lowest scene in each show were identified. These scenes (12 high, 12 low) were viewed Journal of Advertising Research over and over again. Long lists of differential attributes were prepared. Glaser and Strauss refer to this sort of induction as “comparative analysis” and to the selection of maximum-contrast cases as “theoretical sampling.” One of the first attributes to be induced was seen in 5 of the 12 low-scoring scenes and in none of the high-scoring ones. It was “presence of new characters.” (Chi-square with 1 df = 6.3, p < .02.) Its performance was adequate to take it into the next phase, refinement, which began by testing it against the sample of remaining scenes. This second phase of analysis follows what Glaser and Strauss call “representative sampling” and uses a strategy of deviant case analysis to refine the definitions. Defined as “scenes in which new characters enter the show for the first time,” 59 of the 357 scenes were judged to have it; 42 of them scored low. Chisquare = 9.7. Although the definition had improved, there were still 17 deviant cases. Examination of these led to a series of refinements. The term “characters” was modified to read, “characters with impact on the plot line—not just walk-ons.” This eliminated six deviant cases and one supporting case. Chi-square became 16.7. Then the definition was modified to specify “new characters” as those with no prior indication of their personality. This eliminated four deviant cases and three supporting ones. Chi-square = 20.9. The final modification redefined “new” to “no prior introduction or characterization” and opened up the “impact on plot-line” restriction to include all “salient characters.” The net effect was to reduce deviant cases to 6 and increase supporting cases to 40. Chi-square = 24.9. In round 2, the attribute was found in 55 of 455 scenes. Seventy-eight percent scored low. Chi-square = 13.8, p < .001. A study of the 12 deviant cases led to no modifications. In the red-button analysis, the at- tribute appeared again. No alterations in the definition were indicated by the data. Presence of the attribute was associated with low levels of red-button pressing. This analysis showed that audiences hold back all judgment of a show (positive or negative) when confronted with a new character. Attributes of Scenes with High Green-button Scores Goodness, help, and kindness: where persons are, or try to be, good to others in an open, direct, face-to-face way. They must be moved by a genuine good-hearted feeling. It is not sufficient to be routinely or habitually polite, remorseful, sympathetic, or affectionate. In the Funny Face pilot, Sandy Duncan accepted a date with a shy bookstore clerk. Later, after having broken their date, she ditched a movie star to go out with the clerk after all. Both scenes got high levels of green-button pressing. People seem to like people who are good, kind, and helpful. They like scenes in which characters act this way in an open and well-meaning manner. Harmless embarrassment: where characters appear or feel awkward, silly, embarrassed, or ridiculous. The embarrassing incident may actually occur or be vividly recalled or anticipated. This attribute is quite characteristic of sitcoms. Over 25 percent of the scenes had it. Many others attempted to portray embarrassment but failed to meet all the criteria. Surprise: where the audience or one of the characters is surprised or startled. If a character is surprised, he/she must show it clearly. The emphasis must be on something happening that the audience is in on: the kind of private surprise that is shown by characters when they get a sudden idea does not qualify. A suitor in Miss Stewart, Sir had his pass blocked with a stomp on his foot. A supposedly dying character in My Wives, Jane spoke unexpectedly from her hospital bed. A nurse in Oh Nurse! took a shower in a stall adjoining one in which two men were hiding. This was the most frequent high-appeal attribute being in 34 percent of the scenes. Justice realized: where a negative character (or a character who has acted with questionable motives) sees the light and reforms, relents under pressure, gets comeuppance, retribution, put-down, makes a fool of him/herself, or is resisted successfully. It is essential that the character be aware of what has happened. This attribute occurred most often as retribution. Audiences seem to love to see evil punished and the bad guy put in his/her place. Most situation comedies give a scene or two with it. Audiences also like to see someone reform or see the light. They hit the green button in My Wives, Jane when a TV producer finally let an actress have a baby, and they liked it when an immigration official in Missy’s Men finally decided to grant Missy a visa. Catch and chase, rough and tumble. Catch and chase, rough and tumble scenes often have an air of childhood roughhousing or intrigue, though they need not be done playfully in order to get high ratings. Physical discomfort: where non-threatening physical discomfort is received with good-natured grumbling, gag-making, slow burn, blustering, etc. Some Attributes of Scenes with Low Green-button Scores Sudden scene changes: where a show cuts suddenly to new surroundings not anticipated by the audience, scores drop. This excludes scenes which have been led into by the plot or dialogue or lyrics, but the lead-in must be very clear. Nearly 25 percent of all scenes in 19 Journal of Advertising Research rounds I and 2 began with a sudden change of surroundings. Eighty-two Not all scene changes get low scores. When a chase moves smoothly from one scene to the next, or when the audience knows about the new setting in advance (where it will be, who will be there, etc.), scores may not be low. Sad or troubled characters: where a positive or sympathetic character feels genuinely troubled, worried, or unhappy and expresses it openly. New characters: where salient characters have their presence felt for the first time, with no prior introduction or characterization. First five minutes: scenes that end within the first five minutes of a show. It seems to take some time for an audience to warm up to any show (at least in terms of button-pressing) no matter how strong the early scenes may be in terms of other attributes. Red-button pressing also tended to be below average for the first few minutes. Story over: scenes which occur after the story is over and the end of the show is in sight. Attributes of Scenes with High Red-button Scores The attributes affecting red-button pressing were more difficult to discover than those affecting green buttons. Given the low frequency of red-button pressing, the opinion of just one or two persons could strongly affect the scores, making them less reliable and obscuring the underlying patterns. Despite this problem, a number of attributes were identified. Apprehension: scenes which introduce, reinforce, or sustain audience apprehension that something bad, wrong, or anxiety-provoking may happen. The sense of apprehension must be strong, and the nature of the threat must be clear. Apprehension was the most frequent stimulus to red-button pressing. One-third of all scenes had it. 20 percent of them scored low, irrespective of their other attributes. Red-button pressing may not always reflect something undesirable about a scene. Could it be that pilot shows which succeed in building apprehen sion (in order to resolve it later) tend to get screened out by network pretests, and get replaced by shows which are blander and less involving! Perhaps one should discount high red-button scores when they are inflated by scenes with high apprehension levels. Manifest anxiety: scenes with seriously felt complaining, anger, expressed anxiety, unhappiness, or hostility. The definition does not refer to scenes where a concrete solution to a problem is signaled (hope, if concrete, seems to defuse audience anxiety), to scenes where characters express outrage at clear injustice, where the anxiety is expressed through gags, or where it is trivial (e.g., good-natured grumbling about some physical discomfort or inconvenience). Manifest anxiety was present when Katie told her father-in-law about her worries (with no clear call for solution) in After the Honeymoon. Expressing anxiety through gags tends to mute the negative response of audiences: a pattern demonstrated by the many scenes where Dick Van Dyke expressed worry over his screen test. Writers often try to build tension through the display of “manifest anxiety,” but it may be a dangerous strategy. The rest of the high-negative attributes seem to reflect qualities that viewers judge as socially “bad”: Tattles and lies: scenes where characters lie, tattle on someone, or violate a confidence. Excluded from this factor is putting on an assumed role for a legitimate purpose. Meddlers: where characters mind other people’s business, try to thwart their legitimate aspirations, or meddle in their concerns. The acts must have a negative intent. Illicit sex: scenes with unmarried couples making physical contact with strong and at least vaguely illicit erotic implications. Also included in this factor are scenes in which such activity is talked about and dirty jokes or allusions. Ridicule: scenes where characters poke fun at someone, laugh at their embarrassment, or otherwise make fun of them to their face. Conceit: when salient characters indulge in conceited behavior or self-adulation, either at the expense of others around them or as a flaw in an otherwise positive character. Normally arrogant characters are OK unless they exercise their conceit at the expense of innocent others. Red-button Analysis: Low-Negative Attributes The four attributes found to account for low red-button scores were almost identical to four of the low-green attributes. One required a slight change. It specified scenes ending in the “first three minutes” instead of the first five. The others remained unchanged: “new characters,” “story over,” and “credits.” Predicting New Scene Scores An assistant prepared the materials for the analysis. Scores were withheld from the researcher until all the scenes were classified, using the differentiating attributes. A prediction was made for each scene: high if there were more high attributes than low attributes, low if there were not. A 2 x 2 contingency table related the predictions to the scores (high or low, relative to the show mean). Table 2 lists the nine pilot shows. There were 305 scenes. The green-button scores were correct 83 percent of the time: 76 percent of the highs and 86 percent of the lows; chi-square = 118 with I df, p < .001. The red-button scores were correct for 69 percent of the highs, 65 percent of the lows; chi-square = 32, p < .001 (see Journal of Advertising Research Table 3). Conclusion Table 2 Green29button Pilot Shows Used for the 35 Prediction Test 31 23 29 The Doris Day Show 26 Houseboat 30 M*A*S*H 30 Maureen The Minnie Pearl Show When in Rome (To Rome with Love) Scared Stiff Shepherd’s Flock The Singles Table 3 Scene-Score Predictions Number of scenes High score Low score Green-button predictions High-score predicted Low-score predicted Red-button predictions High-score predicted Low-score predicted how we define those variables. The problem of definition is especially critical in the social sciences where the subjects may process experience differently than the researchers. The tendency for researchers to build abstract models (descriptive or theoretical) around ad hoc, a priori variables is a primary reason for their difficulties in predicting how people will respond to previously untested complex stimuli. The attributes discovered through grounded-analysis procedures proved to generalize well in this study. When applied to new program material, they differentiated high- from low-scoring scenes with an encouraging degree of accuracy. It is hoped that publication of this study will encourage extension of the methods of grounded analysis to new substantive areas and to the evolution 21 Red button 3.5 2.7 1.6 3.5 4.2 1.5 7.5 4.0 2.6 The relationship we discover between any two variables depends on produce, and the findings have shown a high level of predictive generality. By providing a qualitative characterization of the structure of other people’s perceptions, we may better understand why they act as they do, and we may be able to produce better materials for them. Bogdan, R., and S. K. Biklin. Quali tative Research for Education: An In troduction to Theory and Methods. Boston: Allyn and Bacon, Inc., 1982. Accuracy % 98 25 31 151 76 86 74 68 34 129 69 65 of inductive methodologies in Campbell, D. T., and J. C. Stanley. “Experimental and Quasi-experimental Designs for Research on Teaching.” In Handbook of Research on Teaching. N. L. Gage, ed. Chicago: Rand McNally, 1963. Glaser, B. G. “The Constant Comparative Method of Qualitative Analysis.” Social Problems (1965):436—445. general. There is a large class of problems that can be analyzed by the grounded approach. Appropriate data bases abound, particularly in the worlds of business and communications. Wherever standardized pretesting (of advertisements, products, programs, etc.) has been carried on long enough, the necessary raw materials are available. Grounded analysis makes it possible to draw valuable generalizations from data on the shelf. The results take some getting use to. We are accustomed to research which reports that the relationship between two predefined variables is such-and-such a statistic. We are less accustomed to research which says that the critical stimuli to a certain type of response are best defined in such-and-such a way. But the latter is what these grounded analyses Journal of Advertising Research ------------,and A. L. Strauss. The Discovery of Grounded Theory: Strategies for Qualitative Research. Chicago: Aldine Publishing Company, 1967. Levy, M. R. “The Lazarsfeld-Stanton 23 Program Analyzer: An Historical Note.” Journal of Communication 32, 4 (l982):30—38. Patton, M. Q. Qualitative Evaluation Methods, Beverly Hills, CA: Sage Publications, 1980. Rust, L. W. Attributes That Differentiate Boys’ and Girls’ Preference for Materials in the Preschool Classroom: A Systems Design Approach: Ann Arbor, MI, University Microfilms, 1971. Journal of Advertising Research ------------. Attributes of The Electric Company Pilot Shows That Produced High and Low Attention in 2nd and 3rd Graders. New York: Children’s Television Workshop, 1971, Eric text-fiche. -------------.Attributes of Sesame Street That Influence Preschoolers’ Attention to the TV Screen. New York: Children’s Television Workshop, 1971. Turner, R. H. “The Quest for Universals in Sociological Research.” American Sociological Review 18 (1953): 604—611. Znaniecki, F. The Method of Sociology: New York: Farrar & Reinhart, 1934. Journal of Advertising Research. Volume 25, Number 5, October/November, 1985 23