Case Study: 2004 Movies 1 Description This data was extracted from www.imdb.com. These are the movies that appeared in 2004. There are 63 movies, and 4 variables: budget length rating votes How much the movie cost to make Length in minutes Average user rating Number of users logging into web site to rate movie The primary question is “Can the movies be grouped into a small number of clusters according to their similarity?” Other possible questions might be: • Does a bigger budget suggest a better user rating? • Which low budget movies that have rated unusually highly? • Do shorter movies have lower budgets? 1 2 Plan for Analysis Approach Summary statistics (marginal and conditional) Plots Reason extract location/scale information Type of questions addressed “What movie is rated highest by users?” “What is longest movie?” explore data distributions Numerical clustering Grouping the tracks into clusters of similar audio attributes. Use hierarchical, k-means, model-based and self-organizing maps. “Are most movies short or long?”, “Is there any obvious clustering of the movies?” “Which movies might be considered alike?” 2