Sports Data Sources and Data Extraction Gavin Zhang MIS580 University of Arizona 02-06-2008 Outline • Sports Data Sources – Baseball – Basketball – Football – Olympics – Greyhound • Data Extraction – Case Study: AZGreyhound System 2 Baseball Data Source Download the database • http://www.baseball1.com/ 3 Data Download • This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2007. – The data are provided in Microsoft Access, CVS and other formats. – The newest version is Version 5.5. • The database can be downloaded at: http://baseball1.com/content/view/57/82/ 4 AwardPlayers.csv Database • Detailed description of the database is available at: http://baseball1.com/content/view/57/82/ • The database has 21 tables; main tables include: – MASTER Table- Player names, DOB, and biographical info; – Batting Table- batting statistics; – Pitching Table- pitching statistics; – Fielding Table- fielding statistics. • Detailed description about each data field in each table is available. playerID bondto01 hinespa01 heckegu01 radboch01 keefeti01 clarkjo01 duffyhu01 rusieam01 lajoina01 youngcy01 wadderu01 mathech01 mathech01 cobbty01 cobbty01 schulfr01 speaktr01 doylela01 johnswa01 awardID Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown Triple Crown MVP MVP MVP MVP MVP ………… yearID 1877 1878 1884 1884 1888 1889 1894 1894 1901 1901 1905 1905 1908 1909 1911 1911 1912 1912 1913 5 lgID NL NL AA NL NL NL NL NL AL AL AL NL NL AL AL NL AL NL AL Basketball Data Source Download all of the player and team statistics • http://databaseBasketball.com/ 6 Data Download • The website contains the NBA data from 1947 to 2007 and ABA data from 1968 to 1976 on players, teams, leagues, all-star games, awards, and coaches. • Download at: http://databasebasketball.com/ stats_download.htm 7 Database • This download contains nine column delimited files (.txt format), each of which represents a table in the database. Teams.txt team|location|name|leag ANA|Anaheim|Amigos|A AND|Anderson|Duffey Packers|N ATL|Atlanta|Hawks|N BA1|Baltimore|Bullets|N BAL|Baltimore|Bullets|N BOS|Boston|Celtics|N • If you open the files up in excel, you may need to select Data -> Text to Columns, then use the bar ("|") character as the delimiter. BUF|Buffalo|Braves|N CAP|Capital|Bullets|N CAR|Carolina|Cougars|A CH1|Chicago|Stags|N CH2|Chicago|Zephyrs|N CHA|Charlotte|Hornets|N CHI|Chicago|Bulls|N ………… 8 Football Data Source • http://www.pro-football-reference.com/ 9 Data Download • A copy of data set (in CVS format) can be downloaded from: http://ai.arizona.edu/hchen/chencourse/SportsData/Pro-footballrefernce_CSV.zip • This version contains the game data from 1995 to 2006. The dataset contains 64,327 players and the games they played in. • Tables include: – Master—information about players – Seasons—the statistics of the players’ records by season – Games—the statistics of the players’ records by game • Detailed description about each data field in each table is available. 10 Master.csv Database ID AbduKa00 AbduRa00 AberWa00 AbraDa00 AdamBo00 AdamCh00 AdamCu00 AdamGe00 AdamGr00 AdamJo00 AdamMi00 AdamMi01 AdamTo00 AdamTo01 AdamTo02 AdamWi00 AddaJo00 AdkiJa00 AdkiMa00 AdkiSa00 last name Abdul-Jabbar Abdullah Abercrombie Abramowicz Adams Adams Adams Adams Adams Adams Adams Adamle Adams Adams Adamle Adams Addai Adkisson Adkins Adkins first name position Karim rb Rabih rb Walter rb Danny wr Bob te Charlie wr Curtis rb George rb Grant wr John rb Michael wr Mike rb Tony qb Tom wr Tony rb Willie wr Joseph rb James te Margene wr Sam qb ………… birth year debut year 1974 1996 1975 1999 1959 1982 1945 1967 1946 1969 1979 2003 1962 1985 1962 1985 2000 2005 1937 1959 1974 1997 1949 1971 1950 1975 1940 1962 1924 1950 1956 1979 1983 2006 1980 2005 1947 1970 1955 1977 11 Some Other Football Data Sources • http://www.databasefootball.com/ – The website contains the National Football League (NFL) data from 1922 to 2005 and Australian Football League (AFL) data from 1960 to 1969 on players, teams, leagues, awards, and coaches. – Data set can not be downloaded directly. The data need to be extracted from the HTML Web pages by using parsing programs. • http://www.jt-sw.com/football/ – The website contains the player/coach statistics of NFL from 1920 to present and statistics of AFL from 1960 to 1969. – Data set can not be downloaded directly. The data need to be extracted from the HTML Web pages by using parsing programs. 12 Olympics Data Source • http://www.databaseolympics.com/ 13 Data Format • DatabaseOlympics.com is your source for every Summer and Winter Olympics medal winner. – Summer Olympics from 18962004; – Winter Olympics 1924 -2002 • You'll find every medal winner for every country with easy links to each Olympics, sports, and athletes. 14 Data Format 15 Greyhound • http://66.236.122.233:8080/tracklink/ 16 Data Format • Data includes daily race programs (videos) and odds charts (.txt file format) for all US Greyhound tracks. • Some tracks had both Afternoon and Evening programs. 17 Chart.txt Data Format 1st Grade: B Distance: 550 Condition: Fast DOG WT P O 1/8 Str Fin Time PTL Jane 63.5 6 3 1 1 1 ns 32.00 11.60 Held At Wire Inside Silver Speck 68.5 1 1 2 2 2 ns 32.01 2.80 Jain't It Doug 75 7 7 6 6 3 1.5 32.10 7.50 Closed For Show Outs Flyer Whitesocks 75.5 8 8 7 3 4 1.5 32.11 2.30 In The Hunt Flying Detroit 5 5 4 4 5 2 Not Far Behind Mdtrk 59.5 3 4 3 5 6 4.5 32.31 4.20 VP Twix Twizala 69 Odds Comment 32.15 9.00 Sergio 73 4 6 5 7 7 5 Heartattack Jack 71.5 2 2 8 8 8 5.5 32.39 7.10 Cutff 1st, Stayd Cls Losing Position Ins 32.34 13.30 Blocked 1st Turn Bumped 1st Turn ………… 18 Case Study: AZGreyhound System By Rob Schumaker AZGreyhound System Design Greyhound Data AZGreyhound Odds Data Model Building DB Race Data Training / Testing Prediction Traditional Straight Bets Box Bets Accuracy Win Exacta Quiniela Payout Place Trifecta Trifecta Efficiency Show Superfecta Superfecta Betting Engine Metrics 20 Greyhound Data Extraction • Grayhound data was gathered from www.trackinfo.com. The Web site links to: – GreyMatter http://66.236.122.233:8080/tracklink/ – TrackInfo http://www.trackinfo.com/index2.html • The race and odds data was parsed into a SQL Server database; then the data was sent to the AZGreyhound system for prediction. 21 Example code public void RacePrograms() throws Exception { ... ... String URL1 = "http://www.trackinfo.com/trakdocs/hound/"; This method picks up the String URL2 = "/Rpages"; ... ... overall race information OpenConnection2(); try { ... ... Data parsing URL TrackAbbrev = rSet.getString("TrackAbbrev"); and puts it in the database String URL = URL1 + TrackAbbrev + URL2; Feed = web.Scraper(URL, 1); ... ... NumItems = web.NumItems(Feed, "~icons/html.gif"); for(int y = 1; y <= NumItems; y++) { Feed = Feed.substring(Feed.indexOf("~icons/html.gif")); FileName = web.ExtractText(Feed, "<A HREF=\"", "\">"); Parsing out each data field Feed = Feed.substring(Feed.indexOf("<A HREF=")); FileDate = web.ExtractText(Feed, "NOWRAP>", "</TD>"); FileContents = web.Scraper(URL + "/" + FileName, 1); FileContents = FileContents.replaceAll("'", "-"); db.Insert2DBProgram(FileName, FileDate, FileContents); } } CloseConnection2(); } Insert into DB catch(SQLException e) { System.out.println(e); } } 22 You can use the sports data sources introduced in this set of slides for your data mining project. You are strongly encouraged to identify other interesting public sports data sets for your project. Thanks! 23