Data Extraction Sports Data Sources and University of Arizona

advertisement
Sports Data Sources and Data
Extraction
Gavin Zhang
MIS580
University of Arizona
02-06-2008
Outline
• Sports Data Sources
– Baseball
– Basketball
– Football
– Olympics
– Greyhound
• Data Extraction
– Case Study: AZGreyhound System
2
Baseball Data Source
Download the database
• http://www.baseball1.com/
3
Data Download
• This database contains pitching,
hitting, and fielding statistics for
Major League Baseball from 1871
through 2007.
– The data are provided in Microsoft
Access, CVS and other formats.
– The newest version is Version 5.5.
• The database can be downloaded at:
http://baseball1.com/content/view/57/82/
4
AwardPlayers.csv
Database
•
Detailed description of the database
is available at:
http://baseball1.com/content/view/57/82/
•
The database has 21 tables; main
tables include:
– MASTER Table- Player names, DOB,
and biographical info;
– Batting Table- batting statistics;
– Pitching Table- pitching statistics;
– Fielding Table- fielding statistics.
•
Detailed description about each data
field in each table is available.
playerID
bondto01
hinespa01
heckegu01
radboch01
keefeti01
clarkjo01
duffyhu01
rusieam01
lajoina01
youngcy01
wadderu01
mathech01
mathech01
cobbty01
cobbty01
schulfr01
speaktr01
doylela01
johnswa01
awardID
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
Triple Crown
MVP
MVP
MVP
MVP
MVP
…………
yearID
1877
1878
1884
1884
1888
1889
1894
1894
1901
1901
1905
1905
1908
1909
1911
1911
1912
1912
1913
5
lgID
NL
NL
AA
NL
NL
NL
NL
NL
AL
AL
AL
NL
NL
AL
AL
NL
AL
NL
AL
Basketball Data Source
Download all of the player and team statistics
• http://databaseBasketball.com/
6
Data Download
• The website contains the
NBA data from 1947 to 2007
and ABA data from 1968 to
1976 on players, teams,
leagues, all-star games,
awards, and coaches.
• Download at:
http://databasebasketball.com/
stats_download.htm
7
Database
• This download contains
nine column delimited
files (.txt format), each of
which represents a table
in the database.
Teams.txt
team|location|name|leag
ANA|Anaheim|Amigos|A
AND|Anderson|Duffey Packers|N
ATL|Atlanta|Hawks|N
BA1|Baltimore|Bullets|N
BAL|Baltimore|Bullets|N
BOS|Boston|Celtics|N
• If you open the files up in
excel, you may need to
select Data -> Text to
Columns, then use the
bar ("|") character as the
delimiter.
BUF|Buffalo|Braves|N
CAP|Capital|Bullets|N
CAR|Carolina|Cougars|A
CH1|Chicago|Stags|N
CH2|Chicago|Zephyrs|N
CHA|Charlotte|Hornets|N
CHI|Chicago|Bulls|N
…………
8
Football Data Source
• http://www.pro-football-reference.com/
9
Data Download
• A copy of data set (in CVS format) can be downloaded from:
http://ai.arizona.edu/hchen/chencourse/SportsData/Pro-footballrefernce_CSV.zip
• This version contains the game data from 1995 to 2006. The
dataset contains 64,327 players and the games they played in.
• Tables include:
– Master—information about players
– Seasons—the statistics of the players’ records by season
– Games—the statistics of the players’ records by game
• Detailed description about each data field in each table is
available.
10
Master.csv
Database
ID
AbduKa00
AbduRa00
AberWa00
AbraDa00
AdamBo00
AdamCh00
AdamCu00
AdamGe00
AdamGr00
AdamJo00
AdamMi00
AdamMi01
AdamTo00
AdamTo01
AdamTo02
AdamWi00
AddaJo00
AdkiJa00
AdkiMa00
AdkiSa00
last name
Abdul-Jabbar
Abdullah
Abercrombie
Abramowicz
Adams
Adams
Adams
Adams
Adams
Adams
Adams
Adamle
Adams
Adams
Adamle
Adams
Addai
Adkisson
Adkins
Adkins
first name
position
Karim
rb
Rabih
rb
Walter
rb
Danny
wr
Bob
te
Charlie
wr
Curtis
rb
George
rb
Grant
wr
John
rb
Michael
wr
Mike
rb
Tony
qb
Tom
wr
Tony
rb
Willie
wr
Joseph
rb
James
te
Margene
wr
Sam
qb
…………
birth year debut year
1974
1996
1975
1999
1959
1982
1945
1967
1946
1969
1979
2003
1962
1985
1962
1985
2000
2005
1937
1959
1974
1997
1949
1971
1950
1975
1940
1962
1924
1950
1956
1979
1983
2006
1980
2005
1947
1970
1955
1977
11
Some Other Football Data Sources
• http://www.databasefootball.com/
– The website contains the National Football League (NFL) data
from 1922 to 2005 and Australian Football League (AFL) data from
1960 to 1969 on players, teams, leagues, awards, and coaches.
– Data set can not be downloaded directly. The data need to be
extracted from the HTML Web pages by using parsing programs.
• http://www.jt-sw.com/football/
– The website contains the player/coach statistics of NFL from 1920
to present and statistics of AFL from 1960 to 1969.
– Data set can not be downloaded directly. The data need to be
extracted from the HTML Web pages by using parsing programs.
12
Olympics Data Source
• http://www.databaseolympics.com/
13
Data Format
•
DatabaseOlympics.com is your
source for every Summer and
Winter Olympics medal winner.
– Summer Olympics from 18962004;
– Winter Olympics 1924 -2002
•
You'll find every medal winner
for every country with easy links
to each Olympics, sports, and
athletes.
14
Data Format
15
Greyhound
• http://66.236.122.233:8080/tracklink/
16
Data Format
•
Data includes daily race programs (videos) and odds charts (.txt file format) for
all US Greyhound tracks.
•
Some tracks had both Afternoon and Evening programs.
17
Chart.txt
Data Format
1st Grade: B Distance: 550 Condition: Fast
DOG
WT P O 1/8
Str
Fin
Time
PTL Jane
63.5 6 3 1
1
1 ns
32.00 11.60 Held At Wire Inside
Silver Speck
68.5 1 1 2
2
2 ns
32.01 2.80
Jain't It Doug
75
7 7 6
6
3 1.5 32.10 7.50
Closed For Show Outs
Flyer Whitesocks 75.5 8 8 7
3
4 1.5 32.11 2.30
In The Hunt
Flying Detroit
5 5 4
4
5 2
Not Far Behind Mdtrk
59.5 3 4 3
5
6 4.5 32.31 4.20
VP Twix Twizala
69
Odds Comment
32.15 9.00
Sergio
73
4 6 5
7
7 5
Heartattack Jack
71.5 2 2 8
8
8 5.5 32.39 7.10
Cutff 1st, Stayd Cls
Losing Position Ins
32.34 13.30 Blocked 1st Turn
Bumped 1st Turn
…………
18
Case Study:
AZGreyhound System
By Rob Schumaker
AZGreyhound System Design
Greyhound Data
AZGreyhound
Odds Data
Model Building
DB
Race Data
Training / Testing
Prediction
Traditional
Straight Bets
Box Bets
Accuracy
Win
Exacta
Quiniela
Payout
Place
Trifecta
Trifecta
Efficiency
Show
Superfecta
Superfecta
Betting Engine
Metrics
20
Greyhound Data Extraction
• Grayhound data was gathered from
www.trackinfo.com. The Web site links to:
– GreyMatter http://66.236.122.233:8080/tracklink/
– TrackInfo http://www.trackinfo.com/index2.html
• The race and odds data was parsed into a
SQL Server database; then the data was
sent to the AZGreyhound system for
prediction.
21
Example code
public void RacePrograms() throws Exception {
... ...
String URL1 = "http://www.trackinfo.com/trakdocs/hound/";
This method picks up the
String URL2 = "/Rpages";
... ...
overall race information
OpenConnection2();
try {
... ...
Data parsing URL
TrackAbbrev = rSet.getString("TrackAbbrev");
and puts it in the database
String URL = URL1 + TrackAbbrev + URL2;
Feed = web.Scraper(URL, 1);
... ...
NumItems = web.NumItems(Feed, "~icons/html.gif");
for(int y = 1; y <= NumItems; y++) {
Feed = Feed.substring(Feed.indexOf("~icons/html.gif"));
FileName = web.ExtractText(Feed, "<A HREF=\"", "\">");
Parsing out each
data field
Feed = Feed.substring(Feed.indexOf("<A HREF="));
FileDate = web.ExtractText(Feed, "NOWRAP>", "</TD>");
FileContents = web.Scraper(URL + "/" + FileName, 1);
FileContents = FileContents.replaceAll("'", "-");
db.Insert2DBProgram(FileName, FileDate, FileContents); }
}
CloseConnection2();
}
Insert into DB
catch(SQLException e) {
System.out.println(e); }
}
22
You can use the sports data sources
introduced in this set of slides for your data
mining project.
You are strongly encouraged to identify other
interesting public sports data sets for your
project.
Thanks!
23
Download