library(Lahman)Baseball
The Data
The source of data for this lab is the public Lahman database. This database contains a number of data sets with different units of observation! Below are the first few rows and some of the columns for two of these data sets: Teams and Batting. They contain data going back to 1871! Use these excerpts to help you answer the following questions.
Get the data
You can now get information on the datasets with the help functions - ?Teams and ?Batting:
For Teams, some key variables are:
G: number of Games played that seasonW: number of Wins that seasonL: number of Losses that seasonR: number of Runs (points, basically) the team scored that season, across all games.RA: number of Runs Allowed (points scored by the opponents) across all games that season.
For Batting, some key variables are:
G: number of Games played that season by the player. Not every player plays in every game in a season, so this is the number of games that this person actually played in.AB: number of At Bats (number of times the player went up to the plate and tried to hit the ball)H: number of Hits (number of times the player successfully got on base by hitting the ball)SO: number of Strike Outs (number of times the player went up to the plate and missed the ball too many times)BB: number of Base on Balls (number of times the player went up to the plate and got a free pass to first base because the pitcher threw too many bad pitches)R: number of Runs (points) directly scored by the player that season. This means getting all the way around the bases.
Teams data frame:
| yearID | teamID | franchID | G | W | L | R | RA | name |
|---|---|---|---|---|---|---|---|---|
| 2000 | MON | WSN | 162 | 67 | 95 | 738 | 902 | Montreal Expos |
| 1981 | PIT | PIT | 103 | 46 | 56 | 407 | 425 | Pittsburgh Pirates |
| 1945 | WS1 | MIN | 156 | 87 | 67 | 622 | 562 | Washington Senators |
| 1922 | SLN | STL | 154 | 85 | 69 | 863 | 819 | St. Louis Cardinals |
| 1904 | CHA | CHW | 156 | 89 | 65 | 600 | 482 | Chicago White Sox |
| 1896 | BLN | BLO | 132 | 90 | 39 | 995 | 662 | Baltimore Orioles |
Batting data frame:
| playerID | yearID | teamID | G | AB | R | H | BB | SO |
|---|---|---|---|---|---|---|---|---|
| tekotbl01 | 2012 | SDN | 11 | 15 | 0 | 2 | 0 | 4 |
| liriafr01 | 2010 | MIN | 31 | 2 | 0 | 0 | 0 | 1 |
| streehu01 | 2007 | OAK | 48 | 0 | 0 | 0 | 0 | 0 |
| eldreca01 | 2003 | SLN | 63 | 2 | 0 | 1 | 0 | 1 |
| konerpa01 | 1998 | CIN | 26 | 73 | 7 | 16 | 6 | 10 |
| hollica01 | 1921 | DET | 35 | 48 | 4 | 13 | 3 | 4 |
Questions
Question 1
part a
What is the unit of observation for the Teams data set?
part b
What about for the Batting data set?
Question 2
part a
Write out a question about baseball that could answered purely through the information in the Teams data set (numerical summaries or plots, etc.)
part b
Do the same thing, but for the Batting data set.
Question 3
For each subpart below, you will form a predictive questions that you could answer using the data frames above. Identify a response variable for each.
part a - classification question, Batting data frame
part b - regression question, Batting data frame
part c - classification question, Teams data frame
part d - regression question, Teams data frame
For the remainder of the lab, we’ll be working with the Teams data frame. It can be accessed through the Lahman library.
Question 4
Subset the Teams data further set to only include years from 2000 to present day (this is the data set that you’ll use for the remainder of this lab. However, there might be another year post-2000 that you might want to filter out: which one and why?).
Question 5
part a
Plot the relationship between runs and wins using ggplot2 code. Place runs on the x-axis and wins on the y-axis.
part b
Describe the relationship between the two variables, specifically commenting on the form, direction, and the strength of association.
Question 6
part a
Fit a simple linear model to predict wins by runs and save it into m1.
part b
Write out the equation for the linear model (using the estimated coefficients) in mathematical form.
part c
Calculate the \(R^2\) value of the linear model and interpret it in the context of the problem in a sentence.
Question 7
part a
Fit a multiple linear regression model to predict wins using runs *and* runs allowed (RA) and save it as m2.
part b
Write out the equation for the linear model in mathematical form and calculate its \(R^2\) value.
part c
How does this model compare to the simple linear regression from the previous question in terms of predictive power?
Question 8
part a
Fit a third, more complex model to predict wins and call it m3. This model should use:
- at least three predictor variables
- at least one non-linear transformation or polynomial term.
part b
Write out the equation for the linear model in mathematical form and calculate its \(R^2\) value.
Question 9
Revisit the definition of causation. If your predictive model has a positive coefficient between one of the predictors and the response, is that evidence that if you increase that predictor variable for a given observation, the response variable will increase? That is, can you (or a sports management team) use this model to draw causal conclusions? Why or why not? Answer in at least three sentences. There’s more than one possible answer, so make sure to justify your reasoning.