This is a very preliminary introduction to a cool project based on Retrosheet to sort of reinvent basic baseball statistics using a somewhat more mathematical framework. What's different about it? In a word, consequentialism. All arctex statistics are based solely on umpire's decisions about what happened in the game. There are no earned runs or RBIs or at bats, or even hits or errors except as descriptions. It aims to be a complete and reasonably familiar description of baseball using these methods, although the terminology is mostly new.
The name is from "arctangent of expectation values", about which more later after some actual baseball stuff. Skip the rest of this paragraph if you don't want a brief history of the project. This project is a work in progress but close to something interesting (to me at least). It was going along great up through 2019 and then a number of things happened. One is they introduced some new rules that required a bit of a software rewrite - the runner on second base, the 7-inning game, the new postseason format. For the most part that's easy enough (see a complication below), but I delayed too long and accumulated too many irresistable ideas for new stuff, and so the update became too big to tackle at once etc. And then in 2020 I had a number of new ideas all at once that demanded following up, which eventually resulted in the other texts I wrote. Finally in late 2023 I got around to fixing all that and updating so that I can follow retrosheet again - at last! However, that release was particularly buggy, but now I'm happy to say I've done another round, smoothed over the bugs, and added some new features again.
Is Shohei Ohtani's 6-for-6 game the best offensive performance ever? An example step-by-step of how Arctex stats are calculated.
- Description of stats and algorithms
- What's next
- Description of each file in the distribution
There are several lines of text above and under the graph. Starting from the top, there is the "annotation" or description of each play in bold. See below in "Annotations and file format" for a description of the abbreviations. Below that is a description of which offensive players advanced, scored, or were out. See "Player Codes" and "Baserunner Codes" below for a description of the syntax. The final line above the graph shown the inning. Inside the top part of the graph, a large * marks the play of the game, defined as the play that makes the largest step change in the game probability estimator. In the first line below the graph, there are ERD codes for each play (see "ERD Codes"). At the start of each half-inning, the offensive team gets its expectation value raised to the value for 0 on 0 out, which is depicted as a separate column on the graph. Under it, there appears not an ERD code (there is none) but the ballpark code in use for this game, or PW table code if that's in use - see below "ERV", "Ballpark Codes". Below that is the ERD value (see below). Finally at the bottom there are substitution codes, which use the player letter codes and fielding position number codes. When you see 'n:6', that means player 'n' is now the shortstop. When you see 'k:1,j' that means 'k' is the new pitcher and 'j' is leaving the game at the same time. Perpendicular to the graph is a header which gives the team codes and colors, the series and game number (this doesn't really look right for regular season games), followed by the date and weekday, and in between the retrosheet doubleheader code in parentheses.
A brief word about X, which is a fun stat. It follows how much the game probability estimator gets racked around, more or less. It's high for walkoffs and late lead changes, and low for early inning blowouts. For example, the highest X game was BOS193808232 with a final score 12-14 and a walkoff grand slam, with an X of 85.1. The worst is CLE199010020 which was 0-9 after the first inning and finished 3-13, with an X of 3.2. The game with the most lead changes (7 - there's a tool for this) was MON200005140 which is #15 all time in X with 76.4. For series, the 1986 ALCS had an X of 92.6, whereas the 2013 NLWC game had only 5.7.
Next is the display of hitters - really all offensive players for this game. The first column '#' gives the player's letter code, as used on the graph. The next column 'st' contains several pieces of information variously coded. For the starting lineup, it contains the starting fielding code, or 10 for DH. For substitutes it contains the letter corresponding to the batting slot they enter. This is not necessarily the player they displaced. That can be determined by looking for the previous letter code alphabetically that went into the same batting slot letter, looking up the list (it could be a pitcher below). In general to figure out a substitution completely you have to look at all the substitution codes below the graph, but there are a number of features to help figure out common patterns from the roster list only, via dots appearing around this 'st' letter or number. A middle dot on the right signifies that this player had a substitute fielding code for positions 2-9, which may mean a substitute fielder, or changing to a fielding position later. A period on the right means a pinch hitter, and a comma means pinch runner, and these combine with the fielding dot to form colon and semicolon. This allows you to figure out simple multiple switches, e.g. if a pinch hitter goes into the center fielder's batting slot, and if fielding dots appear on the pinch hitter and the starting left fielder, then the pinch hitter probably went into left and the left fielder moved over.
The stats that appear next are all explained below. Briefly, ACR is the basic run value hitting stat and ACOR the pitching stat. ACW is about the game probability estimator, and R is a stat used only for deciding the winning and losing hitters and pitchers. There are more annotations on the PA column. The game's winning and losing hitter are marked + and -. High dots on the winning side, and periods on the losing side both indicate that the player passed the 'R/pa' cut described below for the decision. This helps understand how the decision works. Finally, '..' may appear by itself indicating a double-pinch, i.e. a pinch hitter who never comes to bat, relieved immediately by another pinch hitter after a pitching change. Finally, the player's name has a hand designation appended. This has a middle dot after it if it indicates the throwing hand, meaning that the player pitched at some point in the season. Sometimes this is the case for position players which is basically a bug, although a little hard to fix the way it currently works.
Below that is the pitching table for the game, which is much the same as the hitting table. Unlike ACR, ACOR is better when negative, but ACW is still best positive for pitchers. There are a few little differences from the hitting display. For DH games the 'st' column is just blank for relief pitchers who stay out of the batting order. The PA annotations are the same except you don't see double-pinch, and there is another annotation, the comma which means KO (left in the middle of an inning).
First comes the reular season hitting table, which is ordered by ACR. The season leader in ACRc (see below) has a '>' appended to the PA column. '/pa' after 'BR' means BR/PA for the season (lower case for spacing). BR/PA kind of needs a verbal name, and I prefer "no out percentage". 'w' and 'l' are the hitting decisions.
Next, the summary of hitting vs left and right hand pitchers is ordered by LACR - RACR which means that "normal" right handers sort at the top of the table, and "normal" left handers at the bottom. This allows you to see reverse splits and pigeonhole switch hitters at a glance. The stats are described below, but LACR is just batting-only ACR vs left hand pitchers. Beware when the number of PA is very low on either side. If you can't figure out why somebody would be disqualified for a regular season hitting MVP, look here. You have to have positive LACR and RACR to be considered for either MVP.
Next is hitting ERD probabilities, which is sort of arctex's mixed-up version of singles, doubles, homers, and double plays. What you see is a percentage of all plays this player was involved in as a hitter or baserunner, by ERD value in the sense of greater or equal for + ans less or equal for -. So +.2 is very roughly like singles (and everything better), +.5 like doubles, and +1 like homers, with -.5 being like double plays. But the numbers here are more focused on the game effect and categorize all sorts of plays. For example, an aggressive base stealer will have a high rate of +0. Unlike the other tables in this section, this table has a minimum requirement of 150 plays, the reason being that the numbers are pretty worthless with less than that. The final column is 'si' or slugging index, which is +.5 + 2 * +1, a number kind of like slugging percentage. It is in fact a good number for categorizing sluggers. Over 30 is very good, and over 40 is teriffic (and rare nowadays). The table is sorted by the unobvious metric (+.2 - -.2), which is vaguely interesting.
The baserunning and fielding tables should be self explanatory. All -OR stats are best when negative (opponents' runs), and all are sorted best on top. Fielding stats are aggregated by the named categories. Beware numbers with few PA, especially in fielding. Under 1000 PA should be considered an unreliable FCOR number (much more is really best). The FCOR PA numbers are also a good place to see who was a starting player in the season. The BCR PA number is an interesting offensive metric in itself.
The pitching table is sorted by ACOR, best on top. 'pa/3' means PA per 3 outs, i.e. 3 * PA / O, which is sort of parallel to BR/PA for hitters. In the ERD probabilities, an si over 25 is a red flag in my opinion. That table is ordered by (-.5 - +.5), which again seems vaguely interesting.
The postseason series tables display the same information for these series. The PS numbers are an aggregate of the following series numbers. If you're looking at a regular season game, and a player's name appears in the PS summaries but without any numbers, it means the player was on another team. The only thing different here is the indication in the PA column turns into a primitive version of my series MVP designation, based only on ACRc and ACORc. These are now basically obsolete but I don't see any major reason to turn off the code - just know they're not the real award, which is indicated at the end in the MVP section. The character used to indicate this varies depending which series is being considered. In a file for a postseason series, the winning team's player will be indicated '*', although the new series MVP award doesn't care which team you're on. The losing team's best player gets a '>'. For series summaries for other series in the year, the designation changes to '^' because it only considers the listed players, and the real award winner may not appear (this one is fairly useless). For the entire PS summary they get '>' for both teams.
Next is a wildcard runner up display, even in the pre-wildcard era. Division leaders appear at the top abve the line marked with a '*', and then runners up which are marked with a dot when wildcard slots exist. The display contains a brief summary of each team's most recent postseason appearance, world series appearance, and world series win.
Below is the postseason summary, with an entry for each postseason series. The series name is given in retrosheet format, which includes imponderables like ALD2 that are best looked up here to find out what that means. Following are the team codes of the series winner then loser, followed by a display of series games called the willow. Each game is represented by a letter decorated with "accents" or dots to indicate a number of interesting things. The letters w and l represent wins and losses by the series winning team, and are upper case whenever the home team wins the game, lower case when the visitors win. A high dot (or caret in text files - a postscript charset issue) before the letter indicates a complete game by the winning pitcher, and this is changed to a '!' for a perfect game. By the way, the Arctex definition of a perfect game is very slightly different, being defined as zero BR, which allows for e.g. a sigle thrown out at second (consequentialism). A middle dot after the letter indicates extra innings. A comma after the letter indicates a walkoff, and combines with extra innings to form a semicolon. There follows a display of basic stats for each series (plate appearances, outs, baserunners, runs), with each displaying two numbers - before the slash, the winner's total, and after the slash the series total (winner plus loser), so that it appears as a fraction giving the winner's share of the total. Next come the run value ACR and ACOR numbers for the series. They're printed for the winner, but you reverse the numbers for the losing team. I.e. the winning team's ACOR is the losing team's ACR (no sign reversal). Finally there is the series X, followed by the rank in X value (out of 381 through the 2023 postseason).
Finally at the bottom there are (up to) three text summaries. On the left is an accounting of games in the regular season and postseason which should be self explanitory. If there were regular season interleague games, there is a summary of which league won how many in the middle. At the right is a summary that needs a slight explanation, as I couldn't find the right wording to fit there. What it shows is the total length of all games in the year divided by one eighteenth of the total number of half-innings, more or less the length of the average game that year. But it's relly calculated as an average of half-innings. Then there's a summary of the total season pace of game or 'P' stat. in plate appearances per minute.
The first column in each table is the old MVP/runner up designation, based solely on ACRc or ACORc for the league-aggregate for the regular season. The second column displays the new MVP winners as * and the runners up as numbers starting with 1. A blank here means the player did not qualify, i.e. had a key stat of the wrong sign. Next is the name and batting or throwing hand which is lowercase to indicate a player's rookie year (simplistically - first calendar year only), followed by the team that player had the most number of PA for in the season. Note, a player who was traded may appear on both sides with different numbers. The headings give the stats that follow - PA is plate appearances, w and l are game decisions, c is complete games, and +1 is +1c described in "ERD probabilities".
Here is the latest software download and release notes. Arctex will always be completely free and open source. I took down the sample PDF files from the website as it's not clear I'm allowed to do that.
There is a script called run_everything which will do what it says, starting by downloading the Retrosheet big zip file, and finishing by generating the exact PDF that appears above, with all the analyses run in between. There are also some tools for looking at the stats in a number of ways, the most important of which is called bb-erd. This is an interactive command-line tool to query the ERV tables, calculate ERD code values for different ballparks, and to run the game and series probability estimators for live games. You have to run_everything first. bb-erd does at least have a help screen, which you can get by entering ? or h. To run the software, you need: bash perl wget unzip ps2pdf (from ghostscript), i.e. a normal Unix environment. Mac should probably work, but I haven't tried it. The download is around 100KB, but it expands to 5GB when you run_everything, which should take about an hour on a fast machine. Have fun! See below for release notes.
Here's a quick guide to making PDF files with the software. Say you want a single PDF for the 2000 world series. This is now pretty easy.
./series-pdf 2000 WS
And that will give you a file called 2000-WS.pdf. Other series than the world series can be got by substituting NLCS or whatever. If you want a regular season game, you need to have the game's retrosheet ID, e.g. BOS193808232 (see below). Then you can do:
./game-pdf BOS193808232
And that gives you BOS193808232.pdf, easy.
1. Basic stats: pa o r etc., calculatd directly from the ERD codes.
2. Run-value stats: starting with the ERV tables, these all have values in runs (per plate appearance typically). These can be quite sophisticated and varied.
3. Log-probability-ratio stats: these are all based on the game probability estimator, which in turn is based on the run-value stats. These are somewhat involved to calculate correctly, and are limited to hitting and pitching, but are also very revealing compared to 1-2.
4. Combined MVP stats: these are all based on products of two terms oversll, one a sum of run-value terms, the other a sum of log-probability-ratio terms. These are the best summary combination of what's revealed by all of the above.
Before describing the algorithms in detail, a word about my recommended interpretation of all of this. I certainly don't mean to suggest, by displaying an all-inclusive universe of statistics, that all that has gone before is worthless. I'm just trying to give these peculiar ideas a space to tell their peculiar story in its evident entirety. There is, I believe, a certain mathematical logic to it. Right now hardly anybody is listening to me, and that's as it should be. I'm not entirely convinced by all of these mechanical judgements myself, and I certainly don't suggest anyone else should be. My motivation in coding these algorithms is to give them what I think is the best chance to tell what they have to tell. When I make some fairly arbitrary adjustments, the purpose is to try to show the algorithm in its best light by my judgement, and not to make sure this or that person has a high or low ranking as I wish. After that, I sit back and see what I think. Overall, obvoiusly, I think it's entertaining enough to continue, at least for a while.
lb = pa - ( o + r ) Runners left on base xo = o + br - pa Extra outs ro = o - xo Regular outs nr = br - xo Net runners
The automatic runner on second base in extra innings is counted as a stat called xr (Extra Runners). The full set of equations and inequalities:
lb = pa + xr - ( o + r ) xo = o + br - pa ro = o - xo nr = br - xo o = ro + xo pa + xr = o + r + lb pa = ro + br br + xr = r + lb + xo nr + xr = r + lb o >= ro, xo br >= nr, (r-xr), (lb-xr), (xo-xr) nr >= (r-xr), (lb-xr) pa >= br, ro, nr, (o-xr), (r-xr), (xo-xr), (lb-xr)
number of outs in the half-inning
runner on first
runner on second
runner on third
The last 3 can be conveniently coded as an octal digit (sum of 1 for first, 2 for second, and 4 for third).
The general algorithm to compute the ERV table is to consider the ERD code for every play for the time period and or ballpark under consideration, while keeping a double table of numerator and denominator counts for all 24 table entries. For each code, the denominator is increased by one for the corresponding table entry. The state code (outs, bases) is then added to a list accumulated over the half-inning. Finally, the list of accumulated state codes is visited to add the number of runs scored on the play to the numerator value for every table entry on the state code list, including duplicates (as the denominators were increased in duplicate). After all ERD codes have been processed, the ratios are formed. These are tabulated generally under the name E indexed twice as E[bases][outs].
Code letters Usual normal half-innings p Plate appearance n Not a plate appearance Usual potential walkoff half-innings q Plate appearance r Not a plate appearance Usual walkoff play w Plate appearance u Not a plate appearance Extra inning with automatic runner - top of inning P Plate appearance N Not a plate appearance Extra inning with automatic runner - bottom of inning Q Plate appearance R Not a plate appearance Extra inning with automatic runner - walkoff play W Plate appearance U Not a plate appearance Game over early for arbitrary reason Normal half-innings x No play Potential walkoff half-innings X No play
The need for p and n in the determination of the BR stat has been explained. The q and r codes are needed because the bottom of the ninth and later innings are actually subject to different rules. For example in a tie game with more than one runner on base, only one run will be allowed to score unless the hit is out of the park. This fact seriously changes the expectation values and necessitates the use and computation of a different set of ERV tables, which are called PW for potentioal walkoff. Furthermore there must be a set of these, for different values of the half-inning-initial score difference. Experiment has shown that 5 tables are necessary for initial score differences of 0-3 and 4 or greater. Not surprisingly the effect wears off after 4 runs of difference. The handling of these extra tables will be explained below, under the discussion of ballpark codes.
Following the need for potential walkoffs, there are actual walkoffs, with codes w and u. The codes are necessary to signal the cancellation of the expectation value of the remaining baserunners. It's also convenient to know when the game ended at an unusual time. The code x was also added for games arbitrarily ended early. The x code comes after the last play of the game, and exists mainly to provide an ERD code of non-zero value which is not assigned to any player, as the early game ending is often prejudicial to the more sophistocated player statistics. Since the x code has a value, it must have a different value in PW half-innings, so a second code X is needed (it's never a plate appearance).
Finally, for various reasons to do with some of the more sophistocated stats, it is desirable to know whether the half-inning started with an automatic runner on second. No separate ERV table is needed here, and the runner itself can be handled within the codes, but the fractional R stat assigns different values to the codes in this situation.
These various elements are gathered together in this order:
(initial base octal code)(letter code)(final base octal code)(initial outs)(final outs)(runs)
In fact it is not necessary to include the number of runs scored in the play. This can be computed from the previous components. But it is more convenient to have it, and a more sophistocated reason is that it forms a check digit.
ERD = runs + E[base1][out1] - E[base0][out0]
If the code has letter wux, then
ERD = runs - E[base0][out0]
If these values are summed over a half-inning, an integral number is produced by algebraic cancellation.
However these are only used for games when ballpark tables are not available. Ballpark tables are of course computed for all major parks, so why would they not be available? This brings up the important concept of statistical base. Since expectation values are to be computed, and more importantly to have their differences computed, their values must be computed to tolerable accuracy. It turns out that, given the uneven distribution of game states, it takes well over a year to accumulate a tolerable statistical base. I prefer about three years for a single ballpark. The game states in many of the PW tables are even rarer than in the main tables, and this means that they can only be calculated universally. Their names are therefore PW0, PW1, ... PW4.
For each ballpark, it is desirable to split up the tables by era somehow to capture the slow change of the entire game over time. On the other hand I don't really like using decade tables which cause a huge jolt in the stats rather artificially every 10 years. As a comprimise I set ballpark-and-era tables with unique eras for each ballpark, all overlapping and changing at different times. The result is that I assign a 3-letter code to each ballpark, plus a sequential era code starting with 0. The result looks like FEN.4, which is Fenway from 1988 to 2005, or POL.0, which is the Polo Grounds from 1901-1911. The dates where they change are supposed to represent occasions where there was a change in configuration or something like that. The complete list is in rs2erd. If a ballpark table is unavailable, the league-decade table is used instead.
A suspended game was recently completed in which the same player played on both teams via trade. The software today is not ready for this, and I don't know exactly what the answer is. I may wind up preprocessing the input to fake a second player ID until I have a better answer. The software generally processes all players together, asking a particular data structure whic team they're on, and this approach can't work now. I don't particularly want to change tens of thousands of lines of code to accomodate one game, or a very small number of games. Something hacky may result.
The annotation is based on the retrosheet coding of the play, but with many small differences. All of the really peculiar ones are supposed to be described here for reference. One of the most unobvious is that N can be used as a suffix to mean 'iNterferece', normally followed by a code indicating who did the interfering. In this case fielders are indicated by number and others by special letters: 'B' batter, 'R' runner, 'F' fan, 'U' umpire. Obstruction is 'OB'. When one runner is out passing another, that's 'RPAS'. When a runner is out hit by the ball off the bat, that's 'RBAT'. Errors are signalled by 'E', but not all are present (e.g. foul errors are generally omitted because nothing happens). Double plays should contain 'DP' and triple plays 'TP', and these are only present if the outs are finally made.
When the ball is put in play there is a prefix code which indicates how, in a certain rather peculiar categorization. The main opposition is between 'G' for ground and 'F' for fly. 'F' has quasi-synonyms 'P' for pop-up and 'L' for line-drive, and any of these three may appear alone (before a fielding code) for an out. Bunts get the letter 'B' prefixed to 'G', 'P', or 'L'. The sac-fly is 'SF', however the sac-hit is not indicated in any clear way. I maintain retrosheet's distinction between 'FO' for force outs, and 'FC' for tag-outs, and these may have prefixes listed above if they're apropriate. The remaining letters mainly mean what they normally mean in baseball, if you can peel off the prefixes and suffixes. You may be excused for not knowing off the bat what is a 'LDGRNF' - a line-drive double ground rule interference by a fan - or a 'BPFCE' - bunt pop-up fielder's choice error, or even a 'PIF' pop-up infield fly rule. There's a new thing called 'FCNO' which is a fielder's choice no out with no error. A good pronunciation should auto-suggest.
The hits are 'S', 'D', 'T', 'HR'. If the first three are followed by fielding codes somebody is out unless there's an error. If 'HR' is followed by numbers it is inside the park, and nobody is out. All of these things are clearly indicated in the ERD code and the baserunning codes. Feilding codes only indicate handling of the ball, and never mere positioning. Baserunning plays have a fairly peculiar coding, but this comes straight from retrosheet, and from the way such plays are scored. In short, you have 'SB', 'CS', 'PO', 'POCS' (both on the same out), 'WP', 'PB', and the weird 'OA' which is really an out on what would otherwise have been a WP or PB. And 'BK' for balk.
None of these are used for any statistical purpose.
The remaining complication from line 3 of the header is when the home team bats first, indicated by the flag 'HTBF'. My coding has the home team batting second, and the home team is coded as a visitor. The ballpark code is for the real home team, and the game is from the same file as in retrosheet. In other words, the retrosheet codebase is apparently flexible about the team batting order but inflexible about the file location of the game, whereas my codebase is the other way around. Everything after this is normal, as far as the transformation goes. Note the game ID is formed with the visiting team code. When processing an ERD file, that's basically all you need to cue from the HTBF flag.
For precision, separate names will be given to three different interpretations of this stat by suffixing the letters 'c' for Count, 'r' for Ratio, and 'a' for Adjusted. First, the sum of ERD terms for a player is ACORc, which has units of runs. When ACORc is divided by plate appearances, this produces ACORr with units of runs per plate appearance. Finally ACORr is multiplied by 9 * PAPG to produce ACORa, with units of runs per game. The PDF products refer to ACORa exclusively, and call this ACOR. The data files on disk contain ACORr, which is called the pitching average.
This number and ACR are centered on zero by virtue of the fact that the E[0][0] value at the start of the half-inning is never attributed to any player, and so sets the level for what is to follow, defining an average to beat or fall short of, which becomes the sign of the player's statistic.
Another factor to consider in any ratio statistic is how long does it take to converge to a sensible value? Too few plate appearances in a perl plate appearance average make for essentially random numbers. Not all statistics take the same number of plate appearances to converge. ACOR and ACR are fairly quick to converge, although the fractional R stat is specially designed to make sense of a single game. Overall I would rate the length of time required for the various main atatistics to converge as fR < ACOR+ACR < BCR < ACW < FCOR.
1. If the ERD is negative and at least one baserunner is out:
Then divide the ERD value equally among all who were out
2. If the ERD is positive and at least one baserunner advanced:
Then divide the ERD value equally among all who advanced
3. Otherwise:
Divide the ERD value equally among all baserunners who either advanced or were out
These values are split up by year:team:hitter:series. The accumulated average is divided by the number of PA and tabulated by year (file name) and hitter, team, and series. For display, ACR is multiplied by PAPG and then rounded to 3 decimal places. Positive values are better than negative. ACR means Average Contribution to Runs. Similar to ACOR, this is available in three interpretations, ACRc, ACRr, and ACRa, with the same definitions.
sir = +.5r + 2 * +1r
BCRa += ERD - ( ACORr + ACRr )
Plate appearances are counted as usual (assigned to the baserunners) and used as the denominator for the 'r' stat. At present, the 'a' stat is computed exactly as for ACR, although in principle a slightly different number should be used to produce a true runs per game average, i.e. average plate appearances on base per game. BCR is tabulated per year, not per team and per series.
( E[0][o] - E[0][o+1] ) * nPNO[p][o] / nPNO[o]
Where the E[0] becomes E[2] with the automatic runner on second.
Fractional plate appearances are generated in two cases: u-type plays, and n-type plays. For a wakloff u-type play, the pitcher is given a flat 0.1 PA if they don't have any PA already. When a relief pitcher is brought in, the outgoing pitcher is examined to see if they have any PA. If not, it is checked whether they were pitcher of record for at least one play. If so, a count called NPIT is made, and increased for every subsequent releif pitcher. NPIT is reset when a plate appearance is completed, and every pitcher counted as NPIT receives 1/NPIT plate appearances (NPIT is at least two when reset).
Each decision is completed from the list thus generated by making further cuts. All of the decisions have in common that all players remaining after the last cut share the decision. In some cases, all the players who pass the "would have won/lost" test share the decision, but in most cases the list is reduced to one.
hwp = 0.5 + ( 1 / pi ) * arctan ( Z + ( X - A[o] ) / B[o] )
Where o is the number of outs in the game, A[o] and B[o] are tables of fitted coefficients for the estimator, and Z is a value calculated to adjust the game-initial probability, which is constant for the length of a game. The values A[o] and B[o] are fitted in advance by simulated annealing, using the entire retrosheet event file database. The number of outs tabulated runs from 0 to 59, where extra innings are looped around from outs 54-59 as often as necessary.
When Z is left to zero, the estimator reports a game initial probability of 53% for the home team to win, equal to the all time record. It may be desirable to adjust this to particular values when more information is available, without re-fitting the estimator's parameters. This is done by adding in the Z term. For regular season games, a ballpark home win rate is calculated for each ballpark code, and the Z value is adjusted to make this value the initial estimate. An even better reason for making this adjustment comes from the postseason. On the basis of the above estimator, a probability estimator for 232-format series was developed.
The estimator gives the probability of the home advantage team to win the series, with the series state taken as ( home advantage team wins, other team wins ). A table of observed win rates is calculated for each state, and then during a game the estimate is taken by using the home-team win probability calculated above to lever between the observed win rates for the states where the home team wins or loses. E.g. for game 1, 2, 6, or 7 that is
swp = hwp * s232[ hw + 1 ][ vw ] + ( 1 - hwp ) * s232[ hw ][ vw + 1 ]
where hw is the number of games won so far by the home advantage team, vw for the other team, and s232[ hw ][ vw ] is the table of win rates for 232-format series. The reason for the Z adjustment in the hwp estimator is to make the swp estimate continuous. In other words, the table of 232 series states implies game-initial probabilities via equations such as
swp0 = ( s232[0][0] - s232[0][1] ) / ( s232[1][0] - s232[0][1] )
which gives the series-initial estimate.
HWD = log(hwp) - log(previous hwp)
or lost:
HWD = -log(1 - hwp) + log(1 - previous hwp)
Finally, the HWD is added to the averages of home team hitters and visiting team pitchers, and subtracted from visiting hitters and home pitchers. In other ways, this stat is calculated in the same way as ACR and ACOR. In particular, the rules for attributing the HWD to baserunners is the same as for ACR, but with the sign of the HWD in place of the sign of ERD. This produces ACWc and ACWr. For ACWa, the factor is 10 * PAPG for both hitting and pitching. The ACWc for a nominal win is equal to log(2). For ACWa, a single plate appearance with this value would display as 29.493. This is hardly ever relevant, which is why I didn't choose a more sensible value for this.
ACW has never been calculated for a live game but it could be, although not knowing who wins there are two potential values. I provisionally call them hot ACW and cold ACW, based on the assumption of the player's team winning or losing respectively. I eventually plan to add support for calculating this to the program bb-erd, but it requires kind of a lot of information to be entered - ERD code, ballpark code, score difference, and half-inning number for each play for a start, and the baserunner rules are something else.
X = sqrt( sum( ACW^2 ) )
X only has Xc and Xa values, and the latter is multiplied by 20. Like ACW, X is undefined for a tie game. Why compute this? It's basically a measure of how "long a path" the probability estimator takes to the win or loss. Or in more technical language, a confusion metric, as it is high when the estimator can't make up its mind, i.e. has low information. In ordinary human terms it means an exciting game.
X is also defined for series as
series X = sqrt( sum( game X^2 ) )
There is a complication, which is that the product of two or more signed terms does not produce a satisfactory scalar, due to sign ambiguity. One hitter with positive ACR and ACW would wind up the same as another hitter with negative ACR and ACW. There are different ways of dealing with this, but each ranking must choose one.
B vs LHP = ACRc * LACRc^(3/2) * ACWr * sqrt( +1c ) * sqrt( (W - L) / (W + L) ) B vs RHP = ACRc * RACRc^(3/2) * ACWr * sqrt( +1c ) * sqrt( (W - L) / (W + L) ) LHP = ACORr * ACWr * O^2 * sqrt( (W - L) / (W + L) ) RHP = ACORr * ACWr * O^2 * sqrt( (W - L) / (W + L) )
Hitters are eligible for either or both awards if the following are positive:
ACRr ACWr LACRr RACRr (W - L) (W + L)
The +1c term is made positive by adding 0.01 prior to the sqrt, in order to give unique rankings to players with zero +1c.
Pitchers are eligible based on their throwing hand and whether the following are positive:
-ACORr ACWr (W - L) (W + L)
The eligibility criteria are relatively strict, but they produce enough runners up every year. These awards are a combinatin of more factors than the others to make them a little more unpredictable.
smvpRV = ACRc - ACORc
And the log-probability-ratio term is equally simple (h and p refer to hitting and pitching):
smvpLPR = hACWc + pACWc
The full metric is just:
smvp = smvpRV * smvpLPR
Eligibility is based on the following being positive:
smvpRV smvpLPR (hPA + pPA)
This seems to be a reliable and sensitive ranking putting hitters and pitchers on an equal footing over the length of a series.
hofRV = ACRc + BCRc - ( ACORc + (FCORc / 3) ) hofLPR = hACWr + pACWr hofWL = HW + PW - (HL + PL)
Next, the bias terms (negated because they're always negative):
hofRVbias = - min (all players) hofRV + 1e-9 hofLPRbias = - min (all players) hofLPR + 1e-9 hofWLbias = - min (all players) hofWL + 1
Finally, the combined metric:
hof = ( hofRV + hofRVbias ) * ( hofLPR + hofLPRbias ) * sqrt(hofWL + hofWLbias) * ( total PA from ACR+ACOR+BCR+(FCOR/3) )^(1/4)
Finally, FCOR is reduced in strength by a factor of 3, which turns out to be a surprisingly complicated judgement. The basic reason is that FCOR is the least reliably calculated of the main stats, and by the nature of baseball it was giving itself, in effect, too much weight in the sum, having typically 9 or so times as many plate appearances as hitting. Why reduce it by a factor of 3 and not something else? Because it also, as a side effect, changes the balance between hitters and pitchers in the ranking. It doesn't cause one to simply rise or fall, but reducing FCOR has the effect of making pitchers cluster nearer the center of the list. This effect is visible in the published rankings to some extent - hitters tend to be at the very top and bottom, although a few pitchers come very close at both ends. Hitting averages tend to be more concentrated than pitching averages, and the effect of FCOR in the RV term is to moderate the overall RV of hitters, as FCOR tends to be even smaller than ACOR. The factor of 3 seems to best balance these two competing tendencies.
Wow, what a year. My team won the world series! I'll have to wait for the data files like everyone else, but I feel pretty confident that Freddie will get the arctex MVP. With baseball all done, it will soon be time for me to take another whack at arctex over the holidays. I have a pretty good list of things queued up to work on. First, some simple things. I intend to make the constant factor for BCRa reflect the average number of plate appearances on base. I also decided it makes more sense to have the constant factor for ACWa be the same as for Xa. The correctors for BCR and FCOR are calculated per plate appearance, but are applied per play, which seems wrong, so I'll fix that. I'm contemplating giving ACW a role in the decisions, possibly in such a way that the individual decisions may be suppressed, e.g. in a 1-0 game there may not be a losing pitcher. I'm thinking of doing a big linear regression on game time versus PA, pitching KOs, and half-inning changes, and then redefining the P stat based on those results, i.e. subtracting out the estimated time for KOs and inning changes per game before computing P, which should sharpen the stat. The series MVPs should be listed somewhere near the series summaries. The problem is that there isn't much room on the page for 2020, so the MVPs might go underneath the leaderboard on the following page. Oh, and I never got around to coding in the new end-of-season tiebreaker rules, so those should go in also.
The big thing I want to do for the next update is to sharpen the ACW and X stats by introducing two new probability estimators. There's already an experimental estimator for the 232-format series, but so far I haven't used it to calculate any stats, basically because of the limitation to the 232 format. So the first order of business is to intorduce a new binomial-model estimator for any series format, and then to use it to calculate stats that will be called sACW and sX, where the existing stats will be renamed gACW and gX, g for game and s for series. The X values for the postseason series will be replaced by sX, which should give a better ranking. For example, the 2004 ALCS ranks only #21 all time in gX, but it will surely rank much higher in sX. Also sACW can have a role in the series MVP award, possibly in combination with gACW. Finally, there will be the big one - what I call the u for unified or universal estimator, essentially "win the world series from opening day". The way this will work is that during the regular season I will have a function with values tabulated from history, of the expected number of games to exceed a threshold, given games behind and games left. This same function will be used for all postseason slots, with each team's value being a sum for all the slots they're eligible for. With these values for all teams in hand, they will be converted to probabilities by normalizing the sum of all teams to one. Once the postseason starts, the u estimator will switch to pasting together the s estimators for all of the series (yet unplayed postseason series will be treated as a sequence of 50/50 coin flips). These numbers determine the probabilities at the beginning and end of each game, and the game estimator will be used to lever between these like the 232 estimator does now during games. This will produce new stats uACW and uX. The season X numbers in the season summary table will be replaced with uX, which should be a far more interesting number, and wider ranging than the current ones that are always around 250-300 or so. In fact I expect a seasons-by-x file will be called for. uACW is a little stranger because it will give up on teams eliminated from the postseason while they're still playing. I see it having a role in the hall of fame ranking, but a limited one in combination with gACW. All of gACW sACW and uACW will be in the player career summaries. I should warn that the u estimator will be a lot of coding work, and so it will probably start out in a somewhat simplified form that will be expanded and corrected over time. Finally, I'm thinking about a new stat based on X, the swing inning for a game and the swing game for a series, being the one where cumulative X exceeds 50% of the total. It remains to be seen whether this will be interesting enough. If it is, the swing game will be marked on the willow, and the swing half-inning will be added to the "T X ! P" header. Whew, that should be enough work for my vacation!
@ indicates that inputs or outputs are made via the generate script and not directly
run_everything Master script that runs all the necessary generate targets in correct order download Download alldata.zip from retrosheet.org writes: alldata.zip generate Substitute for a proper makefile, also miscellaneous code snippets The reference for how to run each command properly (see run_everything for the proper order) reads: alldata.zip erd/ basic-stats guides/ bb-post hitting-stats/ pitching-stats/ baserunning-stats/ hw-hitting-stats/ hw-pitching-stats/ fielding-stats/ gacw/ writes: rs/ basic-stats papg erv-lib never-seen bb-post.ps sample.ps sample.pdf best-RS-hitting best-RS-pitching best-RS-baserunning best-RS-hw-hitting best-RS-hw-pitching best-RS-catching best-RS-infielding best-RS-outfielding unfiltered-best-RS-hitting unfiltered-best-RS-pitching unfiltered-best-RS-baserunning unfiltered-best-RS-hw-hitting unfiltered-best-RS-hw-pitching unfiltered-best-RS-catching unfiltered-best-RS-infielding unfiltered-best-RS-outfielding games-by-x series-by-x parks-seen Tool to generate a list of games per park reads: rs/postseason/ rs/events/ writes: @games-per-park bio-names Convert retrosheet biofile.csv to namedb used by all software to map RS ID codes to names reads: rs/biofile.csv writes: namedb rs2erd Translator retrosheet -> erd reads: @rs/postseason/ @rs/events/ writes: @erd/ erv-tab Calculates the main erv tables reads: erv-lib erd/ writes: @tab-erv bb-erd Tool to interactively explore erd codes and the probability estimators - h for help message reads: erv-lib s232 winp-lib basic-stats writes: @plays-erd mplhi Generate a list of peculiar all-time stats about games reads: @erd/ writes: @most_pl_hi pl-st Summarize the careers of all players reads: @erd/ namedb writes: @all-players hps Calculate all player run-value stats reads: @erd/ tab-erv rs/rosters/ writes: pitching-stats/ fielding-stats/ hitting-stats/ baserunning-stats/ series-wins Generate table of who won every postseason series reads: erd/ writes: win/ run-stats Determines hitting and pitching wins and losses, calculates fractional R stat reads: tab-erv erd/ writes: runs/ cwl Tabulates career win-loss records reads: runs/ writes: winloss erd-parse Tabulates pa o br r ACR ACOR per year/team/series (also model erd parser) reads: tab-erv papg @erd/ writes: season-stats/ homewins Calculate the parameters of the game probability estimator reads: @erd/ tab-erv papg winp-params writes: winp-params s232-tab Calculates the 232-series estimator parameters reads: erd/ writes: @s232 bwp Calculate ballpark home win probabilities reads: erd/ writes: bwp-tab hwprob Run the probability estimator for every play of every game reads: @erd/ tab-erv papg winp-lib bwp-tab winp-params writes: @hwp/ hwhps Calculate all player log-probability-ratio stats reads: tab-erv papg winp-lib winp-params bwp-tab erd/ writes: hw-pitching-stats/ hw-hitting-stats/ gacw/ career Calculate and tabulate career rankings in all main stats categories reads: papg all-players hitting-stats/ pitching-stats/ baserunning-stats/ hw-hitting-stats/ hw-pitching-stats/ fielding-stats/ writes: career-hitting career-baserunning career-hw-hitting career-hw-pitching career-pitching career-catching career-infielding career-outfielding hpmvp Determine the hitting and pitching MVPs for each season reads: @erd/ hitting-stats/ pitching-stats/ hw-hitting-stats/ hw-pitching-stats/ runs/ writes: @mvp runners-up sm-hof Determine series MVPs and the all-time universal ranking reads: erd/ all-players pitching-stats/ hitting-stats/ fielding-stats/ winloss hw-hitting-stats/ hw-pitching-stats/ baserunning-stats/ writes: hof smvp series-match Tabulate team and player leaderboards for each year, generate the willow reads: @erd/ papg runners-up series-by-x hw-hitting-stats/ hw-pitching-stats/ runs/ season-stats/ win/ unfiltered-best-RS-hitting unfiltered-best-RS-pitching all-players career-pitching career-hitting hitting-stats/ pitching-stats/ gacw/ writes: @season/ @guides/*-season.ps @willow guide-career Generate player career summaries for any postseason series reads: @erd/ broadcasters all-players writes: @guides/*-career.ps guide-graph Generate the graph of any game reads: @erd/ tab-erv hwp/ writes: guides/*-graph.ps guide-roster Generate the roster roster of any game to accompany the graph reads: @erd/ tab-erv papg all-players mvp smvp gacw/ runs/ pitching-stats/ hitting-stats/ fielding-stats/ hw-hitting-stats/ hw-pitching-stats/ baserunning-stats/ career-hitting career-pitching career-hw-hitting career-hw-pitching winloss career-baserunning career-catching career-infielding career-outfielding win/ hof writes: guides/*-roster.ps winp-lib Library of probability estimator functions 30wins Tool to shows all 30-game winners reads: runs/ acctruns Tool to check that the fractional R stat is calculated correctly reads: runs/ bio-check Tool to check retrosheet biofile.csv for parse failures - output should be empty reads: rs/biofile.csv extract-gacw Tool to extract gacw entry for any game reads: gacw/ extract-game Tool to extract any game by retrosheet game ID from the erd database reads: erd/ extract-hwp Tool to extract hwp entry for any game reads: hwp/ extract-runs Tool to extract runs entry for any game reads: runs/ game-pdf Tool to make a complete pdf for any game runs: extract-game extract-hwp extract-gacw extract-runs guide-graph guide-roster guide-career series-match pscat hwp-report Tool to crudely examine the probability estimator's a posteriori accuracy reads: hwp/ list-codes Tool to generate a list of all erd codes seen by order of frequency reads: erd/ max-pa-br Tool to sort all baseruners by PA on base per year reads: baserunning-stats/ pscat Tool to concatenate postscript files reads: common-ps-header series-pdf Tool to make a complete pdf for any series runs: guide-graph guide-roster guide-career series-match pscat text2ps Tool to make postscript from a text file, used for guide-career and bb-post viewdec Tool to sort pitchers and hitters per year by W - L reads: runs/