The ArctEX baseball statistics system

This is a very preliminary introduction to a cool project based on Retrosheet to sort of reinvent basic baseball statistics using a somewhat more mathematical framework. What's different about it? In a word, consequentialism. All arctex statistics are based solely on umpire's decisions about what happened in the game. There are no earned runs or RBIs or at bats, or even hits or errors except as descriptions. It aims to be a complete and reasonably familiar description of baseball using these methods, although the terminology is mostly new.

The name is from "arctangent of expectation values", about which more later after some actual baseball stuff. Skip the rest of this paragraph if you don't want a brief history of the project. This project is a work in progress but close to something interesting (to me at least). It was going along great up through 2019 and then a number of things happened. One is they introduced some new rules that required a bit of a software rewrite - the runner on second base, the 7-inning game, the new postseason format. For the most part that's easy enough (see a complication below), but I delayed too long and accumulated too many irresistable ideas for new stuff, and so the update became too big to tackle at once etc. And then in 2020 I had a number of new ideas all at once that demanded following up, which eventually resulted in the other texts I wrote. Finally in late 2023 I got around to fixing all that and updating so that I can follow retrosheet again - at last! However, that release was particularly buggy, but now I'm happy to say I've done another round, smoothed over the bugs, and added some new features again.

Is Shohei Ohtani's 6-for-6 game the best offensive performance ever? An example step-by-step of how Arctex stats are calculated.

Contents

- Description of PDF files and website features

- Description of stats and algorithms

- What's next

- Description of each file in the distribution

Printouts

First a general note about the PDF files - they're made to be printed out on paper. I like to print them on thick letter size paper almost like what's used for business cards, and then have them looseleaf, with one stack of landscape pages and another of portrait pages, with the current game graph and roster file on top of each. Incidentally, although some software previewers mishandle the fonts, printers never seem to have this problem. If you're looking at the files on a screen, it may be best to open two copies with one on a graph page and another on the roster page if you want to follow a game. There are blank pages inserted to make it suitable for double-sided printing.

Graph

This was the raison d'etre for this project. Once I had seen the idea of expectation value tables, I wanted a graph like this. I had programmed Postscript in college, and had always wanted to try a curvy spline graph using the curveto operator (which is the basis of Postscript fonts). The thick light line is the home team score plus expectation value, and the thick black line is the same for the visiting team. The scale in runs is printed on the left and right. The thin lines represent the probability estimators, and are always on a scale of 0-1. The black line is the home team game win estimate, and the light line is the series home advantage team's series win estimate. This varies with the black line in games 1,2,6,7 and counter to it in games 3,4,5.

There are several lines of text above and under the graph. Starting from the top, there is the "annotation" or description of each play in bold. See below in "Annotations and file format" for a description of the abbreviations. Below that is a description of which offensive players advanced, scored, or were out. See "Player Codes" and "Baserunner Codes" below for a description of the syntax. The final line above the graph shown the inning. Inside the top part of the graph, a large * marks the play of the game, defined as the play that makes the largest step change in the game probability estimator. In the first line below the graph, there are ERD codes for each play (see "ERD Codes"). At the start of each half-inning, the offensive team gets its expectation value raised to the value for 0 on 0 out, which is depicted as a separate column on the graph. Under it, there appears not an ERD code (there is none) but the ballpark code in use for this game, or PW table code if that's in use - see below "ERV", "Ballpark Codes". Below that is the ERD value (see below). Finally at the bottom there are substitution codes, which use the player letter codes and fielding position number codes. When you see 'n:6', that means player 'n' is now the shortstop. When you see 'k:1,j' that means 'k' is the new pitcher and 'j' is leaving the game at the same time. Perpendicular to the graph is a header which gives the team codes and colors, the series and game number (this doesn't really look right for regular season games), followed by the date and weekday, and in between the retrosheet doubleheader code in parentheses.

Roster

This long section of tables provides information about the individual players referred to only by letter in the graph. The first section describes each game in the series (it always thinks there's a series, like the graph), and then another section describes the regular season for these players, and a final section gives their career numbers. Through all of the charts, the symbol '#' serves as infinity in a ratio.

Game Display

The header for each game describes it in basic stats terms (see section of that name below, as usual). The home team for game 1 of the series is on the right, and will continue to be on the right for the entire roster file. The home team's headers appear in lighter gray whether on the left or right. The center headers give information for the entire game, and appear in black. The stats are explained in "Basic Stats", except for the lower center headers. The +.5 etc. are explained in "ERD probabilities" and appear as counts. In the center, T is the game time in minutes, X is the stat called X or game excitement factor, and '!' (called "bang" by programmers) is 100 * X / T rounded to a decimal, i.e. a measure of "excitement per minute". Finally P is the pace, or minutes per plate appearance - higher is slower.

A brief word about X, which is a fun stat. It follows how much the game probability estimator gets racked around, more or less. It's high for walkoffs and late lead changes, and low for early inning blowouts. For example, the highest X game was BOS193808232 with a final score 12-14 and a walkoff grand slam, with an X of 85.1. The worst is CLE199010020 which was 0-9 after the first inning and finished 3-13, with an X of 3.2. The game with the most lead changes (7 - there's a tool for this) was MON200005140 which is #15 all time in X with 76.4. For series, the 1986 ALCS had an X of 92.6, whereas the 2013 NLWC game had only 5.7.

Next is the display of hitters - really all offensive players for this game. The first column '#' gives the player's letter code, as used on the graph. The next column 'st' contains several pieces of information variously coded. For the starting lineup, it contains the starting fielding code, or 10 for DH. For substitutes it contains the letter corresponding to the batting slot they enter. This is not necessarily the player they displaced. That can be determined by looking for the previous letter code alphabetically that went into the same batting slot letter, looking up the list (it could be a pitcher below). In general to figure out a substitution completely you have to look at all the substitution codes below the graph, but there are a number of features to help figure out common patterns from the roster list only, via dots appearing around this 'st' letter or number. A middle dot on the right signifies that this player had a substitute fielding code for positions 2-9, which may mean a substitute fielder, or changing to a fielding position later. A period on the right means a pinch hitter, and a comma means pinch runner, and these combine with the fielding dot to form colon and semicolon. This allows you to figure out simple multiple switches, e.g. if a pinch hitter goes into the center fielder's batting slot, and if fielding dots appear on the pinch hitter and the starting left fielder, then the pinch hitter probably went into left and the left fielder moved over.

The stats that appear next are all explained below. Briefly, ACR is the basic run value hitting stat and ACOR the pitching stat. ACW is about the game probability estimator, and R is a stat used only for deciding the winning and losing hitters and pitchers. There are more annotations on the PA column. The game's winning and losing hitter are marked + and -. High dots on the winning side, and periods on the losing side both indicate that the player passed the 'R/pa' cut described below for the decision. This helps understand how the decision works. Finally, '..' may appear by itself indicating a double-pinch, i.e. a pinch hitter who never comes to bat, relieved immediately by another pinch hitter after a pitching change. Finally, the player's name has a hand designation appended. This has a middle dot after it if it indicates the throwing hand, meaning that the player pitched at some point in the season. Sometimes this is the case for position players which is basically a bug, although a little hard to fix the way it currently works.

Below that is the pitching table for the game, which is much the same as the hitting table. Unlike ACR, ACOR is better when negative, but ACW is still best positive for pitchers. There are a few little differences from the hitting display. For DH games the 'st' column is just blank for relief pitchers who stay out of the batting order. The PA annotations are the same except you don't see double-pinch, and there is another annotation, the comma which means KO (left in the middle of an inning).

Regular Season Tables

All of the stats in this section are for the regular season and for this team only.

First comes the reular season hitting table, which is ordered by ACR. The season leader in ACRc (see below) has a '>' appended to the PA column. '/pa' after 'BR' means BR/PA for the season (lower case for spacing). BR/PA kind of needs a verbal name, and I prefer "no out percentage". 'w' and 'l' are the hitting decisions.

Next, the summary of hitting vs left and right hand pitchers is ordered by LACR - RACR which means that "normal" right handers sort at the top of the table, and "normal" left handers at the bottom. This allows you to see reverse splits and pigeonhole switch hitters at a glance. The stats are described below, but LACR is just batting-only ACR vs left hand pitchers. Beware when the number of PA is very low on either side. If you can't figure out why somebody would be disqualified for a regular season hitting MVP, look here. You have to have positive LACR and RACR to be considered for either MVP.

Next is hitting ERD probabilities, which is sort of arctex's mixed-up version of singles, doubles, homers, and double plays. What you see is a percentage of all plays this player was involved in as a hitter or baserunner, by ERD value in the sense of greater or equal for + ans less or equal for -. So +.2 is very roughly like singles (and everything better), +.5 like doubles, and +1 like homers, with -.5 being like double plays. But the numbers here are more focused on the game effect and categorize all sorts of plays. For example, an aggressive base stealer will have a high rate of +0. Unlike the other tables in this section, this table has a minimum requirement of 150 plays, the reason being that the numbers are pretty worthless with less than that. The final column is 'si' or slugging index, which is +.5 + 2 * +1, a number kind of like slugging percentage. It is in fact a good number for categorizing sluggers. Over 30 is very good, and over 40 is teriffic (and rare nowadays). The table is sorted by the unobvious metric (+.2 - -.2), which is vaguely interesting.

The baserunning and fielding tables should be self explanatory. All -OR stats are best when negative (opponents' runs), and all are sorted best on top. Fielding stats are aggregated by the named categories. Beware numbers with few PA, especially in fielding. Under 1000 PA should be considered an unreliable FCOR number (much more is really best). The FCOR PA numbers are also a good place to see who was a starting player in the season. The BCR PA number is an interesting offensive metric in itself.

The pitching table is sorted by ACOR, best on top. 'pa/3' means PA per 3 outs, i.e. 3 * PA / O, which is sort of parallel to BR/PA for hitters. In the ERD probabilities, an si over 25 is a red flag in my opinion. That table is ordered by (-.5 - +.5), which again seems vaguely interesting.

The postseason series tables display the same information for these series. The PS numbers are an aggregate of the following series numbers. If you're looking at a regular season game, and a player's name appears in the PS summaries but without any numbers, it means the player was on another team. The only thing different here is the indication in the PA column turns into a primitive version of my series MVP designation, based only on ACRc and ACORc. These are now basically obsolete but I don't see any major reason to turn off the code - just know they're not the real award, which is indicated at the end in the MVP section. The character used to indicate this varies depending which series is being considered. In a file for a postseason series, the winning team's player will be indicated '*', although the new series MVP award doesn't care which team you're on. The losing team's best player gets a '>'. For series summaries for other series in the year, the designation changes to '^' because it only considers the listed players, and the real award winner may not appear (this one is fairly useless). For the entire PS summary they get '>' for both teams.

Career Tables

Most of these are self explanitory. All are sorted with the highest ranked players on top. The stats are aggregated over all time, and there are minimum plate appearance requirements, which is 2000 PA for everything except fielding which is 12000 PA. The career decisions table has no minimum, and lists hitting wins and losses and pitching wins and losses in that order. It is sorted on total wins minus total losses. The minumim requirement for the hall of fame is one plate appearance as either hitter or pitcher - x is the combined all-around metric for the ranking. The MVP table lists both regular season and postseason MVP awards, also both based on custom combined metrics. For a postseason series, the real series MVP is listed here, and may be on either team. If an MVP is not listed for a series the team won, it went to a player on the losing team.

Career Summary

Every player's career is summarized in terms of teams, years, series, and positions. Teams are listed in order of the first year the player played there. If the player played for two or more teams in the same year they may not be listed in correct order. Fielding positions beyond the usual: 10 DH, 11 PH, 12 PR. Positions are listed regardless of how many or few PA.

Season Summary

At the top there is an immitation of a common display for division leaders and runners up in all divisions. Across each line there is the division position, the team code, games won minus games lost, total non-tie games, tie games, win percentage, team ACR, team ACOR, team TAR, and the X value for the entire season. ACR, described below, is the run value hitting stat, and ACOR is pitching and of opposite sign. TAR is ACR minus ACOR, which is a combined "team power" number in runs per game (although it's a per-plate appearance number).

Next is a wildcard runner up display, even in the pre-wildcard era. Division leaders appear at the top abve the line marked with a '*', and then runners up which are marked with a dot when wildcard slots exist. The display contains a brief summary of each team's most recent postseason appearance, world series appearance, and world series win.

Below is the postseason summary, with an entry for each postseason series. The series name is given in retrosheet format, which includes imponderables like ALD2 that are best looked up here to find out what that means. Following are the team codes of the series winner then loser, followed by a display of series games called the willow. Each game is represented by a letter decorated with "accents" or dots to indicate a number of interesting things. The letters w and l represent wins and losses by the series winning team, and are upper case whenever the home team wins the game, lower case when the visitors win. A high dot (or caret in text files - a postscript charset issue) before the letter indicates a complete game by the winning pitcher, and this is changed to a '!' for a perfect game. By the way, the Arctex definition of a perfect game is very slightly different, being defined as zero BR, which allows for e.g. a sigle thrown out at second (consequentialism). A middle dot after the letter indicates extra innings. A comma after the letter indicates a walkoff, and combines with extra innings to form a semicolon. There follows a display of basic stats for each series (plate appearances, outs, baserunners, runs), with each displaying two numbers - before the slash, the winner's total, and after the slash the series total (winner plus loser), so that it appears as a fraction giving the winner's share of the total. Next come the run value ACR and ACOR numbers for the series. They're printed for the winner, but you reverse the numbers for the losing team. I.e. the winning team's ACOR is the losing team's ACR (no sign reversal). Finally there is the series X, followed by the rank in X value (out of 381 through the 2023 postseason).

Finally at the bottom there are (up to) three text summaries. On the left is an accounting of games in the regular season and postseason which should be self explanitory. If there were regular season interleague games, there is a summary of which league won how many in the middle. At the right is a summary that needs a slight explanation, as I couldn't find the right wording to fit there. What it shows is the total length of all games in the year divided by one eighteenth of the total number of half-innings, more or less the length of the average game that year. But it's relly calculated as an average of half-innings. Then there's a summary of the total season pace of game or 'P' stat. in plate appearances per minute.

League Leaderboard

This is currently the most confusing display, owing to the fact that it's sort of in a transitional stage. Originally, before the software was ever released, the display was a fairly conventional AL/NL hitting/pitching leaderboard, with a quartet of simpleminded MVPs designated by ACRc and ACORc (see below for the exact meaning of these terms). Now there's a more sophistocated MVP algorithm, and it's not split by league but by pitcher hand, which I eventually considered more interesting. The new MVP is marked in the second column, but the overall display is still rooted in the past. In particular, the numbers displayed for each player are aggregated over multiple teams if necessary, but only within the given league. The new MVP does not consider the league in any way (see below), so the numbers displayed aren't actually relevant (except by accident). The new MVP award may seem inappropriate based on the numbers. The only thing holding up a redesign is that I don't yet have a clear idea of what the new one should look like. The selection of 25 players in each quadrant seems obvious enough, but for very old games it's actually a little too much, at least with the display based on ACRc/ACORc. Before the expansion era there were fewer players overall, and in those years you will sometimes see players with very few PA low down on the list. One reason I don't simply rearrange the display by hand to fit the new MVP is that there are fewer left-handed players, and the number of players on each side both ought and ought not to be the same in that case, more or less.

The first column in each table is the old MVP/runner up designation, based solely on ACRc or ACORc for the league-aggregate for the regular season. The second column displays the new MVP winners as * and the runners up as numbers starting with 1. A blank here means the player did not qualify, i.e. had a key stat of the wrong sign. Next is the name and batting or throwing hand which is lowercase to indicate a player's rookie year (simplistically - first calendar year only), followed by the team that player had the most number of PA for in the season. Note, a player who was traded may appear on both sides with different numbers. The headings give the stats that follow - PA is plate appearances, w and l are game decisions, c is complete games, and +1 is +1c described in "ERD probabilities".

Cheat Sheet

Finally, there is a document called 'bb-post' which is an all-around source of miscellaneous information about the game's history, ballparks, awards of various types, and the "willow" for all postseason games.

Software Download

Here is the latest software download and release notes. Arctex will always be completely free and open source. I took down the sample PDF files from the website as it's not clear I'm allowed to do that.

There is a script called run_everything which will do what it says, starting by downloading the Retrosheet big zip file, and finishing by generating the exact PDF that appears above, with all the analyses run in between. There are also some tools for looking at the stats in a number of ways, the most important of which is called bb-erd. This is an interactive command-line tool to query the ERV tables, calculate ERD code values for different ballparks, and to run the game and series probability estimators for live games. You have to run_everything first. bb-erd does at least have a help screen, which you can get by entering ? or h. To run the software, you need: bash perl wget unzip ps2pdf (from ghostscript), i.e. a normal Unix environment. Mac should probably work, but I haven't tried it. The download is around 100KB, but it expands to 5GB when you run_everything, which should take about an hour on a fast machine. Have fun! See below for release notes.

Here's a quick guide to making PDF files with the software. Say you want a single PDF for the 2000 world series. This is now pretty easy.

./series-pdf 2000 WS

And that will give you a file called 2000-WS.pdf. Other series than the world series can be got by substituting NLCS or whatever. If you want a regular season game, you need to have the game's retrosheet ID, e.g. BOS193808232 (see below). Then you can do:

./game-pdf BOS193808232

And that gives you BOS193808232.pdf, easy.

Algorithms:

There are broadly 4 categories of statistics generated:

1. Basic stats: pa o r etc., calculatd directly from the ERD codes.

2. Run-value stats: starting with the ERV tables, these all have values in runs (per plate appearance typically). These can be quite sophisticated and varied.

3. Log-probability-ratio stats: these are all based on the game probability estimator, which in turn is based on the run-value stats. These are somewhat involved to calculate correctly, and are limited to hitting and pitching, but are also very revealing compared to 1-2.

4. Combined MVP stats: these are all based on products of two terms oversll, one a sum of run-value terms, the other a sum of log-probability-ratio terms. These are the best summary combination of what's revealed by all of the above.

Before describing the algorithms in detail, a word about my recommended interpretation of all of this. I certainly don't mean to suggest, by displaying an all-inclusive universe of statistics, that all that has gone before is worthless. I'm just trying to give these peculiar ideas a space to tell their peculiar story in its evident entirety. There is, I believe, a certain mathematical logic to it. Right now hardly anybody is listening to me, and that's as it should be. I'm not entirely convinced by all of these mechanical judgements myself, and I certainly don't suggest anyone else should be. My motivation in coding these algorithms is to give them what I think is the best chance to tell what they have to tell. When I make some fairly arbitrary adjustments, the purpose is to try to show the algorithm in its best light by my judgement, and not to make sure this or that person has a high or low ranking as I wish. After that, I sit back and see what I think. Overall, obvoiusly, I think it's entertaining enough to continue, at least for a while.

Basic Stats

From basic baseball statistics I take plate appearances, runs, and outs. Also the occupation of the bases. Doing without hits and errors and so forth, I make a new stat called BR which is defined as a plate appearance in which no out is made off the final pitch. This adds one to the total quantity of baserunners, and leads to further definitions:

	lb = pa - ( o + r )		Runners left on base
	xo = o + br - pa		Extra outs
	ro = o - xo			Regular outs
	nr = br - xo			Net runners

The automatic runner on second base in extra innings is counted as a stat called xr (Extra Runners). The full set of equations and inequalities:

        lb = pa + xr - ( o + r )
        xo = o + br - pa
        ro = o - xo
        nr = br - xo
        o = ro + xo
        pa + xr = o + r + lb
        pa = ro + br
        br + xr = r + lb + xo
        nr + xr = r + lb
        o >= ro, xo
        br >= nr, (r-xr), (lb-xr), (xo-xr)
        nr >= (r-xr), (lb-xr)
        pa >= br, ro, nr, (o-xr), (r-xr), (xo-xr), (lb-xr)

Baseball as state machine: states

To concentrate the statistics on the most important aspects of the game, the game is described as a finite state machine. All plays are translated into this state-based description, and all stats are generated by considering the games according to these states. The total number of PA, O, BR, and R accumulated by teams and players is obviously relevant, but instead of creating "box scores" consisting of these, the description instead focuses on state transitions. It is therefore necessary that following the sequence of these transitions should allow an accurate accumulation of these quantities. It turns out that all that is needed is a state consisting of four pieces of information, aside from the identification of the players:

number of outs in the half-inning

runner on first

runner on second

runner on third

The last 3 can be conveniently coded as an octal digit (sum of 1 for first, 2 for second, and 4 for third).

Baseball as state machine: transitions

The state transition corresponding to a baseball play must specify the starting and ending state codes (duh). However it must also specify whether the batter completed the plate appearance, in order to determine whether or not a BR occurred. A letter/number code can be constructed to specify these data. It turns out that there is slightly more that can be asked of a state transition code - the ability to assign a unique numerical value to it. To do this in the baseball context, I define the ERD code, which stands for Expected Run value Difference. Most of the complexity in this is contained in a single letter which also serves to designate the (non-) plate appearance, explained below.

ERV

The "Expected Run Value" is an expectation value of runs to be scored until the end of the half-inning conditioned on the number of outs in the half-inning at the start of the play and the occupation of the three bases at the start of the play, in other words the ERD state code. This implies that the expectation values be collected into a table - the ERV table - which has 8*3=24 nonzero entries, the entries for out 3 being set to zero.

The general algorithm to compute the ERV table is to consider the ERD code for every play for the time period and or ballpark under consideration, while keeping a double table of numerator and denominator counts for all 24 table entries. For each code, the denominator is increased by one for the corresponding table entry. The state code (outs, bases) is then added to a list accumulated over the half-inning. Finally, the list of accumulated state codes is visited to add the number of runs scored on the play to the numerator value for every table entry on the state code list, including duplicates (as the denominators were increased in duplicate). After all ERD codes have been processed, the ratios are formed. These are tabulated generally under the name E indexed twice as E[bases][outs].

ERD Codes

The following bewildering array of codes are required to properly represent the possible run values (see ERD Values below) of a play. Originally it was just p and n, but over time it was discovered that all of these situations are numerically distinct and represent distinct game state changes as far as the algebraic model is concerned.

	Code letters
	Usual normal half-innings
		p	Plate appearance
		n	Not a plate appearance
	Usual potential walkoff half-innings
		q	Plate appearance
		r	Not a plate appearance
	Usual walkoff play
		w	Plate appearance
		u	Not a plate appearance
	Extra inning with automatic runner - top of inning
		P	Plate appearance
		N	Not a plate appearance
	Extra inning with automatic runner - bottom of inning
		Q	Plate appearance
		R	Not a plate appearance
	Extra inning with automatic runner - walkoff play
		W	Plate appearance
		U	Not a plate appearance
	Game over early for arbitrary reason
	Normal half-innings
		x	No play
	Potential walkoff half-innings
		X	No play

The need for p and n in the determination of the BR stat has been explained. The q and r codes are needed because the bottom of the ninth and later innings are actually subject to different rules. For example in a tie game with more than one runner on base, only one run will be allowed to score unless the hit is out of the park. This fact seriously changes the expectation values and necessitates the use and computation of a different set of ERV tables, which are called PW for potentioal walkoff. Furthermore there must be a set of these, for different values of the half-inning-initial score difference. Experiment has shown that 5 tables are necessary for initial score differences of 0-3 and 4 or greater. Not surprisingly the effect wears off after 4 runs of difference. The handling of these extra tables will be explained below, under the discussion of ballpark codes.

Following the need for potential walkoffs, there are actual walkoffs, with codes w and u. The codes are necessary to signal the cancellation of the expectation value of the remaining baserunners. It's also convenient to know when the game ended at an unusual time. The code x was also added for games arbitrarily ended early. The x code comes after the last play of the game, and exists mainly to provide an ERD code of non-zero value which is not assigned to any player, as the early game ending is often prejudicial to the more sophistocated player statistics. Since the x code has a value, it must have a different value in PW half-innings, so a second code X is needed (it's never a plate appearance).

Finally, for various reasons to do with some of the more sophistocated stats, it is desirable to know whether the half-inning started with an automatic runner on second. No separate ERV table is needed here, and the runner itself can be handled within the codes, but the fractional R stat assigns different values to the codes in this situation.

These various elements are gathered together in this order:

(initial base octal code)(letter code)(final base octal code)(initial outs)(final outs)(runs)

In fact it is not necessary to include the number of runs scored in the play. This can be computed from the previous components. But it is more convenient to have it, and a more sophistocated reason is that it forms a check digit.

ERD Value

After the ERV tables are available, when an ERD code is processed a run value can be assigned to it called an ERD value. For normal plays with ERD code letter nprq, the code value is computed from code (base0)(code)(base1)(out0)(out1)(runs) as

ERD = runs + E[base1][out1] - E[base0][out0]

If the code has letter wux, then

ERD = runs - E[base0][out0]

If these values are summed over a half-inning, an integral number is produced by algebraic cancellation.

Ballpark Codes

It's easy enough to shove all plays ever from retrosheet into a single ERV table, but this turns out to be unwise. The need for separate PW half-innings means that all other half-innings should be rigorously separated in other tables. There is indeed an 'all' table which contains data from all non-PW half-innings. But there are other factors which influence the overall magnitude and detailed structure of the tables, although none as severely as PW0-3. The long experiment with both DH and non-DH ball is one of these. Another is that the balance between hitting and pitching has shifted several times now in the statistically detailed historical era. Calculating separate tables for AL and NL and for different decades was apparently standard practice when I first heard about the technique. A set of tables along these lines are also available, with names formed by chopping the last digit off the year e.g. NL199. Of course only "NL" are available before 197.

However these are only used for games when ballpark tables are not available. Ballpark tables are of course computed for all major parks, so why would they not be available? This brings up the important concept of statistical base. Since expectation values are to be computed, and more importantly to have their differences computed, their values must be computed to tolerable accuracy. It turns out that, given the uneven distribution of game states, it takes well over a year to accumulate a tolerable statistical base. I prefer about three years for a single ballpark. The game states in many of the PW tables are even rarer than in the main tables, and this means that they can only be calculated universally. Their names are therefore PW0, PW1, ... PW4.

For each ballpark, it is desirable to split up the tables by era somehow to capture the slow change of the entire game over time. On the other hand I don't really like using decade tables which cause a huge jolt in the stats rather artificially every 10 years. As a comprimise I set ballpark-and-era tables with unique eras for each ballpark, all overlapping and changing at different times. The result is that I assign a 3-letter code to each ballpark, plus a sequential era code starting with 0. The result looks like FEN.4, which is Fenway from 1988 to 2005, or POL.0, which is the Polo Grounds from 1901-1911. The dates where they change are supposed to represent occasions where there was a change in configuration or something like that. The complete list is in rs2erd. If a ballpark table is unavailable, the league-decade table is used instead.

Player Codes

For each game, players are assigned letter codes for brevity. The home team gets all uppercase letters, and the visiting team lower case. The letters are assigned in the order in which players are announced in the record, starting with the starting hitting lineups. As a result, the letters a-i and A-I have a double meaning, also standing for the batting slots throughout the game (this is explained below in the Roster File PDF section). In rare cases the alphabet is overflowed, resulting in player "letters" like aa or AB, so software should actually treat them as words. Also, games ending in 'x' codes (e.g. rained out) have players inserted in the runs/ database called 'nobody' and 'NOBODY' in order to make the runs add up without prejudicing any players, explained in more detail below. It shuold go without saying that the letter codes are unique throughout the game and only for the length of one game.

A suspended game was recently completed in which the same player played on both teams via trade. The software today is not ready for this, and I don't know exactly what the answer is. I may wind up preprocessing the input to fake a second player ID until I have a better answer. The software generally processes all players together, asking a particular data structure whic team they're on, and this approach can't work now. I don't particularly want to change tens of thousands of lines of code to accomodate one game, or a very small number of games. Something hacky may result.

Lineups and Substitutions

Starting players are introduced with a trio of (retrosheet id, letter code, fielding code). Shohei Ohtani is listed twice, under one letter code but fielding codes 1 and 10 (this is not like retrosheet). Substitutions are interspersed with play codes and contain a letter code and a fielding code. The fielding code 0 is used for players exiting the game. The retrosheet codes for substituted players are listed under the starting lineups, without fielding codes.

Baserunner Codes

Each play is supplemented by a baserunning code consisting of up to four elements, each of which has a letter code and a base destination code as folows: 1, 2, 3 for the numbered bases, 0 for out, and 4 for run scored. The elements are normally separated by slashes, and the normal ordering has the batter first and the lead runner last, separated by slashes, and if the play does not involve the batter a '/' appears first.

Annotations and File Format

The file is something a little like CSV, although I have never handed it to a CSV parser. Only erd codes are visible in the CSV format, and all other information appears as specially formatted comments, introduced by the '#' character at the start of the line. Incidentally, Unix line endings are noramlly used, although the parser is flexible. The game begins with 5 header lines each starting with '## '. On line one, we have the visiting and home team codes, followed by a date and time code which is really the retrosheet game ID minus the home team code, and the game ID can be reassembled thus. Line 2 gives the score. Line 3 contains all miscellaneous information about the game. The example chosen shows all possible pieces of information except one, which will be discussed specially below. The word FLAGS is always present. IN7 indicates a 7-inning doubleheader game (so, e.g. a walkoff is possible in the 7th inning, and an automatic baserunner may appear in the 8th (not in the same game)). XR2 indicates that there will be an automatic runner on second in extra innings. In principle other bases could be indicated. TIME gives the game length in minutes and is only present if a nonzero value was reported. WKD gives the weekday as a 3-letter ISO code. TAB gives the ballpark code to identify the ERV table for this game. Lines 4 and 5 give the complete roster of players for this game. In square brackets each player gets a roster entry consisting of full name:retrosheet id:letter code. Starting players additionally get a parenthesized code as is used in the game giving letter code:fielding code. In the game itself, each half inning is represented by two lines, the first a comment and the second the pseudo-CSV of the ERD codes. The two are vertically aligned, play by play. The first line begins with '# ' followed by the half-inning number (starting from 1) and a ':', followed by a comma-separated list of two-element descriptors. The first element is called the annotation, and is a sort of half-traditional human readable (to the author) description of each play, and then the baserunning code in angle brackets.

The annotation is based on the retrosheet coding of the play, but with many small differences. All of the really peculiar ones are supposed to be described here for reference. One of the most unobvious is that N can be used as a suffix to mean 'iNterferece', normally followed by a code indicating who did the interfering. In this case fielders are indicated by number and others by special letters: 'B' batter, 'R' runner, 'F' fan, 'U' umpire. Obstruction is 'OB'. When one runner is out passing another, that's 'RPAS'. When a runner is out hit by the ball off the bat, that's 'RBAT'. Errors are signalled by 'E', but not all are present (e.g. foul errors are generally omitted because nothing happens). Double plays should contain 'DP' and triple plays 'TP', and these are only present if the outs are finally made.

When the ball is put in play there is a prefix code which indicates how, in a certain rather peculiar categorization. The main opposition is between 'G' for ground and 'F' for fly. 'F' has quasi-synonyms 'P' for pop-up and 'L' for line-drive, and any of these three may appear alone (before a fielding code) for an out. Bunts get the letter 'B' prefixed to 'G', 'P', or 'L'. The sac-fly is 'SF', however the sac-hit is not indicated in any clear way. I maintain retrosheet's distinction between 'FO' for force outs, and 'FC' for tag-outs, and these may have prefixes listed above if they're apropriate. The remaining letters mainly mean what they normally mean in baseball, if you can peel off the prefixes and suffixes. You may be excused for not knowing off the bat what is a 'LDGRNF' - a line-drive double ground rule interference by a fan - or a 'BPFCE' - bunt pop-up fielder's choice error, or even a 'PIF' pop-up infield fly rule. There's a new thing called 'FCNO' which is a fielder's choice no out with no error. A good pronunciation should auto-suggest.

The hits are 'S', 'D', 'T', 'HR'. If the first three are followed by fielding codes somebody is out unless there's an error. If 'HR' is followed by numbers it is inside the park, and nobody is out. All of these things are clearly indicated in the ERD code and the baserunning codes. Feilding codes only indicate handling of the ball, and never mere positioning. Baserunning plays have a fairly peculiar coding, but this comes straight from retrosheet, and from the way such plays are scored. In short, you have 'SB', 'CS', 'PO', 'POCS' (both on the same out), 'WP', 'PB', and the weird 'OA' which is really an out on what would otherwise have been a WP or PB. And 'BK' for balk.

None of these are used for any statistical purpose.

The remaining complication from line 3 of the header is when the home team bats first, indicated by the flag 'HTBF'. My coding has the home team batting second, and the home team is coded as a visitor. The ballpark code is for the real home team, and the game is from the same file as in retrosheet. In other words, the retrosheet codebase is apparently flexible about the team batting order but inflexible about the file location of the game, whereas my codebase is the other way around. Everything after this is normal, as far as the transformation goes. Note the game ID is formed with the visiting team code. When processing an ERD file, that's basically all you need to cue from the HTBF flag.

PAPG

Count all plate appearances and all games universally, record the ratio as PAPG divided by 18, the number of batting slots per game.

ACOR

For every play for which the pitcher is the pitcher of record, the ERD is added to the average, and the number of plate appearances is counted. These are split up by year:team:pitcher:series, where the series is one of RS PS WC LDS LCS WS (PS is the sum of everything except RS). The accumulated average is divided by the number of PA and tabulated by year (file name) and pitcher, team, and series. For display, ACOR is multiplied by 9 * PAPG and then rounded to 3 decimal places. Negative values are better than positive values, as the opposing team's runs are being counted. ACOR stands for Average Contribution to Opponents' Runs.

For precision, separate names will be given to three different interpretations of this stat by suffixing the letters 'c' for Count, 'r' for Ratio, and 'a' for Adjusted. First, the sum of ERD terms for a player is ACORc, which has units of runs. When ACORc is divided by plate appearances, this produces ACORr with units of runs per plate appearance. Finally ACORr is multiplied by 9 * PAPG to produce ACORa, with units of runs per game. The PDF products refer to ACORa exclusively, and call this ACOR. The data files on disk contain ACORr, which is called the pitching average.

This number and ACR are centered on zero by virtue of the fact that the E[0][0] value at the start of the half-inning is never attributed to any player, and so sets the level for what is to follow, defining an average to beat or fall short of, which becomes the sign of the player's statistic.

Another factor to consider in any ratio statistic is how long does it take to converge to a sensible value? Too few plate appearances in a perl plate appearance average make for essentially random numbers. Not all statistics take the same number of plate appearances to converge. ACOR and ACR are fairly quick to converge, although the fractional R stat is specially designed to make sense of a single game. Overall I would rate the length of time required for the various main atatistics to converge as fR < ACOR+ACR < BCR < ACW < FCOR.

ACR

For every plate appearance for which the batter is the batter of record, the ERD resulting from the play off the final pitch (a 'p'-type play) is added to the average, and the number of plate appearances is counted. For other ('n'-type) plays, one of three sub-algorithms is invoked depending on the circumstances:

1. If the ERD is negative and at least one baserunner is out:

Then divide the ERD value equally among all who were out

2. If the ERD is positive and at least one baserunner advanced:

Then divide the ERD value equally among all who advanced

3. Otherwise:

Divide the ERD value equally among all baserunners who either advanced or were out

These values are split up by year:team:hitter:series. The accumulated average is divided by the number of PA and tabulated by year (file name) and hitter, team, and series. For display, ACR is multiplied by PAPG and then rounded to 3 decimal places. Positive values are better than negative. ACR means Average Contribution to Runs. Similar to ACOR, this is available in three interpretations, ACRc, ACRr, and ACRa, with the same definitions.

L/RACR

Only the ERD values of completed plate appearances are used to accumulate these averages. The values of all n-type (and r- and u-type) plays are disregarded, as are plays where the pitcher's throwing hand is undefined. The averages are calculated as for ACR but without the baserunning rules, and separate averages are maintained for left- and right-handed pitchers. Interpretations 'c', 'r', and 'a' are the same as for ACR.

ERD Probabilities

These answer a set of questions very vaguely similar to the old single, double, triple type of stats, but with a sharper interpretation. For every play, the ERD value is classified according to whether it is at least 1, 0.5, 0.2, or 0, or in the negative sense at least at or below 0, -0.2, -0.5, or -1. For each player, separate counts are maintained for all of these categories (hitting and pitching). Values are assigned as follows. The hitter and pitcher of record are assigned all ERD values that occur during their time in those roles. In addition, non-plate appearance (n- r- or u-type) plays are assigned to all baserunners. Finally, a count is maintained of the total number of plays elapsed (the denominator for the 'r' stat). This is counted for the hitter, pitcher, and baserunners on n- r- or u-type plays. Only the 'c' stat is stored, along with the denominator. The 'a' stat is calculated from the ratio by multiplying by 100 and rounding to the unit. These categories are only tabulated per year, not per team and per series. The -1 category is not displayed, being too rare to be interesting. Another derived stat is displayed, called 'si' for slugging index, and calculated as

sir = +.5r + 2 * +1r

BCR

BCR, or Baserunning Contribution to Runs, is a derived stat based on the general concept of ACR and ACOR. The idea is to consider every play in which a player is on base at the start of the play. An average of ERD values is maintained, but with a difference. Most of the ERD value comes not from the baserunners directly, but from the actions of the hitter and pitcher. So on every play, the per-plate appearance average of the hitter and pitcher is subtracted to eliminate this contribution - on average, in the long run. Specifically, on each play, and for each baserunner:

BCRa += ERD - ( ACORr + ACRr )

Plate appearances are counted as usual (assigned to the baserunners) and used as the denominator for the 'r' stat. At present, the 'a' stat is computed exactly as for ACR, although in principle a slightly different number should be used to produce a true runs per game average, i.e. average plate appearances on base per game. BCR is tabulated per year, not per team and per series.

FCOR

FCOR is the counterpart of BCR for fielding, and is calculated in much the same way. The fielding positions are categorized as catcher, infielder, and outfielder. The reason for lumping these together is simply to accumulate more plays for the average, as FCOR takes an especially long time to converge. The reason is simly that, instead of attempting to categorize individual plays by who might have fielded it ( or tried to or should have tried to), every play is assigned to all 8 fielders. Incidentally, the subtraction of the pitching average would produce an uninteresting pitcher fielding average. The 'a' form is calculated as for ACOR, and FCOR is tabulated per year, not per team and per series. The catching, infielding, and oufielding averages may be referred to as CFCOR, IFCOR, and OFCOR.

Fractional R

The calculation of this stat is unusually complicated. Some preliminary concepts need to be introduced. In ACR and ACOR, distinction is made between p-type and n-type plays, as distinguished in the ERD codes (and the corresponding q/r w/u codes which are equivalent). Some finer distinctions need to be made here. The main type of play concerned is called PNO for p-type or n-type with at least one out. u-type plays also get special handling. Mainly a count is made in every half-inning of the number of PNO plays which occur with a given number of initial outs, nPNO[o]. Player averages of ERD values are accumulated as in ACR and ACOR, and also a count of PNO plays attributed to the player with a given number of initial outs nPNO[p][o]. At the end of the half-inning corrections are added to each player average of the form

( E[0][o] - E[0][o+1] ) * nPNO[p][o] / nPNO[o]

Where the E[0] becomes E[2] with the automatic runner on second.

Fractional plate appearances are generated in two cases: u-type plays, and n-type plays. For a wakloff u-type play, the pitcher is given a flat 0.1 PA if they don't have any PA already. When a relief pitcher is brought in, the outgoing pitcher is examined to see if they have any PA. If not, it is checked whether they were pitcher of record for at least one play. If so, a count called NPIT is made, and increased for every subsequent releif pitcher. NPIT is reset when a plate appearance is completed, and every pitcher counted as NPIT receives 1/NPIT plate appearances (NPIT is at least two when reset).

Decisions

The selection of the winning and losing hitter and pitcher proceed by a series of cuts. First, all members of the appropriate team are included if they have more than zero plate appearances (including fractions described above). Next, the fractional R stat is used in a manner called "would have won/lost" based on runs per plate appearance as follows. For each player, the statistic R/pa for that player is compared to the same statistic for the entire opposing team. For a player to be the winning hitter, R/ps must be greater than R/pa for the opposing offense, and for the losing hitter it must be less. For the winning pitcher, R/pa must be less than that for the pitcher's team's offense, and for the losing pitcher greater.

Each decision is completed from the list thus generated by making further cuts. All of the decisions have in common that all players remaining after the last cut share the decision. In some cases, all the players who pass the "would have won/lost" test share the decision, but in most cases the list is reduced to one.

Winning Hitter

After the R/pa cut, the highest R among the remaining players is selected, and all players with this R share the decision.

Losing Pitcher

After the R/pa cut, the highest R among the remaining players is selected, and all players with this R share the decision.

Winning Pitcher

After the R/pa cut, the highest number of outs among the remaining players is selected, and all players with this number continue. Then the lowest R among the remaining players is selected, and all players with this R continue. Finally the lowest number of (possibly fractional) plate appearances is selected, and all remaining players share the decision. If no decision is produced, the procedure is repeated, with the pa > 0 and R/pa cuts omitted.

Losing Hitter

After the R/pa cut, the number of plate appearances for the remaining players is examined and the median calculated. If the median is greater than 2, one is subtracted. All remaining players are eliminated if their number of PA is less than this. Then the lowest R among the remaining players is selected, and all players with this R continue. Finally the highest number of plate appearances is selected, and all remaining players share the decision.

Game Probability Estimator

First, at each point in the game the expectaion value from the ERV table in use for the current half-inning is added to the offensive score for the current game state. Then the difference is formed between the home and visiting scores thus adjusted. This home minus visitor fractional score difference is called X for this purpose. This is calculated using the all-time ERV table for all non-PW half-innings instead of the ballpark tables. This is done for consistency. The estimated probability for the home team to win is calculated as

hwp = 0.5 + ( 1 / pi ) * arctan ( Z + ( X - A[o] ) / B[o] )

Where o is the number of outs in the game, A[o] and B[o] are tables of fitted coefficients for the estimator, and Z is a value calculated to adjust the game-initial probability, which is constant for the length of a game. The values A[o] and B[o] are fitted in advance by simulated annealing, using the entire retrosheet event file database. The number of outs tabulated runs from 0 to 59, where extra innings are looped around from outs 54-59 as often as necessary.

When Z is left to zero, the estimator reports a game initial probability of 53% for the home team to win, equal to the all time record. It may be desirable to adjust this to particular values when more information is available, without re-fitting the estimator's parameters. This is done by adding in the Z term. For regular season games, a ballpark home win rate is calculated for each ballpark code, and the Z value is adjusted to make this value the initial estimate. An even better reason for making this adjustment comes from the postseason. On the basis of the above estimator, a probability estimator for 232-format series was developed.

The estimator gives the probability of the home advantage team to win the series, with the series state taken as ( home advantage team wins, other team wins ). A table of observed win rates is calculated for each state, and then during a game the estimate is taken by using the home-team win probability calculated above to lever between the observed win rates for the states where the home team wins or loses. E.g. for game 1, 2, 6, or 7 that is

swp = hwp * s232[ hw + 1 ][ vw ] + ( 1 - hwp ) * s232[ hw ][ vw + 1 ]

where hw is the number of games won so far by the home advantage team, vw for the other team, and s232[ hw ][ vw ] is the table of win rates for 232-format series. The reason for the Z adjustment in the hwp estimator is to make the swp estimate continuous. In other words, the table of 232 series states implies game-initial probabilities via equations such as

swp0 = ( s232[0][0] - s232[0][1] ) / ( s232[1][0] - s232[0][1] )

which gives the series-initial estimate.

ACW

ACW is a log-probability-ratio statistic for hitting and pitching. It answers the question: what do this player's actions have on their team's chance to win the game? This is in principle calculated like a run-value statistic, but there is a mathematical wrinkle: probabilities are not additive, as 1 is the maximum value a probability can have. In order to make such an additive statistic - which is necessary if it is to be averaged per plate appearance - you can take the logarithm of a ratio, which then behaves like a difference. The probability goes to 0 or 1 at the end of a non-tie game, which causes another problem, that the logarithm of 0 is minus infinity. This in turn can be avoided by always calculating from the winning team's point of view, and then negating each term for the losing team. In effect, the game is played time-reversed for the losing team. It is centered on zero by virtue of the fact that the probability estimator starts in the middle of the probability range. To explain in more detail, a single term in a player's ACW is calculated in two steps. First, the number for a play equivalent to an ERD value, called an HWD, is calculated in one of two ways depending on whether the home team won:

HWD = log(hwp) - log(previous hwp)

or lost:

HWD = -log(1 - hwp) + log(1 - previous hwp)

Finally, the HWD is added to the averages of home team hitters and visiting team pitchers, and subtracted from visiting hitters and home pitchers. In other ways, this stat is calculated in the same way as ACR and ACOR. In particular, the rules for attributing the HWD to baserunners is the same as for ACR, but with the sign of the HWD in place of the sign of ERD. This produces ACWc and ACWr. For ACWa, the factor is 10 * PAPG for both hitting and pitching. The ACWc for a nominal win is equal to log(2). For ACWa, a single plate appearance with this value would display as 29.493. This is hardly ever relevant, which is why I didn't choose a more sensible value for this.

ACW has never been calculated for a live game but it could be, although not knowing who wins there are two potential values. I provisionally call them hot ACW and cold ACW, based on the assumption of the player's team winning or losing respectively. I eventually plan to add support for calculating this to the program bb-erd, but it requires kind of a lot of information to be entered - ERD code, ballpark code, score difference, and half-inning number for each play for a start, and the baserunner rules are something else.

The X-factor

X for a game is calculated as

X = sqrt( sum( ACW^2 ) )

X only has Xc and Xa values, and the latter is multiplied by 20. Like ACW, X is undefined for a tie game. Why compute this? It's basically a measure of how "long a path" the probability estimator takes to the win or loss. Or in more technical language, a confusion metric, as it is high when the estimator can't make up its mind, i.e. has low information. In ordinary human terms it means an exciting game.

X is also defined for series as

series X = sqrt( sum( game X^2 ) )

Career Rankings

There are all-time rankings in each of ACR, ACOR, BCR, CFCOR, IFCOR, OFCOR, hitting ACW, and pitching ACW. For each ranking there is a minimum of plate appearances. This is equal to 200 for everything except FCOR, where it is 12000 due to its long time to convergence.

Combined Rankings

This is the first of the advanced combined metrics, which involve the product of (at least) two terms, the first a sum of run value terms, the second a sum of ACW terms. Other items may also be included multiplicatively. The reason for this is unit algebra. The run value terms sum because they're in units of runs or runs per plate appearance. The ACW terms also naturally add, thanks to the logarithm. But terms in different units shouldn't be added, but rather multiplied.

There is a complication, which is that the product of two or more signed terms does not produce a satisfactory scalar, due to sign ambiguity. One hitter with positive ACR and ACW would wind up the same as another hitter with negative ACR and ACW. There are different ways of dealing with this, but each ranking must choose one.

Regular Season MVP

The regular season MVP has more multiplicative terms than the other combined rankings, and uses exclusion to defeat sign ambiguity. There are awards every year for two pitchers and two hitters, but the categories are unusual: LHP, RHP, B vs LHP, B vs RHP. I think these are more natural and interesting categories for the sport as it has evolved to the present day. First, the metrics are defined as (all numbers for the regular season only; O is outs):

B vs LHP = ACRc * LACRc^(3/2) * ACWr * sqrt( +1c ) * sqrt( (W - L) / (W + L) )
B vs RHP = ACRc * RACRc^(3/2) * ACWr * sqrt( +1c ) * sqrt( (W - L) / (W + L) )
LHP = ACORr * ACWr * O^2 * sqrt( (W - L) / (W + L) )
RHP = ACORr * ACWr * O^2 * sqrt( (W - L) / (W + L) )

Hitters are eligible for either or both awards if the following are positive:

ACRr ACWr LACRr RACRr (W - L) (W + L)

The +1c term is made positive by adding 0.01 prior to the sqrt, in order to give unique rankings to players with zero +1c.

Pitchers are eligible based on their throwing hand and whether the following are positive:

-ACORr ACWr (W - L) (W + L)

The eligibility criteria are relatively strict, but they produce enough runners up every year. These awards are a combinatin of more factors than the others to make them a little more unpredictable.

Postseason Series MVP

This is the simple and classic combined ranking. The run value term is a simple sum:

smvpRV = ACRc - ACORc

And the log-probability-ratio term is equally simple (h and p refer to hitting and pitching):

smvpLPR = hACWc + pACWc

The full metric is just:

smvp = smvpRV * smvpLPR

Eligibility is based on the following being positive:

smvpRV smvpLPR (hPA + pPA)

This seems to be a reliable and sensitive ranking putting hitters and pitchers on an equal footing over the length of a series.

Hall of Fame

The purpose of the Hall of Fame (hof) is not just to put hitters and pitchers together, but to rank essentialy everyone, i.e. to have minimal eligibility criteria, in this case having at least one plate apperance as either hitter or pitcher. This necessitates an alternative to the sign ambiguity question, and the answer is to include a global bias term for each of the three (RV, WL and LPR) multiplicative terms to make everyone just positive. This proceeds in a few steps. First, the universal run value and log-probability-ratio terms, which are based on all combined statistics, and then the win-loss record:

hofRV = ACRc + BCRc - ( ACORc + (FCORc / 3) )
hofLPR = hACWr + pACWr
hofWL = HW + PW - (HL + PL)

Next, the bias terms (negated because they're always negative):

hofRVbias = - min (all players) hofRV + 1e-9
hofLPRbias = - min (all players) hofLPR + 1e-9
hofWLbias = - min (all players) hofWL + 1

Finally, the combined metric:

hof = ( hofRV + hofRVbias ) * ( hofLPR + hofLPRbias ) * sqrt(hofWL + hofWLbias) * ( total PA from ACR+ACOR+BCR+(FCOR/3) )^(1/4)

Finally, FCOR is reduced in strength by a factor of 3, which turns out to be a surprisingly complicated judgement. The basic reason is that FCOR is the least reliably calculated of the main stats, and by the nature of baseball it was giving itself, in effect, too much weight in the sum, having typically 9 or so times as many plate appearances as hitting. Why reduce it by a factor of 3 and not something else? Because it also, as a side effect, changes the balance between hitters and pitchers in the ranking. It doesn't cause one to simply rise or fall, but reducing FCOR has the effect of making pitchers cluster nearer the center of the list. This effect is visible in the published rankings to some extent - hitters tend to be at the very top and bottom, although a few pitchers come very close at both ends. Hitting averages tend to be more concentrated than pitching averages, and the effect of FCOR in the RV term is to moderate the overall RV of hitters, as FCOR tends to be even smaller than ACOR. The factor of 3 seems to best balance these two competing tendencies.

What's next

Wow, what a year. My team won the world series! I'll have to wait for the data files like everyone else, but I feel pretty confident that Freddie will get the arctex MVP. With baseball all done, it will soon be time for me to take another whack at arctex over the holidays. I have a pretty good list of things queued up to work on. First, some simple things. I intend to make the constant factor for BCRa reflect the average number of plate appearances on base. I also decided it makes more sense to have the constant factor for ACWa be the same as for Xa. The correctors for BCR and FCOR are calculated per plate appearance, but are applied per play, which seems wrong, so I'll fix that. I'm contemplating giving ACW a role in the decisions, possibly in such a way that the individual decisions may be suppressed, e.g. in a 1-0 game there may not be a losing pitcher. I'm thinking of doing a big linear regression on game time versus PA, pitching KOs, and half-inning changes, and then redefining the P stat based on those results, i.e. subtracting out the estimated time for KOs and inning changes per game before computing P, which should sharpen the stat. The series MVPs should be listed somewhere near the series summaries. The problem is that there isn't much room on the page for 2020, so the MVPs might go underneath the leaderboard on the following page. Oh, and I never got around to coding in the new end-of-season tiebreaker rules, so those should go in also.

The big thing I want to do for the next update is to sharpen the ACW and X stats by introducing two new probability estimators. There's already an experimental estimator for the 232-format series, but so far I haven't used it to calculate any stats, basically because of the limitation to the 232 format. So the first order of business is to intorduce a new binomial-model estimator for any series format, and then to use it to calculate stats that will be called sACW and sX, where the existing stats will be renamed gACW and gX, g for game and s for series. The X values for the postseason series will be replaced by sX, which should give a better ranking. For example, the 2004 ALCS ranks only #21 all time in gX, but it will surely rank much higher in sX. Also sACW can have a role in the series MVP award, possibly in combination with gACW. Finally, there will be the big one - what I call the u for unified or universal estimator, essentially "win the world series from opening day". The way this will work is that during the regular season I will have a function with values tabulated from history, of the expected number of games to exceed a threshold, given games behind and games left. This same function will be used for all postseason slots, with each team's value being a sum for all the slots they're eligible for. With these values for all teams in hand, they will be converted to probabilities by normalizing the sum of all teams to one. Once the postseason starts, the u estimator will switch to pasting together the s estimators for all of the series (yet unplayed postseason series will be treated as a sequence of 50/50 coin flips). These numbers determine the probabilities at the beginning and end of each game, and the game estimator will be used to lever between these like the 232 estimator does now during games. This will produce new stats uACW and uX. The season X numbers in the season summary table will be replaced with uX, which should be a far more interesting number, and wider ranging than the current ones that are always around 250-300 or so. In fact I expect a seasons-by-x file will be called for. uACW is a little stranger because it will give up on teams eliminated from the postseason while they're still playing. I see it having a role in the hall of fame ranking, but a limited one in combination with gACW. All of gACW sACW and uACW will be in the player career summaries. I should warn that the u estimator will be a lot of coding work, and so it will probably start out in a somewhat simplified form that will be expanded and corrected over time. Finally, I'm thinking about a new stat based on X, the swing inning for a game and the swing game for a series, being the one where cumulative X exceeds 50% of the total. It remains to be seen whether this will be interesting enough. If it is, the swing game will be marked on the willow, and the swing half-inning will be added to the "T X ! P" header. Whew, that should be enough work for my vacation!

A description of each file in the distribution.

First the executables, in run_everything order:

@ indicates that inputs or outputs are made via the generate script and not directly

run_everything	Master script that runs all the necessary generate targets in correct order
download	Download alldata.zip from retrosheet.org
		writes: alldata.zip
generate	Substitute for a proper makefile, also miscellaneous code snippets
		The reference for how to run each command properly (see run_everything for
		the proper order)
		reads: alldata.zip erd/ basic-stats guides/ bb-post hitting-stats/ pitching-stats/
		baserunning-stats/ hw-hitting-stats/ hw-pitching-stats/ fielding-stats/ gacw/
		writes: rs/ basic-stats papg erv-lib never-seen bb-post.ps sample.ps sample.pdf
		best-RS-hitting best-RS-pitching best-RS-baserunning best-RS-hw-hitting
		best-RS-hw-pitching best-RS-catching best-RS-infielding best-RS-outfielding
		unfiltered-best-RS-hitting unfiltered-best-RS-pitching
		unfiltered-best-RS-baserunning unfiltered-best-RS-hw-hitting
		unfiltered-best-RS-hw-pitching unfiltered-best-RS-catching
		unfiltered-best-RS-infielding unfiltered-best-RS-outfielding games-by-x series-by-x
parks-seen	Tool to generate a list of games per park
		reads: rs/postseason/ rs/events/
		writes: @games-per-park
bio-names	Convert retrosheet biofile.csv to namedb used by all software to map RS ID codes to names
		reads: rs/biofile.csv
		writes: namedb
rs2erd		Translator retrosheet -> erd
		reads: @rs/postseason/ @rs/events/
		writes: @erd/
erv-tab		Calculates the main erv tables
		reads: erv-lib erd/
		writes: @tab-erv
bb-erd		Tool to interactively explore erd codes and the probability estimators - h for help message
		reads: erv-lib s232 winp-lib basic-stats 
		writes: @plays-erd
mplhi		Generate a list of peculiar all-time stats about games
		reads: @erd/
		writes: @most_pl_hi
pl-st		Summarize the careers of all players
		reads: @erd/ namedb
		writes: @all-players
hps		Calculate all player run-value stats
		reads: @erd/ tab-erv rs/rosters/
		writes: pitching-stats/ fielding-stats/ hitting-stats/ baserunning-stats/
series-wins	Generate table of who won every postseason series
		reads: erd/
		writes: win/
run-stats	Determines hitting and pitching wins and losses, calculates fractional R stat
		reads: tab-erv erd/ 
		writes: runs/
cwl		Tabulates career win-loss records
		reads: runs/
		writes: winloss
erd-parse	Tabulates pa o br r ACR ACOR per year/team/series (also model erd parser)
		reads: tab-erv papg @erd/
		writes: season-stats/
homewins	Calculate the parameters of the game probability estimator
		reads: @erd/ tab-erv papg winp-params
		writes: winp-params
s232-tab	Calculates the 232-series estimator parameters
		reads: erd/
		writes: @s232
bwp		Calculate ballpark home win probabilities
		reads: erd/
		writes: bwp-tab
hwprob		Run the probability estimator for every play of every game
		reads: @erd/ tab-erv papg winp-lib bwp-tab winp-params
		writes: @hwp/
hwhps		Calculate all player log-probability-ratio stats
		reads: tab-erv papg winp-lib winp-params bwp-tab erd/
		writes: hw-pitching-stats/ hw-hitting-stats/ gacw/
career		Calculate and tabulate career rankings in all main stats categories
		reads: papg all-players hitting-stats/ pitching-stats/ baserunning-stats/
		hw-hitting-stats/ hw-pitching-stats/ fielding-stats/
		writes: career-hitting career-baserunning career-hw-hitting career-hw-pitching
		career-pitching career-catching career-infielding career-outfielding
hpmvp		Determine the hitting and pitching MVPs for each season
		reads: @erd/ hitting-stats/ pitching-stats/ hw-hitting-stats/ hw-pitching-stats/ runs/
		writes: @mvp runners-up
sm-hof		Determine series MVPs and the all-time universal ranking
		reads: erd/ all-players pitching-stats/ hitting-stats/ fielding-stats/ winloss
		hw-hitting-stats/ hw-pitching-stats/ baserunning-stats/
		writes: hof smvp
series-match	Tabulate team and player leaderboards for each year, generate the willow
		reads: @erd/ papg runners-up series-by-x hw-hitting-stats/ hw-pitching-stats/ runs/
		season-stats/ win/ unfiltered-best-RS-hitting unfiltered-best-RS-pitching
		all-players career-pitching career-hitting hitting-stats/ pitching-stats/ gacw/
		writes: @season/ @guides/*-season.ps @willow
guide-career	Generate player career summaries for any postseason series
		reads: @erd/ broadcasters all-players
		writes: @guides/*-career.ps
guide-graph	Generate the graph of any game
		reads: @erd/ tab-erv hwp/ 
		writes: guides/*-graph.ps
guide-roster	Generate the roster roster of any game to accompany the graph
		reads: @erd/ tab-erv papg all-players mvp smvp gacw/ runs/ pitching-stats/ hitting-stats/
		fielding-stats/ hw-hitting-stats/ hw-pitching-stats/ baserunning-stats/
		career-hitting career-pitching career-hw-hitting career-hw-pitching winloss
		career-baserunning career-catching career-infielding career-outfielding win/ hof
		writes: guides/*-roster.ps
winp-lib	Library of probability estimator functions
30wins		Tool to shows all 30-game winners
		reads: runs/
acctruns	Tool to check that the fractional R stat is calculated correctly
		reads: runs/
bio-check	Tool to check retrosheet biofile.csv for parse failures - output should be empty
		reads: rs/biofile.csv
extract-gacw	Tool to extract gacw entry for any game
		reads: gacw/
extract-game	Tool to extract any game by retrosheet game ID from the erd database
		reads: erd/
extract-hwp	Tool to extract hwp entry for any game
		reads: hwp/
extract-runs	Tool to extract runs entry for any game
		reads: runs/
game-pdf	Tool to make a complete pdf for any game
		runs: extract-game extract-hwp extract-gacw extract-runs guide-graph
		guide-roster guide-career series-match pscat
hwp-report	Tool to crudely examine the probability estimator's a posteriori accuracy
		reads: hwp/
list-codes	Tool to generate a list of all erd codes seen by order of frequency
		reads: erd/
max-pa-br	Tool to sort all baseruners by PA on base per year
		reads: baserunning-stats/
pscat		Tool to concatenate postscript files
		reads: common-ps-header
series-pdf	Tool to make a complete pdf for any series
		runs: guide-graph guide-roster guide-career series-match pscat
text2ps		Tool to make postscript from a text file, used for guide-career and bb-post
viewdec		Tool to sort pitchers and hitters per year by W - L
		reads: runs/


Home page