The ArctEX baseball statistics system

This is a very preliminary introduction to a cool project based on Retrosheet to sort of reinvent basic baseball statistics using a somewhat more mathematical framework. What's different about it? In a word, consequentialism. All arctex statistics are based solely on umpire's decisions about what happened in the game. There are no earned runs or RBIs or at bats, or even hits or errors except as descriptions. It aims to be a complete and reasonably familiar description of baseball using these methods, although the terminology is mostly new.

The name is from "arctangent of expectation values", about which more later after some actual baseball stuff. Skip the rest of this paragraph if you don't want a brief history of the project. This project is a work in progress but close to something interesting (to me at least). It was going along great up through 2019 and then a number of things happened. One is they introduced some new rules that required a bit of a software rewrite - the runner on second base, the 7-inning game, the new postseason format. For the most part that's easy enough (see a complication below), but I delayed too long and accumulated too many irresistable ideas for new stuff, and so the update became too big to tackle at once etc. And then in 2020 I had a number of new ideas all at once that demanded following up, which eventually resulted in the other texts I wrote. Finally in late 2023 I got around to fixing all that and updating so that I can follow retrosheet again - at last! However, that release was particularly buggy, but now I'm happy to say I've done another round, smoothed over the bugs, and added some new features again.

Is Shohei Ohtani's 6-for-6 game the best offensive performance ever? An example step-by-step of how Arctex stats are calculated. Updated April 2025.

Early 2025 update: Here are the new RSPS MVPs for 2024, the players who did the most to get each team to the postseason. These will be properly documented soon.

ATL     Reynaldo Lopez       1.04051
BAL     Gunnar Henderson     0.35682
CLE     Jose Ramirez         0.42437
DET     Tarik Skubal         0.86095
HOU     Yordan Alvarez       0.58813
KCA     Bobby Witt Jr.       2.38106
LAN     Shohei Ohtani        0.60755
MIL     William Contreras    0.42978
NYA     Juan Soto            1.55118
NYN     Francisco Lindor     1.47705
PHI     Alec Bohm            0.26374
SDN     Jurickson Profar     1.09871

- Description of PDF files and website features

- Description of stats and algorithms

- Description of each file in the distribution

Printouts

First a general note about the PDF files - they're made to be printed out on paper. I like to print them on thick letter size paper almost like what's used for business cards, and then have them looseleaf, with one stack of landscape pages and another of portrait pages, with the current game graph and roster file on top of each. Incidentally, although some software previewers mishandle the fonts, printers never seem to have this problem. If you're looking at the files on a screen, it may be best to open two copies with one on a graph page and another on the roster page if you want to follow a game. There are blank pages inserted to make it suitable for double-sided printing.

Graph

This was the raison d'etre for this project. Once I had seen the idea of expectation value tables, I wanted a graph like this. I had programmed Postscript in college, and had always wanted to try a curvy spline graph using the curveto operator (which is the basis of Postscript fonts). The thick light line is the home team score plus expectation value, and the thick black line is the same for the visiting team. The scale in runs is printed on the left and right. The thin lines represent the probability estimators, and are always on a scale of 0-1. The black line is the home team game win estimate, and the light line is the series home advantage team's series win estimate. This varies with the black line in games 1,2,6,7 and counter to it in games 3,4,5.

There are several lines of text above and under the graph. Starting from the top, there is the "annotation" or description of each play in bold. See below in "Annotations and file format" for a description of the abbreviations. Below that is a description of which offensive players advanced, scored, or were out. See "Player Codes" and "Baserunner Codes" below for a description of the syntax. The final line above the graph shown the inning, and after it the so-called X value for the half-inning, and a '~' marker for the swing half-inning, where the cumulative X value exceeds a threshold (86%). Inside the top part of the graph, a large * marks the play of the game, defined as the play that makes the largest step change in the game probability estimator. In the first line below the graph, there are ERD codes for each play (see "ERD Codes"). At the start of each half-inning, the offensive team gets its expectation value raised to the value for 0 on 0 out, which is depicted as a separate column on the graph. Under it, there appears not an ERD code (there is none) but the ballpark code in use for this game, or PW table code if that's in use - see below "ERV", "Ballpark Codes". Below that is the ERD value (see below). Finally at the bottom there are substitution codes, which use the player letter codes and fielding position number codes. When you see 'n:6', that means player 'n' is now the shortstop. When you see 'k:1,j' that means 'k' is the new pitcher and 'j' is leaving the game at the same time. Perpendicular to the graph is a header which gives the team codes and colors, the series and game number (or day number in the regular season), followed by the date and weekday, and in between the retrosheet doubleheader code in parentheses.

Roster

This long section of tables provides information about the individual players referred to only by letter in the graph. The first section describes each game in the series (it always thinks there's a series, like the graph), and then another section describes the regular season for these players, and a final section gives their career numbers. Through all of the charts, the symbol '#' serves as infinity in a ratio.

Game Display

The header for each game describes it in basic stats terms (see section of that name below, as usual). The home team for game 1 of the series is on the right, and will continue to be on the right for the entire roster file. The home team's headers appear in lighter gray whether on the left or right. Below each team's code is the current "standing" or win minus loss number in the current series or regular season. Below the game or day number is the ballpark home team win rate. The center headers give information for the entire game, and appear in black. The stats are explained in "Basic Stats", except for the lower center headers. The +.5 etc. are explained in "ERD probabilities" and appear as counts. In the center, T is the game time in minutes, gX is the stat called gX or game excitement factor, and '!' (called "bang" by programmers) is 100 * gX / T rounded to a decimal, i.e. a measure of "excitement per minute". Finally P is the pace, or minutes per plate appearance - higher is slower.

A brief word about gX, which is a fun stat. It follows how much the game probability estimator gets racked around, more or less. It's high for walkoffs and late lead changes, and low for early inning blowouts. For example, the highest gX game was BOS193808232 with a final score 12-14 and a walkoff grand slam, with a gX of 85.1. The worst is CLE199010020 which was 0-9 after the first inning and finished 3-13, with a gX of 3.2. The game with the most lead changes (7 - there's a tool for this) was MON200005140 which is #15 all time in gX with 76.4. For series, there is a separate probability estimator and another stat called sX. The highest sX is the 2004 ALCS with 73, and the lowest is the 1989 WS with 3.7. Finally there is a universal or u estimator to win the world series from opening day, and therefore total season and regular season uX values. The highest of these was TEX and SLN 2011 with 70, and the highest regular season uX was BRO 1951 with 14.5.

Above the tables of hitters and pitchers for each game, there are a number of new stats describing the relation of the game to the season and the postseason, based on these probability estimators. gLo is the low point in the game for the winning team's estimate to win, which describes how much of a comeback was mounted, if any. Similarly sLo is the low point in this game for the series winner in the series estimate, and for each team uHi is the high point for that team's estimate to win the world series. uX and sX have been described above, but the dispayed uX number, unlike gX and sX, is not for the game only but cumulative on the year. The LVR stats describe how the game or g probability estimator relates to the larger s and u estimators for the same game. The s and u estimators themselves only provide hypotheses about these larger events based on the winning and losing of this game in those frameworks. The LVR stat is simply the difference between those hypotheses, or the leverage that this game has in winning this series or the (eventual) world series. For example, game 1 of the world series has an sLVR of 0.26, which means that winning game 1 makes the series home advantage team 26% more likely to win the series than losing it. Game 7 has an sLVR of 1. In the regular season the same game may have a different uLVR for each team. For an easy example, one team may be eliminated and the other not. More generally the two teams have different standings and so different uLVR numbers. The various hitting pand pitching also have labels with H: and P: numbers. These are ACW aggregates of hitting and pitching win credits for those games, series, and seasons. As you can see from the game display, the winning H is the negative of the losing P and vice versa.

Next is the display of hitters - really all offensive players for this game. The first column '#' gives the player's letter code, as used on the graph. The next column 'st' contains several pieces of information variously coded. For the starting lineup, it contains the starting fielding code, or 10 for DH. For substitutes it contains the letter corresponding to the batting slot they enter. This is not necessarily the player they displaced. That can be determined by looking for the previous letter code alphabetically that went into the same batting slot letter, looking up the list (it could be a pitcher below). In general to figure out a substitution completely you have to look at all the substitution codes below the graph, but there are a number of features to help figure out common patterns from the roster list only, via dots appearing around this 'st' letter or number. A middle dot on the right signifies that this player had a substitute fielding code for positions 2-9, which may mean a substitute fielder, or changing to a fielding position later. A period on the right means a pinch hitter, and a comma means pinch runner, and these combine with the fielding dot to form colon and semicolon. This allows you to figure out simple multiple switches, e.g. if a pinch hitter goes into the center fielder's batting slot, and if fielding dots appear on the pinch hitter and the starting left fielder, then the pinch hitter probably went into left and the left fielder moved over.

The stats that appear next are all explained below. Briefly, ACR is the basic run value hitting stat and ACOR the pitching stat. ACW is about the game probability estimator, and R is a stat used only for deciding the winning and losing hitters and pitchers. There are more annotations on the PA and O column. The game's winning and losing hitter are marked + and -. High dots on the winning side, and periods on the losing side both indicate that the player passed the 'R/pa' cut described below for the decision. This helps understand how the decision works. Finally, '..' may appear by itself indicating a double-pinch, i.e. a pinch hitter who never comes to bat, relieved immediately by another pinch hitter after a pitching change. Finally, the player's name has a hand designation appended. This has a middle dot after it if it indicates the throwing hand, meaning that the player pitched at some point in the season. Sometimes this is the case for position players which just indicates that they pitched at some point in the season.

Below that is the pitching table for the game, which is much the same as the hitting table. Unlike ACR, ACOR is better when negative, but ACW is still best positive for pitchers. There are a few little differences from the hitting display. For DH games the 'st' column is just blank for relief pitchers who stay out of the batting order. The PA and O annotations are the same except you don't see double-pinch, and there is another annotation, the comma which means KO (left in the middle of an inning).

Regular Season Tables

All of the stats in this section are for the regular season and for this team only.

First comes the reular season hitting table, which is ordered by ACR. The season leader in ACRc (see below) has a '>' appended to the PA column. '/pa' after 'BR' means BR/PA for the season (lower case for spacing). BR/PA kind of needs a verbal name, and I prefer "no out percentage". 'w' and 'l' are the hitting decisions.

Next, the summary of hitting vs left and right hand pitchers is ordered by LACR - RACR which means that "normal" right handers sort at the top of the table, and "normal" left handers at the bottom. This allows you to see reverse splits and pigeonhole switch hitters at a glance. The stats are described below, but LACR is just batting-only ACR vs left hand pitchers. Beware when the number of PA is very low on either side. If you can't figure out why somebody would be disqualified for a regular season hitting MVP, look here. You have to have positive LACR and RACR to be considered for either MVP.

Next is hitting ERD probabilities, which is sort of arctex's mixed-up version of singles, doubles, homers, and double plays. What you see is a percentage of all plays this player was involved in as a hitter or baserunner, by ERD value in the sense of greater or equal for + ans less or equal for -. So +.2 is very roughly like singles (and everything better), +.5 like doubles, and +1 like homers, with -.5 being like double plays. But the numbers here are more focused on the game effect and categorize all sorts of plays. For example, an aggressive base stealer will have a high rate of +0. Unlike the other tables in this section, this table has a minimum requirement of 150 plays, the reason being that the numbers are pretty worthless with less than that. The final column is 'si' or slugging index, which is +.5 + 2 * +1, a number kind of like slugging percentage. It is in fact a good number for categorizing sluggers. Over 30 is very good, and over 40 is teriffic (and rare nowadays). The table is sorted by the unobvious metric (+.2 - -.2), which is vaguely interesting.

The baserunning and fielding tables should be self explanatory. All -OR stats are best when negative (opponents' runs), and all are sorted best on top. Fielding stats are aggregated by the named categories. Beware numbers with few PA, especially in fielding. Under 1000 PA should be considered an unreliable FCOR number (much more is really best). The FCOR PA numbers are also a good place to see who was a starting player in the season. The BCR PA number is an interesting offensive metric in itself.

There is a table of miscellaneous team statistics which contains run value ACR or ACOR and gACW numbers for hitting and pitching by inning (all extras count as 10), and hitting by batting slot number and fielding position (including 10 DH and 11 PH). All numbers are for the entire year, and the ACR and ACOR numbers share a common constant factor - see below for details.

The pitching table is sorted by ACOR, best on top. 'pa/3' means PA per 3 outs, i.e. 3 * PA / O, which is sort of parallel to BR/PA for hitters. In the ERD probabilities, an si over 25 is a red flag in my opinion. That table is ordered by (-.5 - +.5), which again seems vaguely interesting.

The postseason series tables display the same information for these series. The PS numbers are an aggregate of the following series numbers. If you're looking at a regular season game, and a player's name appears in the PS summaries but without any numbers, it means the player was on another team. The only thing different here is the indication in the PA column shows an approximate version of the various MVP awards, although with hitters and pitchers both ambiguously indicated as winners. The real award is indicated at the end in the MVP section. The character used to indicate this varies depending which series is being considered. In a file for a postseason series, the winning team's player will be indicated '*', although the new series MVP award doesn't care which team you're on. The losing team's best player gets a '>'. For series summaries for other series in the year, the designation changes to '^' because it only considers the listed players, and the real award winner may not appear (this one is fairly useless). For the entire PS summary they get '>' for both teams.

Career Tables

Most of these are self explanitory. All are sorted with the highest ranked players on top. The stats are aggregated over all time, and there are minimum plate appearance requirements, which is 2000 PA for everything except fielding which is 12000 PA. The postseason numbers are given for all those who qualify in total, and the number of PA is listed. The career decisions table has no minimum, and lists hitting wins and losses and pitching wins and losses in that order. It is sorted on total wins minus total losses. The minumim requirement for the hall of fame is one plate appearance as either hitter or pitcher - x is the combined all-around metric for the ranking. The MVP table lists both regular season and postseason MVP awards, also both based on custom combined metrics. For a postseason series, the real series MVP is listed here, and may be on either team. If an MVP is not listed for a series the team won, it went to a player on the losing team.

Career Summary

Every player's career is summarized in terms of teams, years, series, and positions. Teams are listed in order of the first year the player played there. If the player played for two or more teams in the same year they may not be listed in correct order. Fielding positions beyond the usual: 10 DH, 11 PH, 12 PR. Positions are listed regardless of how many or few PA. Coaching is now also indicated, after the playing career. The position identifiers for coaching look like Chit for hitting coach, Cpitch for pitching, C1B and C3B for base coaches, Cbench and Cbullp are bench and bullpen, and Ccoach is an unspecified coach. Coaches are not yet listed on the roster for a given game.

Season Summary

At the top there is an immitation of a common display for division leaders and runners up in all divisions. Across each line there is the division position, the team code, games won minus games lost, total non-tie games, tie games, win percentage, team ACR, team ACOR, team TAR, the uX value for the entire season, the uX for the regular season only, and uHi, the year's high point in the u (world series) probability estimator. ACR, described below, is the run value hitting stat, and ACOR is pitching and of opposite sign. TAR is ACR minus ACOR, which is a combined "team power" number in runs per game (although it's a per-plate appearance number). The ACR, ACOR, and TAR numbers displayed here differ slightly from the statistics for individual players of the same names, having a common constant factor when the individual stats do not - see below for a detailed explanation.

Next is a wildcard runner up display, even in the pre-wildcard era. Division leaders appear at the top abve the line marked with a '*', and then runners up which are marked with a dot when wildcard slots exist. The display contains a brief summary of each team's most recent postseason appearance, world series appearance, and world series win. Finally there are two numbers that are averages for the team for each year so far (including the current year) in uHi and uX.

Below is the postseason summary, with an entry for each postseason series. The series name is given in retrosheet format, which includes imponderables like ALD2 that are best looked up here to find out what that means. Following are the team codes of the series winner then loser, followed by a display of series games called the willow. Each game is represented by a letter decorated with "accents" or dots to indicate a number of interesting things. The letters w and l represent wins and losses by the series winning team, and are upper case whenever the home team wins the game, lower case when the visitors win. A high dot (or caret in text files - a postscript charset issue) before the letter indicates a complete game by the winning pitcher, and this is changed to a '!' for a perfect game. By the way, the Arctex definition of a perfect game is very slightly different, being defined as zero BR, which allows for e.g. a sigle thrown out at second (consequentialism). A middle dot after the letter indicates extra innings. A comma after the letter indicates a walkoff, and combines with extra innings to form a semicolon. There follows a display of basic stats for each series (plate appearances, outs, baserunners, runs), with each displaying two numbers - before the slash, the winner's total, and after the slash the series total (winner plus loser), so that it appears as a fraction giving the winner's share of the total. Next come the run value ACR and ACOR numbers for the series. They're printed for the winner, but you reverse the numbers for the losing team. I.e. the winning team's ACOR is the losing team's ACR (no sign reversal). Finally there is the series uX and sX, followed by the rank in sX value (out of 381 through the 2023 postseason), and sLo, the low point in the probability estimate for the series winner.

Finally at the bottom there are (up to) three text summaries. On the left is an accounting of games in the regular season and postseason which should be self explanitory. If there were regular season interleague games, there is a summary of which league won how many in the middle. Below is the percentage of complete games for the year, calculated in such a way that it can go to 200%, and below that a quick summary of perfect games and their pitchers, if any. Note that arctex allows batters out on a hit in a perfect game, which is very slightly more permissive. At the right is a summary that needs a slight explanation, as I couldn't find the right wording to fit there. What it shows is the total length of all games in the year divided by one eighteenth of the total number of half-innings, more or less the length of the average game that year. But it's relly calculated as an average of half-innings. Also there's the number of postseason series, to compare the sX rankings above. Then there's a summary of the total season pace of game or 'P' stat. in plate appearances per minute. Finally there is a summary of total season competitiveness in uX, in total, with the rank all time, and for each league, postseason and regular season only total uX (and for each league), and the average gX of each game (total and league).

League Leaderboard

This is currently the most confusing display, owing to the fact that it's sort of in a transitional stage. Originally, before the software was ever released, the display was a fairly conventional AL/NL hitting/pitching leaderboard, with a quartet of simpleminded MVPs designated by ACRc and ACORc (see below for the exact meaning of these terms). Now there's a more sophistocated MVP algorithm, and it's not split by league but by pitcher hand, which I eventually considered more interesting. The new MVP is marked in the second column. The entire display is now ranked similar to the new RSPS metric, more or less, but counting all regular season games as equal. In particular, the numbers displayed for each player are aggregated over multiple teams if necessary, but only within the given league. The new MVP does not consider the league in any way (see below), so the numbers displayed aren't actually relevant (except by accident). The new MVP award may seem inappropriate based on the numbers.

The first column in each table is the old MVP/runner up designation, based solely on ACRc or ACORc for the league-aggregate for the regular season. The second column displays the new MVP winners as * and the runners up as numbers starting with 1. A blank here means the player did not qualify, i.e. had a key stat of the wrong sign. Next is the name and batting or throwing hand which is lowercase to indicate a player's rookie year (simplistically - first calendar year only), followed by the team that player had the most number of PA for in the season. Note, a player who was traded may appear on both sides with different numbers. The headings give the stats that follow - PA is plate appearances, w and l are game decisions, c is complete games, and +1 is +1c described in "ERD probabilities".

On the bottom there is a summary of the year's MVP awards with the team indicated. These would be more convenient on the previous page no doubt, but they don't fit there for the overlong 2020 postseason.

Cheat Sheet

Finally, there is a document called 'bb-post' which is an all-around source of miscellaneous information about the game's history, ballparks, awards of various types, and the "willow" for all postseason games.

Software Download

Here is the latest software download and release notes. Arctex will always be completely free and open source. I took down the sample PDF files from the website as it's not clear I'm allowed to do that.

There is a script called run_everything which will do what it says, starting by downloading the Retrosheet big zip file, and finishing by generating the exact PDF that appears above, with all the analyses run in between. There are also some tools for looking at the stats in a number of ways, the most important of which is called bb-erd. This is an interactive command-line tool to query the ERV tables, calculate ERD code values for different ballparks, and to run the game and series probability estimators for live games. You have to run_everything first. bb-erd does at least have a help screen, which you can get by entering ? or h. To run the software, you need: bash perl wget unzip ps2pdf (from ghostscript), i.e. a normal Unix environment. Mac should probably work, but I haven't tried it. The download is around 100KB, but it expands to 5GB when you run_everything, which should take about an hour on a fast machine. Have fun! See below for release notes.

Here's a quick guide to making PDF files with the software. Say you want a single PDF for the 2000 world series. This is now pretty easy.

./series-pdf 2000 WS

And that will give you a file called 2000-WS.pdf. Other series than the world series can be got by substituting NLCS or whatever. If you want a regular season game, you need to have the game's retrosheet ID, e.g. BOS193808232 (see below). Then you can do:

./game-pdf BOS193808232

And that gives you BOS193808232.pdf, easy.

There are also a pair of features (which used to be on the website) called the game of the day (gotd) and series of the week (sotw). The game if the day is chosen out of the games-by-x file. The game is selected from the top 2000 games all time, which is just over 1%. The series of the week is chosen out of series-by-x. All series are included in the choice, but they're weighted by their X value. To make a pdf of a randomly selected gotd or sotw, just do:

./generate gotd
./generate sotw

Algorithms:

There are broadly 4 categories of statistics generated:

1. Basic stats: pa o r etc., calculatd directly from the ERD codes.

2. Run-value stats: starting with the ERV tables, these all have values in runs (per plate appearance typically). These can be quite sophisticated and varied.

3. Log-probability-ratio stats: these are all based on the game probability estimator, which in turn is based on the run-value stats. These are somewhat involved to calculate correctly, and are limited to hitting and pitching, but are also very revealing compared to 1-2.

4. Combined MVP stats: these are all based on products of two terms oversll, one a sum of run-value terms, the other a sum of log-probability-ratio terms. These are the best summary combination of what's revealed by all of the above.

Before describing the algorithms in detail, a word about my recommended interpretation of all of this. I certainly don't mean to suggest, by displaying an all-inclusive universe of statistics, that all that has gone before is worthless. I'm just trying to give these peculiar ideas a space to tell their peculiar story in its evident entirety. There is, I believe, a certain mathematical logic to it. Right now hardly anybody is listening to me, and that's as it should be. I'm not entirely convinced by all of these mechanical judgements myself, and I certainly don't suggest anyone else should be. My motivation in coding these algorithms is to give them what I think is the best chance to tell what they have to tell. When I make some fairly arbitrary adjustments, the purpose is to try to show the algorithm in its best light by my judgement, and not to make sure this or that person has a high or low ranking as I wish. After that, I sit back and see what I think. Overall, obvoiusly, I think it's entertaining enough to continue, at least for a while.

Basic Stats

From basic baseball statistics I take plate appearances, runs, and outs. Also the occupation of the bases. Doing without hits and errors and so forth, I make a new stat called BR which is defined as a plate appearance in which no out is made off the final pitch. This adds one to the total quantity of baserunners, and leads to further definitions:

	lb = pa - ( o + r )		Runners left on base
	xo = o + br - pa		Extra outs
	ro = o - xo			Regular outs
	nr = br - xo			Net runners

The automatic runner on second base in extra innings is counted as a stat called xr (Extra Runners). The full set of equations and inequalities:

        lb = pa + xr - ( o + r )
        xo = o + br - pa
        ro = o - xo
        nr = br - xo
        o = ro + xo
        pa + xr = o + r + lb
        pa = ro + br
        br + xr = r + lb + xo
        nr + xr = r + lb
        o >= ro, xo
        br >= nr, (r-xr), (lb-xr), (xo-xr)
        nr >= (r-xr), (lb-xr)
        pa >= br, ro, nr, (o-xr), (r-xr), (xo-xr), (lb-xr)

Baseball as state machine: states

To concentrate the statistics on the most important aspects of the game, the game is described as a finite state machine. All plays are translated into this state-based description, and all stats are generated by considering the games according to these states. The total number of PA, O, BR, and R accumulated by teams and players is obviously relevant, but instead of creating "box scores" consisting of these, the description instead focuses on state transitions. It is therefore necessary that following the sequence of these transitions should allow an accurate accumulation of these quantities. It turns out that all that is needed is a state consisting of four pieces of information, aside from the identification of the players:

number of outs in the half-inning

runner on first

runner on second

runner on third

The last 3 can be conveniently coded as an octal digit (sum of 1 for first, 2 for second, and 4 for third).

Baseball as state machine: transitions

The state transition corresponding to a baseball play must specify the starting and ending state codes (duh). However it must also specify whether the batter completed the plate appearance, in order to determine whether or not a BR occurred. A letter/number code can be constructed to specify these data. It turns out that there is slightly more that can be asked of a state transition code - the ability to assign a unique numerical value to it. To do this in the baseball context, I define the ERD code, which stands for Expected Run value Difference. Most of the complexity in this is contained in a single letter which also serves to designate the (non-) plate appearance, explained below.

ERV

The "Expected Run Value" is an expectation value of runs to be scored until the end of the half-inning conditioned on the number of outs in the half-inning at the start of the play and the occupation of the three bases at the start of the play, in other words the ERD state code. This implies that the expectation values be collected into a table - the ERV table - which has 8*3=24 nonzero entries, the entries for out 3 being set to zero.

The general algorithm to compute the ERV table is to consider the ERD code for every play for the time period and or ballpark under consideration, while keeping a double table of numerator and denominator counts for all 24 table entries. For each code, the denominator is increased by one for the corresponding table entry. The state code (outs, bases) is then added to a list accumulated over the half-inning. Finally, the list of accumulated state codes is visited to add the number of runs scored on the play to the numerator value for every table entry on the state code list, including duplicates (as the denominators were increased in duplicate). After all ERD codes have been processed, the ratios are formed. These are tabulated generally under the name E indexed twice as E[bases][outs].

ERD Codes

The following bewildering array of codes are required to properly represent the possible run values (see ERD Values below) of a play. Originally it was just p and n, but over time it was discovered that all of these situations are numerically distinct and represent distinct game state changes as far as the algebraic model is concerned.

	Code letters
	Usual normal half-innings
		p	Plate appearance
		n	Not a plate appearance
	Usual potential walkoff half-innings
		q	Plate appearance
		r	Not a plate appearance
	Usual walkoff play
		w	Plate appearance
		u	Not a plate appearance
	Extra inning with automatic runner - top of inning
		P	Plate appearance
		N	Not a plate appearance
	Extra inning with automatic runner - bottom of inning
		Q	Plate appearance
		R	Not a plate appearance
	Extra inning with automatic runner - walkoff play
		W	Plate appearance
		U	Not a plate appearance
	Game over early for arbitrary reason
	Normal half-innings
		x	No play
	Potential walkoff half-innings
		X	No play

The need for p and n in the determination of the BR stat has been explained. The q and r codes are needed because the bottom of the ninth and later innings are actually subject to different rules. For example in a tie game with more than one runner on base, only one run will be allowed to score unless the hit is out of the park. This fact seriously changes the expectation values and necessitates the use and computation of a different set of ERV tables, which are called PW for potentioal walkoff. Furthermore there must be a set of these, for different values of the half-inning-initial score difference. Experiment has shown that 5 tables are necessary for initial score differences of 0-3 and 4 or greater. Not surprisingly the effect wears off after 4 runs of difference. The handling of these extra tables will be explained below, under the discussion of ballpark codes.

Following the need for potential walkoffs, there are actual walkoffs, with codes w and u. The codes are necessary to signal the cancellation of the expectation value of the remaining baserunners. It's also convenient to know when the game ended at an unusual time. The code x was also added for games arbitrarily ended early. The x code comes after the last play of the game, and exists mainly to provide an ERD code of non-zero value which is not assigned to any player, as the early game ending is often prejudicial to the more sophistocated player statistics. Since the x code has a value, it must have a different value in PW half-innings, so a second code X is needed (it's never a plate appearance).

Finally, for various reasons to do with some of the more sophistocated stats, it is desirable to know whether the half-inning started with an automatic runner on second. No separate ERV table is needed here, and the runner itself can be handled within the codes, but the fractional R stat assigns different values to the codes in this situation.

These various elements are gathered together in this order:

(initial base octal code)(letter code)(final base octal code)(initial outs)(final outs)(runs)

In fact it is not necessary to include the number of runs scored in the play. This can be computed from the previous components. But it is more convenient to have it, and a more sophistocated reason is that it forms a check digit.

ERD Value

After the ERV tables are available, when an ERD code is processed a run value can be assigned to it called an ERD value. For normal plays with ERD code letter nprq, the code value is computed from code (base0)(code)(base1)(out0)(out1)(runs) as

ERD = runs + E[base1][out1] - E[base0][out0]

If the code has letter wux, then

ERD = runs - E[base0][out0]

If these values are summed over a half-inning, an integral number is produced by algebraic cancellation.

Ballpark Codes

It's easy enough to shove all plays ever from retrosheet into a single ERV table, but this turns out to be unwise. The need for separate PW half-innings means that all other half-innings should be rigorously separated in other tables. There is indeed an 'all' table which contains data from all non-PW half-innings. But there are other factors which influence the overall magnitude and detailed structure of the tables, although none as severely as PW0-3. The long experiment with both DH and non-DH ball is one of these. Another is that the balance between hitting and pitching has shifted several times now in the statistically detailed historical era. Calculating separate tables for AL and NL and for different decades was apparently standard practice when I first heard about the technique. A set of tables along these lines are also available, with names formed by chopping the last digit off the year e.g. NL199. Of course only "NL" are available before 197.

However these are only used for games when ballpark tables are not available. Ballpark tables are of course computed for all major parks, so why would they not be available? This brings up the important concept of statistical base. Since expectation values are to be computed, and more importantly to have their differences computed, their values must be computed to tolerable accuracy. It turns out that, given the uneven distribution of game states, it takes well over a year to accumulate a tolerable statistical base. I prefer about three years for a single ballpark. The game states in many of the PW tables are even rarer than in the main tables, and this means that they can only be calculated universally. Their names are therefore PW0, PW1, ... PW4.

For each ballpark, it is desirable to split up the tables by era somehow to capture the slow change of the entire game over time. On the other hand I don't really like using decade tables which cause a huge jolt in the stats rather artificially every 10 years. As a comprimise I set ballpark-and-era tables with unique eras for each ballpark, all overlapping and changing at different times. The result is that I assign a 3-letter code to each ballpark, plus a sequential era code starting with 0. The result looks like FEN.4, which is Fenway from 1988 to 2005, or POL.0, which is the Polo Grounds from 1901-1911. The dates where they change are supposed to represent occasions where there was a change in configuration or something like that. The complete list is in rs2erd. If a ballpark table is unavailable, the league-decade table is used instead.

Player Codes

For each game, players are assigned letter codes for brevity. The home team gets all uppercase letters, and the visiting team lower case. The letters are assigned in the order in which players are announced in the record, starting with the starting hitting lineups. As a result, the letters a-i and A-I have a double meaning, also standing for the batting slots throughout the game (this is explained below in the Roster File PDF section). In rare cases the alphabet is overflowed, resulting in player "letters" like aa or AB, so software should actually treat them as words. Also, games ending in 'x' codes (e.g. rained out) have players inserted in the runs/ database called 'nobody' and 'NOBODY' in order to make the runs add up without prejudicing any players, explained in more detail below. It shuold go without saying that the letter codes are unique throughout the game and only for the length of one game.

A suspended game was recently completed in which the same player played on both teams via trade. The software today is not ready for this, and I don't know exactly what the answer is. I may wind up preprocessing the input to fake a second player ID until I have a better answer. The software generally processes all players together, asking a particular data structure whic team they're on, and this approach can't work now. I don't particularly want to change tens of thousands of lines of code to accomodate one game, or a very small number of games. Something hacky may result.

Lineups and Substitutions

Starting players are introduced with a trio of (retrosheet id, letter code, fielding code). Shohei Ohtani is listed twice, under one letter code but fielding codes 1 and 10 (this is not like retrosheet). Substitutions are interspersed with play codes and contain a letter code and a fielding code. The fielding code 0 is used for players exiting the game. The retrosheet codes for substituted players are listed under the starting lineups, without fielding codes.

Baserunner Codes

Each play is supplemented by a baserunning code consisting of up to four elements, each of which has a letter code and a base destination code as folows: 1, 2, 3 for the numbered bases, 0 for out, and 4 for run scored. The elements are normally separated by slashes, and the normal ordering has the batter first and the lead runner last, separated by slashes, and if the play does not involve the batter a '/' appears first.

Annotations and File Format

The file is something a little like CSV, although I have never handed it to a CSV parser. Only erd codes are visible in the CSV format, and all other information appears as specially formatted comments, introduced by the '#' character at the start of the line. Incidentally, Unix line endings are noramlly used, although the parser is flexible. The game begins with 5 header lines each starting with '## '. On line one, we have the visiting and home team codes, followed by a date and time code which is really the retrosheet game ID minus the home team code, and the game ID can be reassembled thus. Line 2 gives the score. Line 3 contains all miscellaneous information about the game. The example chosen shows all possible pieces of information except one, which will be discussed specially below. The word FLAGS is always present. IN7 indicates a 7-inning doubleheader game (so, e.g. a walkoff is possible in the 7th inning, and an automatic baserunner may appear in the 8th (not in the same game)). XR2 indicates that there will be an automatic runner on second in extra innings. In principle other bases could be indicated. TIME gives the game length in minutes and is only present if a nonzero value was reported. WKD gives the weekday as a 3-letter ISO code. TAB gives the ballpark code to identify the ERV table for this game. Lines 4 and 5 give the complete roster of players for this game. In square brackets each player gets a roster entry consisting of full name:retrosheet id:letter code. Starting players additionally get a parenthesized code as is used in the game giving letter code:fielding code. In the game itself, each half inning is represented by two lines, the first a comment and the second the pseudo-CSV of the ERD codes. The two are vertically aligned, play by play. The first line begins with '# ' followed by the half-inning number (starting from 1) and a ':', followed by a comma-separated list of two-element descriptors. The first element is called the annotation, and is a sort of half-traditional human readable (to the author) description of each play, and then the baserunning code in angle brackets.

The annotation is based on the retrosheet coding of the play, but with many small differences. All of the really peculiar ones are supposed to be described here for reference. One of the most unobvious is that N can be used as a suffix to mean 'iNterferece', normally followed by a code indicating who did the interfering. In this case fielders are indicated by number and others by special letters: 'B' batter, 'R' runner, 'F' fan, 'U' umpire. Obstruction is 'OB'. When one runner is out passing another, that's 'RPAS'. When a runner is out hit by the ball off the bat, that's 'RBAT'. Errors are signalled by 'E', but not all are present (e.g. foul errors are generally omitted because nothing happens). Double plays should contain 'DP' and triple plays 'TP', and these are only present if the outs are finally made.

When the ball is put in play there is a prefix code which indicates how, in a certain rather peculiar categorization. The main opposition is between 'G' for ground and 'F' for fly. 'F' has quasi-synonyms 'P' for pop-up and 'L' for line-drive, and any of these three may appear alone (before a fielding code) for an out. Bunts get the letter 'B' prefixed to 'G', 'P', or 'L'. The sac-fly is 'SF', however the sac-hit is not indicated in any clear way. I maintain retrosheet's distinction between 'FO' for force outs, and 'FC' for tag-outs, and these may have prefixes listed above if they're apropriate. The remaining letters mainly mean what they normally mean in baseball, if you can peel off the prefixes and suffixes. You may be excused for not knowing off the bat what is a 'LDGRNF' - a line-drive double ground rule interference by a fan - or a 'BPFCE' - bunt pop-up fielder's choice error, or even a 'PIF' pop-up infield fly rule. There's a new thing called 'FCNO' which is a fielder's choice no out with no error. A good pronunciation should auto-suggest.

The hits are 'S', 'D', 'T', 'HR'. If the first three are followed by fielding codes somebody is out unless there's an error. If 'HR' is followed by numbers it is inside the park, and nobody is out. All of these things are clearly indicated in the ERD code and the baserunning codes. Feilding codes only indicate handling of the ball, and never mere positioning. Baserunning plays have a fairly peculiar coding, but this comes straight from retrosheet, and from the way such plays are scored. In short, you have 'SB', 'CS', 'PO', 'POCS' (both on the same out), 'WP', 'PB', and the weird 'OA' which is really an out on what would otherwise have been a WP or PB. And 'BK' for balk.

None of these are used for any statistical purpose.

The remaining complication from line 3 of the header is when the home team bats first, indicated by the flag 'HTBF'. My coding has the home team batting second, and the home team is coded as a visitor. The ballpark code is for the real home team, and the game is from the same file as in retrosheet. In other words, the retrosheet codebase is apparently flexible about the team batting order but inflexible about the file location of the game, whereas my codebase is the other way around. Everything after this is normal, as far as the transformation goes. Note the game ID is formed with the visiting team code. When processing an ERD file, that's basically all you need to cue from the HTBF flag.

PAPG

Count all plate appearances and all games universally, record the ratio as PAPG divided by 18, the number of batting slots per game.

PAOB

Count all baserunners per plate appearance, and divide by 18 times the number of total games, giving the number of plate appearances on base per batting slot per game.

ACOR

For every play for which the pitcher is the pitcher of record, the ERD is added to the average, and the number of plate appearances is counted. These are split up by year:team:pitcher:series, where the series is one of RS PS WC LDS LCS WS (PS is the sum of everything except RS). The accumulated average is divided by the number of PA and tabulated by year (file name) and pitcher, team, and series. For display, ACOR is multiplied by 9 * PAPG and then rounded to 3 decimal places. Negative values are better than positive values, as the opposing team's runs are being counted. ACOR stands for Average Contribution to Opponents' Runs.

For precision, separate names will be given to three different interpretations of this stat by suffixing the letters 'c' for Count, 'r' for Ratio, and 'a' for Adjusted. First, the sum of ERD terms for a player is ACORc, which has units of runs. When ACORc is divided by plate appearances, this produces ACORr with units of runs per plate appearance. Finally ACORr is multiplied by 9 * PAPG to produce ACORa, with units of runs per game. The PDF products refer to ACORa exclusively, and call this ACOR. The data files on disk contain ACORr, which is called the pitching average.

This number and ACR are centered on zero by virtue of the fact that the E[0][0] value at the start of the half-inning is never attributed to any player, and so sets the level for what is to follow, defining an average to beat or fall short of, which becomes the sign of the player's statistic.

Another factor to consider in any ratio statistic is how long does it take to converge to a sensible value? Too few plate appearances in a perl plate appearance average make for essentially random numbers. Not all statistics take the same number of plate appearances to converge. ACOR and ACR are fairly quick to converge, although the fractional R stat is specially designed to make sense of a single game. Overall I would rate the length of time required for the various main atatistics to converge as fR < ACOR+ACR < BCR < ACW < FCOR.

ACR

For every plate appearance for which the batter is the batter of record, the ERD resulting from the play off the final pitch (a 'p'-type play) is added to the average, and the number of plate appearances is counted. For other ('n'-type) plays, one of three sub-algorithms is invoked depending on the circumstances:

1. If the ERD is negative and at least one baserunner is out:

Then divide the ERD value equally among all who were out

2. If the ERD is positive and at least one baserunner advanced:

Then divide the ERD value equally among all who advanced

3. Otherwise:

Divide the ERD value equally among all baserunners who either advanced or were out

These values are split up by year:team:hitter:series. The accumulated average is divided by the number of PA and tabulated by year (file name) and hitter, team, and series. For display, ACR is multiplied by PAPG and then rounded to 3 decimal places. Positive values are better than negative. ACR means Average Contribution to Runs. Similar to ACOR, this is available in three interpretations, ACRc, ACRr, and ACRa, with the same definitions.

There is also a closely related statistic also called ACR defined for an entire team instead of a player. There is also a team ACOR and a stat unique to teams called TAR, which is defined as ACR - ACOR, so that it has the same sign as ACR. All of the team numbers have a common constant display factor of 9 * PAPG, which is appropriate to the context.

L/RACR

Only the ERD values of completed plate appearances are used to accumulate these averages. The values of all n-type (and r- and u-type) plays are disregarded, as are plays where the pitcher's throwing hand is undefined. The averages are calculated as for ACR but without the baserunning rules, and separate averages are maintained for left- and right-handed pitchers. Interpretations 'c', 'r', and 'a' are the same as for ACR.

ERD Probabilities

These answer a set of questions very vaguely similar to the old single, double, triple type of stats, but with a sharper interpretation. For every play, the ERD value is classified according to whether it is at least 1, 0.5, 0.2, or 0, or in the negative sense at least at or below 0, -0.2, -0.5, or -1. For each player, separate counts are maintained for all of these categories (hitting and pitching). Values are assigned as follows. The hitter and pitcher of record are assigned all ERD values that occur during their time in those roles. In addition, non-plate appearance (n- r- or u-type) plays are assigned to all baserunners. Finally, a count is maintained of the total number of plays elapsed (the denominator for the 'r' stat). This is counted for the hitter, pitcher, and baserunners on n- r- or u-type plays. Only the 'c' stat is stored, along with the denominator. The 'a' stat is calculated from the ratio by multiplying by 100 and rounding to the unit. These categories are only tabulated per year, not per team and per series. The -1 category is not displayed, being too rare to be interesting. Another derived stat is displayed, called 'si' for slugging index, and calculated as

sir = +.5r + 2 * +1r

BCR

BCR, or Baserunning Contribution to Runs, is a derived stat based on the general concept of ACR and ACOR. The idea is to consider every play in which a player is on base at the start of the play. An average of ERD values is maintained, but with a difference. Most of the ERD value comes not from the baserunners directly, but from the actions of the hitter and pitcher. So on every play, the per-plate appearance average of the hitter and pitcher is subtracted to eliminate this contribution - on average, in the long run. Specifically, on each play, and for each baserunner:

BCRa += ERD - ( ACORr + ACRr )

Plate appearances are counted as usual (assigned to the baserunners) and used as the denominator for the 'r' stat. The 'a' stat is multiplied by PAOB. BCR is tabulated per year, not per team and per series.

FCOR

FCOR is the counterpart of BCR for fielding, and is calculated in much the same way. The fielding positions are categorized as catcher, infielder, and outfielder. The reason for lumping these together is simply to accumulate more plays for the average, as FCOR takes an especially long time to converge. The reason is simly that, instead of attempting to categorize individual plays by who might have fielded it ( or tried to or should have tried to), every play is assigned to all 8 fielders. Incidentally, the subtraction of the pitching average would produce an uninteresting pitcher fielding average. The 'a' form is calculated as for ACOR, and FCOR is tabulated per year, not per team and per series. The catching, infielding, and oufielding averages may be referred to as CFCOR, IFCOR, and OFCOR.

Fractional R

The calculation of this stat is unusually complicated. Some preliminary concepts need to be introduced. In ACR and ACOR, distinction is made between p-type and n-type plays, as distinguished in the ERD codes (and the corresponding q/r w/u codes which are equivalent). Some finer distinctions need to be made here. The main type of play concerned is called PNO for p-type or n-type with at least one out. u-type plays also get special handling. Mainly a count is made in every half-inning of the number of PNO plays which occur with a given number of initial outs, nPNO[o]. Player averages of ERD values are accumulated as in ACR and ACOR, and also a count of PNO plays attributed to the player with a given number of initial outs nPNO[p][o]. At the end of the half-inning corrections are added to each player average of the form

( E[0][o] - E[0][o+1] ) * nPNO[p][o] / nPNO[o]

Where the E[0] becomes E[2] with the automatic runner on second.

Fractional plate appearances are generated in two cases: u-type plays, and n-type plays. For a wakloff u-type play, the pitcher is given a flat 0.1 PA if they don't have any PA already. When a relief pitcher is brought in, the outgoing pitcher is examined to see if they have any PA. If not, it is checked whether they were pitcher of record for at least one play. If so, a count called NPIT is made, and increased for every subsequent releif pitcher. NPIT is reset when a plate appearance is completed, and every pitcher counted as NPIT receives 1/NPIT plate appearances (NPIT is at least two when reset).

Decisions

The selection of the winning and losing hitter and pitcher proceed by a series of cuts. First, all members of the appropriate team are included if they have more than zero plate appearances (including fractions described above). Next, the fractional R stat is used in a manner called "would have won/lost" based on runs per plate appearance as follows. For each player, the statistic R/pa for that player is compared to the same statistic for the entire opposing team. For a player to be the winning hitter, R/ps must be greater than R/pa for the opposing offense, and for the losing hitter it must be less. For the winning pitcher, R/pa must be less than that for the pitcher's team's offense, and for the losing pitcher greater.

Each decision is completed from the list thus generated by making further cuts. All of the decisions have in common that all players remaining after the last cut share the decision. In some cases, all the players who pass the "would have won/lost" test share the decision, but in most cases the list is reduced to one.

Winning Hitter

After the R/pa cut, the highest R among the remaining players is selected, and all players with this R share the decision.

Losing Pitcher

After the R/pa cut, the highest R among the remaining players is selected, and all players with this R share the decision.

Winning Pitcher

After the R/pa cut, the highest number of outs among the remaining players is selected, and all players with this number continue. Then the lowest R among the remaining players is selected, and all players with this R continue. Finally the lowest number of (possibly fractional) plate appearances is selected, and all remaining players share the decision. If no decision is produced, the procedure is repeated, with the pa > 0 and R/pa cuts omitted.

Losing Hitter

After the R/pa cut, the number of plate appearances for the remaining players is examined and the median calculated. If the median is greater than 2, one is subtracted. All remaining players are eliminated if their number of PA is less than this. Then the lowest R among the remaining players is selected, and all players with this R continue. Finally the highest number of plate appearances is selected, and all remaining players share the decision.

Game Probability Estimator

First, at each point in the game the expectaion value from the ERV table in use for the current half-inning is added to the offensive score for the current game state. Then the difference is formed between the home and visiting scores thus adjusted. This home minus visitor fractional score difference is called X for this purpose. This is calculated using the all-time ERV table for all non-PW half-innings instead of the ballpark tables. This is done for consistency. The estimated probability for the home team to win is calculated as

hwp = 0.5 + ( 1 / pi ) * arctan ( Z + ( X - A[o] ) / B[o] )

Where o is the number of outs in the game, A[o] and B[o] are tables of fitted coefficients for the estimator, and Z is a value calculated to adjust the game-initial probability, which is constant for the length of a game. The values A[o] and B[o] are fitted in advance by simulated annealing, using the entire retrosheet event file database. The number of outs tabulated runs from 0 to 59, where extra innings are looped around from outs 54-59 as often as necessary.

When Z is left to zero, the estimator reports a game initial probability of 53% for the home team to win, equal to the all time record. It may be desirable to adjust this to particular values when more information is available, without re-fitting the estimator's parameters. This is done by adding in the Z term. For regular season games, a ballpark home win rate is calculated for each ballpark code, and the Z value is adjusted to make this value the initial estimate. An even better reason for making this adjustment comes from the postseason. On the basis of the above estimator, a probability estimator for 232-format series was developed, and extended to all other series formats.

First, the series is categorized by series format as one of 1, 3, 221, 23, 232, X7, or X9. The home team of game 1 of the series is designated as the series home advantage team (even for 23). The X7 and X9 formats are neutral to the order of home games except for game 1. The X7 estimator works practically the same way as the 232 estimator, except that its statistical pool of games includes all 7-game series including 232. The X9 estimator is not based on the statistics of actual 9-game series (there are only 3), but instead is a scaffolding around the X7 estimator. The extra terms for X9 were added by hand based on continuing plausible trends in the X7 data. Below, substitute the appropriate series format for 232.

The estimator gives the probability of the home advantage team to win the series, with the series state taken as ( home advantage team wins, other team wins ). A table of observed win rates is calculated for each state, and then during a game the estimate is taken by using the home-team win probability calculated above to lever between the observed win rates for the states where the home team wins or loses. E.g. for game 1, 2, 6, or 7 that is

swp = hwp * s232[ hw + 1 ][ vw ] + ( 1 - hwp ) * s232[ hw ][ vw + 1 ]

where hw is the number of games won so far by the home advantage team, vw for the other team, and s232[ hw ][ vw ] is the table of win rates for 232-format series. The reason for the Z adjustment in the hwp estimator is to make the swp estimate continuous. In other words, the table of 232 series states implies game-initial probabilities via equations such as

swp0 = ( s232[0][0] - s232[0][1] ) / ( s232[1][0] - s232[0][1] )

which gives the series-initial estimate.

The numbers s232[ hw + 1 ][ vw ] and s232[ hw ][ vw + 1 ] are hypothetical probabilities or hypotheses, viewed from the start of the game regarding its outcome. I have introduced a stat for this called LVR or leverage, which is

sLVR = s232[ hw + 1 ][ vw ] - s232[ hw ][ vw + 1 ]

This describes the importance of a game in the series framework. In a maximum-length series the last game has an sLVR of 1 which is the highest possible value. In a series sweep the first game will generally have the highest sLVR.

Finally there is the universal or u estimator, the probability to win the world series in any game of the season. In the postseason the u estimates are currently generated from the s estimates in a very simple manner. All unplayed series are treated as 50/50 coin flips. This is done by introducing a u reduction factor ureduc which is equal to the number of teams competing for the world series at the end of the current series, or the inverse of this number, and so the u values for the series home and visiting teams are

u[ HT ] = ureduc * swp
u[ VT ] = ureduc * ( 1 - swp )

For the regular season, the u estimator generates hypotheses for each game for each team using fitted arctan functions based on the current win minus loss standings and the number of games left in the season, the latter playing the same mathematical role as the number of outs in the game estimator. This is a somewhat lengthy and involved process, and even now has shortcuts and approximations that will eventually need to include yet more complications. I'll try to describe it fairly accurately. First of all there is a framework for matching hypotheses to games without tossing around a flood of game IDs, and so the days of the season are numbered. The Julian Day value for each day is computed, and the value of the day of the first official game is subtracted from each day giving a day calendar that starts at 1 (there is a table of these offset values). For double headers half days are introduced for game 2. The win minus loss standings for all teams are then tabulated on all day values where a game was played. The next step is a kind of mathematically irrelevant aside now, but was originally intended to sort of substitute for what followed - there was some experimenting involved. First, the win minus loss numbers are tabulated for every team in the slot - division or wildcard - and then differences are formed to count the number of games behind or ahead of the slot target each team sits at. Note that the slot target is the lead team's record for most teams, but the second team's record for the leader. Also note that these numbers I call games behind - double differences ( w - l ) - ( w - l ), are twice the numbers that you usually see as games behind. The games behind standings for each day are projected to their known values at the end of the season to form a table of expectation values for the end of season games behind value. This table is somewhat laboriously smoothed and then, having been tabulated separately for each number of games left in the season, a line was fitted by least squares to each value of games left, and so this fitted line gives the expectation value based on the current value. The next step is to feed this expectation value into a fitted arctan function which is of the form atctan( a*x + b ), and so the previous line fit from the expectation value stage is simply absorbed into the arctan fit and in the end plays no distinct role. I will therefore focus on the arctan part.

First an explanation is required about a parameter not yet mentioned, the number of teams competing for the postseason and the division format, wildcards, etc. These are handled at a later stage by normalizing and combining the values that come out of the arctan functions. This is relevant here because it affects how the parameter fits are made. All seasons are combined in the function fitting, and so the arctan fits see an average value of the number of competitors - for both division and wildcard slots. This is desirable to give it a broad statistical base, and is controlled for later by normalization. The estimated unnormalized probability to reach to reach the postseason for a given slot is

upps = 0.5 + ( 1 / pi ) * arctan ( ( X - A[g] ) / B[g] )

where g is the number of games left at that point, and X is the expectation value for games behind at the end of the season, although it could basically just as well be the current day value. The A and B coefficients are fit to the history of making or not making individual slots by teams using the same simulated annealer as the g estimator. A note about playoffs, or end of season tiebreakers. These are currently handled by tabulating them and offsetting the number of games left based on the table until the end of the season. This seems fairly effective, but it would probably be better in the long run to handle them as unscheduled postseason series, especially as the National leage had 3-game playoffs on a couple of occasions. An advantage of doing that would be doing MVPs for them like the wild card series and games.

Separately from this, an elimination table is produced so that teams' probabilities can be zeroed out when this happens. There is currently a shortcut in the elimination by making it a little anachronistic. A team is curently eliminated when the number of games left is less than their games behind the known end of year slot target. This is somewhat fuzzy compared to the historic value, but the latter may be kind of difficult to compute. When a game is rained out it doesn't appear in the record. Actually I think retrosheet does have what you would need for that, but it will take time to integrate it. Currently the combination of elimination and normalization produces automatic clinching, although slightly approximate in the current version.

For a simple single division slot these upps values are then summed for all teams in the division, and then normalized:

pps = upps / sum ( all slot upps )

However, at various stages including the current day there are much more complicated arrangements. This is handled in two stages. First there may be subsidiary slots. In 1981 and 2020 there were isecond subsidiary division slots, and at present there are subsidiary wildcard slots where three teams are chosen. These are all treated as independent slots with overlapping membership. Within each slot, superseding slot leaders are subject to what I call normalizer blackouts in the subsidiary slots. The superseding slot leaders are not included in the normalizer sum, but are still subject to its division to generate pps values. Therefore the total probability for the slot may sum to more than one. Also it is possible for the normalized values to exceed one, so they are clipped. There is one further shortcut approximation in the normalizer. That is that the hypotheses for each slot for each day are done as a group. They can't be divided by their actual sums because it's not generally possible for all teams to win or lose simultaneously - the exact schedule of matchups isn't currently considered. The input values to each hypothesis are generated by adding one (for the win) or subtracting one (for the loss) from yesterday (or half-day)'s standings. And so a third group is generated for the current day's standings, and this value is used to compute the normalization denominator for both sets of hypotheses. This is a fairly good approximation overall but you can occasionally spot it if you look. Eventually a complete schedule of games will be used to produce accurate individually normalized per-game hypotheses by putting the game's win and loss values with flat previous-day values for the other teams.

These independent slot pps values are finally combined using on-of-these logic, i.e. two events a and b combine as 1 - ( 1 - a ) * ( 1 - b ) or a + b - a * b. These combined pps values are final as far as they go. They do not currently take into account the first round byes that some slots lead to, but that will be added eventually.

To finally generate the u hypothesis values they are currently multiplied by common regular season ureduc values, equal to one over the number of teams entering the postseason.

hyp[ w ] = ureduc * pps[ w ]
hyp[ l ] = ureduc * pps[ l ]

Finally, the u value is produced by using the hypotheses with the game estimator:

u[ HT ] = hwp * hyp[ HT ][ day ][ w ] + ( 1 - hwp ) * hyp[ HT ][ day ][ l ]
u[ VT ] = ( 1 - hwp ) * hyp[ VT ][ day ][ w ] + hwp * hyp[ VT ][ day ][ l ]

Eventually separate ureduc values for each slot will calculated and multiplied together for each slot for each team separately before being multiplied by the pps value after all combining operations (the combining function is obviously nonlinear). There are a couple of other things that need to be introduced eventually, and some that are less likely to actually happen. It doesn't currently know about tiebreakers, and kind of clinches the wrong teams once or twice. I will look into crowbarring those in the short-ish run, but eventually it will try to introduce those calculations live as they happen by adding 0.5 then 0.25 etc. to the record for each succeeding tiebreaker as it clinches. Then there is the clinching of postseason home field advantage at previous stages or during the regular season. I don't currently remember all the rules for this over the years and need to do a refresher, but the current rules are complicated enough to put it off in a first attempt. I will probably eventually do this but it is a fair amount of work for a small effect. And then there's 1981. Currently it just gets this totally wrong in the RS. But on the other hand the two part season frankly sucks from this perspective. Teams that clinched in the first half have no stake in the second half, and the first half is restrospectively irrelevant for teams that clinched in the second half. The RSPS MVPs for this year correspond poorly to the actual postseason format, although they're generated for the right teams. There is no problem with 2020.

The LVR values for the u estimator are just ureduc * sLVR for the postseason, but are

uLVR = ureduc * ( hyp[ w ] - hyp[ l ] )

If ACW is a sort of weighted run value average, LVR describes how the sACW or uACW value weights this game differently. gACW in effect always has a LVR of 1.

ACW

ACW is a log-probability-ratio statistic for hitting and pitching. It answers the question: what do this player's actions have on their team's chance to win the game? This is in principle calculated like a run-value statistic, but there is a mathematical wrinkle: probabilities are not additive, as 1 is the maximum value a probability can have. In order to make such an additive statistic - which is necessary if it is to be averaged per plate appearance - you can take the logarithm of a ratio, which then behaves like a difference. The probability goes to 0 or 1 at the end of a non-tie game, which causes another problem, that the logarithm of 0 is minus infinity. This in turn can be avoided by always calculating from the winning team's point of view, and then negating each term for the losing team. In effect, the game is played time-reversed for the losing team. It is centered on zero by virtue of the fact that the probability estimator starts in the middle of the probability range. To explain in more detail, a single term in a player's ACW is calculated in two steps. First, the number for a play equivalent to an ERD value, called an HWD, is calculated in one of two ways depending on whether the home team won:

HWD = log(hwp) - log(previous hwp)

or lost:

HWD = -log(1 - hwp) + log(1 - previous hwp)

Finally, the HWD is added to the averages of home team hitters and visiting team pitchers, and subtracted from visiting hitters and home pitchers. In other ways, this stat is calculated in the same way as ACR and ACOR. In particular, the rules for attributing the HWD to baserunners is the same as for ACR, but with the sign of the HWD in place of the sign of ERD. This produces ACWc and ACWr. For ACWa, the factor is 20 for both hitting and pitching, although career uACW uses a factor of 2000. The ACWc for a nominal win is equal to log(2). For ACWa, a single plate appearance with this value would display as 13.863. This is hardly ever relevant, which is why I didn't choose a more sensible value for this. There are also hitting and pitching ACW aggregates called H and P, which are designed to allocate win or loss credit to the hitting and pitching sides in a game. These are only count-type stats, and the display version is multiplied by 100 / log(2) and then rounded to an integer. The intent with this constant was that a nominal win from a starting probability of 50% would be worth 100 points combined for the winning H and P numbers. In fact the home team advantage is fairly large after taking the logarithm, so it turns out that you might see them instead add up to 80 or 120, for a home team win or loss respectively. Those two numbers reliably add up to 200, but you can't tell that from any single game. The split between those two numbers depends on the ballpark, and they may in fact be reversed. The sH and sP numbers are similar, but for series home advantage team wins or losses. In this case the split is determined by the series format instead of the ballpark. If you want to compare player ACW numbers to H or P, multiply by 7. So if a hitter is listed as having a +1.000 gACW average over 4 plate appearances, then he provides 28 gH credits - actually 29.

ACW has never been calculated for a live game but it could be, although not knowing who wins there are two potential values. I provisionally call them hot ACW and cold ACW, based on the assumption of the player's team winning or losing respectively. I eventually plan to add support for calculating this to the program bb-erd, but it requires kind of a lot of information to be entered - ERD code, ballpark code, score difference, and half-inning number for each play for a start, and the baserunner rules are something else. Actually during a game you would use the game estimator to produce a single ACW number by interpolating between hot and cold with the current estimate.

With the s and u estimators come sACW and uACW.Two separate game estimators are maintained in series games, one with the Z value set to the ballpark table, used for gACW and gX, and the other using the s estimator Z value used for sACW and sX. In a series, u follows s, but in the regular season it follows g. The SWD term for sACW is calculated the same way, with the series winner replacing the game winner, and then these are applied to players with the appropriate sign for the series. For uACW this is a little different. The hot and cold ACW analysis is fine for two teams in opposition, but for many teams you use only the cold or losing team formula until only two remain, and then the winner is identified and given the hot term. A separate u estimator is maintained for each team, and uACW values are always calculated by adding the Udiff term for the player's team. There are games where a team is either eliminated or clinched in the regular season which give no contribution to ACW. At present these plate appearances are still counted along with the zeroes contriuting to the average.

The X-factor

X for a game is calculated as

gX = sqrt( sum( ACW^2 ) )

gX only has gXc and gXa values, and the latter is multiplied by 20. Like ACW, gX is undefined for a tie game. Why compute this? It's basically a measure of how "long a path" the probability estimator takes to the win or loss. Or in more technical language, a confusion metric, as it is high when the estimator can't make up its mind, i.e. has low information. In ordinary human terms it means an exciting game. A low X value also has an interpretation as a highly meritocratic outcome.

X is also defined for series from the s estimator - sX, and uX from u. A note that the u estimator generally changes value between games, but these terms are not included in season uX. The u estimator especially revelas some characteristics of the X statistic that weren't obvious before. One is that X actually decreases as the event being measured gets longer. Extra inning games typically have higher gX, but that's because extra innings only happen in tie (i.e. high-X) games. There is a pure algebraic sense in which X decreases as more competitors are added to "the same" competition, but in the real world high X values come from contended races and there are more of those, and so a higher X value, with more teams involved. Regular season uX values are therefore quite small, and have been getting smaller in recent years. The addition of postseason series lowers the ureduc value for the regular season. For example, TOR 1987 is the only post-LCS era team with a regular season uX value over 5.

Another wrinkle alluded to above is that in postseason games, there are in fact two different game estimators operating with different Z values, one with the ballpark value producing the g statistics, and the other with the swp-based Z value producing the s numbers. The display of half-inning gX values on the graph unfortunately exposes this. You may, for example, see a 20X half-inning in a game said to have a gX of 13. X is of course nondecreasing, so this is simply due to the two different estimators at work. The graph shows the swp version (what might be called the gs estimator) for graphical continuity, and so also uses that for the X values. The tabluated gX numbers are all on the ballpark numbers for statistical continuity. The difference in X values is mainly due to the difference in gLo values between the two estimators, as X is a strong function of gLo.

Career Rankings

There are all-time rankings in each of ACR, ACOR, BCR, CFCOR, IFCOR, OFCOR, hitting ACW, and pitching ACW. For each ranking there is a minimum of plate appearances. This is equal to 2000 for everything except FCOR, where it is 12000 due to its long time to convergence. Career uACW is multiplied by 100 to make it readable, but postseason uACW is displayed normally.

Combined Rankings

This is the first of the advanced combined metrics, which involve the product of (at least) two terms, the first a sum of run value terms, the second a sum of ACW terms. Other items may also be included multiplicatively. The reason for this is unit algebra. The run value terms sum because they're in units of runs or runs per plate appearance. The ACW terms also naturally add, thanks to the logarithm. But terms in different units shouldn't be added, but rather multiplied.

There is a complication, which is that the product of two or more signed terms does not produce a satisfactory scalar, due to sign ambiguity. One hitter with positive ACR and ACW would wind up the same as another hitter with negative ACR and ACW. There are different ways of dealing with this, but each ranking must choose one.

Regular Season MVP

The regular season MVP has more multiplicative terms than the other combined rankings, and uses exclusion to defeat sign ambiguity. There are awards every year for two pitchers and two hitters, but the categories are unusual: LHP, RHP, B vs LHP, B vs RHP. I think these are more natural and interesting categories for the sport as it has evolved to the present day. First, the metrics are defined as (all numbers for the regular season only; O is outs):

B vs LHP = ACRc * LACRc^(3/2) * gACWr * sqrt( +1c ) * sqrt( (W - L) / (W + L) )
B vs RHP = ACRc * RACRc^(3/2) * gACWr * sqrt( +1c ) * sqrt( (W - L) / (W + L) )
LHP = ACORr * gACWr * O^2 * sqrt( (W - L) / (W + L) )
RHP = ACORr * gACWr * O^2 * sqrt( (W - L) / (W + L) )

Hitters are eligible for either or both awards if the following are positive:

ACRr gACWr LACRr RACRr (W - L) (W + L)

The +1c term is made positive by adding 0.01 prior to the sqrt, in order to give unique rankings to players with zero +1c.

Pitchers are eligible based on their throwing hand and whether the following are positive:

-ACORr gACWr (W - L) (W + L)

The eligibility criteria are relatively strict, but they produce enough runners up every year. These awards are a combinatin of more factors than the others to make them a little more unpredictable.

Postseason Series MVP

This is the simple and classic combined ranking. The run value term is a simple sum:

smvpRV = ACRc - ACORc

And the log-probability-ratio term is equally simple (h and p refer to hitting and pitching):

smvpLPR = hsACWc + psACWc

The full metric is just:

smvp = smvpRV * smvpLPR

Eligibility is based on the following being positive:

smvpRV smvpLPR (hPA + pPA)

This seems to be a reliable and sensitive ranking putting hitters and pitchers on an equal footing over the length of a series.

Regular Season for Postseason MVP

This is for the player who does the most to get each team to the postseason, and to show off the regular season uACW stat, which knows the exact postseason leverage of each game. This is essentially exactly the same as the series MVP but with regular season RV and uACW values.

Hall of Fame

The purpose of the Hall of Fame (hof) is not just to put hitters and pitchers together, but to rank essentialy everyone, i.e. to have minimal eligibility criteria, in this case having at least one plate apperance as either hitter or pitcher. This necessitates an alternative to the sign ambiguity question, and the answer is to include a global bias term for each of the three (RV, WL and LPR) multiplicative terms to make everyone just positive. This proceeds in a few steps. First, the universal run value and log-probability-ratio terms, which are based on all combined statistics, and then the win-loss record:

hofRV = ACRc + BCRc - ( ACORc + (FCORc / 3) )
hofLPR = hgACWr + pgACWr + huACWr + puACWr
hofWL = HW + PW - (HL + PL)

Next, the bias terms (negated because they're always negative):

hofRVbias = - min (all players) hofRV + 1e-9
hofLPRbias = - min (all players) hofLPR + 1e-9
hofWLbias = - min (all players) hofWL + 1

Finally, the combined metric:

hof = ( hofRV + hofRVbias ) * ( hofLPR + hofLPRbias ) * sqrt(hofWL + hofWLbias) * ( total PA from ACR+ACOR+BCR+(FCOR/3) )^(1/4)

Finally, FCOR is reduced in strength by a factor of 3, which turns out to be a surprisingly complicated judgement. The basic reason is that FCOR is the least reliably calculated of the main stats, and by the nature of baseball it was giving itself, in effect, too much weight in the sum, having typically 9 or so times as many plate appearances as hitting. Why reduce it by a factor of 3 and not something else? Because it also, as a side effect, changes the balance between hitters and pitchers in the ranking. It doesn't cause one to simply rise or fall, but reducing FCOR has the effect of making pitchers cluster nearer the center of the list. This effect is visible in the published rankings to some extent - hitters tend to be at the very top and bottom, although a few pitchers come very close at both ends. Hitting averages tend to be more concentrated than pitching averages, and the effect of FCOR in the RV term is to moderate the overall RV of hitters, as FCOR tends to be even smaller than ACOR. The factor of 3 seems to best balance these two competing tendencies.

A description of each file in the distribution.

First the executables, in run_everything order:

@ indicates that inputs or outputs are made via the generate script and not directly

run_everything	Master script that runs all the necessary generate targets in correct order
download	Download alldata.zip from retrosheet.org
		writes: alldata.zip
generate	Substitute for a proper makefile, also miscellaneous code snippets
		The reference for how to run each command properly (see run_everything for
		the proper order)
		reads: alldata.zip erd/ basic-stats guides/ bb-post hitting-stats/ pitching-stats/
		baserunning-stats/ hw-hitting-stats/ hw-pitching-stats/ fielding-stats/ gacw/
		psteams-tab
		writes: rs/ basic-stats papg erv-lib never-seen bb-post.ps sample.ps sample.pdf
		best-RS-hitting best-RS-pitching best-RS-baserunning best-RS-hw-hitting
		best-RS-hw-pitching best-RS-catching best-RS-infielding best-RS-outfielding
		unfiltered-best-RS-hitting unfiltered-best-RS-pitching
		unfiltered-best-RS-baserunning unfiltered-best-RS-hw-hitting
		unfiltered-best-RS-hw-pitching unfiltered-best-RS-catching
		unfiltered-best-RS-infielding unfiltered-best-RS-outfielding games-by-x series-by-x
		seasons-by-x regular-seasons-by-x total-year-x
parks-seen	Tool to generate a list of games per park
		reads: rs/postseason/ rs/events/
		writes: @games-per-park
bio-names	Convert retrosheet biofile.csv to namedb used by all software to map RS ID codes to names
		reads: rs/biofile.csv
		writes: namedb
rs2erd		Translator retrosheet -> erd
		reads: @rs/postseason/ @rs/events/
		writes: @erd/
erv-tab		Calculates the main erv tables
		reads: erv-lib erd/
		writes: @tab-erv
bb-erd		Tool to interactively explore erd codes and the probability estimators - h for help message
		reads: erv-lib s232 winp-lib basic-stats 
		writes: @plays-erd
mplhi		Generate a list of peculiar all-time stats about games
		reads: @erd/
		writes: @most_pl_hi
gamelog         Summarize master log of games won/lost per team per day
                reads: @rs/gamelog/
                writes: @daytab/
rstb            Generate table of regular season tiebreakers
                reads: rs/events/
                writes: @playoffs
standings       Generate, in stages, the table of daily probabilities for each team to reach the postseason
                reads: daytab/ winp-lib urs-params
                writes: thresh-tab elim-tab day-offset urs-params pps/
pl-st		Summarize the careers of all players
		reads: @erd/ namedb
		writes: @all-players coachdb
get-formats     Summarize the formats of postseason series
                reads: erd/
		writes: @psformats
hps		Calculate all player run-value stats
		reads: @erd/ tab-erv rs/rosters/
		writes: pitching-stats/ fielding-stats/ hitting-stats/ baserunning-stats/ team-stats/
series-wins	Generate table of who won every postseason series
		reads: erd/
		writes: win/ champ-tab psteams-tab
run-stats	Determines hitting and pitching wins and losses, calculates fractional R stat
		reads: tab-erv erd/ day-offset psformats winp-lib daytab/
		writes: runs/ paob gamedb goodyears
cwl		Tabulates career win-loss records
		reads: runs/
		writes: winloss
erd-parse	Tabulates pa o br r ACR ACOR per year/team/series (also model erd parser)
		reads: tab-erv papg @erd/
		writes: season-stats/
homewins	Calculate the parameters of the game probability estimator
		reads: @erd/ tab-erv winp-params
		writes: winp-params
s232-tab	Calculates the various series estimator parameters (not just 232 anymore)
		reads: psformats erd/
		writes: @s232
bwp		Calculate ballpark home win probabilities
		reads: erd/
		writes: bwp-tab
hwprob		Run the probability estimator for every play of every game
		reads: @erd/ tab-erv winp-lib bwp-tab winp-params psformats s232 pps/
		writes: @hwp/
hwhps		Calculate all player log-probability-ratio stats
		reads: tab-erv winp-lib winp-params bwp-tab erd/ psformats s232 champ-tab pps/
		writes: hw-pitching-stats/ hw-hitting-stats/ gacw/ team-hw-stats/
career		Calculate and tabulate career rankings in all main stats categories
		reads: papg all-players hitting-stats/ pitching-stats/ baserunning-stats/ paob
		hw-hitting-stats/ hw-pitching-stats/ fielding-stats/
		writes: career-hitting career-baserunning career-hw-hitting career-hw-pitching
		career-pitching career-catching career-infielding career-outfielding
sumhp           Summarize hitting and pitching win credits for all series and seasons
                reads: @gacw/
		writes: yhp/
hpmvp		Determine the hitting and pitching MVPs for each season
		reads: @erd/ hitting-stats/ pitching-stats/ hw-hitting-stats/ hw-pitching-stats/ runs/
		writes: @mvp runners-up
sm-hof		Determine series MVPs and the all-time universal ranking
		reads: erd/ all-players pitching-stats/ hitting-stats/ fielding-stats/ winloss
		hw-hitting-stats/ hw-pitching-stats/ baserunning-stats/ psteams-tab
		writes: hof smvp rsps
series-match	Tabulate team and player leaderboards for each year, generate the willow
		reads: @erd/ papg runners-up series-by-x hw-hitting-stats/ hw-pitching-stats/ runs/
		season-stats/ win/ unfiltered-best-RS-hitting unfiltered-best-RS-pitching mvp smvp rsps
		all-players career-pitching career-hitting hitting-stats/ pitching-stats/ gacw/ winp-lib
		seasons-by-x total-year-x goodyears
		writes: @season/ @guides/*-season.ps @willow
guide-career	Generate player career summaries for any postseason series
		reads: @erd/ broadcasters all-players coachdb
		writes: @guides/*-career.ps
guide-graph	Generate the graph of any game
		reads: @erd/ tab-erv hwp/ 
		writes: guides/*-graph.ps
guide-roster	Generate the roster roster of any game to accompany the graph
		reads: @erd/ tab-erv papg all-players mvp smvp gacw/ runs/ pitching-stats/ hitting-stats/
		fielding-stats/ hw-hitting-stats/ hw-pitching-stats/ baserunning-stats/ rsps team-stats/
		career-hitting career-pitching career-hw-hitting career-hw-pitching team-hw-stats/ winloss
		career-baserunning career-catching career-infielding career-outfielding win/ hof paob yhp/
		bwp-tab
		writes: guides/*-roster.ps
winp-lib	Library of probability estimator functions
		reads: day-offset
30wins		Tool to shows all 30-game winners
		reads: runs/
acctruns	Tool to check that the fractional R stat is calculated correctly
		reads: runs/
bio-check	Tool to check retrosheet biofile.csv for parse failures - output should be empty
		reads: rs/biofile.csv
extract-gacw	Tool to extract gacw entry for any game
		reads: gacw/
extract-game	Tool to extract any game by retrosheet game ID from the erd database
		reads: erd/
extract-hwp	Tool to extract hwp entry for any game
		reads: hwp/
extract-runs	Tool to extract runs entry for any game
		reads: runs/
game-pdf	Tool to make a complete pdf for any game
		runs: extract-game extract-hwp extract-gacw extract-runs guide-graph
		guide-roster guide-career series-match pscat
hwp-report	Tool to crudely examine the probability estimator's a posteriori accuracy
		reads: hwp/
list-codes	Tool to generate a list of all erd codes seen by order of frequency
		reads: erd/
max-pa-br	Tool to sort all baseruners by PA on base per year
		reads: baserunning-stats/
pscat		Tool to concatenate postscript files
		reads: common-ps-header
series-pdf	Tool to make a complete pdf for any series
		runs: guide-graph guide-roster guide-career series-match pscat
text2ps		Tool to make postscript from a text file, used for guide-career and bb-post
viewdec		Tool to sort pitchers and hitters per year by W - L
		reads: runs/

Home page