Biased Stats in the NBA

One of my favorite NBA-related articles is Tommy Craggs’ “The Confessions Of An NBA Scorekeeper”, which recounts of the experiences of a scorekeeper named Alex in the 1990s. The article highlights the common occurrence of “stat-padding,” or the practice of inflating the stats (e.g., assists, steals, blocks, and rebounds) of players of the home team. As Craggs writes:

Alex quickly found that a scorekeeper is given broad discretion over two categories: assists and blocks (steals and rebounds are also open to some interpretation, though not a lot). “In the NBA, an assist is a pass leading directly to a basket,” he says. “That’s inherently subjective. What does that really mean in practice? The definition is massively variable according to who you talk to. The Jazz guys were pretty open about their liberalities. … John Stockton averaged 10 assists. Is that legit? It’s legit because they entered it. If he’s another guy, would he get 10? Probably not.”

“The Confessions Of An NBA Scorekeeper”

Alex’s comment on Stockton caught my attention. While I was pretty certain stat-padding existed 20 years ago and and still does to this day, I was curious as to what degree the NBA’s all-time career leaders benefited from this bias.

Methodology

I pulled the top 25 all-time career leaders for each of the following categories from Basketball Reference: points, assists, steals, blocks, and rebounds. This yielded a total of 78 unique players, as some players were ranked on the all-time list in multiple categories. I then pulled the stats for each player, split by home vs. road games.

Note that steals and blocks and were not officially recorded in the NBA until the 1973–74 season. Furthermore, not all statistics were broken out by home vs. road splits until more recently, which means the analysis of bias could not be completed for many of the older stars, including Bill Russell, Kareem Abdul-Jabbar, Magic Johnson, Moses Malone, Oscar Robertson, and Wilt Chamberlain.

Setting the Benchmark (Points)

It’s fairly well-established that teams play better at home than on the road. To confirm this, I measured each player’s points per home game and compared it to his points per road game. On average, players scored 2.8% more points per game at home than on the road, with a standard deviation of 5.1%.

Blue = Average; Green = +/- 1 SD; Red = +/- 2 SD

Strong positive outliers included Tree Rollins, Shawn Bradley, and Mookie Blaylock1Two of these players also belong on the all-time greatest names list. I’ll let you guess which., who were all more than two standard deviations higher than the mean. Jermaine O’Neal was the only negative outlier more than two SDs lower than the mean. Notably, the top six career point leaders were all below average.

I then compared each player’s home vs. road performance for assists, steals, blocks, and rebounds relative to his home vs. road scoring performance. For example, if a player scored, on average, 5% more points per game at home than on the road and grabbed 10% more rebounds per game at home than on the road, then the relative home bias of his rebounding performance would be \frac{1.10}{1.05}-1=4.76\%.

The underlying assumption here is that in the absence of any stat-padding, there should not be significant relative home bias in any of the statistical categories. However, given Alex’s scorekeeping experiences, we would expect to see some degree of bias in all four categories, especially assists and blocks.

My analysis revealed the following results:

Assists

Blue = Average; Green = +/- 1 SD; Red = +/- 2 SD

Relative to the baseline (i.e. points), assists showed a relative home bias of 6.4%, with a standard deviation of 9.6%.

Almost everyone fell within two SDs of the mean, although Theo Ratliff was an extreme positive outlier, albeit on small volume. Note that John Stockton, the all-time assists leader by a long shot, had a relative home bias of only 3.6%, indicating a very low likelihood of stat-padding. On the other hand, Jason Kidd, the second all-time assists leader, had a relative home bias of 16.5%.

Of course, a high relative home bias doesn’t necessarily mean that there was stat-padding going on. Kidd also had an average home vs. road point performance of negative 4.5%. One explanation is that he played more as a facilitator at home while having to shoulder more of the scoring burden while on the road.

Steals

Blue = Average; Green = +/- 1 SD; Red = +/- 2 SD

Relative to the baseline (i.e. points), steals showed a relative home bias of 3.2% (half that of assists), with a standard deviation of 9.3% (roughly the same as that of assists).

Again, almost everyone fell within two SDs of the mean, although Manute Bol was an extreme positive outlier on small volume. Alvin Robertson was also more than two SDs higher than the mean, on much higher volume. Remarkably, John Stockton, also the all-time steals leader by a decent margin, had a relative home bias of only 2.0%, indicating once again that he was the real deal.

By now, you may have noticed that Dikembe Mutombo was more than two SDs below the mean for both assists and steals. It doesn’t really make sense for stat-padding to go the other way, so the likely explanation for negative bias is simply underperformance. The reason why the numbers look so extreme in this case is due to small sample size. Mutombo was an all-time rebounding great who averaged 10.7 boards at home and 10.0 on the road. However, he also only scored 10.3 points at home and 9.4 on the road (benchmark of 9.8%). He had so few assists (1.0 home vs. 1.1 road) and steals (0.4 vs. 0.5) that very small absolute differences in home and road performance led to large percentage biases (-15.4% and -25.4%) relative to his baseline.

Blocks

Blue = Average; Green = +/- 1 SD; Red = +/- 2 SD

Relative to the baseline (i.e. points), blocks showed a relative home bias of 12.3% (nearly double that of assists), with a standard deviation of 19.7% (also nearly double that of assists). Blocks were by far the most biased statistic, as well as the most variable.

There were a handful of players that fell more than two SDs below the mean, while Fat Lever2Another all-time great name., Alvin Robertson, and John Stockton were all more than two SDs above the mean (on low volume). Both Robertson and Stockton had a relative home bias of nearly 80%, or almost 3.5 SDs above the average! So while the Utah Jazz scorekeepers may not have been padding Stockton’s assists and steals, they almost certainly were boosting his blocks…3Take that, Stockton! I finally got you 🙂

Interestingly, David Robinson and Tim Duncan, who both played for the San Antonio Spurs for the entirety of their careers, were between one to two SDs above the mean on relatively high volumes!4Alvin Robertson also played for five season for the Spurs at the beginning of his career.

Rebounds

Blue = Average; Green = +/- 1 SD; Red = +/- 2 SD

Relative to the baseline (i.e. points), rebounds showed a relative home bias of 1.4% (one-fifth of that of assists), with a standard deviation of 4.9% (half that of assists). In contrast to blocks, rebounds were by far the least biased statistic, as well as the least variable.

Given the lower variability, it’s not too surprising that almost all players fell within two SDs of the mean, with no positive outliers and only three negative outliers (on low volume).

Closing Thoughts

In short, the results confirmed our initial expectations. Blocks (12.3% average relative home bias, 19.7% standard deviation) and assists (6.4% Avg, 9.6% SD) showed the most evidence of bias, whereas steals (3.2% Avg, 9.3% SD) and rebounds (1.4% Avg, 4.9% SD) showed the least.

At first, I was surprised that blocks showed significantly more bias than assists. Conceptually, assists felt like a much more subjective stat to record, but the data seemed to suggest the opposite. However, I soon realized this was because of the “Mutombo problem” of small sample size. Simply put, assists occur with a lot more frequency than blocks in the NBA. While many great players average more than five assists a game over the course of their careers (the truly elite average over eight!), very few ever manage to block more than three shots a game.

It’s not uncommon for point guards like Stockton and Kidd to average fewer than 0.5 blocks per game, and in certain cases, significantly fewer than that (e.g., Steve Nash and Tony Parker averaged fewer than 0.1 blocks per game). Therefore, even if there were the same amount of absolute stat-padding for assists and blocks, the relative impact would be much greater for blocks. That is to say, a scorekeeper giving a player an “extra” assist or two every home game when the player is averaging eight or ten assists is going to have a much smaller impact than gifting an “extra” block every few home games if that player is averaging a measly 0.1 blocks a game.

As always, you can find my work here.

This is Post #21 of the “Fun with Excel” series. For more content like this, please click here.

Rigging Live Draws Part II: The CNN Democratic Debate Draw

A throwback to Part I and why live draws can absolutely be rigged.

When I heard the news that the first day of the Democratic debates on July 30 featured only white candidates and that all of the non-white candidates were scheduled for the second day, I knew something was off (I wasn’t the only one who had suspicions). Admittedly, I hadn’t been following the debates very closely, but my gut told me that even in such a large primary field, there were enough minority candidates that the likelihood of such an outcome happening by pure chance was quite slim.

I decided to get to the bottom of things, despite combinatorics being one of my weakest areas in math growing up. To start, I had to confirm the number of white vs. non-white candidates in the debate field. I quickly found out that there were only 5 non-white candidates: Kamala Harris, (black), Cory Booker (black), Julián Castro (Latino), Andrew Yang (Asian), and Tulsi Gabbard (Pacific Islander).

A First Pass

If CNN randomly selected each candidate and their debate day, then we can calculate the total number of ways that 20 candidates can be divided into two groups. Assuming that order matters (i.e. having only white candidates on the first day of debates is different from having only white candidates on the second day), then there are a total of \binom{20}{10}=184,756 possible combinations. Out of those, there are \binom{15}{10} \times \binom{5}{0}=3,003 ways to choose only white candidates on the first day. Therefore, the probability of featuring only white candidates on the first day is \frac{3,003}{184,756}=1.63\%. Not very likely, eh?

CNN, What Were You Thinking?

Interestingly enough, CNN did NOT use a purely random selection process, instead electing to use a somewhat convoluted three-part draw “to ensure support for the candidates [was] evenly spread across both nights.” The 20 candidates were first ordered based on their rankings in the latest public polling, and then divided into three groups: Top 4 (Biden, Harris, Sanders, Warren), Middle 6 (Booker, Buttigieg, Castro, Klobuchar, O’Rourke, Yang), and Bottom 10 (Bennet, Bullock, de Blasio, Delaney, Gabbard, Gillibrand, Hickenlooper, Inslee, Ryan, Williamson).

The 3 Initial Groups and Final Debate Lineups, in Alphabetical Order

“During each draw, cards with a candidate’s name [were] placed into a dedicated box, while a second box [held] cards printed with the date of each night. For each draw, the anchor [retrieved] a name card from the first box and then [matched] it with a date card from the second box.”

CNN

In other words, CNN performed a random selection within each of the three groups, and the three draws were independent events.

A New Methodology

To calculate our desired probability under the actual CNN methodology, we need to figure out the likeliness of having only white candidates on the first day for each of the three groups. We can then multiply these probabilities together since the events are independent. For the Top 4 (where Harris is the only non-white candidate), there are \binom{4}{2}=6 total combinations, and \binom{3}{2} \times \binom{1}{0}=3 ways to choose only white candidates on the first day. Therefore, the probability of featuring only white candidates on the first day is \frac{3}{6}=50\%.

For the Bottom 10 (where Gabbard is the only non-white candidate), there are \binom{10}{5}=252 total combinations, and \binom{9}{5} \times \binom{1}{0}=126 ways to choose only white candidates on the first day. Therefore, our desired probability is \frac{126}{252}=50\%.

It should make sense that the probability is 50% for both the Top 4 and Bottom 10, precisely because there is exactly one candidate of color in each group. Think about it for a second: in both scenarios, the non-white candidate either ends up debating on the first day or the second day, hence 50%.

The Middle 6 is where it gets interesting. There are exactly 3 white candidates and 3 non-white candidates. This yields \binom{6}{3}=20 total combinations, but only \binom{3}{3} \times \binom{3}{0}=1 way to choose only white candidates on the first day, or a probability of just \frac{1}{20}=5\%.

Since the three draws are independent events, we can simply multiply the probabilities to get to our desired answer: 50\% \times 50\% \times 5\% = 1.25\%. Even lower than the 1.63% from our first calculation!

One More Twist

Even a casual observer may have noticed that although the first day of debates featured an all-white field, Democratic front-runner Joe Biden was drawn on the second day. This conveniently set up what many media outlets touted as a “rematch” with Senator Kamala Harris, with CNN going so far as comparing the match-up to the “Thrilla in Manila” (I wish I were joking).

The probability of have only white candidates on the first day AND Joe Biden on the second day is 16.67\% \times 50\% \times 5\% = 0.42\%. The only difference between this scenario and the previous one is that within the Top 4, there is only one way to draw both Biden and Harris on the second day out of a total of six possible combinations: \frac{1}{6}=16.67\%.

Validating with Monte Carlo

I wasn’t 100% certain about my mathematical calculations at this point, so I decided to verify them using Monte Carlo simulations. Plus, this wouldn’t be a “Fun with Excel” post if we didn’t let Excel do some of the heavy lifting 🙂

I set up a series of random number generators to simulate CNN’s drawing procedure, keeping track of whether Scenario 1 (only white candidates on the first day) or Scenario 2 (only white candidates on the first day AND Joe Biden on the second day) was fulfilled in each case. Excel’s row limit only let me run 45,000 draws simultaneously, which I then repeated 100 times and graphed as box and whisker plots below:

Min: 1.14%, Max: 1.36%, Average: 1.26%
Min: 0.34%, Max: 0.52%, Average: 0.42%

The simulations yielded an average of 1.26% for Scenario 1 and 0.42% for Scenario 2, thus corroborating the previously calculated theoretical probabilities of 1.25% and 0.42%.

Accurate Portrayal of My Reaction Whenever One of My Crazy Excel Experiments Ends up Actually Working

Concluding Thoughts

Numbers don’t lie, and they lead me to conclude that the CNN Democratic Debate Draw was not truly random. The million dollar question, of course, is why? What does CNN gain from having only white candidates on the first day and Joe Biden on the second day (along with all the minority candidates)? As I don’t intend for my blog to be an outlet for my personal political views, I’ll leave out any “conspiracy” theories and leave them as an exercise for you, the reader.

As always, you can find my work here.

This is Post #20 of the “Fun with Excel” series. For more content like this, please click here.

Fun with Excel #19 – Defending the World Cup

The World Cup is undoubtedly one of the most prestigious tournaments in all of sports. Although the competition has been held 21 times since its debut in 1930, only eight national teams have won it: Brazil (5 times), Germany (4), Italy (4), Argentina (2), France (2), Uruguay (2), England (1), and Spain (1). Only twice has a World Cup champion successfully defended the title (Italy in 1938 and Brazil in 1962). This is not too surprising, given that the tournament is held once every four years, which can be a lifetime in professional sports.

Summary of World Cup Results, 1930–2018
Points Per Game for Defending Champions (Red Bars = Eliminated in the First Round / Group Stage)

The above charts show the performance of every defending champion since 1930, as well as their average points per game (Win = 3 points, Draw = 1 point, Loss = 0 points). Interestingly, since the World Cup expanded to 32 teams in 1998, the defending champion has lost in the group stage (i.e. failed to reach the knockout stage) in four out of the last six World Cups, including the last three tournaments.

One potential explanation for these early exits is the increase in competition over the last 20 years, both from the higher number of teams participating in the World Cup, as well as the rise in overall skill levels which has led to more parity among nations. Even so, out of the four most recent instances where the defending champions were eliminated in the group stage (France in 2002, Italy in 2010, Spain in 2014, and Germany in 2018), all four countries entered their respective World Cups ranked in the top 20% of all teams. On top of that, all of them had favorable groups from which they were expected to advance. So who suffered the worst group stage exit from a defending champion?

A Slight Detour on Methodology

To analyze each team’s performance, I not only examined their win/loss records, but also how they played relative to expectations. I accomplished the latter by comparing each defending champion’s Elo rating to the ratings of all the nations competing in World Cup. I also compared each team’s Elo to the ratings of the other three nations in its group to determine how difficult it would be for each team to advance from the group stage.

Used widely across sports, board games, and video games, the Elo rating system calculates the relative skill of players (or teams) based on match outcomes.

After every game, the winning player takes points from the losing one. The difference between the ratings of the winner and loser determines the total number of points gained or lost after a game. In a series of games between a high-rated player and a low-rated player, the high-rated player is expected to score more wins. If the high-rated player wins, then only a few rating points will be taken from the low-rated player. However, if the lower-rated player scores an upset win, many rating points will be transferred.

Wikipedia

In soccer, the rating system is further modified to account for the goal difference of the match, such that a 7–1 victory will net more rating points than a 2-1 win. Thus, we expect nations with higher pre-World Cup Elo ratings to perform better than those with lower ratings, which the chart below illustrates.

The relationship isn’t perfect, but we can see that teams with higher Pre-World Cup Elo ratings tend to perform better during the tournament

We’re more interested in the outliers on the right side of the chart, so without further adieu, here is my ranking for the “worst of the worst” World Cup defenses.

The Hall of Shame

4. Italy (2010): 0W/2D/1L, -1GD

Italy entered the tournament with the sixth highest Elo (1938), 142 above the average of 1796
Italy had the fifth easiest group (out of eight) in the first stage of the tournament

In 2010, Italy (1938 Pre-WC Elo) drew Paraguay 1–1 (-14 Elo), drew New Zealand 1–1 (-24 Elo), and lost to Slovakia 2–3 (-50 Elo) in Group F, for a cumulative loss of 88 Elo. In doing so, it gained the dubious honor of being the only defending champion to be eliminated in the first round twice (1950 was the first time). That said, compared to other early exits, this one was slightly more forgivable. For one, Italy entered the World Cup ranked sixth by Elo, by far the weakest of the four most recent defending champions that failed to advance past the group stage. Italy also had the fifth easiest group (out of the initial eight), the only defending champion to start off in the bottom half of group difficulty.

3. Spain (2014): 1W/0D/2L, -3GD

Spain entered the tournament with the second highest Elo (2109), 267 above the average of 1842
Spain had the third easiest group (out of eight) in the first stage of the tournament

In 2014, Spain (2109 Pre-WC Elo) lost to the Netherlands 1-5 (-75 Elo) in a re-match of the 2010 Finals, lost to Chile 0-2 (-57 Elo), and beat Australia 3-0 (+16 Elo) in Group B, for a cumulative loss of 116 Elo. Spain entered the World Cup with the second highest Elo overall and played in the third easiest group, but still found themselves mathematically eliminated after only two games, the quickest exit for a defending champion since Italy in 1950 tournament. Pretty embarrassing, but still not enough to make our Top 2…

2. Germany (2018): 1W/0D/2L, -2GD

Germany entered the tournament with the second highest Elo (2077), 249 above the average of 1828
Germany had the second easiest group (out of eight) in the first stage of the tournament

In 2018, Germany (2077 Pre-WC Elo) lost to Mexico 0-1 (-47 Elo), beat Sweden 2-1 (+14 Elo), and lost to South Korea 2-3 (-80 Elo) in Group F, for a cumulative loss of 113 Elo. For the first time since 1938, Germany did not advance past the first round. Although this remarkable streak was bound to end at some point, almost no one would have thought that 2018 would be the year. After all, Germany entered the World Cup with the second highest Elo and also played in the second easiest group.

Unlike the Spanish team in 2014, which appeared to be on its last legs after a remarkable run from 2008 to 2012 during which it won back-to-back European titles and a World Cup, the German team was seemingly still near the height of its powers. Indeed, their early “exit at group stage was greeted with shock in newspapers around the world,” according to The Guardian.

1. France (2002): 0W/1D/2L, -3G

France entered the tournament with the highest Elo (2096), 274 above the average of 1822
France had the easiest group (out of eight) in the first stage of the tournament

In 2002, France (2096 Pre-WC Elo) lost to Senegal 0-1 (-54 Elo), drew Uruguay 0-0 (-19 Elo), and lost to Denmark 0-2 (-61 Elo) in Group A, for a cumulative loss of 134 ELO. Shockingly, the French failed to win a single match despite starting the World Cup with the highest Elo and playing in the easiest group. Perhaps more embarrassing, the team bowed out without scoring a single goal, good enough for the worst performance ever by a defending champion.

An Important Caveat

Of course, one should never draw conclusions solely from data, because knowing the context surrounding the data is just as crucial. As Gareth Bland rightly points out in his article detailing the story behind France’s failure in the 2002, several factors contributed to the team’s early exit besides mere under-performance:

  1. France’s star player, Zinedine Zidane, regarded as one of the greatest players of all time, injured himself in a friendly less than a week before the team’s first match against Senegal. He returned for France’s third match against Denmark, but was clearly not 100%.
  2. Thierry Henry, considered one of the best strikers to ever play the game, committed a poor challenge in the second match against Uruguay and received a red card. Although France managed to scrape a tie while down one man, Henry was forced to miss the third match because of the red card.
  3. Many members of France’s old guard like Marcel Desailly, Frank Leboeuf, and Youri Djorkaeff were pushing their mid-thirties. Although not old by any stretch of the imagination, they were undoubtedly past their prime as players.
  4. On the other hand, the team’s younger players like Patrick Vieira, Sylvain Wiltord, and Henry found themselves mentally and physically exhausted after a successful but grueling campaign with their domestic club Arsenal.

While none of these reasons should pass as excuses (after all, other teams had to deal with injuries and fatigue as well), this perfect storm of events helps to explain why France so drastically under-performed relative to their Elo rating. As Bland writes, the team’s “return home was not met with disgrace…Rather, it was an acknowledgement that some legs had got tired, while some needed to be moved on, while those of the maestro must just be left to heal.”

Lessons Learned?

One last observation is that none of the four defending champions won their opening matches (Italy drew, and the other three lost). With every match being so critical to advancing, a poor start likely put a tremendous amount of pressure on the defending champions and affected their remaining two matches. Perhaps the defending champions failed because of their relatively easy groups, which led them to become complacent going into the first match. In that case, the biggest takeaway is to not be overconfident, advice that I hope team France will heed going into Qatar 2022.

As always, you can find my work here.

Fun with Excel #18 – The Birthday Problem

Meeting someone with the same birthday as you always seems like a happy coincidence. After all, with 365 (366 including February 29th) unique birthdays, the chances of any two people being born on the same day appear to be small. While this is indeed true for any two individuals picked at random, what happens when we add a third, a fourth, or a fifth person into the fray? At what point does it become inevitable that some pair of people will share a birthday?

Of course, the only way to guarantee a shared birthday among a group of people is to have at least 367 people in a room. Take a moment to think about that statement, and if you’re still stumped, read this. Now that we know 100% probability is reached with 367 people, how many people would it take to reach 50%, 75%, or 90% probability? If you think the answer is 184, 275, and 330, then you would be quite wrong. Here’s why:

Let’s assume that all birthdays are equally likely to occur in a given population and that leap years are ignored. To paint a more vivid picture in our minds, let’s further assume that we have a large room and that people are entering the room one at a time while announcing their birthdays for everyone to hear. The first person enters the room and announces that his/her birthday is January 1st (we can choose any date we want without loss of generality). The second person has a 364/365 probability of having a different birthday from the first person and therefore a 1 - 364/365 probability of having the same birthday. The third person has a (364/365) \times (363/365) probability of having a different birthday from either of the first two people and therefore a 1 - (364/365) \times (363/365) probability of having the same birthday as either of first two people. The fourth person has a (364/365) \times (363/365) \times (362/365) probability of having a different birthday from any of the first three people and therefore a 1 - (364/365) \times (363/365) \times (362/365) probability of having the same birthday as any of first three people. To generalize, the probability of the nth person being the first person to have the same birthday as any of the n-1 people before him/her is:

P(n) = 1- \frac{364}{365} \times \frac{363}{365} \times \frac{362}{365} \times \cdots \times \frac{365-n+1}{365}

Note that the yellow series in the above graph grows exponentially rather than linearly, with the probability reaching 50% at just 23 people. 75% and 90% probability are reached at 32 and 41 people, respectively. By the time 70 people are in the room, there is a greater than 99.9% chance that two individuals will have the same birthday!

As the number of people increases, P(n) switches from exponential to logarithmic, with each additional personal providing less incremental probability than the previous. Interestingly, the 20th person provides the greatest incremental probability, as seen in the above table.

In contrast, the probability that any one person has a specific birthday is denoted by the much simpler equation:

P_1(n) = 1 - \left( \frac{364}{365} \right)^n

This relationship, which is highlighted by the green series in the graph, grows at a much slower rate than the yellow series. In comparison, it takes 253 people for P_1(n) to exceed 50%.

Testing Our Assumptions

One key assumption we made in the above exercise was that all birthdays (aside from February 29th) occur with equal probability. But how correct is that assumption? Luckily, Roy Murphy has run the analysis based on birthdays retrieved from over 480,000 life insurance applications. I won’t repeat verbatim the contents of his short and excellent article, but I did re-create some charts showing the expected and actual distribution of birthdays. The bottom line is that the actual data show more variation (including very apparent seasonal variation by month) than what is expected through chance.

Implications on Birthday Matching

Now that we know that birthdays in reality are unevenly distributed, it follows that matches should occur more frequently than we expect. To test this hypothesis, I ran two Monte Carlo Simulations with 1,000 trials each to test the minimum number of people required to get to a matching birthday: the first based on expected probabilities (each birthday with equal likelihood of 1/365.25 and February 29th with likelihood of 1/1461) and the second based on actual probabilities (sourced from the Murphy data set).

Note that the distributions of both simulations are skewed positively (i.e. to the right). The results appear to corroborate our hypothesis, as evidenced by the gray line growing at a faster rate than the yellow line in the above graph. Indeed, the average number of people required for a birthday match is 24.83 under the simulation using actual probabilities, slightly lower than the 25.06 using expected probabilities. However, the difference is not very significant; therefore, our assumption of uniformly distributed birthdays works just fine.

As always, you can find my work here.

Fun with Excel #17 (ft. Python!) – The Beauty of Convergence

Part I: Everything is Four

A friend recently told me that “everything is four.” No, he wasn’t talking about the Jason Derulo album

Suppose you start with any number. As an example, I’ll pick 17, which yields the following sequence: 17 -> 9 -> 4. Here’s a slightly more convoluted one: 23 -> 11 -> 6 -> 3 -> 5 -> 4. Figured out the pattern yet?

The Answer: for any number you choose, count the number of characters in its written representation. In our first example, the word “seventeen” has 9 characters, and “nine” has 4 characters. In our second example, “twenty three” has 11 characters (not counting spaces), “eleven” has 6 characters, “six” has 3 characters, “three” has 5 characters, and “five” has 4 characters. Try this with any number (or any word, in fact) and you’ll always end up with the number 4 eventually. This is because “four” is the only number in English that has the same number of characters as the value of the number (4) itself.

This occurrence is not unique to the English language: Dutch, Estonian, German, and Hebrew also “converge” to 4, while Croatian, Czech, and Italian converge to 3. Other languages may end up converging to more than one number, or end up in an infinite loop involving two or more numbers.

I decided to explore this phenomenon in more detail by examining how the series converges over a large set of natural numbers (1 through 10,000). Since Excel is unable to graph large data sets efficiently, I needed to relearn some basic Python to help me with this particular project. This was going to be fun…

Me Programming in a Nutshell

Anyway, the important takeaway from all of this is that I eventually succeeded. Look at this beautiful wedge:

In this chart, the x-axis represents the number of iterations or steps in a particular series (you can think of it as “time”), while the y-axis represents the value itself. Every point from (0, 1) to (0, 10000) is on the chart, and they are the starting points for the first 10,000 natural numbers. For example, going from the point (0, 17) to the point (1, 9) is one iteration. The points (0, 17), (1, 9), and (2, 4) represent a series (for the starting integer 17), and every series terminates once 4 is reached (in this case, 2 iterations/steps are required to reach 4).

With that explanation out of the way, there are a few observations we can make about the above chart:

  • Convergence occurs in a fairly uniform manner. On average, the convergence is relatively well-behaved and progresses in a decreasing fashion. Note that I say “on average,” because we know that 1, 2, and 3 are the only natural numbers whose first iterations lead to a larger number (1 -> 3,  2 -> 3, and 3 -> 5). However, every natural number greater than 4 will lead to a smaller number, and given how the English language works…
  • Convergence occurs very quickly. English is pretty efficient when it comes to representing numbers in written form. Larger numbers will in general require more characters, but not that many more. For example, among the first 10,000 natural numbers, the “longest” ones only require 37 characters (e.g. 8,878 being one of them). This leads us to our second chart…

This shows the sum of all 10,000 series over time. At Time 0, the sum is 50,005,000, but that drops to 292,407 after just one iteration. After Time 6, each of the first 10,000 series will have converged to 4 and terminated. If we define “stopping time” as the number of iterations/steps it takes for a series to reach 4, then the stopping times of the first 10,000 natural numbers are shown below (along with a histogram of how often each stopping time occurs):

The average stopping time is 4.4, with a standard deviation of 1.2. Furthermore, the vast majority of stopping times fall under 3, 5, or 6. Note the interesting pattern that forms from the series that have a stopping point of 4 in the first chart.

Part II: The Collatz Conjecture

Truth be told, “Everything is Four” feels a bit gimmicky for a mathematical rule (well, because it technically isn’t), so here’s a rule that is better defined: Start with any positive integer n. If n is even, then divide n by 2 (/ 2). If n is odd, then multiple by 3 and add 1 (3+ 1). Continue this indefinitely.

The Collatz Conjecture states that for any given n, the sequence you get from following the above rules will eventually converge to 1. Note that it’s called a conjecture, meaning that despite being proposed in 1937, it remains unsolved!

Now, let’s see how the Collatz sequence converges over the first 10,000 natural numbers. I’ve included 3 charts to show how things look at different “zoom” levels.

A few observations:

  • Convergence does not occur in an uniform manner. The chart almost looks like many different seismographs stacked on top of each other. While each series does eventually reach 1, how they get there appears to be somewhat random and not well-defined at all, filled with both sudden spikes and collapses over time. In any event, it’s a heck of a lot more interesting than our first sequence. Moreover, relative to our first sequence…
  • Convergence takes a long time. Recall that in our first sequence, no series among the first 10,000 natural numbers lasted for more than 6 iterations before converging to 4. In the Collatz sequence, however, the number 6,171 takes 261 iterations to reach 1.

This chart shows the sum of all 10,000 series over time. At Time 0, the sum is 50,005,000, but that spikes to 87,507,496 at Time 1 before dropping to 59,381,244 at Time 3. The beginning of the chart looks a bit like a failing cardiogram, and things quickly get weird after that. Of course, the sum eventually reaches 0, but the way it decays appears random. The stopping times of the first 10,000 natural numbers are shown below:

Wow! The first chart is really something isn’t it? It certainly doesn’t appear to be 100% random, and the fact that there seems to be some structure to the stopping times of the Collatz sequence could be a sign that a proof for the Collatz Conjecture can eventually be found. The average stopping time is 85.0, with a standard deviation of 46.6. Moreover, with a median of 73 and a mode of 52, the distribution of the stopping times appears to be right-skewed. An examination of the same chart featuring the first 100 million natural numbers confirms this.

Anyway, there wasn’t really a point to this post other than to show that there is often a lot of beauty hidden under the surface of mathematics.

Let me know if you would like to see more posts like this!

You can find my backups for both the data and the Python code here.