Tuesday, January 17, 2017

GitHub Growth Appears Scale Free

Update of Thursday, August 17, 2017: It's looks like we can chalk another one up for the scale-free model (described below) as Github apparently surpasses 20 million users. Outgoing CEO Wanstrath mentioned this number in an emailed statement to Business Insider.
"As GitHub approaches 700 employees, with more than $200M in ARR, accelerating growth, and more than 20 million registered users, I'm confident that this is the moment to find a new CEO to lead us into the next stage of growth. ....."

The Original Analysis

In 2013, a Redmonk blogger claimed that the growth of GitHub (GH) users follows a certain type of diffusion model called Bass diffusion. Here, growth refers to the number of unique user IDs as a function of time, not the number project repositories, which can have a high degree of multiplicity.

In a response, I tweeted a plot that suggested GH growth might be following a power law, aka scale free growth. The tell-tale sign is the asymptotic linearity of the growth data on double-log axes, which the original blog post did not discuss. The periods on the x-axis correspond to years, with the first period representing calendar year 2008 and the fifth period being the year 2012.

Scale free networks can arise from preferential attachment to super-nodes that have a higher vertex degree and therefore more connections to other nodes, i.e., a kind of rich-get-richer effect. Similarly for GH growth viewed as a particular kind of social network. The interaction between software developers using GH can be thought of as involving super-nodes that correspond to influential users attracting prospective GH users to open a new account and contribute to their project.

On this basis, I predicted GH would reach 4 million users during October 2013 and 5 million users during April 2014 (yellow points in the Linear axes plot below). In fact, GH reached those values slightly earlier than predicted by the power law model, and slightly later than the dates predicted by the diffusion model (modulo unreported errors in the data).

Since 2013, new data has been reported so, I extended my previous analysis. Details of the respective models are contained in the R script at the end of this post. In the Linear axes plot below, both the diffusion model and power model essentially form an envelope around the newer data: diffusive on the upper side (red curve) and power law on the lower side (blue curve). In thise sense, it could be argued that the jury is still out on which model offers the more reliable predictions.

However, there is an aspect of the diffusion model that was overlooked in 2013. It predicts that GH growth will eventually plateau at 20 million users in 2020 (the 12th period, not shown) because it is a type of logistic function that has a characteristic sigmoidal or 'S' shape. The beginnings of this leveling off (the top of the 'S') is apparent in the 10th period (i.e., 2017). By contrast, the power law model predicts that GH will reach 23.65 million users by the end of 2017 (yellow point). Whereas the two curves envelope the more recent data in periods 6–9, they start to diverge significantly in the 10th period.

"GitHub is not the only player in the market. Other companies like GitLab are doing a good job but GitHub has a huge head start and the advantage of the network effect around public repositories. Although GitHub’s network effect is weaker compared to the likes of Facebook/Twitter or Lyft/Uber, they are the default choice right now."  —GitHub is Doing Much Better Than Bloomberg Thinks
Although there will inevitably be an equilibrium bound on the number of active GH users, it seems unlikely to be as small as 20 million, given the combination of GH's first-mover advantage and its current popularity. Presumably the private investors in GH also hope it will be a large number. This year will tell.

# Data source ... https://classic.scraperwiki.com/scrapers/github_users_each_year/

#LINEAR axes plot
plot(df.gh3$index, df.gh3$users, xlab="Period (years)", 
     ylab="Users (million)", col="gray", 
     ylim=c(0, 3e7), xaxt="n", yaxt="n")
axis(side=1, tck=1, at=c(0, seq(12,120,12)), labels=0:10, 
     col.ticks="lightgray", lty="dotted")
axis(side=2, tck=1, at=c(0, 10e6, 20e6, 30e6), labels=c(0,10,20,30), 
     col.ticks="lightgray", lty="dotted")

# Simple exp model
curve(coef(gh.exp)[2] * exp(coef(gh.exp)[1] * (x/13)), 
      from=1, to=108, add=TRUE, col="red2", lty="dot dash")

# Super-exp model 
curve(49100 * (x/13) * exp(0.54 * (x/13)), 
      from=1, to=120, add=TRUE, col="red", lty="dashed")

# Bass diffusion model
curve(21e6 * ( 1 - exp(-(0.003 + 0.83) * (x/13)) ) / ( 1 + (0.83 / 0.003) * exp(-(0.003 + 0.83) * (x/13)) ), 
      from=1, to=120, add=TRUE, col="red")

# Power law model
curve(10^coef(gh.fit)[2] * (x/13)^coef(gh.fit)[1], from=1, to=120, add=TRUE, 

title(main="Linear axes: GitHub Growth 2008-2017")
  legend=c("Original data", "New data",  "Predictions", "Exponentital", "Super exp", "Bass diffusion", "Scale free"), 
       lty=c(NA,NA,NA,4,2,1,1), pch=c(1,19,21,NA,NA,NA,NA), 
  col=c("gray", "black", "yellow", "red", "red", "red", "blue"), 
  pt.bg = c(NA,NA,"yellow",NA,NA,NA,NA),
  cex=0.75, inset=0.05)

1 comment:

Neil Gunther said...

Looks like we might have some new evidence in support of the scale-free model of Github growth. See Update upstairs.