Friday 12:35–13:05 in Track 3

Analyzing GitHub, how developers change programming languages over time

Waren Long

Audience level:
Novice

Description

After using what was your favorite programming language for many years, you might want or be required to switch to another one. Based on GitHub repositories, we are able to build a transition matrix after solving the flow optimization problem. The results reflect the history of programming language competition in the open source world.

Abstract

Analyzing GitHub, how developers change programming languages over time

Have you ever been struggling with an nth obscure project, thinking : "I could do the job with this language but why not switch to another one which would be more enjoyable to work with" ? In his awesome blog post : The eigenvector of "Why we moved from language X to language Y", Erik Bernhardsson generated an N*N contingency table of all Google queries related to changing languages. However, when I read it, I couldn't help wondering what the proportion of people who effectively switched is. Thus, it has become engaging to deepen this idea and see how the popularity of languages changes among GitHub users.

Dataset

Thanks to our data retrieval pipeline, source{d} opened the dataset that contains the yearly numbers of bytes coded by each GitHub user in each programming language. In a few figures, it is:

To maintain comparability between languages, the choice has been made to focus on 25 main programming languages used on Github. However, we did not include **Javascript because 40 % of Github users we analyzed had JS in their profiles, and the proposed transition model became useless.

Finally, after filtering and quantizing the dataset, we can proceed with building our transition matrix.

Transition matrix

It is convenient to see every GitHub user as a sequence of annual vectors where the elements stand for the quantity of source code in each language. Then, building the transition matrix can be summarized as comparing vectors side by side.

An elegant approach to this problem, which is effective both in coding and computational time, is offered in PyEMD: a Python wrapper for the Earth Mover's Distance which is Numpy friendly. This distance measure -- better than the euclidean distance for histogram comparison -- is particularly interesting because it is based on Linear Programming (LP). Indeed, it can be seen as the solution of the famoustransportation problem.

After summing the flow matrices over users and over the last 16 years, we obtain one global transition matrix which already contains many information about GitHub users' language profile. However to make results clearer, we need to go one step further.

GitHub "LanguageRank"

Since we have our flow matrix, we want to know which languages are the most and the least popular. It is possible to calculate centrality measures on the represented graph, e.g. the eigenvector centrality. Indeed, these measures convey the relative popularity of languages in the sense of how likely people coding in one language would switch to another.

To follow this approach, it is necessary to transform our transition matrix to make it well conditioned, in other words, make it stochastic, irreducible and aperiodic applying for example the famous trick used for the Google PageRank matrix.

Power Iteration

After these steps, our well conditioned flow matrix contains an approximation of the probabilities of switching between languages, and we can proceed with the power iteration. This algorithm consists of a basic matrix multiplication until an initially random vector converges to the dominant eigenvector.

Most popular languages on GitHub

At last! Here is the reward: the stationary distribution of our Markov chain. This probability distribution is independent of the initial distribution. It gives information about the stability of the process of random switching between languages. Thus, no matter how popular the languages are at the present time, the hypothetical future stationary state stays the same. Here is the 10 most popular languages used on GitHub among our sample of 25 programming languages :

language popularity, % source code, %
1. Python 16 11.2
2. Java 15.3 16.6
3. C 9.1 17.1
4. C++ 9 12.6
5. PHP 8.3 24.3
6. Ruby 8.1 2.6
7. C# 6 6.5
8. Objective-C 3.9 3.2
9. Go 3.1 0.7
10. Swift 2.5 0.4

Python (16 %) appears to be the most attractive language, followed closely by Java (15.3 %). It's especially interesting since only 11.2 % of all source code on GitHub is written in Python (still among the 25 initial languages).

Then, when we sort this transition matrix by this dominant eigenvector, the results become even more interesting, Here are very few of them :

Finally, it is also possible to look at these dominant eigenvectors sequentially. This yields to striking conclusions about how the programming language competition has been evolving since the early 2000's and what are the current trends.

Conclusion

It would be more appropriate to see Erik's contingency table as a kind of second derivative of the languages distribution problem while our flow transitions are like the first derivative. That is, first you google, then you try to write an OSS project, and finally the languages distribution changes.

You can find all the matrices, visuals plus further details about this approach in its dedicated blog post.

Subscribe to Receive PyData Updates

Subscribe

Tickets

Get Now