In software engineering, “risk” can mean a lot of things. When it comes to code risk, we mean the likelihood that any particular commit is going to add to your technical debt or increase the chance of defects. With code risk, we’re essentially deciding: What’s the probability that a new bit of code will saddle the organization with more work in the future?

Note that code risk and code quality are different things. With code risk, we aren’t evaluating quality indicators such as complexity, portability, or testability. It’s true, however, that code quality can be a variable in code risk.

How we evaluate code risk

The first step in evaluating code risk is to determine which code commits relate to fixes. To do this, we use Natural Language Processing (NLP), combined with topic modeling, to investigate the commit comments and decide whether the commit is fix-related. This is followed by a number of validation steps to verify the accuracy of our model, to be sure that commits flagged as “fixes” are indeed fixes.

With a hardened fix identification model, we then identify the last commit to touch the same lines as the incoming commit. For each line of code changed to fix a bug, we trace back to the prior commit(s) that changed the same line. This information, combined with our fix data, allows the model to spot the commits that introduced a bug. We integrate this subset of commits back into our full dataset, labeling them as bug (1) and all other commits as other (0).

With our full commit dataset, we then layer in other indicators that are useful to predicting code risk. This includes demographic signals like the person or team making the commit, as well as a large number of more granular inputs, including programming language, file age and linked issue data. In sum, there are more than 80 different inputs to the model. By analyzing hundreds of thousands of buggy commits and comparing them with bug-free commits, the model uncovers interesting patterns, finding seemingly unrelated variables that nonetheless correlate positively with code risk.

For model training and building, we use h2o's open source machine learning platform. Within this framework we identify the label we'd like to predict, choose models and create grid search parameters, review and assess model metrics and accuracy, and deploy our machine learning model as a MOJO object that allows us to embed prediction within our Kafka framework. The embedded model means we can deploy it in real-time and make risk predictions as commits are made.

The value of knowing your code risk

Technical debt is hard to quantify, hard to prevent, and even harder to know how to prioritize fixing. With code risk, Pinpoint provides a way to understand not only which repos have a higher debt load, but also the characteristics of commits that contribute to that debt. It’s a way of heading off, or at least curtailing, debt accumulation as it happens.

Get the data science behind high-performance teams.