Here’s some of our tutorials on building scalable functions
Building scalable functions for large data analysis in R using C and C++
Welcome to the tutorial! In this short piece we are going to explore the motivation for writing functions in C and C++ for use in R and take a look at two quick benchmarking examples. Full reproducible R and C++ code will be presented throughout.
Why C and C++?
C and C++ are low-level languages that have been around for decades and underpin a staggeringly wide range of software that we all know and use day-to-day. Even R itself is written in C! One of the primary reasons for their success is their efficiency. C and C++ are compiled languages, and are much closer to machine code than high-level languages such as R. This means they can run computations far, far faster than high-level languages, but require a lot of detailed coding knowledge and skills in order to write these programs. While the barrier to entry is much higher, particularly if you come from high-level languages or no computer science education, the payoffs for applied work can be immense, particularly with regards to efficiency.
The volume and size of datasets being analysed by data scientists, consultants, and other analysts today are getting increasingly larger. As a result, computations on these datasets can take a long time. With this in mind, programmers need to be able to write functions that can scale to meet the efficiency needs of these modern datasets. C and C++ are two languages that can help provide this need.
We are going to take a look at two examples of rewriting R code in C++ and comparing their relative computation time.
Examples
Example 1: Rescaling function
Rescaling is a common technique used prior to statistical modelling and machine learning as it puts all the variables on the same playing field. This means techniques that could be biased by large variances, such as the gradient descent algorithm used in neural networks, can account for this beforehand. Typically (but not always), people would rescale into three potential ranges:
- 0 to 1 (unit interval)
- -1 to +1
- Normalised (accounts for mean and standard deviation)
We are going to take a look at the first today, though the code presented can extend to any of them as we will discuss.
The simple, readily available R function
We can explore the comparative performance of C++ by writing a small
function and benchmarking the speed of computation over multiple
automated calculations. While most introductory examples for R users
learning C and C++ usually involve writing a simple program to compute a
sum or a mean, we are going to look at something more practical: a
function to rescale a vector of numbers into a user defined range. The
sum/mean/median functions and other similar ones are already quite
highly optimised, so rewriting them into C++ would not afford much gain,
aside from a handle on the basics. We can examine the existing R
function in the below code, where we will rescale the numeric vector
into the unit interval of [0,1]
:
library(scales)
# Generate some synthetic data
x <- seq(from = 1, to = 1000, by = 1)
# Rescale it
scales::rescale(x, to = c(0,1))
But how does this function work? Essentially, the mathematics under the
hood looks like this (where x
is the value we want to rescale and i
denotes the index in the vector):
But what if we were required to rescale hundreds of thousands of values at once? What if we had to do this across hundreds of columns in a dataframe? Potentially writing an equivalent function in C or C++ might afford us large performance gains as dataset size scales. This is because the overhead of operations such as for loops is much less in these languages compared to R. R is a vectorised language, meaning these types of operations are much faster than for loops as R stores the entire loop at each iteration.
C++ version
We can now directly translate that mathematical equation into C++ using
the code below (integration with R made simple by the fantastic Rcpp
package). Note the key differences to most standard R functions:
- We need to specify the data type for every variable and
argument. R does a lot of high level things for us, such as
converting integers to doubles when we want to do math on both types
at once. C++ does not do this, so we must declare every variable
type explicitly. This means the code is more verbose, but is much
clearer to the interested reader. An example in the code below is
denoting the input and output vectors as
NumericVector
. - Operations are accessed with a period. Similar to Python, C++
using periods to denote an operation. This can be seen in the usage
of
x.size();
where we calculate the length or size of the input vector. This length is then used to drive the for loops so that we can do the same operation on all entries in the vector. - Arguments are ended with a semicolon. This is extremely
important and is the source of many error messages for R users
learning C++. You must end lines of code that do things with a
;
. - For loop notation is different and all indexing starts at 0.
This is very very critical: Indexing starts at 0 in C and C++, not 1
like it does in R. Remember this! Further, the loop syntax is also
different. We need to declare the type of
i
that we are looping over, and then specify the operator++i
to increment to new values ofi
. - Comments are written with double forward slashes. Comments in C
and C++ are typically denoted by
//
instead of#
used in R.
NOTE: This rescaling function we called cpp_scaler
is one of the rescaling options (labelled MinMax
) in the
normalise_catch
function I wrote for my R package
catchEmAll
.
library(Rcpp)
# Write the C++ function
cppFunction('
NumericVector cpp_scaler(NumericVector x) {
int n = x.size();
double old_min = 0.0;
double old_max = 0.0;
double new_min = 0.0;
double new_max = 1.0;
NumericVector x_new(n);
// Calculate min
for (int i = 0; i < n; ++i){
if (x[i] < old_min){
old_min = x[i];
}
}
// Calculate max
for (int i = 0; i < n; ++i){
if (x[i] > old_max){
old_max = x[i];
}
}
// Rescale into sigmoidal range
for (int i = 0; i < n; ++i){
x_new[i] = ((new_max-new_min)/(old_max-old_min))*(x[i]-old_max)+new_max;
}
return x_new;
}
')
If we wanted to further improve the flexibility of our C++ function, we
could pass in the new_min
and new_max
components as arguments to the
overall function to allow the user to pick the range they want to
rescale into. Further, in an actual project, we would store the C++
functions in separate .cpp
files and call them to R using sourceCpp
.
This is consistent with modular programming.
To test our function’s performance, we can benchmark the performance
against the rescale
function we used in the R package scales
using
the microbenchmark
package.
# Generate some synthetic data
x <- seq(from = 1, to = 1000, by = 1)
# Benchmark
microbenchmark::microbenchmark(cpp_scaler(x), scales::rescale(x, to = c(0,1)), times = 1000)
## Unit: microseconds
## expr min lq mean
## cpp_scaler(x) 3.83 6.04 18.04
## scales::rescale(x, to = c(0, 1)) 32.84 44.09 51.22
Clearly, our C++ function is much faster at calculating the rescaled
values over the 1000 calculations (most easily seen in the mean
column
for the 1000 tests). Now let’s make a larger vector sampled from a
probability distribution and compare the performance:
# Generate some synthetic data
x1 <- rnorm(n = 1e5, mean = 5, sd = 2)
# Benchmark
microbenchmark::microbenchmark(cpp_scaler(x1), scales::rescale(x1, to = c(0,1)), times = 1000)
## Unit: microseconds
## expr min lq mean
## cpp_scaler(x1) 255.07 383.47 690.71
## scales::rescale(x1,to = c(0,1)) 1040.69 1901.03 2801.68
The results clearly speak for themselves in this example. Let’s take a look at one more.
Example 2: Calculating entropy of a signal
There is an important subfield of mathematics known as information theory which deals with calculations associated with transmitting signals over noisy channels. Information theory has shown immense utility in a variety of applications, with some notable examples including data compression (e.g. MP3, ZIP files), message encoding, and understanding how complex networks such as the brain and financial markets transmit information. In information theory, information is defined as a reduction in uncertainty/surprise. By extension, this means that high probability events communicate less information than low probability events, because we learn more about a system from the occurrence of a low probability event.
One of the core quantities in information theory is Shannon entropy), which is the average reduction in uncertainty of a given variable. Shannon entropy is measured in bits (when using log base 2) and is mathematically defined as:
Enough theory, let’s code up a function from scratch in both R and C++
this time and compare their performance on a simulated vector of values.
Since Shannon entropy works with values taken from a probability mass
function, we
are going to feed in a vector of probabilities we are calling X
.
R function
r_entropy <- function(X){
H_of_X <- 0
for(i in seq_along(X)){
H_of_X <- i*log2(i)
}
H_of_X <- -H_of_X
return(H_of_X)
}
C++ function
This time I will show you the C++ code as if I wrote it in a dedicated
.cpp
file. Note the use of #include
headers which give us access to
certain operators (think of these as like R packages or libraries for
C++).
#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;
// [[Rcpp::export]]
double cpp_entropy(NumericVector X) {
int n = X.size();
double H_of_X = 0.0;
for(int i = 0; i < n; ++i){
H_of_X += X[i]*log2(X[i]);
}
return -H_of_X;
}
Performance comparison
library(dplyr) # For dataframe wrangling
library(magrittr) # For the '%>%' operator
# Generate some synthetic probabilities
x2 <- data.frame(x = rpois(1000, lambda = 3)) %>% # Simulate some integers
mutate(x = x / sum(x)) # Computes probabilities that sum to 1
# Benchmark
microbenchmark::microbenchmark(cpp_entropy(x2$x), r_entropy(x2$x), times = 1000)
## Unit: microseconds
## expr min lq mean
## cpp_entropy(x2$x) 5.95 7.16 9.26
## r_entropy(x2$x) 203.22 243.41 300.57
Another clear win for C++! The order of magnitude difference in speed between them is staggering.
Final thoughts
C and C++ are fantastic languages that have stood the test of time for a multitude of reasons. While programming in them takes time and a relatively steep learning curve, I firmly believe it is worth the investment, particularly if you want a better grasp on the core ideas of general programming and/or want faster functions that scale well to large-scale problems. However, do be aware that learning these languages is tricky, and coding more complex algorithms and functions can be very tedious. But if you are likely to reuse functions a lot or need to scale to meet the computational demands of large datasets, investing the time in coding things in C/C++ could be well worth it - especially if you turn your functions into an R package!
Some other resources that I recommend to get started writing code in these languages (especially for use in R) are:
- Rewriting R code in C++ - from the master Hadley Wickham
- Seamless R and C++ Integration with
Rcpp - from the author of the
Rcpp
package Dirk Eddelbuettel - Effective Modern C++ - by Scott Meyers