R News

Performance: when algorithmics meets mathematics

by Florian Privé · April 19, 2018

This article is originally published at https://privefl.github.io/blog/

In this post, I talk about performance through an efficient algorithm I developed for finding closest points on a map. This algorithm uses both concepts from mathematics and algorithmics.

Problem to solve

This problem comes from a recent question on StackOverflow.

I have two matrices, one is 200K rows long, the other is 20K. For each row (which is a point) in the first matrix, I am trying to find which row (also a point) in the second matrix is closest to the point in the first matrix. This is the first method that I tried on a sample dataset:

# Test dataset: longitude and latitude
pixels.latlon <- cbind(runif(200000, min = -180, max = -120),
                       runif(200000, min = 50, max = 85))
grwl.latlon <- cbind(runif(20000, min = -180, max = -120),
                     runif(20000, min = 50, max = 85))
# Calculate the distance matrix
library(geosphere)
dist.matrix <- distm(pixels.latlon, grwl.latlon, fun = distHaversine)
# Pick out the indices of the minimum distance
rnum <- apply(dist.matrix, 1, which.min)

At first, this problem was a memory problem because dist.matrix would take 30GB.

A simple solution to overcome this memory problem has been proposed:

library(geosphere)
rnum <- apply(pixels.latlon, 1, function(x) {
  dm <- distm(x, grwl.latlon, fun = distHaversine)
  which.min(dm)
})

Yet, a second problem remains, this solution would take 30-40 min to run.

First idea of improvement

In the same spirit as with this case study in book Advanced R, let us see the source code of the distHaversine function and see if we can adapt it for our particular problem.

library(geosphere)
distHaversine

## function (p1, p2, r = 6378137) 
## {
##     toRad <- pi/180
##     p1 <- .pointsToMatrix(p1) * toRad
##     if (missing(p2)) {
##         p2 <- p1[-1, ]
##         p1 <- p1[-nrow(p1), ]
##     }
##     else {
##         p2 <- .pointsToMatrix(p2) * toRad
##     }
##     p = cbind(p1[, 1], p1[, 2], p2[, 1], p2[, 2], as.vector(r))
##     dLat <- p[, 4] - p[, 2]
##     dLon <- p[, 3] - p[, 1]
##     a <- sin(dLat/2) * sin(dLat/2) + cos(p[, 2]) * cos(p[, 4]) * 
##         sin(dLon/2) * sin(dLon/2)
##     a <- pmin(a, 1)
##     dist <- 2 * atan2(sqrt(a), sqrt(1 - a)) * p[, 5]
##     return(as.vector(dist))
## }
## <bytecode: 0x71c0490>
## <environment: namespace:geosphere>

So, what this code does:

.pointsToMatrix verifies the format of the points to make sure that it is a two-column matrix with the longitude and latitude. Our data is already in this format, we don’t need this here.
it converts from degrees to radians by multiplying by pi / 180.
it computes some intermediate value a.
it computes the great-circle distance based on a.

Knowing that latitude values are between -90° and 90°, you can show that the values of a are between 0 and 1. For these values, dist(a) is in an increasing function of a:

curve(atan2(sqrt(x), sqrt(1 - x)), from = 0, to = 1)

So, in fact, to find the minimum distance, you just need to find the minimum a.

# p1 is just one point and p2 is a two-column matrix of points
haversine2 <- function(p1, p2) {
  
  toRad <- pi / 180
  p1 <- p1 * toRad
  p2 <- p2 * toRad

  dLat <- p2[, 2] - p1[2]
  dLon <- p2[, 1] - p1[1]
  sin(dLat / 2)^2 + cos(p1[2]) * cos(p2[, 2]) * sin(dLon / 2)^2
}

# Test dataset (use smaller size for now)
N <- 200
pixels.latlon <- cbind(runif(N, min = -180, max = -120),
                       runif(N, min = 50, max = 85))
grwl.latlon <- cbind(runif(20000, min = -180, max = -120),
                     runif(20000, min = 50, max = 85))

system.time({
  rnum <- apply(pixels.latlon, 1, function(x) {
    dm <- distm(x, grwl.latlon, fun = distHaversine)
    which.min(dm)
  })
})

##    user  system elapsed 
##   1.852   0.559   2.408

system.time({
  rnum2 <- apply(pixels.latlon, 1, function(x) {
    a <- haversine2(x, grwl.latlon)
    which.min(a)
  })
})

##    user  system elapsed 
##   0.380   0.001   0.383

all.equal(rnum2, rnum)

## [1] TRUE

So, here we get a solution that is 4-5 times as fast because we restricted the source code to our special use case. Still, this is not fast enough in my opinion.

Second idea of improvement

Do you really have to compute distances between all points? For example, if two points are on very different latitudes, does it mean that they are very far from each other?

In a <- sin(dLat / 2)^2 + cos(p1[2]) * cos(p2[, 2]) * sin(dLon / 2)^2, you have a sum of two positive terms. You can deduce that a is always superior to sin(dLat / 2)^2, which is equivalent to 2 * asin(sqrt(a)) is always superior to dLat.

In other terms, for a given point in your matrix, if you have already computed one a0 corresponding to one point in the second matrix, a new point could have its a inferior to a0 only if dLat is inferior to 2 * asin(sqrt(a0)).

So, using a sorted list of all latitudes and with good starting values for a0, you can quickly discard lots of points as being the closest one, just by considering their latitudes. Implementing this idea in R(cpp):

#include <Rcpp.h>
using namespace Rcpp;

double compute_a(double lat1, double long1, double lat2, double long2) {

  double sin_dLat = ::sin((lat2 - lat1) / 2);
  double sin_dLon = ::sin((long2 - long1) / 2);

  return sin_dLat * sin_dLat + ::cos(lat1) * ::cos(lat2) * sin_dLon * sin_dLon;
}

int find_min(double lat1, double long1,
             const NumericVector& lat2,
             const NumericVector& long2,
             int current0) {

  int m = lat2.size();
  double lat_k, lat_min, lat_max, a, a0;
  int k, current = current0;

  a0 = compute_a(lat1, long1, lat2[current], long2[current]);
  // Search before current0
  lat_min = lat1 - 2 * ::asin(::sqrt(a0));
  for (k = current0 - 1; k >= 0; k--) {
    lat_k = lat2[k];
    if (lat_k > lat_min) {
      a = compute_a(lat1, long1, lat_k, long2[k]);
      if (a < a0) {
        a0 = a;
        current = k;
        lat_min = lat1 - 2 * ::asin(::sqrt(a0));
      }
    } else {
      // No need to search further
      break;
    }
  }
  // Search after current0
  lat_max = lat1 + 2 * ::asin(::sqrt(a0));
  for (k = current0 + 1; k < m; k++) {
    lat_k = lat2[k];
    if (lat_k < lat_max) {
      a = compute_a(lat1, long1, lat_k, long2[k]);
      if (a < a0) {
        a0 = a;
        current = k;
        lat_max = lat1 + 2 * ::asin(::sqrt(a0));
      }
    } else {
      // No need to search further
      break;
    }
  }

  return current;
} 

// [[Rcpp::export]]
IntegerVector find_closest_point(const NumericVector& lat1,
                                 const NumericVector& long1,
                                 const NumericVector& lat2,
                                 const NumericVector& long2) {

  int n = lat1.size();
  IntegerVector res(n);

  int current = 0;
  for (int i = 0; i < n; i++) {
    res[i] = current = find_min(lat1[i], long1[i], lat2, long2, current);
  }

  return res; // need +1
}

find_closest <- function(lat1, long1, lat2, long2) {

  toRad <- pi / 180
  lat1  <- lat1  * toRad
  long1 <- long1 * toRad
  lat2  <- lat2  * toRad
  long2 <- long2 * toRad

  ord1  <- order(lat1)
  rank1 <- match(seq_along(lat1), ord1)
  ord2  <- order(lat2)

  ind <- find_closest_point(lat1[ord1], long1[ord1], lat2[ord2], long2[ord2])

  ord2[ind + 1][rank1]
}

system.time(
  rnum3 <- find_closest(pixels.latlon[, 2], pixels.latlon[, 1], 
                        grwl.latlon[, 2], grwl.latlon[, 1])
)

##    user  system elapsed 
##   0.007   0.000   0.007

all.equal(rnum3, rnum)

## [1] TRUE

This is so much faster, because for one point in the first matrix, you just check only a small subset of the points in the second matrix. This solution takes 0.5 sec for N = 2e4 and 4.2 sec for N = 2e5.

4 seconds instead of 30-40 min!

Mission accomplished.

Conclusion

Knowing some maths and some algorithmics can be useful if you are interested in performance.

Thanks for visiting r-craft.org
This article is originally published at https://privefl.github.io/blog/
Please visit source website for post related comments.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Performance: when algorithmics meets mathematics

You may also like...

Categories

Performance: when algorithmics meets mathematics

Problem to solve

First idea of improvement

Second idea of improvement

Conclusion

You may also like...

Equipping Your Data Science Team to Work from Home

Debug Java in Bio7

RcppMLPACK2 and the MLPACK Machine Learning Library

Categories