Write a function that computes Kendall rank coefficient *Tau* (see lecture notes for additional details on Kendall *Tau*).

Kendall Coefficient {as quoted from *Wikipedia*):

Let (x1, y1), (x2, y2), ., (xn, yn) be a set of observations of the joint random variables X and Y respectively, such that all the values of xi and yi are unique. Any pair of observations (xi and yi) are said to be concordant if the ranks for both elements (more precisely, the sort order by x and by y) agree: that is, if both xi > xj and yi > yi or if both xi < xj and yi < yj. They are said to be discordant, if xi > xj and yi < yj or if xi < xj and yi > yj. If xi = xj or yi = yj, the pair is neither concordant nor discordant.

Equation 1

TAU=(number of concordant pairs-number of discordant pairs)/(0.5 x *n* x (*n* - 1))

The above equations ignore ties (sets of pairs where either *xi* = *xj* or *yi* = *yj*). One way of correcting for ties is to consider how many ties (or non-ties) are present in the data. This variant known as Kendallâ€™s tau_b is computed by modifying the denominator 0.5*n*(n-1) in Equation 1 to consider ties:

sqrt(*mx*) x sqrt(*my*)

Where *mx* is number of non-ties for *x* and *my* is number of non-ties for *y*.

Note: The R function {cor} returns tau_b.

**YOUR TASKS:** 1. Write a custom function that computes Kendall ignoring ties (Equation 1) and correcting for ties (as explained above).

Your function should return a single value (Kendallâ€™s coefficient)

Your function should allow the user to choose to ignore ties or not and thus produce either tau or tau_b.

Apply the function to two vectors provided below and check the results provided by your function against those obtained using the R function {cor}.

Check how fast is your function relative to {cor} function using microbenchmark function from package â€˜microbenchmarkâ€™.

Finally, generate large vectors x and y (n = 1000) from normal distribution using rnorm function. Check again how fast is your function relative to {cor} function using microbenchmark function from package â€˜microbenchmarkâ€™.

To carry out this exercise, please start by uploading the library â€˜microbenchmarkâ€™ and please use **x** and **y** variables as defined below. This will ensure that all results are the same for all of us. It will also help you to evaluate if your outcomes agree with those provided in solution functions or alternative *R* functions.

```
library(microbenchmark)
x <- c(-30, 0, 14, 5, 6, 7)
y <- c(-5, 0, 12, 16, 16, 8)
```

Once your function is ready, you can continue to the next step. The function solutions included in this tutorial are hidden here, but can be reviewed in the .rmd file or in the answer key file (.html). Please try to solve this on your own before checking the solution.

Now check if your function yields comparable results to those provided by standard R functions or custom functions developed by your instructor.

```
x <- c(-30, 0, 14, 5, 6, 7)
y <- c(-5, 0, 12, 16, 16, 8)
# compute kendall tau using generic function {cor}
cor(x, y, method='kendall')
```

`## [1] 0.4140393`

```
# check if your function is consistent with standard cor function
my.kend(x, y, ties=T)
```

`## [1] 0.4140393`

```
# if you developed a second solution check if that solution works as well.
my.kend2(x, y, ties=T) # check if your function works as well
```

`## [1] 0.4140393`

```
# now check if your function(s) also compute(s) uncorrected tau correctly
my.kend(x, y, ties=F)
```

`## [1] 0.4`

`my.kend2(x, y, ties=F) `

`## [1] 0.4`

Once you established that your function works, you should investigate if your function is faster or slower than cor function. Keep in mind, however, that the performance of the function may depend on sample size, so your function may work better than â€˜corâ€™ function for small sample sizes, but perform poorly for large sample sizes (or *vice versa*).

```
microbenchmark(cor(x, y, method='kendall'), my.kend(x, y, ties=T),
my.kend2(x, y, ties=T)) # which is faster for a much larger dataset?
```

```
## Unit: microseconds
## expr min lq mean median uq
## cor(x, y, method = "kendall") 57.770 59.6590 62.28257 61.1690 62.8675
## my.kend(x, y, ties = T) 20.390 21.5225 23.72020 23.5995 24.9210
## my.kend2(x, y, ties = T) 9.062 9.4400 10.29720 10.1960 10.9500
## max neval
## 120.071 100
## 59.658 100
## 17.370 100
```

You can see that my custom functions my.kend and my.kend2 are much faster than cor function when analyzed data are tiny.

Letâ€™s now check how the functions perform for much larger datasets.

```
x <- rnorm(500,1,1)
y <- rnorm(500,1,3) + x
cor(x, y, method='kendall')
```

`## [1] 0.2152305`

`my.kend(x, y, ties=T) # check if your function agrees`

`## [1] 0.2152305`

`my.kend2(x, y, ties=T) # check if your function agrees`

`## [1] 0.2152305`

```
microbenchmark(cor(x, y, method='kendall'), my.kend(x, y, ties=T),
my.kend2(x, y, ties=T)) # which is faster for a much larger dataset?
```

```
## Unit: milliseconds
## expr min lq mean median
## cor(x, y, method = "kendall") 3.213214 3.283444 3.655006 3.419185
## my.kend(x, y, ties = T) 13.088842 16.598647 19.032110 17.722893
## my.kend2(x, y, ties = T) 49.585435 51.565470 55.769142 53.307439
## uq max neval
## 3.71804 10.94192 100
## 20.13243 43.07858 100
## 55.73528 134.17111 100
```

As clear from the report, the custom-written solutions that I wrote (you may come up with better algorithms) perform well at small sample size, but underperform at larger sample sizes. That is, compiled, professionally written R functions are faster for large samples than custom written solutions written by amatuers. This means that when we iterate small datasets, custom functions can be faster. When huge datasets are processed, R functions are likley to be superior to custom solutions.

Let us evaluate this issue here by using â€˜corâ€™ and our custom function to carry out the same analysis.

```
r1 <- as.data.frame(matrix(rnorm(20000), 10, 2000))
# NOTE r1 is a dataframe of many (s=2000) tiny samples (n=10 observations). Each column is a sample.
r2 <- as.data.frame(matrix(rnorm(20000), 2000, 10))
# NOTE r2 is a dataframe of s=10 huge samples (n=2000 observations). Each column is a sample.
system.time(sapply(r1, function(x) cor(x, r1[,1], method='kendall')))
```

```
## user system elapsed
## 0.19 0.00 0.19
```

`system.time(sapply(r1, function(x) my.kend(x, r1[,1])))`

```
## user system elapsed
## 0.11 0.00 0.11
```

`system.time(sapply(r1, function(x) my.kend2(x, r1[,1])))`

```
## user system elapsed
## 0.10 0.00 0.09
```

`system.time(sapply(r2, function(x) cor(x, r2[,1], method='kendall')))`

```
## user system elapsed
## 0.69 0.00 0.69
```

`system.time(sapply(r2, function(x) my.kend(x, r2[,1])))`

```
## user system elapsed
## 4.05 0.89 4.94
```

`system.time(sapply(r2, function(x) my.kend2(x, r2[,1])))`

```
## user system elapsed
## 9.53 0.02 9.54
```

Clearly, my functions worked faster when we processed many, many short vectors (data frame r1), but underperformed badly when a few very large vectors were evaluated (data frame r2).

Always cite packages.

```
##
## To cite package 'microbenchmark' in publications use:
##
## Olaf Mersmann (2018). microbenchmark: Accurate Timing Functions.
## R package version 1.4-4.
## https://CRAN.R-project.org/package=microbenchmark
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {microbenchmark: Accurate Timing Functions},
## author = {Olaf Mersmann},
## year = {2018},
## note = {R package version 1.4-4},
## url = {https://CRAN.R-project.org/package=microbenchmark},
## }
```

**Comments/Questions/Corrections:** Michal Kowalewski (kowalewski@ufl.edu)

**Peer-review:** This document has NOT been peer-reviewed.

**Our Sponsors:** *National Science Foundation (Sedimentary Geology and Paleobiology Program)*, *National Science Foundation (Earth Rates Initiative)*, *Paleontological Society*, *Society of Vertebrate Paleontology*

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.