R and Vertica

I’ve been spending the last few months working my way through the integration of R and Vertica, and will try to keep here things that I find handy. I’m quite sad to see there is not much about this Vertica feature on the web, that’s a little disappointing. But, it didn’t stop us from creating a scalable statistical model learning machine out of this feature (I will write about it in later posts…).

For those of you who don’t know HP Vertica, it is a powerful columnar DBMS.  I’ve worked with two installations of it and me and my colleagues are very impressed with it. For those of you who are familiar with PostgreSQL, you will find many things similar  as both products were invented by the same guy.

That’s all the intro I’m going to give since we pay them and not the other way around 🙂

I suggest people reading this post to read through the chapter called “Developing a User Defined Function in R” in “HP Vertica 6.1.x (or higher) Programmer’s Guide” I hope you are familiar in vsql and R.

Short example – normal inverse function in Vertica

Vertica and R communicate via User Defined Functions (UDFs) that are written in the R language. The columns selected in Vertica are passed to R as a data.frame, the functions must treat their main argument as a data.frame. Here is an example that creates an inverse normal CDF function, x is the data.frame passed from Vertica, in this case it is a 3-column data.frame with the percentile, mean and standard deviation.

# normal inverse
norm_inv <- function(x) { # 1 - percentile, 2 - mean, 3 - sd
apply(x,1,function(i) qnorm(p=i[1L],mean=i[2L],sd=i[3L]))
}

norm_invFactory <- function() { # this is the factory function that 'links' between R and Vertica
list(
name=norm_inv
,udxtype=c("scalar")
,intype=c("float","float","float")
,outtype=c("float")
)
}

You create the file above and place somewhere on your machine, then load this function to the Vertica database :


MyDB=> CREATE LIBRARY r_func AS '/home/dbadmin/norm_inverse.R' LANGUAGE 'R';

MyDB=> CREATE FUNCTION norm_inv AS NAME 'norm_invFactory' LIBRARY r_func;

MyDB=> select norm_inv(.25,0,1);
norm_inv
--------------------
-0.674489750196082
(1 row)

More about User Defined Functions

Imagine being able to implicitly parallelize an R function across an infinite amount of segments and data. This is basically the promise behind the R language package for Vertica.

What is parallelized exactly? Vertica allows you to partition the data sent into R, it implicitly works out how to divide the load between nodes and the nodes’ cores, so you don’t have to work out elaborate code. Each R instance is run independetly from other instances, so you cannot parallelize ,say, a single lm() function, but rather perform multiple ones at once.

The user defined function are loaded into Vertica using CREATE LIBRARY syntax and by writing R code that has two main parts, as you might have noticed above:

  1. A main function – the main function is what does the work
  2. A factory function – tells the Vertica engine about the input and output to expect.

source() example

Using R’s source() function is possible through an R UDFs in Vertica. Which is a very useful ability for large projects with a lot of code you can then change the sourced code without changing any part of the main functions. Here is how.
Create some R file called ‘foo.R’ with a variable called ‘bar’.

## Start of R code
bar <- "we've done it!"
## End of R code

Create the main and factory functions for in an R file

# Main function
sourceTest <- function(x)
{
source('foo.R')
return(bar)
}
# Factory function
sourceTestFactory <- function()
{
  list(
   name=sourceTest
   ,udxtype=c("scalar")
   ,intype=c("char")
   ,outtype=c("char")
 )
}

In vertica run :

MyDB=> CREATE LIBRARY RsourceTest AS 'sourceTest.r' LANGUAGE 'R';
MyDB=> CREATE FUNCTION sourceTest AS NAME 'sourceTestFactory' LIBRARY RsourceTest;
MyDB=> SELECT sourceTest ("");
sourceTest
--------------------
we've done it!
(1 row)

Pre-requisites for the R language pack

The vertica programmer guide instructs you to install a version of libgfortran. Aside from that you may notice that packages that rely on gfortran such as “xts” require installing other parts :

yum install gcc-gfortran
yum install gcc-c++

Installing the ‘forecast’ package for Vertica

The ‘forecast’ package, by Rob J Hyndman, is great for many time series analysis. Connecting it with Vertica is very powerful for creating forecasts out of your data. Trouble is, the R version used in Vertica is 3.0.0 so you will have to get the older version of forecast 4.8. In short :

wget https://stat.ethz.ch/CRAN/contrib/main/Archive/forecast/forecast_4.8.tar.gz

And then in R:

install.packages("tseries")
install.packages("fracdiff")
install.packages("RcppArmadillo")
install.packages("/root/forecast/forecast_4.8.tar.gz")

Voila. Next time I’ll show some more elaborate examples of transform functions that I use.

Advertisements

11 thoughts on “R and Vertica

  1. […] article was first published on My contRibution » R, and kindly contributed to […]

  2. Since Vertica is, according to some, Postgres with bells, can you run R through PL/R?

  3. Thank you for the nice post!

    I have a little question: is it normal that vertica executes the most simple transform function for ~1minute?

    And it doesnt matter if I pass to a function 3 rows or 10000 rows -> it will executes ~1min anyway

    • Hi Vadym, from my experience the execution time is usually affected by the link between R and Vertica (e.g. – firing up the R instance, passing the data in it.)

      The overall execution time is usually : Total = QueryTime + TransferTo_R_Time + R_execution_time + Transfer_to_vertica

      Sometime, the TransferTo_R_Time can be very big since you are passing big chuncks of data.

      To try and understand where the bottle neck is, you can try to do the following
      If your query with the transform function is : “SELECT partition-column(s), transfunc(col1, col2, …) OVER(…) FROM …”
      Try to run it without the function: “SELECT partition-column(s), col1, col2, … FROM …” and measure the time (\timing in the vsql app)

      create a mini R program that emulates your transform function and read the data.frame from the STDIN run the query from the console and pipe to your program. Measure the overall time and the R execution time.

      This should take care of the first 3 parts of the sum. You can infer the last by subtracting.

      Hope this helps.

  4. Vaughan Roberts

    Aviad, thanks for the post.
    I am using R with Vertica and I would like to use the ‘zoo’ package as well as the ‘forecast’ package. Like you I have forecast v4.8 installed following the advice from Rob J Hyndman.
    I got into a little bit of trouble with dependences when I tried to update some other R packages and had to remove all packages and reinstall Vertica-R in order to get over the problem. In particular I note that HP recommend that neither Rcpp or Rinside packages are updated (presumably they have modified these packages to work with Vertica).
    I presume that if I try to install the latest version of ‘zoo’ I could well have issues with dependencies again. So I need install a version of ‘zoo’ that is compatible with Vertica-R, however I am not sure how to work out which version to use. Have you gone down this path already?

    • Hi Vaughan,

      In short, its a headache. I had to adjust to Vertica’s recommendation of not updating Rinside/Rcpp. And had to write a lot of code to get around it.

      Hope this helps

      Aviad.

  5. Hi Aviad,

    Great blog post! How would I go about writing functions/libraries that use R libraries that don’t come with the default install (ie. dplyr)? Is that possible?

    Thanks,
    Ben

    • Hi Ben,

      Yes, of course that is possible. Write your code with the desired library dependencies.
      The only thing you need to take care of is the installation of the libraries on all Vertica nodes, the R instances are independent of each other and require separate installations.

      Cheers,

      Aviad

      • Hi ,

        I am getting the error as
        CREATE LIBRARY r_func AS ‘/home/dbadmin/R/norm_inverse.R’ LANGUAGE ‘R’;
        ROLLBACK 2175: An error occurred when loading library file on node v_airlinestat_node0001, message:
        Couldn’t create new UDx side process, the language R is not supported

        Could you please help me out in this.
        Thanks a lot in advance,
        Kalpesh

      • I guess you figured it out by now, you have an issue in your code…
        Cheers,

        Aviad

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s