Vertica doesn’t suit ML or ‘Why I stopped using the Vertica R package’

2 Years, are a long time. For a blogger wannabe for sure. 2 years since my last post, a lot has happened and my skills have broadened beyond R, so this R dedicated blog, might not speak about R so much, maybe some other things.

I’ve received some good comments from readers and friends on my contribution to the use of the Vertica R package, I even had an interviewee that shook my hand and said my posts were great help to his team. After using the package extensively to solve a few ML use cases, I must conclude that it is not the tool you would want to use. I’d dedicate this post to explain why in a few points I believe are important.

1 – Lack of community

Product communities are important, I never paid attention to it in the past, but today I suggest to anyone considering a new technology, look into the community of that product, who are the users, and how many are there. How well do they manage Braistorming and tickets. HP’s Vertica is not an open source product, so you would expect a smaller community compared to, say, MySQL, but still there is not much buzz around the search term ‘R and Vertica’.

As of writing this post, on incognito mode, my post from June 2014 is still ranked #1 on google
baby-eme.

As flattering as that may be, it is not a good sign for the product, you’d want to see stackoverflow questions and linkedin groups and a multitude of HP owned documents on the subject. Without a big community of developers, bugs don’t get attention, packages and solutions are not created and you find yourself stuck.

The number of tags in Stackoverflow for Vertica alone, without, filtering for R related issues is 439, vs. ~100K for Spark and ~400K for MySQL, get the picture?

2 -Lack of non-SQL API

Vertica’s great, Vertica’s cool, but, it only has a SQL API. Now, I can say that a LOT can be done with a SQL, and packages like sqlalchemy or frameworks like Apache Spark allow connections via JDBC and then a layer of their own sophistication and programmability, but then Vertica doesn’t really smoothly comply with sqlalchemy and there is no simple connection to Spark.

Also, SQL means that Vertica is stateless, you cannot cache your data, you cannot loop over it, and basically you cannot develop. So you’ll need some wrapping application to call Vertica for you with a JDBC connection – aha! brilliant! But, if you need to write an application that calls it, why not use open source framework like Apache Spark, and have your data in Hadoop instead of paying for Vertica’s license.

3 – Vertica is parallel, but, it doesn’t apply so much to R

Vertica segments or copies it’s data to many nodes and uses novel algorithms to read records quickly from zip, it can do some real magic with most complex queries I’ve ever seen. But when you call R through Vertica, the data gets unzipped then piped into an R process which is:

  1. Limited by local resources.
  2. Processes cannot communicate between or within nodes. You cannot broadcast any data. As you can do with spark.

So what is parallel here? Let’s see an example, assume you have a table with some grouping column ‘G’ and then features columns {x1, x2, ….xp} and an output column ‘y’. You’d like to run some, very general, R model that has the formula:

y~x1+x2+...+xp

for every group in ‘G’.

Vertica will split/shuffle the data by group to all nodes and then pipe it to R for analysis, the R instances (1 per node) will work in a queue until finished then each will return results back to Vertica. Why is this bad? You cannot run a huge regression that is truly parallel with something like BFGS or SGD this way, you are limited by the size of data that the R instance can handle, which will not be in the 100s of GB, sadly it would be much smaller.

To Conclude…

Vertica is still my first choice DBMS for analytics, used and maintained correctly it’s amazing. If your team has need for scalar functions that are dependent on R and are not readily available in Vertica then I recommend to use the R package, otherwise, don’t.

 

Advertisements

R and Vertica

I’ve been spending the last few months working my way through the integration of R and Vertica, and will try to keep here things that I find handy. I’m quite sad to see there is not much about this Vertica feature on the web, that’s a little disappointing. But, it didn’t stop us from creating a scalable statistical model learning machine out of this feature (I will write about it in later posts…).

For those of you who don’t know HP Vertica, it is a powerful columnar DBMS.  I’ve worked with two installations of it and me and my colleagues are very impressed with it. For those of you who are familiar with PostgreSQL, you will find many things similar  as both products were invented by the same guy.

That’s all the intro I’m going to give since we pay them and not the other way around 🙂

I suggest people reading this post to read through the chapter called “Developing a User Defined Function in R” in “HP Vertica 6.1.x (or higher) Programmer’s Guide” I hope you are familiar in vsql and R.

Short example – normal inverse function in Vertica

Vertica and R communicate via User Defined Functions (UDFs) that are written in the R language. The columns selected in Vertica are passed to R as a data.frame, the functions must treat their main argument as a data.frame. Here is an example that creates an inverse normal CDF function, x is the data.frame passed from Vertica, in this case it is a 3-column data.frame with the percentile, mean and standard deviation.

# normal inverse
norm_inv <- function(x) { # 1 - percentile, 2 - mean, 3 - sd
apply(x,1,function(i) qnorm(p=i[1L],mean=i[2L],sd=i[3L]))
}

norm_invFactory <- function() { # this is the factory function that 'links' between R and Vertica
list(
name=norm_inv
,udxtype=c("scalar")
,intype=c("float","float","float")
,outtype=c("float")
)
}

You create the file above and place somewhere on your machine, then load this function to the Vertica database :


MyDB=> CREATE LIBRARY r_func AS '/home/dbadmin/norm_inverse.R' LANGUAGE 'R';

MyDB=> CREATE FUNCTION norm_inv AS NAME 'norm_invFactory' LIBRARY r_func;

MyDB=> select norm_inv(.25,0,1);
norm_inv
--------------------
-0.674489750196082
(1 row)

More about User Defined Functions

Imagine being able to implicitly parallelize an R function across an infinite amount of segments and data. This is basically the promise behind the R language package for Vertica.

What is parallelized exactly? Vertica allows you to partition the data sent into R, it implicitly works out how to divide the load between nodes and the nodes’ cores, so you don’t have to work out elaborate code. Each R instance is run independetly from other instances, so you cannot parallelize ,say, a single lm() function, but rather perform multiple ones at once.

The user defined function are loaded into Vertica using CREATE LIBRARY syntax and by writing R code that has two main parts, as you might have noticed above:

  1. A main function – the main function is what does the work
  2. A factory function – tells the Vertica engine about the input and output to expect.

source() example

Using R’s source() function is possible through an R UDFs in Vertica. Which is a very useful ability for large projects with a lot of code you can then change the sourced code without changing any part of the main functions. Here is how.
Create some R file called ‘foo.R’ with a variable called ‘bar’.

## Start of R code
bar <- "we've done it!"
## End of R code

Create the main and factory functions for in an R file

# Main function
sourceTest <- function(x)
{
source('foo.R')
return(bar)
}
# Factory function
sourceTestFactory <- function()
{
  list(
   name=sourceTest
   ,udxtype=c("scalar")
   ,intype=c("char")
   ,outtype=c("char")
 )
}

In vertica run :

MyDB=> CREATE LIBRARY RsourceTest AS 'sourceTest.r' LANGUAGE 'R';
MyDB=> CREATE FUNCTION sourceTest AS NAME 'sourceTestFactory' LIBRARY RsourceTest;
MyDB=> SELECT sourceTest ("");
sourceTest
--------------------
we've done it!
(1 row)

Pre-requisites for the R language pack

The vertica programmer guide instructs you to install a version of libgfortran. Aside from that you may notice that packages that rely on gfortran such as “xts” require installing other parts :

yum install gcc-gfortran
yum install gcc-c++

Installing the ‘forecast’ package for Vertica

The ‘forecast’ package, by Rob J Hyndman, is great for many time series analysis. Connecting it with Vertica is very powerful for creating forecasts out of your data. Trouble is, the R version used in Vertica is 3.0.0 so you will have to get the older version of forecast 4.8. In short :

wget https://stat.ethz.ch/CRAN/contrib/main/Archive/forecast/forecast_4.8.tar.gz

And then in R:

install.packages("tseries")
install.packages("fracdiff")
install.packages("RcppArmadillo")
install.packages("/root/forecast/forecast_4.8.tar.gz")

Voila. Next time I’ll show some more elaborate examples of transform functions that I use.

Coalesing in R

The coalesce function is a recursive null filler very common in every database software, however R seems to be missing this simple function. Here is my suggestion :

coalesce <- function(x,...) {

  fillerList <- list(...)
    y <- try(y <- unlist(..1))
    if(class(y)=="try-error" | length(y)==0L) {
        x <- x 
    }
    else if(length(y)==1L) {
     x[is.na(x)] <- y
    }
    else {
     x[is.na(x)] <- y[is.na(x)]
    }
    # recursion
    if(length(fillerList)-1L<=0L) {return(x)}
    else {return(coalesce(x,fillerList[-1]))}
}

Color choosing in R made easy

I don’t know about you, but when I want to make a graph in R, I handpick the colors, line widths etc… to produce awesome output.

A lot of my time is spent on color choosing, I had to find a more convenient way of doing so. Earl F. Glynn’s “Chart of R colors”  posted on  http://research.stowers-institute.org/efg/R/Color/Chart/ gave me the idea to the following function.

R has an Internal called colors(), it’s output is a 657 long character vector of reserved names for colors.

The names can be used directly as in :

plot(iris,col="orange")

Iris attributes in orangeAlternatively, they can be referred to from colors()

plot(iris,col=colors()[503])

Iris attributes in orange redThe color() function has a two arguments (notice it is not plural… I chose this unreserved name because it’s easy to remember) :

  1. plot.it – FALSE by default
  2. locate – 0 by default

The call color() is the same as colors().

> color()[555]==colors()[555]
[1] TRUE

Calling color(plot.it=T) or color(T) gives the following output (very similar to Earl F. Glynn’s “Chart of R colors”) :

You can choose and remember the numbers, or print and stick above your working area… but the following makes color() more useful :

Specifying locate = k > 0 will plot the chart above and this time will use the locator() function in a loop to choose colors you want. After choosing k colors the output will be  a k-long character vector of the chosen colors.

> color(T,5)
[1] "firebrick4" "grey9"      "green"      "gray99"     "khaki1"

You can use it directly in a plot function, the palette of colors will plot first, choose your colors, the plot you called for will be followed by it:

plot(iris,col=color(T,5))

The function is given by :

color    <- function (plot.it=F,locate=0)
 {
 if(!plot.it)
   {
   return(.Internal(colors())) # so far, not different from colors()
   } # close on if
 else
   {
   ytop    <- rep(seq(1/26,1,by=1/26),each=26)[1:657]
   ybottom <- rep(seq(0,1-1/26,by=1/26),each=26)[1:657]
   xleft   <- rep(seq(0,1-1/26,by=1/26),times=26)[1:657]
   xright  <- rep(seq(1/26,1,by=1/26),times=26)[1:657]
   pall    <- round(col2rgb(colors())/256)
   pall    <- colSums(pall) ; pall2 <- character(0)
   pall2[pall>0]   <- "black"
   pall2[pall==0]  <- "white"

   par(mar=c(0,0,1,0))

   plot.new()
   title(main="Palette of colors()")
   rect(xleft,ybottom,xright,ytop,col=colors())
   text(x=xleft+((1/26)/2)
   ,y=ytop-((1/26)/2)
   ,labels = 1:657
   ,cex=0.55
   ,col=pall2)

   } # close on else
   if(locate==0) print("Palette of colors()")
   else
   {
   colmat    <- matrix(c(1:657,rep(NA,26^2-657)),byrow=T,ncol=26,nrow=26)
   cols        <- NA
   i        <- NA
   for(i in 1:locate)
   {
   h    <- locator(1)
   if(any(h$x<0,h$y<0,h$x>1,h$y>1)) stop("locator out of bounds!")
   else
   {
   cc        <- floor(h$x/(1/26))+1
   rr        <- floor(h$y/(1/26))+1            
   cols[i]    <- .Internal(colors())[colmat[rr,cc]]
   } # close on else
   } # close on i
   return(cols)
   } # close on else
   } # close on else+function

You can also write it to variable for further use:

> cols<- color(T,5)
> cols
[1] "magenta"        "orange3"        "palevioletred2" "seagreen4"     
[5] "seagreen2"

Of course it’s not perfect, I still have not solved issues of working with Devices such as pdf() jpeg() and such…

Your comments are welcome.