My solution:

# normal log-loss ll &amp;lt;- function(y) function(p) -(y * log(p) + (1-y)*log(1-p)) plot(ll(0),0,1,col=1,main="log loss",ylab="loss",xlab="p") plot(ll(1),0,1,col=2,add=TRUE) legend("topleft",legend = c("y=0","y=1"), lty=1, col=1:2, bty="n") # cost weighted log-loss cwll &amp;lt;- function(y,cost) function(p) -(cost * y * log(p) + (1-cost)*(1-y)*log(1-p)) plot(cwll(0,0.1),0,1,col=1,main="cost weighted log loss\n(cost=0.1)",ylab="loss",xlab="p") plot(cwll(1,0.1),0,1,col=2,add=TRUE) legend("topleft",legend = c("y=0","y=1"), lty=1, col=1:2, bty="n")

Here we can see the different loss behaviours

We will try to classify the Virginica species from the iris dataset

# let's take the iris data set with a species that is hard to differentiate by sepal length and petal length d <- transform(iris, y = Species == "virginica") plot(Sepal.Length~Petal.Length,data=d,col=0,pch=21,bg=rgb(0,y,0), main="is it Virginica?") legend("topleft",legend=c("virginica","others"),pch=21,col=0,pt.bg=c("green","black"),bty="n")

And now the loss functions:

# normal loss function Loss <- function(par, data) { xb <- cbind(1,d$Sepal.Length, d$Petal.Length) %*% par p <- 1 / (1 + exp(-xb)) sum(-(d$y * log(p) + (1-d$y)*log(1-p))) } # cost weighted version, cost refers to error on the virginica species cwLoss <- function(par, data, cost = 0.01) { xb <- cbind(1,d$Sepal.Length, d$Petal.Length) %*% par p <- 1 / (1 + exp(-xb)) sum(-(cost * d$y * log(p) + (1-cost)*(1-d$y)*log(1-p))) } beta <- optim(par=c(0,0,0),fn=Loss,data=d)$par cw.beta <- optim(par=c(0,0,0),fn=cwLoss,data=d)$par xRange <- range(d$Petal.Length) yRange <- range(d$Sepal.Length) X <- seq(xRange[1],xRange[2],by = 0.05) Y <- seq(yRange[1],yRange[2],by = 0.05) z <- outer(X,Y, FUN = function(x,y) -(beta[1] + x * beta[3] + y * beta[2])) cwz <- outer(X,Y, FUN = function(x,y) -(cw.beta[1] + x * cw.beta[3] + y * cw.beta[2])) contour(X,Y,z, add=TRUE,levels = 0,labels = "normal",labcex = 1) contour(X,Y,cwz, add=TRUE,levels = 0,col=4, labels = "cost-weighted",labcex = 1)]]>

In a first of its kind public statement, the American Statistical Association (ASA), has addressed the worrying misuse and misunderstanding of the famous p-value. Some niches in science mistreat it while others ignore it completely.

Here is the link to the full statement, and a link to an abbreviated message, it is separated in two parts. The first, explains the reasons and process that led the ASA to release the statement, including a reference to the ‘reproducibility crisis’ reported in a paper last year [1]. The second is the statement itself regarding the meaning of the p-value and recommended use.

It doesn’t address hard questions, like multiple comparisons or potential comparisons, but an interesting read altogether, especially for us who need to infer x>>0 times a day. They provide a long list of related papers if you are interested in reading them. I suggest to read Yoav Benjamini’s and Tal Gallili’s joint post on the matter.

In the 177 years since it’s foundation, the ASA has not publicly addressed such a fundamental scientific subject. This is tantamount to the Archmaesters of Hightower commenting to Westeros on how to ‘count coppers’.

[1] Peng, R. (2015), The reproducibility crisis in science: A statistical counterattack. Significance, 12: 30–32. doi: 10.1111/j.1740-9713.2015.00827.x

]]>

I’ve received some good comments from readers and friends on my contribution to the use of the Vertica R package, I even had an interviewee that shook my hand and said my posts were great help to his team. After using the package extensively to solve a few ML use cases, I must conclude that it is not the tool you would want to use. I’d dedicate this post to explain why in a few points I believe are important.

Product communities are important, I never paid attention to it in the past, but today I suggest to anyone considering a new technology, look into the community of that product, who are the users, and how many are there. How well do they manage Braistorming and tickets. HP’s Vertica is not an open source product, so you would expect a smaller community compared to, say, MySQL, but still there is not much buzz around the search term ‘R and Vertica’.

As of writing this post, on incognito mode, my post from June 2014 is still ranked #1 on google

.

As flattering as that may be, it is not a good sign for the product, you’d want to see stackoverflow questions and linkedin groups and a multitude of HP owned documents on the subject. Without a big community of developers, bugs don’t get attention, packages and solutions are not created and you find yourself stuck.

The number of tags in Stackoverflow for Vertica alone, without, filtering for R related issues is 439, vs. ~100K for Spark and ~400K for MySQL, get the picture?

Vertica’s great, Vertica’s cool, but, it only has a SQL API. Now, I can say that a LOT can be done with a SQL, and packages like sqlalchemy or frameworks like Apache Spark allow connections via JDBC and then a layer of their own sophistication and programmability, but then Vertica doesn’t really smoothly comply with sqlalchemy and there is no simple connection to Spark.

Also, SQL means that Vertica is stateless, you cannot cache your data, you cannot loop over it, and basically you cannot develop. So you’ll need some wrapping application to call Vertica for you with a JDBC connection – aha! brilliant! But, if you need to write an application that calls it, why not use open source framework like Apache Spark, and have your data in Hadoop instead of paying for Vertica’s license.

Vertica segments or copies it’s data to many nodes and uses novel algorithms to read records quickly from zip, it can do some real magic with most complex queries I’ve ever seen. But when you call R through Vertica, the data gets unzipped then piped into an R process which is:

- Limited by local resources.
- Processes cannot communicate between or within nodes. You cannot broadcast any data. As you can do with spark.

So what is parallel here? Let’s see an example, assume you have a table with some grouping column ‘G’ and then features columns {x1, x2, ….xp} and an output column ‘y’. You’d like to run some, very general, R model that has the formula:

y~x1+x2+...+xp

for every group in ‘G’.

Vertica will split/shuffle the data by group to all nodes and then pipe it to R for analysis, the R instances (1 per node) will work in a queue until finished then each will return results back to Vertica. Why is this bad? You cannot run a huge regression that is truly parallel with something like BFGS or SGD this way, you are limited by the size of data that the R instance can handle, which will not be in the 100s of GB, sadly it would be much smaller.

Vertica is still my first choice DBMS for analytics, used and maintained correctly it’s amazing. If your team has need for scalar functions that are dependent on R and are not readily available in Vertica then I recommend to use the R package, otherwise, don’t.

]]>

You’ll notice 3 functions that need to be created, the main function which will be called from Vertica, the params functions which is optional and holds the config of parameters optionally pased to the function. And finally, and mandatory, is the factory function which tells Vertica what to expect when running ‘group_concat’.

group_concat <- function(x,y) { options(useFancyQuotes=FALSE) # initialize parameters sep <- ifelse(is.null(y[['delimiter']]),',',as.character(y[['delimiter']])) if(is.null(y[['quote']])) y[['quote']] <- 0 doQuote <- ifelse(y[['quote']]==1 ,TRUE,FALSE) if(doQuote) ret <- paste0(sQuote(x[,1]),collapse=sep) else ret <- paste0(x[,1],collapse=sep) ret } group_concat_params <- function() { params <- data.frame(datatype=c("varchar", "int"), length=c(100,NA), scale=rep(NA,2),name=c("delimiter","quote")) params } group_concat_factory <- function() { list( name=group_concat ,udxtype=c("transform") ,intype=c("any") ,outtype=c("varchar(65000)") ,parametertypecallback=group_concat_params # ,volatility=c("stable") # ,strict=c("called_on_null_input") ) }

After distributiong the file into the same directory on all nodes.

In Vertica run :

create or replace library r_func as '.../group_concat.r' language 'R'; create or replace transform function group_concat as name 'group_concat_factory' library r_func; select group_concat(id using parameters delimiter='|', quote=1) over() from sometable;

]]>

For those of you who don’t know HP Vertica, it is a powerful columnar DBMS. I’ve worked with two installations of it and me and my colleagues are very impressed with it. For those of you who are familiar with PostgreSQL, you will find many things similar as both products were invented by the same guy.

That’s all the intro I’m going to give since we pay them and not the other way around

I suggest people reading this post to read through the chapter called “Developing a User Defined Function in R” in “HP Vertica 6.1.x (or higher) Programmer’s Guide” I hope you are familiar in vsql and R.

Vertica and R communicate via User Defined Functions (UDFs) that are written in the R language. The columns selected in Vertica are passed to R as a data.frame, the functions must treat their main argument as a data.frame. Here is an example that creates an inverse normal CDF function, x is the data.frame passed from Vertica, in this case it is a 3-column data.frame with the percentile, mean and standard deviation.

# normal inverse norm_inv <- function(x) { # 1 - percentile, 2 - mean, 3 - sd apply(x,1,function(i) qnorm(p=i[1L],mean=i[2L],sd=i[3L])) } norm_invFactory <- function() { # this is the factory function that 'links' between R and Vertica list( name=norm_inv ,udxtype=c("scalar") ,intype=c("float","float","float") ,outtype=c("float") ) }

You create the file above and place somewhere on your machine, then load this function to the Vertica database :

MyDB=> CREATE LIBRARY r_func AS '/home/dbadmin/norm_inverse.R' LANGUAGE 'R'; MyDB=> CREATE FUNCTION norm_inv AS NAME 'norm_invFactory' LIBRARY r_func; MyDB=> select norm_inv(.25,0,1); norm_inv -------------------- -0.674489750196082 (1 row)

Imagine being able to implicitly parallelize an R function across an infinite amount of segments and data. This is basically the promise behind the R language package for Vertica.

What is parallelized exactly? Vertica allows you to partition the data sent into R, it implicitly works out how to divide the load between nodes and the nodes’ cores, so you don’t have to work out elaborate code. Each R instance is run independetly from other instances, so you cannot parallelize ,say, a single lm() function, but rather perform multiple ones at once.

The user defined function are loaded into Vertica using CREATE LIBRARY syntax and by writing R code that has two main parts, as you might have noticed above:

- A main function – the main function is what does the work
- A factory function – tells the Vertica engine about the input and output to expect.

Using R’s source() function is possible through an R UDFs in Vertica. Which is a very useful ability for large projects with a lot of code you can then change the sourced code without changing any part of the main functions. Here is how.

Create some R file called ‘foo.R’ with a variable called ‘bar’.

## Start of R code bar <- "we've done it!" ## End of R code

Create the main and factory functions for in an R file

# Main function sourceTest <- function(x) { source('foo.R') return(bar) } # Factory function sourceTestFactory <- function() { list( name=sourceTest ,udxtype=c("scalar") ,intype=c("char") ,outtype=c("char") ) }

In vertica run :

MyDB=> CREATE LIBRARY RsourceTest AS 'sourceTest.r' LANGUAGE 'R'; MyDB=> CREATE FUNCTION sourceTest AS NAME 'sourceTestFactory' LIBRARY RsourceTest; MyDB=> SELECT sourceTest (""); sourceTest -------------------- we've done it! (1 row)

The vertica programmer guide instructs you to install a version of libgfortran. Aside from that you may notice that packages that rely on gfortran such as “xts” require installing other parts :

yum install gcc-gfortran yum install gcc-c++

The ‘forecast’ package, by Rob J Hyndman, is great for many time series analysis. Connecting it with Vertica is very powerful for creating forecasts out of your data. Trouble is, the R version used in Vertica is 3.0.0 so you will have to get the older version of forecast 4.8. In short :

wget https://stat.ethz.ch/CRAN/contrib/main/Archive/forecast/forecast_4.8.tar.gz

And then in R:

install.packages("tseries") install.packages("fracdiff") install.packages("RcppArmadillo") install.packages("/root/forecast/forecast_4.8.tar.gz")

Voila. Next time I’ll show some more elaborate examples of transform functions that I use.

]]>How many times do users visit your site? A very relevant question in web analysis.

One major issue that biases your estimation is the users who save no cookies, these users change their ID on every page visit so they look like 1 time visitors and thus inflate the proportion of 1 in the distribution. This has an enormous effect on the bounce rate you calculate to your site.

Another minor issue, which is a by product of the solution I’m about to offer, is that most distribution functions have a support of x>=0, while visits start at 1. Read on to see the suggested solution.

I’ve stumbled upon this issue a few times and would like to suggest the following solution to estimate the latent proportion of 1 time visitors via maximum likelihood estimation.

- There exists a constant proportion (P) of users who don’t save cookies.
- The number of visits to the site by users who do save cookies (X) follows some known distribution function (f) who’s parameters are unknown. The distribution is truncated at 0, as we will never see the users who visit 0 times. The CDF for f is denoted as F.

We define Y as the distribution of visits to your site :

Pr(y=1) = P + (1-P) * f(1) / (1-F(0))

For Y=y other than 1 :

Pr(Y=y) = (1-P) * f(y) / (1-F(0))

You might ask yourself why f(y) is weighted with 1/(1-F(0)), since we are looking at the truncated version of the distribution this weighting factor is essential to keep the integral of the truncated distribution at 1. I’ll leave you to figure out the math.

Here is an example when assuming the distribution f is Poisson.

first we construct the likelihood function:

Ind <- function(l) { if(is.logical(l)) as.numeric(l) else stop('"l" must be logical') } fy <- function(x, lambda, log = FALSE) { if(!log) { Ind(x>=1) * dpois(x, lambda) / ppois(0, lambda, lower.tail = FALSE)} else { Ind(x>=1) * ( dpois(x, lambda, log=TRUE) - ppois(0, lambda, lower.tail = FALSE, log=TRUE))} } Lik <- function(par,x) { p <- par[1] lambda <- par[2] n1 <- sum(x==1) n0 <- sum(x!=1) x0 <- x[x!=1] n1 * log(p + (1-p) * fy(1,lambda)) + n0 * log(1-p) + sum(fy(x0,lambda,log=TRUE)) }

Let’s create some data :

x <- c(rpois(5000,5),rep(1,600)) # a sample of 5000 values from a Poisson distribution with lambda = 5, artificially inflated with 600 1s. par <- c(mean(x==1),mean(x)) # starting values for the estimation process parOptim <- optim(par,fn=Lik,x=x,method="Nelder-Mead",control=list(fnscale=-1)) # using optim() to find the MLE parOptim$par[1] [1] 0.1092563

This means the estimated P is 10.9% of inflated one time visitors. Be sure to remove these guys from the distribution of visitors you get. But be sure to test a few distributions and find a way to judge between them prior to estimating the final numbers.

Good luck.

]]>```
coalesce <- function(x,...) {
fillerList <- list(...)
y <- try(y <- unlist(..1))
if(class(y)=="try-error" | length(y)==0L) {
x <- x
}
else if(length(y)==1L) {
x[is.na(x)] <- y
}
else {
x[is.na(x)] <- y[is.na(x)]
}
# recursion
if(length(fillerList)-1L<=0L) {return(x)}
else {return(coalesce(x,fillerList[-1]))}
}
```

]]>One way of separating 2 categories using linear sub spaces of the input space (e.g. planes for 3D inputs, lines for 2D inputs, etc.) is by dimensionality reduction:

**y=w’x**

if x belongs to a **P** dimensional input space of real numbers, and **w** is a **P** on **1** vector of weights then y is the projection of **X** to the **P-1** sub-plane. We assign **x** to **C1** if **y>0** and to **C2** otherwise. The group of **x**‘s that hold **y=0** are called the “Decision Boundary”.

Other features of this method are listed in the book, one which is important for the presentation is that each vector on the sub-plane is orthogonal to **w**. Proof : let **x1**, **x2** be points on the decision boundary separating **R1** and **R2**. Since **y=0** for all these points we conclude :

**w’x1=w’x2=0**

**w'(x1-x2)=0**

So **w** is orthogonal to the decision boundary by definition.

Assume with we have two groups following a normal distribution with different means and similar variance co-variance matrices **Σ** :

Given the mean points **M1,M2**. A good strategy might be to find **w** so the projection of the means **m1** and **m2** will reach maximum separation. By putting the constraint **w’w=1** and using lagrange multipliers we find that **w** is proportional to **(M1-M2)**. The problems begins when the groups have diagonal covariances and the projected values overlap. This can happen even if the two means are fully separated. Fisher (1926) wrote the criterion for this solution, with w proportional to **inverse(Σ)(M1-M2)**.

By looking at the histograms of the projected values we get two histograms of a linear transformation of normal bi-variate variables, hence also a normal distribution. The code is attached, follow it step by step and adjust the device size and sampling parameters to see how Fisher’s criterion for **w** is superior to the ordinary optimization.

# (1) sampling assuming mvnorm with same Sigma]]>

{

rm(list=ls())

mu.1 <- c(2,-1)

mu.2 <- c(2,5)

rho <- 0.8

sigma.1 <- 1

sigma.2 <- 3

Sigma <- matrix(c(sigma.1^2

,rho*sigma.1*sigma.2

,rho*sigma.1*sigma.2,sigma.2^2),byrow=T,nrow=2)

N <- 100

# multivariate normal sampling

X1 <- MASS::mvrnorm(N,mu=mu.1,Sigma=Sigma)

X2 <- MASS::mvrnorm(N,mu=mu.2,Sigma=Sigma)

# make a data frame

X <- data.frame(cbind(rep(c(4,2),each=N),rbind(X1,X2)))

names(X) <- c("group","X1","X2")

means <- matrix(c(tapply(X$X1,X$group,mean),tapply(X$X2,X$group,mean)),2,2)

means <- data.frame(X1=means[,1],X2=means[,2],row.names=c(2,4))

A <- matrix(NA,nrow=nrow(X),ncol=2)

A[,1] <- as.numeric(X$X1) ;A[,2] <- as.numeric(X$X2)}

# (2) plot the sample

{

PLOT <- function(main) {

layout(matrix(c(1,1,2,3),byrow=T,ncol=2),heights=c(0.2,0.8))

par(mar=rep(0,4))

plot.new()

text(0.5,0.5,main,cex=3)

par(mar=c(4,4,0,0))

plot(X2~X1,data=X,pch=21,bg=1

,type="n"

,xlab=expression(x[1]),ylab=expression(x[2]))

axx <- par("xaxp")

axx <- seq(axx[1],axx[2],length.out=axx[3]+1)

axy <- par("yaxp")

axy <- seq(axy[1],axy[2],length.out=axy[3]+1)

abline(h=axy,v=axx,lty=5,col=gray(0.8))

points(X2~X1,pch=21,bg=group,col=group,data=X)}

PLOT(main="Simple Linear Classification")}

# (3) show means and mean line

{

points(means,col=1,pch=3,cex=3,lwd=3)

lines(means,lwd=3,col=1)}

# (4) calculate midpoint and orthogonal line

{

mid.point <- apply(means,2,mean)

points(mid.point[2]~mid.point[1],col=1,pch=4,cex=2,lwd=4)

H <- c(-1,1)

m <- H %*% as.matrix(means)

m <- m[2]/m[1]

inv.m <- solve(m,-1)

arrows(

x0=mid.point[1],x1=mid.point[1]+2

,y0=mid.point[2],y1=mid.point[2]+2*inv.m

,col=1,lwd=3)}

# (5) Simple Linear Optimization

{

SLO <- function() {

lambda <- sqrt((as.matrix(means[1,]-means[2,])) %*% t(as.matrix(means[1,]-means[2,])))/2

w <- (means[2,]-means[1,])/(2*lambda)

w <- matrix(as.numeric(w))

wX <- A %*% w

wX <- data.frame(group=X$group,wX=wX)

h4 <- hist(wX$wX[wX$group==4],plot=F)

h2 <- hist(wX$wX[wX$group==2],plot=F)

breaks <- sort(union(h4$breaks,h2$breaks))

counts4 <- c(rep(0,length(breaks)-length(h4$counts)-1),h4$counts)

counts2 <- c(h2$counts,rep(0,length(breaks)-length(h2$counts)-1))

mids <- sort(union(h4$mids,h2$mids))

h2$breaks <- h4$breaks <- breaks

h2$mids <- h4$mids <- mids

h4$counts <- counts4

plot(h2,border=0,main=""

,xlab="Projection Values",ylim=c(0,max(counts4,counts2)))

rect( # blue group

xleft=breaks[-length(breaks)]

,ybottom=0

,xright=breaks[-1]

,ytop=counts4

,col=4,density=20,angle=45)

rect( # red group

xleft=breaks[-length(breaks)]

,ybottom=0

,xright=breaks[-1]

,ytop=counts2

,col=2,density=20,angle=-45)

} # close on SLO

SLO()}

# (6) Fisher's Criterion

{

means <- matrix(c(tapply(X$X1,X$group,mean),tapply(X$X2,X$group,mean)),2,2)

b <- as.numeric(solve(cov(A)) %*% (means[1,]-means[2,]))

b0 <- 0.5*((t(means[1,]-means[2,]) %*% solve(cov(A))) %*% (means[1,]-means[2,]))

Lx <- (A %*% b)+b0*rep(1,N)}

# (7) re-plot the sample

{

PLOT(main="Fisher's Criterion")}

# (8) plot Fisher's Criterion

{

x <- par("xaxp")

x <- seq(100*x[1],100*x[2],length.out=2)

b1 <- b[1]

b2 <- b[2]

y <- (-b0-b1*x)/b2

lines(y~x,lwd=3)}

# (9) plot Projection Histogram

{

FC <- function() {

L4 <- hist(Lx[X$group==4],plot=F)

L2 <- hist(Lx[X$group==2],plot=F)

Lbreaks <- sort(union(L4$breaks,L2$breaks))

Lcounts4 <- c(rep(0,length(Lbreaks)-length(L4$counts)-1),L4$counts)

Lcounts2 <- c(L2$counts,rep(0,length(Lbreaks)-length(L2$counts)-1))

Lmids <- sort(union(L4$mids,L2$mids))

L2$breaks <- L4$breaks <- Lbreaks

L2$mids <- L4$mids <- Lmids

L4$counts <- Lcounts4

plot(L2,border=0,main=""

,xlab="Projection Values",ylim=c(0,max(Lcounts4,Lcounts2)))

rect( # blue group

xleft=Lbreaks[-length(Lbreaks)]

,ybottom=0

,xright=Lbreaks[-1]

,ytop=Lcounts4

,col=4,density=20,angle=45)

rect( # red group

xleft=Lbreaks[-length(Lbreaks)]

,ybottom=0

,xright=Lbreaks[-1]

,ytop=Lcounts2

,col=2,density=20,angle=-45)

} # close on FC

FC()}

The problem, as I saw it, was that a lot (20%) of bank accounts had the same family name for two different card holders, hence, it is very reasonable to assume they are family. But what happens if GrinShtein is written as GreenShtein , an infinite amount of typos obscured the true amount of family members.

Given the fact that the bank accounts with more than 1 member were probably family related, I had to see which names were similar up to some metric.

I chose the Levenshtein Distance as the metric to measure the dissimilarity between two names. Levenshtein’s distance (L) counts the minimum number of edits (insertions, substitutions and deletions) needed to transform String1 to String2.

I’ve graded each pair of names with :

**Grade=L/max(c1,c2)**

where c1 and c2 are the number of characters in each String.

Next I had to use a macro to compare names of 3 and more members of a bank account. The macro’s flow is as follows :

- I used the
**%combo**macro to make a list of pairs indices to choose from iteratively, those were written into two macro “+” separated macro variables v_1 and v_2. - If the number of distinct ID holders in the bank account is K then the number of iterations is N=k*(k-1)/2, for each iteration I graded a different pair of names and held the result according to a Boolean rule in an array (
**Pair array**) of size N. - For each bank account I calculated the number of similar names by using the sum of the array, and the number K with the function :
**min(K,sum(array))**. Later, I substituted 1’s to 2’s because a pair is made of two people. So the values that the similar names receives is**S**=0,2,…,K. - I calculated the homogeneity :
**H=S/K.**

if **H=0** then no names are similar, if H=0 then all names are similar up to a threshold of the Grade.

I used a threshold of both **Grade** and **max(c1,c2)**, since smaller names tend to be more volatile. I guess each language has its different type of typos and errors in spelling names. Hebrew, in my case, is prone to dropping and adding vowels (which can result in the same name), substitution of consonants which appear similar to the ear and keyboard.

Results were excellent, Thousands of solid households were discovered and added to the database.

The Code for the program will be shown in a few days…

]]>I’ve searched for a quick way to find out all the classes of vectors inside a data.frame.

Since I’ve found no reference for such a function/process I made one up.

I’d like to hear what people have to say about the following use of the “class” function on data.frames

a simple call :

> library(rpart) # comes with R > data(kyphosis) # comes with rpart > class(kyphosis) "data.frame"

trying to use the “apply” function to know what classes are the columns in the data.frame yeilds the following unwanted result :

> apply(kyphosis,2,class) Kyphosis Age Number Start "character" "character" "character" "character"

For some reason the apply function returns “character” on all vectors regardless of their true content (any ideas why?).

Anyhow, after some thought I’ve come up with the following function :

> allClass <- function(x) {unlist(lapply(unclass(x),class))} > allClass(kyphosis) Kyphosis Age Number Start "factor" "integer" "integer" "integer"

Compact, fast and quite useful. Of course the control flow needs more work to fit other classes and recognize when x is not a data.frame.

Comments are welcome.

]]>