An example of linear discriminant analysis

The following example was shown in an advanced statistics seminar held in tel aviv. The material for the presentation comes from C.M Bishop’s book : Pattern Recognition and Machine Learning by Springer(2006).

One way of separating 2 categories using linear sub spaces of the input space (e.g. planes for 3D inputs, lines for 2D inputs, etc.) is by dimensionality reduction:


if x belongs to a P dimensional input space of real numbers, and w is a P on 1 vector of weights then y is the projection of X to the P-1 sub-plane. We assign x to C1 if y>0 and to C2 otherwise. The group of  x‘s that hold y=0 are called the “Decision Boundary”.

Other features of this method are listed in the book, one which is important for the presentation is that each vector on the sub-plane is orthogonal to w. Proof : let x1, x2 be points on the decision boundary separating R1 and R2. Since y=0 for all these points we conclude :



So w is orthogonal to the decision boundary by definition.

Assume with we have two groups following a normal distribution with different means and similar variance co-variance matrices Σ :

Given the mean points M1,M2. A good strategy might be to find w so the projection of the means m1 and m2 will reach maximum separation. By putting the constraint w’w=1 and using lagrange multipliers we find that w is proportional to (M1-M2). The problems begins when the groups have diagonal covariances and the projected values overlap. This can happen even if the two means are fully separated. Fisher (1926) wrote the criterion for this solution, with w proportional to inverse(Σ)(M1-M2).

By looking at the histograms of the projected values we get two histograms of a linear transformation of normal bi-variate variables, hence also a normal distribution. The code is attached, follow it step by step and adjust the device size and sampling parameters to see how Fisher’s criterion for w is superior to the ordinary optimization.

# (1) sampling assuming mvnorm with same Sigma
mu.1 <- c(2,-1)
mu.2 <- c(2,5)
rho <- 0.8
sigma.1 <- 1
sigma.2 <- 3
Sigma <- matrix(c(sigma.1^2
N <- 100

# multivariate normal sampling
X1 <- MASS::mvrnorm(N,mu=mu.1,Sigma=Sigma)
X2 <- MASS::mvrnorm(N,mu=mu.2,Sigma=Sigma)

# make a data frame
X <- data.frame(cbind(rep(c(4,2),each=N),rbind(X1,X2)))
names(X) <- c("group","X1","X2")

means <- matrix(c(tapply(X$X1,X$group,mean),tapply(X$X2,X$group,mean)),2,2)
means <- data.frame(X1=means[,1],X2=means[,2],row.names=c(2,4))

A <- matrix(NA,nrow=nrow(X),ncol=2)
A[,1] <- as.numeric(X$X1) ;A[,2] <- as.numeric(X$X2)}
# (2) plot the sample
PLOT <- function(main) {

axx <- par("xaxp")
axx <- seq(axx[1],axx[2],length.out=axx[3]+1)
axy <- par("yaxp")
axy <- seq(axy[1],axy[2],length.out=axy[3]+1)
PLOT(main="Simple Linear Classification")}
# (3) show means and mean line
# (4) calculate midpoint and orthogonal line
mid.point <- apply(means,2,mean)
H <- c(-1,1)
m <- H %*% as.matrix(means)
m <- m[2]/m[1]
inv.m <- solve(m,-1)
# (5) Simple Linear Optimization
SLO <- function() {
lambda <- sqrt((as.matrix(means[1,]-means[2,])) %*% t(as.matrix(means[1,]-means[2,])))/2
w <- (means[2,]-means[1,])/(2*lambda)

w <- matrix(as.numeric(w))

wX <- A %*% w
wX <- data.frame(group=X$group,wX=wX)

h4 <- hist(wX$wX[wX$group==4],plot=F)
h2 <- hist(wX$wX[wX$group==2],plot=F)

breaks <- sort(union(h4$breaks,h2$breaks))
counts4 <- c(rep(0,length(breaks)-length(h4$counts)-1),h4$counts)
counts2 <- c(h2$counts,rep(0,length(breaks)-length(h2$counts)-1))
mids <- sort(union(h4$mids,h2$mids))

h2$breaks <- h4$breaks <- breaks
h2$mids <- h4$mids <- mids
h4$counts <- counts4

,xlab="Projection Values",ylim=c(0,max(counts4,counts2)))
rect( # blue group
rect( # red group
} # close on SLO
# (6) Fisher's Criterion
means <- matrix(c(tapply(X$X1,X$group,mean),tapply(X$X2,X$group,mean)),2,2)
b <- as.numeric(solve(cov(A)) %*% (means[1,]-means[2,]))
b0 <- 0.5*((t(means[1,]-means[2,]) %*% solve(cov(A))) %*% (means[1,]-means[2,]))
Lx <- (A %*% b)+b0*rep(1,N)}
# (7) re-plot the sample
PLOT(main="Fisher's Criterion")}
# (8) plot Fisher's Criterion
x <- par("xaxp")
x <- seq(100*x[1],100*x[2],length.out=2)
b1 <- b[1]
b2 <- b[2]
y <- (-b0-b1*x)/b2
# (9) plot Projection Histogram
FC <- function() {
L4 <- hist(Lx[X$group==4],plot=F)
L2 <- hist(Lx[X$group==2],plot=F)

Lbreaks <- sort(union(L4$breaks,L2$breaks))
Lcounts4 <- c(rep(0,length(Lbreaks)-length(L4$counts)-1),L4$counts)
Lcounts2 <- c(L2$counts,rep(0,length(Lbreaks)-length(L2$counts)-1))
Lmids <- sort(union(L4$mids,L2$mids))

L2$breaks <- L4$breaks <- Lbreaks
L2$mids <- L4$mids <- Lmids
L4$counts <- Lcounts4

,xlab="Projection Values",ylim=c(0,max(Lcounts4,Lcounts2)))
rect( # blue group
rect( # red group
} # close on FC

Matching family names including typos

In order to build a database of relationships between credit card owners  in my workplace, I’ve come up with the following SAS code on a lazy day’s work.

The problem, as I saw it, was that a lot (20%) of bank accounts had the same family name for two different card holders, hence, it is very reasonable to assume they are family. But what happens if GrinShtein is written as GreenShtein , an infinite amount of typos obscured the true amount of family members.

Given the fact that the bank accounts with more than 1 member were probably family related, I had to see which names were similar up to some metric.

I chose the Levenshtein Distance as the metric to measure the dissimilarity between two names.  Levenshtein’s distance (L) counts the minimum number of edits  (insertions, substitutions and deletions) needed to transform String1 to String2.

I’ve graded each pair of names with :


where c1 and c2 are the number of characters in each String.

Next I had to use a macro to compare names of 3 and more members of a bank account. The macro’s flow is as follows :

  1. I used the %combo macro to make a list of pairs indices to choose from iteratively, those were written into two macro “+” separated macro variables v_1 and v_2.
  2. If the number of distinct ID holders in the bank account is K then the number of iterations is N=k*(k-1)/2, for each iteration I graded a different pair of names and held the result according to a Boolean rule in an array (Pair array) of size N.
  3. For each bank account I calculated the number of similar names by using the sum of the array,  and the number K with the function : min(K,sum(array)). Later, I substituted 1’s to 2’s because a pair is made of two people. So the values that the similar names receives is S=0,2,…,K.
  4. I calculated the homogeneity : H=S/K.

if H=0 then no names are similar,  if H=0 then all names are similar up to a threshold of the Grade.

I used a threshold of both Grade and max(c1,c2), since smaller names tend to be more volatile. I guess each language has its different type of typos and errors in spelling names. Hebrew, in my case, is prone to dropping and adding vowels (which can result in the same name),  substitution of consonants which appear similar to the ear and keyboard.

Results were excellent, Thousands of solid households were discovered and added to the database.

The Code for the program will be shown in a few days…

Finding out (fast) the classes of data.frame vectors

Sometimes it’s useful to write down the various classes of vectors inside your data.frame objects for documentation and other people to use it.

I’ve searched for a quick way to find out all the classes of vectors inside a data.frame.

Since I’ve found no reference for such a function/process I made one up.

I’d like to hear what people have to say about the following use of the “class” function on data.frames

a simple call :

> library(rpart) # comes with R
> data(kyphosis) # comes with rpart
> class(kyphosis)


trying to use the “apply” function to know what classes are the columns in the data.frame yeilds the following unwanted result :

> apply(kyphosis,2,class)
Kyphosis         Age      Number       Start 
"character" "character" "character" "character" 

For some reason the apply function returns “character” on all vectors regardless of their true content (any ideas why?).

Anyhow, after some thought I’ve come up with the following function :

> allClass <- function(x) {unlist(lapply(unclass(x),class))}
> allClass(kyphosis)
Kyphosis       Age    Number     Start 
 "factor" "integer" "integer" "integer"

Compact, fast and quite useful. Of course the control flow needs more work to fit other classes and recognize when x is not a data.frame.

Comments are welcome.

Color choosing in R made easy

I don’t know about you, but when I want to make a graph in R, I handpick the colors, line widths etc… to produce awesome output.

A lot of my time is spent on color choosing, I had to find a more convenient way of doing so. Earl F. Glynn’s “Chart of R colors”  posted on gave me the idea to the following function.

R has an Internal called colors(), it’s output is a 657 long character vector of reserved names for colors.

The names can be used directly as in :


Iris attributes in orangeAlternatively, they can be referred to from colors()


Iris attributes in orange redThe color() function has a two arguments (notice it is not plural… I chose this unreserved name because it’s easy to remember) :

  1. – FALSE by default
  2. locate – 0 by default

The call color() is the same as colors().

> color()[555]==colors()[555]
[1] TRUE

Calling color( or color(T) gives the following output (very similar to Earl F. Glynn’s “Chart of R colors”) :

You can choose and remember the numbers, or print and stick above your working area… but the following makes color() more useful :

Specifying locate = k > 0 will plot the chart above and this time will use the locator() function in a loop to choose colors you want. After choosing k colors the output will be  a k-long character vector of the chosen colors.

> color(T,5)
[1] "firebrick4" "grey9"      "green"      "gray99"     "khaki1"

You can use it directly in a plot function, the palette of colors will plot first, choose your colors, the plot you called for will be followed by it:


The function is given by :

color    <- function (,locate=0)
   return(.Internal(colors())) # so far, not different from colors()
   } # close on if
   ytop    <- rep(seq(1/26,1,by=1/26),each=26)[1:657]
   ybottom <- rep(seq(0,1-1/26,by=1/26),each=26)[1:657]
   xleft   <- rep(seq(0,1-1/26,by=1/26),times=26)[1:657]
   xright  <- rep(seq(1/26,1,by=1/26),times=26)[1:657]
   pall    <- round(col2rgb(colors())/256)
   pall    <- colSums(pall) ; pall2 <- character(0)
   pall2[pall>0]   <- "black"
   pall2[pall==0]  <- "white"

   title(main="Palette of colors()")
   ,labels = 1:657

   } # close on else
   if(locate==0) print("Palette of colors()")
   colmat    <- matrix(c(1:657,rep(NA,26^2-657)),byrow=T,ncol=26,nrow=26)
   cols        <- NA
   i        <- NA
   for(i in 1:locate)
   h    <- locator(1)
   if(any(h$x<0,h$y<0,h$x>1,h$y>1)) stop("locator out of bounds!")
   cc        <- floor(h$x/(1/26))+1
   rr        <- floor(h$y/(1/26))+1            
   cols[i]    <- .Internal(colors())[colmat[rr,cc]]
   } # close on else
   } # close on i
   } # close on else
   } # close on else+function

You can also write it to variable for further use:

> cols<- color(T,5)
> cols
[1] "magenta"        "orange3"        "palevioletred2" "seagreen4"     
[5] "seagreen2"

Of course it’s not perfect, I still have not solved issues of working with Devices such as pdf() jpeg() and such…

Your comments are welcome.