A note about the estimating the number of visits to a site

Motivation

How many times do users visit your site? A very relevant question in web analysis.
One major issue that biases your estimation is the users who save no cookies, these users change their ID on every page visit so they look like 1 time visitors and thus inflate the proportion of 1 in the distribution. This has an enormous effect on the bounce rate you calculate to your site.
Another minor issue, which is a by product of the solution I’m about to offer, is that most distribution functions have a support of x>=0, while visits start at 1. Read on to see the suggested solution.

I’ve stumbled upon this issue a few times and would like to suggest the following solution to estimate the latent proportion of 1 time visitors via maximum likelihood estimation.

Assumptions

  1. There exists a constant proportion (P) of users who don’t save cookies.
  2. The number of visits to the site by users who do save cookies (X) follows some known distribution function (f) who’s parameters are unknown. The distribution is truncated at 0, as we will never see the users who visit 0 times. The CDF for f is denoted as F.

We define Y as the distribution of visits to your site :

Pr(y=1) = P + (1-P) * f(1) / (1-F(0))

For Y=y other than 1 :

Pr(Y=y) = (1-P) * f(y) / (1-F(0))

You might ask yourself why f(y) is ¬†weighted with 1/(1-F(0)), since we are looking at the truncated version of the distribution this weighting factor is essential to keep the integral of the truncated distribution at 1. I’ll leave you to figure out the math.

Example

Here is an example when assuming the distribution f is Poisson.

first we construct the likelihood function:

Ind <- function(l) {
 if(is.logical(l)) as.numeric(l)
 else stop('"l" must be logical')
 }
fy <- function(x, lambda, log = FALSE) {
 if(!log) { Ind(x>=1) * dpois(x, lambda) / ppois(0, lambda, lower.tail = FALSE)}
 else { Ind(x>=1) * ( dpois(x, lambda, log=TRUE) - ppois(0, lambda, lower.tail = FALSE, log=TRUE))}
 }
Lik <- function(par,x) {
 p <- par[1]
 lambda <- par[2]
 n1 <- sum(x==1)
 n0 <- sum(x!=1)
 x0 <- x[x!=1]
n1 * log(p + (1-p) * fy(1,lambda)) + n0 * log(1-p) + sum(fy(x0,lambda,log=TRUE))
 }

Let’s create some data :

x <- c(rpois(5000,5),rep(1,600)) # a sample of 5000 values from a Poisson distribution with lambda = 5, artificially inflated with 600 1s.
par <- c(mean(x==1),mean(x)) # starting values for the estimation process
parOptim <- optim(par,fn=Lik,x=x,method="Nelder-Mead",control=list(fnscale=-1)) # using optim() to find the MLE
parOptim$par[1]
[1] 0.1092563

This means the estimated P is 10.9% of inflated one time visitors. Be sure to remove these guys from the distribution of visitors you get. But be sure to test a few distributions and find a way to judge between them prior to estimating the final numbers.

Good luck.

About these ads

3 thoughts on “A note about the estimating the number of visits to a site

  1. I like this idea, but it would be much improved with the inclusion of real web data.

    • Ian, thanks for commenting. Yes, this does need to be tested on some real data, I have tested it for Poisson only, but goodness of fit was weak.
      Hopefully I’ll publish some results soon enough.

      Do you happen to have any insights on this matter?

  2. […] Aviad Klein Motivation How many times do users visit your site? A very relevant question in web analysis. One […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s