Archive

Posts Tagged ‘statistics’

Sorting rated content

February 14th, 2009

Anything with user-submitted ratings can be sorted by those ratings. But what is the best way? The naive approach is to sort by average rating. Unfortunately, this would rate 5-stars from a single user as higher than 4.5-stars from 150 users.

This guy has an answer. In short, his recommendation is to take the lower bound of the 95% confidence interval, given by:

[pmath] {hat{ p }+{ z^2_{alpha slash 2}} / { 2 n } pm z_{alpha slash 2} sqrt{ { [ hat{p} (1 - hat{p}) + z^2_{alpha slash 2}/{4n}]}/n}} / {1+z^2_{ alpha slash 2 }/n} [/pmath]

where

  • [pmath]hat{p}[/pmath] is the observed fraction of positive ratings
  • [pmath]z_{alpha slash 2}[/pmath] is the [pmath]1-alpha slash 2[/pmath] quantile of the standard normal distribution
  • [pmath]n[/pmath] is the total number of ratings

Or, in Ruby:

require 'statistics2'

def ci_lower_bound(pos, n, power)
    if n == 0
        return 0
    end
    z = Statistics2.pnormaldist(1-power/2)
    phat = 1.0*pos/n
    (phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end

where

  • pos is the number of positive rating
  • n is the total number of ratings
  • power refers to the statistical power (0.05 recommended)

Aidan Findlater Impersonal , ,

Calculating the area under the normal curve in Ruby

July 24th, 2007

Summary: Attached is a pure Ruby implementation of the AS66 algorithm (Hill 1973), ported from the Fortran code available here. It estimates the integral of the normal distribution, defaulting to the area under the right tail.
Read more…

Aidan Findlater Impersonal , , ,

Excel macro for p-values

July 4th, 2006

This macro calculates the cumulative hypergeometric distribution for the given values, using Excel’s built-in HypGeomDist function:

Public Function CumHypGeom(sample_s As Integer, number_sample As Integer,
            population_s As Integer, number_pop As Integer)
'   Returns the cumulative hypergeometric distribution (i.e. p-value)
    Dim RetVal As Double
    RetVal = 0
    For i = sample_s To number_sample
        RetVal = RetVal + WorksheetFunction.HypGeomDist(i, number_sample, population_s, number_pop)
    Next
    CumHypGeom = RetVal
End Function

To use it, go: Tools > Macro > Visual Basic Editor then Insert > Module and copy the above code into the box. Save, return to Excel (using the “View Microsoft Excel” button) and then you can call it like any other function.

The function’s parameters are the same as the built-in HypGeomDist function. For a population of red and blue balls, where red balls are considered a success, they are:

  • sample_s: number of successes (i.e. red balls) in the sample
  • number_sample: sample size (i.e. number of balls drawn from the total population)
  • population_s: number of successes (i.e. red balls) in the population
  • number_pop: population size (i.e. total number of both red and blue balls)

Edit: I’d just like to state for the record that this is probably the least efficient way to do the calculation, but it was the easiest to program. In the end, I sacrificed CPU cycles to conserve brain cycles.

Aidan Findlater Impersonal , ,