Summary: Attached is a pure Ruby implementation of the AS66 algorithm (Hill 1973), ported from the Fortran code available here. It estimates the integral of the normal distribution, defaulting to the area under the right tail.
Read more…
Aidan Findlater Impersonal bioinformatics, research, ruby, statistics
Summary: Attached is a diff that allows Bio::FlatFileIndex to access BDB flatfile databases created by BioPerl’s Bio::DB::Flat. I have not changed the way BioRuby creates its databases, so this likely breaks access to BioRuby-created flatfile indices.
Read more…
Aidan Findlater Impersonal bioinformatics, perl, research, ruby
Quickest way to set up a local GO DB with MySQL:
- Grab latest go-YYYYMM-seqdblite-tables.gz from here.
- tar -zxvf go-YYYYMM-seqdblite-tables.gz
- cd go-YYYYMM-seqdblite-tables
- echo “create database mygo” | mysql -uroot -p
- cat *.sql | mysql -uroot -p mygo
- mysqlimport -L -uroot -p mygo *.txt
This is listed right on that same download page, but somehow I always forget that.
Aidan Findlater Impersonal bioinformatics, mysql
This macro calculates the cumulative hypergeometric distribution for the given values, using Excel’s built-in HypGeomDist function:
Public Function CumHypGeom(sample_s As Integer, number_sample As Integer,
population_s As Integer, number_pop As Integer)
' Returns the cumulative hypergeometric distribution (i.e. p-value)
Dim RetVal As Double
RetVal = 0
For i = sample_s To number_sample
RetVal = RetVal + WorksheetFunction.HypGeomDist(i, number_sample, population_s, number_pop)
Next
CumHypGeom = RetVal
End Function
To use it, go: Tools > Macro > Visual Basic Editor then Insert > Module and copy the above code into the box. Save, return to Excel (using the “View Microsoft Excel” button) and then you can call it like any other function.
The function’s parameters are the same as the built-in HypGeomDist function. For a population of red and blue balls, where red balls are considered a success, they are:
- sample_s: number of successes (i.e. red balls) in the sample
- number_sample: sample size (i.e. number of balls drawn from the total population)
- population_s: number of successes (i.e. red balls) in the population
- number_pop: population size (i.e. total number of both red and blue balls)
Edit: I’d just like to state for the record that this is probably the least efficient way to do the calculation, but it was the easiest to program. In the end, I sacrificed CPU cycles to conserve brain cycles.
Aidan Findlater Impersonal bioinformatics, excel, statistics