Saturday, November 14, 2009

The "Gang of Four" in Population Genetics

The following are classic theories in the field of population genetics.

Which one is your favorite?

Friday, October 30, 2009

Decrease in TAIR funding

Although I am not working on Arabidopsis, I think this is still one of the best-managed plant databases out there.

Read more from the TAIR homepage and comments from the community.

Monday, August 24, 2009

High GC grass genes

Several work in the past noted that there are two classes of genes in the grass genome, giving bimodal distribution when plotting the GC content of various genes. The following figure is taken from Lescot 2008 Musa paper.

Please note that the eudicot (Arabidopsis) pattern is quite different from grasses (rice). In some rice genes, in fact there are genes that have the third-codon position virtually all G or C. The third codon positions (also known as synonymous positions) have more freedom to change because the substitutions don’t alter the amino acid composition.

From many applications, we need to calculate the Ks distance (substitutions per synonymous positions) as a rough estimate of how divergent two proteins become. Most calculations are based on an evolutionary model of nucleotide bases. With the abnormal gene class noted above, much of the assumptions that simplistic models are no longer valid. In particular, let us assume the JC69 model, which is what Nei and Gojobori’s Ks calculation method is based on. When there are only G<->C changes as allowable transitions, the assumption of equal transition from G to other three nucleotides no longer hold.

Several published work and I noted in the past that when calculating the Ks for the grass genes, Nei & Gojobori’s method under-estimates the Ks values while PAML over-estimates the Ks values. In practice, we noted that yn00 program in the PAML software, when calculating Ks values between pairs of grass genes, often resulted in values larger than 2. Some further investigations reveal that some results are simply correlated with the protein length!

I looked very briefly at the PAML codes and did a bit debugging and found the following code in yn00.c.

int DistanceF84(double n, double P, double Q, double pi[],
double*k_HKY, double*t, double*SEt)
{
/* This calculates kappa and d from P (proportion of transitions) & Q
(proportion of transversions) & pi under F84.
When F84 fails, we try to use K80. When K80 fails, we try
to use JC69. When JC69 fails, we set distance t to maxt.
Variance formula under F84 is from Tateno et al. (1994), and briefly
checked against simulated data sets.
*/


So the code tries to fit the data using several substitution model, and when all fails, switch to the default, so what is the default?



if(failK80) {
if((P+=Q)>=.75) { failJC69=1; P=.75*(n-1.)/n; }
*t = -.75*log(1-P*4/3.);
if(*t>maxt) *t=maxt;
if(SEt) {
*SEt = sqrt(9*P*(1-P)/n) / (3-4*P);
}
}


Let us look at this code. If transitions (P) plus transversions (Q) are larger than 75%, which means it really is getting completely randomized to the point that it is even more likely to get substituted than not. Under this case, JC69 model fails, fair enough. Then the code enforces the transitions to be slightly less than 75%.


CodeCogsEqn , CodeCogsEqn_001

With a bit of math you can get


CodeCogsEqn_002

Now you can see that this becomes a very simple function of the sequence length n. I could imagine that for the grass calculation, the three models it tried get thrown off quite often so that this approximation gets called. With n equals to 333 (for a 1kb gene), you have Ks roughly equal 4. That why for a significant portion of the whole dataset, we have a distribution of Ks values that is simply dependent on the sequence length.

I will emphasize here that this post is not critic to the PAML. I love PAML and use it heavily. But it is important to understand when some unexpected substitutions (such as the grass GC rage) occur, evolutionary models will fail miserably.

Wednesday, August 05, 2009

TKF91 model on a tree

This is a bio-informatics related animation on the youtube. It says that the animation was made by "hacking" RASMOL. This is simply amazing and educating at the same time. Naturally, when I looked at the name of the youtube author, it is Ian Holmes.

Thursday, July 23, 2009

Facebook Puzzle Master

I just discovered this yesterday, and submitted one solution to the test problem. The judge-bot checks the submissions every four hours, so this is unlike some of the online judge (POJ, USACO) that I worked on before. The problems are more realistic, rather than contrived for competition purposes. One particular problem that I glanced through yesterday -- the Facebull problem seems NP-hard (in computer terms, bloody hard), and you cannot solve it in reasonable time for the general case. On top of that, unlike the problems for the online judges, it does not explain the scale of the problem, nor time/memory requirements. The judge-bot accepts popular computer languages. It is always inspiring to see the snake!

But, the problem I submitted yesterday was judged correct, so I will look at some more problems this evening. I like the online problem listings; and hopefully not limited to the programming. Similar ideas I’ve seen so far include mathematics/computing and chemistry/biology. These are not simple problems but incentives are prepared.

Wednesday, July 22, 2009

Plants that I had when I was a young boy

There are many indigenous plant that are local varieties of common domesticated plants, but when I was young I didn’t know that. The two plants that I will show below, are perhaps only native to the eastern part of China where I lived. The names are best pronounced in local Shanghai-nese accents. Some pictures shown are taken from random places on the internet.

甜卢粟 (sweet sorghum or sugar sorghum; Sorghum bicolor), when I was young, I used to eat that a lot. It is a variety of sorghum, with a high sugar content in the stalk, so a bit like sugarcane, except that the sweet sorghum has a unique aroma (and flavor) to it. The plants grow fast and my mom used to grow these next to the corn plants and the sorghum never needs any fertilizer or other attention. The panicles (the seed bearing part) can be cut and bundled to make a broom. Therefore the seeds must be non-shattering… well I digress. There are some recent interest in developing this into a energy crop, since it has lower water requirements than the sugarcane. See paper here.

 20070929140127-1986 image

癞葡萄 (bitter melon; Momordica charantia), we didn’t grow these. The skin is yellow while the kernels are red and sweet (but not too sugary). When we bought vegetables on the market, some farmers would usually send this to kids (aka me) for no cost. It turned out the in the same genus as the bitter melon, even the same species. I always suspected whether the left is at different developmental stage as the right (which is the bitter melon we eat as vegetables). But the resemblance is obvious.

Sunday, July 05, 2009

Mozilla Firefox 3.5 is here

I tried it out… noticeably fast (rather informal impression here, considering the network traffic is perhaps lower). In addition, <video> tags are now supported, which means you don’t need a flash plugin to stream videos. The firefox people are now inventing many things; <canvas> for example, has not been supported by IE yet.