Several work in the past noted that there are two classes of genes in the grass genome, giving bimodal distribution when plotting the GC content of various genes. The following figure is taken from Lescot 2008 Musa paper.
Please note that the eudicot (Arabidopsis) pattern is quite different from grasses (rice). In some rice genes, in fact there are genes that have the third-codon position virtually all G or C. The third codon positions (also known as synonymous positions) have more freedom to change because the substitutions don’t alter the amino acid composition.
From many applications, we need to calculate the Ks distance (substitutions per synonymous positions) as a rough estimate of how divergent two proteins become. Most calculations are based on an evolutionary model of nucleotide bases. With the abnormal gene class noted above, much of the assumptions that simplistic models are no longer valid. In particular, let us assume the JC69 model, which is what Nei and Gojobori’s Ks calculation method is based on. When there are only G<->C changes as allowable transitions, the assumption of equal transition from G to other three nucleotides no longer hold.
Several published work and I noted in the past that when calculating the Ks for the grass genes, Nei & Gojobori’s method under-estimates the Ks values while PAML over-estimates the Ks values. In practice, we noted that yn00 program in the PAML software, when calculating Ks values between pairs of grass genes, often resulted in values larger than 2. Some further investigations reveal that some results are simply correlated with the protein length!
I looked very briefly at the PAML codes and did a bit debugging and found the following code in yn00.c.
int DistanceF84(double n, double P, double Q, double pi[],
double*k_HKY, double*t, double*SEt)
{
/* This calculates kappa and d from P (proportion of transitions) & Q
(proportion of transversions) & pi under F84.
When F84 fails, we try to use K80. When K80 fails, we try
to use JC69. When JC69 fails, we set distance t to maxt.
Variance formula under F84 is from Tateno et al. (1994), and briefly
checked against simulated data sets.
*/
So the code tries to fit the data using several substitution model, and when all fails, switch to the default, so what is the default?
if(failK80) {
if((P+=Q)>=.75) { failJC69=1; P=.75*(n-1.)/n; }
*t = -.75*log(1-P*4/3.);
if(*t>maxt) *t=maxt;
if(SEt) {
*SEt = sqrt(9*P*(1-P)/n) / (3-4*P);
}
} Let us look at this code. If transitions (P) plus transversions (Q) are larger than 75%, which means it really is getting completely randomized to the point that it is even more likely to get substituted than not. Under this case, JC69 model fails, fair enough. Then the code enforces the transitions to be slightly less than 75%.

,

With a bit of math you can get

Now you can see that this becomes a very simple function of the sequence length n. I could imagine that for the grass calculation, the three models it tried get thrown off quite often so that this approximation gets called. With n equals to 333 (for a 1kb gene), you have Ks roughly equal 4. That why for a significant portion of the whole dataset, we have a distribution of Ks values that is simply dependent on the sequence length.
I will emphasize here that this post is not critic to the
PAML. I love PAML and use it heavily. But it is important to understand when some unexpected substitutions (such as the grass GC rage) occur, evolutionary models will fail miserably.