The Plant Informatics

Rod Page’s VIZBI 2011 annotated links

2011-10-16T14:52:00.001-04:00

iPhylo is hands-down my favorite blog, whose author is Rod Page. In the blog post “Some VIZBI 2011 links”, Rod grabbed a list of visualization software that he noted. I stole the whole list and post it here, but with my own annotations. There are some good ones there that I would like to use in the future, so this post is merely a reminder and click-saver.

Arena 3D - network in 3D
BioBlender - cool, artistic protein structure rendering
BLAST Atlas - COG results in 3D
Cerebral v.2.0 - cytoscape plugin
Chromograms - visualize text editing
CloVR - microbial / metagenomics seq analysis using cloud computing
Cytoscape - need-i-say-more
EMAGE - more of a database than standalone tool
embedded Python Molecular Viewer (ePMV) - protein structure
Fleshmap - desire and body
GenomeView - like Tablet, nextgen genome viewer
History Flow - track history of edits, see also Chromograms above
HivePlots - novel visualization method called hive plots, didn't like it though
ImageVis3D - image viewer
JProfileGrid - edit MSA
Many Eyes - text cloud, maps, etc.
Molecular Maya
Molecular Movies
PathBLAST - query PPI networks
Phyloviewer - from iPlant, biggish tree viewer
Powerwall - tumor slides
Quartz Composition of DNA - on MAC only
Scribl - lightweight, nice module for building genome browsers
Sequence Surveyor - show sequence compositions along phylogenetic tree
Shape of Song - not sure why it is in this list
Sybil - from TIGR/JCVI, comparative genomics tool
Topiary Explorer - useful in microbial ecology studies
Vrtual Worm - as the name suggests
Web Seer - visualize google search results
Whole Brain Catalog

Why 10,000 heads from 20,000 tosses suggests fake data?

2010-11-09T23:23:00.001-05:00

I have this statistical question in mind for some time now. The reason I am interested in this question is that when we some data in a biology paper, sometimes the data are just “too good to be true”. Well, this by itself is not a sufficient reason to discredit a sound paper, but I was always wondering how we can quantify the confidence when seeing such data.

I formulated my question in terms of coin tosses and asked on CrossValidated. Let me just repost it here.

Let's say we are repeatedly tossing a fair coin, and we know number of heads and tails should be roughly equal. When we see a result like 10 heads and 10 tails for a total of 20 tosses, we believe the results and are inclined to believe the coin is fair.
Well when you see a result like 10000 heads and 10000 tails for a total of 20000 tosses, I actually would question the validity of the result (did the experimenter fake the data), as I know this is more unlikely than, say a result of 10093 heads and 9907 tails.
What is the statistical argument behind my intuition?

I got my answer back very quickly and someone suggested I had implicitly invoked Bayes theorem. It was nicely explained. After that answer, I did some thinking and calculation and was able to calculate the probability using an explicit model.

From prior experience, I know the data faker would forge a data that is very close to the “expected”, i.e. small variance. I will just model it as a Normal(mean=10000, sd=10). In the mean time, I know the actual coin tosses would follow a binomial distribution Binom(size=20000, p=0.5).

Note that the fake data has a much sharper peak (low variance) around 10000 – well that’s what fakers do. Let’s say now I have a prior belief of P(fake)=0.1, i.e. before seeing the data, I think the data is forged 10% of the time.

I now have everything I need for the Bayes calculation. The formula is:

I left all the calculations to R, in calc_prob() function.

> calc_prob(10000)
[1] 0.4399905
> calc_prob(10093)
[1] 3.088808e-19

Seeing 10000 heads, the posterior probability of the data being rigged is 0.43 (recall my prior belief is only 0.1). In contrast, seeing 10093 heads, the posterior probability is effectively zero. After all, who would have rigged a data that’s not pretty!

R source code is attached here.

Gambler’s ruin

2010-08-01T18:32:00.001-04:00

There is a common casino strategy when playing roulette:

Bet on either odd or even, and stick to your choice. Bet $20 at start, if you lose, double the bet; if you win, start with the $20 bet again. Don’t worry about losing, just keep going.

Let’s assume you have $10000. Well, if you adopt this strategy, you’ll only go bankrupt ~1/500 of the time – this happens when you lose 9 times in a row (2**9=512), then you’ll have no money at all. But if you win at any point, you go back to where you were, plus $20 more. For example, let’s say you lose 3 times followed by a win: that’s –20-40-80+160= $20.

This simulation (python script) suggests that if you follow this strategy for a few rounds (say 50), more than 90% of the people will make money, with few people bankrupt.

   1: $ python casino_strategy.py   2: Simulating 50 rolls per trial and generate 10000 trials   3: mean: 10006,  median: 10480   4: 93% of the samples earned money   5: 3% of the samples got bankrupt 

The following graph simulates 20 trials, and each line tracks their balance at any given round. In the end (round 50), only 1 guy lost money, while the others all earn money, and no people went bankrupt. Notice the typical pattern is exponential drop followed by rebound in a single round. Most people earn little, few guys lose a lot. Now the question is, do you want to be that poor guy who lost a lot?

In addition, if you adopt this strategy long enough, eventually you'll go bankrupt, that’s called Gambler's ruin, since “bankrupt” is an absorption state.

Promethease report: Hyungyong Kim

2010-07-30T23:45:00.001-04:00

What can you do with your genome deciphered by 23andme? For those who are not into tech startups, this is a personal genomics company that reads your genetic information and sells it back to you. There are a couple of others, like deCODEme, Navigenics, and BioResolve. Among these, I think 23andme is particularly promising – oh did I mention its cofounder Anne Wojcicki, is the wife of Google co-founder Sergey Brin?

Get back to the genome data topic. This is a typical report you get from company 23andme, which is essentially a spreadsheet for all the SNPs that they test on their 2nd generation chip. Each line is a single SNP, with the typical SNP ID (rsid), its location on the reference human genome, and the genotype call. The genotype shows two nucleotides, since we have two copies (from both parents) of each SNP position.

   1: # rsid  chromosome      position        genotype   2: rs3094315       1       742429  AA   3: rs12562034      1       758311  AG   4: rs3934834       1       995669  CC   5: rs9442372       1       1008567 GG   6: rs3737728       1       1011278 GG   7: rs11260588      1       1011521 GG   8: rs6687776       1       1020428 CC

Hyungyong Kim, a bioinformatician working in a Korean biotech firm, had his genome sequenced by 23andme, is open enough to put the genome on his blog – so that everyone can download it, hack it and know all his weaknesses or strengths. The problem is that you’ll have to read through a ton of literatures to understand what this spreadsheet tells you.

Enter Promethease – the software that does all the magic – will annotate the SNPs for you. So for Kim’s data, it will report interesting SNPs, drug metabolism, medical conditions, plus more complicated information.

Going down the list. Kim is a male, asian, slow caffeine metabolizer, maybe a few disease susceptibilities here and there. Is this good news or bad news for Kim?

My dear frond

2010-07-28T23:50:00.001-04:00

I have been into Chaos game for quite a while. The game proceeds by iteratively generating a sequence of points, using a series of simple linear rewriting rules, or in geek terms – iterated function system. Unexpectedly, complex shapes arise with simple formulas, particularly for shapes that are self-similar – trees, leaves, snowflakes, broccoli, coastlines etc. So with that in mind, the following is actually a fractal drawing of fern leaf, versus a real fern leaf. Can you tell the difference?

For those who are into details, the following is the python code to generate it, using the excellent matplotlib library. Note the re-writing functions and probabilities were taken shamelessly from the wiki page.

The world’s first Megatron prototype

2010-07-28T18:32:00.001-04:00

A recent report on the Proceedings of the National Academy of Sciences by Mit-Geeks et al., reveals a miniaturized machine “shape-shifter” that can achieve two distinct shapes – a boat or a plane. The actual robot is cunningly disguised as a sheet of interconnected triangles. The key innovation is the actuator – a device that can bend itself when heated. The creases (white stretch on the sheet) are made of silicon thus quite flexible.

The research is the product of a growing field called computational origami – which, in addition to fueling my own imagination, have proclaimed applications in machine folding, protein folding, airbag design, among many others. Note the entire operation from the sheet to a targeted geometry can be achieved within 20 seconds in the following video. Un-surprisingly, the research was funded by DARPA.

Javascript bio-sequence highlighter - I

2010-07-13T16:53:00.001-04:00

A few weeks back during lunch chat with brentp, we discussed a simple javascript highlighter for biological sequences, much like the syntax highlighter for programming languages (for example, Artistic Style for C-like languages or Pygments for python). The raw sequence files, usually in FASTA format, is just plain text – and very boring to stare at.

The biggest issue with biological code is that there are no punctuation marks. Luckily, computational biologists over the last few decades have come up with ways to identify various sequence elements, like genes and repeats. Such annotated sequence features extend over “genomic intervals”, and are typically stored in a GFF file or BED file.

The javascript code that I started to write today takes as input a FASTA file and a BED file, and then outputs the HTML codes for the annotated sequence. As a first attempt, it is fairly straightforward, with some preset styles defined in a default CSS file. For example, genes are always plotted with a lavendar background, etc.

This code is far from stable, especially stuff that involves nesting of sequence features (e.g. exons should be nested within genes), but right now this is not modeled at all, and resulting in nasty bugs.

See a working demo here, and screenshot below. Comments and suggestions are welcome, as always.

TimeTree App – evolution in your palm

2010-07-07T00:14:00.001-04:00

I have rarely looked at iPhone AppStore. The only time I did - I just looked at Top25 or 50 to see what’s hot at the moment. An unexpectedly useful app, TimeTree, turned up when I searched for the keyword genome (please don’t ask why I entered this). Here is the description:

TimeTree is a public knowledge-base for information on the evolutionary timescale of life. This application allows easy exploration of the thousands of divergence times among organisms in the scientific literature. A tree-based (hierarchical) system is used to identify all published molecular time estimates bearing on the divergence of two chosen organisms, such as species, compute summary statistics, and present the results. Names of two taxa to be compared are entered in the search window and the results are presented on a set of self-explanatory tabs.

My own query with arabidopsis and grape gave me a fast and informative result.

The divergence time estimates are mostly based on sequence data, and categorized into nuclear, chloroplast etc, depending on the source. Multiple estimates are weighted (with large variations among different estimates as often expected from this kind of study). References are nicely summarized with # of genes sampled, data type, table # in the publication… then there is also a reference to the book. I recognized one of the authors - Sudmir Kumar wrote the MEGA software, a user-friendly GUI tool for exploratory evolutionary analysis.

I also played with escherichia and rice (prokaryote vs. eukaryote), giving me an estimated divergence time of 2622.2Mya. Can you get a longer evolutionary distance than this?

Get plant gene coordinates from Phytozome

2010-04-29T20:30:00.001-04:00

For most of my genomics research, all I care is the gene position on the chromosome, without worrying about the exons. As a result, the popular-powerful-intuitive gff3 format is often over-kill for me. There are common python programming tools for parsing the gff3, including genometools and Brad Chapman’s BCBio, and they work on well-formed gff3. However, it take some substantial (>10s) time to parse the Arabidopsis TAIR9 gff file (which, by the way, is not standard gff3).

The work-around for me, in the last couple of months, along with a colleague in the lab, is to adopt a simpler file format for our calculations and graphics. The format is called “bed” format, and was introduced by UCSC, mainly for graphics purpose. The bed format, like gff, is tab-delimited, but contain no hierarchy among different types. There are less required fields, too. So the bed files I often have contain 4 columns.

Chr2    16848438        16850487        AT2G40340
Chr2    14239213        14242365        AT2G33640

As it stands, these are all I need for my synteny analysis. But is there a easy way to generate bed-formatted files? In the past, I usually generate the bed file by parsing the gff file, either through the library or a custom script. It occurs to me today that I can get all that information from Phytozome. For a brief introduction, phytozome is a great site that stores many plant gene sequences and their annotations. Their biomart site provides an entry point for programmers to get the data out.

So our script has an xml template. I got the template by going to the biomart site and then click on the “XML” button. Then we can send the xml query to the website. Two key things are “filters” and “attributes”. The rest is to send the query through http. You can send the query through “curl” command or through urllib in python.

Python script for this is attached below. You can specify the species filter to get the data from just a few species (like ‘Arabidopsis thaliana’ in my code). The tab-delimited bed file will be written to stdout.

Review article on Nature Review Genetics retracted

2010-03-18T15:33:00.001-04:00

I have never seen a review article retracted. Now here is one,

Plant genetic engineering for biofuel production: towards affordable cellulosic ethanol : Article : Nature Reviews Genetics (link)

So what happened?

I am retracting this invited Nature Reviews Genetics article due to a paragraph being paraphrased without attribution. The paragraph in question was from an early version of an article to which I had access as a peer reviewer and which has since been published in Plant Science.

Apparently there is some plagiarism going on.. tracking the Plant Science paper. I noticed some Editor’s comment attached to the end.

The third paragraph of Section 4 of an early version of this article was previously published, nearly verbatim, in Nature Reviews Genetics, Paragraph 2, Page 441, Volume 9, June 2008 by a reviewer of this manuscript, while this paper was still under review. When this was discovered, the Plant Science Review Editor immediately reported the apparent abuse of the reviewer’s privilege to the academic institution of the author of the paper published in Nature Reviews Genetics.

I simply cannot understand this.. my impression for invited review is that they’d have a near-100% acceptance rate. That said, there's not much motivation for plagiarism. I tend to believe that this paragraph gets in the NRG manuscript by some kind of mistake. Although this incident definitely brings disgrace to the NRG author.

Getting the phylogeny from a list of organisms

2010-02-11T22:46:00.001-05:00

I started a postdoc job in this Berkeley lab, and I am quite excited about working with new people. I haven’t posted anything for this year yet, so today I made up something from scratch.

Let’s say we have a list of organisms, with their NCBI taxonomy ids. So how should we know how close they are on the tree of life? Well, you can look it up on the NCBI. For example, below is the lineage for Arabidopsis thaliana, the botanical model species.

cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids; malvids; Brassicales; Brassicaceae; Arabidopsis

This is very useful indeed. Nowadays people are less devoted to taxonomy than they used to with everything in molecular level, but knowing the lineage can help making phylogenetic inference of when a certain morphological feature arose.

How about looking for a handful of species? is there a better way to check one-by-one? You’d need some help from programming. I’ll use Python here and we need a few library dependencies (ClientForm and BeautifulSoup), make sure that you install BeautifulSoup version 3.0.8, because of the problem described here.

Okay I admit. I am lazy. I don’t really want to go to NCBI and grab the long list of lineage and do tons of matches in my code. I found this excellent website “Interactive Tree of Life” (iToL) that helps me do this job, so I’ll just deliver my query there. In that, ClientForm help me do prepare the form and click for me, and I’ll use BeautifulSoup to retrieve the information I want. The returned data is Newick-formatted. We’ll use the python library ete to visualize the tree. Full source is below, try to run the example first. If anything is unclear, please see the documentation for the above three python packages (also for installations, more functionalities, etc).

"""
Example:
>>> mylist = [3702, 3649, 3694, 3880]
>>> t = TaxIDTree(mylist)
>>> print t
(((Carica_papaya,Arabidopsis_thaliana)Brassicales,(Medicago_truncatula,Populus_trichocarpa
)fabids)rosids);
>>> t.print_tree()
<BLANKLINE>
               /-Carica_papaya
          /---|
         |     \-Arabidopsis_thaliana
---- /---|
         |     /-Medicago_truncatula
          \---|
               \-Populus_trichocarpa
"""
from urllib2 import urlopen
from ClientForm import ParseResponse
from BeautifulSoup import BeautifulSoup
URL="http://itol.embl.de/other_trees.shtml"
class TaxIDTree(object):
    def __init__(self, list_of_taxids):
        # the data to send in
        form_data = "\n".join(str(x) for x in list_of_taxids)
        response = urlopen(URL)
        forms = ParseResponse(response, backwards_compat=False)
        form = forms[0]
        form["ncbiIDs"] = form_data
        page = urlopen(form.click()).read()
        soup = BeautifulSoup(page)
        for element in soup("textarea"):
            if element["id"]=="nameCol":
                self.nameCol = str(element.contents[0])
    def __str__(self):
        return self.nameCol
    def print_tree(self):
        from ete2 import Tree
        t = Tree(self.nameCol)
        print t

gzip.open

2009-12-07T07:58:00.001-05:00

I read about this blog on how you can save time and space by compressing datafile with gzip and then read in using the zlib library, in R. I did a small test in python, but as it is informal I’ll just state the results. I started with a 730Mb sorghum assembly file (with lots of ACGT), after compressions, the .gz file was around 200Mb. Using normal file read on the raw file took about 5sec on the test server, whereas the .gz file took about 22sec to read. Not impressed.

The "Gang of Four" in Population Genetics

2009-11-14T10:04:00.001-05:00

The following are classic theories in the field of population genetics.

Fundamental theorem of natural selection, by R. A. Fisher
Shifting balance theory, by S. Wright
Genetic load theory, by J. B. S. Haldane
Neutral theory, by M. Kimura

Which one is your favorite?

Decrease in TAIR funding

2009-10-29T14:38:00.001-04:00

Although I am not working on Arabidopsis, I think this is still one of the best-managed plant databases out there.

Read more from the TAIR homepage and comments from the community.

High GC grass genes

2009-08-23T22:54:00.002-04:00

Several work in the past noted that there are two classes of genes in the grass genome, giving bimodal distribution when plotting the GC content of various genes. The following figure is taken from Lescot 2008 Musa paper.

Please note that the eudicot (Arabidopsis) pattern is quite different from grasses (rice). In some rice genes, in fact there are genes that have the third-codon position virtually all G or C. The third codon positions (also known as synonymous positions) have more freedom to change because the substitutions don’t alter the amino acid composition.From many applications, we need to calculate the Ks distance (substitutions per synonymous positions) as a rough estimate of how divergent two proteins become. Most calculations are based on an evolutionary model of nucleotide bases. With the abnormal gene class noted above, much of the assumptions that simplistic models are no longer valid. In particular, let us assume the JC69 model, which is what Nei and Gojobori’s Ks calculation method is based on. When there are only G<->C changes as allowable transitions, the assumption of equal transition from G to other three nucleotides no longer hold.

Several published work and I noted in the past that when calculating the Ks for the grass genes, Nei & Gojobori’s method under-estimates the Ks values while PAML over-estimates the Ks values. In practice, we noted that yn00 program in the PAML software, when calculating Ks values between pairs of grass genes, often resulted in values larger than 2. Some further investigations reveal that some results are simply correlated with the protein length!

I looked very briefly at the PAML codes and did a bit debugging and found the following code in yn00.c.

int DistanceF84(double n, double P, double Q, double pi[],
   double*k_HKY, double*t, double*SEt)
{
/* This calculates kappa and d from P (proportion of transitions) & Q
  (proportion of transversions) & pi under F84.
  When F84 fails, we try to use K80.  When K80 fails, we try
  to use JC69.  When JC69 fails, we set distance t to maxt.
  Variance formula under F84 is from Tateno et al. (1994), and briefly
  checked against simulated data sets.
*/

So the code tries to fit the data using several substitution model, and when all fails, switch to the default, so what is the default?

if(failK80) {
  if((P+=Q)>=.75) { failJC69=1; P=.75*(n-1.)/n; }
  *t = -.75*log(1-P*4/3.);
  if(*t>maxt) *t=maxt;
  if(SEt) {
     *SEt = sqrt(9*P*(1-P)/n) / (3-4*P);
  }
}

Let us look at this code. If transitions (P) plus transversions (Q) are larger than 75%, which means it really is getting completely randomized to the point that it is even more likely to get substituted than not. Under this case, JC69 model fails, fair enough. Then the code enforces the transitions to be slightly less than 75%.

With a bit of math you can get

Now you can see that this becomes a very simple function of the sequence length n. I could imagine that for the grass calculation, the three models it tried get thrown off quite often so that this approximation gets called. With n equals to 333 (for a 1kb gene), you have Ks roughly equal 4. That why for a significant portion of the whole dataset, we have a distribution of Ks values that is simply dependent on the sequence length.

I will emphasize here that this post is not critic to the PAML. I love PAML and use it heavily. But it is important to understand when some unexpected substitutions (such as the grass GC rage) occur, evolutionary models will fail miserably.

TKF91 model on a tree

2009-08-04T14:03:00.001-04:00

This is a bio-informatics related animation on the youtube. It says that the animation was made by "hacking" RASMOL. This is simply amazing and educating at the same time. Naturally, when I looked at the name of the youtube author, it is Ian Holmes.

Facebook Puzzle Master

2009-07-23T09:32:00.002-04:00

I just discovered this yesterday, and submitted one solution to the test problem. The judge-bot checks the submissions every four hours, so this is unlike some of the online judge (POJ, USACO) that I worked on before. The problems are more realistic, rather than contrived for competition purposes. One particular problem that I glanced through yesterday -- the Facebull problem seems NP-hard (in computer terms, bloody hard), and you cannot solve it in reasonable time for the general case. On top of that, unlike the problems for the online judges, it does not explain the scale of the problem, nor time/memory requirements. The judge-bot accepts popular computer languages. It is always inspiring to see the snake!

But, the problem I submitted yesterday was judged correct, so I will look at some more problems this evening. I like the online problem listings; and hopefully not limited to the programming. Similar ideas I’ve seen so far include mathematics/computing and chemistry/biology. These are not simple problems but incentives are prepared.

Plants that I had when I was a young boy

2009-07-21T16:05:00.001-04:00

There are many indigenous plant that are local varieties of common domesticated plants, but when I was young I didn’t know that. The two plants that I will show below, are perhaps only native to the eastern part of China where I lived. The names are best pronounced in local Shanghai-nese accents. Some pictures shown are taken from random places on the internet.

甜卢粟 (sweet sorghum or sugar sorghum; Sorghum bicolor), when I was young, I used to eat that a lot. It is a variety of sorghum, with a high sugar content in the stalk, so a bit like sugarcane, except that the sweet sorghum has a unique aroma (and flavor) to it. The plants grow fast and my mom used to grow these next to the corn plants and the sorghum never needs any fertilizer or other attention. The panicles (the seed bearing part) can be cut and bundled to make a broom. Therefore the seeds must be non-shattering… well I digress. There are some recent interest in developing this into a energy crop, since it has lower water requirements than the sugarcane. See paper here.

癞葡萄 (bitter melon; Momordica charantia), we didn’t grow these. The skin is yellow while the kernels are red and sweet (but not too sugary). When we bought vegetables on the market, some farmers would usually send this to kids (aka me) for no cost. It turned out the in the same genus as the bitter melon, even the same species. I always suspected whether the left is at different developmental stage as the right (which is the bitter melon we eat as vegetables). But the resemblance is obvious.

Mozilla Firefox 3.5 is here

2009-07-04T17:38:00.001-04:00

I tried it out… noticeably fast (rather informal impression here, considering the network traffic is perhaps lower). In addition, <video> tags are now supported, which means you don’t need a flash plugin to stream videos. The firefox people are now inventing many things; <canvas> for example, has not been supported by IE yet.

Different millets

2009-06-22T16:16:00.001-04:00

It has been confusing to me that there are different millet varieties I have encountered when reading literatures (finger millet, foxtail millet and pearl millet). The question again occurred to me when I picked a few weedy grass downstairs and came back to ask Changsoo to ID them. It turns out to be goose grass (Eleusine indica). I looked at some close relatives and found that it is in fact a relative of finger millet (Eluesine coracana). Now I will try to explain the different millet, copying some images from wikipedia to illustrate the morphology. Three major millets (in terms of agricultural production) are listed below.

Pearl millet (Pennisetum glaucum, subfamily: Pennisetum), most widely grown millets, mainly in Africa and India.

Foxtail millet (Setaria italica, subfamily: Panicoideae), very important in east asia, in fact been grown in China for more than 8000 years (there is a recent PNAS paper on this). The Chinese name for this is 小米; it also looks like a local weedy grass Setaria viridus, which we called 狗尾巴草.

Finger millet (Eleusine coracana, subfamily: Chloridoideae), annual plant grown as cereal in Africa and Asia.

STL usage

2008-08-18T22:06:00.001-04:00

发信人: littlebee (小蜜蜂), 信区: C
标  题: STL的技巧
发信站: 日月光华 (2008年08月07日16:36:25 星期四)

网上看到的，都是一些常用技巧，都不是很难，不过有些确实很有用。。
转过来大家看看吧～～


toupper,tolower
地球人都知道 C++ 的 string 没有 toupper ，好在这不是个大问题，因为我们有 STL 算
法：

string s("heLLo");
transform(s.begin(), s.end(), s.begin(), ::toupper);
cout << s << endl;
transform(s.begin(), s.end(), s.begin(), ::tolower);
cout << s << endl;

当然，我知道很多人希望的是 s.to_upper() ，但是对于一个这么通用的 basic_string 
来说，的确没办法把这些专有的方法放进来。如果你用 boost stringalgo ，那当然不在
话下，你也就不需要读这篇文章了。

------------------------------------------------------------------------
trim
我们还知道 string 没有 trim ，不过自力更生也不困难，比 toupper 来的还要简单：


    string s("   hello   ");
    s.erase(0, s.find_first_not_of(" \n"));
    cout << s << endl;
    s.erase(s.find_last_not_of(' ') + 1);
    cout << s << endl;

注意由于 find_first_not_of 和 find_last_not_of 都可以接受字符串，这个时候它们寻
找该字符串中所有字符的 absence ，所以你可以一次 trim 掉多种字符。

-----------------------------------------------------------------------
erase
string 本身的 erase 还是不错的，但是只能 erase 连续字符，如果要拿掉一个字符串里
面所有的某个字符呢？用 STL 的 erase + remove_if 就可以了，注意光 remove_if 是不
行的。

    string s("   hello, world. say bye   ");
    s.erase(remove_if(s.begin(),s.end(), 
        bind2nd(equal_to<char>(), ' ')), 
    s.end());

上面的这段会拿掉所有的空格，于是得到 hello,world.saybye。

-----------------------------------------------------------------------
replace
string 本身提供了 replace ，不过并不是面向字符串的，譬如我们最常用的把一个 sub
str 换成另一个 substr 的操作，就要做一点小组合：

    string s("hello, world");
    string sub("ello, ");
    s.replace(s.find(sub), sub.size(), "appy ");
    cout << s << endl;

输出为 happy world。注意原来的那个 substr 和替换的 substr 并不一定要一样长。


-----------------------------------------------------------------------
startwith, endwith
这两个可真常用，不过如果你仔细看看 string 的接口，就会发现其实没必要专门提供这
两个方法，已经有的接口可以干得很好：

    string s("hello, world");
    string head("hello");
    string tail("ld");
    bool startwith = s.compare(0, head.size(), head) == 0;
    cout << boolalpha << startwith << endl;
    bool endwith = s.compare(s.size() - tail.size(), tail.size(), tail) == 0;

    cout << boolalpha << endwith << endl;

当然了，没有 s.startwith("hello") 这样方便。

------------------------------------------------------------------------
toint, todouble, tobool...
这也是老生常谈了，无论是 C 的方法还是 C++ 的方法都可以，各有特色：

    string s("123");
    int i = atoi(s.c_str());
    cout << i << endl;
    
    int ii;
    stringstream(s) >> ii;
    cout << ii << endl;
    
    string sd("12.3");
    double d = atof(sd.c_str());
    cout << d << endl;
    
    double dd;
    stringstream(sd) >> dd;
    cout << dd << endl;
    
    string sb("true");
    bool b;
    stringstream(sb) >> boolalpha >> b;
    cout << boolalpha << b << endl;

C 的方法很简洁，而且赋值与转换在一句里面完成，而 C++ 的方法很通用。

------------------------------------------------------------------------
split
这可是件麻烦事，我们最希望的是这样一个接口： s.split(vect, ',') 。用 STL 算法来
做有一定难度，我们可以从简单的开始，如果分隔符是空格、tab 和回车之类，那么这样
就够了：

    string s("hello world, bye.");
    vector<string> vect;
    vect.assign(
        istream_iterator<string>(stringstream(s)),
        istream_iterator<string>()
    );

不过要注意，如果 s 很大，那么会有效率上的隐忧，因为 stringstream 会 copy 一份 
string 给自己用。

------------------------------------------------------------------------
concat
把一个装有 string 的容器里面所有的 string 连接起来，怎么做？希望你不要说是 han
d code 循环，这样做不是更好？

    vector<string> vect;
    vect.push_back("hello");
    vect.push_back(", ");
    vect.push_back("world");
    
    cout << accumulate(vect.begin(), vect.end(), string(""));

不过在效率上比较有优化余地。

-------------------------------------------------------------------------

reverse
其实我比较怀疑有什么人需要真的去 reverse 一个 string ，不过做这件事情的确是很容
易：

  std::reverse(s.begin(), s.end());

上面是原地反转的方法，如果需要反转到别的 string 里面，一样简单：

  s1.assign(s.rbegin(), s.rend());

效率也相当理想。

-------------------------------------------------------------------------

解析文件扩展名
字数多点的写法：

    std::string filename("hello.exe");

    std::string::size_type pos = filename.rfind('.');
    std::string ext = filename.substr(pos == std::string::npos ? filename.leng
th() : pos + 1);

不过两行，合并成一行呢？也不是不可以：

    std::string ext = filename.substr(filename.rfind('.') == std::string::npos
 ? filename.length() : filename.rfind('.') + 1);

我知道，rfind 执行了两次。不过第一，你可以希望编译器把它优化掉，其次，扩展名一
般都很短，即便多执行一次，区别应该是相当微小。 
STL 算法 
distance 
很多时候我们希望在一个 vector ，或者 list ，或者什么其他东西里面，找到一个值在
哪个位置，这个时候 find 帮不上忙，而有人就转而求助手写循环了，而且是原始的手写
循环：

for ( int i = 0; i < vect.size(); ++i)
    if ( vect[i] == value ) break;

如果编译器把 i 看作 for scope 的一部分，你还要把 i 的声明拿出去。真的需要这样么
？看看这个：

    int dist = 
        distance(col.begin(), 
            find(col.begin(), col.end(), 5));

其中 col 可以是很多容器，list, vector, deque... 当然这是你确定 5 就在 col 里面
的情形，如果你不确定，那就加点判断：

    int dist;
    list<int>::iterator pos = find(col.begin(), col.end(), 5);
    if ( pos != col.end() )
        dist = distance(col.begin(), pos);

我想这还是比手写循环来的好些吧。

--------------------------------------------------------------------------
max, min
这是有直接的算法支持的，当然复杂度是 O(n)，用于未排序容器，如果是排序容器...老
兄，那还需要什么算法么？

max_element(col.begin(), col.end());
min_element(col.begin(), col.end());

注意返回的是 iterator ，如果你关心的只是值，那么好：

*max_element(col.begin(), col.end());
*min_element(col.begin(), col.end());

max_element 和 min_element 都默认用 less 来排序，它们也都接受一个 binary predi
cate ，如果你足够无聊，甚至可以把 max_element 当成 min_element 来用，或者反之：


*max_element(col.begin(), col.end(), greater<int>()); // 返回最小值！
*min_element(col.begin(), col.end(), greater<int>()); // 返回最大值

当然它们的本意不是这个，而是让你能在比较特殊的情况下使用它们，例如，你要比较的
是每个元素的某个成员，或者成员函数的返回值。例如：

#include <iostream>
#include <list>
#include <algorithm>
#include <string>
#include <boost/bind.hpp>

using namespace boost;
using namespace std;

struct Person
{
    Person(const string& _name, int _age)
        : name(_name), age(_age)
    {}
    int age;
    string name;
};

int main()
{
    list<Person> col;
    list<Person>::iterator pos;

    col.push_back(Person("Tom", 10));
    col.push_back(Person("Jerry", 12));
    col.push_back(Person("Mickey", 9));

    Person eldest = 
        *max_element(col.begin(), col.end(), 
            bind(&Person::age, _1) < bind(&Person::age, _2));//>=1.33
    
    cout << eldest.name;
}

输出是 Jerry ，这里用了 boost.bind ，原谅我不知道用 bind2nd, mem_fun 怎么写，我
也不想知道...

-------------------------------------------------------------------------
copy_if
没错，STL 里面压根没有 copy_if ，这就是为什么我们需要这个：

template<typename InputIterator, typename OutputIterator, typename Predicate>

OutputIterator copy_if(
    InputIterator begin, InputIterator end, OutputIterator destBegin, Predicat
e p)
{
    while (begin != end) 
    {
        if (p(*begin))*destBegin++ = *begin;
        ++begin;
    }
    return destBegin;
}

把它放在自己的工具箱里，是一个明智的选择。

------------------------------------------------------------------------
惯用手法：erase(iter++)
如果你要去除一个 list 中的某些元素，那可千万小心：（下面的代码是错的！！！）


#include <iostream>
#include <algorithm>
#include <iterator>
#include <list>

int main()
{
    int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    std::list<int> lst(arr, arr + 10);

    for ( std::list<int>::iterator iter = lst.begin();
          iter != lst.end(); ++iter)
        if ( *iter % 2 == 0 )
            lst.erase(iter);
            
    std::copy(lst.begin(), lst.end(),
        std::ostream_iterator<int>(std::cout, " "));
}

当 iter 被 erase 掉的时候，它已经失效，而后面却还会做 ++iter ，其行为无可预期！
如果你不想动用 remove_if ，那么唯一的选择就是：

#include <iostream>
#include <algorithm>
#include <iterator>
#include <list>

int main()
{
    int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    std::list<int> lst(arr, arr + 10);

    for ( std::list<int>::iterator iter = lst.begin();
          iter != lst.end(); )
        if ( *iter % 2 == 0 )
            lst.erase(iter++);
        else
            ++iter;
           
    std::copy(lst.begin(), lst.end(),
        std::ostream_iterator<int>(std::cout, " "));
}

但是上面的代码不能用于 vector, string 和 deque ，因为对于这些容器， erase 不光
令 iter 失效，还令 iter 之后的所有 iterator 失效！

-------------------------------------------------------------------------
erase(remove...) 惯用手法
上面的循环如此难写，如此不通用，如此不容易理解，还是用 STL 算法来的好，但是注意
，光 remove_if 是没用的，必须使用 erase(remove...) 惯用手法：

#include <iostream>
#include <algorithm>
#include <iterator>
#include <list>
#include <functional>
#include <boost/bind.hpp>

int main()
{
    int arr[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    std::list<int> lst(arr, arr + 10);

    lst.erase(remove_if(lst.begin(), lst.end(),
        boost::bind(std::modulus<int>(), _1, 2) == 0),
        lst.end()
    );
           
    std::copy(lst.begin(), lst.end(),
        std::ostream_iterator<int>(std::cout, " "));
}

当然，这里借助了 boost.bind ，让我们不用多写一个没用的 functor 。 
简单常识——关于stream 
从文件中读入一行 

简单，这样就行了： 

ifstream ifs("input.txt");
char buf[1000];


ifs.getline(buf, sizeof buf); 

string input(buf); 

当然，这样没有错，但是包含不必要的繁琐和拷贝，况且，如果一行超过1000个字符，就
必须用一个循环和更麻烦的缓冲管理。下面这样岂不是更简单？

string input;
input.reserve(1000);
ifstream ifs("input.txt");
getline(ifs, input); 

不仅简单，而且安全，因为全局函数 getline 会帮你处理缓冲区用完之类的麻烦，如果你
不希望空间分配发生的太频繁，只需要多 reserve 一点空间。

这就是“简单常识”的含义，很多东西已经在那里，只是我一直没去用。

---------------------------------------------------------------------------

一次把整个文件读入一个 string 


我希望你的答案不要是这样：

string input;
while( !ifs.eof() )
{
    string line;
    getline(ifs, line);
    input.append(line).append(1, '\n');
} 

当然了，没有错，它能工作，但是下面的办法是不是更加符合 C++ 的精神呢？

string input(
    istreambuf_iterator<char>(instream.rdbuf()), 
    istreambuf_iterator<char>()
); 

同样，事先分配空间对于性能可能有潜在的好处：

string input;
input.reserve(10000);
input.assign(
    istreambuf_iterator<char>(ifs.rdbuf()), 
    istreambuf_iterator<char>()
);


很简单，不是么？但是这些却是我们经常忽略的事实。
补充一下，这样干是有问题的：

    string input; 
    input.assign( 
        istream_iterator<char>(ifs), 
        istream_iterator<char>() 
    ); 


因为它会忽略所有的分隔符，你会得到一个纯“字符”的字符串。最后，如果你只是想把
一个文件的内容读到另一个流，那没有比这更快的了：


    fstream fs("temp.txt"); 
    cout << fs.rdbuf(); 

因此，如果你要手工 copy 文件，这是最好的（如果不用操作系统的 API）：

   ifstream ifs("in.txt"); 
   ofstream ofs("out.txt"); 
   ofs << in.rdbuf(); 


-------------------------------------------------------------------------

open 一个文件的那些选项 

ios::in     Open file for reading 
ios::out    Open file for writing 
ios::ate    Initial position: end of file 
ios::app    Every output is appended at the end of file 
ios::trunc  If the file already existed it is erased 
ios::binary Binary mode 

-------------------------------------------------------------------------

还有 ios 的那些 flag 

flag  effect if set  
ios_base::boolalpha  input/output bool objects as alphabetic names (true, fals
e).  
ios_base::dec  input/output integer in decimal base format.  
ios_base::fixed  output floating point values in fixed-point notation.  
ios_base::hex  input/output integer in hexadecimal base format.  
ios_base::internal  the output is filled at an internal point enlarging the ou
tput up to the field width.  
ios_base::left  the output is filled at the end enlarging the output up to the
 field width.  
ios_base::oct  input/output integer in octal base format.  
ios_base::right  the output is filled at the beginning enlarging the output up
 to the field width.  
ios_base::scientific  output floating-point values in scientific notation.  

ios_base::showbase  output integer values preceded by the numeric base.  
ios_base::showpoint  output floating-point values including always the decimal
 point.  
ios_base::showpos  output non-negative numeric preceded by a plus sign (+).  

ios_base::skipws  skip leading whitespaces on certain input operations.  
ios_base::unitbuf  flush output after each inserting operation.  
ios_base::uppercase  output uppercase letters replacing certain lowercase lett
ers.  

There are also defined three other constants that can be used as masks: 

constant  value  
ios_base::adjustfield  left | right | internal  
ios_base::basefield  dec | oct | hex  
ios_base::floatfield  scientific | fixed  

--------------------------------------------------------------------------


用我想要的分隔符来解析一个字符串，以及从流中读取数据 


这曾经是一个需要不少麻烦的话题，由于其常用而显得尤其麻烦，但是其实 getline 可以
做得不错：


    getline(cin, s, ';');    
    while ( s != "quit" ) 
    { 
        cout << s << endl; 
        getline(cin, s, ';'); 
    } 


简单吧？不过注意，由于这个时候 getline 只把 ; 作为分隔符，所以你需要用 ;quit; 
来结束输入，否则 getline 会把前后的空格和回车都读入 s ，当然，这个问题可以在代
码里面解决。


同样，对于简单的字符串解析，我们是不大需要动用什么 Tokenizer 之类的东西了：


#include <iostream> 
#include <sstream> 
#include <string> 

using namespace std; 

int main() 
{ 
    string s("hello,world, this is a sentence; and a word, end."); 
    stringstream ss(s); 
    
    for ( ; ; ) 
    { 
        string token; 
        getline(ss, token, ','); 
        if ( ss.fail() ) break; 
        
        cout << token << endl; 
    } 
} 


输出：


hello 
world 
 this is a sentence; and a word 
 end. 


很漂亮不是么？不过这么干的缺陷在于，只有一个字符可以作为分隔符。


--------------------------------------------------------------------------


把原本输出到屏幕的东西输出到文件，不用到处去把 cout 改成 fs


#include <iostream>
#include <fstream> 
using namespace std; 
int main()
{     
    ofstream outf("out.txt");  
    streambuf *strm_buf=cout.rdbuf();     
    cout.rdbuf(outf.rdbuf());  
    cout<<"write something to file"<<endl;  
    cout.rdbuf(strm_buf);   //recover  
    cout<<"display something on screen"<<endl; 
    system("PAUSE");
    return 0;
} 
 
输出到屏幕的是：


display something on screen 


输出到文件的是：


write something to file 


也就是说，只要改变 ostream 的 rdbuf ，就可以重定向了，但是这招对 fstream 和 st
ringstream 都没用。


--------------------------------------------------------------------------


关于 istream_iterator 和 ostream_iterator


经典的 ostream_iterator 例子，就是用 copy 来输出：


#include <iostream> 
#include <fstream> 
#include <sstream> 
#include <algorithm> 
#include <vector> 
#include <iterator> 

using namespace std; 

int main() 
{   
    vector<int> vect; 
    for ( int i = 1; i <= 9; ++i ) 
        vect.push_back(i); 
        
    copy(vect.begin(), vect.end(), 
        ostream_iterator<int>(cout, " ") 
    ); 
    cout << endl; 
    
    ostream_iterator<double> os_iter(cout, " ~ "); 
    *os_iter = 1.0; 
    os_iter++; 
    *os_iter = 2.0; 
    *os_iter = 3.0; 
} 

输出：


1 2 3 4 5 6 7 8 9 
1 ~ 2 ~ 3 ~ 


很明显，ostream_iterator 的作用就是允许对 stream 做 iterator 的操作，从而让算法
可以施加于 stream 之上，这也是 STL 的精华。与前面的“读取文件”相结合，我们得到
了显示一个文件最方便的办法：


    copy(istreambuf_iterator<char>(ifs.rdbuf()), 
         istreambuf_iterator<char>(), 
         ostreambuf_iterator<char>(cout) 
    ); 


同样，如果你用下面的语句，得到的会是没有分隔符的输出：


    copy(istream_iterator<char>(ifs), 
         istream_iterator<char>(), 
         ostream_iterator<char>(cout) 
    ); 


那多半不是你要的结果。如果你硬是想用 istream_iterator 而不是 istreambuf_iterat
or 呢？还是有办法：


    copy(istream_iterator<char>(ifs >> noskipws), 
         istream_iterator<char>(), 
         ostream_iterator<char>(cout) 
    ); 


但是这样不是推荐方法，它的效率比第一种低不少。
如果一个文件 temp.txt 的内容是下面这样，那么我的这个从文件中把数据读入 vector 
的方法应该会让你印象深刻。


12345 234 567
89    10


程序：


#include <iostream> 
#include <fstream> 
#include <algorithm> 
#include <vector> 
#include <iterator> 

using namespace std; 

int main() 
{   
    ifstream ifs("temp.txt"); 
    
    vector<int> vect; 
    vect.assign(istream_iterator<int>(ifs),
        istream_iterator<int>()
    ); 

    copy(vect.begin(), vect.end(), ostream_iterator<int>(cout, " ")); 
} 

输出：


12345 234 567 89 10 


很酷不是么？判断文件结束、移动文件指针之类的苦工都有 istream_iterator 代劳了。



-----------------------------------------------------------------------


其它算法配合 iterator 


计算文件行数：


    int line_count = 
        count(istreambuf_iterator<char>(ifs.rdbuf()), 
              istreambuf_iterator<char>(), 
              '\n');        


当然确切地说，这是在计算文件中回车符的数量，同理，你也可以计算文件中任何字符的
数量，或者某个 token 的数量：


    int token_count = 
        count(istream_iterator<string>(ifs), 
              istream_iterator<string>(), 
              "#include");        


注意上面计算的是 “#include” 作为一个 token 的数量，如果它和其他的字符连起来，
是不算数的。


------------------------------------------------------------------------
Manipulator


Manipulator 是什么？简单的说，就是一个接受一个 stream 作为参数，并且返回一个 s
tream 的函数，比如上面的 unskipws ，它的定义是这样的：


  inline ios_base& 
  noskipws(ios_base& __base) 
  { 
    __base.unsetf(ios_base::skipws); 
    return __base; 
  } 


这里它用了更通用的 ios_base 。知道了这一点，你大概不会对自己写一个 manipulator
 有什么恐惧感了，下面这个无聊的 manipulator 会忽略 stream 遇到第一个分号之前所
有的输入（包括那个分号）：


template <class charT, class traits>
inline std::basic_istream<charT, traits>&
ignoreToSemicolon (std::basic_istream<charT, traits>& s)
{
    s.ignore(std::numeric_limits<int>::max(), s.widen(';'));
    return s;
}

不过注意，它不会忽略以后的分号，因为 ignore 只执行了一次。更通用一点，manipula
tor 也可以接受参数的，下面这个就是 ignoreToSemicolon 的通用版本，它接受一个参数
， stream 会忽略遇到第一个该参数之前的所有输入，写起来稍微麻烦一点：


struct IgnoreTo {
    char ignoreTo;
    IgnoreTo(char c) : ignoreTo(c) 
    {}
};
    
std::istream& operator >> (std::istream& s, const IgnoreTo& manip)
{
    s.ignore(std::numeric_limits<int>::max(), s.widen(manip.ignoreTo)); 
    return s;
}

但是用法差不多：


    copy(istream_iterator<char>(ifs >> noskipws >> IgnoreTo(';')), 
         istream_iterator<char>(), 
         ostream_iterator<char>(cout) 
    ); 


其效果跟 IgnoreToSemicolon 一样。

Stanford machine learning series

2008-08-10T17:05:00.001-04:00

This course provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
Complete Playlist for the Course:
http://www.youtube.com/view_play_list...
CS 229 Course Website:
http://www.stanford.edu/class/cs229/

Briefings in bioinformatics -- how impact factor should be calculated

2008-05-27T16:08:00.001-04:00

The review-oriented journal has received its first impact factor measurement from Journal citation reports for 2006, an astounding 24.37; and this would rank as the first in the field of bioinformatics. I have looked at some articles in this journal, including a recent review on operon prediction -- but this has turned out to be quite a surprise to me. A closer look at the announcement both on the journal website and an editorial reveals that one article on phylogenetic reconstruction software MEGA3 has contributed significantly to the high score.

The editors of BIB acknowledge this outlier and explicitly point out that once the above article is removed, impact factor drops to a 4 -- still considered a relatively high score for a new magazine.

While appreciating the editors' honesty, one should look at the formula of IF calculation

For a journal that contains quite few articles, the large number of citations of one or two articles could affect a lot on the score. I imagine the degree (citations) distribution follows power law, in that a few papers actually attracted most citations.

Lenovo T61 woes

2008-05-01T10:26:00.001-04:00

I have waited for the new computer for about half a month now, only to discover that the Intel 3945A/B/G is not compatible with my Linksys router. Internet connection drops to about one-fifth of what I used to get. All the symptoms seem to go away as soon as I plug in the cable directly. Played around with it for more than three hour last night, tried a dozen solutions suggested but still no luck.

Finally I give up and plug in my old smart card Linksys NIC, everything seems to be OK now. I wish I had sticked to the Macintosh ...

Knuth shuffle (Fisher-Yates shuffle)

2008-03-27T10:16:00.005-04:00

The shuffling algorithm occurred to me several years ago when I first learned programming, but recently re-surfaced as I tried to digest a piece of Perl code from lab note for PBIO class.

The following is the perl code,

#!/usr/bin/perl
use warnings;
use strict;
open IFILE, 'sequence.txt';
chomp (my $seq = <IFILE>);
close IFILE;
my @oldseq = split('', $seq);
srand;
my @newseq = ();
for( @oldseq ){
my $r = rand(@newseq+1);
push(@newseq,$newseq[$r]);
$newseq[$r] = $_;
}
print @oldseq, "\n";
print @newseq, "\n";

It tries to loop through the original array and then exchange the element to an element already passed through (including itself), this would generate n! execution paths, and the possible outcomes are n!, and it is easy to prove that there is no collision in different execution path.

Note that this is a slight variant of Knuth shuffle, where the elements are swapped with an element that has NOT passed through (including itself), the same number of execution paths. Optimal solution. O(n log n) time complexity.

A more intuitive way (to me) is to think about it in Fisher-Yates original method, where a random item is taken each time out of a hat. Note that there are potential pitfalls as well when implementing this simple permutation algorithm.

Reference:

http://en.wikipedia.org/wiki/Fisher-Yates_shuffle

http://en.wikipedia.org/wiki/Shuffle