October 16, 2011

Rod Page’s VIZBI 2011 annotated links

 

iPhylo is hands-down my favorite blog, whose author is Rod Page. In the blog post “Some VIZBI 2011 links”, Rod grabbed a list of visualization software that he noted. I stole the whole list and post it here, but with my own annotations. There are some good ones there that I would like to use in the future, so this post is merely a reminder and click-saver.

November 09, 2010

Why 10,000 heads from 20,000 tosses suggests fake data?

I have this statistical question in mind for some time now. The reason I am interested in this question is that when we some data in a biology paper, sometimes the data are just “too good to be true”. Well, this by itself is not a sufficient reason to discredit a sound paper, but I was always wondering how we can quantify the confidence when seeing such data.

I formulated my question in terms of coin tosses and asked on CrossValidated. Let me just repost it here.

Let's say we are repeatedly tossing a fair coin, and we know number of heads and tails should be roughly equal. When we see a result like 10 heads and 10 tails for a total of 20 tosses, we believe the results and are inclined to believe the coin is fair.

Well when you see a result like 10000 heads and 10000 tails for a total of 20000 tosses, I actually would question the validity of the result (did the experimenter fake the data), as I know this is more unlikely than, say a result of 10093 heads and 9907 tails.

What is the statistical argument behind my intuition?

I got my answer back very quickly and someone suggested I had implicitly invoked Bayes theorem. It was nicely explained. After that answer, I did some thinking and calculation and was able to calculate the probability using an explicit model.

From prior experience, I know the data faker would forge a data that is very close to the “expected”, i.e. small variance. I will just model it as a Normal(mean=10000, sd=10). In the mean time, I know the actual coin tosses would follow a binomial distribution Binom(size=20000, p=0.5).

two_distributions

Note that the fake data has a much sharper peak (low variance) around 10000 – well that’s what fakers do. Let’s say now I have a prior belief of P(fake)=0.1, i.e. before seeing the data, I think the data is forged 10% of the time.

I now have everything I need for the Bayes calculation. The formula is:

bayes_formula

I left all the calculations to R, in calc_prob() function.

> calc_prob(10000)
[1] 0.4399905
> calc_prob(10093)
[1] 3.088808e-19

Seeing 10000 heads, the posterior probability of the data being rigged is 0.43 (recall my prior belief is only 0.1). In contrast, seeing 10093 heads, the posterior probability is effectively zero. After all, who would have rigged a data that’s not pretty!

R source code is attached here.

August 01, 2010

Gambler’s ruin

There is a common casino strategy when playing roulette:

Bet on either odd or even, and stick to your choice. Bet $20 at start, if you lose, double the bet; if you win, start with the $20 bet again. Don’t worry about losing, just keep going.

Let’s assume you have $10000. Well, if you adopt this strategy, you’ll only go bankrupt ~1/500 of the time – this happens when you lose 9 times in a row (2**9=512), then you’ll have no money at all. But if you win at any point, you go back to where you were, plus $20 more. For example, let’s say you lose 3 times followed by a win: that’s –20-40-80+160= $20.

This simulation (python script) suggests that if you follow this strategy for a few rounds (say 50), more than 90% of the people will make money, with few people bankrupt.

   1: $ python casino_strategy.py
   2: Simulating 50 rolls per trial and generate 10000 trials
   3: mean: 10006,  median: 10480
   4: 93% of the samples earned money
   5: 3% of the samples got bankrupt 

The following graph simulates 20 trials, and each line tracks their balance at any given round. In the end (round 50), only 1 guy lost money, while the others all earn money, and no people went bankrupt. Notice the typical pattern is exponential drop followed by rebound in a single round. Most people earn little, few guys lose a lot. Now the question is, do you want to be that poor guy who lost a lot?

casino_strategy

In addition, if you adopt this strategy long enough, eventually you'll go bankrupt, that’s called Gambler's ruin, since “bankrupt” is an absorption state.

July 30, 2010

Promethease report: Hyungyong Kim

What can you do with your genome deciphered by 23andme? For those who are not into tech startups, this is a personal genomics company that reads your genetic information and sells it back to you. There are a couple of others, like deCODEme, Navigenics, and BioResolve. Among these, I think 23andme is particularly promising – oh did I mention its cofounder Anne Wojcicki, is the wife of Google co-founder Sergey Brin?

Get back to the genome data topic. This is a typical report you get from company 23andme, which is essentially a spreadsheet for all the SNPs that they test on their 2nd generation chip. Each line is a single SNP, with the typical SNP ID (rsid), its location on the reference human genome, and the genotype call. The genotype shows two nucleotides, since we have two copies (from both parents) of each SNP position.

   1: # rsid  chromosome      position        genotype
   2: rs3094315       1       742429  AA
   3: rs12562034      1       758311  AG
   4: rs3934834       1       995669  CC
   5: rs9442372       1       1008567 GG
   6: rs3737728       1       1011278 GG
   7: rs11260588      1       1011521 GG
   8: rs6687776       1       1020428 CC

Hyungyong Kim, a bioinformatician working in a Korean biotech firm, had his genome sequenced by 23andme, is open enough to put the genome on his blog – so that everyone can download it, hack it and know all his weaknesses or strengths. The problem is that you’ll have to read through a ton of literatures to understand what this spreadsheet tells you.


Enter Promethease – the software that does all the magic – will annotate the SNPs for you. So for Kim’s data, it will report interesting SNPs, drug metabolism, medical conditions, plus more complicated information.


Promethease_Kim


Going down the list. Kim is a male, asian, slow caffeine metabolizer, maybe a few disease susceptibilities here and there. Is this good news or bad news for Kim?

July 28, 2010

My dear frond

I have been into Chaos game for quite a while. The game proceeds by iteratively generating a sequence of points, using a series of simple linear rewriting rules, or in geek terms – iterated function system. Unexpectedly, complex shapes arise with simple formulas, particularly for shapes that are self-similar – trees, leaves, snowflakes, broccoli, coastlines etc. So with that in mind, the following is actually a fractal drawing of fern leaf, versus a real fern leaf. Can you tell the difference?

ifs fern-leaf-isolated-largethumb1135377[1]

For those who are into details, the following is the python code to generate it, using the excellent matplotlib library. Note the re-writing functions and probabilities were taken shamelessly from the wiki page.

The world’s first Megatron prototype

A recent report on the Proceedings of the National Academy of Sciences by Mit-Geeks et al., reveals a miniaturized machine “shape-shifter” that can achieve two distinct shapes – a boat or a plane. The actual robot is cunningly disguised as a sheet of interconnected triangles. The key innovation is the actuator – a device that can bend itself when heated. The creases (white stretch on the sheet) are made of silicon thus quite flexible.

megatron[1] origami

The research is the product of a growing field called computational origami – which, in addition to fueling my own imagination, have proclaimed applications in machine folding, protein folding, airbag design, among many others. Note the entire operation from the sheet to a targeted geometry can be achieved within 20 seconds in the following video. Un-surprisingly, the research was funded by DARPA.

July 13, 2010

Javascript bio-sequence highlighter - I

A few weeks back during lunch chat with brentp, we discussed a simple javascript highlighter for biological sequences, much like the syntax highlighter for programming languages (for example, Artistic Style for C-like languages or Pygments for python). The raw sequence files, usually in FASTA format, is just plain text – and very boring to stare at.

The biggest issue with biological code is that there are no punctuation marks. Luckily, computational biologists over the last few decades have come up with ways to identify various sequence elements, like genes and repeats. Such annotated sequence features extend over “genomic intervals”, and are typically stored in a GFF file or BED file.

The javascript code that I started to write today takes as input a FASTA file and a BED file, and then outputs the HTML codes for the annotated sequence. As a first attempt, it is fairly straightforward, with some preset styles defined in a default CSS file. For example, genes are always plotted with a lavendar background, etc.

This code is far from stable, especially stuff that involves nesting of sequence features (e.g. exons should be nested within genes), but right now this is not modeled at all, and resulting in nasty bugs.

See a working demo here, and screenshot below. Comments and suggestions are welcome, as always.

js_sequence_highlighter_screenshot

July 07, 2010

TimeTree App – evolution in your palm

I have rarely looked at iPhone AppStore. The only time I did - I just looked at Top25 or 50 to see what’s hot at the moment. An unexpectedly useful app, TimeTree, turned up when I searched for the keyword genome (please don’t ask why I entered this). Here is the description:

TimeTree is a public knowledge-base for information on the evolutionary timescale of life. This application allows easy exploration of the thousands of divergence times among organisms in the scientific literature. A tree-based (hierarchical) system is used to identify all published molecular time estimates bearing on the divergence of two chosen organisms, such as species, compute summary statistics, and present the results. Names of two taxa to be compared are entered in the search window and the results are presented on a set of self-explanatory tabs.

My own query with arabidopsis and grape gave me a fast and informative result.

image_2image image_1 

The divergence time estimates are mostly based on sequence data, and categorized into nuclear, chloroplast etc, depending on the source. Multiple estimates are weighted (with large variations among different estimates as often expected from this kind of study). References are nicely summarized with  # of genes sampled, data type, table # in the publication… then there is also a reference to the book. I recognized one of the authors - Sudmir Kumar wrote the MEGA software, a user-friendly GUI tool for exploratory evolutionary analysis.

I also played with escherichia and rice (prokaryote vs. eukaryote), giving me an estimated divergence time of 2622.2Mya. Can you get a longer evolutionary distance than this?