Tilastotieteellistä ohjelmointia R-kielellä

Phylogenetic analyses using R

Phylogenetics (or molecular systematics) is a methodology that is used for inferring evolutionary relationships between individuals or taxa, such as species. Nowadays phylogenetics is often classified as a subfield of bioinformatics. Phylogenetic bioinformatics is distinctively a computational field, although phylogenetics as such just has deep roots in episdemology and theory of evolution.

Contemporary computational phylogenetics consists of two broad methodological fields that have somewhat incompatible views on the theory of phylogenetic inference. These methodological families are 1) parsimony-based (”cladistic parsimony”) and 2) model-based (”statistical phylogenetics”) methodologies. Due to the slightly troubled history of the model-based methods, these fields are still rather separated from each other.

Both parsimony and model-based fields have several prominent software packages for inferring phylogenies, but these do not necessarily interoperate nicely or even at all. Of course, the same can happen inside the field, too. For example, both MrBayes and RAxML are well-known programs for statistical phylogenetics. MrBayes uses NEXUS-formatted input files, whereas RAxML wants to get PHYLIP-formatted sequence alignments as an input. Therefore, we are stuck with a handful of different input file formats that we need to be able to convert to other formats, if we want to utilize several software packages during the analysis.

This is where R comes to help, since it allows easy scripting of several different software packages. Let’s see what R can do.

Generic analysis workflow

A generic phylogenetic analysis workflow with molecular data could flow as follows:

  • Gather sequence data
  • Align the sequences
  • Infer trees
  • Assess tree reliability (e.g., with bootstrapping)
  • Report the results

I have previously published a post about molecular systematics in Finnish. The code published in the previous post and on GitHub executes the mentioned steps with a couple of model-based methods and software. The code is entirely automatic, and it generates an output report in a PDF file. I just updated the script into AutomatedPhylogeneticAnalysis.R, which adds a couple of more steps to the analysis.

On top of the updated automatic script, I also produced an example analysis of Eutherian phylogeny. This example is more comprehensive than the automated script, since it examplifies several extra step that are not available in the automated script. Let’s take a closer look into the Eutheria example.

An example on Eutheria

Eutherians are mammals (e.g., horse and human) that have a placenta. Other extant mammalian clades, marsupialia (mammals with abdominal pouches, such as kangaruus) and monotremata (egg-laying mammals, such as duck-billed platypus) dot not have a placenta.

The analysis was originally published in PNAS, and was based on whole mitochondrial DNA sequences (mtDNA). Here, only three mitochondrial genes will be used: COX3, CYTB and ND5. The accession numbers for the mtDNA sequences were first screen scraped from PNAS, and the gene sequences were then extracted from the Genbank using R package AnnotationBustR.

The sequences were then aligned either with Muscle software or its native implementation in the R package msa. All three sequences were then concatenated into the same dataset. Here R becomes especially handy, because it allows concatenation of several DNA sequences in such a way that missing data is by default coded as missing data. Performing such concatenation by hand is rather painful. There are, of course, also other programs that allow one to do that, such as Mesquite.

Once the aligned sequences are concatenated into one long dataset, the alignment is converted into the formats that are needed: FastA (basic format), PHYLIP format and NEXUS format. Before analysis, the evolutionary model that fits the data best is searched for, and the model is used during the analyses. The alignment files are then analysed using 1) neighbor joining with logdet -distances, 2) parsimony in TNT, 3) maximum likelihood in RAxML and 4) bayesian method in MrBayes. Bootstrapping is performed to assess the reliability of the distance tree, but it can be extented to cover other methods, also.

The results are reported in native format (NEXUS- and NEWICK-trees) as well as a PDF file. PDF file also includes pairwise comparisons of the tree topologies, which makes it easier to spot differences between the results generated with different methods. Separate files contain results from Shimodaira’s tests and topological distances between the trees. These all help in figuring out the extent of the differences between the resulting tree.

The whole analysis is coordinated by a MASTER-script that can be copied and pasted into R to reproduce the analysis. This MASTER-script just calls other scripts in a concerted and correct fashion.

The codes and results from the analysis can be retrieved from the Github repository. I hope this serves as an example of how R can be used to coordinate phylogenetic analyses.

If you are interested in how the trees actually appear, see here: https://github.com/jtuimala/eutheria-phylogeny/blob/master/data/trees.pdf.

Tags: , , , , , , ,


Sähköpostiosoitettasi ei julkaista. Pakolliset kentät on merkitty *