Documentation of SHOT vs. 2.0 - Shared Ortholog and Gene Order Tree Reconstruction Tool

Documentation of SHOT vs. 2.0 - Shared Ortholog and Gene Order Tree Reconstruction Tool

SHOT vs. 2.0 is an updated version of the SHOT web server for reconstruction of genome phylogenies (originally described in Korbel et al., 2002). In contrast to phylogenies derived from comparisons of single genes, genome phylogenies are less affected by horizontal gene transfer, unrecognized paralogy, highly variable rates of gene evolution, or misalignment than phylogenies based on single genes (Snel et al., 1999; Huynen et al., 1999; Fitz-Gibbon and House, 1999). Distance-based phylogenies may be generated based on two independent strategies. The underlying method of gene content phylogenies is based on an approach introduced by Snel et al. (1999). As an alternative, SHOT allows to generate phylogenies of prokaryotes based on an analysis of gene order conservation of the whole genome.

SHOT

For SHOT gene content phylogenies, similarity between two species is defined as the ratio of the number of shared orthologs and a normalization value reflecting genome size. For gene order phylogenies distances are derived from the number of orthologous gene pairs conserved. Here, a conserved gene pair is defined as orthologues that in two genomes form an adjacent pair of genes with conserved transcriptional directions.

The current version of SHOT uses orthologies, which are based on the COG database (Tatusov et al., 2001), and have been refined by the maintainers of the STRING database as described in von Mering et al., 2003. In contrast to the original version of SHOT (pairs of bi-directional best hits were defined as orthologs, see Korbel et al., 2002), the common presence of an orthologous group is considered as a single instance of a "shared ortholog". Phylogenetic trees are constructed using tools from the PHYLIP package (Felsenstein, 1989).

How to get started with SHOT

Essential inputs of SHOT are (1) the desired reconstruction method (gene content phylogeny or gene order phylogeny), and (2) a selection of species to be included in the phylogeny. If 'apply selection below' is chosen, the user can select individual species (by checking the boxes) at the bottom of the page.

By default, the output format is a PNG-image of a graphical unrooted tree, with the option to download the tree as a PDF file or in Newick format. Furthermore, the respective distance matrix can be downloaded (species are therein represented by their taxonomic IDs, which can be retrieved from the NCBI taxonomy homepage).

Advanced input parameters for gene content phylogenies

Normalization value:
1. Division by the size of the smaller of two genomes compared (the theoretical maximum of shared orthologs).
2. Division by the weighted average genome size [applied by default]. This value is obtained from an empirical formula representing a fit to the number of orthologs shared between archaeal and eubacterial genomes as a function of the eubacterial genome size ({sqrt(2) * a * b / sqrt( a^2 + b^2 )}; with a and b as the sizes of the two genomes compared; see Fig. 1 in Snel et al. (1999)). This option represents a more reasonable evolutionary model than applying the genome size of the smaller genome, since the number of genes shared with the Archaea increases with the bacterial genome size also for very large genomes - albeit slower.

Genome size definition:

1. Genome size is defined as the number of open reading frames (ORFs) annotated as protein coding genes.
2. Genome size is the number of ORFs with at least one ortholog in other completed genomes. By disregarding orphan ORFs, this option eliminates considerable variations in genome annotation quality. Therefore, it may be a better estimate of the theoretical maximum number of orthologs.
3. Genome size is the number of orthologous groups present in a species. This option reduces the size of genomes that have experienced a large number of recent duplications. Given the improved definition of orthology (see above), we regard this option as the most reasonable measure.

Distance measure:
Evolutionary distances d are directly determined from estimated similarities s (see Swofford & Olsen 1990).
1. d=-ln(s) is applied by default.
2. Alternatively, d=1-s may be used. This function is less supported by models of evolution (Swofford & Olsen), hence providing a poorer estimate of evolutionary distances for weak similarities. We thus tend to apply the latter option for testing the robustness of clusters.

Clustering algorithm:
1. Neighbor-joining (Saitou & Nei 1987) is the default clustering algorithm.
2. The slower Fitch-Margoliash algorithm (Fitch and Margoliash 1967) may be applied instead.

Advanced input parameters for gene order phylogenies

Genes considered:
1. All ORFs annotated as genes are analysed for the presence of conserved gene pairs [default selection].
2. Only genes having an ortholog in the genome compared with (regarding genes without ortholog in the other genome as not annotated) are analysed for the presence of conserved gene pairs. (Events that only affect the genomic gene content are ignored.)

Normalization:
Numbers of conserved gene pairs are normalized according to the genome size of the smaller genome, which correlates with the maximum number of conserved gene pairs theoretically possible. Genome size may be defined as follows:

1. The genome size is the number of open reading frames (ORFs) annotated as genes.
2. The genome size is number of ORFs with at least one ortholog in other genomes completed so far.
3. The genome size is the number of orthologous groups present in a species.
4. In addition, the number of orthologs shared between genomes compared may be considered as the genome size. This option should be applied, if only genes having an ortholog in the genome compared with are considered for the gene order analysis. (In that case that value represents the theoretical maximum of conserved gene pairs.)

Distance measure and clustering algorithm:
- Selectable parameters are identical for gene order and gene content phylogenies.

References

Felsenstein, J. (1989) PHYLIP-phylogeny inference package (Version 3.2). Cladistics, 5, 164-166.

Fitch, W.M & Margoliash, E., (1967). Construction of phylogenetic trees. Science, 155, 279-284.

Fitz-Gibbon, S.T and House, C.H. (1999) Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res., 27, 4218-4222.

Huynen, M., Snel, B & Bork, P. (1999). Lateral gene transfer, genome surveys, and the phylogeny of prokaryotes. Science, 286, 1443a.

Korbel, J.O., Snel, B., Huynen, M.A., Bork, P. (2002). SHOT: a web server for the construction of genome phylogenies. Trends Genet., 18, 158-162.

Pearson, W., (1998). Empirical statistical estimates for sequence similarity searches. J. Mol. Biol., 276, 71-84.

von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P. & Snel, B., (2003). STRING: a database of predicted functional associations between proteins. Nucleic Acids Res., 31, 258-261.

Saitou, N. & Nei, M., (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406-425.

Snel, B., Bork, P. & Huynen, M., (1999). Genome phylogeny based on gene content. Nat. Genet., 21, 108-110.

Swofford, D.L. and Olsen, G.J. (1990) Phylogeny construction. In Molecular Systematics (Hillis, D.M. and Moritz, C., eds), pp. 411-501, Sinauer Associates

Tatusov, R.L. et al. (2001). The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res., 29, 22-28.