Documentation of SHOT vs. 2.0 - Shared Ortholog and Gene Order Tree Reconstruction Tool
SHOT vs. 2.0 is an updated version of the SHOT web
server for reconstruction of genome phylogenies (originally described in Korbel et al., 2002). In contrast to
phylogenies derived from comparisons of single genes, genome
phylogenies are less affected by horizontal gene transfer,
unrecognized paralogy, highly variable rates of gene evolution, or
misalignment than phylogenies based on single genes (Snel et al.,
1999; Huynen et al., 1999; Fitz-Gibbon
and House, 1999). Distance-based phylogenies may be generated
based on two independent strategies. The underlying method of gene
content phylogenies is based on an approach introduced by Snel et
al. (1999). As an alternative, SHOT allows to generate phylogenies
of prokaryotes based on an analysis of gene order conservation of the
whole genome.
SHOT
For SHOT gene content phylogenies, similarity between two species is defined as the
ratio of the number of shared orthologs and a normalization value
reflecting genome size. For gene order phylogenies
distances are derived from the number of orthologous gene pairs
conserved. Here, a conserved gene pair is defined as
orthologues that in two genomes form an adjacent pair of genes with
conserved transcriptional directions.
The current version of SHOT uses orthologies, which are based on
the COG database (Tatusov et al., 2001), and have been refined by the maintainers of the STRING
database as described in von Mering et al., 2003. In contrast to the original version of SHOT (pairs of bi-directional best hits were defined as orthologs, see Korbel et al., 2002), the common presence of an orthologous group is considered as a single instance of a "shared ortholog".
Phylogenetic trees are constructed using tools from the PHYLIP package (Felsenstein, 1989).
How to get started with SHOT
Essential inputs of SHOT are (1) the
desired reconstruction method (gene content phylogeny or gene order
phylogeny), and (2) a selection of species to be
included in the phylogeny. If 'apply selection below' is
chosen, the user can select individual species (by checking the boxes) at the bottom of the
page.
By default, the output format is a PNG-image of a graphical unrooted tree, with the option to download the tree as a PDF file or in Newick format. Furthermore, the respective distance matrix can be downloaded (species are therein represented by their taxonomic IDs, which can be retrieved from the NCBI taxonomy homepage).
Advanced input parameters for gene content phylogenies
Normalization value:
1. Division by the size of the smaller of two genomes compared (the
theoretical maximum of shared orthologs).
2. Division by the
weighted average genome size [applied by default]. This value is
obtained from an empirical formula representing a fit to the number
of orthologs shared between archaeal and eubacterial genomes as a
function of the eubacterial genome size ({sqrt(2) * a * b / sqrt(
a^2 + b^2 )}; with a and b as the sizes of the two
genomes compared; see Fig. 1 in Snel et al. (1999)). This
option represents a more reasonable evolutionary model than applying
the genome size of the smaller genome, since the number of genes
shared with the Archaea increases with the bacterial genome size also
for very large genomes - albeit slower.
Genome size definition:
1. Genome size is defined as the
number of open reading frames (ORFs) annotated as protein coding
genes.
2. Genome size is the number of ORFs with at least one
ortholog in other completed genomes. By disregarding orphan ORFs, this option eliminates
considerable variations in genome annotation quality. Therefore, it
may be a better estimate of the theoretical maximum number of orthologs.
3. Genome size is the number of orthologous groups present in a species.
This option reduces the size of genomes that have experienced a large number
of recent duplications. Given the improved definition of orthology (see above),
we regard this option as the most reasonable measure.
Distance measure:
Evolutionary distances d are directly determined from
estimated similarities s (see Swofford & Olsen 1990).
1.
d=-ln(s) is applied by default.
2. Alternatively, d=1-s
may be used. This function is less supported by models of evolution
(Swofford & Olsen), hence providing a poorer estimate of
evolutionary distances for weak similarities. We thus tend to apply the latter
option for testing the robustness of clusters.
Clustering algorithm:
1.
Neighbor-joining (Saitou & Nei 1987) is the default clustering
algorithm.
2. The slower Fitch-Margoliash
algorithm (Fitch and Margoliash 1967) may be applied instead.
Advanced input parameters for gene order phylogenies
Genes considered:
1. All ORFs annotated as genes are
analysed for the presence of conserved gene pairs [default
selection].
2. Only genes having an ortholog in
the genome compared with (regarding genes without ortholog in the
other genome as not annotated) are analysed for the presence of
conserved gene pairs. (Events that only affect the genomic gene content are ignored.)
Normalization:
Numbers of conserved gene pairs are
normalized according to the genome size of the smaller genome, which
correlates with the maximum number of conserved gene pairs
theoretically possible. Genome size may be defined as follows:
1. The genome size is the number of open reading frames (ORFs) annotated as
genes.
2. The genome size is number of ORFs with at least
one ortholog in other genomes completed so far.
3. The genome size is the number of orthologous groups present in a species.
4. In addition, the number of
orthologs shared between genomes compared may be considered as the
genome size. This option should be applied, if only genes having an
ortholog in the genome compared with are considered for the gene
order analysis. (In that case that value represents the theoretical
maximum of conserved gene pairs.)
Distance measure and
clustering algorithm:
- Selectable parameters are identical
for gene order and gene content phylogenies.
References
Felsenstein, J. (1989) PHYLIP-phylogeny inference package (Version 3.2). Cladistics, 5, 164-166.
Fitch, W.M & Margoliash, E., (1967). Construction of phylogenetic trees. Science, 155, 279-284.
Fitz-Gibbon, S.T and House, C.H. (1999) Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res., 27, 4218-4222.
Huynen, M., Snel, B & Bork, P. (1999). Lateral gene transfer, genome surveys, and the phylogeny of prokaryotes. Science, 286, 1443a.
Korbel, J.O., Snel, B., Huynen, M.A., Bork, P. (2002). SHOT: a web server for the construction of genome phylogenies. Trends Genet., 18, 158-162.
Pearson, W., (1998). Empirical statistical estimates for sequence similarity searches. J. Mol. Biol., 276, 71-84.
von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P. & Snel, B., (2003). STRING: a database of predicted functional associations between proteins. Nucleic Acids Res., 31, 258-261.
Saitou, N. & Nei, M., (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 406-425.
Snel, B., Bork, P. & Huynen, M., (1999). Genome phylogeny based on gene content. Nat. Genet., 21, 108-110.
Swofford, D.L. and Olsen, G.J. (1990) Phylogeny construction. In Molecular Systematics (Hillis, D.M. and Moritz, C., eds), pp. 411-501, Sinauer Associates
Tatusov, R.L. et al. (2001). The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res., 29, 22-28.