OPENSEQ.org - GREMLIN - Submit Sequences for Coevolution Analysis

Due to complexity of seperating orthologs from paralogs (multiple copies of close homologs in the same genome), generating a paired alignment of two genes is a difficult task. This is because we not know which copy of gene A to pair with which copy of gene B. To avoid paralogs, we can do one or both of the following:

Note: E-value is in Scientific notation: [LOW] 1E-40 1E-20 1E-10 1E-06 1E-04 [HIGH]

Decrease the e-value until only one copy is detected per genome.
For prokaryotic genomes, use operon information (Δgene) to decide which genes to pair. Adjust the Δgene until no neighboring paralogs are detected.
- Δgene of (1,∞) = "pair genes that are from the same genome"
- Δgene of (1,1) = "only pair genes that are immediate neighbors in the same genome"
- Δgene of (1,20) = "only pair genes that are within 20 annotated genes of eachother in the same genome"
- Δgene of (0,0) = "pair domains that are from the same gene"
- Besides domain-pairing, why might min=0 option be useful in gene-pairing context? Sometimes genes that are immediate neighbors (such as fusion genes) are annotated as a single gene in the genome, setting Δgene min to 0, allows for these to be included in the alignment. On the other hand, if the two genes you are joining are paralogous to eachother, joining at min=0 will result in corrupted alignment!

On the flip side, decreasing the e-value and setting a Δgene cuttoff will results in less sequences for coevolution analysis. So a balance must be found! To increase the number of sequences/length, we can do one or both of the following:

Increase the e-value and number of iterations until there is enough sequences per length.

For paired alignments, we find that the best results are achieved when there is at least 1 sequence per length. For example: If gene A is length 50, and gene B is length 50, we would want at least 100 non-redunant sequences.

Trim the query sequence to where there is likely to be more homologous sequences.