May 4, 2021 - We are working on upgrading the webserver, some pages may not work.
OPENSEQ.org

Complexes - FAQ
FAQ page under construction (please send questions!)
How is a paired-alignment generated?
Operons Lets say we want to build a paired-alignment between the BLUE gene and the RED gene. The task is further complicated by the fact that each genome has multiple copies of BLUE and RED (these are called paralogs). How do we know which copy of BLUE to pair with which copy of RED?

It turns out, for bacterial genomes, interacting genes are often found near eachother in the genome, in units called operons. Using the operon information, or a measure of how far apart two genes are in the genome (Δgene) we can avoid the paralog issue altogether!

Click here for technical details

How does e-value play a role in seperating the paralogs from orthologs?
Alternativetly, one could adjust the e-value threshold until only one copy (for each gene) is detected per genome. The paralog ratio tells you how many copies per genome are detected for the given settings. Though sometimes its the combination of both approaches that produces the best pairing! Both Δgene and the e-value threshold can be adjusted on our web-server.
Is it possible to run GREMLIN on a complex that involves more than 2 genes?
  • Yes! You can do a all-vs-all paired analysis. For example if you have genes A,B and C. You can run the protocol on A+B, A+C, B+C (and put all the results together). This is what we did for the Ribosome and the NADH dehydrogenase complex!
What is "S_sco" or "I_Prob"?
  • Scaled Score = raw_score/average(raw_scores)
    • Referred to in the paper as "normalized coupling strength", see Figure 1 of our complexes paper to see how this score behaves with varying number of sequences.
    • A coupling strength larger than one indicates higher than average coupling between two residues.

  • I_Prob ≈ P(contact | scaled_score, seq/len, top_inter_score)
    • Referred to in the paper as "GREMLIN score"
    • If you do NOT know if the proteins interact, this score should help decide if there is enough information to infer an interaction. Otherwise you should use "Prob". WARNING: The "I_Prob" score was trained on "obligate" interactions of the ribosome. Sometimes we find "transient" interactions to have a low prob (due to low relative coupling score), but still very accurate. Examples include the CH10 - CH60 (chaperonin) and DHSA - DHSB (Succinate dehydrogenase). We present all of our predictions to allow data-mining for more cases!

  • Prob ≈ P(contact | scaled_score, seq/len)
    • See FAQ for details about this score.

  • The scores above should not be used as "cutoffs". For docking simulations we find that using ALL inter contacts within the top 1.5L intra/inter contacts as sigmoidal restraints to work best. "L" being the length of the protein-pair.
  • Technical details of how to get these scores can be found in the question below!
How do I run GREMLIN locally?
  • First you need an alignment
    • (see technical details in the first question on how to generate this).
  • Remove positions from the alignment that have > 75% gaps
    • seq_len.pl -i AB_id90cov75.fas -percent 25
      • AB_id90cov75.cut.fas - fasta file for your records
      • AB_id90cov75.cut.msa - alignment to be used as input to GREMLIN
      • AB_id90cov75.cut - mapping between the full length sequence and cut/trimmed sequence
      • Please take note of seq_len value reported at the end of the script run, you'll need this value later!
  • Run GREMLIN and get scores
    • run_gremlin.sh MCR_location AB_idcov75.cut.msa AB.mtx MaxIter 30 verbose 1 apc 0
      • AB.mtx - raw matrix from GREMLIN (apc 0 = no All Product Correction)
    • mtx2sco.pl -mtx AB.mtx -cut AB_id90cov74.cut -div 100 -seq_len 5 -apcd AB.apcd
      • AB.apcd - specialized All Product Correction to account for potential differing rates in each gene.
      • -div 100 is the length of gene A (full length of gene A, not the cut length), aka the point of division in the matrix!
      • -seq_len 5 is the number of sequences per length (this is the value you get at the end of running seq_len.pl)

  • The scripts used in this demo: scripts_21Sep2018.zip. Use the download form to get a copy of GREMLIN.