The distinctive signatures of promoter regions and operon junctions across Prokaryotes

Sarath Chandra Janga1, Warren F. Lamboy2, Araceli M. Huerta3& Gabriel Moreno-Hagelsieb4

1Program of Computational Genomics, CCG-UNAM, Apdo Postal 565-A, Cuernavaca, Morelos, 62100 Mexico. 2USDA-ARS Plant Genetic Resources Unit, Cornell University, Geneva, NY 14456. 3Joint Genome Institute, DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598. 4Department of Biology, Wilfrid Laurier University, 75 University Avenue West, Waterloo, ON, Canada, N2L 3C5.

 


Abstract

Here we show that regions upstream first transcribed genes have tri-nucleotide signatures that distinguish them from those regions upstream genes in the middle of operons. Knowledge databases of experimentally confirmed transcription units for most available genomes do not exist. Thus, same-strand adjacent pairs of genes conserved in evolutionarily distant genomes as representatives of genes inside operons, and divergently transcribed genes as representative examples of first transcribed genes have been used as training data in respective genomes. The tri-nucleotide signatures of regions upstream these representative genes allow for operon predictions with accuracies surprisingly close to those obtained with known operon data. If we use signatures to predict operons we obtain genes with more similar phylogenetic profiles and with higher proportions of genes in the same pathways than predicted transcription unit boundaries, demonstrating that we are separating genes with related functions, as expected for operons, from genes not necessarily related, as expected for genes in different transcription units. We also test the quality of the predictions using microarray data in six representative genomes and show that the signature predicted operons tend to have high correlations of expression comparable to those of distance-based predictions. This approach should pave the way to identifying operons purely based on sequence information of a genome and hence expand the ability to identify operon structures even in poorly characterized or partially sequenced genomes.


Sections

1) Data sets of OJs and PRs along with their promoter densities in both the genomes E. coli K12 and B. subtilis.

2) Table containing the maximum accuracies achieved in different genomes in separating the training data of conserved pairs and divergent transcription unit boundaries.

3) Signature-based operon predictions in all genomes.

4) Glossary of the terms/ abbreviations used in this work.

 


1.Date sets of OJs and TUBs used in the entire analysis in Escherichia coli and Bacillus subtilis.The tables contain the gene identifiers forming the OJs or TUBs, strand of the pair, intergenic distance between the pair and the operon or the genes forming the pair.The genomes of E.coli and B.subtilis correspond to the genbank files NC_000913.2 and NC_000964.2. (If automatic download window doesnt popup or if you have problems viewing the files in the browser please use wget or some command line tool to get these flat files)

a) Download OJ set of E.coli / TUB set of E.coli

b) Download OJ set of B.subtilis / TUB set of B.subtilis

Promoter densities upstream of each gene in the genomes E. coli K12 and B. subtilis.The data has the following columns 1)Upstream of the gene of interest 2)Number of matches 3) Promoter density (matches/250 bp).

 


2.Click here to see the table with the maximum accuracies attained and the respective signature log-liklihood at which it was attained in different genomes.

The table has the following columns 1) Genome name 2) Signature LLH 3) Accuracy attained. Key to the Genome names used in this table can be seen in this table.

 


 

3.Signature-based(tri-nucleotide signature) predictions for complete set of 330 genomes analysed can be obtained as tar zipped file for download from here.They can also be downloaded in zipped format from here. The format of the flat files is tab-limited and contain the following columns

a) Genbank gids of co-directional adjacent gene pairs.

b) Intergenic distance.

c) Chi-square value with respect to the WO signature in the respective genome.

d) Chi-square value with respect to the TUB signature in the respective genome.

e) Signature LLH, calculated as shown in the manuscript(refer methods section of the manuscript).

f) Class which is either WO or TUB based on the 0.00 sigllh threshold.

 

The complete set of genomes along with the Genbank accession numbers for all the replicons(per genome) used in the work can be seen in this table.

 

4.Glossary of the terms/abbreviations used in this work.

Promoter Region (PR): Due to our interest in operon predictions, here we used adjacent co-directional pairs of genes (pairs of adjacent genes in the same strand). Transcription unit boundary pairs (TUB pairs) would then be the last gene in one transcription unit, and the first on the next (this would be a first transcribed gene, by definition). A promoter region (PR) would be the region upstream the first transcribed gene. Thus, we use "TUB pair," to mean two adjacent genes in the same strand where one gene is the last of one transcription unit, the second is the first gene in the next transcription unit (in the direction of transcription). We use PR to mean the region upstream a first transcribed gene.

Operon junction: Given two adjacent genes in the same operon, genes transcribed into a single messenger RNA, the region upstream the second gene (counting in the direction of transcription) would be an operon junction (OJ). Within operon (WO) pairs would be two adjacent genes in the same operon. In this work thus, we show that we can distinguish a WO pair from a TUB pair if the region upstream the second gene looks similar to an overall-OJ signature or to an overall-PR signature, respectively.

Overall signature: This would be the average count of each oligo-nucleotide in all regions of a given data set (either all OJs or all PRs).

 


 

For Questions/Comments /Additional data requests , please mail: sarath AT cifn.unam.mx or gmoreno AT wlu.ca