|
|
|
POMPOUS - Predicted Simple Sequence Repeat
Polymorphisms
This includes
identification of likely polymorphic repeats in coding and UTRs as
found by inspecting
the UniGene Database as well as those found in introns and exons as found
when GenBank is analyzed.
Access our repeat polymorphism prediction database and summary statistics using our
Repeat Polymorphism Search Tool.
Repeat Polymorphisms Within Gene Regions
- Phenotypic and Evolutionary Implications
Wren
JD, Forgacs E, Fondon JW 3rd, Pertsemlidis A, Cheng S, Gallardo T,
Williams
RS, Shohet RV, Minna JD, and Garner HR "Repeat Polymorphisms Within
Gene Regions: Phenotypic and Evolutionary Implications". American
Journal of Human Genetics (August 2000, Vol 67, p. 345-56)>
- Abstract
-
- We have developed
an algorithm that predicted 11,265 potentially polymorphic tandem
repeats within transcribed sequences.
We estimate that 22% (2,207 out of 9,717) of the annotated clusters
within UniGene contain at least one potentially polymorphic loci.
Our predictions were tested by allelotyping a panel of ~30 individuals
for 5% of these regions, confirming polymorphism for more than
half
the loci tested. Our study indicates that tandem repeat polymorphisms
in genes are more common than generally believed. Roughly 8% of
these
loci are within coding sequence and if polymorphic, would result
in frame shifts. Our catalog of putative polymorphic repeats within
transcribed
sequences comprises a large set of potentially phenotypic or disease
causing loci. In addition, from the anomalous character of the
repetitive
sequences within unannotated clusters, we also conclude that the
UniGene cluster count substantially overestimates the number of
genes in the
human genome. We hypothesize that
polymorphisms in repeated sequences occur with some baseline distribution
based upon repeat homogeneity, size and sequence composition, and
deviations from that distribution are indicative of the nature of
selection pressure at that locus. We find evidence of selective maintenance
of the ability of some genes to respond very rapidly, perhaps even
on intra-generational time scales, to fluctuating selective pressures.
UniGene download
date: March 13, 2001
- Human
- 92149 records read
Total genes with repeats: 33200
Where the repeats were found:
5'----787----[1117]-----2382-----3' and 29378 unknown
How many repeats were found:
5'----948----[1437]-----3319-----3' and 39025 unknown
Genes with hairpins found: 6177
Where the hairpins were found:
5'----236----[734]-----173-----3' and 5062 unknown
How many hairpins were found:
5'----330----[926]-----256-----3' and 9950 unknown
Entries with coding regions given: 14966
Average 5'UTR length: 165 (from 2483735 bp)
Average CDS length: 1484 (from 22214384 bp)
Average 3'UTR length: 826 (from 12367305 bp)
Average size of unknown entry: 566 (from 43729132 bp)
Smallest entry=50
Largest entry=17734
Self-Similarity average : 4.684
Self-Complementarity average: 4
3'UTR obs/exp A/T ratio avg.: 0.108
-
- Mouse
- 79916 records read
Total genes with repeats: 17857
Where the repeats were found:
5'----320----[508]-----1079-----3' and 16144 unknown
How many repeats were found:
5'----377----[692]-----1607-----3' and 19440 unknown
Genes with hairpins found: 1867
Where the hairpins were found:
5'----87----[302]-----36-----3' and 1455 unknown
How many hairpins were found:
5'----131----[356]-----46-----3' and 2049 unknown
Entries with coding regions given: 7170
Average 5'UTR length: 135 (from 971257 bp)
Average CDS length: 1530 (from 10976156 bp)
Average 3'UTR length: 559 (from 4014279 bp)
Average size of unknown entry: 418 (from 30452300 bp)
Smallest entry=53
Largest entry=17333
Self-Similarity average : 3.94
Self-Complementarity average: 3.639
3'UTR obs/exp A/T ratio avg.: 0.055
-
- Rat
- 46258 records read
Total genes with repeats: 28289
Where the repeats were found:
5'----147----[263]-----569-----3' and 27401 unknown
How many repeats were found:
5'----176----[374]-----838-----3' and 30794 unknown
Genes with hairpins found: 1354
Where the hairpins were found:
5'----48----[161]-----11-----3' and 1138 unknown
How many hairpins were found:
5'----81----[187]-----23-----3' and 1412 unknown
Entries with coding regions given: 4268
Average 5'UTR length: 126 (from 539060 bp)
Average CDS length: 1516 (from 6473174 bp)
Average 3'UTR length: 525 (from 2244223 bp)
Average size of unknown entry: 479 (from 20131942 bp)
Smallest entry=51
Largest entry=16453
Self-Similarity average : 4.069
Self-Complementarity average: 3.772
3'UTR obs/exp A/T ratio avg.: 0.055
-
- Cow
- 6789 records read
Total genes with repeats: 2184
Where the repeats were found:
5'----45----[80]-----120-----3' and 1963 unknown
How many repeats were found:
5'----49----[102]-----148-----3' and 2321 unknown
Genes with hairpins found: 236
Where the hairpins were found:
5'----7----[44]-----4-----3' and 181 unknown
How many hairpins were found:
5'----7----[51]-----4-----3' and 338 unknown
Entries with coding regions given: 1321
Average 5'UTR length: 86 (from 114799 bp)
Average CDS length: 1280 (from 1691438 bp)
Average 3'UTR length: 424 (from 561266 bp)
Average size of unknown entry: 487 (from 2666490 bp)
Smallest entry=56
Largest entry=12706
Self-Similarity average : 4.427
Self-Complementarity average: 3.906
3'UTR obs/exp A/T ratio avg.: 0.12
-
-
- Zebrafish
- 10341 records read
Total genes with repeats: 3889
Where the repeats were found:
5'----32----[28]-----135-----3' and 3709 unknown
How many repeats were found:
5'----35----[34]-----192-----3' and 4480 unknown
Genes with hairpins found: 445
Where the hairpins were found:
5'----3----[24]-----5-----3' and 413 unknown
How many hairpins were found:
5'----8----[26]-----7-----3' and 590 unknown
Entries with coding regions given: 823
Average 5'UTR length: 115 (from 95298 bp)
Average CDS length: 1289 (from 1061284 bp)
Average 3'UTR length: 481 (from 396011 bp)
Average size of unknown entry: 517 (from 4929823 bp)
Smallest entry=101
Largest entry=10620
Self-Similarity average : 4.271
Self-Complementarity average: 3.911
3'UTR obs/exp A/T ratio avg.: 0.061
-
Computationally
Assisted Polymorphic Marker Identification:
Identification and Verification of Multiple New 3p21.3 Polymorphic
Markers
J.
W. Fondon III, G. M. Mele, D. Cummings, A. Pande, J. Wren, K. M. O’Brien, K. C. Kupfer, M. Lerman, J. D. Minna and H.R. Garner,
“Computationally Assisted Polymorphic Marker Identification: Identification and Verification of Multiple New 3p21.3 Polymorphic
Markers”, Proc. Nat. Acad. Scie., 95:7514-7519, June 23,
1998.
Abstract
A computational
system for the prediction of polymorphic loci directly and efficiently
from
human genomic sequence was developed
and verified.A suite of
programs, collectively called POMPOUS (POlymorphic Marker
Prediction Of Ubiquitous Simple sequences)
detects tandem repeats ranging from dinucleotides up to 250-mers, scores
them according to predicted level of polymorphism, and designs appropriate
flanking primers for PCR amplification.This approach was validated on an approximately 750 kb region of
human chromosome 3p21.3, involved in lung and breast carcinoma homozygous
deletions.Target DNA from
36 paired B lymphoblastoid and lung cancer lines was amplified and allelotyped
for 33 loci predicted by POMPOUS to be variable in repeat size.We found that among these 36 predominately Caucasian individuals
22 of the 33 (67%) predicted loci were polymorphic with an average heterozygosity
of 0.42.Allele loss in this
region was found in 27/36 (75%) of the tumor lines using these markers.
POMPOUS provides the genetic researcher with a new tool for the rapid
and efficient identification of polymorphic markers, and through the creation
of a World Wide Web server site, investigators can use POMPOUS to identify
new polymorphic markers for their research.A catalog of 13,261 potential polymorphic markers and associated
primer sets has been created from the analysis of 141,779,504 base pairs
of human genomic sequence in GenBank.This data is available on our WWW site and will be periodically
updated as GenBank is expanded and algorithm accuracy is improved.
We have also catalogued the simple
sequence repeats likely to be found in the entire genome. This
includes intronic regions, where the repeats could be used as markers
or could
be involved in gene regulation.
|