Information

Why are protein-coding regions rich in GC?

Why are protein-coding regions rich in GC?


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have been searching for an answer for this question and have some possible solutions, but I am not sure. GC regions are more stable as there are 3 hydrogen bonds instead of 2 with AT, however I am not sure if this would influence the number of protein-coding genes in GC regions? Protein-coding regions would have to be less condensed than noncoding regions so transcription factors could access them. Would a high GC content influence this as well?


Transcription factors generally bind to promoters, enhancers, silencers, and other regulatory regions that lie outside coding regions, through there are "duons" which code for amino acids and also bind TFs to regulatory effect. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3967546/

One paper suggests GC content helps balance recombination events with genetic stability, and so there may be an evolutionary benefit from organisms that undergo sexual reproduction having GC-rich coding regions. https://www.frontiersin.org/articles/10.3389/fpls.2016.01433/full

In eukaryotes, meiotic exchange of genetic information, or recombination, between homologous chromosomes is a critical step in generating genetic diversity required for adaptation. Recombination is also a crucial tool in plant improvement efforts. Local genome architecture is sculpted by the recombination process, and genome architecture, in turn, drives recombination. This interplay helps to create variability in genomic space, defining relatively stable and plastic genomic regions. This fluctuation in genomic stability is critical for balancing adaptation and stability on the phenotypic level.

Recombination has direct implications for GC patterns and vice versa. GC content refers to the percentage of guanine and cytosine bases in a DNA sequence, as opposed to adenine and thymidine bases. There have been many studies substantiating the positive correlation between recombination and GC content (Ikemura and Wada, 1991; Eyre-Walker, 1993; Fullerton et al., 2001; Galtier et al., 2001; Marais et al., 2001; Duret and Arndt, 2008; Haudry et al., 2008; Escobar et al., 2010; Muyle et al., 2011). Crossovers have been found to be correlated with high GC content in rat, mouse, human, zebrafish, bee, and maize at a broad scale (Jensen-Seaman et al., 2004; Beye et al., 2006; Gore et al., 2009; Backstrom et al., 2010; Giraut et al., 2011), while other studies detected strong correlation only at a fine scale (∼5 kb for yeast, ∼15-128 kb for human) and rather weak correlation at a broad scale (∼30 kb for yeast, ∼1 Mb for human; Gerton et al., 2000; Myers et al., 2006; Marsolier-Kergoat and Yeramian, 2009).

GC-richness may offer the ability for organisms to create offspring that evolve to changes to the environment and fend off parasites, while also being viable enough to procreate, themselves.


The features of eukaryotic mRNA synthesis are markedly more complex those of prokaryotes. Instead of a single polymerase comprising five subunits, the eukaryotes have three polymerases that are each made up of 10 subunits or more. Each eukaryotic polymerase also requires a distinct set of transcription factors to bring it to the DNA template.

RNA polymerase I is located in the nucleolus, a specialized nuclear substructure in which ribosomal RNA (rRNA) is transcribed, processed, and assembled into ribosomes (Table). The rRNA molecules are considered structural RNAs because they have a cellular role but are not translated into protein. The rRNAs are components of the ribosome and are essential to the process of translation. RNA polymerase I synthesizes all of the rRNAs except for the 5S rRNA molecule. The “S” designation applies to “Svedberg” units, a nonadditive value that characterizes the speed at which a particle sediments during centrifugation.

Locations, Products, and Sensitivities of the Three Eukaryotic RNA Polymerases
RNA Polymerase Cellular Compartment Product of Transcription α-Amanitin Sensitivity
I Nucleolus All rRNAs except 5S rRNA Insensitive
II Nucleus All protein-coding nuclear pre-mRNAs Extremely sensitive
III Nucleus 5S rRNA, tRNAs, and small nuclear RNAs Moderately sensitive

RNA polymerase II is located in the nucleus and synthesizes all protein-coding nuclear pre-mRNAs. Eukaryotic pre-mRNAs undergo extensive processing after transcription but before translation. For clarity, this module’s discussion of transcription and translation in eukaryotes will use the term “mRNAs” to describe only the mature, processed molecules that are ready to be translated. RNA polymerase II is responsible for transcribing the overwhelming majority of eukaryotic genes.

RNA polymerase III is also located in the nucleus. This polymerase transcribes a variety of structural RNAs that includes the 5S pre-rRNA, transfer pre-RNAs (pre-tRNAs), and small nuclear pre-RNAs. The tRNAs have a critical role in translation they serve as the adaptor molecules between the mRNA template and the growing polypeptide chain. Small nuclear RNAs have a variety of functions, including “splicing” pre-mRNAs and regulating transcription factors.

A scientist characterizing a new gene can determine which polymerase transcribes it by testing whether the gene is expressed in the presence of a particular mushroom poison, α-amanitin (Table). Interestingly, α-amanitin produced by Amanita phalloides, the Death Cap mushroom, affects the three polymerases very differently. RNA polymerase I is completely insensitive to α-amanitin, meaning that the polymerase can transcribe DNA in vitro in the presence of this poison. In contrast, RNA polymerase II is extremely sensitive to α-amanitin, and RNA polymerase III is moderately sensitive. Knowing the transcribing polymerase can clue a researcher into the general function of the gene being studied. Because RNA polymerase II transcribes the vast majority of genes, we will focus on this polymerase in our subsequent discussions about eukaryotic transcription factors and promoters.


Gene-rich and gene-poor chromosomal regions have different locations in the interphase nuclei of cold-blooded vertebrates

In situ hybridizations of single-copy GC-rich, gene-rich and GC-poor, gene-poor chicken DNA allowed us to localize the gene-rich and the gene-poor chromosomal regions in interphase nuclei of cold-blooded vertebrates. Our results showed that the gene-rich regions from amphibians (Rana esculenta) and reptiles (Podarcis sicula) occupy the more internal part of the nuclei, whereas the gene-poor regions occupy the periphery. This finding is similar to that previously reported in warm-blooded vertebrates, in spite of the lower GC levels of the gene-rich regions of cold-blooded vertebrates. This suggests that this similarity extends to chromatin structure, which is more open in the gene-rich regions of both mammals and birds and more compact in the gene-poor regions. In turn, this may explain why the compositional transition undergone by the genome at the emergence of homeothermy did not involve the entire ancestral genome but only a small part of it, and why it involved both coding and noncoding sequences. Indeed, the GC level increased only in that part of the genome that needed a thermodynamic stabilization, namely in the more open gene-rich chromatin of the nuclear interior, whereas the gene-poor chromatin of the periphery was stabilized by its own compact structure.

This is a preview of subscription content, access via your institution.


Results

We obtained a large collection of codon usage data across the three domains of life (46 archaea, 686 bacteria, and 826 eukaryotes). Both models (Model 1 and Model 2 see Figure 1) use GC (S) and purine (R) contents to predict expected compositions theoretically. Different from Model 2, Model 1 requires prior knowledge of empirical relationships between S and S iand between R and R i, where i represents codon position (i = 1, 2, 3) (see Models). We inferred these empirical relationships (Additional file 1) from all the collected sequences in individual domains of life for Model 1.

Illustrations for quantifying theoretical compositions. Quantification of theoretical compositions of nucleotide, codon, and amino acid is based on GC (S) and purine (R) contents, which are readily observed from input coding sequences. Model 1 (red) and Model 2 (blue) differ only in how position-dependent GC (S i) and purine (R i) contents are calculated, where i represents codon position (i = 1, 2, 3).

Nucleotide composition

We plotted the expected and observed frequencies of the four nucleotides and their individual frequencies at the three codon positions against GC content for all data in our collection (Additional file 2). An example for guanine is shown in Figure 2. Both models performed well across a wide range of GC contents and yielded very close predictions for the nucleotide G (Figure 2A to 2C) and for the three codon positions (Figure 2D to 2L). The expected compositions with changing GC contents exhibit similar trends as compared to the observed ones, despite the fact that deviations at the second codon position appeared more pronounced in comparison with the first and third codon positions (Figure 2G to 2I discussed below). Furthermore, the expected compositions predicted by Model 1 correlated with GC content linearly, whereas those predicted by Model 2 appeared scattered around those by Model 1, indicating greater deviations. Taken together, the two models produced close predictions for the expected nucleotide compositions, exhibiting comparable trends with the observed (Figure 2 and Additional file 2).

Expected and observed guanine compositions. The nucleotide composition was examined in four scenarios: total frequencies (A to C), frequencies at first (D to F), second (G to I), and third (J to L) codon positions. The expected and observed guanine compositions were quantitated in Archaea (A, D, G and J), Bacteria (B, E, H and K), and Eukarya (C, F, I and L). Each point in all plots represents a sum of the composition from the species coding sequences and similar plots for all other nucleotides were summarized in Additional file 2.

Codon composition

We further used the models to predict the codon compositions (see Models). The expected and observed codon frequencies were plotted against GC content over all collected sequences (Additional file 3). We took four randomly selected codons (AAT, TGC, GCC, and CTT) as examples (Figure 3). When GC content varies from low to high, both models show consistent predictions for expected codon compositions that are very similar to the distributions of the observed (Figure 3A to 3L). Specifically, the expected compositions of codons AAT and CTT yield negative correlations with the increasing GC content, agreeing well with the observed (Figure 3A to 3C and 3J to 3L). In contrast, the expected compositions of TGC and GCC codons correlate positively with the increasing GC content, again consistent with the observed (Figure 3D to 3F and 3G to 3I). Moreover, in comparison with Model 2, the predicted trends by Model 1 are smoother when the GC content varies (Figure 3). Although there are deviations between the expected and observed in general, the two models predict rather consistent codon compositions (Figure 3 and Additional file 3).

Expected and observed codon compositions. We chose four codons randomly as examples: AAT (A to C), TGC (D to F), GCC (G to I), and CTT (J to L). The expected and observed codon compositions were quantitated in Archaea (A, D, G and J), Bacteria (B, E, H and K), and Eukarya (C, F, I and L). Each point in all plots represents a sum of the composition from the species coding sequences and similar plots for all other codons were summarized in Additional file 3.

Amino acid composition

Based on the expected codon compositions, we compared the expected and observed amino acid compositions across the three domains of life (Additional file 4). We chose to show here the plots for codons AAT, TGC, GCC and CTT for four amino acids, Asn (asparagine), Cys (cysteine), Ala (alanine), and Leu (leucine), respectively (Figure 4). Although predicting amino acid compositions may be entangled by the fact that most amino acids are encoded by multiple codons and thus may involve greater deviations, both models still performed moderately well in quantifying the expected amino acid compositions. The expected compositions of Asn (encoded by AAT and AAC) decreased with increasing GC content, providing comparable trends with the observed (Figure 4A to 4C). In contrast, the expected compositions of Cys (encoded by TGT and TGC) appeared constant (extremely low) with changing GC content, displaying similar trends with the observed (Figure 4D to 4F), albeit slightly larger than the observed. As Ala is encoded by codons GCN (where N = A, T, G, C), the expected compositions of Ala dramatically increased with increasing GC content, but appeared smaller than the observed (especially in bacteria discussed below) nevertheless, the expected compositions of Ala still presented similar trends with the observed (Figure 4G to 4I). With regard to Leu (encoded by six different codons, CTN and TTR), its observed compositions appear much more scattered than those of Asn, Cys or Ala. Even so, both models are still capable of predicting consistent compositions for Leu. Although the expected compositions of Leu are smaller than the observed in archaea and bacteria, they appear closer to the observed in eukaryote (Figure 4J to 4L). Additionally, comparing the expected amino acid compositions between the two models, we found that Model 1 again exhibits smoother trends (except Leu) tailored to the increasing GC content. Collectively, the two models also offered a consistent quantification for amino acid compositions across the three domains of life (Figure 4 and Additional file 4).

Expected and observed amino acid compositions. We took four representative amino acids as examples: Asn (asparagine A to C), Cys (cysteine D to F), Ala (alanine G to I), and Leu (leucine J to L). The expected and observed amino acid compositions for the four amino acids were quantitated in Archaea (A, D, G and J), Bacteria (B, E, H and K), and Eukarya (C, F, I and L). Each point in all plots represents a sum of the composition from the species coding sequences and similar plots for all other amino acids were summarized in Additional file 4.


Discussion

We show that yeast isochores share characteristics with those found in higher eukaryotes in addition to those described before. Our results indicate that GC-rich and AT-rich domains are both structurally and functionally distinct. First, interaction frequencies within GC-rich chromatin tend to be lower than those in AT-rich chromatin, which is in agreement with a more extended chromatin conformation, as observed in higher eukaryotes [12, 13]. Second, similar to mammalian isochores, genes located in the most GC-rich regions of the yeast genome are, on average, more highly expressed (for example, [4]). Importantly, we found that GC-rich genes display higher levels of H3 and H4 acetylation compared to more AT-rich genes. Finally, we identify Rpd3p as a molecular component involved in base composition-dependent control of chromatin structure and function. This role of Rpd3p may be conserved in higher eukaryotes as it is also associated with less condensed interbands in Drosophila [43]. This activity appears to be specific for Rpd3p as we did not detect a base composition-dependent activity of another globally acting histone deacetylase, Hda1p.

Rpd3p has been shown to have two distinct modes of action. First, Rpd3p is recruited to specific target genes to modulate their expression. Second, Rpd3p acts in a global and non-targeted fashion to deacetylate bulk chromatin. We propose that the base composition-dependent effects of Rpd3p are related to its global activities. First, the magnitudes of these GC-content dependent effects are subtle, similar to the previously described effects of Rpd3p on global histone acetylation [40]. Second, deletion of Ume6p, a protein involved in recruitment of Rpd3p to many of its specific target genes [38], does not result in up-regulation of GC-rich genes, indicating that Rpd3p interacts with GC-rich genes in a Ume6p-independent manner. Third, the GC content-dependent effects are not correlated with the steady-state expression of genes, and thus seem unrelated to local promoter controls.

We favor the model that the global and untargeted activity of Rpd3p acts predominantly and/or has the largest effect on GC-rich chromatin. First, Rpd3p binds GC-rich genes more prominently than AT-rich genes. Second, deletion of RPD3 results in increased H4 acetylation, particularly of GC-rich genes. Finally, treatment of wild-type cells with the histone deacetylase inhibitor TSA activates GC-rich genes more strongly than AT-rich genes. However, we did observe that TSA induced activation of GC-rich genes requires more time than induction of many direct target Rpd3p genes. This relatively slow effect could be interpreted to mean that the base composition-dependent effects of deletion of RPD3 are indirect and are due to altered expression of a specific Rpd3p target gene that, in turn, encodes a protein that acts in a GC content-dependent fashion. Alternatively, and consistent with the Rpd3p localization and acetylation data, Rpd3p does directly affect expression of GC-rich genes, but this more global and non-targeted process occurs at a longer time scale or requires passage through a specific phase of the cell cycle.

An alternative or additional molecular explanation of the observed phenomena is related to potential base composition-dependent differences in wrapping of DNA around histones. AT-rich DNA may be more flexibly and more easily wrapped around nucleosomes than GC-rich DNA [44]. This physical model implies intrinsic differences in nucleosome organization dependent on base composition and does not require that histone modifying enzymes act in a base composition-dependent fashion per se. In this model, histone modifying enzymes recognize differences in intrinsic conformation of GC- and AT-rich chromatin. Rpd3p may preferentially act on the nucleosome organization of GC-rich chromatin. Similarly, acetyl transferases may preferentially modify GC-rich domains in wild type, resulting in higher levels of histone H3 and H4 acetylation, as we observed here. Based on these considerations, we predict the presence of histone acetyl transferases that act most prefentially on GC-rich chromatin.

In light of these observations, we can interpret our 3C analysis more precisely. The 3C results show that deletion of RPD3 differentially affected the conformation of GC- and AT-rich isochore domains along chromosome III, but did not allow determination of which of the two types of domains, or both, displayed an altered conformation. When Rpd3p activity affects GC-rich genes most prominently, the most parsimonious explanation of our 3C data is that deletion of RPD3 most strongly affects the conformation of the GC-rich domain, resulting in a more extended and transcriptionally active chromatin conformation, consistent with predicted relationships between transcription, histone acetylation and chromatin conformation.

GC-rich chromatin displays lower interaction frequencies, as detected by 3C, than AT-rich chromatin. Analysis of cross-linking efficiency suggests that both types of domains are cross-linked with similar frequencies (Additional data file 2) and, therefore, have similar protein densities. Histones are the most abundant chromatin proteins, and thus our results suggest that GC-rich and AT-rich regions have similar levels of histone binding. Consistent with this hypothesis, Nagy et al. [45] found no correlation between base composition and regions depleted in nucleosomes. Previously, we found a decrease in interaction frequencies upon activation of the FMR1 gene in human cells [19], similar to the observed changes in rpd3Δ cells described here, suggesting that reduced 3C interaction frequencies may be a general characteristic of active chromatin.

The base composition-dependent effect of Rpd3p activity affects expression of genes independent of their steady state level of expression. Genes with the same steady state expression level in wild type are more strongly repressed by Rpd3p when they are GC-rich than when they are AT-rich. This implies that GC-rich genes are intrinsically more active, consistent with higher steady state levels of H3 and H4 acetylation, as we observed here, and that Rpd3p acts as an attenuator of these genes. Based on these considerations, we propose that chromatin status is regulated through a homeostatic and highly dynamic mechanism involving counteracting activating and repressing activities. A similar model of dynamic global acetylation and deacetylation has been proposed by Katan-Khaykovich and Struhl [46] and by Clayton et al. [47].


Conclusions

Here we have presented two models that theoretically quantify expected compositions of nucleotides, codons, and amino acids, based merely on GC and purine contents (which are easily computed from input sequences). We evaluated the two models on a large collection of protein-coding sequences across the three domains of life. Our results show that the two models are capable of yielding consistent expected compositions. In addition, our results indicate that deviations of the observed from the expected compositions are signatures resulted from complex interplays between mutation and selection. Therefore, our models represent a promising theoretical framework for compositional studies.


Methods

Biological samples

All biological samples used for DNA extracts were females. Blood samples from chicken lines, araucana, alsacian and white leghorn, japanese quail, turkey, pekin duck and duck of barbary were obtained from breeds maintained at the INRA UE1295 PEAT experimental facilities (Pôle96 d’Expérimentation Animale de Tours, Agreement N° C37–175-1). Those from the RJF Gallus gallus and the grey jungle fowl Gallus sonneratii were supplied by Christophe Bec, Parc des oiseaux (Villars les Dombes 01330 France) et Christophe Auzou (Grand Champs 89,350 France). Those from the ostrich and African fisher eagle were supplied respectively by la Réserve de Beaumarchais (Autrèche 37,110 France) and Zooparc de Beauval (Saint Aignan 41,198 France). Biopsies of pectoral skeletal muscles from black-chinned hummingbird were supplied by the Department of Biology & Museum of Southwestern Biology of University of New Mexico (USA). For RNA extracts, tissues biopsies were supplied by Hendrix Genetics (Saint Laurent de la Plaine, France) and were from broiler breeder female chicks (Cobb 500), 35 weeks of age.

Nucleic acid purifications

gDNA samples were prepared from 100 μL of fresh red blood cells using the Nucleospin Tissue kit (Macherey-Nagel). Total RNA was extracted from abdominal adipose tissue, subcutaneous adipose tissue, liver and Pecto skeletal muscle from 35 weeks-old hens by homogenization in TRIzol reagent using an Ultraturax, according to the manufacturer’s recommendations (Invitrogen by Life Technologies, Villebon sur Yvette, France). The quality and concentration of nucleic acid samples were evaluated using a NanoDrop™ 2000 spectrophotometer.

Illumina sequencing of RJF genome

Library construction and sequencing were performed at the Plateforme de Séquençage Haut Débit I2BC (Gif-sur-Yvette 91,198 France). Illumina libraries of 600 bp genomic fragments were prepared without PCR amplification, as recommended, to optimize for the presence GC and AT-rich sequences (Aird et al. 2011, Oyola et al. 2012). Paired-end sequencing with reads of 75 and 250 nucleotides in length were performed with at least 10X coverage. All raw and processed data are available through the European Nucleotide Archive under accession numbers PRJEB22479 and PRJEB25675 for the RJF, PRJEB24169 for the ostrich, PRJEB27669 for the african fisher eagle and PRJEB27670, for the back-chinned hummingbird.

Searching databases

Files containing the 2323 cDNA models were downloaded from https://doi.org/10.6084/m9.figshare.5202853 [5, 20]. 2132 of these were validated [5, 20]. Their functional annotation must be carefully reviewed (Additional file 3a.1), among these a few errors were found. For example, cDNA models in file aves_ENSPSIG00000014155_Hmm5_gapClean40.fasta was annotated as coding piwiL4 while it should have been piwiL1 data corresponding to orthologues of the ENSPSIG00000011968 Pelodicus sequence and describing cDNA coding the SLC2A4/GLUT4 protein were not released by the authors. The galGal5 and galGal6 genome sequences and coding sequences were downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/Gallus_gallus. Release 94 and 96 of the Ensembl chiken CDS were downloaded from ftp://ftp.ensembl.org/pub/release-94/fasta/gallus_gallus/ and ftp://ftp.ensembl.org/pub/release-96/fasta/gallus_gallus/. Searches with cDNA models as queries for galGal5 and galGal6 CDSs available at NCBI Ensembl release 94 & 96. These were done using blastn within the script blastfasta.pl as described [51]. Searches for polyexonic genes using cDNA models as queries in the galGal5 and galGal6 genome models were done using blastn and HSPs were fused using agregfilter.pl as described [51]. Searches for polyexonic genes using cDNA models as queries in Gallus WGS and TSA datasets were done using the “Remote BLAST” plugin at NCBI as recommended (https://www.ncbi.nlm.nih.gov/books/NBK279668/). Hits were defined as positive when more than 90% of the query was aligned with the subject sequence with an identity above 95%. Such a rate of identity was used because numerous cDNA models contained long stretches of Ns.

To reconstruct leptin cDNAs in five palaeognathae species, we used public Illumina datasets for the white-throated tinamou (dataset ID: SRR952232 to SRR952238), the brown kiwi (dataset ID: ERR519283 to ERR519288 and ERR522063 to ERR5220668), the mallard duck (dataset ID: SRR7194749 to SRR7194798), the muscovy duck (dataset ID: SRR6300650 to SRR6300675 and SRR6305144) and the Japanese quail (24 Illumina datasets of PRJNA292031 plus about 20 Gbp of PacBio reads kindly supplied by Dr. J. Gros (Pasteur Institute, Paris, France).

Construction of gene models from Pacbio reads

Files containing Illumina and PacBio reads from RJF available from public databases were downloaded using the sra toolkit (downloaded at https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/) to produce fasta formatted files. Presence of contamination in Illumina reads was tested for each file by aligning it to the genomes of the most commonly studied genetic models (Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Danio rerio, Mus musculus and Homo sapiens) using HISAT2 or bowtie2, depending on the mRNA or the genomic origin of the nucleic acids. Below 2.5% of aligned reads per file or alignment, we considered that there was no significant contamination by these species. Illumina reads were thereafter searched using blastn. PacBio reads were searched using blastn and blasr [52] and produced similar results once parameters were optimized (blastn command line: blastn -db [name of the database for blast] -query [query.fa] -out [name of the file containing results] -evalue 100 -task blastn -word_size 5 -dust no -num_threads 1 blasr command line: blasr [PacBio reads file.fa] [query.fa] --header --minReadLength 400 --minAlnLength 400 --bestn 100 --out Resoutput -m 0 --nproc 1). Reads were extracted and oriented before alignment with MUSCLE for Illumina reads. Pacbio reads were aligned one by one using the --add option of mafft and as an alignment seed one cDNA model previously aligned with Illumina reads (as for the leptin gene) or one cDNA model manually aligned with a Pacbio reads, generally the best hit in blasr and blastn results was used. The alignment of each Pacbio read aligned with mafft was verified and adjusted manually in order to properly align poorly sequenced regions as well as to identify polypurine or polypyrimidine tracts insertions that are sporadically found in sequenced GC-rich reads. The five alignments produced here can be visualized using Aliview (https://github.com/AliView/AliView) or Seaview (http://www.seaviewfishing.com/DownloadSoftware.html) freeware packages. Additional files 4 and 6, 7, 8, 9 are in fasta format and contain the alignment of all Illumina and/or PacBio reads used to calculate the genomic models of the RJF leptin, TNFα, MRPL52, PCP2 and PET100 genes. In Additional file 4, the 10 sequences at the bottom of the alignment correspond to the Seroussi’s et al. (2016) partial cDNA model and the 9 reads used to calculate it. Sequence names of each read are indicated. Label “strand (+)” or “ strand (-)” at the end of the name indicated that the read was aligned in the forward or reverse-complement orientation. Sequence sections that were impossible to align were represented as N tracts. Features of orthologous genes in vertebrates were investigated using genomicus 94.01 at http://www.genomicus.biologie.ens.fr/genomicus-94.01/cgi-bin/search.pl.

Detection of non-B DNA G4 motifs

To detect candidate motifs putatively able to assemble G4 and other non-B DNA motifs, we used facilities available at https://nonb-abcc.ncifcrf.gov/apps/site/default [35]. The detection mode of the G4 candidates is done by text mining for the detection of a strict motif G3 + N1-12G3 + N1-12G3 + N1-12G3. Because this motif did not stand any sequence degeneracy when a motif is detected, its detection probability was 100%.

PCR on gDNA

Our optimal procedure for PCR amplification was performed on 60 ng of avian gDNA in 10 mM Tris-HCl, pH 9, 4 mM MgCl2, 50 mM KCl, 0.1% TritonX100, 150 μM of each dNTP, and 0.1 mM of each oligonucleotide in a 50 μl reaction volume with 1 unit GoTaq DNA Polymerase (Promega). Each PCR was carried out in a programmable temperature controller (Eppendorf) for 30 cycles. After initial denaturation (5 min at 98 °C), the cycle was as follows: denaturing at 98 °C for 20″, annealing at 60 °C for 15″, and extension at 72 °C for 1′. At the end of the 30th cycle, the heat denaturing step was omitted, and extension was allowed to proceed at 72 °C for 5′. Each amplified sample could then be purified using a QIAquick PCR purification kit (Qiagen) and sent to Eurofins Genomics for a Sanger sequencing in order to verify its leptin origin.

RT-PCR assay

The classic way we used to generate cDNA was by reverse transcription (RT) of total RNA (1 μg) in a mixture comprising 0.5 mM of each deoxyribonucleotide triphosphate (dATP, dGTP, dCTP and DTTP), 2 M of RT buffer, 15 μg/μL of oligodT, 0.125 U of ribonuclease inhibitor, and 0.05 U M-MLV RT for one hour at 37 °C. Our optimal procedure to synthesize cDNA was performed from 0.1 μg total RNA using a mix of oligodT and hexanucleotides as primers and the Opti M-MLV RT under conditions recommended by the supplier (Eurobio). Real-time PCR was performed using the MyiQ Cycle device (Bio-Rad, Marnes-la-Coquette, France), in a mixture containing SYBR Green Supermix 1X reagent (Bio-Rad, Marnes la Coquette, France), 250 nM specific primers (Invitrogen by Life Technologies, Villebon sur Yvette, France) and 5 μL of cDNA (diluted five-fold) for a total volume of 20 μL. Samples were duplicated on the same plate and the following PCR procedure used: after an incubation of 2 min at 50 °C and a denaturation step of 10 min at 95 °C, samples were subjected to 40 cycles (30 s at 95 °C, 30 s at 60 °C and 30 s at 72 °C). The levels of expression of messenger RNA were standardized to the GAPDH reference gene. For the leptin gene, the relative abundance of transcription was determined by the calculation of e -ct . Relative expression of the gene of interest was then related to the relative expression of the geometric mean of the GAPDH reference gene.


Materials and Methods

Plasmids.

pcDNA3.1(+) and pcDNA3.1(−) were from Invitrogen (Invitrogen, Carlsbad, California, United States). These plasmids contain a strong constitutive CMV promoter, a BGH polyadenylation signal, and a neomycin resistance gene for G418-based selection of mammalian cells.

pcDNA3-Hsp70-HA: the HSPA1A (gi: 188487) coding region was amplified by PCR using the Hsp70-HA-U and Hsp70-HA-L primers (see Table S2). This appended an HA tag to the Hsp70 ORF. The PCR product was digested with EcoRV and XhoI and inserted into the EcoRV and XhoI sites of the pcDNA3.1 vector (Invitrogen), under the control of the CMV promoter.

pcDNA3-Hsc70-HA: the HSPA8 (gi: 32466) coding region was amplified by PCR from HeLa cDNA using the Hsc70-HA-U and Hsc70-HA-L primers ( Table S2). This replaced the first three Hsc70 amino acids with Hsp70 amino acids, improved the Hsc70 Kozak sequence, and appended an HA tag to the Hsc70 ORF. The PCR product was digested with EcoRV and XhoI and inserted into the EcoRV and XhoI sites of the pcDNA3.1 vector (Invitrogen).

pEGFP-N2 was from Clontech (Clontech, Palo Alto, California, United States).

pGFP-N2 was constructed by first introducing the R80Q mutation into the GFP sequence in the pS65T-C1 vector (Clontech) using the GFP-R80Q-U and GFP-R80Q-L primers, then by introducing the F64L mutation, using the GFP-F64L-U and GFP-F64L-L primers, and then by amplifying the coding region of the modified GFP using the BamHI-5′-GFP and GFP-3′-NotI primers ( Table S2). The PCR product was digested with BamHI and NotI and then inserted into the BamHI and NotI sites of the pEGFP-N2 vector. The resulting pGFP-N2 vector encoded a GFP with an identical amino acid sequence and Kozak sequence as pEGFP-N2 (see Dataset S1).

pcDNA3-IL2 was constructed by extracting the IL2 cDNA from pWPXL-IL2 (kind gift from D. Kowalczyk) using BamHI and EcoRI and insertion into the BamHI, EcoRI sites of pcDNA3.1 (+).

pcDNA3-eIL2 and pcDNA3-wIL2 were constructed by introducing the synthetic eIL2 or wIL2 genes (ordered from Geneart, Regensburg, Germany) into the HindIII, EcoRI sites of pcDNA3.1 (+). The sequences of eIL2 and wIL2 can be found online in the Dataset S1.

pcDNA3-IL2-eIL2 was constructed from pcDNA-IL2 by replacing a fragment of the IL2 gene by a fragment of the eIL2 gene, PCR-amplified from pcDNA-eIL2 using the eIL2-1152-U and eIL2-1480-L primers ( Table S2) and digested with XbaI.

pcDNA5/FRT-IL2, pcDNA5/FRT-wIL2, and pcDNA5/FRT-eIL2 were constructed by extracting the IL2, wIL2, and eIL2 coding regions, respectively, from pcDNA3-IL2, pcDNA3-wIL2, and pcDNA3-eIL2 using HindIII and NotI and inserting them into the HindIII, NotI sites of pcDNA5/FRT/TO (Invitrogen).

pcDNA5/FRT/CAT was from Invitrogen.

pcDNA5/FRT-GFP and pcDNA5/FRT-EGFP were generated by subcloning the BamHI, NotI fragments from pEGFP-N2 or pGFP-N2 into the pcDNA5/FRT/TO vector digested with BamHI and NotI.

Following cloning, the coding regions of all plasmids were sequenced.

Cell culture.

Adherent HeLa cells and 293T cells were cultured at 37 °C in a humidified atmosphere containing 5% CO2, in Dulbecco's Modified Eagle's Medium (DMEM, Sigma D5523, Sigma, St. Louis, Missouri, United States) with 10% heat-inactivated Fetal Bovine Serum (FBS, Sigma F7524) and the antibiotic/antimycotic mixture (Sigma). Saos-2 cells ( ATCC) were cultured in McCoy's medium with 15% non-inactivated FBS and the antibiotic/antimycotic mixture. MCF-7 cells ( ATCC) were grown in RPMI-1640 (Sigma) containing 10% FBS. For stable transfection of MCF-7 cells, linearized plasmids were transfected using Lipofectamine 2000 (Invitrogen), and clones were selected using 750 μg/mL neomycin (G418 Sigma). The Flp-In T-Rex-293 cell line (Invitrogen) and Flp-In TM3 cells (mouse Leydig cells, L. Lipinski, unpublished data) were cultured in DMEM with 4.5 g/mL glucose, 10% FBS, 100 μg/mL zeocin. 15 μg/mL blasticidin was additionally used for the Flp-In T-Rex-293 cell line. Following the transfection of Flp-In cells, stable transfectants were selected using 100 μg/mL hygromycin B instead of zeocin. Generation of clonal cell lines was performed according to manufacturer's instructions (Flp-In T-Rex Core Kit, Instruction manual, Invitrogen). Total cellular DNA of individual Flp-In T-Rex-293-derived and Flp-In TM3-derived clones was analyzed by qPCR to confirm the presence of a single transgene copy in each clone. GFP and IL2 expression in Flp-In T-Rex-293 cells was induced by adding 1 μg/ml tetracycline for 24 h before harvesting.

For transient transfection of HeLa cells, 5.5 × 10 4 cells per well were seeded in a 24-well plate (Corning, New York, United States). For each well, 0.3 μg plasmid DNA and 1 μL Lipofectamine 2000 (Invitrogen) were used according to the manufacturer's instructions. Following 24 h of incubation, 50%–80% transfection efficiency and > 95% cell viability was routinely achieved, as detected by fluorescence microscopy, immunofluorescence microscopy, and flow cytometry. For transfection of 293T cells, 8 × 10 4 cells per well were used in a 24-well plate. For each well, 0.4 μg pure plasmid DNA was mixed with 25 μL DMEM without FBS, and 0.8 μL 1 mg/mL polyethyleneimine (PEI, Polysciences Incorporated, Warrington, Pennsylvania, United States) in H2O was added to this mixture, incubated 10 min at room temperature and the solution was added onto the cells. The transfection efficiency and cell viability was similar as for HeLa cells. For transfection of Saos-2 cells, 1.6 × 10 5 cells per well were seeded in a 12-well plate. For each well, 0.8 μg DNA was mixed with 50 μL DMEM, 1.6 μL of 1 mg/mL PEI was added, incubated 10 min, and spread on the cells. Transfection efficiency was 20%. For mRNA quantification, all transfections were scaled up to 6-well plates.

SDS-PAGE and Western blotting.

Cells were washed once with ice-cold PBS and lysed directly in the wells in 70 μL 1 × SDS sample buffer, boiled for 5 min and amounts corresponding to about 5 μg total protein per lane were loaded on 10% poliacrylamide gels. A prestained protein ladder (PAGE-Ruler, Fermentas, Burlington, Ontario, Canada) was routinely used. Following electrophoresis, proteins were transferred onto a nitrocellulose membrane (Pall) using a Bio-Rad blotting system (Bio-Rad, Hercules, California, United States). The following antibodies were used for detection: rabbit anti-HA, sc-805 (Santa Cruz Biotechnology), 1:2000 rabbit anti-GAPDH, sc-25778 (Santa Cruz Biotechnology), 1:6000 goat anti-rabbit IgG-HRP conjugated, 401393 (Calbiochem, San Diego, California, United States), 1:6000. The membranes were soaked in the chemiluminescence reagent immediately before exposure to a Kodak BioMax film.

Flow cytometry.

Cells were trypsinized, washed with medium containing 10% FBS, resuspended in PBS with 5% DMSO, and stored at −70 °C. The flow cytometry analysis was performed using BD FACS Calibur. Forward scatter and side scatter measurements were used to define a homogenous population of living cells, and the FL1 channel was used to detect the GFP or EGFP fluorescence. For fluorescence quantification, the arithmetic mean of all events corresponding to living cells was used.

IL2 ELISA.

24 h following transfection, cell culture media were gathered and centrifuged 1 min at 14,000 rpm. Supernatants were diluted to the appropriate concentration with PBS + 10% heat-inactivated FBS, and IL2 concentrations were measured using the OptEIA human IL2 ELISA set (BD Biosciences, Palo Alto, California, United States) according to the manufacturer's instructions.

In vitro transcription and translation.

Capped Hsp70 and Hsc70 mRNA was produced in vitro using the T7 Cap Scribe kit (Roche, Basel, Switzerland) according to the manufacturer's instructions. The mRNA was analyzed by 1% agarose gel electrophoresis to confirm the absence of degradation. The in vitro translations were performed at 28 °C using the Reticulocyte Translation Kit Type II (Roche) and 35 S-labeled Methionine (Amersham Biosciences, Little Chalfont, United Kingdom). The reactions contained 1–2 μg Hsp70 or Hsc70 mRNA, 2 μL translation reaction mix without methionine, 50 mM potassium acetate, 1.25 mM magnesium acetate, 2 μL 35 S-Met (10 mCi/mL), and 10 μL rabbit reticulocyte lysate, in a total reaction volume of 25 μL. The reactions were started by the addition of rabbit reticulocyte lysate, and stopped after the desired time by addition of SDS sample buffer, followed by SDS-PAGE and autoradiography.

MRNA quantification.

Total cellular RNA was purified using the NucleoSpin kit (Macherey Nagel, Germany) according to the manufacturer's instructions. The NucleoSpin purification procedure comprises on-column DNA digestion using DNAse I. On several occasions, we verified the absence of contaminating plasmid DNA in our RNA preparations by omitting the reverse transcriptase in the RT reactions and then performing the real-time PCR. We never observed any significant contamination with this purification method. RNA concentration was measured spectrophotometrically, and approximately 1.5 μg of total RNA was used in each cDNA synthesis reaction. cDNA synthesis was performed using the RevertAid kit (Fermentas) with (dT)18 primers. Real-time PCR cDNA quantification was performed using Light-Cycler (Roche) with Sybr Green II (Sigma). The primer sequences are shown in the Table S2. The equal transfection efficiency in transient transfection experiments was controlled using the neomycin resistance gene (neo), present in all our experimental constructs. The neo gene cDNA from the pEGFP-N2 and pGFP-N2 plasmids was amplified using the neo(GFP) primers, and the neo gene cDNA from the pcDNA3-Hsp70-HA, pcDNA3-Hsc70-HA, and all the pcDNA3-IL2 plasmids—using the neo(pcDNA) primers. The IL2 and GFP variants expressed in the Flp-In cells were quantified using the pcDNA5-UTR-U and pcDNA5-UTR-L primers. For RNA stability assays, cells were treated with 10 μg/mL actinomycin D (Sigma) for 0–7 h before RNA isolation. mRNA half-lives were determined by fitting exponential decay curves to experimental data points.


What is PCR used for?

Once amplified, the DNA produced by PCR can be used in many different laboratory procedures. For example, most mapping techniques in the Human Genome Project (HGP) relied on PCR.

PCR is also valuable in a number of laboratory and clinical techniques, including DNA fingerprinting, detection of bacteria or viruses (particularly AIDS), and diagnosis of genetic disorders.

Once amplified, the DNA produced by PCR can be used in many different laboratory procedures. For example, most mapping techniques in the Human Genome Project (HGP) relied on PCR.

PCR is also valuable in a number of laboratory and clinical techniques, including DNA fingerprinting, detection of bacteria or viruses (particularly AIDS), and diagnosis of genetic disorders.


15.3 Eukaryotic Transcription

By the end of this section, you will be able to do the following:

  • List the steps in eukaryotic transcription
  • Discuss the role of RNA polymerases in transcription
  • Compare and contrast the three RNA polymerases
  • Explain the significance of transcription factors

Prokaryotes and eukaryotes perform fundamentally the same process of transcription, with a few key differences. The most important difference between prokaryote and eukaryote transcription is due to the latter’s membrane-bound nucleus and organelles. With the genes bound in a nucleus, the eukaryotic cell must be able to transport its mRNA to the cytoplasm and must protect its mRNA from degrading before it is translated. Eukaryotes also employ three different polymerases that each transcribe a different subset of genes. Eukaryotic mRNAs are usually monogenic, meaning that they specify a single protein.

Initiation of Transcription in Eukaryotes

Unlike the prokaryotic polymerase that can bind to a DNA template on its own, eukaryotes require several other proteins, called transcription factors, to first bind to the promoter region and then to help recruit the appropriate polymerase.

The Three Eukaryotic RNA Polymerases

The features of eukaryotic mRNA synthesis are markedly more complex than those of prokaryotes. Instead of a single polymerase comprising five subunits, the eukaryotes have three polymerases that are each made up of 10 subunits or more. Each eukaryotic polymerase also requires a distinct set of transcription factors to bring it to the DNA template.

RNA polymerase I is located in the nucleolus, a specialized nuclear substructure in which ribosomal RNA (rRNA) is transcribed, processed, and assembled into ribosomes (Table 15.1). The rRNA molecules are considered structural RNAs because they have a cellular role but are not translated into protein. The rRNAs are components of the ribosome and are essential to the process of translation. RNA polymerase I synthesizes all of the rRNAs from the tandemly duplicated set of 18S, 5.8S, and 28S ribosomal genes. (Note that the “S” designation applies to “Svedberg” units, a nonadditive value that characterizes the speed at which a particle sediments during centrifugation.)

RNA Polymerase Cellular Compartment Product of Transcription α-Amanitin Sensitivity
I Nucleolus All rRNAs except 5S rRNA Insensitive
II Nucleus All protein-coding nuclear pre-mRNAs Extremely sensitive
III Nucleus 5S rRNA, tRNAs, and small nuclear RNAs Moderately sensitive

RNA polymerase II is located in the nucleus and synthesizes all protein-coding nuclear pre-mRNAs. Eukaryotic pre-mRNAs undergo extensive processing after transcription but before translation. For clarity, this module’s discussion of transcription and translation in eukaryotes will use the term “mRNAs” to describe only the mature, processed molecules that are ready to be translated. RNA polymerase II is responsible for transcribing the overwhelming majority of eukaryotic genes.

RNA polymerase III is also located in the nucleus. This polymerase transcribes a variety of structural RNAs that includes the 5S pre-rRNA, transfer pre-RNAs (pre-tRNAs), and small nuclear pre- RNAs . The tRNAs have a critical role in translation they serve as the “adaptor molecules” between the mRNA template and the growing polypeptide chain. Small nuclear RNAs have a variety of functions, including “splicing” pre-mRNAs and regulating transcription factors.

A scientist characterizing a new gene can determine which polymerase transcribes it by testing whether the gene is expressed in the presence of α-amanitin, an oligopeptide toxin produced by the fly agaric toadstool mushroom and other species of Amanita. Interestingly, the α-amanitin affects the three polymerases very differently (Table 15.1). RNA polymerase I is completely insensitive to α-amanitin, meaning that the polymerase can transcribe DNA in vitro in the presence of this poison. RNA polymerase III is moderately sensitive to the toxin. In contrast, RNA polymerase II is extremely sensitive to α-amanitin. The toxin prevents the enzyme from progressing down the DNA, and thus inhibits transcription. Knowing the transcribing polymerase can provide clues as to the general function of the gene being studied. Because RNA polymerase II transcribes the vast majority of genes, we will focus on this polymerase in our subsequent discussions about eukaryotic transcription factors and promoters.

RNA Polymerase II Promoters and Transcription Factors

Eukaryotic promoters are much larger and more intricate than prokaryotic promoters. However, both have a sequence similar to the -10 sequence of prokaryotes. In eukaryotes, this sequence is called the TATA box, and has the consensus sequence TATAAA on the coding strand. It is located at -25 to -35 bases relative to the initiation (+1) site (Figure 15.10). This sequence is not identical to the E. coli -10 box, but it conserves the A–T rich element. The thermostability of A–T bonds is low and this helps the DNA template to locally unwind in preparation for transcription.

Instead of the simple σ factor that helps bind the prokaryotic RNA polymerase to its promoter, eukaryotes assemble a complex of transcription factors required to recruit RNA polymerase II to a protein coding gene. Transcription factors that bind to the promoter are called basal transcription factors. These basal factors are all called TFII (for Transcription Factor/polymerase II) plus an additional letter (A-J). The core complex is TFIID, which includes a TATA-binding protein (TBP). The other transcription factors systematically fall into place on the DNA template, with each one further stabilizing the pre-initiation complex and contributing to the recruitment of RNA polymerase II.

Visual Connection

A scientist splices a eukaryotic promoter in front of a bacterial gene and inserts the gene in a bacterial chromosome. Would you expect the bacteria to transcribe the gene?

Some eukaryotic promoters also have a conserved CAAT box (GGCCAATCT) at approximately -80. Further upstream of the TATA box, eukaryotic promoters may also contain one or more GC-rich boxes (GGCG) or octamer boxes (ATTTGCAT). These elements bind cellular factors that increase the efficiency of transcription initiation and are often identified in more “active” genes that are constantly being expressed by the cell.

Basal transcription factors are crucial in the formation of a preinitiation complex on the DNA template that subsequently recruits RNA polymerase II for transcription initiation. The complexity of eukaryotic transcription does not end with the polymerases and promoters. An army of other transcription factors, which bind to upstream enhancers and silencers, also help to regulate the frequency with which pre-mRNA is synthesized from a gene. Enhancers and silencers affect the efficiency of transcription but are not necessary for transcription to proceed.

Promoter Structures for RNA Polymerases I and III

The processes of bringing RNA polymerases I and III to the DNA template involve slightly less complex collections of transcription factors, but the general theme is the same.

The conserved promoter elements for genes transcribed by polymerases I and III differ from those transcribed by RNA polymerase II. RNA polymerase I transcribes genes that have two GC-rich promoter sequences in the -45 to +20 region. These sequences alone are sufficient for transcription initiation to occur, but promoters with additional sequences in the region from -180 to -105 upstream of the initiation site will further enhance initiation. Genes that are transcribed by RNA polymerase III have upstream promoters or promoters that occur within the genes themselves.

Eukaryotic transcription is a tightly regulated process that requires a variety of proteins to interact with each other and with the DNA strand. Although the process of transcription in eukaryotes involves a greater metabolic investment than in prokaryotes, it ensures that the cell transcribes precisely the pre-mRNAs that it needs for protein synthesis.

Evolution Connection

The Evolution of Promoters

The evolution of genes may be a familiar concept. Mutations can occur in genes during DNA replication, and the result may or may not be beneficial to the cell. By altering an enzyme, structural protein, or some other factor, the process of mutation can transform functions or physical features. However, eukaryotic promoters and other gene regulatory sequences may evolve as well. For instance, consider a gene that, over many generations, becomes more valuable to the cell. Maybe the gene encodes a structural protein that the cell needs to synthesize in abundance for a certain function. If this is the case, it would be beneficial to the cell for that gene’s promoter to recruit transcription factors more efficiently and increase gene expression.

Scientists examining the evolution of promoter sequences have reported varying results. In part, this is because it is difficult to infer exactly where a eukaryotic promoter begins and ends. Some promoters occur within genes others are located very far upstream, or even downstream, of the genes they are regulating. However, when researchers limited their examination to human core promoter sequences that were defined experimentally as sequences that bind the preinitiation complex, they found that promoters evolve even faster than protein-coding genes.

It is still unclear how promoter evolution might correspond to the evolution of humans or other complex organisms. However, the evolution of a promoter to effectively make more or less of a given gene product is an intriguing alternative to the evolution of the genes themselves. 1

Eukaryotic Elongation and Termination

Following the formation of the preinitiation complex, the polymerase is released from the other transcription factors, and elongation is allowed to proceed as it does in prokaryotes with the polymerase synthesizing pre-mRNA in the 5' to 3' direction. As discussed previously, RNA polymerase II transcribes the major share of eukaryotic genes, so in this section we will focus on how this polymerase accomplishes elongation and termination.

Although the enzymatic process of elongation is essentially the same in eukaryotes and prokaryotes, the DNA template is considerably more complex. When eukaryotic cells are not dividing, their genes exist as a diffuse mass of DNA and proteins called chromatin. The DNA is tightly packaged around charged histone proteins at repeated intervals. These DNA–histone complexes, collectively called nucleosomes, are regularly spaced and include 146 nucleotides of DNA wound around eight histones like thread around a spool.

For polynucleotide synthesis to occur, the transcription machinery needs to move histones out of the way every time it encounters a nucleosome. This is accomplished by a special protein complex called FACT , which stands for “facilitates chromatin transcription.” This complex pulls histones away from the DNA template as the polymerase moves along it. Once the pre-mRNA is synthesized, the FACT complex replaces the histones to recreate the nucleosomes.

The termination of transcription is different for the different polymerases. Unlike in prokaryotes, elongation by RNA polymerase II in eukaryotes takes place 1,000 to 2,000 nucleotides beyond the end of the gene being transcribed. This pre-mRNA tail is subsequently removed by cleavage during mRNA processing. On the other hand, RNA polymerases I and III require termination signals. Genes transcribed by RNA polymerase I contain a specific 18-nucleotide sequence that is recognized by a termination protein. The process of termination in RNA polymerase III involves an mRNA hairpin similar to rho-independent termination of transcription in prokaryotes.


Watch the video: 10 Τροφές Με Πρωτεΐνη Για Υγιεινά Γεύματα Εκτός Από Κρέας (June 2022).