How do you convert mtDNA sequences in FASTA to FSTAT format?

How do you convert mtDNA sequences in FASTA to FSTAT format?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I've got control region sequence data from a population of shark and I'm looking to convert this from FASTA to FSTAT in order to calculated the effective population size of females. The software I want to use only accepts FSTAT or Genepop files.

Is it possible to convert FASTA to FSTAT or even Genepop?

Check out PGDSpider. The inputs and outputs table indicates that it supports conversion between FASTA and FSTAT formats, among many others.

If you want something quick and dirty you could rapidly index the FASTA with samtools faidx and then put the lengths column through R (other languages are available) on the command line.

This outputs a statistical summary, and creates a PDF in the current directory called Rplots.pdf, containing a histogram.

Statistics for nanopore reads are tricky because of the huge range of read lengths that can be present in a single run. I have found that the best way to display lengths is by using a log scale on both the x axis (length) and the y axis (sequenced bases, or counts, depending on preference).

I have written my own scripts for doing this: one for generating the read lengths, and another for plotting the length distribution in various ways. The script that generates read lengths also spits out basic length summary statistics to standard error:

Here are a couple of of the produced graphs:

The scripts to generate these can be found here:

Using Biopython and matplotlib would seem like the way to go, indeed. It really just boils down to three lines of code to get that graph:

Of course you might want to make a longer script that's callable from the command line, with a couple options. You are welcome to use mine:

There are several potential approaches. For example:

    in the Biopython tutorial from the Ruby-based biopieces framework
  • various solutions to get sequence length including bioawk and EMBOSS infoseq

As to which of these are "quick and efficient" using a 10 GB file. it's hard to say in advance. You may have to try and benchmark a few of them.

bioawk could be reasonably efficient for this kind of task.

The -c fastx tells the program to parse the data as fastq or fasta. This gives access to the different parts of the records as $name , $seq (and $qual in the case of fastq format) in the awk code (bioawk is based on awk, so you can use whatever language features you want from awk).

Between the single quotes come a series of <condition> <<action>>blocks.

The first one has no <condition> part, which mean it is executed for each record. Here, it updates the lengths counts in a table which I named "histo". length is a predefined function in awk.

In the second block, the END condition means we want it to be executed after all the input has been processed. The action part consists in looping over the recorded length values and print them together with the associated count.

The output is piped to sort -n in order to sort the results numerically.

On my workstation, the above code took 20 seconds to execute for a 1.2G fasta file.

In molecular biology we often work with sequences

  • DNA sequences use 4 letters to represent the nucleotides in one of the two strands
  • Protein sequences use 20 letters to represent the amino-acids, from amino to carboxyl terminal
  • Other sequences are sometime used:
    • RNA,
    • DNA with ambiguous nucleotides,
    • amino-acid sequences with stop codons

    How do you convert mtDNA sequences in FASTA to FSTAT format? - Biology

    View, edit, and convert chromatograms. Trim low quality ends automatically.

    DNA Chromatogram Explorer Lite is a Windows Explorer clone dedicated to DNA sequence analysis and manipulation. You can view the chromatograms while browsing through folders using its integrated file explorer. With a single click you can trim the low quality bases at the end of your samples.

    The Lite version of Chromatogram Explorer is freeware .

    Start DNA Chromatogram Explorer and navigate to your DNA sample files (chromatograms).

    All chromatograms in that folder will be displayed in the right panel (see picture below). SCF and ABI (ABI, AB, AB1, AB!) chromatogram files are supported. Low quality ends are shown in dark gray color. To view non-chromatogram files (FASTA, SEQ, TXT) just double click them.

    Press 'Convert' or 'Convert all' and your file will be saved as SCF or FASTA (as you choose).

    With DNA Chromatogram Explorer you can automatically trim low quality ends of all chromatograms in a folder. Please see this short tutorial.

    DNA Chromatogram Explorer is delivered in a small package together with other free molecular biology tools.

    You don't need administrator rights in order to 'install' this package.

      the package
  • Double click it to unpack it
  • Specify the destination folder (where to unpack it)
  • Go to the destination folder and double click the program you want to use
  • DNA Chromatogram Explorer installs zero files in your system. Therefore, you don't need to uninstall it. To uninstall the DNA Chromatogram Explorer, just delete it.

    This software tool is really small so you can easily copy it on a floppy disk or USB flash stick and take it with you or send it to your colleagues via email.

    DNA Chromatogram Explorer can run on any version of Windows from Windows 98 to Windows 7 and also on Mac via Parallels or Bootcamp. It does not install additional libraries, updates, DLL, Java or registry keys into your system.

    Display sample's content as you browse through your folders

    Highlight low quality regions

    Manually trim low quality end

    Automatically trim low quality ends (batch)

    Convert between miscellaneous formats

    View FASTA, SEQ, TXT samples

    View SCF, ABI, AB, AB!, AB1 samples

    View sample's properties & statistics

    Extract bases from chromatograms (copy to clipboard)

    Perform file operations (copy/delete/move samples)

    Show all chromatogram files in a folder

    Convert all samples in a folder

    Double click a file to open it

    Your feedback is important to us!

    Similar bioinformatics tools included in this package

    DNA Chromatogram Explorer is a Windows Explorer clone dedicated to DNA sequence analysis and manipulation. You can view the chromatograms while browsing through folders using its integrated file explorer. With a single click you can trim the low quality bases at the end of your samples.

    Everything to Fasta Converter converts the specified samples (SCF, ABI, FASTA, multiFasta, GBK, multiGBK, SEQ, TXT) to FASTA format. Starting with version 3.0 protein FASTA files are also supported.


    HaploGrep 2 is a web application that communicates through a REST API with the web server. Thus, all computation intensive tasks are executed directly on the server. The haplogroup classification itself is based on pre-calculated phylogenetic weights that correspond to the occurrence per position in Phylotree and reflecting the mutational stability of a variant. In the updated classification algorithm, the weights are now scaled from 1 to 10 in a non-linear way (see Supplementary Table S1). Thus, the rare occurrences of variants in Phylotree will no longer influence the classification toward those haplogroups as much as in the previous version. Once the data is imported, the haplogroup classification is started automatically. Optimizations within the code led to a 20-fold speed-up compared to HaploGrep 1. By storing only the 50 highest ranked haplogroups per sample the memory consumption could be reduced significantly.

    Furthermore, new dissimilarity metrics for the mtDNA haplogroup classification were introduced. In addition to the already implemented Kulczynski distance ( 1), the Jaccard index, the Hamming distance and the Kimura 2-parameter distance were included ( 24) (see Supplementary Table S2 and 3 for performance comparison). Further major improvements included a check for artificial recombination ( 25) and a check for systematic artefacts and for rare or potential phantom mutations ( 26). For detecting artificial recombination, we apply two different strategies: the first strategy, proposed by Kong et al. ( 27), counts the remaining variants that were not assigned to the resulting best haplogroup, and tests whether these variants could be assigned to another haplogroup. For this step, mutational hotspots are excluded (e.g. 315.1C or 16519). The second recombination strategy assumes prior knowledge about the specific placement of the fragments of the polymerase chain reaction products (amplicons). With this information in hand, a check comparing the profiles relative to the fragment ranges can be executed. The user-defined fragments are generated, and the profiles split accordingly. If the distance of both haplogroup fragments exceeds five phylogenetic nodes, the sample is listed as potentially contaminated.

    Some Genomatix tools, e.g. Gene2Promoter or GPD allow the extraction of sequences. Genomatix uses the following syntax to annotate sequence information: each information item is denoted by a keyword, followed by a "=" and the value. These information items are separated by a pipe symbol "|".
    The keywords are the following:

    loc The Genomatix Locus Id, consisting of the string "GXL_" followed by a number.
    sym The gene symbol. This can be a (comma-separated) list.
    geneid The NCBI Gene Id. This can be a (comma-separated) list.
    acc A unique identifier for the sequence. E.g. for Genomatix promoter regions, the Genomatix Promoter Id is listed in this field.
    taxid The organism's Taxon Id
    spec The organism name
    chr The chromosome within the organism.
    ctg The NCBI contig within the chromosome.
    str Strand, (+) for sense, (-) for antisense strand.
    start Start position of the sequence (relative to the contig).
    end End position of the sequence (relative to the contig).
    len Length of the sequence in base pairs.
    tss A (comma-separated list of) UTR-start/TSS position(s). If there are several TSS/UTR-starts, this means that several transcripts share the same promoter (e.g. when they are splice variants). The positions are relative to the promoter region.
    probe A (comma-separated list of) Affymetrix Probe Id(s).
    unigene A (comma-separated list of) UniGene Cluster Id(s).
    homgroup An identifier (a number) for the homology group (available for promoter sequences only). Orthologously related sequences have the same value in this field.
    promset If the sequence is a promoter region, the promoter set is denoted here.
    eldorado The ElDorado version from which the sequence has been extracted.
    descr The gene description. If several genes (i.e. NCBI gene ids) are associated with the sequence, the descriptions for all of the genes are listed, separated by ""
    comm A comment field, used for additional annotation. For promoter sequences, this field contains information about the transcripts associated with the promoter. For each transcript the Genomatix Transcript Id, accession number, TSS position and quality is listed, separated by "/". For Genomatix CompGen promoters no transcripts are assigned, in this case the string "CompGen promoter" is denoted.

    This syntax is currently used only for sequences in the FASTA and GenBank formats.

    Example (a promoter sequence in GenBank format):


    Similarity is one of the key processes of DNA sequence analysis in computational biology and bioinformatics. In nearly all research that explores evolutionary relationships, gene function analysis, protein structure prediction and sequence retrieving, it is necessary to perform similarity calculations. One major task in alignment-free DNA sequence similarity calculations is to develop novel mathematical descriptors for DNA sequences. In this paper, we present a novel approach to DNA sequence similarity analysis studies using similarity calculations of texture images. Texture analysis methods, which are a subset of digital image processing methods, are used here with the assumption that these calculations can be adapted to alignment-free DNA sequence similarity analysis methods. Gray-level textures were created by the values assigned to the nucleotides in the DNA sequences. Similarity calculations were made between these textures using histogram-based texture analyses based on first-order statistics. We obtained texture features for 3 different DNA data sets of different lengths, and calculated the similarity matrices. The phylogenetic relationships revealed by our method shows our trees to be similar to the results of the MEGA software, which is based on sequence alignment. Our findings show that texture analysis metrics can be used to characterize DNA sequences.

    How do you convert mtDNA sequences in FASTA to FSTAT format? - Biology

    RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). On average, almost 50% of a human genomic DNA sequence currently will be masked by the program. Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green.

    Input format:

    Sequences can be pasted in or uploaded as files, both in fasta format. Multiple fasta format sequences may be pasted in at once or may be contained within a file. Fasta format looks like this:

    The submission form contains a text field for the full pathname of the file containing the sequence data on the local system (i.e. where the Netscape browser is running). By pressing the "Browse. " button, you can use a file selection box to select the file without having to type the path. When running the browser on a MacIntosh the browse button works but the file name can not be typed in. On both the PC and Mac the sequence file needs to be saved as 'text only'.

    Size limitations

    Output / return format

    The program returns three or four output files for each query. One contains the submitted sequence(s) in which all recognized interspersed or simple repeats have been masked. In the masked areas, each base is replaced with an N, so that the returned sequence is of the same length as the original. A table annotating the masked sequences as well as a table summarizing the repeat content of the query sequence will be returned to your screen. Optionally a file with alignments of the query with the matching repeats will be returned as well.

    In the "html" return format (default when the browser runs on a Mac or PC) all output is returned to your screen in one file. In the "tar file" return format the masked sequence(s) and alignments can be saved as compressed files. The "links" return format returns links to these output files in a text format (they look bad on the browser, but are fine when saved to your computer).


    Show alignments
    When checked, alignments are returned in a file (ending in .aln) or to the screen. Alignments are shown in order of appearance in the query sequence.

    Do not mask simple. /Only mask simple.
    Regions of low complexity, like simple tandem repeats, polypurine and AT-rich regions can lead to spurious matches in database searches. By default they are masked along with the interspersed repeats.
    With the option "Do not mask simple. " only interspersed repeats are masked. This may, for example, be preferred when the masked sequence will be fed to a gene prediction program.
    Alternatively, with the option "Only mask simple. ", one can mask only these low complexity regions, e.g. when you are only interested to quickly locate polymorphic simple repeats in a sequence.

    Only mask Alus
    By checking this option, you limit the masking and annotation to (primate) Alu repeats. 7SL RNA (the ancestral sequence of Alus), SVA (which contains several Alu sequences and a fragment of LTR5) and LTR5 are masked as well. This option only works for primate DNA.

    Mask with Xs.
    When checked, the repeat sequences are replaced by Xs instead of Ns. This allows one to distinguish the masked areas from possibly existing ambiguous bases or other stretches of Ns in the original sequence. However, when running BLAST searches (and maybe other programs) Xs are deleted out of the query and the returned BLAST matches will have position numbers not necessarily corresponding to that of the original sequence.

    Fixed-width columns
    Since April 1999 the column widths in the annotation table are adjusted to the maximum length of any string occurring in a column this allows long sequence names to be spelled out completely. Previously a fixed column width table was returned, which can still be obtained by checking this option button.

    Other options

    You can type in less frequently used options in UNIX command line style, like: which will cause the program to only annotate and mask repeats less than 20% diverged, return the alignments in the orientation of the repeat consensus sequences, and use matrices optimal for a 45% GC background nucleotide distribution.

    With the option -div you can limit the masking and annotation to a subset of less diverged (younger) repeats by choosing a maximum divergence level of the repeat copy to its consensus sequence. This option may be used to limit the masking to those repeats that are either specific to primates or another mammalian order for use in subsequent comparison of orthologous mammalian loci. On average, interspersed repeats have diverged 18% in human (

    35% in mouse) from their consensus since the mammalian orders separated, so typing '-div 18' in the advanced options box limits masking to most primate specific repeats. Note that this method is rather crude, mostly since the range of deterioration of repeats of the same age is wide many shared repeats may go unmasked and vice versa..

    Neutral mutation patterns differ significantly depending on the GC richness of a locus and we have calculated optimal scoring matrices for the alignment to consensus sequences in a range of background GC levels. Usually, RepeatMasker calculates the percentage of the sequence consisting of Gs and Cs and uses the appropriate matrices. However, the program defaults to using 'average' 43% GC matrices when the query is shorter than 2000 bp or a batch file is analyzed. Short sequences are less likely to share the GC level of the locus. For example, CpG islands and exons are more GC rich than the surrounding DNA, whereas a LINE1 element usually is more AT rich than the background. In a batch file, RepeatMasker analyses all sequences together with the same matrices. The percentage GC in all the sequences combined may be inappropriate for some sequence entries using high GC level matrices in AT rich sequences (and vice versa) may result in false masking.
    One can override this behavior in two ways:
    With the option -gc you can set the GC level to a certain percentage e.g. '-gc 37' lets the program use matrices appropriate for 37% GC background. This could be useful, for example, when you have a batch file of ESTs from a single locus with a known GC level.
    Alternatively, the -gccalc option forces RepeatMasker to use the actual GC level of a short sequence or the average GC level of a batch of sequences. The latter sequences, for example, may be contigs or reads in a sequencing project.

    RepeatMasker transparently fragments large sequences in fragments of 60 kb with 2 kb overlaps. The -frag option allows one to change the size of these fragments. Fragmentation was implemented to allow the size of sequences and sequence batches to be unlimited. It also can improve repeat detection when a genomic sequence contains regions of DNA with significantly different GC levels (isochores) sets of scoring matrices are chosen based on the GC level of a fragment. The only visible effect of the fragmentation is in the alignment files, where alignments at the edges of the fragments can be duplicated and/or truncated.

    Alignments are shown in the orientation of the query sequence. The option -inv will return alignments in the orientation of the repeats.

    In the process of finding all repeats, RepeatMasker temporarily cuts out most full-length elements, young LINE1 3' ends, and close to perfect simple repeats are deleted (both in human and rodent settings) to unearth any possible underlying older repeat in which these elements have inserted or expanded. The option -nocut skips the above deletion step in the default procedure. RepeatMasker is generally more sensitive including the deletion step.

    When the option -xsmall is used a sequence is returned in the .masked file in which repeat regions are in lower case and non-repetitive regions are in capitals.

    The option -small causes the whole masked sequence to be returned in lower case, with repeats replaced by 'x's (or 'x's if combined with -x).

    DNA source

    Interspersed repeats are specific to a (group of) species, dependent on the time of activity of the source transposable element. About half of the repeats identified in human DNA are specific to primates, i.e. they amplified after the eukaryotic radiation some 100 million years ago. Most repeats that can be identified in mouse DNA are specific to rodents, due to higher activity and faster mutation rates in the rodent lineage. RepeatMasker has separate protocols optimized for analysis of rodent and primate genomes. Interspersed repeats in other mammals have not been so well catalogued as yet. Among these, artiodactyl queries are treated best by RepeatMasker, but repeats specific to other orders are also present.

    The numbers of different repeat consensus sequences against which queries of different species are compared gives an impression of how far the different libraries are developed: Note that the majority of sequences against which rodent and especially other mammalian queries are compared are repeats identified in the human genome and thought to predate the mammalian radiation.

    Whereas the mammalian libraries represent heavily manipulated and expanded versions of Repbase libraries, the non-mammalian libraries were extracted with very limited curation. The vertebrate (chicken, Xenopus, etc.) and grasses (maize, rice) libraries are especially fetal. No summary tables are returned for these two.

    Speed and sensitivity

    On average, with default settings, a 10 kb human cosmid will be analyzed in about 30-40 seconds if no one else is using the server at the time.
    For longer sequences the required time increases pretty much linearly with the sequence length. Sequences shorter than 10 kb are analyzed disproportionally faster. This is partially due to the program, e.g. a batch file of 200 human sequences of 400 bp (total 80 kb) is analyzed within 2 minutes, but we also have implemented a queuing system for sequences longer than 10 kb, making the request of lower priority the longer the query sequence. The speed is further somewhat dependent on the repeat content of the sequence repeat dense regions, especially Alu-rich regions, are analyzed faster.

    The program can be run at three levels of speed or sensitivity. The only difference between these settings is the minimum match or word length in the initial (not quite) hashing step of the cross_match program (see the cross_match/phrap documentation). The "slow" setting will take about 3 times longer and will find and mask 0-5% more repetitive DNA sequences than the default setting. The "quick" settings miss 5-10% of the sequences masked by default, but will be 3 to 6 times faster. The alignments may extend more or be somewhat more accurate in the more sensitive settings as well.

    At the sensitive settings RepeatMasker currently finds, on average, 47% of human genomic DNA to be derived from interspersed repeats. RepeatMasker is very sensitive in comparison with other programs, although comparison to some is skewed because of the use of much smaller databases.

    Selectivity and matches to coding sequences

    The cutoff Smith-Waterman scores for masking interspersed repeats are conservative, since masking of one short potentially interesting region generally is more harmful than not masking a number of hard to find matches. If there are any false matches, they tend to have scores close to the cutoff, which is 225 for most repeats, 300 for the low-complexity LINE1 search, and 180 for the very old MIR, LINE2 and MER5 sequences.
    We tested for the occurrence of false matches in randomized and in inverted (but not complemented) DNA. To check a variety of conditions, four 150 to 400 kb DNA fragments were analyzed ranging in GC level from 36% to 54%. To retain seeds for Smith Waterman alignments, randomization was done at the 10 bp word level. Note that the inverted sequences retain the low complexity and simple repeat patterns of the original sequences. Even at sensitive settings, for which false matches are most likely, this version of RepeatMasker reported no (false) matches at all to interspersed repeats in the randomized or inverted sequences. No simple repeats were reported in the randomized queries.

    RepeatMasker returned only a single probably false match (71 bp) when analyzing a batch of 4440 coding regions in human mRNAs (7,200,000 bp) at sensitive settings. The coding regions were collected from GenBank, based on annotations, filtered for the presence of complete ORFs and initiator methionines, and made more or less non-redundant. When each coding region was analyzed individually using the -gccalc option, 5 matches (414 bp, 0.006%) were falsely masked (156 bp at default speed, 76 bp at quick settings). In this analysis each sequence was analyzed with matrices chosen based on the actual GC level, even for very short sequences, while in the batch analysis of the coding regions the 'average' 43% GC matrices were used.

    RepeatMasker is most commonly used to avoid spurious matches in database searches. Generally this step is strongly recommended before doing BLASTN or BLASTX equivalent searches with mammalian DNA sequence.

    The most common concern is of course if RepeatMasker ever masks coding regions.
    We found that false matches in coding regions are extremely rare, but did identify 38 genuine fragments of interspersed repeats (4214 bp) in the (annotated) coding regions of the 4440 human mRNAs (7.2 Mb) analyzed (excluding annotated coding sequences of LINE1 elements and endogenous retroviruses). We verified matches with lower scores by comparing the translation products to close homologous or redundant entries in the database (the repeat matching regions always were exactly missing). In the majority of these cases, the sequences appear to be improperly annotated or to represent either artificially or naturally defective mRNAs (e.g. alternatively spliced exons comprised of a small fragment of a repeat). Genuine overlaps of interspersed repeats with coding sequences usually involve terminal regions of the ORFs. Since the transposable element derived region is unique to the protein in that (group of) species, the masking does not interfere with database searches.

    However, some cautionary comments are necessary. First, a few active cellular genes are derived from transposable elements. For example, I have identified 7 examples of human genes derived from (DNA transposon) transposases. These genes will be partially masked by a (related) DNA transposon in the repeat database. EST and cDNA matches beyond the masked region should alert you.

    Also be aware that RepeatMasker screens for small RNA pseudogenes and will therefore mask the active small RNA genes as well (I think the tRNA list is complete, I stopped adding snRNAs unless I found an indication that they have created many pseudogenes). The number of matches to small RNAs are listed in the overview table (close to) exact matches are possibly active genes, although related active genes not in the database may show diverged matches.

    A final caution relates to the fact that 3' UTRs of transcripts are about as dense in interspersed repeats as intergenic regions are. Thus, many ESTs are completely masked as repetitive DNA. I recommend that, when you compare a genomic sequence against the EST database or use ESTs as a query in nucleotide searches, you search with the unmasked sequence as well use a long minimum match (word length/ word size) like 40 bp to identify exact matches and avoid most background. Unfortunately the maximum word length that can be used in the NCBI BLASTN program is 18 (apparently due to memory limitations).

    Use in association with gene prediction programs

    Predicting genes from a masked sequence faces several problems. First, one should not mask low complexity regions, e.g. to avoid masking trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As mentioned above, sometimes tail ends of coding regions may have originated from transposable elements. Even if no coding regions have been masked, splice sites may be compromised e.g. the polypyrimidine region that is part of the acceptor splice site may be contained within a repeat.

    Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used.

    Other uses

    Many people mask repeats before designing primers or oligo probes from sequence data. I've been told often that primers/probes designed from regions unmasked by RepeatMasker have a much better success rate. A cautionary note here is that unmasked regions not necessarily are unique in the genome (e.g. many lower copy repeats are not in the database yet) and experiments should be performed as if no filtering against repeats has been done.
    The alignments can help in designing primers from sequences that are completely masked. Regions that diverge much from the consensus are less likely to misbehave than others.

    RepeatMasker is sometimes used during assembly of large genomic sequences. This procedure probably is most useful in very Alu rich regions in that situation I recommend to only mask the Alus, and maybe limit the masking to those Alus less than 15% diverged (-div 15).

    How to read the results

    The annotation file contains the cross_match output lines. It lists all best matches (above a set minimum score) between the query sequence and any of the sequences in the repeat database or with low complexity DNA. The term "best matches" reflects that a match is not shown if its domain is over 80% contained within the domain of a higher scoring match, where the "domain" of a match is the region in the query sequence that is defined by the alignment start and stop. These domains have been masked in the returned masked sequence file. In the output, matches are ordered by query name, and for each query by position of the start of the alignment.

    This is a sequence in which a Tigger1 DNA transposon has integrated into a MER7 DNA transposon copy. Subsequently two Alus integrated in the Tigger1 sequence. The simple repeat is derived from the poly A of the Alu element. The first line is interpreted like this:

    An asterisk (*) in the final column (no example shown) indicates that there is a higher-scoring match whose domain partly (<80%) includes the domain of this match.

    Note that the SW score and divergence numbers for the three Tigger1 lines are identical. This is because the information is derived from a single alignment (the Alus were deleted from the query before the alignment with the Tigger element was performed). The program makes educated guesses about many fragments if they are derived from the same element (e.g. it knows that the MER7A fragments represent one insert). In a next version I can identify each element with a unique ID, if interest exists (this could help to represent repeats cleaner in graphic displays).


    Alignments are shown in order of appearance in the query sequence. These alignments may be most generally useful for designing PCR primers in a region full of repeats. It is possible to get primers that work in a whole genome, when the 3' end of it lies in a region of (even a common) repeat that is very different from the consensus. Alignments are shown in the orientation of the query sequence unless the option -inv is typed in in the option box.

    Here is an example of an alignment of a MIR spanning an Alu element deleted in an earlier step:

    In cross_match alignments the mismatches are indicated, where "-" indicates an insertion/deletion, "i" a transition (G<->A, C<->T) and "v" a transversion (all other substitutions). The position of the deleted Alu in the query is indicated with an "X".
    The lines in the annotation table describing this match appear as:

    Discrepancies between alignments and annotation

    Most discrepancies between alignments and annotation result from adjustments made to produce more legible annotation. This annotation also tends to be closer to the biological reality than the raw cross_match output. For example, adjustments often are necessary when a repeat is fragmented through deletions, insertions, or an inversion. Many subfamilies of repeats closely resemble each other, and when a repeat is fragmented these fragments can be assigned different subfamily names in the raw output. The program often can decide if fragments are derived from the same integrated transposable element and which subfamily name is appropriate (subsequently given to all fragments). This can result in discrepancies in the repeat name and matching positions in the consensus sequence (subfamily consensus sequences differ in length).

    Some other discrepancies are specific to LINE elements. These repeats do not appear as complete elements in the consensus database. This is mostly a result of the contrast in conservation over the length of its sequence during its evolution in the mammalian genome the

    3 kb ORF2 region of LINE1 has been very conserved, whereas the untranslated regions and ORF1 to a lesser degree have evolved very fast. Thus the 3' end or 5' end of an ancient LINE1 does not even remotely resemble that of the currently active LINE1, whereas the coding region for reverse transcriptase is closely related. Thus, many subfamilies have been defined for both the 5' and 3' UTRs (25 and 50, resp.) of LINE1 elements in human DNA, whereas only three ORF2 entries are present in the database. It is not only hard to extend all subfamilies from the beginning to the end, but it also appears that different 3' ends may have been associated with the same 3' ends, and vice versa. On top of that, including 50 full length (6.2-8 kb) LINE1 elements in the database would make the program very slow. LINE1 elements therefore are presented in the database in 3 (or more) pieces, and the program tries to put these pieces together as well as possible. As a result both the names of the repeats and position numbering in the consensus sequence are generally different in the alignments than in the output file. The LINE2 elements are likewise broken up in the databases, in 3' UTRs for different subfamilies and one ORF2 region.

    The 3' UTR of LINE1 subfamilies ranges from 500 bp to over 2000 bp (in L1MC/D3), and the length of the 5' UTR is even more variable, even between subfamilies that show strong similarity in the 3' UTR. To allow the LINE1 fragments to be put together, all position numbers in older LINE1 subfamilies are adjusted to the position of ORF2 (the conserved part of LINE1) in a complete L1PA2 element. Since some older elements have much longer 5' UTRs or ORF1-ORF2 linker regions than L1PA2, this sometimes results in the assignment of negative position numbers for the 5' end of LINEs.

    Finally, you may find large discrepancies in position numbering if an element includes tandem repeat units. For example, MER109 contains multiple

    300 bp repeat units this can lead to overlapping matches. In the output such matches are fused.

    The summary (.tbl) file

    The four main classes mentioned in this table are well defined (see my 1996 review in COGD) and form a good basis for a summary or visual presentation of the repeats in a locus. Among the subclasses, some uncertainty of classification remains it is especially hard to predict if an LTR is derived from an endogenous retrovirus or a non-autonomous LTR element. Also, not all subclasses are listed and the total for the classes is often higher than the sum of the sub classes. Note that the "MER" subclasses and the different MER interspersed repeats are not necessarily related to each other. The term MER (MEdium Reiterated repeats) was introduced for purely administrative purposes to give the beast a name. I named the MER1, MER2, and MER4 groups after the first member of each group that was identified as an interspersed repeat.

    The program tries very hard to find out which repeat fragments were derived from the same insertion event of a transposable element. The estimated number of events still tend to be an overestimate.

    The 'bases masked' number is calculated from the total number of Xs in the masked sequences (before these are changed to Ns or lower case letters). The other numbers are derived from the annotation (.out) file. Discrepancies between the 'bases masked' number and the sum of 'total interspersed repeats', small RNA, satellites and low complexity are generally very small. They are mostly accounted for by unmasked regions between flanking identical simple repeats, annotated as one stretch if fewer than 10 bases separate them, and fragments of repeats shorter than 10 bp which are not annotated but are masked.

    Low-complexity DNA and simple repeats

    Finding polymorphic simple repeats

    Although RepeatMasker does a good job in masking simple repeats to avoid spurious matches in database searches, it is not written to find and indicate all possibly polymorphic simple repeat sequences. Only di- to pentameric and some hexameric repeats are scanned for and simple repeats shorter than 20 bp are ignored. Combining the "Only mask simple.." button option with a "div" option (e.g. -div 10) will produce a list of simple repeats that are 90% or more perfect. However, this list may not be not complete e.g. two perfect 40 bp long (CA)n repeats interrupted by 10 Ts are aligned in one piece and may be reported as having > 10% divergence from the consensus. Of course most hexameric and longer unit repeats won't be reported either. A site dedicated to identifying polymorphic tandem repeats can be found at UTSW .

    Reference repeat databases

    The interspersed repeat databases screened by RepeatMasker are based on the repeat databases (Repbase Update) copyrighted by the Genetic Information Research Institute (G.I.R.I.). The Repbase Update database contains annotation of most repeats with respect to divergence level, affiliation, etc. The nomenclature of the interspersed repeats in the output of RepeatMasker is nearly identical to that of the reference database which in most cases corresponds to that in the literature.

    Scoring matrices

    We have calculated statistically optimal scoring matrices for the alignment of neutrally diverging (non-selected) sequences in human DNA to their original sequence. These matrices have been in use since the May 1998 release. The matrices were derived from alignments of DNA transposon fossils to their consensus sequences (Arian Smit, Arnie Kas & Phil Green, in preparation. ). A series of different matrices are used dependent on the divergence level (14-25%) of the repeats and the background GC level (35-53%, neutral mutation patterns differ significantly in different isochores).

    These matrices are (close to) optimal for human genomic sequences longer than 10 kb, for which length the GC level usually is representative of the isochore in which the sequence lives. However, the GC level of small fragments can diverge a lot from the surrounding (e.g. a fragment spanning a CpG island, a GC rich exon or an AT-rich LINE1 element) and RepeatMasker defaults to using matrices derived for a 43% GC background when a sequence is shorter than 2000 bp or when a batch file is submitted. When the appropriate background GC level is known, this can be entered with the -gc option.


    We haven't published a paper on RepeatMasker yet, unless you call this expanding help file a publication. We'd appreciate it if you could refer to the web site in your publications (A.F.A. Smit, R. Hubley & P. Green RepeatMasker at


    Smit, A.F.A. (1996) Origin of interspersed repeats in the human genome. Curr. Opin. Genet. Devel. 6 (6), 743-749.
    Smit, A.F.A. (1996) Structure and evolution of mammalian interspersed repeats. PhD dissertation, USC. (lots of otherwise unpublished information here, available under order number 9636751 at the UMI web site)

    Schmid, C. W. (1996). Alu: structure, origin, evolution, significance, and function of one-tenth of human DNA. Prog Nucleic Acids Res Mol Biol 53, 283-319.
    Jurka, J. (1996) Origin and evolution of Alu repetitive elements. In " The impact of short interspersed elements (SINEs) on the host genome. Maraia, R.J., editor. Springer Verlag.
    Batzer, M. A., Deininger, P. L., Hellmann Blumberg, U., Jurka, J., Labuda, D., Rubin, C. M., Schmid, C. W., Zietkiewicz, E., and Zuckerkandl, E. (1996). Standardized nomenclature for Alu repeats. J Mol Evol 42, 3-6.

    Smit, A. F. A., and Riggs, A. D. (1995). MIRs are classic, tRNA-derived SINEs that amplified before the mammalian radiation. Nucleic Acids Res 23, 98-102.

    Smit, A. F. A., Toth, G., Riggs, A. D., Jurka, J., Ancestral mammalian-wide subfamilies of LINE-1 repetitive sequences. J Mol Biol 246, 401-417.

    Smit, A. F. A. (1993). Identification of a new, abundant superfamily of mammalian LTR-transposons. Nucleic Acids Res 21, 1863-72.

    Wilkinson, D. A., Mager, D. L., and Leong, J. C. (1994). Endogenous Human Retroviruses. In The Retroviridae, J. A. Levy, ed. (New York: Plenum Press), pp. 465-535.

    DNA/all types
    Smit, A. F. A., and Riggs, A. D. (1996). Tiggers and other DNA transposon fossils in the human genome. Proc Natl Acad Sci USA 93, 1443-8.

    Improvements and new features

    June 1997

    The database of human/mammalian-wide repeats was expanded 2.5 fold. Among the new additions are the (long) internal sequences of endogenous retroviruses.

    Databases of repeats from other species than primates, rodents or artiodactyls can now be screened, although the program is not optimized to do so and the quality of the databases is not at the same level.

    Through optimization of the cross_match searches, the program more sensitive and selective, especially with regard to detection of low complexity sequences and old LINE1 elements.

    The RepeatMasker output is now processed by a second script to create annotation ready for database submission. Some of the more obvious improvements in the output are (i) overlapping matches are generally resolved, (ii) LINE1 fragments are annotated with position numbers as in a full L1 element, and (iii) when an Alu or LINE1 is fragmented information from both or all fragments is used to assign a subfamily name.

    Alignments are shown without interruption by other cross_match output and in the order of appearance in the query sequence.

    A summary table has been added which shows, among other things, the repeat composition of the query sequence.

    September 1997

    - major expansion of the rodent libraries and significant update of the human libraries as well, especially in LINE1 elements.
    - scripts modified to accommodate new entries in databases
    - simple repeats masking optimized by including pentamers and using a more stringent matrix
    - several bugs fixed (e.g. sequences without repeats are now counted)
    - table now displays the parameters used

    June 1998

    - the program is more robust and accepts most 'almost but not quite fasta' format files
    - large sequences are analyzed in fragments of 100 kb to reduce the memory requirements of the program. Similarly files with very many sequence entries are divided up. You shouldn't notice any of this in the output files.
    - matrices are used that are optimal for the divergence level of the repeats to which the query is compared and the background nucleotide composition.
    - another big update of the human repeat databases.
    - the small RNA sequences have been corrected and expanded (all tRNAs should be there now)
    - the summary table now lists the amount of small RNA (pseudo)genes, simple repeats and low complexity DNA identified
    - close to perfect simple repeats, full-length shorter interspersed repeats and young LINE1 3' ends are temporarily excised from the sequence (in both human and rodent analysis) to allow better detection of any underlying repeats.
    - the "Skip simple, low complexity region masking" really skips all simple repeats now
    - alignments are shown in the orientation of the query sequence
    - among many bugs fixed is one involving sequence names including a number between parentheses

    December 1998

    This version uses the 1998 cross_match release. The difference for RepeatMasker is mainly in the complexity adjusted length of the matches that function as kernels for Smith Waterman alignments and the matrix dependent adjustment of the score for complexity of the alignment.

    The full description ('>') lines are now retained in the masked file.

    The .out file table is returned with flexible length columns allowing the full length of long query sequence names to be displayed. Optionally, the old fixed width table can still be obtained.

    Simple repeat and satellite masking has been improved again their annotation has changed a bit, most notably they are now all listed in the orientation of the query sequence

    Several new options are available:
    - A mRNA/EST option prevents false masking due to inappropriate matrix choice and low complexity matches to LINE1 elements in short GC rich regions like coding regions.
    - You can limit the masking to Alus when masking primate DNA
    - You can limit the masking to younger repeats by setting a maximum allowed divergence to the consensus sequence
    - The sequences identified as repeats can be returned in lower case (rest in capitals) rather than masked out by Ns or Xs.
    - You can set the background GC level (determining which matrices are used) overriding the program's calculations.

    Among bugs fixed since May 1998 are those responsible for distorted output for sequences with names ending in .seq and for sequences without a header line. Also, sequence files from PCs and Mac with hidden carriage returns are handled appropriately.

    April 1999

    All the command line options are now available on the web site.

    The default return format of the annotation file is changed, hopefully in a way that does not interfere with any type of parsing the width of the columns is now adjusted to the longest entry in that column, allowing query names to be spelled out in full, and usually leading to narrower tables.

    Arabidopsis, Drosophila, and grass repeat libraries were added other repeat libraries were updated.

    Three measures were taken to eliminate the (few) false positives:
    - Use of the actual average GC level of sequences in a batch file may sometimes lead to false masking (or failure to mask) in sequences that diverge largely from the average. Thus, by default, all batch files are now analyzed with the innocuous 43% matrices.
    - one entry, responsible for 90% of false masking in GC rich regions, is deleted from the 'tough L1' library.
    - the matrix used for identification of the most diverged sequences in very GC rich regions, based on too little data and too much extrapolation, was 'too easy' on the mismatches and has been adjusted.
    Thanks to these measures the 'mrna' option is not necessary and has been removed.

    A bug is fixed that led to (wildly) improper annotation for some sequences fully consisting of repeats (all bases masked). A series of lesser bugs were taken care of. New bugs were introduced, probably.

    For further information and to obtain a local copy go to the RepeatMasker Download Page.

    Institute for Systems Biology
    This server is made possible by funding from the National Human Genome Research Institute (NHGRI grant # RO1 HG002939).


    To use TopHat, you will need the following programs in your PATH:

    • bowtie2 and bowtie2-align (or bowtie)
    • bowtie2-inspect (or bowtie-inspect)
    • bowtie2-build (or bowtie-build)
    • samtools

    Because TopHat outputs and handles alignments in BAM format, you will need to download and install the SAM tools. You may want to take a look at the Getting started guide for more detailed installation instructions, including installation of SAM tools and Boost.

    You will also need Python version 2.6 or higher.

    How do you convert mtDNA sequences in FASTA to FSTAT format? - Biology

    For the latest version, navigate to:

    Exploratory phylodynamics of early EBOV epidemic in Sierra Leone

    In this practical, we will re-analyse whole-genome EBOV sequences collected over the course of the 2013-2015 Ebola virus epidemic in Western Africa. The data and analysis were first described here:

    Details of the original analysis of these data can be found here

    In the course of this practical you will learn how to

    • load and view EBOV sequence data
    • estimate a phylogeny using neighbour-joining and maximum likelihood algorithms
    • root a phylogeny using root-to-tip regression and estimate a molecular clock
    • estimate time-scale phylogenies
    • conduct non-parametric phylodynamic analyses an estimate the effective population size over the course of the epidemic
    • extract and analyze 'meta-data' associated with each sequence such as the time of sampling and country of origin
    • carry out ancestral state estimation to infer the likely location of lineages over the history fo the epidemic.

    You will carry out this analysis on a random subset of the available sequences and your results will be unique. Make a note of the main results of your analysis:

    1. Estimate the reproduction number in Siera Leone in mid-2014
    2. Estimate when the epidemic peaked
    3. Estimate when the epidemic originated in humans
    4. Estimate the country or origin of the epidemic

    For these analyses, we'll use the ape package for manipulating sequence and tree data, the phangorn package for estimating phylogenies and doing ancestral state estimation, the treedater package for estimating a molecular clock, and the skygrowth package for phylodynamic analysis.

    All of these packages are on CRAN and can be installed using install.packages(. ) except for skygrowth which must be installed from github.

    If necessary, install the packages using

    Now we load the package as follows:

    Install and load skygrowth with the following:

    Loading and exploring the data

    The original analysis by Dudas et al. was based on 1610 whole EBOV genomes. We will do a fast exploratory analysis of a random subsample of these sequences.

    Let's load the multiple sequence alignment and inspect it:

    Now we will create a unique sub-sample of these sequences. Since your results will be based on a different sample of sequences, your results will likely differ from what is presented here. You can try re-running your analysis with different subsamples and options.

    Choose a 'seed' for random number generation distinct from the 2014 value used here (for example, your CID number). Make a note of this number. Your results will be reproducible with this seed.

    It's always a good idea to visually check your alignment, which is easily done using an external tool like seaview. If you like, you can also do this from within R using packages such as msaR . Note installation and visualization will take some time so you may skip this step.

    This should open a browser window where you will something like the following:

    Let's compute genetic and evolutionary distances between sequences. This computes the raw number of character differences between each pair of sequences:

    Note the option pairwise.deletion=TRUE , which causes missing data to be handled on a pairwise basis as opposed to masking sites across the entire alignment. Let's make a histogram:

    There is a lot of variation in distances, with some pairs differing by less than two characters. This is due to the short time frame over which the epidemic spread and over which samples were collected.

    Evolutionary distances and a neighbour-joining tree

    First, we will compute an evolutionary distance matrix for phylogenetic analysis. We will use the F84 nucleotide substition model, which is similar to the HKY model that several published studies have found to work well for EBOV. This is different than computing the raw number of differences between sequences that we looked in the last section. The evolutionary model accounts for differential rates of substitution between different characters and also accounts for reverse-mutations and saturation.

    Using the pairwise.deletion option tells the distance calculation to ignore sites that are missing in one or both sequences when comparing two sequences, but sites which may be missing in other sequences are still used.

    Now computing a neighbor-joining tree is simple with the following command:

    Note that there is no significance to the location of the root of this tree, and branch lengths show distances in units of substitions per site. We can plot an unrooted version with a scale bar:

    Maximum likelihood phylogeny

    First we convert the sequence data into a format recognized by phangorn :

    Then set the initial conditions for optimization:

    This tells the package to start from the neighbour-joining tree and estimate 4 categories of rate variation with an HKY substitution model and to estimate the proportion of sites in the alignment which are invariant.

    Now we can optimize the tree topology and substitution model parameters. These options specify which parameters should be optimized

    • optNni specifies that the tree topology will be optimized using nearest-neighbor interchange seearch
    • optBf specifies that the base frequencies (A,C,T or G) will be estimated
    • optQ specifies that the substitution rate parameters will be estimated
    • optGamma specifies that Gamma parameters for rate variation between sites will be estimated
    • optInv specifies that the proportion of sites which are invariant will be estimated

    Note: This optimization can take a couple of minutes.

    Let's see to what extent the optimized tree has higher likelihood than the initial neighbor-joining tree:

    In the original analysis by Dudas et al., a more complex substitution model was used which accounted for differences in codon positions as well as in the non-coding regions.

    To fit a molecular clock, we must use information about the time of each sample. Let's load the date of sampling for sequence. Note that the label for each sequence includes metadata regarding the province and country of origin and the time of sampling.

    We load the sample times in numeric format using the following command:

    Note the distribution of samples through time:

    Most samples were collected in the latter half of 2014 when peak incidence occurred.

    Now we can construct a time-scaled phylogenetic tree so that branches are in units of years and nodes correspond to TMRCAs. Let's start by placing the root of the tree on a branch that is likely to have the MRCA of the sample. One way to do this is to use the rtt command, which uses root-to-tip regression this selects the root position to maximise the variance in evolutionary distance explained by the tree.

    Lets do our own root-to-tip regression using the rerooted tree. You should find an almost linear trend between when evolutionary divergence and time that the sample was taken. This will also give us a rough estimate of the molecular clock rate.

    Does this look approximately linear? The slope of a linear regression line will have units of substitutions per site per unit time and can serve as a fast estimator for the molecular clock rate.

    The molecular clock rate is the slope:

    Estimates based on the state-of-the-art Bayesian methods place the rate at around .00124 substitions per site per year.

    Estimating times of common ancestry

    To estimate a tree with branch lengths in units of time (and TMRCAs), we will use the recently-developed treedater R package which is based on

    The treedater algorithm requires as input a tree with branches in units of substitutions, the sample times for each tree tip, and the length of the sequences used to estimate the tree. This package can estimate the root position if given an unrooted phylogeny, or we can re-use the estimated root position found with rtt . We use treedater like this:

    Note that this provides an estimate of the clock rate, the variation in clock rates, and the time of common ancestry. Does your estimated TMRCA correspond to when this epidemic originated in humans? The first documented case in humans from this epidemic was in early December 2013.

    We can do an improved root-to-tip regression which also shows estimated dates at the interior of the tree using this command:

    By default treedater does not provide confidence intervals for estimated dates and rates, but we can do this quickly using a parametric bootstrap procedure. Note: This will take a couple minutes to run.

    Does this confidence interval overlap with the earliest cases of EBOV in humans? This would be around 2013.95 in decimal format.

    Nonparametric phylodynamic estimation

    We will reconstruct the historical dynamics of effective population size, Ne(t), using the nonparametric skygrowth technique. For details, see

    This 'effective' size may correspond approximately to the number of infected hosts (although this assumption must be checked carefully ), and the growth rate of effective size can be used to estimate reproduction numbers.

    Because geographic structure can confound the relationship between Ne(t) and epidemic size, we will work with a subtree drawing only on lineages sampled from the best sampled country, Sierra Leone. The set of lineages with geocode 'SLE' can be found using

    Now we want to make a new tree where all lineages but these are 'pruned':

    Now we can estimate Ne(t) using Bayesian MCMC. NOTE This will take a couple minutes. While you wait, have a look at this figure and these data which show how many cases were reported to the WHO over time and in each country.

    Let's plot on the calendar time axis. According to WHO records, the peak number of cases in Sierra Leone (maximum number of cases per week) occurred on October 31, 2014. We include a vertical red line showing this time point.

    Does your estimated time of peak Ne match that date?

    We can also use these methods to get a rough idea of how the reproduction number changed through time, because the epidemic growth rate will sometimes be similar to the growth rate of Ne. If we assume that the EBOV infections last 21 days on average (including incubation and clinical phases) than we can say that hosts are removed at the annualized rate of approximately 365/21. Then we can visualize R(t) using this command:

    How does R(t) change through time? What was R(t) around the time the epidemic was growing rapidly in Sierra Leone (around 2014.5). Note that estimates may be very noisy and have large confidence intervals early on before rapid growth in Sierra Leone set in. How does this estimate of R(t) compare to other published values based on the early epidemic?

    Ancestral state estimation

    Here we will use parsimony to reconstruct the likely location of lineages using the rooted time-scaled phylogeny. The country of origin (Liberia, Guinea, and Sierra Leone) can be found in the 4th position of each taxon label:

    We can extract these geocodes using the strsplit command:

    We can tabulate how many sequences come from each country (Guinea, Liberia, and Siera Leone):

    Now we must put the geocodes in a phyDat format used in the phangorn package:

    Compute the ancestral states using

    And we can plot the states using the following:

    What country do you find at the root of the tree? The West African epidemic is though to have originated near Gueckedou, a town in Southern Guinea which is quite close to the borders of both Sierra Leone and Liberia. The proximity of the original outbreak to three international border is though to have compounded the epidemic. By the Summer of 2014 Ebola was circulating in all three countries.

    Watch the video: how to make plain sequence format, FASTA format, Genbank format, EMBL format (June 2022).


  1. Christy

    very helpful thinking

  2. Britton

    I do not doubt it.

  3. Rufo

    The choice you have is not easy

  4. Brandin

    Your thought is simply excellent

  5. Burhdon

    It is possible to tell, this exception :)

Write a message