Here is the list of available command line software.
abyssABySS (Assembly By Short Sequences) is a de novo, parallel, paired-end sequence assembler that is designed for short reads.
AdmixtoolsADMIXTOOLS (Patterson et al. 2012) is a software package that supports formal tests of whether admixture occurred, and makes it possible to infer admixture proportions and dates.
AdmixtureADMIXTURE is a software tool for maximum likelihood estimation of individual ancestries from multilocus SNP genotype datasets. It uses the same statistical model as STRUCTURE but calculates estimates much more rapidly using a fast numerical optimization algorithm.
alignAceAlignACE (Aligns Nucleic Acid Conserved Elements) is a program which finds sequence elements conserved in a set of DNA sequences.
AllPaths-LGALLPATHS-LG is a whole genome shotgun assembler that can generate high quality genome assemblies using short reads (~100bp) such as those produced by the new generation of sequencers. The significant difference between ALLPATHS and traditional assemblers such as Arachne is that ALLPATHS assemblies are not necessarily linear, but instead are presented in the form of a graph. This graph representation retains ambiguities, such as those arising from polymorphism, uncorrected read errors, and unresolved repeats, thereby providing information that has been absent from previous genome assemblies.
amosA Modular, Open-Source whole genome assembler.
AmpliconnoiseAmpliconNoise is a collection of programs for the removal of noise from 454 sequenced PCR amplicons. It involves two steps the removal of noise from the sequencing itself and the removal of PCR point errors. This project also includes the Perseus algorithm for chimera removal.
apolloApollo allows researchers to explore genomic annotations at many levels of detail, and to perform expert annotation curation, all in a graphical environment.
asm2aceConvert the "Celera ASM" assembly format to the "Consed ACE" or the "CAF" file format.
ATLASATLAS (Automatically Tuned Linear Algebra Software) provides highly optimized Linear Algebra kernels for arbitrary cache-based architectures. ATLAS provides ANSI C and Fortran77 interfaces for the entire BLAS API, and a small portion of the LAPACK AP
Atlas2Atlas2 is a next-generation sequencing suite of variant analysis tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in Whole Exome Capture Sequecing (WECS) data.
AugustusAUGUSTUS is a program that predicts genes in eukaryotic genomic sequences
bam2fastqExtract sequences from a BAM file in fastq format.
bambusScaffolding tool: ordering and orienting contigs by incorporating additional information about their relative placement along the genome.
BAMTOOLSBamTools provides both a programmer's API and an end-user's toolkit for handling BAM files.
bamUTILbamUtil is a repository that contains several programs that perform operations on SAM/BAM files. All of these programs are built into a single executable, bam.
banyanThis package provides sorted drop-in versions of Python's set and dict (with optional augmentation).
BAYESCANDetecting natural selection from population-bases genetic data using differences in alleles frequencies between populations.
bcl2fastqThe Bcl2FastQ conversion software is a new tool to handle bcl conversion and demultiplexing of both unzipped and zipped bcl files, which have reduced footprint and were introduced as an optional output of the HCS Software version 2.0
bedtoolsThe BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage.
BIOM-FORMATThe Biological Observation Matrix format; There are two components to the BIOM project: first is definition of the BIOM format, and second is development of support objects in multiple programming languages to support the use of BIOM in diverse bioinformatics applications. The version of the BIOM file format is independent of the version of the biom-format software.
BIOPYTHONBiopython is a set of freely available tools for biological computation written in Python by an international team of developers.
Bis-SNPBisSNP is a package based on the Genome Analysis Toolkit (GATK) map-reduce framework for genotyping and accurate DNA methylation calling in bisulfite treated massively parallel sequencing (Bisulfite-seq, NOMe-seq, RRBS and any other bisulfite treated sequencing) with Illumina directional library protocol.
BismarkA tool to map bisulfite converted sequence reads and determine cytosine methylation states
BlastrBlastR is a new method for searching Non-Coding RNAs in databases
blatThe BLAST-Like Alignment Tool: similarity search in databanks. BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25 bases or more. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more.
bowtieBowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).
Bowtie2Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
BRATBRAT is an accurate and efficient tool for mapping short bisulfite-treated reads obtained from the Solexa-Illumina Genome Analyzer. BRAT supports single-end and pair-end short reads mapping and allows alignment of different length reads/mates. BRAT-bw is BRAT-BW, a fast, accurate and memory-efficient tool that maps bisulfite-treated short reads (BS-seq) to a reference genome using the FM-index (Burrows-Wheeler transform). The package includes tools to trim low quality reads ends and to report A, C, G, T counts at each base for forward and reverse strands of references.
breakdancerbreakdancer_max and bam2cfg.pl are available. BreakDancerMax predicts five types of structural variants: insertions, deletions, inversions, inter- and intra-chromosomal translocations from next-generation short paired-end sequencing reads using read pairs that are mapped with unexpected separation distances or orientation.
bwaBurrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It implements two algorithms, bwa-short and BWA-SW. The former works for query sequences shorter than 200bp and the latter for longer sequences up to around 100kbp. Both algorithms do gapped alignment. They are usually more accurate and faster on queries with low error rates.
cap3CAP3 is a sequence assembly program for small-scale assembly of EST sequences with or without quality values.
carnacCarnac is a software tool for analysing the hypothetical secondary structure of a family of homologous RNA.
carthagenCarthaGĂ¨ne is a genetic/radiated hybrid mapping software. CarthaGene looks for multiple populations maximum likelihood consensus maps using a fast EM algorithm for maximum likelihood estimation and powerful ordering algorithms. CarthaGĂ¨ne can handle data made up of several distinct populations which may each be either F2 backcross, recombinant inbred lines, F2 intercross, phase known outbreds and/or radiated hybrids (haploid and diploid data).
CASAVAIllumina's Consensus Assessment of Sequence and Variation (CASAVA) software captures summary information for resequencing and counting studies and places the data in a compact structure for visualization within GenomeStudio Software or publicly available analysis tools. CASAVA can create genomic builds, call SNPs, detects indels, and count reads from data generated from one or more runs of the Genome Analyzer across a broad range of sequencing applications.
CD-HITCD-HIT stands for Cluster Database at High Identity with Tolerance. The program (cd-hit) takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the sequence 'groupies' for each nr sequence representative.
CD-HIT-OTUCD-HIT-OTU, identifying OTUs in metagenomics Pyrosequencing-based 16S ribosomal RNA survey and decrease spurious OTUs.
CEGMACEGMA (Core Eukaryotic Genes Mapping Approach) is a pipeline for building a set of high reliable set of gene annotations in virtually any eukaryotic genome.
CheetahCheetah is an open source template engine and code generation tool, written in Python.
chimerascanchimerascan is a software package that detects gene fusions in paired-end RNA sequencing (RNA-Seq) datasets.
ChIPMunkChIPMunk is a fast heuristic DNA motif digger based on greedy approach accompanied by bootstrapping. ChIPMunk identifies the strong motif with the maximum Discrete Information Content in a set of DNA sequences. ChIPMunk uses (extended) multifasta as the input format and supports IUPAC DNA letters in the input sequence
clearcutClearcut is the reference implementation for the Relaxed Neighbor Joining (RNJ)
algorithm by J. Evans, L. Sheneman, and J. Foster from the Initiative
for Bioinformatics and Evolutionary Studies (IBEST) at the University of
Clustal OmegaClustal Omega is the latest addition to the Clustal family. It offers a significant increase in scalability over previous versions, allowing hundreds of thousands of sequences to be aligned in only a few hours. It will also make use of multiple processors, where present. In addition, the quality of alignments is superior to previous versions, as measured by a range of popular benchmarks
clustalwMultiple sequence alignment program for DNA or proteins.
clviewThis is a graphical, interactive tool for inspecting the ACE format assembly files generated by CAP3 or phrap.
consedConsed allow to visualise, edit and finish sequences assembled with phrap. Consed is compatible with Newbler, Cross_match, Phrap and PCAP output.
Control-FREECControl-FREEC is a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data developed by the Bioinformatics Laboratory of Institut Curie (Paris).
CRACA integrated RNA-Seq read analysis.
CufflinksCufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.
cutadaptCutadapt removes adapter sequences from DNA high-throughput
sequencing data. This is usually necessary when the read length of the
machine is longer than the molecule that is sequenced, such as in
DellyDELLY is an integrated structural variant prediction method that can detect deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read massively parallel sequencing data. It uses paired-ends and split-reads to sensitively and accurately delineate genomic rearrangements throughout the genome.
dialignDIALIGN is a software program for multiple sequence alignment. DIALIGN constructs pairwise and multiple alignments by comparing entire segments of the sequences. No gap penalty is used. This approach can be used for both global and local alignment, but it is particularly successful in situations where sequences share only local homologies.
DisEMBLDisEMBL is a computational tool for prediction of disordered/unstructured regions within a protein sequence. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects.
e-PCRe-PCR identifies sequence tagged sites(STSs)within DNA sequences. Using e-PCR, you can search for sub-sequences that closely match the PCR primers and have the correct order, orientation, and spacing.
ea-utilsCommand-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc.
ecoPCRecoPCR is an electronic PCR software developed by LECAand Helix-Project . It helps you to estimate Barcode primers quality. In conjunction with OBItools, you can postprocess ecoPCR output to compute barcode coverage and barcode speci?city.
edenaDe novo short reads assembler.
embossThe suite includes programs, tools and sequence databases that can cover all the basic needs in the field of analysis and exploitation of biological sequences.
Ensembl-apiEnsembl uses MySQL relational databases to store its information. A comprehensive set of Application Programme Interfaces (APIs) serve as a middle-layer between underlying database schemes and more specific application programmes. The APIs aim to encapsulate the database layout by providing efficient high-level access to data tables and isolate applications from data layout changes. Ensembl's API is written in Perl:
EspritAn Algorithm for estimating species richness using large collections of 16S rRNA pyrosequences.
ESTScanESTScan is a program that can detect coding regions in DNA/RNA sequences, even if they are of low quality (e.g. EST sequences). ESTScan will also detect and correct sequencing errors that lead to frameshifts. ESTScan is not a gene prediction program , nor is it an open reading frame detector. In fact, its strength lies in the fact that it does not require an open reading frame to detect a coding region. As a result, the program may miss a few translated amino acids at either the N or the C terminus, but will detect coding regions with high selectivity and sensitivity.
eulerEuler is a new approach to fragment assembly that abandons the classical "overlap - layout - consensus" paradigm that is used in all currently available assembly tools.
euler-SREULER-SR is a program for de novo assembly of reads (from Roche 454 Live Sciences or Illumina/Solexa).
EvalEval is a flexible tool for analyzing the performance of gene-structure prediction programs. It provides summaries and graphical distributions for many statistics describing any set of annotations, regardless of their source. It also compares sets of predictions to standard annotations and to one another
exonerateA generic tool for sequence alignment.
fastaFASTA is a sequence similarity search tool which uses heuristics for fast local alignment searching.
FastQCA Quality Control application for FastQ files.
FastQC is an application which takes a FastQ file and runs a series
of tests on it to generate a comprehensive QC report.
FASTX-ToolkitThe FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
FigTreeFigTree is designed as a graphical viewer of phylogenetic trees and as a program for producing publication-ready figures.
FindPeaksFindPeaks performs two functions: 1) analysis of short-read sequencing (Solexa/Illumina) experiments to identify areas of enrichment 2) generating wig files for use with the UCSC browser.
FLASHFLASH, Fast Length Adjustment of SHort reads, is a very accurate fast tool to merge paired-end reads from fragments that are shorter than twice the length of reads. The extended length of reads has a significant positive impact on improvement of genome assemblies.
Flux SimulatorThe Flux Simulator aims at modeling RNA-Seq experiments in silico: sequencing reads are produced from a reference genome according annotated transcripts.
FrameDPSensitive peptide detection on noisy matured sequences. Available with command line interface on the cluster.
FreeBayesFreeBayes is a population-based short polymorphism detector which features support for the simultaneous detection of SNPs, INDELs, and multi-base mismatches, poly-allelic sites, polyploidy, and sample and region-specific copy number modeling. FreeBayes works with standard file formats (BAM and VCF) and easily be integrated into existing next-generation sequencing pipelines.
GAASGAAS (Genome relative Abundance and Average Size) is a bioinformatic tool to calculate accurate community composition and average genome size in metagenomes by using BLAST, advanced parsing of hits and correction of genome length bias.
The GapCloser is designed to close the gaps emerging during the scaffolding process by SOAPdenovo, using the abundant pair relationships of short reads.
gassstGASSST (Global Alignment Short Sequence Search Tool) finds global alignments of short DNA sequences against large DNA banks. GASSST strong point is its ability to perform fast gapped alignments. It works well for both short and longer reads. It currently has been tested for reads up to 500bp.
GASTGlobal Alignment for Sequence Taxonomy.
Uses a reference database of SSU rRNA sequences to determine the taxonomy of hypervariable region tags.
GATKThe GATK is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.
GblocksGblocks is a computer program written in ANSI C language that eliminates poorly aligned positions and divergent regions of an alignment of DNA or protein sequences. These positions may not be homologous or may have been saturated by multiple substitutions and it is convenient to eliminate them prior to phylogenetic analysis. Gblocks selects blocks in a similar way as it is usually done by hand but following a reproducible set of conditions. The selected blocks must fulfill certain requirements with respect to the lack of large segments of contiguous nonconserved positions, lack or low density of gap positions and high conservation of flanking positions, making the final alignment more suitable for phylogenetic analysis. Gblocks outputs several files to visualize the selected blocks. The use of a program such as Gblocks reduces the necessity of manually editing multiple alignments, makes the automation of phylogenetic analysis of large data sets feasible and, finally, facilitates the reproduction of the alignments and subsequent phylogenetic analysis by other researchers.
GeneMarkE.hmmGeneMark.hmm eukaryotic, gene prediction in eucaryotes. Executable name : gmhmme3.
GeneScissorsGeneScissors exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods.
genomeToolsCollection of bioinformatics tools (in the realm of genome informatics) combined into a single binary named "gt".
GeocoderShort description: Species locality data + polygons -> nexus file. Longer description: geocoder.py is a program written in Python that takes one file containing polygons, and one file with species locality data as input. The program then tests if a species have been recorded inside any of the polygons. The result is presented as a nexus- file with "0" indicating absence, and "1" indicating pressence in a polygon.
GMAP / GSNAPGMAP: A Genomic Mapping and Alignment Program for mRNA and EST Sequences, and
GSNAP: Genomic Short-read Nucleotide Alignment Program
gtf_to_genesis a Python parser which caches all the genes / transcripts from a GTF file and caches the data into python classes for high speed access.
gthGenomeThreader is a software tool to compute gene structure predictions. The gene structure predictions are calculated using a similarity-based approach where additional cDNA/EST and/or protein sequences are used to predict gene structures via spliced alignments. GenomeThreader is available free of charge only for non-commercial research institutions.
hmmerHMMER is a package used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
HPG-VariantThe HPG Variant suite is a project aimed to provide a complete suite of tools to work with genomic variation data, from VCF tools to variant profiling or genomic statistics.
HTSeqHTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays.
HTSeqHTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays.
IGVThe Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
InfernalInfernal ("INFERence of RNA ALignment") is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs).
IntaRNAIntaRNA predicts interactions between two RNA molecules, e.g. a non-coding RNA (ncRNA) and a mRNA. The scoring is based on hybridization free energy and accessibility of the interaction sites in both molecules.
iPSORTiPSORT is a subcellular localization site predictor for N-terminal sorting signals. Given a protein sequence, it will predict whether it contains a Signal Peptide (SP), Mitochondrial Targeting Peptide (mTP), or Chloroplast Transit Peptide (cTP).
JellysfishJELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence. JELLYFISH can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the "compare-and-swap" CPU instruction to increase parallelism. JELLYFISH is a command-line program that reads FASTA and multi-FASTA files containing DNA sequences. It outputs its k-mer counts in an binary format, which can be translated into a human-readable text format using the "jellyfish dump" command.
KronaKrona allows hierarchical data to be explored with zoomable pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats.
LamarcLAMARC is a program which estimates population-genetic parameters such as population size, population growth rate, recombination rate, and migration rates.
LastLAST finds similar regions between sequences.
LDhatLDhat is a package written in the C and C++ languages for the analysis of recombination rates from population genetic data.
LocusZoomLocusZoom is a tool to plot regional association results from genome-wide association scans or candidate gene studies.
LucyLUCY A Sequence Cleanup Program. The quality trimming portion of lucy makes use of phred quality scores, such as those produced by many automated sequencers based on the Sanger sequencing method. As such, lucyâ€™s quality trimming may not be appropriate for sequence data produced by some of the new â€śnext-generationâ€ť sequencers.
lxmllxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API.
MAAFTMAFFT is a multiple sequence alignment program for unix-like operating systems
MACSWe present Model-based Analysis of ChIP-Seq (MACS) on short reads sequencers such as Genome Analyzer (Illumina / Solexa). MACS empirically models the length of the sequenced ChIP fragments, which tends to be shorter than sonication or library construction size estimates, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome sequence, allowing for more sensitive and robust prediction. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, is publicly available open source, and can be used for ChIP-Seq with or without control samples.
MAFFTMAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <?200 sequences), FFT-NS-2 (fast; for alignment of <?10,000 sequences), etc.
mapSpliceAccurate mapping of RNA-seq reads for splice junction discovery.
maqMaq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. It is particularly designed for Illumina/Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLID data.
MaSuRCAMaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454)
MATSMATS is a computational tool to detect differential alternative splicing events from RNA-Seq data.
MauveMauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Aligning whole genomes is a fundamentally different problem than aligning short sequences.
MeganMEtaGenome ANalyzer : Metagenomic data analysis : taxonomic and functionnal (SEED and KEGG classification) analysis.
memeThe MEME Suite allows you to: (1) discover motifs using MEME or GLAM2 on groups of related DNA or protein sequences, (2) search sequence databases using motifs, (3) compare a motif to all motifs in a database of motifs, and (3) associate motifs with Gene Ontology terms via their putative target genes.
MetaPhlAnMetaPhlAn is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn relies on unique clade-specific marker genes identified from 3,000 reference genomes
MetaSimA Sequencing Simulator for Genomics and Metagenomics. The resulting data sets can be used as standardized test scenarios for planning sequencing projects or for benchmarking assembler and metagenomic software.
MIGRATE-NEstimation of population sizes and gene flow using the coalescent. Migrate estimates effective population sizes and past migration rates between n population assuming a migration matrix model with asymmetric migration rates and different subpopulation sizes. Migrate uses maximum likelihood or Bayesian inference to jointly estimate all parameters.
miraWhole genome shotgun and EST sequence assembler for Sanger, 454, and Solexa / Illumina.
miRandamiRanda is an algorithm for finding genomic targets for microRNAs. This algorithm has been written in C and is available as an open-source method under the GPL. MiRanda was developed at the Computational Biology Center of Memorial Sloan-Kettering Cancer Center
mirDeep-PmiRDeep-P,miRDP for short, is a computational tool for analyzing the microRNA (miRNA) transcriptome in plants.
mirDeep2miRDeep2 is a software package for identification of novel and known miRNAs in deep sequencing data. Furthermore, it can be used for miRNA expression profiling across samples. Last, a new module for preprocessing of raw Illumina sequencing data produces files for downstream analysis with the miRDeep2 or quantifier module.
miropeatsMiropeats discovers regions of sequence similarity amongst any set of DNA sequences and then presents this similarity information graphically.
miRParamiRPara is a SVM-based miRNA prediction tool.
MisoMISO (Mixture-of-Isoforms) is a probabilistic framework that quantitates the expression level of alternatively spliced genes from RNA-Seq data, and identifies differentially regulated isoforms or exons across samples. By modeling the generative process by which reads are produced from isoforms in RNA-Seq, the MISO model uses Bayesian inference to compute the probability that a read originated from a particular isoform.
Mitoprot IIMitoProt calculates the N-terminal protein region that can support a Mitochondrial Targeting Sequence and the cleavage site.
MocatMOCAT is a package for analyzing metagenomics datasets. Currently MOCAT supports Illumina single- and paired-end reads in raw FastQ format. Using MOCAT you can, for example, generate taxonomic profiles of, and assemble, metagenomes.
MothurThe one-stop source for your computational microbial ecology needs. mothur offers the ability to go from raw sequences to the generation of visualization tools to describe alpha and beta diversity.
msA program for generating samples under neutral models.
MSR-CAThe MSR-CA assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches. The strength of the deBruijn graph approach is in its ability to quickly create a graph representation of the genome assembly from the deep coverage short read data. However in most cases the graph is extremely complex and it is hard to find a way to recover the original genome sequence from simply traversing it. On the other hand, overlap-layout-consensus is better suited for longer reads with high coverage, and since it usually relies on overlaps of 40 bases or longer, it is better for resolving short repetitive structures.
multalinMultiple sequence alignment with hierarchical clustering.
MultiRNAFoldThe MultiRNAFold package contains software for secondary structure prediction of one, two, or many interacting RNA or DNA molecules. It is composed of three pieces of software: SimFold, PairFold and MultiFold.
mummerMUMmer is a package for rapidly aligning entire genomes, whether in complete or draft form.
muscleMultiple sequence alignment (nucleic or proteic).
nbci-blastSimilarity search against databanks.
ncPRO-seqncPRO-seq (Non-Coding RNA PROfiling in sRNA-seq) is a tool for annotation and profiling of ncRNAs using deep-sequencing data developed by the Bioinformatics Laboratory of the institut Curie. This comprehensive and flexible ncRNA analysis pipeline, aims in interrogating and performing detailed analysis on small RNAs derived from annotated non-coding regions in miRBase, Rfam and repeatMasker, and regions defined by users. The ncPRO-seq pipeline also has a module to identify regions significantly enriched with short reads that can not be classified as known ncRNA families.
NetPhosNetPhos is a neural network-based method for predicting potential phosphorylation sites at serine, threonine or tyrosine residues in protein sequences.
New FugueNew Fugue is a program for estimation of haplotype frequencies and linkage disequilibrium coefficients in family data
newblerNewbler is a software package for de novo DNA sequence assembly. It is designed specifically for assembling sequence data generated by the 454 GS-series of pyrosequencing platforms sold by 454 Life Science, a Roche diagnostic.
NextGenMapNextGenMap is a flexible and fast read mapping program that is more than twice as fast as BWA while achieving a mapping sensitivity similar to Stampy.
NIKSNIKS (Needle in a K-stack) - detection of mutations in NGS data.
numpyNumPy is a package needed for scientific computing with Python.
oasesOases is a de novo transcriptome assembler designed to produce transcripts from short read sequencing technologies, such as Illumina, SOLiD, or 454 in the absence of any genomic assembly. It was developed by Marcel Schulz (MPI for Molecular Genomics) and Daniel Zerbino (previously at the European Bioinformatics Institute (EMBL-EBI), now at UC Santa Cruz).
Oases uploads a preliminary assembly produced by Velvet, and clusters the contigs into small groups, called loci. It then exploits the paired-end read and long read information, when available, to construct transcript isoforms.
OSLayA new tool OSLay that uses synteny between matching sequences in a target assembly and a reference assembly to layout the contigs (or scaffolds) in the target assembly. The tool provides an interactive visualization of the computed layout and the result can be imported into the assembly editing tool Consed to support the design of primer pairs for gap-closure.
PASAPASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.
patscanPatScan is a pattern matcher which searches protein or nucleotide (DNA, RNA, tRNA etc.) sequence archives for instances of a pattern which you input.
PBJellyPBJelly is a highly automated pipeline that aligns long sequencing reads (such as PacBio RS reads or long 454 reads in fasta format) to high-confidence draft assembles.
PennCNVCopy Number Variation (CNV) detection from SNP genotyping arrays. PennCNV implements a hidden Markov model (HMM) that integrates multiple sources of information to infer CNV calls for individual genotyped samples. It differs form segmentation-based algorithm in that it considered SNP allelic ratio distribution as well as other factors, in addition to signal intensity alone. In addition, PennCNV can optionally utilize family information to generate family-based CNV calls by several different algorithms. Furthermore, PennCNV can generate CNV calls given a specific set of candidate CNV regions, through a validation-calling algorithm.
PerlPrimerPerlPrimer is a free, open-source GUI application written in Perl that designs primers for standard PCR, bisulphite PCR, real-time PCR (QPCR) and sequencing. It aims to automate and simplify the process of primer design.
pftoolsThe pftools package contains all the software necessary to build protein and DNA generalized profiles and use them to scan and align sequences, and search databases
phrep / phrapThe phred software reads DNA sequencing trace files, calls bases, and assigns a quality value to each called base. Phrap is a program for assembling shotgun DNA sequence data.
PhyMLPhyML is a phylogeny software based on the maximum-likelihood principle.
picard-toolsPicard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported.
PindelPindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
PLAST is a parallel alignment search tool for comparing large protein banks.
PLAST runs 3 to 5 times faster than the NCBI-BLAST software when processing large amount of data.
PLINKPLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
polyphredAnalysis of data from the sequencer. PolyPhred is a program that compares fluorescence-based sequences across traces obtained from different individuals to identify heterozygous sites for single nucleotide substitutions.
popABCPopABC is a computer package to estimate historical demographic parameters of closely related species/populations (e.g. population size, migration rate, mutation rate, recombination rate, splitting events) within a Isolation with migration model. The software performs coalescent simulation in the framework of approximate Bayesian computation (ABC, Beaumont et al, 2002). PopABC can also be used to perform Bayesian model choice to discriminate between different demographic scenarios.
PoPoolationA Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals
prot4ESTprot4EST is a perl script that takes expressed sequence tags (ESTs) and translates them optimally to produce putative peptides.
pycogentPython librairie for biology sequence oriented.
PyicoteoPyicoteo is a suite of tools for the analysis of high-throughput sequencing data.
pysamPysam is a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API.
QIIMEQIIME (pronounced "chime") stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities, primarily based on high-throughput amplicon sequencing data (such as SSU rRNA) generated on a variety of platforms, but also supporting analysis of other types of data (such as shotgun metagenomic data). QIIME takes users from their raw sequencing output through initial analyses such as OTU picking, taxonomic assignment, and construction of phylogenetic trees from representative sequences of OTUs, and through downstream statistical analysis, visualization, and production of publication-quality graphics. QIIME has been applied to single studies based on billions of sequences from thousands of samples.
QRNAA prototype noncoding RNA genefinder, based on comparative genome sequence analysis.
quickdistCalculates a matrix of pairwise distances between sequences in a multiple sequence alignment.
RR is "GNU S", a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.
R'MESRecherche de Mots Exceptionnels dans une SĂ©quence.
RADmapperSet of scripts to create denovo consensus and map the read back
RADtoolsRADtools is our software for processing RAD Sequencing data. Version 1.0 is a pipeline for transforming Illumina reads into candidate genetic markers.
RayParallel genome assemblies for parallel DNA sequencing. Ray is a parallel software that computes de novo genome assemblies with next-generation sequencing data. Ray is written in C++ and can run in parallel on numerous interconnected computers using the message-passing interface (MPI) standard.
RDP_FrameBotRDP FrameBot is a tool for correcting frameshift errors caused by insertions and deletions in DNA sequences.
readSeqFormat conversion sequence. ReadSeq is a program and library for conversion of biosequence data from one format to another, useful in various bioinformatics programs and services.
realignerReAligner is used to realign multi-alignments of DNA fragments. converter is a utility for reformatting multi-alignments.
RECONProper identification of repetitive sequences is an essential step in genome analysis. The RECON package performs de novo identification and classification of repeat sequence families from genomic sequences. The underlying algorithm is based on extensions to the usual approach of single linkage clustering of local pairwise alignments between genomic sequences. Specifically, our extensions use multiple alignment information to define the boundaries of individual copies of the repeats and to distinguish homologous but distinct repeat element families. RECON should be useful for first-pass automatic classification of repeats in newly sequenced genomes.
RepeatMaskerRepeatMasker is a program that screens DNA sequences for interspersed repeats (thanks to RepBase repeats databanks specially formatted) and low complexity DNA sequences.
RepeatModelerRepeatModeler is a de-novo repeat family identification and modeling package.
RepeatScoutRepeatScout is a tool to discover repetitive substrings in DNA.
RIsearchRIsearch is a program for fast RNA-RNA interaction search. It employs a modified Smith-Waterman-Gotoh algorithm based on di-nucleotides to approximate nearest-neighbor energy parameters
RMBlastRMBlast is a RepeatMasker compatible version of the standard NCBI BLAST suite. The primary difference between this distribution and the NCBI distribution is the addition of a new program "rmblastn" for use with RepeatMasker and RepeatModeler. RMBlast supports RepeatMasker searches by adding a few necessary features to the stock NCBI blastn program. These include:
- Support for custom matrices ( without KA-Statistics ).
- Support for cross_match-like complexity adjusted scoring. Cross_match is Phil Green's seeded smith-waterman search algorithm.
- Support for cross_match-like masklevel filtering.
RNA-SeQCRNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data
RNAhybridRNAhybrid is a tool for finding the minimum free energy hybridisation of a long and a short RNA.The hybridisation is performed in a kind of domain mode, ie. the short sequence is hybridised to the best fitting part of the long one. The tool is primarily meant as a means for microRNA target prediction.
rnammerRnammer predicts 5s/8s, 16s/18s, and 23s/28s ribosomal RNA in full genome sequences. The program uses hidden Markov models trained on data from the 5S ribosomal RNA database and the European ribosomal RNA database project.
Rpy2rpy2 is a redesign and rewrite of rpy. It is providing a low-level interface to R, a proposed high-level interface, including wrappers to graphical libraries, as well as R-like structures and functions
RSEGThe RSEG software package is aimed to analyze ChIP-Seq data, especially for identifying genomic regions and their boundaries marked by diffusive histone modification markers, such as H3K36me3 and H3K27me3. It can work with or without control sample. It can be used to find regions with differential histone modifications patterns, either comparsion between two cell types or between two kinds of histone modifications.
RSeQCRSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data
RUMRUM is an alignment, junction calling, and feature quantification pipeline specifically designed for Illumina RNA-Seq data. RUM can also be used effectively for DNA sequencing (e.g. ChIP-Seq) and microarray probe mapping. RUM also has a strand specific mode.
S-MartS-MART manages your RNA-Seq and ChIP-Seq data. It also produces many different plots to visualize your data.
samstatSAMStat is an efficient C program to quickly display statistics in html format of large sequence files from next generation sequencing projects.
samtoolsSAM (Sequence Alignment/Map). SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
Scan For Matchesscan_for_matches is a utility written in C for locating patterns in DNA or protein FASTA files.
scipySciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. The SciPy library depends on Numpy, which provides convenient and fast N-dimensional array manipulation.
SeaViewSeaView is a multiplatform, graphical user interface for multiple sequence alignment and molecular phylogeny.
SEGEMEHLsegemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to mapprimer- or polyadenylation contaminated reads correctly. segemehl implements a matching strategy based on enhanced suffix arrays (ESA). Segemehl now supports the SAM format, reads gziped queries to save both disk and memory space and allows bisulfite sequencing mapping and split read mapping.
seqcleanA script for automated trimming and validation of ESTs or other DNA sequences by screening for various contaminants, low quality and low-complexity sequences.
SeqMonkA tool to visualise and analyse high throughput mapped sequence data
seqtools (dotter belvu blixem blixemh)A suite of tools for visualising sequence alignments. Blixem is an interactive browser of pairwise alignments that have been stacked up in a "master-slave" multiple alignment; it is not a 'true' multiple alignment but a 'one-to-many' alignment. It displays an overview section showing the positions of genes and alignments around the alignment window, and a detail section showing the actual alignment of protein or nucleotide sequences to the genomic DNA sequence. Dotter is a graphical dot-matrix program for detailed comparison of two sequences. Every residue in one sequence is compared to every residue in the other, with one sequence plotted on the x-axis and the other on the y-axis. Noise is filtered out so that alignments appear as diagonal lines. Belvu is a multiple sequence alignment viewer and phylogenetic tool. It has an extensive set of user-configurable modes to color residues by conservation or by residue type, and some basic alignment editing capabilities. It can generate distance matrices between sequences and construct distance-based trees, either graphically or as part of a phylogenetic software pipeline.
SeqtrimNEXTSeqtrimNEXT is a customizable and distributed pre-processing software for NGS (Next Generation Sequencing) biological data. It makes use of scbi_mapreduce gem to be able to run in parallel and distributed environments. It is specially suited for Roche 454 (normal and paired-end) & Ilumina datasets, although it could be easyly adapted to any other situation.
sff_extract454 sequence reads are usually stored in sff files. In these files the information about the reads is stored: sequence, quality and quality and adapter clips. sff_extract extracts the reads from the sff files and stores them into fasta and xml or caf text files.
ShoRAHShoRAH is a software package that allows for inference about the structure of a population from a set of short sequence reads as obtained from ultra-deep sequencing of a mixed sample.
sibsim4The SIBsim4 project is based on sim4, which is a program designed to align an expressed DNA sequence with a genomic sequence, allowing for introns.
sim4sim4 is a program designed to align an expressed DNA sequence with a genomic sequence, allowing for introns.
SLIDESLIDE (Sliding-window method for Locally Inter-correlated markers with asymptotic Distribution Errors corrected) is a multivariate normal distribution (MVN)-based multiple hypothesis testing correction method. SLIDE shows a near identical accuracy to the gold standard, the permutation test, and is much faster.
SLIPSLIP (Sliding-window method for Locally Inter-correlated markers for Power estimation) is a multivariate normal distribution (MVN)-based power estimation method. SLIP shows a near identical accuracy to the standard simulation procedure for power, and is much faster.
smaltSMALT efficiently aligns DNA sequencing reads with genomic reference sequences. Reads from a range of sequencing platforms, for example Illumina-Solexa, Roche-454, PacBio or ABI-Sanger, can be processed including paired-end reads
SNAPPredicts effect of mutations on protein function
SnoReportComputational identification of snoRNAs with unknown targets.
Detecting novel or orphan snoRNAs in RNA sequence data using sequence and structure information only without relying on target information
SOAPdenovo-transSOAPdenovo-Trans is a de novo transcriptome assembler basing on the SOAPdenovo framework, adapt to alternative splicing and different expression level among transcripts.The assembler provides a more accurate, complete and faster way to construct the full-length transcript sets.
SortMeRNASortMeRNA is a software designed to rapidly filter ribosomal RNA fragments from metatransriptomic data produced by next-generation sequencers. It is capable of handling large RNA databases and sorting out all fragments matching to the database with high accuracy and specificity
SRA toolkitToolkit to query Short Reads Archive at NCBI
ssahaSSAHA2 (Sequence Search and Alignment by Hashing Algorithm) is a pairwise sequence alignment program designed for the efficient mapping of sequencing reads onto genomic reference sequences.
StacksStacks is a software suite for analysing RAD Sequencing data by Julian Catchen at the University of Oregon. It will process raw Illumina RAD data or RAD data aligned to a reference genome, and produce genotypes that can be viewed and filtered via a web interface.
StructureThe program structure is a free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.
SumatraSumatra was developed by the LECA and aims to compute a great deal of sequence similarities in a fast and exact way, based on the length of the Longest Common Subsequence (LCS) between two sequences. Sequence clustering based on similarities is also available through Sumaclust.
T-CoffeeT-Coffee is a multiple sequence alignment package. You can use T-Coffee to align sequences or to combine the output of your favorite alignment methods (Clustal, Mafft, Probcons, Muscle...) into one unique alignmen.
tabixTAB-delimited file IndeXer. Useful for vcfTools.
TagCleanerThe TagCleaner tool (standalone version) can be used to automatically detect and efficiently remove tag sequences (e.g. WTA tags) from genomic and metagenomic datasets. It is easily configurable.
TagDustTagDust is a program to eliminate artifactual reads from next-generation sequencing data sets.
Tandem Repeats FinderTandem Repeats Finder is a program to locate and display tandem repeats in DNA sequences. A tandem repeat in DNA is two or more adjacent, approximate copies of a pattern of nucleotides.
TASSELTrait Analysis by aSSociation, Evolution and Linkage. TASSEL has multiple functions, including association study, evaluating evolutionary relationships, analysis of linkage disequilibrium, principal component analysis, cluster analysis, missing data imputation and data visualization for large sets of data.
TFM_PvalueTFM-Pvalue is a software suite providing tools for computing the score threshold associated to a given P-value and the P-value associated to a given score threshold. It uses Position Weight Matrices, such as those available in the Transfac or Jaspar databases.
TGICLThis package automates clustering and assembly of a large EST/mRNA dataset. The clustering is performed by a slightly modified version of NCBI's megablast , and the resulting clusters are then assembled using CAP3 assembly program. TGICL starts with a large multi-FASTA file (and an optional peer quality values file) and outputs the assembly files as produced by CAP3.
TopHatTopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
trans-ABYSSTrans-ABySS is a software pipeline for analyzing ABySS-assembled contigs from shotgun transcriptome data. The pipeline accepts assemblies that were generated across a wide range of k values in order to address variable transcript expression levels. It first filters and merges the multi-k assemblies, generating a much smaller set of nonredundant contigs. It contains scripts that map assembled contigs to known transcripts, currently supporting the Blat contig-to-genome aligner. It identifies novel splicing events like exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. Its scripts can also identify candidate gene-fusions, single-nucleotide variants, insertions, deletions, and inversions.
treemixTreeMix is a method for inferring the patterns of population splits and mixtures in the history of a set of populations. In the underlying model, the modern-day populations in a species are related to a common ancestor via a graph of ancestral populations. We use the allele frequencies in the modern populations to infer the structure of this graph.
TRFTandem Repeats Finder 4.04 for 64 bit Linux
Trim GaloreA wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries.
tRNAscan-SESearch for tRNA genes in genomic sequence.
UNAFoldSoftware for nucleic acid folding and hybridization. The UNAFold software package is an integrated collection of programs that simulate folding, hybridization, and melting pathways for one or two single-stranded nucleic acid sequences.
USEARCHHigh-throughput biological sequence analysis. It is a distributed as single binary program that implements a suite of algorithms comparable to BLASTN, BLASTP, BLASTX, BLASTCLUST, CD-HIT, CD-HIT-EST, CD-HIT-2D, CD-HIT-EST-2D, CD-HIT-OTU, CD-HIT-454, ChimeraSlayer, Perseus, RAPsearch and more. It supports a rich set of sequence matching options, including E-values, identity, coverage (fraction of query or target sequence covered by the alignment) and maximum gap length, and a range of output file formats including FASTA, BLAST-like, user-defined tabbed text and a native format designed for clustering applications. Supported alignment styles include local (gapped and ungapped), like BLAST, and global, which is most often used in clustering applications. User-settable parameters allow tuning of substitution scores, gap penalties and Karlin-Altschul statistics.
VCAKEVCAKE is a genetic sequence assembler capable of assembling millions of small nucleotide reads even in the presence of sequencing error. This software is currently geared towards de novo assembly of Illumina's Solexa Sequencing data.
vcftoolsVCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide methods for working with VCF files: validating, merging, comparing and calculate some basic population genetic statistics.
velvetVelvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom.
Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.
viennaRNAVienna RNA package allows RNA Secondary Structure Prediction and Comparison.
wgswgs (Celera Assembler) is a de novo whole-genome shotgun (WGS) DNA sequence assembler.
Wise2Wise2 is a package focused on comparisons of biopolymers, commonly DNA sequence and protein sequence. These are the programs which you might use for this:
genewise: a single protein vs a single genomic dna sequence
genewisedb: a database of proteins vs a database of genomic dna sequences.
estwise: a single protein vs a single EST/cDNA sequence.
estwisedb: a database of proteins vs a database of EST/cDNA sequences.
wu-blastSimilarity search against databanks, Washington University Blast.
YassYASS is a genomic similarity search tool, for nucleic (DNA/RNA) sequences in fasta or plain text format (it produces local pairwise alignments).
YinOYangyinOyang produces neural network predictions for O-(beta)-GlcNAc attachment sites in eukaryotic protein sequences. It can also use netphos, to mark possible phosphorylated sites and hence identify the "Yin-Yang" sites i.e. the sites that may be modified reversibly and dynamically by O-GlcNAc or phosphate groups at different times in the cell.