Here is the list of available command line software. Most of them are available in GenoToul to this path : /usr/local/bioinfo/src.
abyssABySS (Assembly By Short Sequences) is a de novo, parallel, paired-end sequence assembler that is designed for short reads.
AdapterRemovalThis program was developed to remove residual adapter sequences from next generation sequencing reads. The program handles both single end and paired end data.
AdmixtoolsADMIXTOOLS (Patterson et al. 2012) is a software package that supports formal tests of whether admixture occurred, and makes it possible to infer admixture proportions and dates.
AdmixtureADMIXTURE is a software tool for maximum likelihood estimation of individual ancestries from multilocus SNP genotype datasets. It uses the same statistical model as STRUCTURE but calculates estimates much more rapidly using a fast numerical optimization algorithm.
ALDERThe ALDER software computes the weighted linkage disequilibrium (LD) statistic for making inference about population admixture
alignAceAlignACE (Aligns Nucleic Acid Conserved Elements) is a program which finds sequence elements conserved in a set of DNA sequences.
AllPaths-LGALLPATHS-LG is a whole genome shotgun assembler that can generate high quality genome assemblies using short reads (~100bp) such as those produced by the new generation of sequencers. The significant difference between ALLPATHS and traditional assemblers such as Arachne is that ALLPATHS assemblies are not necessarily linear, but instead are presented in the form of a graph. This graph representation retains ambiguities, such as those arising from polymorphism, uncorrected read errors, and unresolved repeats, thereby providing information that has been absent from previous genome assemblies.
amosA Modular, Open-Source whole genome assembler.
AmpliconnoiseAmpliconNoise is a collection of programs for the removal of noise from 454 sequenced PCR amplicons. It involves two steps the removal of noise from the sequencing itself and the removal of PCR point errors. This project also includes the Perseus algorithm for chimera removal.
ANGSDANGSD is a software for analyzing next generation sequencing data. The software can handle a number of different input types from mapped reads to imputed genotype probabilities.
apolloApollo allows researchers to explore genomic annotations at many levels of detail, and to perform expert annotation curation, all in a graphical environment.
AragornARAGORN is a program to detect tRNA genes and tmRNA genes in nucleotide sequence
ArtemisArtemis is a free genome browser and annotation tool that allows visualisation of sequence features, next generation data and the results of analyses within the context of the sequence, and also its six-frame translation.
asm2aceConvert the "Celera ASM" assembly format to the "Consed ACE" or the "CAF" file format.
ATLASATLAS (Automatically Tuned Linear Algebra Software) provides highly optimized Linear Algebra kernels for arbitrary cache-based architectures. ATLAS provides ANSI C and Fortran77 interfaces for the entire BLAS API, and a small portion of the LAPACK AP
Atlas2Atlas2 is a next-generation sequencing suite of variant analysis tools specializing in the separation of true SNPs and insertions and deletions (indels) from sequencing and mapping errors in Whole Exome Capture Sequecing (WECS) data.
AugustusAUGUSTUS is a program that predicts genes in eukaryotic genomic sequences
bam2fastqExtract sequences from a BAM file in fastq format.
bambusScaffolding tool: ordering and orienting contigs by incorporating additional information about their relative placement along the genome.
BAMTOOLSBamTools provides both a programmer's API and an end-user's toolkit for handling BAM files.
bamUTILbamUtil is a repository that contains several programs that perform operations on SAM/BAM files. All of these programs are built into a single executable, bam.
banyanThis package provides sorted drop-in versions of Python's set and dict (with optional augmentation).
BarrnapBarrnap predicts the location of ribosomal RNA genes in genomes. It supports bacteria (5S,23S,16S), archaea (5S,5.8S,23S,16S), mitochondria (12S,16S) and eukaryotes (5S,5.8S,28S,18S).
BAYESCANDetecting natural selection from population-bases genetic data using differences in alleles frequencies between populations.
bcl2fastqThe Bcl2FastQ conversion software is a new tool to handle bcl conversion and demultiplexing of both unzipped and zipped bcl files, which have reduced footprint and were introduced as an optional output of the HCS Software version 2.0
BeagleBEAGLE is a state of the art software package for analysis of large-scale genetic data sets with hundreds of thousands of markers genotyped on thousands of samples.
bedtoolsThe BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage.
Bio++Bio++ is a set of C++ libraries for Bioinformatics, including sequence analysis, phylogenetics, molecular evolution and population genetics. Bio++ is fully Object Oriented and is designed to be both easy to use and computer efficient.
BIOM-FORMATThe Biological Observation Matrix format; There are two components to the BIOM project: first is definition of the BIOM format, and second is development of support objects in multiple programming languages to support the use of BIOM in diverse bioinformatics applications. The version of the BIOM file format is independent of the version of the biom-format software.
BIOPYTHONBiopython is a set of freely available tools for biological computation written in Python by an international team of developers.
biotoolboxThis tool box is a collection of various library modules and programs for processing, converting, analyzing, and manipulating genomic data and/or features. They are written in Perl, and rely on BioPerl and GMOD related modules for working with a wide variety of modern file formats and databases.
Bis-SNPBisSNP is a package based on the Genome Analysis Toolkit (GATK) map-reduce framework for genotyping and accurate DNA methylation calling in bisulfite treated massively parallel sequencing (Bisulfite-seq, NOMe-seq, RRBS and any other bisulfite treated sequencing) with Illumina directional library protocol.
BismarkA tool to map bisulfite converted sequence reads and determine cytosine methylation states
BlastrBlastR is a new method for searching Non-Coding RNAs in databases
blatThe BLAST-Like Alignment Tool: similarity search in databanks. BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 25 bases or more. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more.
BlueBlue is a fast, accurate short-read error-correction tool based on k-mer consensus and context.Blue will correct both Illumina and 454-like data, and accepts sequence data files in both FASTQ and FASTA formats.
bowtieBowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).
Bowtie2Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
BRATBRAT is an accurate and efficient tool for mapping short bisulfite-treated reads obtained from the Solexa-Illumina Genome Analyzer. BRAT supports single-end and pair-end short reads mapping and allows alignment of different length reads/mates. BRAT-bw is BRAT-BW, a fast, accurate and memory-efficient tool that maps bisulfite-treated short reads (BS-seq) to a reference genome using the FM-index (Burrows-Wheeler transform). The package includes tools to trim low quality reads ends and to report A, C, G, T counts at each base for forward and reverse strands of references.
breakdancerbreakdancer_max and bam2cfg.pl are available. BreakDancerMax predicts five types of structural variants: insertions, deletions, inversions, inter- and intra-chromosomal translocations from next-generation short paired-end sequencing reads using read pairs that are mapped with unexpected separation distances or orientation.
BridgerBridger is an efficient de novo transcriptome assembler for RNA-Seq data. It expects as input RNA-Seq reads (single or paired) in fasta or fastq format, outputs all transcripts in fasta format, without using a reference genome.
BSMAPBSMAP is a short reads mapping software for bisulfite sequencing reads. Bisulfite treatment converts unmethylated Cytosines into Uracils (sequenced as Thymine) and leave methylated Cytosines unchanged, hence provides a way to study DNA cytosine methylation at single nucleotide resolution. BSMAP aligns the Ts in the reads to both Cs and Ts in the reference
bwaBurrows-Wheeler Aligner (BWA) is an efficient program that aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. It implements two algorithms, bwa-short and BWA-SW. The former works for query sequences shorter than 200bp and the latter for longer sequences up to around 100kbp. Both algorithms do gapped alignment. They are usually more accurate and faster on queries with low error rates.
cap3CAP3 is a sequence assembly program for small-scale assembly of EST sequences with or without quality values.
carnacCarnac is a software tool for analysing the hypothetical secondary structure of a family of homologous RNA.
carthagenCarthaGĂ¨ne is a genetic/radiated hybrid mapping software. CarthaGene looks for multiple populations maximum likelihood consensus maps using a fast EM algorithm for maximum likelihood estimation and powerful ordering algorithms. CarthaGĂ¨ne can handle data made up of several distinct populations which may each be either F2 backcross, recombinant inbred lines, F2 intercross, phase known outbreds and/or radiated hybrids (haploid and diploid data).
CASAVAIllumina's Consensus Assessment of Sequence and Variation (CASAVA) software captures summary information for resequencing and counting studies and places the data in a compact structure for visualization within GenomeStudio Software or publicly available analysis tools. CASAVA can create genomic builds, call SNPs, detects indels, and count reads from data generated from one or more runs of the Genome Analyzer across a broad range of sequencing applications.
CD-HITCD-HIT stands for Cluster Database at High Identity with Tolerance. The program (cd-hit) takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the sequence 'groupies' for each nr sequence representative.
CD-HIT-OTUCD-HIT-OTU, identifying OTUs in metagenomics Pyrosequencing-based 16S ribosomal RNA survey and decrease spurious OTUs.
CEGMACEGMA (Core Eukaryotic Genes Mapping Approach) is a pipeline for building a set of high reliable set of gene annotations in virtually any eukaryotic genome.
CheetahCheetah is an open source template engine and code generation tool, written in Python.
chimerascanchimerascan is a software package that detects gene fusions in paired-end RNA sequencing (RNA-Seq) datasets.
ChIPMunkChIPMunk is a fast heuristic DNA motif digger based on greedy approach accompanied by bootstrapping. ChIPMunk identifies the strong motif with the maximum Discrete Information Content in a set of DNA sequences. ChIPMunk uses (extended) multifasta as the input format and supports IUPAC DNA letters in the input sequence
CISAIntegrates the assemblies into a hybrid set of contigs, resulting in assemblies of superior contiguity and accuracy, compared with the assemblies generated by the state-of-the-art assemblers and the hybrid assemblies merged by existing tools.
clearcutClearcut is the reference implementation for the Relaxed Neighbor Joining (RNJ)
algorithm by J. Evans, L. Sheneman, and J. Foster from the Initiative
for Bioinformatics and Evolutionary Studies (IBEST) at the University of
Clustal OmegaClustal Omega is the latest addition to the Clustal family. It offers a significant increase in scalability over previous versions, allowing hundreds of thousands of sequences to be aligned in only a few hours. It will also make use of multiple processors, where present. In addition, the quality of alignments is superior to previous versions, as measured by a range of popular benchmarks
clustalwMultiple sequence alignment program for DNA or proteins.
clviewThis is a graphical, interactive tool for inspecting the ACE format assembly files generated by CAP3 or phrap.
CNCICNCI (Coding-Non-Coding Index) is a powerful signature tool by profiling adjoining nucleotide triplets to effectively distinguish protein-coding and non-coding sequences independent of known annotations.
CNV-seqCNV-seq, a new method to detect copy number variation using high-throughput sequencing
CNVnatorA tool for CNV discovery and genotyping from depth of read mapping.
COMMETCOMMET (â€śCOmpare Multiple METagenomesâ€ť) provides a global similarity overview between all datasets of a large metagenomic project.
consedConsed allow to visualise, edit and finish sequences assembled with phrap. Consed is compatible with Newbler, Cross_match, Phrap and PCAP output.
Control-FREECControl-FREEC is a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data developed by the Bioinformatics Laboratory of Institut Curie (Paris).
CPATA program to discriminate coding and noncoding genes using logistic regression model based on 4 selected features.
CPCCPC (Coding Potential Calculator) distinguishes protein-coding from non-coding RNAs based on the sequence features of the input transcripts.
CRACA integrated RNA-Seq read analysis.
CufflinksCufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.
CUSHAW2CUSHAW2 (the second distribution of CUSHAW software package for next-generation sequencing read alignment) is a fast and parallel gapped read alignment to large genomes, such as the human genome.
CUSHAW3CUSHAW3 (the third distribution of CUSHAW software package for next-generation sequencing read alignment) is an open-source parallelized, sensitive and accurate short-read aligner.
cutadaptCutadapt removes adapter sequences from DNA high-throughput
sequencing data. This is usually necessary when the read length of the
machine is longer than the molecule that is sequenced, such as in
dadidadi implements a method for demographic inference from genetic data, based on a diffusion approximation to the allele frequency spectrum.
DALIGNERThe commands below permit one to find all significant local alignments between reads encoded in Dazzler database. The assumption is that the reads are from a PACBIO RS II
long read sequencer.
DAZZ_DBTo facilitate the multiple phases of the dazzler assembler, we organize all the read data into what is effectively a "database" of the reads and their meta-information.
DeFuseDeFuse is a software package for gene fusion discovery using RNA-Seq data. The software uses clusters of discordant paired end alignments to inform a split read alignment analysis for finding fusion boundaries. The software also employs a number of heuristic filters in an attempt to reduce the number of false positives and produces a fully annotated output for each predicted fusion.
DellyDELLY is an integrated structural variant prediction method that can detect deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read massively parallel sequencing data. It uses paired-ends and split-reads to sensitively and accurately delineate genomic rearrangements throughout the genome.
DisEMBLDisEMBL is a computational tool for prediction of disordered/unstructured regions within a protein sequence. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects.
e-PCRe-PCR identifies sequence tagged sites(STSs)within DNA sequences. Using e-PCR, you can search for sub-sequences that closely match the PCR primers and have the correct order, orientation, and spacing.
ea-utilsCommand-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc.
ecoPCRecoPCR is an electronic PCR software developed by LECAand Helix-Project . It helps you to estimate Barcode primers quality. In conjunction with OBItools, you can postprocess ecoPCR output to compute barcode coverage and barcode speci?city.
ecoPrimersecoPrimer is a barcoding software which is written in C language. It finds universal primers from a set of input DNA sequences by finding conserved regions without "a priori" on candidate sequences.
It also evaluates the quality of the primers and barcode regions by measuring the "barcode specificity" and "barcode coverage" indices
edenaDe novo short reads assembler.
EggLibEggLib is a C++/Python library and program package for evolutionary genetics and genomics.
EigensoftThe EIGENSOFT package combines functionality from our population genetics methods (Patterson et al. 2006) and our EIGENSTRAT stratification correction method (Price et al. 2006).
embossThe suite includes programs, tools and sequence databases that can cover all the basic needs in the field of analysis and exploitation of biological sequences.
Ensembl-apiEnsembl uses MySQL relational databases to store its information. A comprehensive set of Application Programme Interfaces (APIs) serve as a middle-layer between underlying database schemes and more specific application programmes. The APIs aim to encapsulate the database layout by providing efficient high-level access to data tables and isolate applications from data layout changes. Ensembl's API is written in Perl:
EspritAn Algorithm for estimating species richness using large collections of 16S rRNA pyrosequences.
ESTScanESTScan is a program that can detect coding regions in DNA/RNA sequences, even if they are of low quality (e.g. EST sequences). ESTScan will also detect and correct sequencing errors that lead to frameshifts. ESTScan is not a gene prediction program , nor is it an open reading frame detector. In fact, its strength lies in the fact that it does not require an open reading frame to detect a coding region. As a result, the program may miss a few translated amino acids at either the N or the C terminus, but will detect coding regions with high selectivity and sensitivity.
eulerEuler is a new approach to fragment assembly that abandons the classical "overlap - layout - consensus" paradigm that is used in all currently available assembly tools.
euler-SREULER-SR is a program for de novo assembly of reads (from Roche 454 Live Sciences or Illumina/Solexa).
EvalEval is a flexible tool for analyzing the performance of gene-structure prediction programs. It provides summaries and graphical distributions for many statistics describing any set of annotations, regardless of their source. It also compares sets of predictions to standard annotations and to one another
ExaMLExascale Maximum Likelihood (ExaML) code for phylogenetic inference using MPI.
exonerateA generic tool for sequence alignment.
FALCONFalcon: a set of tools for fast aligning long reads for consensus and assembly
fastaFASTA is a sequence similarity search tool which uses heuristics for fast local alignment searching.
fastq-toolsThis package provides a number of small and efficient programs to perform common tasks with high throughput sequencing data in the FASTQ format. All of the programs work with typical FASTQ files as well as gzipped FASTQ files.
FastQCA Quality Control application for FastQ files.
FastQC is an application which takes a FastQ file and runs a series
of tests on it to generate a comprehensive QC report.
fastsimcoal2Fast sequential Markov coalescent simulation of genomic data under complex evolutionary models
fastStructurefastStructure is an algorithm for inferring population structure from large SNP genotype data. It is based on a variational Bayesian framework for posterior inference and is written in Python2.x.
FastUniqAn ultrafast de novo duplicates removal tool for paired short DNA sequences
FASTX-ToolkitThe FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing.
FCPThe fragment classification package (FCP) provides some classifiers for assigning a taxonomic attribution to metagenomic fragments or assembled scaffolds.
FigTreeFigTree is designed as a graphical viewer of phylogenetic trees and as a program for producing publication-ready figures.
FindPeaksFindPeaks performs two functions: 1) analysis of short-read sequencing (Solexa/Illumina) experiments to identify areas of enrichment 2) generating wig files for use with the UCSC browser.
fineSTRUCTUREfineSTRUCTURE is a fast and powerful algorithm for identifying population structure using dense sequencing data.
FLASHFLASH, Fast Length Adjustment of SHort reads, is a very accurate fast tool to merge paired-end reads from fragments that are shorter than twice the length of reads. The extended length of reads has a significant positive impact on improvement of genome assemblies.
FlexbarFlexbar preprocesses high-throughput sequencing data efficiently. It demultiplexes barcoded runs and removes adapter sequences. Moreover, trimming and filtering features are provided. Flexbar increases read mapping rates and improves genome and transcriptome assemblies. It supports next-generation sequencing data in fasta/q and csfasta/q format from Illumina, Roche 454, and the SOLiD platform.
Flux SimulatorThe Flux Simulator aims at modeling RNA-Seq experiments in silico: sequencing reads are produced from a reference genome according annotated transcripts.
FrameDPSensitive peptide detection on noisy matured sequences. Available with command line interface on the cluster.
FreeBayesFreeBayes is a population-based short polymorphism detector which features support for the simultaneous detection of SNPs, INDELs, and multi-base mismatches, poly-allelic sites, polyploidy, and sample and region-specific copy number modeling. FreeBayes works with standard file formats (BAM and VCF) and easily be integrated into existing next-generation sequencing pipelines.
GAASGAAS (Genome relative Abundance and Average Size) is a bioinformatic tool to calculate accurate community composition and average genome size in metagenomes by using BLAST, advanced parsing of hits and correction of genome length bias.
GAASGAAS (Genome relative Abundance and Average Size) is a bioinformatic tool to calculate accurate community composition and average genome size in metagenomes by using BLAST, advanced parsing of hits and correction of genome length bias.
gam-ngsGenomic Assemblies Merger for NGS
The GapCloser is designed to close the gaps emerging during the scaffolding process by SOAPdenovo, using the abundant pair relationships of short reads.
gassstGASSST (Global Alignment Short Sequence Search Tool) finds global alignments of short DNA sequences against large DNA banks. GASSST strong point is its ability to perform fast gapped alignments. It works well for both short and longer reads. It currently has been tested for reads up to 500bp.
GASTGlobal Alignment for Sequence Taxonomy.
Uses a reference database of SSU rRNA sequences to determine the taxonomy of hypervariable region tags.
GATKThe GATK is a structured software library that makes writing efficient analysis tools using next-generation sequencing data very easy, and second it's a suite of tools for working with human medical resequencing projects such as 1000 Genomes and The Cancer Genome Atlas. These tools include things like a depth of coverage analyzers, a quality score recalibrator, a SNP/indel caller and a local realigner.
GblocksGblocks is a computer program written in ANSI C language that eliminates poorly aligned positions and divergent regions of an alignment of DNA or protein sequences. These positions may not be homologous or may have been saturated by multiple substitutions and it is convenient to eliminate them prior to phylogenetic analysis. Gblocks selects blocks in a similar way as it is usually done by hand but following a reproducible set of conditions. The selected blocks must fulfill certain requirements with respect to the lack of large segments of contiguous nonconserved positions, lack or low density of gap positions and high conservation of flanking positions, making the final alignment more suitable for phylogenetic analysis. Gblocks outputs several files to visualize the selected blocks. The use of a program such as Gblocks reduces the necessity of manually editing multiple alignments, makes the automation of phylogenetic analysis of large data sets feasible and, finally, facilitates the reproduction of the alignments and subsequent phylogenetic analysis by other researchers.
GEMThe GEM library strives to be a true "next-generation" tool for handling any kind of sequence data, offering state-of-the-art algorithms and data structures specifically tailored to this demanding task.
GEMMAGEMMA is the software implementing the Genome-wide Efficient Mixed Model Association algorithm for a standard linear mixed model and some of its close relatives for genome-wide association studies (GWAS).
GeneMarkE.hmmGeneMark.hmm eukaryotic, gene prediction in eucaryotes. Executable name : gmhmme3.
GeneScissorsGeneScissors exploits machine learning techniques guided by biological knowledge to detect and correct spurious transcriptome inference by existing RNA-seq analysis methods.
genomeToolsCollection of bioinformatics tools (in the realm of genome informatics) combined into a single binary named "gt".
GeocoderShort description: Species locality data + polygons -> nexus file. Longer description: geocoder.py is a program written in Python that takes one file containing polygons, and one file with species locality data as input. The program then tests if a species have been recorded inside any of the polygons. The result is presented as a nexus- file with "0" indicating absence, and "1" indicating pressence in a polygon.
germlineGERMLINE is an algorithm for discovering long shared segments of Identity by Descent (IBD) between pairs of individuals in a large population. It takes as input genotype or haplotype marker data for individuals (as well as an optional known pedigree) and generates a list of all pairwise segmental sharing.
glintComplete genome alignment tool
GMAP / GSNAPGMAP: A Genomic Mapping and Alignment Program for mRNA and EST Sequences, and
GSNAP: Genomic Short-read Nucleotide Alignment Program
GRASSA generic algorithm for scaffolding next-generation sequencing assemblies.
GrinderGrinder is a versatile open-source bioinformatic tool to create simulated omic shotgun and amplicon sequence libraries for all main sequencing platforms.
GRITGRIT is genome-guided a transcript assembly tool designed and implemented by Nathan Boley, in collaboration with the Bickel and Celniker groups.
gtf_to_genesis a Python parser which caches all the genes / transcripts from a GTF file and caches the data into python classes for high speed access.
gthGenomeThreader is a software tool to compute gene structure predictions. The gene structure predictions are calculated using a similarity-based approach where additional cDNA/EST and/or protein sequences are used to predict gene structures via spliced alignments. GenomeThreader is available free of charge only for non-commercial research institutions.
HaploviewHaploview is designed to simplify and expedite the process of haplotype analysis by providing a common interface to several tasks relating to such analyses.
Hiplex-primerA tool for generating primers for the Hi-Plex targeted, multiplexed DNA sequencing strategy.
hmmerHMMER is a package used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
HomerHOMER (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and next-gen sequencing analysis. It is a collection of command line programs for unix-style operating systems written in Perl and C++.
HPG-VariantThe HPG Variant suite is a project aimed to provide a complete suite of tools to work with genomic variation data, from VCF tools to variant profiling or genomic statistics.
HTSeqHTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays.
HTSeqHTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays.
HyPhyHyPhy is an open-source software package for the analysis of genetic sequences using techniques in phylogenetics, molecular evolution, and machine learning. It features a complete graphical user interface (GUI) and a rich scripting language for limitless customization of analyses.
idbaIDBA-UD is a iterative De Bruijn Graph De Novo Assembler for Short Reads Sequencing data with Highly Uneven Sequencing Depth. It is an extension of IDBA algorithm.
IGVThe Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
IMR/DENOMPackage for assembling genomes from Illumina short read sequqnce data
InfernalInfernal ("INFERence of RNA ALignment") is for searching DNA sequence databases for RNA structure and sequence similarities. It is an implementation of a special case of profile stochastic context-free grammars called covariance models (CMs).
IntaRNAIntaRNA predicts interactions between two RNA molecules, e.g. a non-coding RNA (ncRNA) and a mRNA. The scoring is based on hybridization free energy and accessibility of the interaction sites in both molecules.
iPSORTiPSORT is a subcellular localization site predictor for N-terminal sorting signals. Given a protein sequence, it will predict whether it contains a Signal Peptide (SP), Mitochondrial Targeting Peptide (mTP), or Chloroplast Transit Peptide (cTP).
ITSxImproved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for use in environmental sequencing
JellysfishJELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence. JELLYFISH can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the "compare-and-swap" CPU instruction to increase parallelism. JELLYFISH is a command-line program that reads FASTA and multi-FASTA files containing DNA sequences. It outputs its k-mer counts in an binary format, which can be translated into a human-readable text format using the "jellyfish dump" command.
jModeltestjModelTest is a tool to carry out statistical selection of best-fit models of nucleotide substitution.
KATThe K-mer Analysis Toolkit (KAT) contains a number of tools that analyse and compare K-mer spectra.
KisSpliceKisSplice is a software that enables to analyse RNA-seq data with or without a reference genome. It is an exact local transcriptome assembler that allows to identify SNPs, indels and alternative splicing events. It can deal with an arbitrary number of biological conditions, and will quantify each variant in each condition.
KlastKLAST is a fast, accurate and NGS scalable bank-to-bank sequence similarity search tool providing significant accelerations of seeds-based heuristic comparison methods, such as the Blast suite of algorithms.
KmerGenieKmerGenie estimates the best k-mer length for genome de novo assembly.
KronaKrona allows hierarchical data to be explored with zoomable pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats.
LAGANThe Lagan Tookit is a set of alignment programs for comparative genomics. The three main components are a pairwise aligner (LAGAN), a multiple aligner (M-LAGAN), and a glocal aligner (Shuffle-LAGAN). All three are based on the CHAOS local alignment tool and combine speed (regions up to several megabases can be aligned in minutes) with high accuracy.
LamarcLAMARC is a program which estimates population-genetic parameters such as population size, population growth rate, recombination rate, and migration rates.
LastLAST finds similar regions between sequences.
LDhatLDhat is a package written in the C and C++ languages for the analysis of recombination rates from population genetic data.
LDhelmetLDhelmet performs statistical inference for fine-scale variable recombination rate estimation.
LocusZoomLocusZoom is a tool to plot regional association results from genome-wide association scans or candidate gene studies.
LoRDECLoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.
LucyLUCY A Sequence Cleanup Program. The quality trimming portion of lucy makes use of phred quality scores, such as those produced by many automated sequencers based on the Sanger sequencing method. As such, lucyâ€™s quality trimming may not be appropriate for sequence data produced by some of the new â€śnext-generationâ€ť sequencers.
lxmllxml is a Pythonic, mature binding for the libxml2 and libxslt libraries. It provides safe and convenient access to these libraries using the ElementTree API.
MAAFTMAFFT is a multiple sequence alignment program for unix-like operating systems
MACSWe present Model-based Analysis of ChIP-Seq (MACS) on short reads sequencers such as Genome Analyzer (Illumina / Solexa). MACS empirically models the length of the sequenced ChIP fragments, which tends to be shorter than sonication or library construction size estimates, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to effectively capture local biases in the genome sequence, allowing for more sensitive and robust prediction. MACS compares favorably to existing ChIP-Seq peak-finding algorithms, is publicly available open source, and can be used for ChIP-Seq with or without control samples.
MAFFTMAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods, L-INS-i (accurate; for alignment of <?200 sequences), FFT-NS-2 (fast; for alignment of <?10,000 sequences), etc.
mapSpliceAccurate mapping of RNA-seq reads for splice junction discovery.
maqMaq is a software that builds mapping assemblies from short reads generated by the next-generation sequencing machines. It is particularly designed for Illumina/Solexa 1G Genetic Analyzer, and has preliminary functions to handle ABI SOLID data.
MaSuRCAMaSuRCA is whole genome assembly software. It combines the efficiency of the de Bruijn graph and Overlap-Layout-Consensus (OLC) approaches. MaSuRCA can assemble data sets containing only short reads from Illumina sequencing or a mixture of short reads and long reads (Sanger, 454)
MATSMATS is a computational tool to detect differential alternative splicing events from RNA-Seq data.
MauveMauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion. Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics. Aligning whole genomes is a fundamentally different problem than aligning short sequences.
MaximaMaxima is a system for the manipulation of symbolic and numerical expressions, including differentiation, integration, Taylor series, Laplace transforms, ordinary differential equations, systems of linear equations, polynomials, sets, lists, vectors, matrices and tensors.
MegadockA fft-based protein-protein docking system for all-to-all protein-protein interaction predictions.
MeganMEtaGenome ANalyzer : Metagenomic data analysis : taxonomic and functionnal (SEED and KEGG classification) analysis.
memeThe MEME Suite allows you to: (1) discover motifs using MEME or GLAM2 on groups of related DNA or protein sequences, (2) search sequence databases using motifs, (3) compare a motif to all motifs in a database of motifs, and (3) associate motifs with Gene Ontology terms via their putative target genes.
MetaPhlAnMetaPhlAn is a computational tool for profiling the composition of microbial communities from metagenomic shotgun sequencing data. MetaPhlAn relies on unique clade-specific marker genes identified from 3,000 reference genomes
MetaSimA Sequencing Simulator for Genomics and Metagenomics. The resulting data sets can be used as standardized test scenarios for planning sequencing projects or for benchmarking assembler and metagenomic software.
MetaVelvetAn extension of Velvet assembler to de novo metagenomic assembly
methylKitmethylKit is an R package for DNA methylation analysis and annotation from high-throughput bisulfite sequencing. The package is designed to deal with sequencing data from RRBS and its variants, but also target-capture methods such as Agilent SureSelect methyl-seq. In addition, methylKit can deal with base-pair resolution data for 5hmC obtained from Tab-seq or oxBS-seq. It can also handle whole-genome bisulfite sequencing data if proper input format is provided.
MIGRATE-NEstimation of population sizes and gene flow using the coalescent. Migrate estimates effective population sizes and past migration rates between n population assuming a migration matrix model with asymmetric migration rates and different subpopulation sizes. Migrate uses maximum likelihood or Bayesian inference to jointly estimate all parameters.
MinCED is a program to find Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) in full genomes or environmental datasets such as metagenomes, in which sequence size can be anywhere from 100 to 800 bp. MinCED runs from the command-line and was derived from CRT
miraWhole genome shotgun and EST sequence assembler for Sanger, 454, and Solexa / Illumina.
miRandamiRanda is an algorithm for finding genomic targets for microRNAs. This algorithm has been written in C and is available as an open-source method under the GPL. MiRanda was developed at the Computational Biology Center of Memorial Sloan-Kettering Cancer Center
mirDeep-PmiRDeep-P,miRDP for short, is a computational tool for analyzing the microRNA (miRNA) transcriptome in plants.
mirDeep2miRDeep2 is a software package for identification of novel and known miRNAs in deep sequencing data. Furthermore, it can be used for miRNA expression profiling across samples. Last, a new module for preprocessing of raw Illumina sequencing data produces files for downstream analysis with the miRDeep2 or quantifier module.
miropeatsMiropeats discovers regions of sequence similarity amongst any set of DNA sequences and then presents this similarity information graphically.
miRParamiRPara is a SVM-based miRNA prediction tool.
MisoMISO (Mixture-of-Isoforms) is a probabilistic framework that quantitates the expression level of alternatively spliced genes from RNA-Seq data, and identifies differentially regulated isoforms or exons across samples. By modeling the generative process by which reads are produced from isoforms in RNA-Seq, the MISO model uses Bayesian inference to compute the probability that a read originated from a particular isoform.
Mitoprot IIMitoProt calculates the N-terminal protein region that can support a Mitochondrial Targeting Sequence and the cleavage site.
MobsterMobster is used to detect novel (non-reference) Mobile Element Insertion (MEI) events in BAM files and uses both a discordant read pair method and a split-read method.
MocatMOCAT is a package for analyzing metagenomics datasets. Currently MOCAT supports Illumina single- and paired-end reads in raw FastQ format. Using MOCAT you can, for example, generate taxonomic profiles of, and assemble, metagenomes.
ModelGeneratorModelGenerator is a free easy-to-use software program for carrying out model selection in phylogenetics, designed and written by Thomas Keane.
ModelGenerator is a model selection program that selects optimal amino acid and nucleotide substitution models from Fasta or Phylip alignments. ModelGenerator supports 56 nucleotide and 96 amino acid substitution models.
MOSAIKMOSAIK is a reference-guided assembler comprising of two main modular programs
MothurThe one-stop source for your computational microbial ecology needs. mothur offers the ability to go from raw sequences to the generation of visualization tools to describe alpha and beta diversity.
MrBayesMrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters.
msA program for generating samples under neutral models.
msmcThis software implements MSMC, a method to infer population size and gene flow from multiple genome sequences
MSR-CAThe MSR-CA assembler combines the benefits of deBruijn graph and Overlap-Layout-Consensus assembly approaches. The strength of the deBruijn graph approach is in its ability to quickly create a graph representation of the genome assembly from the deep coverage short read data. However in most cases the graph is extremely complex and it is hard to find a way to recover the original genome sequence from simply traversing it. On the other hand, overlap-layout-consensus is better suited for longer reads with high coverage, and since it usually relies on overlaps of 40 bases or longer, it is better for resolving short repetitive structures.
MugsyMugsy is a multiple whole genome aligner. Mugsy uses Nucmer for pairwise alignment, a custom graph based segmentation procedure for identifying collinear regions, and the segment-based progressive multiple alignment strategy from Seqan::TCoffee. Mugsy accepts draft genomes in the form of multi-FASTA files and does not require a reference genome.
multalinMultiple sequence alignment with hierarchical clustering.
MultiRNAFoldThe MultiRNAFold package contains software for secondary structure prediction of one, two, or many interacting RNA or DNA molecules. It is composed of three pieces of software: SimFold, PairFold and MultiFold.
mummerMUMmer is a package for rapidly aligning entire genomes, whether in complete or draft form.
muscleMultiple sequence alignment (nucleic or proteic).
MuTectMuTect is a method developed at the Broad Institute for the reliable and accurate identification of somatic point mutations in next generation sequencing data of cancer genomes.
NAMDNAMD, recipient of a 2002 Gordon Bell Award and a 2012 Sidney Fernbach Award, is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems.
nbci-blastSimilarity search against databanks.
ncPRO-seqncPRO-seq (Non-Coding RNA PROfiling in sRNA-seq) is a tool for annotation and profiling of ncRNAs using deep-sequencing data developed by the Bioinformatics Laboratory of the institut Curie. This comprehensive and flexible ncRNA analysis pipeline, aims in interrogating and performing detailed analysis on small RNAs derived from annotated non-coding regions in miRBase, Rfam and repeatMasker, and regions defined by users. The ncPRO-seq pipeline also has a module to identify regions significantly enriched with short reads that can not be classified as known ncRNA families.
NetPhosNetPhos is a neural network-based method for predicting potential phosphorylation sites at serine, threonine or tyrosine residues in protein sequences.
New FugueNew Fugue is a program for estimation of haplotype frequencies and linkage disequilibrium coefficients in family data
newblerNewbler is a software package for de novo DNA sequence assembly. It is designed specifically for assembling sequence data generated by the 454 GS-series of pyrosequencing platforms sold by 454 Life Science, a Roche diagnostic.
NextGenMapNextGenMap is a flexible and fast read mapping program that is more than twice as fast as BWA while achieving a mapping sensitivity similar to Stampy.
ngShoRTngsShoRT (Next Generation Sequencing Short Read Trimmer) a comprehensive and flexible open-source software package written in Perl that implements the novel algorithms developed by our group and many other commonly used pre-processing algorithms in the literature. ngsShoRT algorithms are designed to pre-process Single Read (SR) or Paired-end (PE)/Mate-pair (MP) reads in FastQ format or Illumina's native QSEQ format (with compressed file support). It privides parallel processing by multi-threading to deal with large volume of data and reduce running time
ngsToolsngsTools is a collection of programs for population genetics analyses from NGS data, taking into account its statistical uncertainty. The methods implemented in these programs do not rely on SNP or genotype calling, and are particularly suitable for low sequencing depth data.
NIKSNIKS (Needle in a K-stack) - detection of mutations in NGS data.
nSLnSL is a program for efficiently computing the nSL statistic described in Ferrer et al. 2014.
numpyNumPy is a package needed for scientific computing with Python.
oasesOases is a de novo transcriptome assembler designed to produce transcripts from short read sequencing technologies, such as Illumina, SOLiD, or 454 in the absence of any genomic assembly. It was developed by Marcel Schulz (MPI for Molecular Genomics) and Daniel Zerbino (previously at the European Bioinformatics Institute (EMBL-EBI), now at UC Santa Cruz).
Oases uploads a preliminary assembly produced by Velvet, and clusters the contigs into small groups, called loci. It then exploits the paired-end read and long read information, when available, to construct transcript isoforms.
OSLayA new tool OSLay that uses synteny between matching sequences in a target assembly and a reference assembly to layout the contigs (or scaffolds) in the target assembly. The tool provides an interactive visualization of the computed layout and the result can be imported into the assembly editing tool Consed to support the design of primer pairs for gap-closure.
PAL2NALPAL2NAL is a program that converts a multiple sequence alignment of proteins and the corresponding DNA (or mRNA) sequences into a codon alignment. The program automatically assigns the corresponding codon sequence even if the input DNA sequence has mismatches with the input protein sequence, or contains UTRs, polyA tails. It can also deal with frame shifts in the input alignment, which is suitable for the analysis of pseudogenes. The resulting codon alignment can further be subjected to the calculation of synonymous (dS) and non-synonymous (dN) substitution rates.
paleomixThe PALEOMIX pipeline is a set of free and open-source pipelines and tools designed to enable the rapid processing of Next Generation Sequencing (NGS) data, starting from de-multiplexed reads from one or more samples, through sequence processing and alignment, and ending with genotyping, phylogenetic inference on the samples, as well as metagenomic analysis of the samples.
PAMLPAML is a package of programs for phylogenetic analyses of DNA or protein sequences using maximum likelihood.
ParsnpParsnp was designed to align the core genome of hundreds to thousands of bacterial genomes within a few minutes to few hours. Input can be both draft assemblies and finished genomes, and output includes variant (SNP) calls, core genome phylogeny and multi-alignments.
PartitionFinderPartitionFinder is free open source software to select best-fit partitioning schemes and models of molecular evolution for phylogenetic analyses.
PASAPASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.
patscanPatScan is a pattern matcher which searches protein or nucleotide (DNA, RNA, tRNA etc.) sequence archives for instances of a pattern which you input.
pbcoreThe pbcore package provides Python APIs for interacting with PacBio data files and writing bioinformatics applications.
PBJellyPBJelly is a highly automated pipeline that aligns long sequencing reads (such as PacBio RS reads or long 454 reads in fasta format) to high-confidence draft assembles.
PCAdmixPCAdmix is a method that estimates local ancestry via principal components analysis (PCA) using phased haplotypes. The method considers data chromosome by chromosome.
PennCNVCopy Number Variation (CNV) detection from SNP genotyping arrays. PennCNV implements a hidden Markov model (HMM) that integrates multiple sources of information to infer CNV calls for individual genotyped samples. It differs form segmentation-based algorithm in that it considered SNP allelic ratio distribution as well as other factors, in addition to signal intensity alone. In addition, PennCNV can optionally utilize family information to generate family-based CNV calls by several different algorithms. Furthermore, PennCNV can generate CNV calls given a specific set of candidate CNV regions, through a validation-calling algorithm.
PerlPrimerPerlPrimer is a free, open-source GUI application written in Perl that designs primers for standard PCR, bisulphite PCR, real-time PCR (QPCR) and sequencing. It aims to automate and simplify the process of primer design.
pftoolsThe pftools package contains all the software necessary to build protein and DNA generalized profiles and use them to scan and align sequences, and search databases
PGDSpiderPGDSpider is a powerful automated data conversion tool for population genetic and genomics programs. It facilitates the data exchange possibilities between programs for a vast range of data types (e.g. DNA, RNA, NGS, microsatellite, SNP, RFLP, AFLP, multi-allelic data, allele frequency or genetic distances)
phrep / phrapThe phred software reads DNA sequencing trace files, calls bases, and assigns a quality value to each called base. Phrap is a program for assembling shotgun DNA sequence data.
PhyloBayesPhyloBayes is a Bayesian Monte Carlo Markov Chain (MCMC) sampler for phylogenetic reconstruction and molecular dating using protein and nucleic acid alignments.
PhyloCSFPhyloCSF is a method to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region.
PhyMLPhyML is a phylogeny software based on the maximum-likelihood principle.
picard-toolsPicard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported.
PICRUStPICRUSt (pronounced â€śpie crustâ€ť) is a bioinformatics software package designed to predict metagenome functional content from marker gene (e.g., 16S rRNA) surveys and full genomes.
PindelPindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data. It uses a pattern growth approach to identify the breakpoints of these variants from paired-end short reads.
PLAST is a parallel alignment search tool for comparing large protein banks.
PLAST runs 3 to 5 times faster than the NCBI-BLAST software when processing large amount of data.
PlatypusPlatypus is a tool designed for efficient and accurate variant-detection in high-throughput sequencing data.
PLINKPLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner.
polyphredAnalysis of data from the sequencer. PolyPhred is a program that compares fluorescence-based sequences across traces obtained from different individuals to identify heterozygous sites for single nucleotide substitutions.
popABCPopABC is a computer package to estimate historical demographic parameters of closely related species/populations (e.g. population size, migration rate, mutation rate, recombination rate, splitting events) within a Isolation with migration model. The software performs coalescent simulation in the framework of approximate Bayesian computation (ABC, Beaumont et al, 2002). PopABC can also be used to perform Bayesian model choice to discriminate between different demographic scenarios.
PoPoolationA Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals
poretoolsA toolkit for working with nanopore sequencing data from Oxford Nanopore.
Primer3Primer3 is a widely used program for designing PCR primers (PCR = "Polymerase Chain Reaction").
ProdigalProdigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee.
PROKKAProkka is a software tool for the rapid annotation of prokaryotic genomes. A typical 4 Mbp genome can be fully annotated in less than 10 minutes on a quad-core computer, and scales well to 32 core SMP systems. It produces GFF3, GBK and SQN files that are ready for editing in Sequin and ultimately submitted to Genbank/DDJB/ENA.
proovreadPacBio hybrid error correction through iterative short read consensus
prot4ESTprot4EST is a perl script that takes expressed sequence tags (ESTs) and translates them optimally to produce putative peptides.
PROTTESTPROTTEST (ModelTest's relative) is a program for selecting the model of protein evolution that best fits a given set of sequences (alignment).
pycogentPython librairie for biology sequence oriented.
PyicoteoPyicoteo is a suite of tools for the analysis of high-throughput sequencing data.
pypeFLOWpypeFLOW is light weight and reusable make / flow data process library written in Python
pysamPysam is a python module for reading and manipulating Samfiles. It's a lightweight wrapper of the samtools C-API.
QIIMEQIIME (pronounced "chime") stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities, primarily based on high-throughput amplicon sequencing data (such as SSU rRNA) generated on a variety of platforms, but also supporting analysis of other types of data (such as shotgun metagenomic data). QIIME takes users from their raw sequencing output through initial analyses such as OTU picking, taxonomic assignment, and construction of phylogenetic trees from representative sequences of OTUs, and through downstream statistical analysis, visualization, and production of publication-quality graphics. QIIME has been applied to single studies based on billions of sequences from thousands of samples.
QRNAA prototype noncoding RNA genefinder, based on comparative genome sequence analysis.
QualiMapQualimap 2 is a platform-independent application written in Java and R that provides both a Graphical User Inteface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data and its derivatives like feature counts.
QUASTQUAST evaluates genome assemblies by computing various metrics
quickdistCalculates a matrix of pairwise distances between sequences in a multiple sequence alignment.
QuREQuRe is a program for viral quasispecies reconstruction, specifically developed to analyze long read (>100 bp) NGS data
RR is "GNU S", a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.
R'MESRecherche de Mots Exceptionnels dans une SĂ©quence.
RADmapperSet of scripts to create denovo consensus and map the read back
RADtoolsRADtools is our software for processing RAD Sequencing data. Version 1.0 is a pipeline for transforming Illumina reads into candidate genetic markers.
RAxMLRAxML (Randomized Axelerated Maximum Likelihood) is a program for sequential and parallel Maximum Likelihood based inference of large phylogenetic trees. It can also be used for postanalyses of sets of phylogenetic trees, analyses of alignments and, evolutionary placement of short reads.
RayParallel genome assemblies for parallel DNA sequencing. Ray is a parallel software that computes de novo genome assemblies with next-generation sequencing data. Ray is written in C++ and can run in parallel on numerous interconnected computers using the message-passing interface (MPI) standard.
RDPToolsCollection of commonly used RDP Tools for easy building
RDP_FrameBotRDP FrameBot is a tool for correcting frameshift errors caused by insertions and deletions in DNA sequences.
readSeqFormat conversion sequence. ReadSeq is a program and library for conversion of biosequence data from one format to another, useful in various bioinformatics programs and services.
realignerReAligner is used to realign multi-alignments of DNA fragments. converter is a utility for reformatting multi-alignments.
REAPRREAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison. It can be used in any stage of an assembly pipeline to automatically break incorrect scaffolds and flag other errors in an assembly for manual inspection. It reports mis-assemblies and other warnings, and produces a new broken assembly based on the error calls.
RECONProper identification of repetitive sequences is an essential step in genome analysis. The RECON package performs de novo identification and classification of repeat sequence families from genomic sequences. The underlying algorithm is based on extensions to the usual approach of single linkage clustering of local pairwise alignments between genomic sequences. Specifically, our extensions use multiple alignment information to define the boundaries of individual copies of the repeats and to distinguish homologous but distinct repeat element families. RECON should be useful for first-pass automatic classification of repeats in newly sequenced genomes.
RepeatMaskerRepeatMasker is a program that screens DNA sequences for interspersed repeats (thanks to RepBase repeats databanks specially formatted) and low complexity DNA sequences.
RepeatModelerRepeatModeler is a de-novo repeat family identification and modeling package.
RepeatScoutRepeatScout is a tool to discover repetitive substrings in DNA.
REPETThe REPET package ( Flutre et al, 2011 ) integrates bioinformatics programs in order to tackle biological issues at the genomic scale.
RIsearchRIsearch is a program for fast RNA-RNA interaction search. It employs a modified Smith-Waterman-Gotoh algorithm based on di-nucleotides to approximate nearest-neighbor energy parameters
RMBlastRMBlast is a RepeatMasker compatible version of the standard NCBI BLAST suite. The primary difference between this distribution and the NCBI distribution is the addition of a new program "rmblastn" for use with RepeatMasker and RepeatModeler. RMBlast supports RepeatMasker searches by adding a few necessary features to the stock NCBI blastn program. These include:
- Support for custom matrices ( without KA-Statistics ).
- Support for cross_match-like complexity adjusted scoring. Cross_match is Phil Green's seeded smith-waterman search algorithm.
- Support for cross_match-like masklevel filtering.
RNA-SeQCRNA-SeQC is a java program which computes a series of quality control metrics for RNA-seq data
RNAhybridRNAhybrid is a tool for finding the minimum free energy hybridisation of a long and a short RNA.The hybridisation is performed in a kind of domain mode, ie. the short sequence is hybridised to the best fitting part of the long one. The tool is primarily meant as a means for microRNA target prediction.
rnammerRnammer predicts 5s/8s, 16s/18s, and 23s/28s ribosomal RNA in full genome sequences. The program uses hidden Markov models trained on data from the 5S ribosomal RNA database and the European ribosomal RNA database project.
Rpy2rpy2 is a redesign and rewrite of rpy. It is providing a low-level interface to R, a proposed high-level interface, including wrappers to graphical libraries, as well as R-like structures and functions
RSEGThe RSEG software package is aimed to analyze ChIP-Seq data, especially for identifying genomic regions and their boundaries marked by diffusive histone modification markers, such as H3K36me3 and H3K27me3. It can work with or without control sample. It can be used to find regions with differential histone modifications patterns, either comparsion between two cell types or between two kinds of histone modifications.
RSeQCRSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data
RstanR interface to Stan, which is a C++ package for obtaining Bayesian inference using the No-U-turn sampler, a variant of Hamiltonian Monte Carlo
RUMRUM is an alignment, junction calling, and feature quantification pipeline specifically designed for Illumina RNA-Seq data. RUM can also be used effectively for DNA sequencing (e.g. ChIP-Seq) and microarray probe mapping. RUM also has a strand specific mode.
S-MartS-MART manages your RNA-Seq and ChIP-Seq data. It also produces many different plots to visualize your data.
samstatSAMStat is an efficient C program to quickly display statistics in html format of large sequence files from next generation sequencing projects.
samtoolsSAM (Sequence Alignment/Map). SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
SARToolsSARTools is a R package dedicated to the differential analysis of RNA-seq data. It provides tools to generate descriptive and diagnostic graphs, to run the differential analysis with one of the well known DESeq2 or edgeR packages and to export the results into easily readable tab-delimited files.
Scan For Matchesscan_for_matches is a utility written in C for locating patterns in DNA or protein FASTA files.
scipySciPy (pronounced "Sigh Pie") is open-source software for mathematics, science, and engineering. The SciPy library depends on Numpy, which provides convenient and fast N-dimensional array manipulation.
SConsSCons is a software construction toolâ€”that is, a superior alternative to the classic "Make" build tool that we all know and love.
SeaViewSeaView is a multiplatform, graphical user interface for multiple sequence alignment and molecular phylogeny.
SEGEMEHLsegemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to mapprimer- or polyadenylation contaminated reads correctly. segemehl implements a matching strategy based on enhanced suffix arrays (ESA). Segemehl now supports the SAM format, reads gziped queries to save both disk and memory space and allows bisulfite sequencing mapping and split read mapping.
SelEstimThe software package SelEstim is aimed at distinguishing neutral from selected polymorphisms and estimate the intensity of selection at the latter. The SelEstim model accounts explicitly for positive selection, and it is assumed that all marker loci in the dataset are responding to selection, to some extent.
selscanA program to calculate EHH-based scans for positive selection in genomes.
seqcleanA script for automated trimming and validation of ESTs or other DNA sequences by screening for various contaminants, low quality and low-complexity sequences.
SeqMonkA tool to visualise and analyse high throughput mapped sequence data
seqtools (dotter belvu blixem blixemh)A suite of tools for visualising sequence alignments. Blixem is an interactive browser of pairwise alignments that have been stacked up in a "master-slave" multiple alignment; it is not a 'true' multiple alignment but a 'one-to-many' alignment. It displays an overview section showing the positions of genes and alignments around the alignment window, and a detail section showing the actual alignment of protein or nucleotide sequences to the genomic DNA sequence. Dotter is a graphical dot-matrix program for detailed comparison of two sequences. Every residue in one sequence is compared to every residue in the other, with one sequence plotted on the x-axis and the other on the y-axis. Noise is filtered out so that alignments appear as diagonal lines. Belvu is a multiple sequence alignment viewer and phylogenetic tool. It has an extensive set of user-configurable modes to color residues by conservation or by residue type, and some basic alignment editing capabilities. It can generate distance matrices between sequences and construct distance-based trees, either graphically or as part of a phylogenetic software pipeline.
SeqtrimNEXTSeqtrimNEXT is a customizable and distributed pre-processing software for NGS (Next Generation Sequencing) biological data. It makes use of scbi_mapreduce gem to be able to run in parallel and distributed environments. It is specially suited for Roche 454 (normal and paired-end) & Ilumina datasets, although it could be easyly adapted to any other situation.
sff_extract454 sequence reads are usually stored in sff files. In these files the information about the reads is stored: sequence, quality and quality and adapter clips. sff_extract extracts the reads from the sff files and stores them into fasta and xml or caf text files.
SGASGA is a de novo genome assembler based on the concept of string graphs. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads.
ShoRAHShoRAH is a software package that allows for inference about the structure of a population from a set of short sequence reads as obtained from ultra-deep sequencing of a mixed sample.
ShortStackShortStack is a tool developed to process and analyze smallRNA-seq data with respect to a reference genome, and output a comprehensive and informative annotation of all discovered small RNA genes.
sickleSickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads.
sim4sim4 is a program designed to align an expressed DNA sequence with a genomic sequence, allowing for introns.
SimuPOPsimuPOP is a general-purpose individual-based forward-time population genetics simulation environment.
SLIDESLIDE (Sliding-window method for Locally Inter-correlated markers with asymptotic Distribution Errors corrected) is a multivariate normal distribution (MVN)-based multiple hypothesis testing correction method. SLIDE shows a near identical accuracy to the gold standard, the permutation test, and is much faster.
SLIPSLIP (Sliding-window method for Locally Inter-correlated markers for Power estimation) is a multivariate normal distribution (MVN)-based power estimation method. SLIP shows a near identical accuracy to the standard simulation procedure for power, and is much faster.
smaltSMALT efficiently aligns DNA sequencing reads with genomic reference sequences. Reads from a range of sequencing platforms, for example Illumina-Solexa, Roche-454, PacBio or ABI-Sanger, can be processed including paired-end reads
SNAPPredicts effect of mutations on protein function
SnoReportComputational identification of snoRNAs with unknown targets.
Detecting novel or orphan snoRNAs in RNA sequence data using sequence and structure information only without relying on target information
SnpEffSnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of variants on genes (such as amino acid changes)
SOAPdenovo-transSOAPdenovo-Trans is a de novo transcriptome assembler basing on the SOAPdenovo framework, adapt to alternative splicing and different expression level among transcripts.The assembler provides a more accurate, complete and faster way to construct the full-length transcript sets.
SomaticSniperThe purpose of this program is to identify single nucleotide positions that are different between tumor and normal (or in theory, any two bam files). It takes a tumor bam and a normal bam and compares the two to determine the differences.
SortMeRNASortMeRNA is a software designed to rapidly filter ribosomal RNA fragments from metatransriptomic data produced by next-generation sequencers. It is capable of handling large RNA databases and sorting out all fragments matching to the database with high accuracy and specificity
SPAdesSPAdes â€“ St. Petersburg genome assembler â€“ is intended for both standard isolates and single-cell MDA bacteria assemblies.
SRA toolkitToolkit to query Short Reads Archive at NCBI
ssahaSSAHA2 (Sequence Search and Alignment by Hashing Algorithm) is a pairwise sequence alignment program designed for the efficient mapping of sequencing reads onto genomic reference sequences.
SSPACESSPACE standard is a stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data.
SSPACE-LongReadSSPACE-LongRead is a stand-alone program for scaffolding pre-assembled contigs using long reads (e.g. PacBio RS reads).
StacksStacks is a software suite for analysing RAD Sequencing data by Julian Catchen at the University of Oregon. It will process raw Illumina RAD data or RAD data aligned to a reference genome, and produce genotypes that can be viewed and filtered via a web interface.
StampyStampy is a package for the mapping of short reads from illumina sequencing machines onto a reference genome.
STAR-FusionSTAR-Fusion further processes the output generated by the STAR aligner to map junction reads and spanning reads to a reference annotation set (using a GTF file, ideally the same annotation file used during the STAR genome index building process during the intial STAR setup).
StringTieStringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus.
StructureThe program structure is a free software package for using multi-locus genotype data to investigate population structure. Its uses include inferring the presence of distinct populations, assigning individuals to populations, studying hybrid zones, identifying migrants and admixed individuals, and estimating population allele frequencies in situations where many individuals are migrants or admixed. It can be applied to most of the commonly-used genetic markers, including SNPS, microsatellites, RFLPs and AFLPs.
SubreadA tool kit for processing next-gen sequencing data
SumatraSumatra was developed by the LECA and aims to compute a great deal of sequence similarities in a fast and exact way, based on the length of the Longest Common Subsequence (LCS) between two sequences. Sequence clustering based on similarities is also available through Sumaclust.
T-CoffeeT-Coffee is a multiple sequence alignment package. You can use T-Coffee to align sequences or to combine the output of your favorite alignment methods (Clustal, Mafft, Probcons, Muscle...) into one unique alignmen.
tabixTAB-delimited file IndeXer. Useful for vcfTools.
TagCleanerThe TagCleaner tool (standalone version) can be used to automatically detect and efficiently remove tag sequences (e.g. WTA tags) from genomic and metagenomic datasets. It is easily configurable.
TagDustTagDust is a program to eliminate artifactual reads from next-generation sequencing data sets.
Tandem Repeats FinderTandem Repeats Finder is a program to locate and display tandem repeats in DNA sequences. A tandem repeat in DNA is two or more adjacent, approximate copies of a pattern of nucleotides.
Tangram is a C/C++ command line toolbox for structural variation(SV) detection.
TASSELTrait Analysis by aSSociation, Evolution and Linkage. TASSEL has multiple functions, including association study, evaluating evolutionary relationships, analysis of linkage disequilibrium, principal component analysis, cluster analysis, missing data imputation and data visualization for large sets of data.
TE-locateTE-locate is a tool to locate all copies of sequences in a reference sequence
TE = Transposable Element
Input is NGS-data.
TFM_PvalueTFM-Pvalue is a software suite providing tools for computing the score threshold associated to a given P-value and the P-value associated to a given score threshold. It uses Position Weight Matrices, such as those available in the Transfac or Jaspar databases.
TGICLThis package automates clustering and assembly of a large EST/mRNA dataset. The clustering is performed by a slightly modified version of NCBI's megablast , and the resulting clusters are then assembled using CAP3 assembly program. TGICL starts with a large multi-FASTA file (and an optional peer quality values file) and outputs the assembly files as produced by CAP3.
TopHatTopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.
trans-ABYSSTrans-ABySS is a software pipeline for analyzing ABySS-assembled contigs from shotgun transcriptome data. The pipeline accepts assemblies that were generated across a wide range of k values in order to address variable transcript expression levels. It first filters and merges the multi-k assemblies, generating a much smaller set of nonredundant contigs. It contains scripts that map assembled contigs to known transcripts, currently supporting the Blat contig-to-genome aligner. It identifies novel splicing events like exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. Its scripts can also identify candidate gene-fusions, single-nucleotide variants, insertions, deletions, and inversions.
TransDecoderTransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.
treemixTreeMix is a method for inferring the patterns of population splits and mixtures in the history of a set of populations. In the underlying model, the modern-day populations in a species are related to a common ancestor via a graph of ancestral populations. We use the allele frequencies in the modern populations to infer the structure of this graph.
TRFTandem Repeats Finder 4.04 for 64 bit Linux
Trim GaloreA wrapper tool around Cutadapt and FastQC to consistently apply quality and adapter trimming to FastQ files, with some extra functionality for MspI-digested RRBS-type (Reduced Representation Bisufite-Seq) libraries.
trimAltrimAl: a tool for automated alignment trimmin
TrimmomaticTrimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data.The selection of trimming steps and their associated parameters are supplied on the command line.
TrinotateTrinotate is a comprehensive annotation suite designed for automatic functional annotation of transcriptomes, particularly de novo assembled transcriptomes, from model or non-model organisms.
tRNAscan-SESearch for tRNA genes in genomic sequence.
UNAFoldSoftware for nucleic acid folding and hybridization. The UNAFold software package is an integrated collection of programs that simulate folding, hybridization, and melting pathways for one or two single-stranded nucleic acid sequences.
USEARCHHigh-throughput biological sequence analysis. It is a distributed as single binary program that implements a suite of algorithms comparable to BLASTN, BLASTP, BLASTX, BLASTCLUST, CD-HIT, CD-HIT-EST, CD-HIT-2D, CD-HIT-EST-2D, CD-HIT-OTU, CD-HIT-454, ChimeraSlayer, Perseus, RAPsearch and more. It supports a rich set of sequence matching options, including E-values, identity, coverage (fraction of query or target sequence covered by the alignment) and maximum gap length, and a range of output file formats including FASTA, BLAST-like, user-defined tabbed text and a native format designed for clustering applications. Supported alignment styles include local (gapped and ungapped), like BLAST, and global, which is most often used in clustering applications. User-settable parameters allow tuning of substitution scores, gap penalties and Karlin-Altschul statistics.
VARNAVARNA is Java lightweight Applet dedicated to drawing the secondary structure of RNA. It is also a Swing component that can be very easily included in an existing Java code working with RNA secondary structure to provide a fast and interactive visualization.
VarScanVarScan is a platform-independent software tool developed at the Genome Institute at Washington University to detect variants in NGS data.
VCAKEVCAKE is a genetic sequence assembler capable of assembling millions of small nucleotide reads even in the presence of sequencing error. This software is currently geared towards de novo assembly of Illumina's Solexa Sequencing data.
vcflibvcflib provides methods to manipulate and interpret sequence variation as it can be described by VCF
vcftoolsVCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide methods for working with VCF files: validating, merging, comparing and calculate some basic population genetic statistics.
velvetVelvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom.
Velvet currently takes in short read sequences, removes errors then produces high quality unique contigs. It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.
viennaRNAVienna RNA package allows RNA Secondary Structure Prediction and Comparison.
wgswgs (Celera Assembler) is a de novo whole-genome shotgun (WGS) DNA sequence assembler.
Wise2Wise2 is a package focused on comparisons of biopolymers, commonly DNA sequence and protein sequence. These are the programs which you might use for this:
genewise: a single protein vs a single genomic dna sequence
genewisedb: a database of proteins vs a database of genomic dna sequences.
estwise: a single protein vs a single EST/cDNA sequence.
estwisedb: a database of proteins vs a database of EST/cDNA sequences.
wu-blastSimilarity search against databanks, Washington University Blast.
wxMaximawxMaxima is a document based interface for the computer algebra system Maxima.
YassYASS is a genomic similarity search tool, for nucleic (DNA/RNA) sequences in fasta or plain text format (it produces local pairwise alignments).
YinOYangyinOyang produces neural network predictions for O-(beta)-GlcNAc attachment sites in eukaryotic protein sequences. It can also use netphos, to mark possible phosphorylated sites and hence identify the "Yin-Yang" sites i.e. the sites that may be modified reversibly and dynamically by O-GlcNAc or phosphate groups at different times in the cell.