Apache Taverna - Genome and gene expression

Taverna is being used by various projects and researchers for the analysis of genomes and gene expression.
These include:

Next Generation Sequencing using Taverna 2 Server on Amazon cloud
TavernaPBS - next generation sequencing analysis using a computational cluster that uses a PBS queuing system and Taverna 2 Workbench
Coordination and Sustainability of International Mouse Informatics Resources (CASIMIR) -workflows to associate mouse genome and phenome
Developmental Gene Expression Map (DGEMap)- analysis of human gene expression during development
Graves disease - identification of genes responsible
MicroArray analysis using R - statistical analysis of gene expression
SIGENAE - development of workflows to analyse breeding animal data
Trypanosomiasis - identification of genes responsible for sleeping sickness
Williams-Beuren syndrome - automation and confirmation of gene characterization
Integration of plant genome resources (PLANET)
Annotation of genomes
Shared Genomics
e-Fungi - functional genomics in fungal species

Next generation sequencing ¶

Next generation sequencing presents new challenges in large scale data processing. In collaboration with the University of Liverpool’s Animal Sciences & Physiology Research group, in particular Dr Harry Noyes,
we combined Taverna scientific workflows with computing power from the Amazon cloud to create a powerful next generation sequencing application for whole genome Single Nucleotide Polymorphism (SNP) analysis.

Through a Web portal, the application allows scientists to upload their input data, fire off a number of parallel cloud instances for the analysis, monitor progress and collect results (see figure below).

Preliminary work on the genetic variation of African cattle showed we can run a whole genome of ~22 million SNPs in a matter of hours. This work focuses on the response to trypanomiasis infection (sleeping sickness) in different cattle species.

The application was demonstrated at the European Conference of Computational Biology (ECCB) 2010, Ghent, Belgium, under the title “Software for the Data-Driven Researcher of the Future” – see the slides and video (no audio or subtitles yet!) from the talk.

This cloud application was based on the next generation sequencing work done presented at Bioinformatics Open Source Conference (BOSC) 2010, Boston, USA, under the title “Analysing African and European cattle with Taverna 2.2″. See the slides from the talk.

TavernaPBS ¶

TavernaPBS is a plugin for Taverna developed by the Mackey Lab, Center for Public Health Genomics, University of Virginia, US. It allows a user to define workflows that can then be run using a computational cluster that uses a PBS queuing system. The workflows represent next generation sequencing analysis pipelines.

To most efficiently make use of their myriad of UNIX command line tools, the project has developed a custom Beanshell - based library to enable workflows composed of UNIX command line invocations. Furthermore, they have abstracted the command execution such that each step could be executed as an independent “job” on a remote high performance computing or grid environment. By doing so, they have essentially turned Taverna into a distributed workflow “compiler”.

You can download the code from the project's pages at SourceForge.

CASIMIR¶

The Coordination and Sustainability of International Mouse Informatics Resources (CASIMIR) project carried out a pilot study that demonstrated how (parts of) existing databases can be made accessible in a standardized way and data processed using the MOLGENIS toolkit, BioMART and Taverna.

The MOLGENIS toolkit was used to generate WSDL services to access the databases. These were then included in a Taverna workflow along with BioMart processors.

An example workflow can be found on myExperiment.

DGEMap¶

The Developmental Gene Expression Map (DGEMap) project is an EU-funded study to design a pan-European infrastructure to support gene expression studies in early human development.

The DGEMap team plan to use Taverna workflows to link human spatio-temporal gene expression data with other bioinformatics resources.

SIGENAE¶

The Systeme d’Information des GENomes des animaux ’Elevage (SIGENAE) is the Information System of Analysis of Breeding Animals’ Genome (AGENAE) program. They are a group of French bioinformaticians providing services to biologists working on four mammal and one fish species. They are funded by the French National Institute for Agricultural Research (INRA).
They not only use Taverna Workbench to support e-Science but have been very active in promoting its adoption throughout the French bioinformatics community.

Some of the workflows developed by the Sigenae community have been made available for sharing.

Genome and gene expression

Next generation sequencing¶

CASIMIR¶

DGEMap¶

SIGENAE¶

Next generation sequencing ¶