About Project Diptex database comprises sequencing data from early embryonic transcriptomes of two non-drosophilid dipteran species: the moth midge Clogmia albipunctata, and the scuttle fly Megaselia abdita. Our resource includes a third, published, transcriptome for the hoverfly Episyrphus balteatus.


Megaselia and Clogmia transcriptomes were sequenced with both Roche-454 and Illumina-HiSeq2000 technologies. Episyrphus transcriptome was sequenced with Roche-454 only. All transcript sequences reported in this database were assembled with the Trinity software.


Assembled sequences were annotated as follows. Identified transcripts were translated in all six possible open reading frames (ORFs). For each detected ORF, a custom-made processing pipeline identifies protein signatures, assigns best orthologs, and uses orthology-derived information to annotate metabolic pathways, multi-enzymatic complexes, and reactions. First, ORFs are inspected for the presence of different protein signatures (such as families, regions, domains, repeats, and sites) by using InterProScan and the InterPro database. These signatures are used for the classification and automatic annotation of protein sequences by assigning biological functions and gene ontology (GO) terms. Second, each ORF is mapped to the UniRef50 protein database (http://www.ebi.ac.uk/uniref) using the BLASTp algorithm in order to assess similarity with known protein sequences from other species. Finally, best-hit protein identifiers are then used to retrieve metabolic pathways, multi-enzymatic complexes, and reaction information available in the Reactome database (http://www.reactome.org). Annotations obtained in this way were stored in a relational database based on MySQL (http://www.mysql.com). Raw sequence reads and assembled contigs/transcripts for the species, M. abdita, and C. albipunctata, have been submitted to the European Nucleotide Archive (ENA) and their accession numbers will be available soon.