The DANTE Database (Database of ANalysed Texts of English)

AuthorCathal Convery, Pádraig Ó Mianáin, Muiris Ó Raghallaigh, Sue Atkins, Adam Kilgarriff, Michael Rundell
AbstractThis database ( was designed and created for Foras na Gaeilge by the Lexicography MasterClass and their 15-strong team led by Valerie Grundy (Managing Editor); textflow is managed by Diana Rawlinson (Project Administrator). The corpus of 1.7 bn words of current English, custom-built in 2007, was queried using the Sketch Engine (, and the database was compiled in IDM’s Dictionary Production System (DPS: The present volume contains a fuller description of this project (Atkins, Kilgarriff and Rundell Database of ANalysed Texts of English (DANTE): the NEID database project) and of its use in a bilingual dictionary (Convery, Ó Mianáin and Ó Raghallaigh Covering all bases: Regional Marking of material in the New English-Irish Dictionary).
The 95,000 or so DANTE entries cover approximately 50,000 headwords and 45,000 compounds, idioms and phrasal verbs, using over 40 datatypes. The lexical entry is subdivided into lexical units, each a sense of a single- or multi-word lemma. Almost every linguistic fact recorded is accompanied by full corpus sentences illustrating its use in text. Apart from the definitions and the corpus-derived example sentences, all the significant information is machine-retrievable. Functionality demonstrated here includes simple and complex searches over various combinations of datatypes and the automatic insertion of empty translation fields for use in dictionary building.
DANTE was created as the initial stage of compilation of the New English-Irish Dictionary. Its long-term potential is much more far-reaching: it offers publishers world-wide a comprehensive launchpad for bilingual dictionaries with English as the source language or the draft stage of a learners’ dictionary of English; a source of updating material for an existing dictionary, etc. It offers software developers, universities and other research institutions a resource for improved word sense disambiguation, the creation or enhancement of online lexicons, and other uses in software applications such as machineassisted translation, information retrieval systems, etc. More details from
