Extracting Phraseology for Content Analysis and Document Retrieval

AuthorThierry Fontenelle
AbstractThis paper describes a program which identifies the topic of a text by extracting the most relevant key words and phraseological sequences. The various factors taken into account to generate this list of terms and expressions are described (frequency of occurrence, classification as a function of the number of elements, processing of abbreviations, use of customisable stop lists. . . ). The output can then be used by powerful search engines to retrieve topic-related texts which are believed to display a high degree of repetitivity, an essential criterion for building translation memory databases.
SessionPART 8 - Extraction of terminologically relevant multiword expressions
