- Agglutinative language
- language in which most of the grammatical information is expressed through suffixes added to words. Agglutinative languages are morphologically complex and require an efficient morphological analyzer in order to be processed accurately.
- Ambiguity
- a word (or any other linguistic unit) that has different meanings. For example “bank” can be a money-lending institution or the edge of a river. Ambiguity is pervasive in languages and is one of the main problems natural language processing has to face.
- Cognate
- related words with a similar form and meaning across languages. Proper nouns are often valid cognates (“Paris” designates the same city in English and in French, while “Londres” and “London” do not have exactly the same form in the two languages but can nevertheless be considered as valid cognates). In contrast, French “achèvement” is not a valid cognate of English “achievement” since the two words, although etymologically related, do not have the same meaning today (French “achèvement” means “completion”). This is known as a “deceptive cognate.”
- Compound
- a word composed of several morphemes that generally does not fully preserve the semantics of its components. For example, “round table” generally designates an event, most of the time with no “round” table. When a compound is made of several words, it is called a “multiword expression” (as opposed to “solid compounds” like “football” or “blackboard” where the concatenated morphemes in the end produce one single word unit).
- Conjugation
- the different forms of a verb obtained by inflection.
- Entry, dictionary entry
- short description of a given word meaning in a dictionary. Generally if a word has different meanings (i.e. if the word is ambiguous), it has different entries (one entry per meaning)
- FAHQT
- Fully automated high-quality translation; also found as FAHQMT, for fully automated high-quality machine translation.
- Frozen expression
- see idiom.
- Grammatical function
- role of the word in the sentence (e.g., subject, object, etc.).
- Idiom
- complex expression whose meaning has little to do with the semantics of its parts (e.g., “kick the bucket,” which has nothing to do with “kick” or with “bucket”).
- Inflection
- variation of a given word depending on its grammatical function in a sentence. The term inflection is used for nouns and adjectives. For verbs, people generally use the term conjugation, but both terms refer to fundamentally the same process. The more variation there is, the more the language at stake will be called “morphologically complex.” English is known to be simpler than many other languages from a morphological point of view.
- Interlingua
- representation of the semantic content of a sentence in a language-independent formalism.
- Lemma
- normalized form of a word as found in a dictionary (e.g., “walk” as opposed to “walking”).
- Lemmatizer
- automatic tool intended to calculate the lemma of each word in a text. The task is not obvious when a surface form corresponds to different possible lemma and should thus be disambiguated according to the local context.
- Light verb
- a verb used in a context where it has little semantic content, especially in complex verbal expressions such as “to take a shower” (where nobody literally takes anything).
- Morpheme
- word part. See morphology.
- Morphological analyzer
- automatic tool calculating the structure of a word (see morphology).
- Morphology
- analysis of the structure of words. Words are generally made of a stem, with (optionally) some prefixes and some suffixes. For example, in the noun “deconstruction,” “de-“ is a prefix and “-tion” is a suffix. The word stem is “construct” or even “-struct” (since “con-” can also be considered a prefix). Word parts (stems, prefixes, and suffixes) are called morphemes.
- Morphosyntax
- see part-of-speech tagger.
- Occurrence
- the presence of a word in a corpus: the number of occurrences of a word in a text is the number of times the word is used in the text.
- Parser
- see syntactic analyzer.
- Part-of-speech (or morphosyntax) tagger
- automatic tool assigning part-of-speech tags to words in context (e.g., a specific sentence). The task is difficult since most words are ambiguous (e.g., “fly” can be a noun or a verb.).
- Part-of-speech tags
- word categories (noun, verb, adjective, etc.). In English, researchers generally consider around a dozen categories, but the inventory varies greatly across languages.
- Phrase
- semiautonomous group of words in a sentence, such as a noun phrase (“a cat”) or a verb phrase (“to go shopping”). A phrase is said to be semiautonomous since it does not form a full sentence in itself, but it can be associated with some autonomous meaning (as opposed to a sequence like “cat goes to”).
- Precision
- fraction of retrieved “information nugget” (words, sequences of words, documents) that are relevant to a query or a task.
- Prefix
- see morphology.
- Recall
- fraction of the retrieved “information nuggets” (words, sequences of words, documents) that are relevant to a query or a task, and that are successfully retrieved.
- Semantic analyzer
- automatic tool intended to provide a semantic representation (see semantics).
- Semantics
- analysis of the meaning of any linguistic unit (word, phrase, sentence, or any higher-level unit, such as a paragraph or text).
- Suffix
- see morphology.
- Syntactic analyzer or parser
- automatic tool intended to provide the syntactic representation of a linguistic unit (see syntax).
- Surface form or word form
- word as it occurs in a text. The proper analysis of surface forms (recognizing the different morphemes and linking a word form to the corresponding lemma) has a lot to do with morphology. The task is relatively simple for English, which has a relatively low level of morphological complexity (e.g., “dancing,” “dances,” and “danced” are easily recognizable forms of “to dance”). Linguists consider that French is morphologically more complex than English (since there are more surface forms per lemma in French than in English) and Finnish is even more complex (there could even be theoretically a near-infinite number of word forms for one lemma in Finnish since Finnish is an agglutinative language).
- Syntactic structure
- structure of a group of words reflecting their relative grammatical function.
- Syntax
- structure of a group of words, generally a sentence. The result of a syntactic analysis is generally a tree, in which everything depends on the main verb.
- Transfer rule
- in a rule-based machine translation system, a transfer rule formalizes the way a linguistic structure in the source language must be rendered in the target language. Transfer rules have to do with syntax.
- Vague, vagueness
- refers to the fact that a language is never completely precise or could always be more precise, especially in relation to the external world. Vagueness is pervasive in language and involves many differing notions (e.g., vague concepts such as “to be bald”; philosophical and abstract concepts such as “to be good”; concepts that vary across languages like colors; etc.).
- Word sense
- different meanings of a word. The number of word senses corresponds to the number of entries for one word in a dictionary.