Mathematical and Computational Linguistics. Relationship between linguistics and computer science

The buildings 25.09.2019

Linguistic informatics is a part of information service theory. The theory of information service arose in connection with the computerization of speech, that is, in connection with the use of computers as a means of recording, recording and storing linguistic information. Thanks to technology, it was possible to combine the functions of a library, archive and office.

Large text classes are processed by automatic referencing. The constantly growing volume of scientific and technical information, the search for which is becoming more and more laborious, gave rise to the idea of searching through the so-called secondary texts, which are folded information of the primary document: bibliographic description, annotation, abstract, scientific translation.

The folding of the primary text is carried out by its compression, compression. Special methods for folding the primary text have been developed:

a) statistical-distributive methods mean that the most informative sentences are singled out, in which the most significant linguistic signs for a given text are concentrated;

b) methods of using semantic indicators, when the most meaningful "points" of the text are noted - the subject of study, purpose, methods, relevance, scope, conclusions, results); c) the method of textual links, which lies in the fact that taking into account interphrase links makes the abstract complete.

3. Practical terminology.
Practical terminology includes sections:

a) lexicographic terminology, which deals with the theory and practice of creating special dictionaries, unifying term systems, translating terms, creating terminological data banks, automating their storage and processing.

b) lexicography itself became the subject of applied linguistics as one of the most time-consuming types of practical linguistics. Dictionaries have been created for decades. Therefore, the desire of scientists to automate lexicographic activity is quite understandable. There are automatic dictionaries. Their purpose is to increase labor productivity when working with texts, collecting, storing and processing various units of the language. Dictionaries of this type are used in automatic text processing systems.

Automatic translation.

Automatic or machine translation is based on the assumption that it is possible to harmonize typologically different language structures (vocabulary, word order, inflection, syntactic structures). The linguistic principle of translation is to compare linguistic units of two or more languages that are equivalent in meaning.

There are two stages in the development of automatic translation systems. At the first stage, such fundamental problems of machine translation as the creation of automatic dictionaries, the development of an intermediary language, the formalization of grammar, the overcoming of homonymy, and the processing of idiomatic formations were solved. At the second stage, the set-theoretic models of grammars, models of dependency grammars, direct constituents, models of generative grammar continue to develop quite fruitfully and be embodied in practice. During this period, semantics according to the “meaning - text” model is more and more actively involved in applied linguistics. The centers of applied linguistics that have emerged in domestic and foreign universities are developing strategies for machine translation. These include the Laboratory of Mathematical Linguistics at St. Petersburg University, at the Institute of Applied Mathematics of the Russian Academy of Sciences; All-Union Translation Center; the Speech Statistics group at the Leningrad Pedagogical Institute under the direction of Raymond Genrikhovich Piotrovsky; group for the study of syntactic modeling "sense - text" under the leadership of Igor Alexandrovich Melchuk.

A new stage in the improvement of machine translation is associated with the use of an intermediary language - a knowledge representation language. It is based on the analysis of the meaning of the sentence obtained when comprehending the input sentence, supplemented and marked up with the help of information from the knowledge base and in its terms. The process of translation is the transformation of an input sentence of language X into an output structure of language Y. In other words, the result of machine translation is not a translation itself, but rather a retelling of the source text (X). The quality of translation depends on the efficiency of the knowledge representation language. The high quality of machine translation can only be ensured by the creation of reliable linguistic foundations and software tools for building powerful semantic networks based on automated lexicons.

IV. Ethnolinguistics.

Ethnolinguistics (ethnosemantics, anthropolinguistics) is a field of linguistics that studies the language in its relationship with the culture of a particular ethnic group. The foundations of ethnolinguistics were laid in the works of Franz Boas and Edward Sapir in the first quarter of the 20th century. In the second half of the 20th century ethnolinguistics took shape as an independent branch of linguistics. Ethnolinguistic studies of the second half of the 20th century. characterized by such features as: attraction of methods of experimental psychology; comparison of semantic models of different languages; study of problems of folk taxonomy; paralinguistic research; reconstruction of spiritual ethnic culture based on language data; revival of attention to folklore.

Central to ethnolinguistics are two closely related problems that can be called "cognitive" and "communicative":

1. How, by what means and in what form are the cultural (domestic, religious, social, etc.) ideas of the people who speak this language about the world around them and about the place of man in this world reflected in the language?

2. What forms and means of communication - primarily linguistic communication - are specific to a given ethnic or social group?

In accordance with these problems, two directions have emerged in ethnolinguistics: cognitively oriented ethnolinguistics and communicatively oriented linguistics.

a) Cognitively oriented ethnolinguistics.

Cognitively oriented ethnolinguistics is characteristic of American linguistics. It is called anthropological linguistics. Initially, anthropological linguistics was focused on the study of the culture of peoples that differed sharply from European ones, primarily the American Indians. Establishing family ties between these languages and describing them state of the art were subordinated to the task of a comprehensive description of the culture of these peoples and the reconstruction of their history, including migration routes. The recording and interpretation of everyday and folklore texts was an integral component of the anthropological description.

Following Franz Boas in anthropological linguistics, it is believed that more fractional fragments of the classification of reality in a language correspond to more important aspects of a given culture. As the American linguist and anthropologist Harry Hoyer notes, “peoples who live by hunting and gathering, such as the Apache tribes in the American southwest, have an extensive vocabulary of the names of animals and plants, as well as the phenomena of the surrounding world. The peoples whose main source of livelihood is fishing (in particular, the Indians of the northern coast Pacific Ocean), have in their dictionary a detailed set of fish names, as well as fishing tools and techniques.

The greatest attention of ethnolinguists was attracted by such taxonomic systems as designations of body parts, terms of kinship, the so-called ethno-biological classifications, that is, the names of plants and animals (the English scientist B. Berlin, Anna Vezhbitskaya), and especially color designations (B. Berlin and P .Kay, A.Vezhbitskaya).

In modern anthropological ethnolinguistics, one can conditionally distinguish between “relativistic” and “universalist” directions: for the first, the priority is the study of cultural and linguistic specifics in the picture of the speaker’s world, for the second, the search for universal properties of vocabulary and grammar of natural languages.

An example of research in the relativistic direction in ethnolinguistics can be the works of Yuri Derenikovich Apresyan, Nina Davidovna Arutyunova, Anna Vezhbitskaya, Tatyana Vyacheslavovna Bulygina, Alexei Dmitrievich Shmelev, E.S. Yakovleva, dedicated to the peculiarities of the Russian language picture of the world. These authors analyze the meaning and use of words that either denote unique concepts that are not typical for the conceptualization of the world in other languages (longing and daring, perhaps and probably), or correspond to concepts that exist in other cultures, but are especially significant for Russian culture or receiving a special interpretation (truth and truth, freedom and will, fate and share). For example, we give a fragment of the description of the word “maybe” from the book by T.V. Bulygina and A.D. Shmelev “Linguistic conceptualization of the world”:

«<...>perhaps does not mean at all the same as simply “possibly” or “maybe”.<...>most often, maybe is used as a kind of excuse for carelessness, when it comes to the hope not so much that some favorable event will happen, but that some extremely undesirable consequences will be avoided. About a person who buys a lottery ticket, they will not say that he acts at random. So, rather, it can be said about a person who<...>saves money by not buying health insurance and hopes nothing bad happens<...>Therefore, hope for a chance is not just a hope for good luck. If the symbol of fortune is roulette, then hope for a chance can be symbolized by “Russian roulette”.

An example of research in the universalist direction in ethnolinguistics is the work of the Polish scientist Anna Wierzbicka, devoted to the principles of describing linguistic meanings. The goal of many years of research by A. Wiezhbitskaya and her followers is to establish a set of so-called "semantic primitives", universal elementary concepts, by combining which each language can create an infinite number of configurations specific to a given language and culture. Semantic primitives are lexical universals, in other words, they are such elementary concepts for which in any language there is a word denoting them. These concepts are intuitively clear to a native speaker of any language, and on their basis one can build interpretations of any arbitrarily complex language units. Studying the material of genetically and culturally different languages of the world, including the languages of Papua New Guinea, Austronesian languages, African languages and Australian aborigines, A. Vezhbitskaya constantly refines the list of semantic primitives. Her work Interpretation of Emotional Concepts lists them as follows:

"substantives" - I, you, someone, something, people;
"determinators and quantifiers" - this, the same, the same, another, one, two, many, all / all;
"mental predicates" - think (about), speak, know, feel, want;
"actions and events" - to do, occur / happen;
"ratings" - good, bad;
"descriptors" - large, small;
"time and place" - when, where, after / before, under / over;
"metapredicates" - not / no / negation, because / because of, if, to be able;
"intensifier" - very;
"taxonomy and partonomy" - species / variety, part;
“non-strictness / prototype” - similar / like.

From semantic primitives, as from "bricks", A. Vezhbitskaya puts together interpretations of even such subtle concepts as emotions. For example, she manages to demonstrate the subtle difference between the concept of American culture, denoted by the word "happy", and the concept denoted by the Russian word "happy" (and similar Polish, French and German adjectives). The word "happy", as A. Vezhbitskaya writes, although it is usually considered the dictionary equivalent of the English word "happy", in Russian culture has a narrower meaning, "usually it is used to denote rare states of complete bliss or perfect satisfaction derived from such serious things like love, family, the meaning of life, etc.” Here is how this difference is formulated in the language of semantic primitives (components of interpretation B, which are absent in interpretation A, are highlighted in capital letters).

Interpretation A: X feels happy
X feels something
something good happened to me
i wanted it
I don't want anything else
X feels something like

Interpretation B: X is happy
X feels something
sometimes people think like this:
something very good happened to me
i wanted it
EVERYTHING IS FINE
I CAN'T WANT anything else
so this person feels something good
X feels something like

For the research program of A. Vezhbitskaya, it is important that the search for universal semantic primitives is carried out empirically, using the methods of field linguistics - working with an informant: firstly, in each individual language, the role played by this concept in the interpretation of other concepts, and, secondly, for each concept, a set of languages is found in which this concept is lexicalized, that is, there is a special word expressing this concept.

B) Communicatively oriented ethnolinguistics.

The most significant results in communicatively oriented ethnolinguistics are associated with the direction called "ethnography of speech" or "ethnography of communication". Speech ethnography as a theory and method for analyzing language usage in a sociocultural context was proposed in the early 1960s. in the works of D. Himes and John J. Gamperz and developed in the works of the American scientist Aron Sikurel, J. Bauman, A.U. Corsaro. The utterance is investigated only in connection with some speech or communicative event within which it is generated. The cultural conditionality of any speech events (sermon, court session, telephone conversation, etc.) is emphasized. The rules of language use are established through present observation (complicity in a speech event), analysis of spontaneous data, interviewing native speakers of a given language.

Within the framework of this direction, the models of speech behavior adopted in a particular culture, in a particular ethnic or social group are studied. So, for example, in the culture of the “Central European standard”, an informal conversation of several people assumes, according to the rules of etiquette adopted in this community, that the participants will not interrupt each other, everyone is given the opportunity to speak in turn, who wants to speak usually signals this with the words “let me see” , “Let me ask”, etc. A person who wants to leave the conversation announces his intention with the words “unfortunately, I have to go”, “I have to leave for a while”, and so on. Quite different norms of public speech behavior are accepted, for example, in a number of Australian Aboriginal cultures. Compliance individual rights an individual participant in a conversation in these communities is not a mandatory rule: several interlocutors can speak at the same time, it is not necessary to react to the statement of another, the speaker speaks out without specifically addressing anyone, the interlocutors may not look at each other, etc. Such a model of speech behavior is based on the initial premise that all utterances are somehow accumulated in the surrounding world, and therefore the “reception” of a message does not necessarily have to immediately follow its “transmission”.

A relevant topic of communication ethnography is also the study of the linguistic expression of the relative social status of the interlocutors: the rules of addressing the interlocutor, including the use of titles, addresses by name, surname, first name and patronymic, professional addresses (for example, “doctor”, “comrade major”, “ professor”), the appropriateness of the appeals “to you” and “to you”, etc. Especially closely studied are such languages in which the ratio of the social position of the speaker and the listener is fixed not only in vocabulary, but also in grammar. An example is Japanese, where the choice of the grammatical form of a verb depends on whether the listener is higher or lower than the speaker in the social hierarchy, and also on whether the speaker and listener are in the same social cell or not. In addition, the relationship between the speaker and the person in question is also taken into account. As a result of the complex action of these restrictions, the same person uses different forms of the verb when referring to a subordinate and when referring to a boss, when referring to a colleague and when referring to to a stranger, when referring to his wife and neighbor's wife.

Grammar also reflects this feature speech etiquette Japanese, as the desire to avoid intrusion into the sphere of thoughts and feelings of the interlocutor. AT Japanese there is a special grammatical form of the verb - the so-called "desirable mood". Using the suffix of the desirable mood -tai, the speaker expresses the desire to perform the action indicated by the original verb: "read" + tai = "I want to read", "leave" + tai = "I want to leave". However, forms of desirable mood are possible only if the speaker describes own wish. The desire of an interlocutor or a third person is expressed using a special construction, which approximately means "by external signs, one can conclude that person X wants to perform action Y". Thus, subject to the requirements of grammar, a Japanese speaker can only make judgments about their own intentions. To make direct statements about the internal state of another person, for example, about his desires, the language simply does not allow. You can say “I want ...”, but you cannot say “Do you want ...” or “He wants ...”, but only “It seems to me (I have the impression) that you want ...” or “ It seems to me (I have the impression) that he wants ... ".

In addition to the norms of speech etiquette, the ethnography of communication also studies speech situations ritualized in certain cultures, such as a court session, a dissertation defense, a trade deal, and the like; rules for choosing a language in interlingual communication; language conventions and clichés, signaling that the text belongs to a certain genre (“once upon a time” - in fairy tales, “listened and decided” - in the minutes of the meeting).

Modern ethnolinguistics is closely connected with sociology, psychology, and semiotics. In Russian ethnolinguistics, a special place is occupied by research at the intersection of ethnolinguistics, folklore, and comparative historical linguistics. First of all, this is a research program dedicated to the ethno-linguistic and ethno-cultural history of the Slavic peoples (Nikita Ilyich Tolstoy, Svetlana Mikhailovna Tolstaya, Vladimir Nikolaevich Toporov). Within the framework of this program, ethnolinguistic atlases are compiled, rituals, beliefs, and folklore are mapped; the structure of codified Slavic texts of certain genres, including incantatory texts, riddles, funerary and building rituals, etc., is studied in relation to the data of comparative historical and archaeological research.

Systematization in linguistics and linguistic classification of the peoples of the world

Sociolinguistic (or functional) classification of languages and forms of speech

linguistics statistical linguistics software

History of the development of computational linguistics

The process of formation and formation modern linguistics as a science of natural language is a long historical development of linguistic knowledge. Linguistic knowledge is based on elements, the formation of which took place in the process of activity, inextricably linked with the development of the structure of oral speech, the emergence, further development and improvement of writing, learning to write, as well as the interpretation and decoding of texts.

Natural language as an object of linguistics occupies a central place in this science. In the process of language development, ideas about it also changed. If earlier no special importance was attached to the internal organization of the language, and it was considered, first of all, in the context of its relationship with the outside world, then, starting from the end of the 19th - beginning of the 20th centuries, a special role is assigned to the internal formal structure of the language. It was during this period that the famous Swiss linguist Ferdinand de Saussure developed the foundations of such sciences as semiology and structural linguistics, and were detailed in his book A Course in General Linguistics (1916).

The scientist owns the idea of considering the language as a single mechanism, an integral system of signs, which in turn makes it possible to describe the language mathematically. Saussure was the first to propose a structural approach to language, namely, the description of a language by studying the relationships between its units. By units, or "signs", he understood a word that combines both meaning and sound. The concept proposed by the Swiss scientist is based on the theory of language as a system of signs, consisting of three parts: language (from French langue), speech (from French parole) and speech activity (from French langage).

The scientist himself defined the science he created, semiology, as "a science that studies the life of signs within the framework of the life of society." Since language is a sign system, in search of an answer to the question of what place linguistics occupies among other sciences, Saussure argued that linguistics is part of semiology. It is generally accepted that it was the Swiss philologist who laid the theoretical foundation of a new direction in linguistics, becoming the founder, the "father" of modern linguistics.

The concept put forward by F. de Saussure was further developed in the works of many outstanding scientists: in Denmark - L. Elmslev, in the Czech Republic - N. Trubetskoy, in the USA - L. Bloomfield, Z. Harris, N. Chomsky. As for our country, here structural linguistics began its development at about the same period of time as in the West - at the turn of the 19th-20th centuries. - in the works of F. Fortunatov and I. Baudouin de Courtenay. It should be noted that I. Baudouin de Courtenay worked closely with F. de Saussure. If Saussure laid the theoretical foundation of structural linguistics, then Baudouin de Courtenay can be considered the person who laid the foundations for the practical application of the methods proposed by the Swiss scientist. It was he who defined linguistics as a science that uses statistical methods and functional dependencies, and separated it from philology. The first experience of applying mathematical methods in linguistics was phonology - the science of the structure of the sounds of a language.

It should be noted that the postulates put forward by F. de Saussure could be reflected in the problems of linguistics that were relevant in the middle of the 20th century. It is during this period that a clear trend towards the mathematization of the science of language is outlined. Practically in all large countries, the rapid development of science and computer technology begins, which in turn required more and more new linguistic foundations. The result of all this was the rapid convergence of the exact and human sciences, as well as the active interaction of mathematics and linguistics. practical use in solving current scientific problems.

In the 50s of the XX century, at the intersection of such sciences as mathematics, linguistics, computer science and artificial intelligence, a new direction of science arose - computational linguistics(also known as machine linguistics or automatic natural language processing). The main stages in the development of this direction took place against the backdrop of the evolution of artificial intelligence methods. A powerful impetus to the development of computational linguistics was the creation of the first computers. However, with the advent of a new generation of computers and programming languages in the 60s, a fundamentally new stage in the development of this science begins. It should also be noted that the origins of computational linguistics go back to the works of the famous American linguist N. Chomsky in the field of formalizing the structure of the language. The results of his research, obtained at the intersection of linguistics and mathematics, formed the basis for the development of the theory of formal languages and grammars (generative or generative grammars), which is widely used to describe both natural and artificial languages, in particular programming languages. To be more precise, this theory is quite a mathematical discipline. It can be considered one of the first in such a direction of applied linguistics as mathematical linguistics.

The first experiments and first developments in computational linguistics relate to the creation of machine translation systems, as well as systems that simulate human language abilities. In the late 80s, with the advent and active development of the Internet, there was a rapid growth in the volume of text information available in electronic form. This has led to the fact that information retrieval technologies have moved to a qualitatively new stage of their development. There was a need for automatic processing of texts in natural language, completely new tasks and technologies appeared. Scientists are faced with such a problem as the rapid processing of a huge stream of unstructured data. In order to find a solution to this problem, great importance has been given to the development and application of statistical methods in the field of automatic word processing. It was with their help that it became possible to solve such problems as dividing texts into clusters united by a common theme, highlighting certain fragments in the text, etc. In addition, the use of methods of mathematical statistics and machine learning made it possible to solve the problems of speech recognition and the creation of search engines.

Scientists did not stop at the achieved results: they continued to set themselves new goals and objectives, to develop new techniques and methods of research. All this led to the fact that linguistics began to act as an applied science, combining a number of other sciences, the leading role among which belonged to mathematics with its variety of quantitative methods and the ability to use them for a deeper understanding of the phenomena being studied. Thus began its formation and development of mathematical linguistics. On the this moment this is a rather “young” science (it has existed for about fifty years), however, despite its very “young age”, it is an already established area of scientific knowledge with many successful achievements.

Modern computational linguistics is very much focused on the use of mathematical models. There is even a popular belief that linguists are not particularly needed for automatic natural language modeling. Frederic Jelinek, head of the Johns Hopkins University Speech Recognition Center, is well known: " Anytime a linguist leaves the group, the recognition rate goes up"- every time a linguist leaves the working group, the quality of recognition increases.

However, the more complex and multi-level tasks of linguistic modeling are set for the developers of automatic systems, the more obvious it becomes that their solution is impossible without taking into account linguistic theory, understanding how the language functions, and linguistic expert competence. At the same time, it became obvious that automatic methods of analysis and modeling of linguistic data can significantly enrich theoretical linguistic research, being both a means for collecting linguistic data and a tool for testing the validity of a particular linguistic hypothesis.

Text Processing Evaluation Forum

S.Yu.Toldova, O.N. Lyashevskaya, A.A. Bonch-Osmolovskaya

How to formalize the lexical meaning, make it "machine-readable"? The answer to this is provided by distributive models of the language, in which the meaning of a word is the sum of its contexts in a sufficiently large corpus. artificial neural networks allow fast and high-quality training of such models.

Denis Kiryanov, Tanya Panova (supervisor B.V. Orekhov)

This program has two functions: a) normalization of Yiddish text, b) transliteration from square letters to Latin. These problems are very relevant: until now, there has not been a single normalizer, except for spell checkers. Meanwhile, almost every publishing house that published books in Yiddish followed its spelling practice. The normalizer is needed to work on the corpus of the Yiddish language: to reduce all texts to a single spelling recognized by the parser. Transliteration will allow typologists to work with Yiddish material as well.

VIDEO of the staff of the School of Linguistics:

Optionally; 3 year, 2, 3 module

Required; 1st year, 2 module

Optionally; 3 year, 3 module

Required; 4th year, 1-3 module

Required; 4th year, 2 module

Required; 2nd year, 1, 2, 4 module

Computational Linguistics: Methods, Resources, Applications

Introduction

Term computational linguistics(CL) in recent years is increasingly common in connection with the development of various applied software systems, including commercial software products. This is due to the rapid growth in the society of text information, including on the Internet, and the need for automatic processing of texts in natural language (NL). This circumstance stimulates the development of computational linguistics as a field of science and the development of new information and linguistic technologies.

Within the framework of computational linguistics, which has existed for more than 50 years (and is also known under the names machine linguistics, automatic word processing in NL) many promising methods and ideas have been proposed, but not all of them have yet found their expression in software products used in practice. Our goal is to characterize the specifics of this area of research, formulate its main tasks, indicate its connections with other sciences, give a brief overview of the main approaches and resources used, and briefly characterize the existing applications of CL. For a more detailed acquaintance with these issues, books can be recommended.

1. Tasks of computational linguistics

Computational linguistics arose at the intersection of such sciences as linguistics, mathematics, computer science (Computer Science) and artificial intelligence. The origins of CL go back to the research of the famous American scientist N. Chomsky in the field of formalization of the structure of natural language; its development is based on results in the field of general linguistics (linguistics). Linguistics studies the general laws of natural language - its structure and functioning, and includes the following areas:

Ø Phonology- studies the sounds of speech and the rules for their combination in the formation of speech;

Ø Morphology- deals with the internal structure and external form words of speech, including parts of speech and their categories;

Ø Syntax- studies the structure of sentences, the rules of compatibility and the order of words in a sentence, as well as its general properties as a unit of language.

Ø Semanticsand pragmatics- closely related areas: semantics deals with the meaning of words, sentences and other units of speech, and pragmatics deals with the features of expressing this meaning in connection with the specific goals of communication;

Ø Lexicography describes the lexicon of a particular SL - its individual words and their grammatical properties, as well as methods for creating dictionaries.

The results of N. Chomsky, obtained at the intersection of linguistics and mathematics, laid the foundation for the theory of formal languages and grammars (often called generative, or generative grammarians). This theory is now mathematical linguistics and is used to process not so much NL, but artificial languages, primarily programming languages. By its nature, it is quite a mathematical discipline.

Mathematical linguistics also includes quantitative linguistics, studying the frequency characteristics of the language - words, their combinations, syntactic constructions, etc., while using mathematical methods of statistics, so you can call this branch of science statistical linguistics.

CL is also closely related to such an interdisciplinary scientific field as artificial intelligence (AI), within which computer models of individual intellectual functions are developed. One of the first working programs in the field of AI and CL is the well-known program of T. Winograd, which understood the simplest orders of a person to change the world of cubes, formulated on a limited subset of NL. It should be noted that despite the obvious intersection of research in the field of CL and AI (since language proficiency is related to intellectual functions), AI does not absorb all CL, since it has its own theoretical basis and methodology. Common to these sciences is computer modeling as the main method and final goal of research.

Thus, the task of CL can be formulated as the development of computer programs for automatic processing of texts in NL. And although processing is understood quite broadly, far from all types of processing can be called linguistic, and the corresponding processors can be called linguistic. Linguistic Processor must use one or another formal model of the language (even if it is very simple), which means that it must be language-dependent in one way or another (that is, depend on a specific NL). For example, text editor Mycrosoft Word can be called linguistic (if only because it uses dictionaries), but NotePad editor is not.

The complexity of the tasks of CL is due to the fact that NL is a complex multi-level system of signs that arose for the exchange of information between people, developed in the process practical activities human, and constantly changing in connection with this activity. Another difficulty in the development of CL methods (and the difficulty of studying NL within the framework of linguistics) is associated with the diversity of natural languages, significant differences in their vocabulary, morphology, syntax, which different languages provide different ways expressions of the same meaning.

2. Features of the NL system: levels and connections

The objects of linguistic processors are the texts of NL. Texts are understood as any samples of speech - oral and written, of any genre, but basically CL considers written texts. The text has a one-dimensional, linear structure, and also carries a certain meaning, while the language acts as a means of converting the transmitted meaning into texts (speech synthesis) and vice versa (speech analysis). The text is composed of smaller units, and there are several ways of splitting (dividing) the text into units belonging to different levels.

The existence of the following levels is generally recognized:

The level of sentences (statements) - syntactic level;

· Lexico-morphological homonymy (the most common type) occurs when the word forms of two different lexemes coincide, for example, verse- a verb in the singular masculine and a noun in the singular, nominative case),

· Syntactic homonymy signifies an ambiguity in the syntactic structure, leading to several interpretations: Students from Lvov went to Kyiv,flying planes can be dangerous(famous example of Chomsky), etc.

3. Modeling in computational linguistics

The development of a linguistic processor (LP) involves a description of the linguistic properties of the processed text of the NL, and this description is organized as model language. As in modeling in mathematics and programming, a model is understood as some system that reflects a number of essential properties of the phenomenon being modeled (i.e., NL) and therefore has a structural or functional similarity.

Models of language used in CL are usually built on the basis of theories created by linguists by studying various texts and based on their linguistic intuition (introspection). What is the specificity of the KL models? The following features can be distinguished:

Formality and, ultimately, algorithmizability;

Functionality (the purpose of modeling is to reproduce the functions of the language as a "black box", without building an accurate model for the synthesis and analysis of human speech);

Generality of the model, i.e., it takes into account a rather large set of texts;

· Experimental validity, which involves testing the model on different texts;

· Reliance on dictionaries as a mandatory component of the model.

The complexity of the SL, its description and processing leads to the division of this process into separate stages corresponding to the levels of the language. Most modern LPs are of a modular type, in which each level of linguistic analysis or synthesis corresponds to a separate processor module. In particular, in the case of text analysis, individual LP modules perform:

Ø Graphematic analysis, i.e. highlighting word forms in the text (transition from characters to words);

Ø Morphological analysis - the transition from word forms to their lemmas(dictionary forms of lexemes) or basics(nuclear parts of the word, minus inflectional morphemes);

Ø Syntactic analysis, i.e., identifying the grammatical structure of text sentences;

Ø Semantic and pragmatic analysis, which determines the meaning of phrases and the corresponding reaction of the system within which the LP works.

Possible different schemes interactions of these modules (sequential work or parallel interleaved analysis), however, individual levels - morphology, syntax and semantics are still processed by different mechanisms.

Thus, LP can be considered as a multi-stage converter that, in the case of text analysis, translates each of its sentences into an internal representation of its meaning, and vice versa in the case of synthesis. The corresponding language model can be called structural.

Although complete CL models require taking into account all the main levels of the language and the availability of appropriate modules, when solving some applied problems, it is possible to do without the representation of individual levels in the LP. For example, in early experimental CL programs, the processed texts belonged to very narrow problem areas (with a limited set of words and a strict word order), so that they could be used for word recognition. initial letters, omitting the stages of morphological and syntactic analysis.

Another example of a reduced model, which is now quite often used, is the language model of the frequency of symbols and their combinations (bigrams, trigrams, etc.) in the texts of a specific NL. Such statistical model displays linguistic information at the level of characters (letters) of the text, and it is sufficient, for example, to detect typos in the text or to recognize its linguistic affiliation. A similar model based on the statistics of individual words and their joint occurrence in texts (bigrams, trigrams of words) is used, for example, to resolve lexical ambiguity or determine the part of speech of a word (in languages like English).

Note that it is possible structural-statistical models, in which certain statistics are taken into account when presenting individual levels of NL - words, syntactic constructions, etc.

In a modular type LP, at each stage of text analysis or synthesis, an appropriate model (morphology, syntax, etc.) is used.

The morphological models of the analysis of word forms existing in CL differ mainly in the following parameters:

The result of the work is a lemma or stem with a set of morphological characteristics (gender, number, case, type, person, etc.) of a given word form;

the method of analysis - based on the dictionary of word forms of the language or on the dictionary of basics, or the non-dictionary method;

· the possibility of processing the word form of a lexeme not included in the dictionary.

In morphological synthesis, the initial data are the lexeme and specific morphological characteristics of the requested word form of the given lexeme; it is also possible to request the synthesis of all forms of the given lexeme. The result of both morphological analysis and synthesis is generally ambiguous.

To model syntax within the framework of CL, a large number of different ideas and methods that differ in the way the syntax of the language is described, the way this information is used in the analysis or synthesis of the SL sentence, and the way the syntactic structure of the sentence is presented. It is quite conditionally possible to single out three main approaches to creating models: a generative approach that goes back to the ideas of Chomsky, an approach that goes back to the ideas of I. Melchuk and is represented by the Meaning Text model, as well as an approach in which certain attempts are made to overcome the limitations of the first two approaches, in particular, the theory of syntactic groups.

Within the framework of the generative approach, syntactic analysis is usually performed on the basis of a formal context-free grammar that describes the phrase structure of a sentence, or on the basis of some extension of the context-free grammar. These grammars proceed from a sequential linear division of a sentence into phrases (syntactic constructions, for example, noun phrases) and therefore reflect simultaneously both its syntactic and linear structures. The hierarchical syntactic structure of the NL sentence obtained as a result of the analysis is described component tree, whose leaves contain the words of the sentence, the subtrees correspond to the syntactic constructions (phrases) included in the sentence, and the arcs express the nesting relations of the constructions.

The approach under consideration can include network grammars, which are both a device for describing a language system and for setting a procedure for analyzing sentences based on the concept of a finite automaton, for example, an extended transition network ATN .

As part of the second approach, a more visual and common way is used to represent the syntactic structure of a sentence - dependency trees. The nodes of the tree contain the words of the sentence (usually a verb-predicate at the root), and each arc of the tree that connects a pair of nodes is interpreted as a syntactic subordinating connection between them, and the direction of connection corresponds to the direction of this arc. Since, in this case, the syntactic links of words and the order of words in the sentence are separated, then, on the basis of subordination trees, broken and non-projective constructions that occur quite often in languages with free word order.

Component trees are more suitable for describing languages in a rigid word order; their representation of broken and non-projective constructions requires an extension of the grammatical formalism used. But within the framework of this approach, constructions with non-subordinate relations are more naturally described. At the same time, a common difficulty for both approaches is the representation homogeneous members suggestions.

Syntactic models in all approaches try to take into account the restrictions imposed on the connection of language units in speech, while in one way or another the concept of valence is used. Valence- this is the ability of a word or other unit of a language to attach other units in a certain syntactic way; actant is a word or a syntactic construction that fills this valence. For example, the Russian verb hand over has three main valences, which can be expressed by the following interrogative words: who? to whom? what? Within the framework of the generative approach, the valences of words (first of all, verbs) are described mainly in the form of special frames ( subcategorization frames) , and in the framework of the dependency tree approach, as management models.

Models of the semantics of the language are the least developed within the framework of CL. For the semantic analysis of sentences, the so-called case grammars and semantic cases(valency), on the basis of which the semantics of the sentence is described as through the connection of the main word (verb) with its semantic actants, i.e. through semantic cases. For example, the verb hand over described by semantic cases giving(agent), addressee and transfer object.

To represent the semantics of the entire text, two logically equivalent formalisms are usually used (both of which are described in detail in the framework of AI):

· Predicate calculus formulas expressing properties, states, processes, actions and relationships;

· Semantic networks are labeled graphs in which vertices correspond to concepts, and vertices correspond to relationships between them.

As for the models of pragmatics and discourse, which allow processing not only individual sentences, but also the text as a whole, the ideas of Van Dyck are mainly used to build them. One of the rare and successful models is the model of discursive synthesis of connected texts. Such models should take into account anaphoric references and other discourse-level phenomena.

Concluding the characterization of language models within the framework of CL, let us dwell a little more on the theory of linguistic models "Meaning Text", and within which many fruitful ideas appeared that were ahead of their time and are still relevant.

In accordance with this theory, the NL is considered as a special kind of converter that performs the processing of given meanings into corresponding texts and given texts into their corresponding meanings. The meaning is understood as the invariant of all synonymous transformations of the text. The content of a connected fragment of speech without division into phrases and word forms is displayed as a special semantic representation consisting of two components: semantic graph and information about communicative organization of meaning.

As distinctive features of the theory should be indicated:

o orientation towards the synthesis of texts (the ability to generate correct texts is considered as the main criterion for language competence);

o multi-level, modular nature of the model, and the main levels of the language are divided into surface and deep levels: they differ, for example, deep(semantized) and surface("pure") syntax, as well as surface-morphological and deep-morphological levels;

o the integral nature of the language model; saving the information presented at each level by the corresponding module performing the transition from this level to the next;

o special means of describing syntactics (rules for connecting units) at each level; for description lexical compatibility was offered a set lexical functions, with the help of which the rules of syntactic paraphrasing are formulated;

o emphasis on vocabulary rather than grammar; the dictionary stores information related to different levels of the language; in particular, for syntactic analysis, word management models are used that describe their syntactic and semantic valencies.

This theory and language model has found its embodiment in the ETAP machine translation system.

4. Linguistic resources

The development of linguistic processors requires an appropriate presentation of linguistic information about the processed NL. This information is displayed in a variety of computer dictionaries and grammars.

Dictionaries are the most traditional form of representation of lexical information; they differ in their units (usually words or phrases), structure, scope of vocabulary (dictionaries of terms of a specific problem area, dictionaries of general vocabulary, etc.). The dictionary unit is called dictionary entry, it provides information about the token. Lexical homonyms are usually presented in different dictionary entries.

Morphological dictionaries used for morphological analysis are the most common in CL, their dictionary entry contains morphological information about the corresponding word - part of speech, inflectional class (for inflectional languages), a list of word meanings, etc. Depending on the organization of the linguistic processor in the dictionary grammatical information can also be added, such as word control patterns.

There are dictionaries that provide more information about words. For example, the linguistic model "Meaning-Text" essentially relies on explanatory-combinatorial dictionary, in the dictionary entry of which, in addition to morphological, syntactic and semantic information (syntactic and semantic valencies), information about the lexical compatibility of this word is presented.

A number of linguistic processors use synonym dictionaries. A relatively new type of dictionaries - paronym dictionaries, i.e. outwardly similar words that differ in meaning, for example, stranger and alien, editing and reference .

Another type of lexical resources - phrase bases, in which the most typical phrases of a particular language are selected. Such a base of phrases in the Russian language (about a million units) is the core of the CrossLexic system.

More complex species lexical resources are thesauri and ontologies. Thesaurus is a semantic dictionary, i.e. a dictionary in which semantic connections of words are presented - synonymous, gender-species relations (sometimes called the above-below relation), part-whole, associations. The spread of thesauri is associated with the solution of information retrieval problems.

The concept of ontology is closely related to the concept of thesaurus. Ontology is a set of concepts, entities of a certain field of knowledge, focused on multiple use for various tasks. Ontologies can be created on the basis of the vocabulary existing in the language - in this case they are called linguistic and.

Such a linguistic ontology is considered to be the WordNet system - a large lexical resource in which the words of the English language are collected: nouns, adjectives, verbs and adverbs, and their semantic connections of several types are presented. For each of the indicated parts of speech, words are grouped into groups of synonyms ( synsets), between which the relations of antonymy, hyponymy (genus-species relation), meronymy (part-whole relation) are established. The resource contains about 25 thousand words, the number of hierarchy levels for the genus-species relation is on average 6-7, sometimes reaching 15. The upper level of the hierarchy forms a common ontology - a system of basic concepts about the world.

According to the English WordNet scheme, similar lexical resources for other European languages were built, united under common name EuroWordNet.

A completely different kind of linguistic resources is Grammar, whose type depends on the syntax model used in the processor. In the first approximation, grammar is a set of rules that express the general syntactic properties of words and groups of words. The total number of grammar rules also depends on the syntax model, varying from several tens to several hundreds. In essence, such a problem manifests itself here as the relationship between grammar and vocabulary in the language model: the more information is presented in the dictionary, the shorter the grammar can be and vice versa.

It should be noted that the construction of computer dictionaries, thesauri and grammars is a voluminous and time-consuming work, sometimes even more time-consuming than the development of a linguistic model and the corresponding processor. Therefore, one of the subordinate tasks of CL is the automation of the construction of linguistic resources.

Computer dictionaries are often formed by converting ordinary text dictionaries, but often much more complex and painstaking work is required to build them. This usually happens when building dictionaries and thesauri for rapidly developing scientific fields- molecular biology, informatics, etc. The source material for extracting the necessary linguistic information can be collections and corpora of texts.

A corpus of texts is a collection of texts collected according to a certain principle of representativeness (by genre, authorship, etc.), in which all texts are marked up, that is, they are provided with some linguistic markup (annotations) - morphological, accent, syntactic, etc. At present, there are at least a hundred different corpora - for different NL and with different markings, in Russia the most famous is the National Corpus of the Russian Language.

Labeled corpora are created by linguists and used both for linguistic research and for tuning (training) models and processors used in CL using well-known mathematical methods of machine learning. Thus, machine learning is used to set up methods for resolving lexical ambiguity, recognizing parts of speech, and resolving anaphoric references.

Since corpora and collections of texts are always limited in terms of the linguistic phenomena represented in them (and corpora, in addition, are created for a rather long time), recently Internet texts are increasingly considered as a more complete linguistic resource. Undoubtedly, the Internet is the most representative source of modern speech samples, but its use as a corpus requires the development of special technologies.

5. Computational linguistics applications

The field of applications of computational linguistics is constantly expanding, so we will characterize here the most well-known applied problems solved by its tools.

Machine translate- the earliest application of CL, with which this area itself arose and developed. The first translation programs were built over 50 years ago and were based on the simplest word-by-word translation strategy. However, it was quickly realized that machine translation requires a complete linguistic model that takes into account all levels of the language, up to semantics and pragmatics, which repeatedly hampered the development of this direction. A fairly complete model is used in the domestic system ETAP, which translates scientific texts from French into Russian.

Note, however, that in the case of translation into a related language, for example, when translating from Spanish to Portuguese or from Russian to Ukrainian (which have much in common in syntax and morphology), the processor can be implemented based on a simplified model, for example, based on all the same strategy of word-for-word translation.

Currently, there is a whole range of computer translation systems (of varying quality), from large international research projects to commercial automatic translators. Of significant interest are projects of multilingual translation, using an intermediate language in which the meaning of translated phrases is encoded. Another modern direction is statistical translation, based on the statistics of the translation of words and phrases (these ideas, for example, are implemented in the Google search engine translator).

But despite many decades of development of this whole area, in general, the task of machine translation is still very far from being completely solved.

Another fairly old application of computational linguistics is information retrieval and related tasks of indexing, summarizing, classifying and categorizing documents.

Full-text search of documents in large databases of documents (primarily scientific, technical, business), is usually carried out on the basis of their search images, which is understood as a set keywords- words that reflect the main topic of the document. At first, only individual words of the SL were considered as keywords, and the search was carried out without taking into account their inflection, which is uncritical for weakly inflectional languages such as English. For inflectional languages, for example, for Russian, it was necessary to use a morphological model that takes into account inflection.

The search request was also presented as a set of words, suitable (relevant) documents were determined based on the similarity of the request and the search image of the document. Creating a search image of a document involves indexing its text, i.e. highlighting key words in it. Since very often the topic and content of the document are much more accurately displayed not by individual words, but by phrases, phrases began to be considered as keywords. This significantly complicated the procedure for indexing documents, since it was necessary to use various combinations of statistical and linguistic criteria to select meaningful phrases in the text.

In fact, information retrieval mainly uses text vector pattern(sometimes called bag of words- a bag of words), in which the document is represented by a vector (set) of its keywords. Modern Internet search engines also use this model, indexing texts by the words used in them (at the same time, they use very sophisticated ranking procedures to return relevant documents).

The specified text model (with some complications) is also used in the related problems of information retrieval considered below.

Abstracting text- reducing its volume and obtaining its summary - abstract (contracted content), which makes it faster to search in collections of documents. A general abstract can also be drawn up for several documents related to the topic.

The main method of automatic summarization is still the selection of the most significant sentences of the abstracted text, for which the keywords of the text are usually calculated first and the coefficient of significance of the sentences of the text is calculated. The choice of meaningful sentences is complicated by anaphoric links of sentences, the break of which is undesirable - to solve this problem, certain strategies for selecting sentences are being developed.

A task close to referencing - annotation the text of the document, i.e., compiling its annotation. In its simplest form, an abstract is a list of the main topics of the text, for which indexing procedures can be used to highlight.

When creating large collections of documents, the tasks are relevant classification and clustering texts in order to create classes of documents related to the topic . Classification means assigning each document to a certain class with known parameters in advance, and clustering means dividing a set of documents into clusters, i.e., subsets of thematically related documents. To solve these problems, machine learning methods are used, and therefore these applied tasks are called Text Mining and belong to the scientific direction known as Data Mining, or data mining.

Very close to classification problem rubricating text - its assignment to one of the previously known thematic headings (usually headings form a hierarchical tree of topics).

The problem of classification is becoming more widespread, it is solved, for example, when recognizing spam, and a relatively new application is the classification of SMS messages in mobile devices. A new and relevant line of research for common task information retrieval - multilingual search through documents.

Another relatively new task related to information retrieval is formation of answers to questions(Question Answering) . This task is solved by determining the type of question, searching for texts that potentially contain the answer to this question, and extracting the answer from these texts.

A completely different applied direction, which is developing, albeit slowly, but steadily, is automation of preparation and editing texts on EY. One of the first applications in this direction were programs for automatically detecting word hyphenation and programs for spelling text checks (spellers, or auto-correctors). Despite the apparent simplicity of the hyphenation problem, its correct solution for many NLs (for example, English) requires knowledge of the morphemic structure of the words of the corresponding language, and hence the corresponding dictionary.

Spell checking has long been implemented in commercial systems and relies on an appropriate vocabulary and morphology model. An incomplete syntax model is also used, on the basis of which rather frequent all syntactic errors (for example, word agreement errors) are revealed. At the same time, auto-correctors have not yet implemented the detection of more complex errors, for example, misuse prepositions. Many lexical errors are also not detected, in particular, errors resulting from typos or misuse of similar words (for example, weight instead of weighty). In modern studies of CL, methods are proposed for the automated detection and correction of such errors, as well as some other types of stylistic errors. These methods use statistics on the occurrence of words and phrases.

An applied task close to supporting the preparation of texts is natural language teaching, within the framework of this direction, computer systems for teaching languages - English, Russian, etc. are often developed (similar systems can be found on the Internet). Typically, these systems support the study of certain aspects of the language (morphology, vocabulary, syntax) and are based on appropriate models, for example, a morphology model.

As for the study of vocabulary, electronic analogues of text dictionaries are also used for this (in which, in fact, there are no language models). However, multifunctional computer dictionaries are also being developed that have no text analogues and are aimed at a wide range of users - for example, a dictionary of Russian phrases Crosslexic. This system covers a wide range of vocabulary - words and their acceptable word combinations, and also provides information on word management models, synonyms, antonyms and other semantic correlates of words, which is clearly useful not only for those who study Russian, but also for native speakers.

The next application area worth mentioning is automatic generation texts on EY. In principle, this task can be considered a subtask of the machine translation task already considered above, however, within the framework of the direction, there are a number of specific tasks. Such a task is multilingual generation, i.e. automatic construction in several languages of special documents - patent formulas, operating instructions technical products or software systems based on their specification in a formal language. Quite detailed language models are used to solve this problem.

An increasingly relevant applied task, often referred to as Text Mining, is extracting information from texts, or Information Extraction, which is required when solving problems of economic and industrial analytics. To do this, certain objects are identified in the NL test - named entities (names, personalities, geographical names), their relationships and events associated with them. As a rule, this is implemented on the basis of partial parsing of the text, allowing processing of news feeds from news agencies. Since the task is quite complex not only theoretically, but also technologically, the creation of meaningful systems for extracting information from texts is feasible within the framework of commercial companies.

The direction of Text Mining also includes two other related tasks - the selection of opinions (Opinion Mining) and the assessment of the tonality of texts (Sentiment Analysis), attracting the attention of an increasing number of researchers. The first task searches (in blogs, forums, online stores, etc.) for user opinions about products and other objects, and analyzes these opinions. The second task is close to the classical task of content analysis of texts of mass communication; it evaluates the general tone of statements.

Another application worth mentioning is − dialogue support with the user on the NL within the framework of any information software system. Most often, this problem was solved for specialized databases - in this case, the query language is quite limited (lexically and grammatically), which allows using simplified language models. Requests to the base, formulated in NL, are translated into a formal language, after which the search for the necessary information is performed and the corresponding response phrase is built.

As the last in our list of CL applications (but not in importance) we indicate speech recognition and synthesis. Recognition errors that inevitably arise in these tasks are corrected by automatic methods based on dictionaries and linguistic knowledge about morphology. Machine learning will also be applied in this area.

Conclusion

Computational linguistics demonstrates quite tangible results in various applications for automatic processing of texts in NL. Its further development depends both on the emergence of new applications and the independent development of various language models, in which many problems have not yet been solved. The most developed are the models of morphological analysis and synthesis. Syntax models have not yet been brought to the level of stable and efficient modules, despite the large number of proposed formalisms and methods. Even less studied and formalized are models of the level of semantics and pragmatics, although automatic processing of discourse is already required in a number of applications. Note that the already existing tools of computational linguistics itself, the use of machine learning and text corpora, can significantly advance the solution of these problems.

Literature

1. Baeza-Yates, R. and Ribeiro-Neto, B. Modern Information Retrieval, Adison Wesley, 1999.

2. Bateman, J., Zock M. Natural Language Generation. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p.304.

3. Biber, D., Conrad S., and Reppen D. Corpus Linguistics. Investigating Language Structure and Use. Cambridge University Press, Cambridge, 1998.

4. Bolshakov, I. A., Gelbukh putational Linguistics. Models, Resources, Applications. Mexico, IPN, 2004.

5. Brown P., Pietra S., Mercer R., Pietra V. The Mathematics of Statistical Machine Translation. // Computational Linguistics, Vol. 19(2): 263-3

6. Carroll J R. Parsing. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 233-248.

7. Chomsky, N. Syntactic Structures. The Hague: Mouton, 1957.

8. Grishman R. Information extraction. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 545-559.

9. Harabagiu, S., Moldovan D. Question Answering. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 560-582.

10. Hearst, M. A. Automated Discovery of WordNet Relations. In: Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database. MIT Press, Cambridge, 1998, p.131-151.

11. Hirst, G. Ontology and the Lexicon. In.: Handbook on Ontologies in Niformation Systems. Berlin, Springer, 2003.

12. Jacquemin C., Bourigault D. Term extraction and automatic indexing // Mitkov R. (ed.): Handbook of Computational Linguistics. Oxford University Press, 2003. p. 599-615.

13. Kilgarriff, A., G. Grefenstette. Introduction to the Special Issue on the Web as putational linguistics, V. 29, No. 3, 2003, p. 333-347.

14. Manning, Ch. D., H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, 1999.

15. Matsumoto Y. Lexical Knowledge Acquisition. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 395-413.

16. The Oxford Handbook on Computational Linguistics. R. Mitkov (Ed.). Oxford University Press, 2005.

17. Oakes, M., Paice C. D. Term extraction for automatic abstracting. Recent Advances in Computational Terminology. D. Bourigault, C. Jacquemin and M. L "Homme (Eds), John Benjamins Publishing Company, Amsterdam, 2001, p.353-370.

18. Pedersen, T. A decision tree of bigrams is an accurate predictor of word senses. Proc. 2nd Annual Meeting of NAC ACL, Pittsburgh, PA, 2001, p. 79-86.

19. Samuelsson C. Statistical Methods. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 358-375.

20. Salton, G. Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Reading, MA: Addison-Wesley, 1988.

21. Somers, H. Machine Translation: Latest Developments. In: The Oxford Handbook of Computational Linguistics. Mitkov R. (ed.). Oxford University Press, 2003, p. 512-528.

22. Strzalkowski, T. (ed.) Natural Language Information Retrieval. Kluwer, 19p.

23. Woods W. A. Transition Network Grammers for Natural language Analysis/ Communications of the ACM, V. 13, 1970, No. 10, p. 591-606.

24. Word Net: an Electronic Lexical Database. /Christian Fellbaum. Cambridge, MIT Press, 1998.

25. Wu J., Yu-Chia Chang Y., Teruko Mitamura T., Chang J. Automatic Collocation Suggestion in Academic Writing // Proceedings of the ACL 2010 Conference Short Papers, 2010.

26. and others. Linguistic support of the ETAP-2 system. Moscow: Nauka, 1989.

27. etc. Data analysis technologies: Data Mining, Visual Mining, Text Mining, OLAP - 2nd ed. - St. Petersburg: BHV-Petersburg, 2008.

28. Bolshakov, Vocabulary - a large electronic dictionary of combinations and semantic connections of Russian words. // Comp. linguistics and intelligence. technologies: Proceedings of int. Conf. "Dialogue 2009". Issue: RGGU, 2009, pp. 45-50.

29. Bolshakova E. I., Bolshakov detection and automated correction of Russian malapropisms // NTI. Ser. 2, No. 5, 2007, pp. 27-40.

30. Wang, Kinch V. A strategy for understanding a coherent text.// New in foreign linguistics. Issue. XXIII– M., Progress, 1988, p. 153-211.

31. Vasiliev V. G., Krivenko M. P. Methods of automated text processing. – M.: IPI RAN, 2008.

32. Vinograd T. A program that understands natural language - M., world, 1976.

33. Smooth structure of natural language in automated communication systems. - M., Nauka, 1985.

34. Gusev, V.D., Salomatina dictionary of paronyms: version 2. // NTI, Ser. 2, No. 7, 2001, p. 26-33.

35. Zakharov - space as a language corpus // Computational Linguistics and Intelligent Technologies: Proceedings of Int. Conference Dialogue ‘2005 / Ed. , - M .: Nauka, 2005, p. 166-171.

36. Kasevich of general linguistics. - M., Nauka, 1977.

37. Leontief understanding of texts: Systems, models, resources: Tutorial– M.: Academy, 2006.

38. Linguistic Encyclopedic Dictionary / Ed. V. N. Yartseva, Moscow: Soviet Encyclopedia, 1990, 685 p.

39., Saliy for automatic indexing and categorization: development, structure, maintenance. // NTI, Ser. 2, No. 1, 1996.

40. Luger J. Artificial intelligence: strategies and methods for solving complex problems. M., 2005.

41. McQueen K. Discursive strategies for text synthesis in natural language // New in foreign linguistics. Issue. XXIV. M.: Progress, 1989, pp. 311-356.

42. Melchuk theory of linguistic models "MEANING "TEXT". - M., Nauka, 1974.

43. National Corpus of the Russian Language. http://*****

44. Khoroshevsky VF OntosMiner: a family of systems for extracting information from multilingual document collections // Ninth National Conference on Artificial Intelligence with International Participation KII-2004. T. 2. - M .: Fizmatlit, 2004, pp. 573-581.

The content of the article

COMPUTER LINGUISTICS, direction in applied linguistics, focused on the use of computer tools - programs, computer technologies for organizing and processing data - for modeling the functioning of a language in certain conditions, situations, problem areas, etc., as well as the entire scope of computer language models in linguistics and related disciplines. Actually, only in the latter case we are talking about applied linguistics in the strict sense, since computer modeling of a language can also be considered as a sphere of application of computer science and programming theory to solving problems of the science of language. In practice, however, almost everything related to the use of computers in linguistics is referred to as computational linguistics.

As a special scientific direction, computational linguistics took shape in the 1960s. The Russian term "computational linguistics" is a tracing-paper from English computational linguistics. Since the adjective computational in Russian can also be translated as “computational”, the term “computational linguistics” is also found in the literature, but in Russian science it acquires a narrower meaning, approaching the concept of “quantitative linguistics”. The flow of publications in this area is very high. In addition to thematic collections, the journal Computational Linguistics is published quarterly in the United States. A large organizational and scientific work is carried out by the Association for Computational Linguistics, which has regional structures (in particular, the European branch). Every two years there are international conferences on computational linguistics - COLING. Relevant issues are usually widely presented also at various conferences on artificial intelligence.

Toolkit of Computational Linguistics.

Computational linguistics, as a special applied discipline, is distinguished primarily by its tool - i.e. on the use of computer tools for processing language data. Since computer programs that model certain aspects of the functioning of a language can use a variety of programming tools, it seems that there is no need to talk about the general conceptual apparatus of computational linguistics. However, it is not. There are general principles of computer modeling of thinking, which are somehow implemented in any computer model. They are based on the theory of knowledge, which was originally developed in the field of artificial intelligence, and later became one of the sections of cognitive science. The most important conceptual categories of computational linguistics are such knowledge structures as "frames" (conceptual, or, as they say, conceptual structures for the declarative representation of knowledge about a typified thematically unified situation), "scenarios" (conceptual structures for the procedural representation of knowledge about a stereotyped situation or stereotyped behavior), “plans” (knowledge structures that fix ideas about possible actions leading to the achievement of a specific goal). The concept of "scene" is closely related to the category of frame. The scene category is mainly used in the literature on computational linguistics as a designation of a conceptual structure for the declarative representation of situations and their parts that are actualized in a speech act and highlighted by linguistic means (lexemes, syntactic constructions, grammatical categories, etc.).

A certain organized set of knowledge structures forms the "model of the world" of the cognitive system and its computer model. In artificial intelligence systems, the model of the world forms a special block, which, depending on the chosen architecture, may include general knowledge about the world (in the form of simple propositions such as “it is cold in winter” or in the form of production rules “if it is raining outside, then you need to wear a raincoat or take an umbrella"), some specific facts ("The most high peak in the world - Everest"), as well as values and their hierarchies, sometimes singled out in a special "axiological block".

Most elements of the concepts of computational linguistics tools are homonymous: they simultaneously designate some real entities of the human cognitive system and ways of representing these entities used in their theoretical description and modeling. In other words, the elements of the conceptual apparatus of computational linguistics have ontological and instrumental aspects. For example, in the ontological aspect, the division of declarative and procedural knowledge corresponds to different types of knowledge that a person has - the so-called knowledge of WHAT (declarative; such, for example, knowledge postal address of some NN), on the one hand, and knowledge of HOW (procedural; such, for example, knowledge that allows you to find the apartment of this NN, even without knowing its formal address) - on the other. In the instrumental aspect, knowledge can be embodied in a set of descriptions (descriptions), in a data set, on the one hand, and in an algorithm, an instruction that a computer or some other model of a cognitive system executes, on the other.

Directions of Computational Linguistics.

The sphere of CL is very diverse and includes such areas as computer modeling of communication, modeling of the plot structure, hypertext technologies for text presentation, machine translation, computer lexicography. In a narrow sense, the problems of CL are often associated with an interdisciplinary applied area with a somewhat unfortunate name "natural language processing" (translation of the English term Natural Language Processing). It arose in the late 1960s and developed within the framework of the scientific and technological discipline "artificial intelligence". In its internal form, the phrase "natural language processing" covers all areas in which computers are used to process language data. Meanwhile, a narrower understanding of this term has become fixed in practice - the development of methods, technologies and specific systems that ensure communication between a person and a computer in natural or limited natural language.

The rapid development of the direction of "natural language processing" falls on the 1970s, which was associated with an unexpected exponential growth in the number of end users of computers. Since it is impossible to teach languages and programming technologies to all users, the problem of organizing interaction with computer programs has arisen. The solution to this problem of communication followed two main paths. In the first case, attempts were made to adapt programming languages and operating systems to the end user. As a result, high-level languages such as Visual Basic appeared, as well as convenient operating systems built in the conceptual space of metaphors familiar to humans - DESK, LIBRARY. The second way is the development of systems that would allow interacting with a computer in a specific problem area in a natural language or some limited version of it.

The architecture of natural language processing systems generally includes a block for analyzing a user's speech message, a block for interpreting a message, a block for generating the meaning of an answer, and a block for synthesizing the surface structure of an utterance. A special part of the system is the dialogue component, which contains dialogue strategies, the conditions for applying these strategies, ways to overcome possible communication failures (failures in the communication process).

Among computer systems for natural language processing, question-answer systems, interactive problem solving systems, and connected text processing systems are usually distinguished. Initially, question-answer systems began to be developed as a reaction to the poor quality of query coding when searching for information in information retrieval systems. Since the problem area of such systems was very limited, this somewhat simplified the algorithms for translating queries into a formal language representation and the reverse procedure for transforming a formal representation into natural language statements. From domestic developments, the POET system, created by a team of researchers led by E.V. Popov, belongs to programs of this type. The system processes requests in Russian (with minor restrictions) and synthesizes a response. The block diagram of the program assumes the passage of all stages of analysis (morphological, syntactic and semantic) and the corresponding stages of synthesis.

Dialogue problem-solving systems, unlike systems of the previous type, play an active role in communication, since their task is to obtain a solution to a problem based on the knowledge that is presented in it itself and on the information that can be obtained from the user. The system contains knowledge structures that record typical sequences of actions for solving problems in a given problem area, as well as information about the required resources. When the user asks a question or sets a certain task, the corresponding script is activated. If some script components are missing or some resources are missing, the system initiates the communication. This is how, for example, the SNUKA system works, which solves the problems of planning military operations.

Connected text processing systems are quite diverse in structure. Them common feature can be considered the widespread use of knowledge representation technologies. The functions of systems of this kind are to understand the text and answer questions about its content. Understanding is considered not as a universal category, but as a process of extracting information from a text, determined by a specific communicative intention. In other words, the text is "read" only with the assumption that it is the potential user who wants to know about it. Thus, connected text processing systems turn out to be by no means universal, but problem-oriented. Typical examples of systems of the type under discussion are the RESEARCHER and TAILOR systems, which form a single software package that allows the user to obtain information from patent abstracts describing complex physical objects.

The most important area of computational linguistics is the development of information retrieval systems (IPS). The latter arose in the late 1950s and early 1960s as a response to a sharp increase in the volume of scientific and technical information. By the type of stored and processed information, as well as by the features of the search, IPS are divided into two large groups- documentary and factual. Documentary information systems store the texts of documents or their descriptions (abstracts, bibliographic cards, etc.). Factographic IPS deal with the description of specific facts, and not necessarily in textual form. It can be tables, formulas and other types of data presentation. There are also mixed IPSs that include both documents and factual information. At present, factographic information systems are built on the basis of database (DB) technologies. To provide information retrieval in IPS, special information retrieval languages are created, which are based on information retrieval thesauri. Information retrieval language is a formal language designed to describe certain aspects of the content plan of documents stored in the IPS and the request. The procedure for describing a document in an information retrieval language is called indexing. As a result of indexing, each document is assigned its formal description in the information retrieval language - the search image of the document. Similarly, the query is indexed, to which the search image of the query and the search prescription are assigned. Information retrieval algorithms are based on the comparison of the search prescription with the search image of the query. The criterion for issuing a document for a request may consist of a full or partial match between the search image of the document and the search prescription. In some cases, the user has the opportunity to formulate the issuance criteria himself. This is determined by his information need. Descriptive information retrieval languages are more often used in automated ISs. The subject of the document is described by a set of descriptors. Words and terms denoting simple, fairly elementary categories and concepts of the problem area act as descriptors. As many descriptors are entered into the search image of the document as there are different topics covered in the document. The number of descriptors is not limited, which makes it possible to describe the document in a multidimensional feature matrix. Often, in a descriptor information retrieval language, restrictions are imposed on the combinability of descriptors. In this case, we can say that the information retrieval language has a syntax.

One of the first systems that worked with a descriptor language was the American UNITERM system created by M. Taube. In this system, the keywords of the document, the uniterms, functioned as descriptors. The peculiarity of this IPS is that initially the dictionary of the information language was not set, but arose in the process of indexing the document and the query. The development of modern information retrieval systems is associated with the development of non-thesaurus-type IPS. Such IPS work with the user in a limited natural language, and the search is carried out in the texts of abstracts of documents, in their bibliographic descriptions, and often in the documents themselves. For indexing in the non-thesaurus type IPS, words and phrases of natural language are used.

To a certain extent, the field of computational linguistics can include works in the field of creating hypertext systems, considered as a special way of organizing text and even as a fundamentally new type of text, opposed in many of its properties to ordinary text formed in the Gutenberg tradition of printing. The idea of hypertext is associated with the name of Vannevar Bush, President F. Roosevelt's science adviser. W. Bush theoretically substantiated the project of the technical system "Memex", which allowed the user to link texts and their fragments by various types of links, mainly by associative relations. The lack of computer technology made the project difficult to implement, since mechanical system turned out to be too complicated for practical implementation.

Bush's idea in the 1960s received a second birth in the "Xanadu" system of T. Nelson, which already assumed the use of computer technology. "Xanadu" allowed the user to read the totality of texts entered into the system different ways, in various sequences, the software made it possible to both memorize the sequence of texts viewed, and choose almost any of them at an arbitrary point in time. A set of texts with relations connecting them (a system of transitions) was called hypertext by T. Nelson. Many researchers consider the creation of hypertext as the beginning of a new information age, opposed to the era of printing. The linearity of writing, outwardly reflecting the linearity of speech, turns out to be a fundamental category that limits human thinking and understanding of the text. The world of meaning is non-linear, therefore, the compression of semantic information in a linear speech segment requires the use of special "communicative packages" - division into topic and rheme, division of the utterance content plan into explicit (statement, proposition, focus) and implicit (presupposition, consequence, implicature of discourse) layers . Rejection of the linearity of the text both in the process of its presentation to the reader (i.e., in reading and understanding) and in the process of synthesis, according to theorists, would contribute to the "liberation" of thinking and even the emergence of its new forms.

In a computer system, hypertext is represented as a graph, the nodes of which contain traditional texts or their fragments, images, tables, videos, etc. The nodes are connected by various relationships, the types of which are set by the developers. software hypertext or by the reader. The relations define the potential possibilities of movement, or navigation through the hypertext. Relationships can be unidirectional or bidirectional. Accordingly, bidirectional arrows allow the user to move in both directions, while unidirectional arrows allow the user to move only in one direction. The chain of nodes through which the reader passes while viewing the components of the text forms a path, or route.

Computer implementations of hypertext are hierarchical or network. The hierarchical – tree-like – structure of the hypertext significantly limits the possibilities of transition between its components. In such a hypertext, the relationships between components resemble the structure of a thesaurus based on genus-species relationships. Network hypertext allows you to use various types of relationships between components, not limited to genus-species relationships. According to the mode of existence of hypertext, static and dynamic hypertexts are distinguished. Static hypertext does not change during operation; in it, the user can record his comments, but they do not change the essence of the matter. For dynamic hypertext, change is a normal form of existence. Typically, dynamic hypertexts function where it is necessary to constantly analyze the flow of information, i.e. in information services of various kinds. Hypertext is, for example, the Arizona Information System (AAIS), which is updated monthly with 300–500 abstracts per month.

Relationships between hypertext elements can be initially fixed by the creators, or they can be generated whenever the user accesses the hypertext. In the first case, we are talking about hypertexts of a rigid structure, and in the second case, about hypertexts of a soft structure. The rigid structure is technologically quite clear. The technology for organizing a soft structure should be based on a semantic analysis of the proximity of documents (or other sources of information) to each other. This is a non-trivial task of computational linguistics. Currently, the use of soft structure technologies on keywords is widespread. The transition from one node to another in the hypertext network is carried out as a result of searching for keywords. Since the set of keywords may differ each time, the structure of the hypertext also changes each time.

The technology of building hypertext systems does not distinguish between textual and non-textual information. Meanwhile, the inclusion of visual and audio information (video clips, pictures, photographs, sound recordings, etc.) requires a significant change in the user interface and more powerful software and computer support. Such systems are called hypermedia, or multimedia. The visibility of multimedia systems predetermined their widespread use in teaching, in creating computer options encyclopedias. There are, for example, beautifully executed CD-roms with multimedia systems for children's encyclopedias published by Dorlin Kindersley.

Within the framework of computer lexicography, computer technologies for the compilation and operation of dictionaries are being developed. Special programs - databases, computer filing cabinets, text processing programs - allow you to automatically generate dictionary entries, store dictionary information and process it. Many different computer lexicographic programs are divided into two large groups: programs for supporting lexicographic works and automatic dictionaries of various types, including lexicographic databases. An automatic dictionary is a dictionary in a special machine format designed for use on a computer by a user or a computer word processing program. In other words, there is a difference between automatic human end-user dictionaries and automatic dictionaries for word processing programs. Automatic dictionaries designed for the end user, by interface and structure dictionary entry differ significantly from the automatic dictionaries included in machine translation systems, automatic referencing systems, information retrieval systems, etc. Most often they are computer versions of well-known conventional dictionaries. There are computer analogues of explanatory dictionaries on the software market in English(Automatic Webster, Collins Automatic Dictionary of English, automatic option New large English-Russian dictionary, ed. Yu.D. Apresyan and E.M. Mednikova), there is also a computer version of Ozhegov's dictionary. Automatic dictionaries for word processing programs can be called automatic dictionaries in the exact sense. They are generally not intended for the average user. Features of their structure, the scope of vocabulary material are set by the programs that interact with them.

Computer modeling of the plot structure is another promising direction computational linguistics. The study of the structure of the plot refers to the problems of structural literary criticism (in the broad sense), semiotics and cultural studies. The available computer programs for plot modeling are based on three basic plot presentation formalisms - morphological and syntactic directions for plot presentation, as well as on a cognitive approach. Ideas about the morphological structure of the plot structure go back to the famous works of V.Ya. Propp ( cm.) about a Russian fairy tale. Propp noticed that with an abundance of characters and events in a fairy tale, the number of character functions is limited, and he proposed an apparatus for describing these functions. Propp's ideas formed the basis of the TALE computer program, which simulates the generation of the plot of a fairy tale. The algorithm of the TALE program is based on the sequence of functions of the characters in the fairy tale. In fact, the Propp functions set a set of typified situations, ordered on the basis of the analysis of empirical material. The possibilities of linking various situations in the rules of generation were determined by a typical sequence of functions - in the form in which it can be established from the texts of fairy tales. In the program, typical sequences of functions were described as typical scenarios for meeting characters.

The theoretical basis of the syntactic approach to the plot of the text was “plot grammars”, or “narrative grammars” (story grammars). They appeared in the mid-1970s as a result of the transfer of the ideas of N. Chomsky's generative grammar to the description of the macrostructure of the text. If the most important components of the syntactic structure in the generative grammar were verbal and nominal groups, then in most plot grammars, exposition (setting), event and episode were singled out as basic ones. In the theory of plot grammars, minimality conditions, that is, restrictions that determined the status of a sequence of plot elements as a normal plot, were widely discussed. It turned out, however, that it was impossible to do this by purely linguistic methods. Many restrictions are sociocultural in nature. Plot grammars, differing significantly in the set of categories in the generation tree, allowed a very limited set of rules for modifying the narrative (narrative) structure.

In the early 1980s, one of R. Schenk's students, V. Lenert, as part of the work on creating a computer plot generator, proposed an original formalism of emotional plot units (Affective Plot Units), which turned out to be a powerful tool for representing the plot structure. While it was originally developed for an artificial intelligence system, this formalism has been used in purely theoretical studies. The essence of Lehnert's approach was that the plot was described as a successive change in the cognitive-emotional states of the characters. Thus, the focus of Lehnert's formalism is not on the external components of the plot - exposition, event, episode, morality - but on its substantive characteristics. In this respect, Lehnert's formalism is partly a return to Propp's ideas.

Computational linguistics also includes machine translation, which is currently experiencing a rebirth.

Literature:

Popov E.V. Communication with computers in natural language. M., 1982
Sadur V.G. Speech communication with electronic computers and problems of their development. - In the book: Speech communication: problems and prospects. M., 1983
Baranov A.N. Categories of artificial intelligence in linguistic semantics. Frames and scripts. M., 1987
Kobozeva I.M., Laufer N.I., Saburova I.G. Modeling communication in human-machine systems. – Linguistic support information systems. M., 1987
Olker H.R. Fairy tales, tragedies and ways of presenting world history. - In the book: Language and modeling of social interaction. M., 1987
Gorodetsky B.Yu. Computational Linguistics: Modeling Language Communication
McQueen K. Discursive Strategies for Natural Language Text Synthesis. – New in foreign linguistics. Issue. XXIV, Computational Linguistics. M., 1989
Popov E.V., Preobrazhensky A.B. . Features of the implementation of NL-systems
Preobrazhensky A.B. The state of development of modern NL-systems. - Artificial intelligence. Book. 1, Communication systems and expert systems. M., 1990
Subbotin M.M. Hypertext. New form written communication. — VINITI, Ser. Informatics, 1994, v. 18
Baranov A.N. Introduction to Applied Linguistics. M., 2000