What does a computational linguist do? Mathematical and Computational Linguistics.

Decor elements 25.09.2019

Modern computational linguistics is very much focused on the use of mathematical models. There is even a popular belief that linguists are not particularly needed for automatic natural language modeling. Known popular expression Frederic Jelinek, head of the speech recognition center at Johns Hopkins University: " Anytime a linguist leaves the group, the recognition rate goes up"- every time a linguist leaves the working group, the quality of recognition increases.

However, the more complex and multi-level tasks of linguistic modeling are put before developers automatic systems, the more obvious it becomes that their solution is impossible without taking into account linguistic theory, understanding how the language functions, linguistic expert competence. At the same time, it became obvious that automatic methods of analysis and modeling of linguistic data can significantly enrich theoretical linguistic research, being both a means for collecting linguistic data and a tool for testing the validity of a particular linguistic hypothesis.

Text Processing Evaluation Forum

S.Yu.Toldova, O.N. Lyashevskaya, A.A. Bonch-Osmolovskaya

How to formalize lexical meaning, make it "machine-readable"? The answer to this is provided by distributive models of the language, in which the meaning of a word is the sum of its contexts in a sufficiently large corpus. artificial neural networks allow fast and high-quality training of such models.

Denis Kiryanov, Tanya Panova (supervisor B.V. Orekhov)

This program has two functions: a) normalization of Yiddish text, b) transliteration from square letters to Latin. These problems are very relevant: until now, there has not been a single normalizer, except for spell checkers. Meanwhile, almost every publishing house that published books in Yiddish followed its spelling practice. The normalizer is needed to work on the corpus of the Yiddish language: to reduce all texts to a single spelling recognized by the parser. Transliteration will allow typologists to work with Yiddish material as well.

VIDEO of the staff of the School of Linguistics:

Optionally; 3 year, 2, 3 module

Required; 1st year, 2 module

Optionally; 3 year, 3 module

Required; 4th year, 1-3 module

Required; 4th year, 2 module

Required; 2nd year, 1, 2, 4 module

The term "computational linguistics" usually refers to a wide area of ​​using computer tools - programs, computer technologies for organizing and processing data - to model the functioning of a language in certain conditions, situations, problem areas, as well as the scope of computer language models. only in linguistics, but also in related disciplines. Actually, only in the latter case we are talking about applied linguistics in the strict sense, since computer language modeling can also be considered as a field of application of programming theory (computer science) in the field of linguistics. Nevertheless, the general practice is such that the field of computational linguistics covers almost everything related to the use of computers in linguistics: "The term" computational linguistics "sets a general orientation towards the use of computers to solve a variety of scientific and practical problems related to language, without limiting in any way ways of solving these problems.

Institutional aspect of computational linguistics. As a special scientific direction, computational linguistics took shape in the 60s. The flow of publications in this area is very high. In addition to thematic collections, the journal Computational Linguistics is published quarterly in the USA. Great organizational and scientific work is carried out by the Association for Computational Linguistics, which has regional structures around the world (in particular, the European branch). Every two years there are international conferences on computational linguistics - KOLING. Relevant issues are also widely represented at international conferences on artificial intelligence at various levels.

Cognitive toolkit of computational linguistics

Computational linguistics as a special applied discipline is distinguished primarily by its tool - that is, by the use of computer tools for processing language data. Since computer programs that model certain aspects of the functioning of a language can use the most different means programming, then there seems to be no need to talk about a common metalanguage. However, it is not. Exist general principles computer simulation of thinking, which are somehow implemented in any computer model. This language is based on the theory of knowledge developed in artificial intelligence and forming an important branch of cognitive science.

The main thesis of the theory of knowledge states that thinking is a process of processing and generating knowledge. "Knowledge" or "knowledge" is considered an undefined category. The human cognitive system acts as a "processor" that processes knowledge. In epistemology and cognitive science, two main types of knowledge are distinguished - declarative ("knowing what") and procedural ("knowing how"2)). Declarative knowledge is usually presented as a set of propositions, statements about something. A typical example of declarative knowledge is the interpretation of words in ordinary explanatory dictionaries. For example, a cup] - "a small rounded drinking vessel, usually with a handle, made of porcelain, faience, etc.". Declarative knowledge lends itself to the verification procedure in terms of "true-false". Procedural knowledge is presented as a sequence (list) of operations, actions to be performed. This is some general instruction about actions in a certain situation. A typical example of procedural knowledge is instructions for using household appliances.

Unlike declarative knowledge, procedural knowledge cannot be verified as true or false. They can be evaluated only by the success or failure of the algorithm.

Most of the concepts of the cognitive toolkit of computational linguistics are homonymous: they simultaneously designate some real entities of the human cognitive system and ways of representing these entities in some metalanguages. In other words, the elements of metalanguage have an ontological and instrumental aspect. Ontologically, the separation of declarative and procedural knowledge corresponds to different types knowledge of the human cognitive system. So, knowledge about specific objects, objects of reality is mainly declarative, and the functional abilities of a person to walk, run, drive a car are realized in the cognitive system as procedural knowledge. Instrumentally, knowledge (both ontologically procedural and declarative) can be represented as a set of descriptions, descriptions and as an algorithm, an instruction. In other words, ontologically declarative knowledge about the object of reality "table" can be represented procedurally as a set of instructions, algorithms for its creation, assembly (= creative aspect of procedural knowledge) or as an algorithm for its typical use (= functional aspect of procedural knowledge). In the first case, this may be a guide for a novice carpenter, and in the second, a description of the possibilities of an office desk. The converse is also true: ontologically procedural knowledge can be represented declaratively.

It requires a separate discussion whether any ontologically declarative knowledge can be represented as procedural, and any ontologically procedural - as declarative. Researchers agree that any declarative knowledge can, in principle, be represented procedurally, although this may turn out to be very uneconomical for a cognitive system. The reverse is hardly true. The fact is that declarative knowledge is much more explicit, it is easier for a person to understand than procedural knowledge. In contrast to declarative knowledge, procedural knowledge is predominantly implicit. So, the language ability, being procedural knowledge, is hidden from a person, is not realized by him. An attempt to explicate the mechanisms of language functioning leads to dysfunction. Specialists in the field of lexical semantics know, for example, that the long-term semantic introspection necessary to study the word content plan leads to the fact that the researcher partially loses the ability to distinguish between correct and misuses analyzed word. Other examples can be cited. It is known that from the point of view of mechanics, the human body is complex system two interacting pendulums.

In knowledge theory, various knowledge structures are used to study and represent knowledge - frames, scenarios, plans. According to M. Minsky, "a frame is a data structure designed to represent a stereotyped situation" [Minsky 1978, p.254]. In more detail, we can say that the frame is a conceptual structure for the declarative representation of knowledge about a typified thematically unified situation containing slots interconnected by certain semantic relationships. For purposes of illustration, a frame is often represented as a table, the rows of which form slots. Each slot has its own name and content (see Table 1).

Table 1

Fragment of the "table" frame in a table view

Depending on the specific task frame structuring can be significantly more complex; a frame can include nested subframes and references to other frames.

Instead of a table, a predicate form of presentation is often used. In this case, the frame is in the form of a predicate or a function with arguments. There are other ways to represent a frame. For example, it can be represented as a tuple of the following form: ( (frame name) (slot name)) (slot value,), ..., (slot name n) (slot value n) ).

Typically, frames in knowledge representation languages ​​have this form.

Like other cognitive categories of computational linguistics, the concept of a frame is homonymous. Ontologically, it is a part of the human cognitive system, and in this sense, the frame can be compared with such concepts as gestalt, prototype, stereotype, scheme. In cognitive psychology, these categories are considered precisely from an ontological point of view. Thus, D. Norman distinguishes two main ways of existence and organization of knowledge in the human cognitive system - semantic networks and schemes. "Schemas," he writes, "are organized packets of knowledge assembled to represent distinct, self-contained units of knowledge. My schema for Sam may contain information describing his physical features, his activities, and personality traits. This schema correlates with other schemas that describe its other aspects" [Norman 1998, p. 359]. If we take the instrumental side of the frame category, then this is a structure for the declarative representation of knowledge. In current AI systems, frames can form complex knowledge structures; frame systems allow for hierarchy - one frame can be part of another frame.

In terms of content, the concept of a frame is very close to the category of interpretation. Indeed, a slot is an analogue of valence, the filling of a slot is an analogue of an actant. The main difference between them is that the interpretation contains only linguistically relevant information about the plan of the content of the word, and the frame, firstly, is not necessarily tied to the word, and, secondly, includes all the information relevant to a given problem situation, including including extralinguistic (knowledge of the world) 3).

A scenario is a conceptual framework for the procedural representation of knowledge about a stereotyped situation or behavior. Script elements are the steps of an algorithm or instruction. People usually talk about "restaurant scenario", "buying scenario" and so on.

The frame was also originally used for procedural presentation (cf. the term "procedural frame"), but the term "scenario" is now more commonly used in this sense. A scenario can be represented not only as an algorithm, but also as a network, the vertices of which correspond to certain situations, and the arcs correspond to connections between situations. Along with the concept of a script, some researchers use the category of a script for computer modeling of intelligence. According to R. Schenk, a script is some generally accepted, well-known sequence of causal relationships. For example, understanding the dialogue

On the street it pours like a bucket.

You still have to go to the store: there is nothing in the house - yesterday the guests swept everything.

is based on non-explicit semantic connections such as "if it rains, it is undesirable to go outside, because you can get sick." These connections form a script, which is used by native speakers to understand each other's verbal and non-verbal behavior.

As a result of applying the scenario to a specific problem situation, a plan). A plan is used to procedurally represent knowledge about possible actions leading to a specific goal. A plan relates a goal to a sequence of actions.

In the general case, the plan includes a sequence of procedures that transfer the initial state of the system to the final one and lead to the achievement of a certain subgoal and goal. In AI systems, the plan arises as a result of the planning or planning activity of the corresponding module - the planning module. The planning process may be based on the adaptation of data from one or more scenarios, activated by testing procedures, to solve a problem situation. The execution of the plan is carried out by an executive module that controls the cognitive procedures and physical actions of the system. In the elementary case, the plan in intellectual system is a simple sequence of operations; in more complex versions, the plan is associated with a specific subject, its resources, capabilities, goals, detailed information about a problematic situation, etc. The emergence of the plan occurs in the process of communication between the model of the world, part of which is formed by scenarios, the planning module and the executive module.

Unlike a scenario, a plan is associated with a specific situation, a specific performer, and pursues a specific goal. The choice of plan is governed by the contractor's resources. Plan feasibility - required condition its generation in the cognitive system, and the satisfiability characteristic is inapplicable to the scenario.

Another important concept is the model of the world. The model of the world is usually understood as a set of knowledge about the world organized in a certain way, inherent in a cognitive system or its computer model. In a few more general view the model of the world is spoken of as a part of the cognitive system that stores knowledge about the structure of the world, its patterns, etc. In another sense, the model of the world is associated with the results of understanding the text or, more broadly, discourse. In the process of understanding the discourse, its mental model is built, which is the result of the interaction between the plan of the content of the text and the knowledge about the world inherent in this subject [Johnson-Laird 1988, p. 237 et seq.]. The first and second understandings are often combined. This is typical of linguistic researchers working within cognitive linguistics and cognitive science.

Closely related to the category of frame is the concept of a scene. The scene category is mainly used in the literature as a designation of the conceptual structure for the declarative representation of the actualized in the speech act and highlighted by linguistic means (lexemes, syntactic constructions, grammatical categories etc) situations and their parts5). Being associated with linguistic forms, the scene is often updated by a certain word or expression. In plot grammars (see below), a scene appears as part of an episode or narrative. Characteristic examples of scenes are a set of cubes that the AI ​​system works with, the scene of action in the story and the participants in the action, etc. In artificial intelligence, scenes are used in image recognition systems, as well as in programs focused on the study (analysis, description) of problem situations. The concept of a scene has become widespread in theoretical linguistics, as well as in logic, in particular in situational semantics, in which the meaning of a lexical unit is directly associated with the scene.

Computer linguists are engaged in the development of algorithms for recognizing text and sounding speech, the synthesis of artificial speech, the creation of semantic translation systems and the development itself. artificial intelligence(in the classical sense of the word - as a replacement for the human one - it is unlikely to ever appear, but various expert systems based on data analysis will arise).

Speech recognition algorithms will be increasingly used in everyday life - in "smart homes" and electronic appliances there will be no remotes and buttons, and a voice interface will be used instead. This technology is being refined, but there are still many challenges: it is difficult for a computer to recognize human speech, because different people speak very differently. Therefore, as a rule, recognition systems work well either when they are trained for one speaker and already adjusted to his pronunciation features, or when the number of phrases that the system can recognize is limited (as, for example, in voice commands for TV).

Specialists in the creation of semantic translation programs still have a lot of work ahead of them: this moment good algorithms are developed only for translation into and from English. There are many problems here - different languages ​​are arranged differently in a semantic plan, this differs even at the level of phrase construction, and not all meanings of one language can be conveyed using the semantic apparatus of another. In addition, the program must distinguish between homonyms, correctly recognize parts of speech, and choose the correct meaning of a polysemantic word that fits the context.

Synthesizing artificial speech (for example, for home robots) is also painstaking work. It is difficult to make artificially created speech sound natural to human ear, because there are millions of nuances that we do not pay attention to, but without which everything is no longer “that” - false starts, pauses, hitches, etc. The speech stream is continuous and at the same time discrete: we speak without pausing between words, but it is not difficult for us to understand where one word ends and another begins, and for a machine this will be a big problem.

The biggest direction in computational linguistics is connected with Big Data. After all, there are huge corpora of texts such as news feeds, from which you need to isolate certain information - for example, highlight newsworthy events or sharpen RSS to the tastes of a particular user. Such technologies already exist and will continue to develop, because computing power is growing rapidly. Linguistic analysis texts is also used to ensure security on the Internet, search necessary information for special services.

Where to study as a computational linguist? We, unfortunately, have quite a strong division between specialties related to classical linguistics and programming, statistics, and data analysis. And in order to become a digital linguist, you need to understand both. Foreign universities have higher education programs in computational linguistics, but we still have best option- get a basic linguistic education, and then master the basics of IT. It's good that now there are many different online courses, unfortunately, in my student days, this was not the case. I studied at the Faculty of Applied Linguistics at Moscow State Linguistic University, where we had courses in artificial intelligence and recognition oral speech- but still not enough. Now IT companies are actively trying to interact with institutions. My colleagues from Kaspersky Lab and I also try to participate in the educational process: we give lectures, hold student conferences, give grants to graduate students. But for now, the initiative comes more from employers than from universities.

Linguistic informatics is a part of information service theory. The theory of information service arose in connection with the computerization of speech, that is, in connection with the use of computers as a means of recording, recording and storing linguistic information. Thanks to technology, it was possible to combine the functions of a library, archive and office.

Large text classes are processed by automatic referencing. The constantly growing volume of scientific and technical information, the search for which is becoming more and more laborious, gave rise to the idea of ​​searching through the so-called secondary texts, which are folded information of the primary document: bibliographic description, annotation, abstract, scientific translation.

The folding of the primary text is carried out by its compression, compression. Special methods for folding the primary text have been developed:

a) statistical-distributive methods mean that the most informative sentences are singled out, in which the most significant linguistic signs for a given text are concentrated;

b) methods of using semantic indicators, when the most meaningful "points" of the text are noted - the subject of study, purpose, methods, relevance, scope, conclusions, results); c) the method of textual links, which lies in the fact that taking into account interphrase links makes the abstract complete.

3. Practical terminology.
Practical terminology includes sections:

a) lexicographic terminology, which deals with the theory and practice of creating special dictionaries, unifying term systems, translating terms, creating terminological data banks, automating their storage and processing.

b) lexicography itself became the subject of applied linguistics as one of the most time-consuming types of practical linguistics. Dictionaries have been created for decades. Therefore, the desire of scientists to automate lexicographic activity is quite understandable. There are automatic dictionaries. Their purpose is to increase labor productivity when working with texts, collecting, storing and processing various units of the language. Dictionaries of this type are used in automatic text processing systems.

Automatic translation.

Automatic or machine translation is based on the assumption that it is possible to harmonize typologically different languages structures (dictionary, word order, inflection, syntactic structures). The linguistic principle of translation is to compare linguistic units of two or more languages ​​that are equivalent in meaning.

There are two stages in the development of automatic translation systems. At the first stage, such fundamental problems of machine translation as the creation of automatic dictionaries, the development of an intermediary language, the formalization of grammar, the overcoming of homonymy, and the processing of idiomatic formations were solved. At the second stage, the set-theoretic models of grammars, models of dependency grammars, direct constituents, models of generative grammar continue to develop quite fruitfully and be embodied in practice. During this period, semantics according to the “meaning - text” model is more and more actively involved in applied linguistics. The centers of applied linguistics that have emerged in domestic and foreign universities are developing strategies for machine translation. These include the Laboratory of Mathematical Linguistics at St. Petersburg University, at the Institute of Applied Mathematics of the Russian Academy of Sciences; All-Union Translation Center; the Speech Statistics group at the Leningrad Pedagogical Institute under the direction of Raymond Genrikhovich Piotrovsky; group for the study of syntactic modeling "sense - text" under the leadership of Igor Alexandrovich Melchuk.

A new stage in the improvement of machine translation is associated with the use of an intermediary language - a knowledge representation language. It is based on the analysis of the meaning of the sentence obtained when comprehending the input sentence, supplemented and marked up with the help of information from the knowledge base and in its terms. The process of translation is the transformation of an input sentence of language X into an output structure of language Y. In other words, the result of machine translation is not a translation itself, but rather a retelling of the source text (X). The quality of translation depends on the efficiency of the knowledge representation language. The high quality of machine translation can only be ensured by the creation of reliable linguistic foundations and software tools for building powerful semantic networks based on automated lexicons.

IV. Ethnolinguistics.

Ethnolinguistics (ethnosemantics, anthropolinguistics) is a field of linguistics that studies the language in its relationship with the culture of a particular ethnic group. The foundations of ethnolinguistics were laid in the works of Franz Boas and Edward Sapir in the first quarter of the 20th century. In the second half of the 20th century ethnolinguistics took shape as an independent branch of linguistics. Ethnolinguistic studies of the second half of the 20th century. characterized by such features as: attraction of methods of experimental psychology; comparison of semantic models of different languages; study of problems of folk taxonomy; paralinguistic research; reconstruction of spiritual ethnic culture based on language data; revival of attention to folklore.

Central to ethnolinguistics are two closely related problems that can be called "cognitive" and "communicative":

1. How, by what means and in what form are the cultural (domestic, religious, social, etc.) ideas of the people who speak this language about the world around them and about the place of man in this world reflected in the language?

2. What forms and means of communication - primarily linguistic communication - are specific to a given ethnic or social group?

In accordance with these problems, two directions have emerged in ethnolinguistics: cognitively oriented ethnolinguistics and communicatively oriented linguistics.

a) Cognitively oriented ethnolinguistics.

Cognitively oriented ethnolinguistics is characteristic of American linguistics. It is called anthropological linguistics. Initially, anthropological linguistics was focused on the study of the culture of peoples that differed sharply from European ones, primarily the American Indians. Establishing family ties between these languages ​​and describing them state of the art were subordinated to the task of a comprehensive description of the culture of these peoples and the reconstruction of their history, including migration routes. The recording and interpretation of everyday and folklore texts was an integral component of the anthropological description.

Following Franz Boas in anthropological linguistics, it is believed that more fractional fragments of the classification of reality in a language correspond to more important aspects of a given culture. As the American linguist and anthropologist Harry Hoyer notes, “peoples who live by hunting and gathering, such as the Apache tribes in the American southwest, have an extensive vocabulary of the names of animals and plants, as well as the phenomena of the surrounding world. The peoples whose main source of livelihood is fishing (in particular, the Indians of the northern coast Pacific Ocean), have in their dictionary a detailed set of fish names, as well as fishing tools and techniques.

The greatest attention of ethnolinguists was attracted by such taxonomic systems as designations of body parts, terms of kinship, the so-called ethno-biological classifications, that is, the names of plants and animals (the English scientist B. Berlin, Anna Vezhbitskaya), and especially color designations (B. Berlin and P .Kay, A.Vezhbitskaya).

In modern anthropological ethnolinguistics, one can conditionally distinguish between “relativistic” and “universalist” directions: for the first, the priority is the study of cultural and linguistic specifics in the picture of the speaker’s world, for the second, the search for universal properties of vocabulary and grammar of natural languages.

An example of research in the relativistic direction in ethnolinguistics can be the works of Yuri Derenikovich Apresyan, Nina Davidovna Arutyunova, Anna Vezhbitskaya, Tatyana Vyacheslavovna Bulygina, Alexei Dmitrievich Shmelev, E.S. Yakovleva, dedicated to the peculiarities of Russian language picture peace. These authors analyze the meaning and use of words that either denote unique concepts that are not typical for the conceptualization of the world in other languages ​​(longing and daring, perhaps and probably), or correspond to concepts that exist in other cultures, but are especially significant for Russian culture or receiving a special interpretation (truth and truth, freedom and will, fate and share). For example, we give a fragment of the description of the word “maybe” from the book by T.V. Bulygina and A.D. Shmelev “Linguistic conceptualization of the world”:

«<...>perhaps does not mean at all the same as simply “possibly” or “maybe”.<...>most often, maybe is used as a kind of excuse for carelessness, when it comes to the hope not so much that some favorable event will happen, but that some extremely undesirable consequences will be avoided. About a person who buys a lottery ticket, they will not say that he acts at random. So, rather, it can be said about a person who<...>saves money by not buying health insurance and hopes nothing bad happens<...>Therefore, hope for a chance is not just a hope for good luck. If the symbol of fortune is roulette, then hope for a chance can be symbolized by “Russian roulette”.

An example of research in the universalist direction in ethnolinguistics is the work of the Polish scientist Anna Wierzbicka, devoted to the principles of describing linguistic meanings. The goal of many years of research by A. Wiezhbitskaya and her followers is to establish a set of so-called "semantic primitives", universal elementary concepts, by combining which each language can create an infinite number of configurations specific to a given language and culture. Semantic primitives are lexical universals, in other words, they are such elementary concepts for which in any language there is a word denoting them. These concepts are intuitively clear to a native speaker of any language, and on their basis one can build interpretations of any arbitrarily complex language units. Studying the material of genetically and culturally different languages ​​of the world, including the languages ​​of Papua New Guinea, Austronesian languages, African languages ​​and Australian aborigines, A. Vezhbitskaya constantly refines the list of semantic primitives. Her work Interpretation of Emotional Concepts lists them as follows:

"substantives" - I, you, someone, something, people;
"determinators and quantifiers" - this, the same, the same, another, one, two, many, all / all;
"mental predicates" - think (about), speak, know, feel, want;
"actions and events" - to do, occur / happen;
"ratings" - good, bad;
"descriptors" - large, small;
"time and place" - when, where, after / before, under / over;
"metapredicates" - not / no / negation, because / because of, if, to be able;
"intensifier" - very;
"taxonomy and partonomy" - species / variety, part;
“non-strictness / prototype” - similar / like.

From semantic primitives, as from "bricks", A. Vezhbitskaya puts together interpretations of even such subtle concepts as emotions. For example, she manages to demonstrate the subtle difference between the concept of American culture, denoted by the word "happy", and the concept denoted by the Russian word "happy" (and similar Polish, French and German adjectives). The word "happy", as A. Vezhbitskaya writes, although it is usually considered a dictionary equivalent English word"happy", in Russian culture, has a narrower meaning, "usually used to refer to rare states of complete bliss or complete satisfaction derived from such serious things as love, family, the meaning of life, etc." Here is how this difference is formulated in the language of semantic primitives (components of interpretation B, which are absent in interpretation A, are highlighted in capital letters).

Interpretation A: X feels happy
X feels something
something good happened to me
i wanted it
I don't want anything else
X feels something like

Interpretation B: X is happy
X feels something
sometimes people think like this:
something very good happened to me
i wanted it
EVERYTHING IS FINE
I CAN'T WANT anything else
so this person feels something good
X feels something like

For the research program of A. Vezhbitskaya, it is important that the search for universal semantic primitives is carried out empirically, using the methods of field linguistics - working with an informant: firstly, in each individual language, the role played by this concept in the interpretation of other concepts, and, secondly, for each concept, a set of languages ​​is found in which this concept is lexicalized, that is, there is a special word expressing this concept.

B) Communicatively oriented ethnolinguistics.

The most significant results in communicatively oriented ethnolinguistics are associated with the direction called "ethnography of speech" or "ethnography of communication". Speech ethnography as a theory and method for analyzing language usage in a sociocultural context was proposed in the early 1960s. in the works of D. Himes and John J. Gamperz and developed in the works of the American scientist Aron Sikurel, J. Bauman, A.U. Corsaro. The utterance is investigated only in connection with some speech or communicative event within which it is generated. The cultural conditionality of any speech events (sermon, judicial sitting, telephone conversation, etc.). The rules of language use are established through present observation (participation in a speech event), analysis of spontaneous data, interviewing native speakers of a given language.

Within the framework of this direction, the models of speech behavior adopted in a particular culture, in a particular ethnic or social group are studied. So, for example, in the culture of the “Central European standard”, an informal conversation of several people assumes, according to the rules of etiquette adopted in this community, that the participants will not interrupt each other, everyone is given the opportunity to speak in turn, the one who wants to speak usually signals this with the words “let me see” , “Let me ask”, etc. Those who want to leave the group of participants in the conversation announce their intention with the words “unfortunately, I have to go”, “I have to leave for a while”, and so on. Quite different norms of public speech behavior are accepted, for example, in a number of Australian Aboriginal cultures. Compliance individual rights an individual participant in a conversation in these communities is not a mandatory rule: several interlocutors can speak at the same time, it is not necessary to react to the statement of another, the speaker speaks out without specifically addressing anyone, the interlocutors may not look at each other, etc. Such a model of speech behavior is based on the initial premise that all utterances are somehow accumulated in the surrounding world, and therefore the “reception” of a message does not necessarily have to immediately follow its “transmission”.

A relevant topic of communication ethnography is also the study of the linguistic expression of the relative social status of the interlocutors: the rules of addressing the interlocutor, including the use of titles, addresses by name, surname, first name and patronymic, professional addresses (for example, “doctor”, “comrade major”, “ professor”), the appropriateness of the appeals “to you” and “to you”, etc. Especially closely studied are such languages ​​in which the ratio of the social position of the speaker and the listener is fixed not only in vocabulary, but also in grammar. An example is Japanese, where the choice of the grammatical form of a verb depends on whether the listener is higher or lower than the speaker in the social hierarchy, and also on whether the speaker and listener are in the same social cell or not. In addition, the relationship between the speaker and the person in question is also taken into account. As a result of the complex action of these restrictions, the same person uses different forms verb when referring to a subordinate and when referring to a boss, when referring to a colleague and when referring to to a stranger, when referring to his wife and neighbor's wife.

The grammar also reflects such a feature of Japanese speech etiquette as the desire to avoid intrusion into the sphere of thoughts and feelings of the interlocutor. AT Japanese there is a special grammatical form of the verb - the so-called "desirable mood". Using the suffix of the desirable mood -tai, the speaker expresses the desire to perform the action indicated by the original verb: "read" + tai = "I want to read", "leave" + tai = "I want to leave". However, forms of desirable mood are possible only if the speaker describes own wish. The desire of an interlocutor or a third person is expressed using a special construction, approximately meaning "according to outward signs we can conclude that person X wants to perform action Y". Thus, obeying the requirements of grammar, a Japanese speaker can only express judgments about his own intentions. To make direct statements about the internal state of another person, for example, about his desires, the language simply does not allow You can say "I want ...", but you can not say "Do you want ..." or "He wants ...", but only "It seems to me (I have the impression) that you want ..." or “It seems to me (I have the impression) that he wants ...”.

In addition to the norms of speech etiquette, the ethnography of communication also studies speech situations ritualized in certain cultures, such as a court session, a dissertation defense, a trade deal, and the like; rules for choosing a language in interlingual communication; language conventions and clichés, signaling that the text belongs to a certain genre (“once upon a time” - in fairy tales, “listened and decided” - in the minutes of the meeting).

Modern ethnolinguistics is closely connected with sociology, psychology, and semiotics. In Russian ethnolinguistics, a special place is occupied by research at the intersection of ethnolinguistics, folklore, and comparative historical linguistics. First of all, this is a research program dedicated to the ethno-linguistic and ethno-cultural history of the Slavic peoples (Nikita Ilyich Tolstoy, Svetlana Mikhailovna Tolstaya, Vladimir Nikolaevich Toporov). Within the framework of this program, ethnolinguistic atlases are compiled, rituals, beliefs, and folklore are mapped; the structure of codified Slavic texts of certain genres, including incantatory texts, riddles, funerary and building rituals, etc., is studied in relation to the data of comparative historical and archaeological research.

  • Systematization in linguistics and linguistic classification of the peoples of the world
  • Sociolinguistic (or functional) classification of languages ​​and forms of speech

  • linguistics statistical linguistics software

    History of the development of computational linguistics

    The process of formation and formation modern linguistics as a science of natural language is a long-term historical development linguistic knowledge. Linguistic knowledge is based on elements, the formation of which took place in the process of activity, inextricably linked with the development of the structure of oral speech, the emergence, further development and improvement of writing, learning to write, as well as the interpretation and decoding of texts.

    Natural language as an object of linguistics occupies a central place in this science. In the process of language development, ideas about it also changed. If earlier no special importance was attached to the internal organization of the language, and it was considered, first of all, in the context of its relationship with the outside world, then, starting from late XIX- the beginning of the 20th century, a special role is given to the internal formal structure of the language. It was during this period that the famous Swiss linguist Ferdinand de Saussure developed the foundations of such sciences as semiology and structural linguistics, and were detailed in his book A Course in General Linguistics (1916).

    The scientist owns the idea of ​​considering the language as a single mechanism, an integral system of signs, which in turn makes it possible to describe the language mathematically. Saussure was the first to propose a structural approach to language, namely, the description of a language by studying the relationships between its units. By units, or "signs", he understood a word that combines both meaning and sound. The concept proposed by the Swiss scientist is based on the theory of language as a system of signs, consisting of three parts: language (from French langue), speech (from French parole) and speech activity (from French langage).

    The scientist himself defined the science he created, semiology, as "a science that studies the life of signs within the framework of the life of society." Because language is sign system, then in search of an answer to the question of what place linguistics occupies among other sciences, Saussure argued that linguistics is part of semiology. It is generally accepted that it was the Swiss philologist who laid the theoretical foundation of a new direction in linguistics, becoming the founder, the "father" of modern linguistics.

    The concept put forward by F. de Saussure was further developed in the works of many outstanding scientists: in Denmark - L. Elmslev, in the Czech Republic - N. Trubetskoy, in the USA - L. Bloomfield, Z. Harris, N. Chomsky. As for our country, here structural linguistics began its development at about the same period of time as in the West - at the turn of the 19th-20th centuries. - in the works of F. Fortunatov and I. Baudouin de Courtenay. It should be noted that I. Baudouin de Courtenay worked closely with F. de Saussure. If Saussure laid the theoretical foundation of structural linguistics, then Baudouin de Courtenay can be considered the person who laid the foundations for the practical application of the methods proposed by the Swiss scientist. It was he who defined linguistics as a science that uses statistical methods and functional dependencies, and separated it from philology. The first experience of applying mathematical methods in linguistics was phonology - the science of the structure of the sounds of a language.

    It should be noted that the postulates put forward by F. de Saussure could be reflected in the problems of linguistics that were relevant in the middle of the 20th century. It is during this period that a clear trend towards the mathematization of the science of language is outlined. Practically in all large countries, the rapid development of science and computer technology begins, which in turn required more and more new linguistic foundations. The result of all this has been a rapid convergence of precise and humanities, as well as the active interaction of mathematics and linguistics, has found practical application in solving urgent scientific problems.

    In the 1950s, at the intersection of such sciences as mathematics, linguistics, computer science and artificial intelligence, a new direction of science arose - computational linguistics (also known as machine linguistics or automatic processing of texts in natural language). The main stages in the development of this direction took place against the backdrop of the evolution of artificial intelligence methods. A powerful impetus to the development of computational linguistics was the creation of the first computers. However, with the advent of a new generation of computers and programming languages ​​in the 60s, fundamentally new stage in the development of this science. It should also be noted that the origins of computational linguistics go back to the works of the famous American linguist N. Chomsky in the field of formalizing the structure of the language. The results of his research, obtained at the intersection of linguistics and mathematics, formed the basis for the development of the theory of formal languages ​​and grammars (generative or generative grammars), which is widely used to describe both natural and artificial languages in particular programming languages. To be more precise, this theory is quite a mathematical discipline. It can be considered one of the first in such a direction of applied linguistics as mathematical linguistics.

    The first experiments and first developments in computational linguistics relate to the creation of machine translation systems, as well as systems that simulate human language abilities. In the late 80s, with the advent and active development of the Internet, there was a rapid growth in the volume of text information available in in electronic format. This has led to the fact that information retrieval technologies have moved to a qualitatively new stage of their development. There was a need for automatic processing of texts in natural language, completely new tasks and technologies appeared. Scientists are faced with such a problem as the rapid processing of a huge stream of unstructured data. In order to find a solution for this problem great importance began to be given to the development and application of statistical methods in the field of automatic word processing. It was with their help that it became possible to solve such problems as dividing texts into clusters united by a common theme, highlighting certain fragments in the text, etc. In addition, the use of methods mathematical statistics and machine learning made it possible to solve the problems of speech recognition and the creation of search engines.

    Scientists did not stop at the achieved results: they continued to set themselves new goals and objectives, to develop new techniques and methods of research. All this led to the fact that linguistics began to act as an applied science, combining a number of other sciences, the leading role among which belonged to mathematics with its variety of quantitative methods and the ability to use them for a deeper understanding of the phenomena being studied. Thus began its formation and development of mathematical linguistics. At the moment, this is a rather “young” science (it has existed for about fifty years), however, despite its very “young age”, it is an already established field of scientific knowledge with many successful achievements.

    We recommend reading

    Top