Everyday natural language processing tools, such as those mentioned in the previous section, provide new educational opportunities. The goal of our courses is to show students the capabilities of these tools, and especially to encourage them to take a reflective and analytic approach to their use.
The aim of this book is to provide insight into how computers support language-related tasks, and to provide a framework for thinking about these tasks. There are two major running themes. The first is an emphasis on the relation between the aspects of language that need to be made explicit to obtain working language technology and how they can be represented. We introduce and motivate the need for explicit representations, which arises from the fact that computers cannot directly work with language but require a committment to linguistic modeling and data structures that can be expressed in bits and bytes. We emphasize the representational choices that are involved in modeling linguistic abstractions in a concrete form computers can use.
The second running theme is the means needed in order to obtain the knowledge about language that the computer requires. There are two main options here: either we arrange for the computer to learn from examples, or we arrange for experts to create rules that encode what we know about language. This is an important distinction, both for students whose primary training is in formal linguistics, to whom the idea that such knowledge can be learned may be unfamiliar, and for the increasing number of students who are exposed to the “new AI” tradition of machine learning, to whom the idea of creating and using hand-coded knowledge representations may be surprising. Our view is that the typical real-world system uses a synthesis of both approaches, so we want students to understand and respect the strengths and weaknesses of both data-driven and theory-driven traditions.
This chapter lays the groundwork for understanding how natural language is used on a computer, by outlining how language can be represented. There are two halves to this chapter, focusing on the two ways in which language is transmitted: text and speech. The text portion outlines the range of writing systems in use and then turns to how information is encoded on the computer, specifically how all writing systems can be encoded effectively. The speech portion offers an overview of both the articulatory and the acoustic properties of speech. This provides a platform for talking about automatic speech recognition and text-to-speech synthesis. The chapter closes with a discussion of language modeling in the context of speech recognition.
This chapter sets out to (i) explain what is currently known about the causes of and reasons for spelling errors; (ii) introduce the main techniques for the separate but related tasks of nonword error detection, isolated-word spelling correction, and real-word spelling correction; and (iii) introduce the linguistic representations that are currently used to perform grammar correction – including a lengthy discussion of syntax – and explain the techniques employed. The chapter describes classical computational techniques, such as dynamic programming for calculating edit distance between words. It concludes with a discussion of advances in technology for applying spelling correction to newer contexts, such as web queries.
In this chapter, we seek to (i) introduce some fundamentals of first and second language acquisition and the relevance of language awareness for the latter; (ii) explain how computer-assisted language learning (CALL) tools can offer feedback for exercises without encoding anything about language in general; (iii) motivate and exemplify that the space of well-formed and ill-formed variation arising in language use often is well beyond what can be captured in a CALL tool; (iv) introduce the idea that the need for linguistic abstraction and generalization is addressed in tokenization and part-of-speech tagging as two fundamental NLP processing steps, that even such basic steps can be surprisingly difficult, and how part-of-speech classes are informed by different types of empirical evidence; (v) motivate the need for analysis beyond the word level and syntactic generalizations; and (vi) showcase what a real-life intelligent language tutoring system looks like and how it makes use of linguistic analysis for both analysis and feedback. The chapter ends with a section discussing how in addition to the context of language use and linguistic analysis, learner modeling can play an important role in tutoring systems.
To cover the task of searching, the goals of this chapter are (i) to outline the contexts and types of data in which people search for information (structured, unstructured, and semi-structured data), emphasizing the concept of one’s information need; (ii) to provide ways to think about the evaluation and improvement of one’s search results; (iii) to cover the important concept of regular expressions and the corresponding machinery of finite-state automata; and (iv) to delve into linguistic corpora, illustrating a search for linguistic forms instead of for content. The middle of the chapter provides more in-depth discussion of web searching, including how webpages are indexed and how the PageRank algorithm is used to determine relevance for a query.
This chapter aims to (i) explain the idea of classifiers and machine learning; (ii) introduce the Naive Bayes and Perceptron classifiers; (iii) give basic information about how to evaluate the success of a machine learning system; and (iv) explain the applications of machine learning to junk-mail filtering and to sentiment analysis. The chapter concludes with advice on how to select a machine learning algorithm and a discussion of how this plays out for a consulting company employing sentiment analysis as part of an opinion-tracking application designed to be used by corporate customers.
The goals of this chapter are (i) to introduce the idea of dialog systems; (ii) to describe some of the ways in which researchers have conceptualized dialog, including dialog moves, speech acts, and conversational maxims; (iii) to show some of the ways of classifiying dialog systems according to their purpose and design; and (iv) to illustrate how to measure the performance of dialog systems. We spend some time discussing the difficulties in automating dialogue and illustrate this with the example of the early dialog system Eliza.
Starting from the general idea of what it means to translate, in this chapter we (i) introduce the idea of machine translation (MT) and explain its capabilities and limits; (ii) indicate the differences between direct MT systems, transfer systems, and modern statistical methods; and (iii) set machine translation in its business context, emphasizing the idea of a translation need. The chapter includes extended discussion of IBM’s Model 1 translation model and of the Noisy Channel Model as applied to translation. It also discusses the translation needs of the European Union and those of the Canadian Meteorological Service, and contrasts them with the very difficult requirements for a satisfactory translation of a Shakespeare sonnet. The chapter concludes with a discussion of the likely consequences of automated translation for the career prospects and training choices of human translators.
The final chapter takes a look at the impact of language technology on society and human self-perception, as well as some of the ethical issues involved. We raise questions about how computers and language technology change the way information can be accessed and what this means for a democratic society in terms of control of information and privacy, how this changes learning and teaching as well as our jobs through upskilling and deskilling, and the impact on human self-perception when faced with machines capable of communicating using language. The goal of the chapter is to raise awareness of such issues arising through the use of computers and language technology in the context of real life.
A typical way to use the material in this book is in a quarter-length course assuming no mathematical or linguistic background beyond normal high-school experience. For this kind of course, instructors may choose to cover only a subset of the chapters. Each chapter is a stand-alone package, with no strict dependencies between the topics. We have found this material to be accessible to the general student population at the Ohio State University (OSU), where we originally developed the course.
To support the use of the book for longer or more advanced courses, the book also includes Under the Hood sections providing more detail on selected advanced topics, along with development of related analytical and practical skills. This kind of use is more appropriate as part of a Linguistics, Computer Science, or Communications major, or as an overview course at the nonspecialist graduate level. The Under the Hood topics have been useful in semester-length courses at Georgetown University and Indiana University, as well as honors versions at OSU.
Accompanying the book there is a website containing course materials, such as presentation slides, at http://purl.org/lang-and-comp/teaching.