Text comes in many shapes, sizes, and forms, but for text to be useful as data, it is almost always converted to electronic form prior to analysis. Electronic text has the advantage of being easily stored and easily manipulated (Louwerse & Graesser 2005). Fortunately for modern users of textual data, new texts are almost invariably recorded in electronic form that can be easily converted for analysis. Websites, electronic documents, text stored on optical media, electronic mail, word processing files, news feeds—all provide text that can be easily captured and manipulated for analysis (Lindkvist 1981).
Textual data refer to systematically collected material consisting of written, printed, or electronically published words, typically either purposefully written or transcribed from speech (Palmer 2001). Text collected for use as data typically reflects a conscious research purpose, motivated by a design aimed at yielding insight on some feature of the social or political world. This entry outlines the purpose, issues, and challenges involved in selecting, preparing, and analyzing textual data. In this essay, I present the tools and methods like figurative language, detail and tone to assess the meanings and interpretation of the author's attitude.
Practical Issues in Dealing with Textual Data
One potentially tricky practical issue when dealing with textual data concerns the manner in which text is electronically encoded by computers. Broadly speaking, “text” means the representation of language as a set of recorded characters, following language-specific rules for syntax, grammar, and style to be meaningful. Pre-electronic textual data were typically series of characters drawn or printed on paper (Carley1993). For text to be stored electronically, however, each character must be encoded in a way that corresponds to a digital format used by computers. Morse code, for instance, encodes characters as a series of dots and dashes. Unfortunately—but hardly surprisingly—different computers encode character sets in different ways, and this can create challenges when preparing texts for processing together when these texts use different encodings. (Louwerse & Graesser 2005) For instance, two instances of démocratie may be considered as two different words by software designed to compile a word frequency table, when the term occurs in two different documents stored with different text encodings. Specifically, issues arise in giving appropriate weight to each of the following elements of textual analysis:
Diction - the specific words the writer uses and their connotations
Imagery - the way the writer uses the sense to create specific experiences (visual, auditory, olfactory, gustatory, tactile)
Language - formal or informal? The specific type of language style the writer selects (scientific, jargon, colloquial, slang, professional)
Irony - a use of language which involves an incongruity between what one would expect and what actually occurs
Metaphor - when an author makes a comparison between two unlike situations
Organization - the way the writer sets up his piece (a letter, a speech, enumeration of salient points)
Syntax - the sentence structure the writer chooses (includes punctuation, use of italics, spacing, complex and/or compound sentences, sentence length)
Allusion - when a writer refers to another situation (historical, mythical, biblical)