CORPUS LINGUISTICS a short introduction Tatiana I. Shutova, LUNN What is a corpus? A corpus (pl. corpora) is a large and structured set of texts (nowadays usually electronically stored and processed). What can a corpus of texts be possibly used for? Types of corpora Specialized General Written Spoken Demographic Context-governed Multilingual Parallel Learner Historical Diachronic etc => used for hypothesis testing checking occurrences and combinations of words validating linguistic rules providing examples of actual language use for teaching languages dictionaries, tests, exercises compilation tracing (sociolinguistic) changes in the language etc. Software for Corpus Studies: AntConc works only with plain-text files with the .txt appendix (e.g. Hamlet.txt) AntConc will not read .doc, .docx, .pdf files.You will need to convert these into .txt files. To save the file as a .txt file, open Notepad (for Windows) or TextEdit (on Mac) and insert it there. AntConc: Launch AntConc: intro Concordance: Keyword in Context view (KWIC). Concordance Plot: A very simple visualization of your KWIC search, where each instance is represented as a little black line from beginning to end of each file containing the search term. File View: a full file view for larger context of a result. Clusters: words which very frequently appear together. Collocates: words which _definitely _appear together in a corpus; collocates show words which are statistically likely to appear together. Word list: All the words in your corpus. Keyword List: comparisons between two corpora. Loading corpus into AntConc AntConc Basic Functions: Keywords-in-Context (KWIC) Write down definitions for the word “shot” (2 min.) Type in “shot” into search term bar. What do you see? What meanings of this word can you identify? Did you guess all of them? AntConc Basic Functions: Keywords-in-Context (KWIC) – contd. AntConc Basic Functions: Search Operators The * operator (wildcard) The * operator can help, for instance, find both the singular and the plural forms of nouns. Task: Search for qualit*, then sort this search. What tends to precede and follow quality & qualities? (Hint: they’re different words, and have different contexts. Look for patterns in usage using the KWIC!) AntConc Basic Functions: Search Operators – contd. To find out the difference between * and ?, search for th*n and th?n. The ? operator is more specific than the * operator: wom?n – both women and woman m?n – man and men, but also min contrast to m*n: not helpful, because you’ll get mean, melon, etc. AntConc Basic Functions: Search Operators – contd. Task: Compare these two searches: wom?n and m?n sort each search in a meaningful way (e.g., by search term, then 1L then 2L) File > Save output to text file (& append with .txt.) open the plain text file in your text editor and study the information do this for each of the two searches and then look at the two text files side by side. What do you notice? AntConc: Basic Functions Collocates and wordlists - Collocation – a tendency / requirement for words to co-occur / make sense together - Task: Generate collocates for m?n and wom?n. Now sort them by frequency to 1L. - This tells us about what makes a man or woman ‘movieworthy’: – women have to be ‘beautiful’ or ‘pregnant’ or ‘sophisticated’ – men have to be somehow outside the norm – ‘holy’ or ‘black’ or ‘old’ AntConc: Basic Functions Comparing Corpora Corpus compared against Reference Corpus (bigger) Keyness is the frequency of a word in the text when compared with its frequency in a reference corpus. Keyness reflects the most important topics of the text / corpus. AntConc: Basic Functions Comparing Corpora / Keywords Settings > Tool preferences > Keyword List Under ‘Reference Corpus’ make sure “Use raw files” is checked Add Directory > open the folder containing the files that make up the reference corpus Ensure you have a whole list of files! What are our keywords? ? What sorts of research questions can you come up with? Comparing what and what? Identifying what and what? Limitations of Corpus Linguistics Won’t tell us if something is impossible in the language Generalizations based on the corpora are deductions, not facts Corpora give us evidence but no explanation / information about the culture, societal information, etc. Corpora give us language out of the context (no pictures, body language, behavior, etc.) Questions? Thank you!