Click here to join our discord server! Plus, the 2023 conference proeceedings are now live! Check them out here.

Lexical Bundles and Idiolect

N-grams work to identify authorship, how and why this works is vital information, however it is yet to be understood. Specifically, long character n-grams (n = 7-9) were found to achieve the best results when using the n-gram tracing method, in the paper in which the method was proposed (Grieve et al., 2019). This length is consistent with the predictions of cognitive linguistic frameworks, as explained by Nini (forthcoming), as these sequences are neither too uncommon to be found elsewhere in the corpus, nor so common that they could be found anywhere. What cognitive linguistics suggests, is that collocations used and heard more frequently by a person become entrenched to the extent that they start to treat them as one unit (Langacker, 1987).

However, character n-grams are inherently influenced by topic, for example, two essays about law will likely both include the same words and phrases (and thus the same sequences of characters) associated with legalese. The aim of the study is therefore to answer the question: do different individuals consistently use different sets of n-grams to encode a particular meaning as predicted by cognitive linguistic theories or is it perhaps that the reason why n-grams are successful is because they capture information that is only correlated to authorship (e.g., topic)?

One way to test why long character n-grams work would be to keep the context fixed and ask different people to produce language in that context. Two summaries of the same text were written by each participant. Other variables that were fixed were the age of the participants (19-22), their education level (currently studying at an undergraduate level) and their first language (English). All the 7, 8 and 9-grams were then extracted, and the n-grams of each text were compared using a presence-absence approach. The text another was most similar to (disregarding the stimulus) is indicated to be written by the same author. R-scripts were run to analyse the texts produced.

The analysis enables us to verify whether two texts written by the same author are more similar to each other than two texts written by different authors in terms of the number of long character n-grams in common. This research is still in progress and thus the findings cannot be reported at this moment in time. The final goal will be a greater understanding of the contribution of lexical sequences to make up a person’s idiolect. This has real-world implications in forensic linguistics, as for n-gram tracing to be used in real-life settings we must understand why it works, this research begins to contribute to this understanding.

References:

Grieve, Jack, Isobelle Clarke, Emily Chiang, Hannah Gideon, Annina Heini, Andrea Nini & Emily Waibel. 2019. Attributing the Bixby Letter using n-gram tracing. Digital Scholarship in the Humanities 34, 493-512.

Langacker, Rondald W. 1987. Foundations of cognitive grammar. Stanford: Stanford University Press.

Nini, Andrea. Forthcoming. A Theory of Linguistic Individuality. Elements in Forensic Linguistics. Cambridge: Cambridge University Press.