On Utilizing Nonstandard Abbreviations and Lexicon to Infer Demographic Attributes of Twitter Users

Nathaniel Moseley; Cecilia Ovesdotter Alm; Manjeet Rege

doi:10.1007/978-3-319-16577-6_11

Back

Conference proceeding

On Utilizing Nonstandard Abbreviations and Lexicon to Infer Demographic Attributes of Twitter Users

Nathaniel Moseley, Cecilia Ovesdotter Alm and Manjeet Rege

FORMALISMS FOR REUSE AND SYSTEMS INTEGRATION, Vol.346, pp.257-278

Advances in Intelligent Systems and Computing

01/01/2015

DOI: https://doi.org/10.1007/978-3-319-16577-6_11

Abstract

Computer Science, Artificial Intelligence

Computer Science, Information Systems

Computer Science, Theory & Methods

Science & Technology

Computer Science

Technology

Automatically determining demographic attributes of writers with high accuracy, based on their texts, can be useful for a range of application domains, including smart ad placement, security, the discovery of predator behaviors, enabling automatic enhancement of participants' profiles for extended analysis, and various other applications. It is also of interest from the perspective to linguists who may wish to build on such inference for further sociolinguistic analysis. Previous work indicates that attributes such as author gender can be determined with some amount of success, using various methods, such as analysis of shallow linguistic patterns or topic, in authors' written texts. Author age appears more difficult to determine, but previous research has been somewhat successful at classifying age as a binary (e.g. over or under 30), ternary, or even as a continuous variable using various techniques. In this work, we show that word and phrase abbreviation patterns can be used toward determining user age using novel binning, as well as toward determining binary user gender, and ternary user education level. Notable results include age classification accuracy of up to 83% (67% above relative majority class baseline) using a support vector machine classifier and PCA extracted features, including n-grams. User ages were classified into 10 equally sized age bins and achieved 51% accuracy (34% above baseline) when using only abbreviation features. Gender classification achieved 75% accuracy (13% above baseline) using only abbreviation features, PCA extracted, and education classification achieved 62% accuracy, 19% above baseline with PCA extracted abbreviation features. Also presented is an analysis of the evident change in author abbreviation use over time on Twitter.

Metrics

4 Record Views

Details

Title: On Utilizing Nonstandard Abbreviations and Lexicon to Infer Demographic Attributes of Twitter Users
Author/Creator: Nathaniel Moseley - Rochester Institute of Technology
Cecilia Ovesdotter Alm - Rochester Institute of Technology
Manjeet Rege - University of St. Thomas - Minnesota
Contributors: T BouabanaTebibel
S H Rubin
Publication Details: FORMALISMS FOR REUSE AND SYSTEMS INTEGRATION, Vol.346, pp.257-278
Series: Advances in Intelligent Systems and Computing
Publisher: Springer Nature
Number of pages: 22
Academic Unit: Software Engineering and Data Science
Language: English
Resource Type: Conference proceeding
Record Identifier: 991015165551203691

On Utilizing Nonstandard Abbreviations and Lexicon to Infer Demographic Attributes of Twitter Users

Abstract

Related links

Metrics

Details