Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls

Social Media Big Data

Table of Contents




Abstract:

Large-scale databases of human activity in social media have captured scientific and policy attention, producing a flood of research and discussion. This paper considers methodological and conceptual challenges for this emergent field, with special attention to the validity and representativeness of social media big data analyses. Persistent issues include the over-emphasis of a single platform, Twitter, sampling biases arising from selection by hashtags, and vague and unrepresentative sampling frames. The socio-cultural complexity of user behavior aimed at algorithmic invisibility (such as subtweeting, mock-retweeting, use of “screen-captures” for text, etc.) further complicates the interpretation of big data social media. Other challenges include accounting for field effects, i.e. broadly consequential events that do not diffuse only through the network under study but affect the whole society. The application of network methods from other fields to the study of human social activity may not always be appropriate. The paper concludes with a call to action on practical steps to improve our analytic capacity in this promising, rapidly-growing field.

Introduction

Very large datasets, commonly referred to as big data, have become common in the study of everything from genomes to galaxies, including, importantly, human behavior. Thanks to digital technologies, more and more human activities leave imprints whose collection, storage, and aggregation can be readily automated. In particular, the use of social media results in the creation of datasets that may be obtained from platform providers or collected independently with relatively little effort as compared with traditional sociological methods.

Social media big data has been hailed as the key to crucial insights into human behavior and extensively analyzed by scholars, corporations, politicians, journalists, and governments (Boyd and Crawford 2012; Lazer et al, 2009). Big data reveal fascinating insights into a variety of questions, and allow us to observe social phenomena at a previously unthinkable level, such as the mood oscillations of millions of people in 84 countries (Golder et al., 2011), or in cases where there is arguably no other feasible method Copyright © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. If data collection, as with the study of ideological polarization on Syrian Twitter (Lynch, Freelon and Aday, 2014). The emergence of big data from social media has had impacts in the study of human behavior similar to the introduction of the microscope or the telescope in the fields of biology and astronomy: it has produced a qualitative shift in the scale, scope, and depth of possible analysis. Such a dramatic leap requires a careful and systematic examination of its methodological implications, including trade-offs, biases, strengths, and weaknesses.

This paper examines methodological issues and questions of inference from social media big data. Methodological issues including the following: 1. The model organism problem, in which a few platforms are frequently used to generate datasets without adequate consideration of their structural biases. 2. Selecting on dependent variables without requisite precautions; many hashtag analyses, for example, fall in this category. 3. The denominator problem created by vague, unclear, or unrepresentative sampling. 4. The prevalence of single platform studies which overlook the wider social ecology of interaction and diffusion.

There are also important questions regarding what we can legitimately infer from online imprints, which are but one aspect of human behavior. Issues include the following:

1. Online auctions such as clicks, links, and retweets are complex social interactions with varying meanings, logics, and implications, yet they may be aggregated together.

2. Users engage in practices that may be unintelligible to algorithms, such as subtweets (tweets referencing an unnamed but implicitly identifiable individual), quoting text via screen captures, and “hate-linking”—linking to denounce rather than endorse.

3. Network methods from other fields are often used to study human behavior without evaluating their appropriateness.

4. Social media data almost solely capture “node-to-node” interactions, while “field” effects—events that affect society or a group in a wholesale fashion either through shared experience or through broadcast media—may often account for observed phenomena.

5. Human self-awareness needs to be taken into account; humans will alter behavior because they know they are being observed, and this change in behavior may correlate with big data metrics.

Partial Conclusion:

  1. Industry Outreach. The field should solicit cooperation from the industry for data such as “denominators”, similar to Facebook’s recent release of what percent of a Facebook network sees status updates. Industry scientists who participate in the research community can be conduits.
  2. Convergent answers and complementary methods. Multi-method, multi-platform analyses should be sought and rewarded. As things stand, these exist (Adar et al., 2007 or Kairam, 2013) but are rare. Whenever possible, social media big data studies should be paired with surveys, interviews, ethnographies, and other methods so that biases and shortcomings of each method can be used to balance each other to arrive at richer answers.
  3. Multi-disciplinary teams. Scholars from fields where network methods are shared should cooperate to study the scope, differences, and utility of common methods.
  4. Methodological awareness in the review. These issues should be incorporated into the review process and go beyond soliciting “limitations” sections. A future study that recruited a panel of ordinary users, from multiple countries, and examined their behavior online and offline, and across multiple platforms to detect the frequency of behaviors outlined here, and those not detected yet, would be a path-breaking next step for understanding and grounding our social media big data.

About KSRA

The Kavian Scientific Research Association (KSRA) is a non-profit research organization to provide research / educational services in December 2013. The members of the community had formed a virtual group on the Viber social network. The core of the Kavian Scientific Association was formed with these members as founders. These individuals, led by Professor Siavosh Kaviani, decided to launch a scientific / research association with an emphasis on education.

KSRA research association, as a non-profit research firm, is committed to providing research services in the field of knowledge. The main beneficiaries of this association are public or private knowledge-based companies, students, researchers, researchers, professors, universities, and industrial and semi-industrial centers around the world.

Our main services Based on Education for all Spectrum people in the world. We want to make an integration between researches and educations. We believe education is the main right of Human beings. So our services should be concentrated on inclusive education.

The KSRA team partners with local under-served communities around the world to improve the access to and quality of knowledge based on education, amplify and augment learning programs where they exist, and create new opportunities for e-learning where traditional education systems are lacking or non-existent.

FULL Paper PDF file:

Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls

Authors:

Zeynep Tufekci
University of North Carolina, Chapel Hill
zeynep@unc.edu

Comments: Tufekci, Zeynep. (2014). Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls. In ICWSM ’14: Proceedings of the 8th International AAAI Conference on Weblogs and Social Media, 2014. [forthcoming]
Subjects: Social and Information Networks (cs.SI); Physics and Society (physics.soc-ph)
Cite as: arXiv:1403.7400 [cs.SI]
(or arXiv:1403.7400v1 [cs.SI] for this version)

Bibliographic data

[Enable Bibex (What is Bibex?)]

Submission history

From: Zeynep Tufekci [view email] [v1] Fri, 28 Mar 2014 14:48:28 UTC (2,169 KB)
[v2] Tue, 15 Apr 2014 21:03:37 UTC (283 KB)

Citation:

TY – JOUR
AU – Tufekci, Zeynep
PY – 2014/03/28
SP –
T1 – Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls
JO – Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014
ER –

Citation

Tufekci, Zeynep. (2014). Big Questions for Social Media Big Data: Representativeness, Validity, and Other Methodological Pitfalls. Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014.

Tufekci, Zeynep. (2014). Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological
Pitfalls. In ICWSM ’14: Proceedings of the 8th International AAAI Conference on Weblogs and Social Media, 2014.

PDF reference and original file: Click here

Website | + posts

Professor Siavosh Kaviani was born in 1961 in Tehran. He had a professorship. He holds a Ph.D. in Software Engineering from the QL University of Software Development Methodology and an honorary Ph.D. from the University of Chelsea.

+ posts

Maryam kakaei was born in 1984 in Arak. She holds a Master's degree in Software Engineering from Azad University of Arak.

Website | + posts

Nasim Gazerani was born in 1983 in Arak. She holds a Master's degree in Software Engineering from UM University of Malaysia.