My name is danah boyd and I'm a Principal Researcher at Microsoft Research, a Research Assistant Professor in Media, Culture, and Communication at New York University, and a Fellow at Harvard's Berkman Center for Internet and Society. Buzzwords in my world include: privacy, context, youth culture, social media, big data. I use this blog to express random thoughts about whatever I'm thinking.

Relevant links:


Big Data: Opportunities for Computational and Social Sciences

Scott Golder recently wrote blog post at Cloudera entitled “Scaling Social Science with Hadoop” where he accounts for “how social scientists are using large scale computation.” He begins with a delightful quote from George Homans: The methods of social science are dear in time and money and getting dearer every day. He then turns to talk about the trajectory of social science:

When Homans — one of my favorite 20th century social scientists — wrote the above, one of the reasons the data needed to do social science was expensive was because collecting it didn’t scale very well. If conducting an interview or lab experiment takes an hour, two interviews or experiments takes two hours. The amount of data you can collect this way grows linearly with the number of graduate students you can send into the field (or with the number of hours you can make them work!). But as our collective body of knowledge has accumulated, and the “low-hanging fruit” questions have been answered, the complexity of our questions is growing faster than our practical capacity to answer them. Things are about to change.

This is his bouncing off point for thinking about how “computational social science” provides new opportunities because of the “large archives of naturalistically-created behavioral data.” And then he makes a very compelling claim for why looking at behavioral data is critical:

Though social scientists care what people think it’s also important to observe what people do, especially if what they think they do turns out to be different from what they actually do.

By and large, I agree with him. Big Data presents new opportunities for understanding social practice. Of course the next statement must begin with a “but.” And that “but” is simple: Just because you see traces of data doesn’t mean you always know the intention or cultural logic behind them. And just because you have a big N doesn’t mean that it’s representative or generalizable. Scott knows this, but too many people obsessed with Big Data don’t.

Increasingly, computational scientists are having a field day with Big Data. This is exemplified by the “web science” community and highly visible in conferences like CHI and WWW and ICWSM and many other communities in which I am a peripheral member. In these communities, I’ve noticed something that I find increasingly worrisome… Many computational scientists believe that because they have large N data that they know more about people’s practices than any other social scientist. Time and time again, I see computational scientists mistake behavioral traces for cultural logic. And this both saddens me and worries me, especially when we think about the politics of scholarship and funding. I’m getting ahead of myself.

Let me start with a concrete example. Just as social network sites were beginning to gain visibility, I reviewed a computational science piece (that was never published) where the authors had crawled Friendster, calculated numbers of friends, and used this to explain how social network sites were increasing friendship size. My anger in reading this article resulted in a rant that turned into a First Monday article. As is now common knowledge, there’s a big difference between why people connect on social network sites and why they declare relationships when being interviewed by a sociologist. This is the difference between articulated networks and personal networks.

On one hand, we can laugh at this and say, oh folks didn’t know how these sites would play out, isn’t that funny. But this beast hasn’t yet died. These days, the obsession is with behavioral networks. Obviously, the people who spend the most time together are the REAL “strong” ties, right? Wrong. By such a measure, I’m far closer to nearly everyone that I work with than my brother or mother who mean the world to me. Even if we can calculate time spent interacting, there’s a difference in the quality of time spent with different people.

Big Data is going to be extremely important but we can never lose track of the context in which this data is produced and the cultural logic behind its production. We must continue to ask “why” questions that cannot be answered through traces alone, that cannot be elicited purely through experiments. And we cannot automatically assume that some theoretical body of work on one data set can easily transfer to another data set if the underlying conditions are different.

As we start to address Big Data, we must begin by laying the groundwork, understanding the theoretical foundations that make sense and knowing when they don’t apply. Cherry picking from different fields without understanding where those ideas are rooted will lead us astray.

Each methodology has its strength and weaknesses. Each approach to data has its strengths and weaknesses. Each theoretical apparatus has its place in scholarship. And one of the biggest challenges in doing “interdisciplinary” work is being about to account for these differences, to know what approach works best for what question, to know what theories speak to what data and can be used in which ways.

Unfortunately, our disciplinary nature makes a mess out of this. Scholars aren’t trained to read in other fields, let alone make sense of the conditions in which that work was produced. Thus, it’s all-too-common to pick and choose from different fields and take everything out of context. This is one of the things that scares me about students trained in interdisciplinary programs.

Now, of course, you might ask: But didn’t you come from an interdisciplinary program? Yes, I did. But there’s a reason that I was in grad school for 8.5 years. The first two were brutal as I received a rude awakening that I knew nothing about social science. And then I did a massive retraining as an ethnographer drawing on sociological and anthropological literatures. At this point, that’s my strength as a scholar. I know how to ask qualitative questions and I know how to employ ethnographic methods and theories to work out cultural practices. I had to specialize to have enough depth.

Of course, there’s one big advantage to an interdisciplinary program: it’s easy to gain an appreciation for diverse methodological and analytical approaches. In my path, I’ve learned to value experimental, computational, and quantitative research, but I’m by no means well trained in any of those approaches. That said, I am confident in my ability to assess which questions can be answered by which approaches. This also means that I can account for the questions I can’t answer.

Now back to Big Data… Big Data creates tremendous opportunities for those who know how to assess the context of the data and ask the right questions into it. But mucking with Big Data alone is not research. And seeing patterns in Big Data is not the same as hypothesis testing. Patterns invite more questions than they answer.

I agree with Scott that there’s the potential for social science to be transformed by Big Data. So many questions that we’ve wanted to ask but haven’t been able to. But I’m also worried that more computationally minded researchers will think that they’re answering social science questions simply by finding patterns in Big Data. It’s the same worry that I have when graph theorists think that they understand people because they can model a narrow kind of information flow given the perfect conditions.

If we’re going to actually attack Big Data, the best solution would be to combine forces between social scientists and computational scientists. In some places, this is happening. But there are also huge issues at play that need to be accounted for and addressed. First, every discipline has its arrogance and far too many scholars think that they know everything. We desperately need a little humility here. Second, we need to think about the differences in publication, collaboration, and validation across fields. Social scientists aren’t going to get tenure on ACM or IEEE publications. Hell, they’re often dismissed for anything that’s not single author. Computational scientists often see no point in the extended review cycles that go into journal publications to help produce solid articles. And don’t get me started on the messy reviewing process involved on both sides.

We need to find a way for people to start working together and continue to get validated in their work. I actually think that the funding agencies are going to play a huge role in this, not just in demanding cross-disciplinary collaboration, but in setting the stage for how research will be published. Given departmental obsessions with funding these days, they have a lot of sway over shaping the future here.

There’s also another path that needs to be used: cross-bred students. Scott Golder, our fearless critic, is a good example of this. He was trained in computational ways before going to Cornell to pursue a PhD in sociology. This is one way of doing it. Another is to start cross-breeding students early on. Computer scientists: teach courses for social scientists on how to think about Big Data from a computational perspective. Social scientists: allow computer scientists into your core courses or teach core courses for them to understand the fundamentals of social science methodology and social theory. And universities: provide incentives for your faculty to teach students outside of their departments and for departments to encourage their students to take classes in other departments.

It’s great that we have Big Data but we need to develop the intellectual apparatus to actually analyze it. Each of us has a piece to the puzzle, but stitching it together is going to take a lot of reworking of old habits. It can be done and it is important. The key is to let go of our grudges and territoriality without letting go of our analytic rigor and depth.

Print Friendly

15 comments to Big Data: Opportunities for Computational and Social Sciences

  • Thanks for this danah. I think you make some excellent points here. I particularly liked your comment ‘If we’re going to actually attack Big Data, the best solution would be to combine forces between social scientists and computational scientists.” I think that bringing together different disciplines is really key here. I help run a grant program called the Digging into Data Challenge. This program brings together interdisciplinary teams from the humanities, social sciences, and computer/information sciences to tackle questions using Big Data approaches. My hope is to inspire projects that are genuinely collaborative across the disciplines — that is, that raise really interesting research questions in multiple domains. As you suggested, “we need to develop the intellectual apparatus to actually analyze” Big Data. I’m hoping that some of the projects we’re funding will help with that development. (see: Brett (@brettbobley)

  • anonymous

    As a CS grad student, what do you think I should do to not make such mistakes ? Do courses in the dept of Sociology ? other than that ?

  • Kevin K

    “But mucking with Big Data alone is not research. And seeing patterns in Big Data is not the same as hypothesis testing. Patterns invite more questions than they answer.”

    Shall we call this the Freakanomics Fallacy?

  • I feel like actual social scientists doing statistical work often don’t like observational studies — which is what you usually get in Big Data world — but rather, are all into randomized or natural experiments as the best methodology, if you can get it. Matthew Salganik, for example, thinks the point of the web for social science research is to allow really good experiments. As fun as it is to run some Big Data observational studies, I think there’s some wisdom there. I always feel I learn more from reading an ethnography or survey, where at least the researcher gets to formulate the queries.

    As for interdisciplinary people, eventually the best people doing this will be social science phd’s in social science departments who have strong computational skills. The fact that computer scientists are involved at all is an accident of history during this transitional period. Give it 10-20 years.

    What all this big-data-for-social-science stuff is, is quantitative methodology for social science. Lots of social science disciplines have methodology subfields for this sort of thing — political science has “political methodology”, economics has “econometrics”, and so on. It’s a potentially very useful methodology, potentially ground-breaking for many problems, but at the end of the day just a methodology in service of answer questions, that requires cognizance of the right questions to use properly.

    As for interdisciplinary institutions, it is indeed very hard. Compounding matters, it can be hard enough just to find both good social science and computation at the same university; you were lucky to be at Berkeley, which is one of the exceptions. But that still doesn’t solve the problems with institutional incentives and the like. I always imagined industrial research labs might be one of the few places to effectively support this sort of research…

  • Barry Brown

    At one knuckle point here you have Sociology vs. Economics, which I’ve always thought would be an almighty battleground but is surprisingly quiet.

    Some of us are interested in socio-logics, the why of whatever we do so you can go about changing and influencing those behaviours. But a lot of people aren’t since they see that as ultimately an impossible task (why did you buy a stupid iPad vs. How many ipads sold this month) so they focus on the aggregate and higher level patterns.

    I’m pessimistic about the chances for bringing these approaches together not just because they have different answers to that question, but because they have a different view of what scale is. Everyone talks about mixed methods but nobody does it – and if they do its a right old muddle.

    I remember an argument between two friends: “ultimately everything is quant”. “how can you say that, ultimately everything is qual!”…

  • This is a really great article to compliment your post dana – looks at 3 different interdisciplinary areas of research and one of those areas is ethnography in the IT industry!

    Barry, A., Born, G., and Weszkalnys, G. Logics of interdisciplinarity. Economy and Society 37, 1 (2008), 20-49.
    “This paper interrogates influential contemporary accounts of interdisciplinarity, in which it is portrayed as offering new ways of rendering science accountable to society and/or of forging closer relations between scientific research and innovation. The basis of the paper is an eighteen-month empirical study of three interdisciplinary fields that cross the boundaries between the natural sciences or engineering, on the one hand, and the social sciences or arts, on the other….”

  • nice article. thanks. has be pondering if quantitative and qualitative will ever co-exist. sort of like debating the existence of god in society. everyone see’s the same things, yet each has a very different perspective of what they are seeing. hope you are well.

  • e. pyatt

    Another good article. I’ve seen similar problems in linguistic research by non-linguists (from biologists in particular) and…linguists trying to answer archaeological problems (usually badly), so I sympathize.

    There may be promising approaches in in combining information from Big Data and ethnography, but the researcher has to understand the methodologies BOTH disciplines so as to not create a too simplistic model (e.g. more daily contact = closer relationtionship).

    I agree with commenters who point out that you tend to get different types of information from different methodologies, and the connection points are not always obvious. I think Big Data could be worth pursuing IF everyone understands what they’re doing (not saying I do by the way). I agree that assuming this will give us quick and simple answers can be dangerous.

  • CS grad student: Learn social science methodologies from social scientists. Develop a taste for the different methodological approaches, what questions can be addressed through what means, etc. Find a social scientist advisor who can help acculturate you.

  • Hey Danah,
    This is a brilliant post – thank you! I started writing up a comment basically talking about how Big Data is effecting social media analysis and tried to put a social media spin on things and it ended up being horrifically long so I published it as a post instead (here ->

    Thanks again!

  • Thanks for a very insightful and timely post, danah. I think many of the issues you raise about the suggestiveness of data and it’s representation will stay with us for many years to come. I also think they will “spill out” beyond compsci/socsci and beyond scholarship in general. I’m referring to the applications of social/human data modeling in things like profiling, context-sensitive ads and semantic web technologies that “infer” without understanding context. People will mistrust these technologies even if they become very reliable over time, and there will be a significant need for critical debate and mediation between those who build them and the public at large, which is something social scientists and humanities scholars who are tech literate can hopefully help with.

    One thought came up when reading your piece and thinking about numerous other pieces on Big Data I’ve read over time (e.g. Chris Anderson’s article on “The end of theory”: Big Data ultimately strikes me as an incredibly male idea. Quantitative data is incredibly suggestive, esp. when visualized. There is the idea that it “shows” or “proves” something that precludes other interpretations and it conveniently provides rhetorical ammunition to get almost anything across (when misused). Being data literate is essential.

  • Danah, thanks for this blog posting, it dovetails in an uncanny way with Tim O’Reilley’s recent keynote at the MySQL CE 2010 conference about The Cloud – and the corporate use of data in order to facilitate consumer convenience (while accumulating an unprecedented amount of data that can be analyzed and monetized). It seems that many of the problems you note by scholars without a social science background who make erroneous assumptions using Big Data boil down to nothing more that logical flaws (intentional fallacies, proof by assertion, etc.) Perhaps a couple semesters of formal logical training would assist, in addition to more social science background!

  • Go danah! Bravo for this thoughtful post. Combining quantitative & qualitative methods to from credible case studies is the challenge as Science 2.0 researchers press forward to understand the science of the made world. Big data is a great opportunity, but ethnographic methods are needed to make the results meaningful.

  • I think that the future of science is really in inter-disciplinary research that can be used to create collaborations between folks that produce data and those that analyzie it. For example, check out the latest in collaborative grants and RFAs posted by organizations here:

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>