Reading Reflection: Crowdsourcing

I’ve been trying to investigate how to leverage the power of crowdsourcing to bring insight into understanding an emerging academia field. I put academic papers of Engineering Education Research on Amazon Mechanical Turk and ask turkers to read and tag them. From the result of a pilot test, I feel the foreseeable outcome probably will not be very optimal. This week’s reading provide some points for me to think about why this happened and where to go for my future work.

Part 1 in Surowiecki’s book “The Wisdom of Crowds” identifies 4 conditions that characterize wise crowd: diversity, independence, decentralization and aggregation. Based on this and other readings, I would like to add more conditions under which good results could happen using crowdsourcing. Only when most (if not all) the conditions are satisfied, the collective effort of the crowd will excel. I also add thoughts here that are particularly useful for my project and maybe also relevant to the readers of this blog post who is interested in crowdsourcing.

  • Diversity. The crowd has to be diverse enough to provide all the information for solving the particular problem. It doesn’t need and it’s also impossible to be diverse in all regards (ethnicity, gender, economic status, skills, expertise, religion, etc.) In the Maze Experiment mentioned in page6 of “The Wisdom of Crowds”, because the group has gone through the maze once, so collectively, they have a nearly complete picture of the maze, that is, the group contains enough information for accomplish the task optimally the second time. So as long as the crowd is diverse enough to give complete information for the particular task, it doesn’t matter whether they come from different gender, race, or hold different worldview, etc, though diverse worldviews does matter in may realistic problems. So when using crowdsourcing, I have to consider whether the possible crowd that will come to solve my problem have the complete information to offer.
  • Balance (I added this one, and it is a derivative from Diversity). The crowd also has to be diverse or balance enough to cancel the errors out in the sum of information. In the submarine Scorpion example, the officer recruited mathematicians, submarine specialists and salvage men. These people are all experts to some extent. They all have some pieces of information that is valuable to the problem but they may not know themselves. If there are too many irrelevant people with absolute no information to offer in this team, then these people will just offer many errors that cannot get canceled out aggregately. For my project, when outsourcing people from non-academia, I might include too many people that have absolute no information to offer besides errors. Compared with the general crowd, the Engineering Education Research community is too small. I might have a very small amount of experts, and too large body of non-experts, which makes the crowd imbalanced.
  • Independence. This might be the point where the book “The Wisdom of Crowd” receives most critiques. The opponents argue that it is impossible for people in real world to be totally independent, and not influenced by the social environment, so in this regard, the wisdom of crowd has no practical value. Not to mention all the successful cases happened in real-world, Amazon Mechanical Turk is a good place where turkers finish tasks without mutual influences. However, independence is not always a good thing especially for complex tasks. The CrowdForge paper (Kittur, Smus, &Kraut, 2011) says that “workers generally complete tasks independently with no knowledge of what others have done, making it difficult to enforce standards and consistency”. So usually the micro-task markets can only help to accomplish simple low complexity tasks that require low cognitive effort.
  • Decentralization (not having much thought yet).
  • Aggregate (not having much thought yet).
  • Carefully designed tasks: The problems or questions being asked to the crowd have to be carefully designed, especially for complex tasks. In the submarine Scorpion example, the officers didn’t ask the team where they thought the Scorpion was, rather, he concocted a series of scenarios about what might have happened to Scorpion. However, uncarefully breaking down the task into small pieces will cause high coordination cost later. Based on organizational behavior literature and distributed computing literature, the CrowdForge paper (Kittur, Smus, &Kraut, 2011) provides good argument and guidelines about why and how to separate complex tasks to small pieces suitable for micro-markets, where the workers attention span is very short and they don’t make the commitment to do long and complex task.  For my project, the academic papers I put on Amazon Turk are usually over 10 pages. So I have to come up with some ways to carefully break the task down. The CrowdForge paper provide good guidelines, but I have to come up with some ways that’s suitable for my situation.
  • Motivation matters. Various types of motivations are discussed in the reading. What I am thinking here is why the task I put on Amazon Turk about engineering education would matter to the general workers. According to the Crowd and Community (Haythornthwaite, 2009) paper, what I am doing here is blending the line between the two organizing models: heaveyweight community and lightweight crowd. I try to make the crowd contribute content that is useful to the community. Then why the crowd would like to do that if this doesn’t matter to them? It would be beneficial to the community if I make this work, then I can leverage both the power of the community and the crowd, but I have to carefully think about how to do it.

Overall summary: Crowdsourcing is not an internet buzzword. It is not an easy and cheap solution to all the problems you have. Good results happened under certain conditions. Bad results like crowdslapping can happen here and there. The process has to be considered carefully. Crowdsourcing method has to be used wisely.

Question to the readers: do you think there are ways that I can make the crowd work for the community (tagging academic papers), and work well? or there’s simply no way?


TECH621 Assignment: Social Media Sites Classification

Instead of coming up with some kind of classification system myself out of nothing, I did a search about some existing classification system. I found this article very interesting, and I agree with the classification system it proposes, at least for this time being. This seems the most widely accepted classification system by now, unless we come up with something else within our brilliant minds in our brilliant Tech621 class 🙂

Kaplan, A. M., & Haenlein, M. (2010). Users of the world, unite! The challenges and opportunities of Social Media. Business horizons, 53(1), 59–68.
This is the classification in this paper (please click and zoom in to see clearly):

I think Content Communities and Social networking sites took a large part of the social media sites, at least the 20 social media sites each of us in the Tech621 has come up with.

This article also talks about the definition of social media in comparison with Web 2.0, which is the topic we have discussed last week. This article talks about another concept User Generated Content. In the opinion of this article, Web 2.0 is considered as the platform for the evolution of Social Media, and Social Media is a group of internet-based applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchange of User Generated Content. This makes sense, I think maybe there is a way to merge this with our discussion result in last class.

There is a section in this article titled: “2. What is Social Media–—And what is it not?” I was anticipating greatly to hear it talks about what Social Media is NOT, but I think, as far as the way I read it, it only talks about what is social media, then went ahead to talk about the classification of social media, didn’t really talk about what it is not…

Examples and Thoughts:

High Self-representation, Low Media Richness: Blog (LiveJournal), Micro-blogs (Twitter, Sina Weibo)

(I should say that blogs and micro-blogs are initially for sharing thoughts using text, but people can also post rich media like photos and video clips there. As technologies advance, some things or features tend to fuse, but I still think this is a valid classification based on the initial purposes.)

High Self-representation, Medium Media Richness: Social Networking Sites (Facebook, RenRen, , Couchsurfing, Foursquare, Friendfeed, Posterous, Orkut), Friendster (Social Gaming), Dating websites (Fubar)

High Self-representation, High Media Richness: Virtual Social (Second Life)

Low Self-representation, Low Media Richness: collaborative projects (Wiki, Googl doc), Q & A forum (Quora), old discussion groups (Usenet, telnet, BBS), information searching and sharing (Yelp, Eventful, Sourceforge), Zotero

Low Self-representation, Medium Media Richness: YouTube, Flikr, SlateBox (collaborative visualization tool), Spotify, Xiami, LastFm

(There are people on YouTube writing video diaries, which support an even higher self-representation than Facebook. I can only say that there are unlimited possibilities how people use social media, as long as the technologies allow people to do so. Even when the technologies do not support, people will make it possible by improve the technologies. Social media is evolving by people’s need and sometimes random thoughts, not by how they classify the sites initially, but when we still need to classify them, one standard is their initial or main purpose.)

Low Self-representation, High Media Richness: World of Warcraft

TECH621 Discussion: Web 2.0 Ontology–A Soil and Plants Metaphor

I was always thinking ontology is a very abstract term from philosophy, but it shows up more and more frequently in my reading recently, so now I realize in information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain, and may be used to describe the domain. 

In the last TECH621 class, we were discussing the concepts of Web2.0, Social Media, and Social Networking Sites (SNS), and the relationship between them, so basically, we are creating a ontology about this set of concepts. The purpose is to come up with a better way to describe this domain, to establish a common base of communication and to provide a framework for conducting research.

Here is the result of our discuss guided by Dr. V. I added my own thoughts about CSCW (Computer-Supported Cooperative Work) in the process of trying to find the position of email, but then I kind of think it’s not necessary to include CSCW here to mess up with other concepts. I also add internet to include the whole thing, just because I am not very comfortable that half of the CSCW circle is leaving outside.  This is based on our class discussion, but I don’t particularly like using the circles to illustrate here out of the reason specified after this picture. (Please click and zoom in to see clearly)

My Unfinished Thoughts:

(1) Most things we are talking about are based on the modern Internet and the recent and mainstream thinking about social media. There maybe particular cases that are not of interest to research in this field, which we chose to not consider. I feel we are defining things to provide a base of academic communication, also, what our result is heavily influenced by what we refer to in our daily communication in academia. One thing I learned for being a researcher is that (I don’t know whether this is right or wrong), to live with confusion, and to leave things that are not crucial to your research out of mind.

(2) Here are two concepts of platforms and two concepts of channels, which I think is resulted by comparing things of different types. The relationship between Social Media and Web 2.0 is more of “based upon”, so Web 2.0 is platform in the sense of an Operating System. Web 2.0 and Social Media are things of different types, like the soil and the plants. The relationship between Social Media and SNS are more of “include”. They are things of the same type, like plants and a certain family of plants. To illustrate this whole thing using circles (I mean Venn Diagram) is fine, as long as we pointed out that the two platforms are in different senses as Dr. V did it in class, because drawing circles is a very intuitive way of showing relationships. However, personally, I think it’s better to only using circles (I mean Venn Diagram) when categorizing things of the same type, here I more like to use the soil and plants metaphor to illustrate the relationships between Web2.0, Social Media and SNS:

Web 2.0 is like a particular type of Soil with a number of characteristics such as user participation and contribution. That is to say, this soil needs human beings to interact with it to become meaningful soil, otherwise it’s just a piece of empty ground. Web2.0 can also address other possibilities because of this interactive nature, that’s where Social Media comes into play. Social Media is a family of plants that growing in this Web2.0 soil, besides user participation and contribution, it also address user communication. SNS is then a sub-family of the Social Media plants.

Unfinished thoughts, welcome commenting and helping me articulate.



McAfee, 2006

TECH621 Discussion: The “Long Tail” in Social Bookmarking

In response to what we discussed in class, here is some references related to the long-tail in social bookmarking.

Terms in a tagging system are usually considered to follow a power-law distribution. They tend to converge into a small subset of prevalent keywords, and other obscure or problematic tags fall into the “long-tail”, thus get filtered out of the central area. This is regarded as an illustration of Zipf’s Law (Zipf, 1935): “in a corpus of natural language utterances, the frequency of any word is roughly inversely proportional to its rank in the frequency table”. This is useful for filtering out the problematic tags, and reach a converging point of the collective intelligence. However, some researchers point out that there are hidden useful information in the long tail since the long-tail contains informal metadata, and searching method should be improved to search across the long tail, rather than only use a small subset of the tags (Tonkin, 2006).

I also found an online forum ( discussing that you can use long-tail keywords to get traffic to your websites. The basic idea is that you use low competitive keywords to bookmark your website, and each of them get a low volume of searches, but cumulatively, you will get enough traffic. I haven’t looked into whether there are scholarly publications about this idea. I am sure there are tons of other papers mention this, just list a few here:
1. Golder, S., & Huberman, B. A. (2005). The structure of collaborative tagging systems. Arxiv preprint cs/0508082.
2. Tonkin, E. (2006). Searching the long tail: Hidden structure in social tagging. Proceedings of the 17th ASIS&SIG/CR Classification Research Workshop. Austin, TX.
3. Zipf, G. K. (1935). The psycho-biology of language: An introducation to dynamic philology. Boston, MA.

TECH621 Assignment 1: 20 Social Media Sites

20 Social Media Sites that I didn’t know before (Some of the descriptions are quoted from Wikipedia):
1.—Collaborative visualization tool
2.—Question & answer forum created, edited and organized by its community of users
3.—A location-based social networking website based on GPS-enabled mobile devices
4.—Real-time feed aggregator/sharing
5.—Blogging and sharing
6.—Video sharing
7.—Find, create, and publish Open Source software for free
8.—previously social networking, and redesigned to social gaming
9.—Social Networking operated by Google, popular in Brazil
10.—Search, track, and share information about events
11.—Korean social network service
12.– Swedish commercial advertisement-financed social networking website for teenagers
13.—Social networking, blogging, profile, messaging
15.—Business networking
16.—social networking
17.—Global social networking
18.—A volunteer-based worldwide network connecting travelers with members of local communities
19.—African-American community social network site
20. and—social network for dog and cat owners

A bit problematic ones:
1. Google offers (beta) —Offers about places to eat, play, shop and stay. (not sure whether users can contribute to the information yet)

2.—Started as a P2P music sharing service, then turn to a music store because of copyright issue. (not sure whether it’s social media or not now)

3. Amazon Mechanical Turk—Since this one has stimulated lots of discussion in class, so I’d like to list it here. Now I think Amazon Turk may not be fully qualified as a social media site, but it does emphasize and utilize one very important as aspect of Web 2.0 platforms, that is collective intelligence and crowd sourcing–the importance to utilize collective human effort to maintain and improve data quality.

Some social media sites from China:
I was thinking that China’s social media sites are largely controlled by government censorship, so there’s not that much to say about them. However, this turns out just naive thought out of my mind. Just because of the censorship, and people’s desire to express themselves freely, the social media territory is very controversial and is fostering many opportunities too. (reference: China’s Social Network Problem) Here is some examples among the most popular social media sites in China:
1.–Social Networking, a Chinese version of Facebook

2.–Video sharing, a Chinese version of YouTube

3.–Internet Forum(news, gossip, etc)

4.–Book, movie, music reviews and a lot more

5. SinaWeibo (–Microblog, a Chinese version of Twitter, but better in my opinion