ResearchPad - online-encyclopedias Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Ten simple rules for designing learning experiences that involve enhancing computational biology Wikipedia articles]]> <![CDATA[A season for all things: Phenological imprints in Wikipedia usage and their relevance to conservation]]>

Phenology plays an important role in many human–nature interactions, but these seasonal patterns are often overlooked in conservation. Here, we provide the first broad exploration of seasonal patterns of interest in nature across many species and cultures. Using data from Wikipedia, a large online encyclopedia, we analyzed 2.33 billion pageviews to articles for 31,751 species across 245 languages. We show that seasonality plays an important role in how and when people interact with plants and animals online. In total, over 25% of species in our data set exhibited a seasonal pattern in at least one of their language-edition pages, and seasonality is significantly more prevalent in pages for plants and animals than it is in a random selection of Wikipedia articles. Pageview seasonality varies across taxonomic clades in ways that reflect observable patterns in phenology, with groups such as insects and flowering plants having higher seasonality than mammals. Differences between Wikipedia language editions are significant; pages in languages spoken at higher latitudes exhibit greater seasonality overall, and species seldom show the same pattern across multiple language editions. These results have relevance to conservation policy formulation and to improving our understanding of what drives human interest in biodiversity.

<![CDATA[Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited]]>

The ability to produce timely and accurate flu forecasts in the United States can significantly impact public health. Augmenting forecasts with internet data has shown promise for improving forecast accuracy and timeliness in controlled settings, but results in practice are less convincing, as models augmented with internet data have not consistently outperformed models without internet data. In this paper, we perform a controlled experiment, taking into account data backfill, to improve clarity on the benefits and limitations of augmenting an already good flu forecasting model with internet-based nowcasts. Our results show that a good flu forecasting model can benefit from the augmentation of internet-based nowcasts in practice for all considered public health-relevant forecasting targets. The degree of forecast improvement due to nowcasting, however, is uneven across forecasting targets, with short-term forecasting targets seeing the largest improvements and seasonal targets such as the peak timing and intensity seeing relatively marginal improvements. The uneven forecasting improvements across targets hold even when “perfect” nowcasts are used. These findings suggest that further improvements to flu forecasting, particularly seasonal targets, will need to derive from other, non-nowcasting approaches.

<![CDATA[Emergence of online communities: Empirical evidence and theory]]>

Online communities, which have become an integral part of the day-to-day life of people and organizations, exhibit much diversity in both size and activity level; some communities grow to a massive scale and thrive, whereas others remain small, and even wither. In spite of the important role of these proliferating communities, there is limited empirical evidence that identifies the dominant factors underlying their dynamics. Using data collected from seven large online platforms, we observe a relationship between online community size and its activity which generally repeats itself across platforms: First, in most platforms, three distinct activity regimes exist—one of low-activity and two of high-activity. Further, we find a sharp activity phase transition at a critical community size that marks the shift between the first and the second regime in six out of the seven online platforms. Essentially, we argue that it is around this critical size that sustainable interactive communities emerge. The third activity regime occurs above a higher characteristic size in which community activity reaches and remains at a constant and higher level. We find that there is variance in the steepness of the slope of the second regime, that leads to the third regime of saturation, but that the third regime is exhibited in six of the seven online platforms. We propose that the sharp activity phase transition and the regime structure stem from the branching property of online interactions.

<![CDATA[Predicting altcoin returns using social media]]>

Cryptocurrencies have recently received large media interest. Especially the great fluctuations in price have attracted such attention. Behavioral sciences and related scientific literature provide evidence that there is a close relationship between social media and price fluctuations of cryptocurrencies. This particularly applies to smaller currencies, which can be substantially influenced by references on Twitter. Although these so-called “altcoins” often have smaller trading volumes they sometimes attract large attention on social media. Here, we show that fluctuations in altcoins can be predicted from social media. In order to do this, we collected a dataset containing prices and the social media activity of 181 altcoins in the form of 426,520 tweets over a timeframe of 71 days. The containing public mood was then estimated using sentiment analysis. To predict altcoin returns, we carried out linear regression analyses based on 45 days of data. We showed that short-term returns can be predicted from activity and sentiments on Twitter.

<![CDATA[Social media popularity and election results: A study of the 2016 Taiwanese general election]]>

This paper investigates the relationship between candidates’ online popularity and election results, as a step towards creating a model to forecast the results of Taiwanese elections even in the absence of reliable opinion polls on a district-by-district level. 253 of 354 legislative candidates of single-member districts in Taiwan’s 2016 general election had active public Facebook pages during the election period. Hypothesizing that the relative popularity of candidates’ Facebook posts will be positively related to their election results, I calculated each candidate’s Like Ratio (i.e. proportions of all likes on Facebook posts obtained by candidates in their district). In order to have a measure of online interest without the influence of subjective positivity, I similarly calculated the proportion of daily average page views for each candidate’s Wikipedia page. I ran a regression analysis, incorporating data on results of previous elections and available opinion poll data. I found the models could describe the result of the election well and reject the null hypothesis. My models successfully predicted 80% of winners in single-member districts and were effective in districts without local opinion polls with a predictive power approaching that of traditional opinion polls. The models also showed good accuracy when run on data for the 2014 Taiwanese municipal mayors election.

<![CDATA[A productive clash of perspectives? The interplay between articles’ and authors’ perspectives and their impact on Wikipedia edits in a controversial domain]]>

This study examined predictors of the development of Wikipedia articles that deal with controversial issues. We chose a corpus of articles in the German-language version of Wikipedia about alternative medicine as a representative controversial issue. We extracted edits made until March 2013 and categorized them using a supervised machine learning setup as either being pro conventional medicine, pro alternative medicine, or neutral. Based on these categories, we established relevant variables, such as the perspectives of articles and of authors at certain points in time, the (im)balance of an article’s perspective, the number of non-neutral edits per article, the number of authors per article, authors’ heterogeneity per article, and incongruity between authors’ and articles’ perspectives. The underlying objective was to predict the development of articles’ perspectives with regard to the controversial topic. The empirical part of the study is embedded in theoretical considerations about editorial biases and the effectiveness of norms and rules in Wikipedia, such as the neutral point of view policy. Our findings revealed a selection bias where authors edited mainly articles with perspectives similar to their own viewpoint. Regression analyses showed that an author’s perspective as well as the article’s previous perspectives predicted the perspective of the resulting edits, albeit both predictors interact with each other. Further analyses indicated that articles with more non-neutral edits were altogether more balanced. We also found a positive effect of the number of authors and of the authors’ heterogeneity on articles’ balance. However, while the effect of the number of authors was reserved to pro-conventional medicine articles, the authors’ heterogenity effect was restricted to pro-alternative medicine articles. Finally, we found a negative effect of incongruity between authors’ and articles’ perspectives that was pronounced for the pro-alternative medicine articles.

<![CDATA[The Role of Temporal Trends in Growing Networks]]>

The rich get richer principle, manifested by the Preferential attachment (PA) mechanism, is widely considered one of the major factors in the growth of real-world networks. PA stipulates that popular nodes are bound to be more attractive than less popular nodes; for example, highly cited papers are more likely to garner further citations. However, it overlooks the transient nature of popularity, which is often governed by trends. Here, we show that in a wide range of real-world networks the recent popularity of a node, i.e., the extent by which it accumulated links recently, significantly influences its attractiveness and ability to accumulate further links. We proceed to model this observation with a natural extension to PA, named Trending Preferential Attachment (TPA), in which edges become less influential as they age. TPA quantitatively parametrizes a fundamental network property, namely the network’s tendency to trends. Through TPA, we find that real-world networks tend to be moderately to highly trendy. Networks are characterized by different susceptibilities to trends, which determine their structure to a large extent. Trendy networks display complex structural traits, such as modular community structure and degree-assortativity, occurring regularly in real-world networks. In summary, this work addresses an inherent trait of complex networks, which greatly affects their growth and structure, and develops a unified model to address its interaction with preferential attachment.

<![CDATA[Can Simple Transmission Chains Foster Collective Intelligence in Binary-Choice Tasks?]]>

In many social systems, groups of individuals can find remarkably efficient solutions to complex cognitive problems, sometimes even outperforming a single expert. The success of the group, however, crucially depends on how the judgments of the group members are aggregated to produce the collective answer. A large variety of such aggregation methods have been described in the literature, such as averaging the independent judgments, relying on the majority or setting up a group discussion. In the present work, we introduce a novel approach for aggregating judgments—the transmission chain—which has not yet been consistently evaluated in the context of collective intelligence. In a transmission chain, all group members have access to a unique collective solution and can improve it sequentially. Over repeated improvements, the collective solution that emerges reflects the judgments of every group members. We address the question of whether such a transmission chain can foster collective intelligence for binary-choice problems. In a series of numerical simulations, we explore the impact of various factors on the performance of the transmission chain, such as the group size, the model parameters, and the structure of the population. The performance of this method is compared to those of the majority rule and the confidence-weighted majority. Finally, we rely on two existing datasets of individuals performing a series of binary decisions to evaluate the expected performances of the three methods empirically. We find that the parameter space where the transmission chain has the best performance rarely appears in real datasets. We conclude that the transmission chain is best suited for other types of problems, such as those that have cumulative properties.

<![CDATA[Identifying Topics in Microblogs Using Wikipedia]]>

Twitter is an extremely high volume platform for user generated contributions regarding any topic. The wealth of content created at real-time in massive quantities calls for automated approaches to identify the topics of the contributions. Such topics can be utilized in numerous ways, such as public opinion mining, marketing, entertainment, and disaster management. Towards this end, approaches to relate single or partial posts to knowledge base items have been proposed. However, in microblogging systems like Twitter, topics emerge from the culmination of a large number of contributions. Therefore, identifying topics based on collections of posts, where individual posts contribute to some aspect of the greater topic is necessary. Models, such as Latent Dirichlet Allocation (LDA), propose algorithms for relating collections of posts to sets of keywords that represent underlying topics. In these approaches, figuring out what the specific topic(s) the keyword sets represent remains as a separate task. Another issue in topic detection is the scope, which is often limited to specific domain, such as health. This work proposes an approach for identifying domain-independent specific topics related to sets of posts. In this approach, individual posts are processed and then aggregated to identify key tokens, which are then mapped to specific topics. Wikipedia article titles are selected to represent topics, since they are up to date, user-generated, sophisticated articles that span topics of human interest. This paper describes the proposed approach, a prototype implementation, and a case study based on data gathered during the heavily contributed periods corresponding to the four US election debates in 2012. The manually evaluated results (0.96 precision) and other observations from the study are discussed in detail.

<![CDATA[Leveraging Big Data for Exploring Occupational Diseases-Related Interest at the Level of Scientific Community, Media Coverage and Novel Data Streams: The Example of Silicosis as a Pilot Study]]>


Silicosis is an untreatable but preventable occupational disease, caused by exposure to silica. It can progressively evolve to lung impairment, respiratory failure and death, even after exposure has ceased. However, little is known about occupational diseases-related interest at the level of scientific community, media coverage and web behavior. This article aims at filling in this gap of knowledge, taking the silicosis as a case study.


We investigated silicosis-related web-activities using Google Trends (GT) for capturing the Internet behavior worldwide in the years 2004–2015. GT-generated data were, then, compared with the silicosis-related scientific production (i.e., PubMed and Google Scholar), the media coverage (i.e., Google news), the Wikipedia traffic (i.e, Wikitrends) and the usage of new media (i.e., YouTube and Twitter).


A peak in silicosis-related web searches was noticed in 2010–2011: interestingly, both scientific articles production and media coverage markedly increased after these years in a statistically significant way. The public interest and the level of the public engagement were witnessed by an increase in likes, comments, hashtags, and re-tweets. However, it was found that only a small fraction of the posted/uploaded material contained accurate scientific information.


GT could be useful to assess the reaction of the public and the level of public engagement both to novel risk-factors associated to occupational diseases, and possibly related changes in disease natural history, and to the effectiveness of preventive workplace practices and legislative measures adopted to improve occupational health. Further, occupational clinicians should become aware of the topics most frequently searched by patients and proactively address these concerns during the medical examination. Institutional bodies and organisms should be more present and active in digital tools and media to disseminate and communicate scientifically accurate information. This manuscript should be intended as preliminary, exploratory communication, paving the way for further studies.

<![CDATA[Even good bots fight: The case of Wikipedia]]>

In recent years, there has been a huge increase in the number of bots online, varying from Web crawlers for search engines, to chatbots for online customer service, spambots on social media, and content-editing bots in online collaboration communities. The online world has turned into an ecosystem of bots. However, our knowledge of how these automated agents are interacting with each other is rather poor. Bots are predictable automatons that do not have the capacity for emotions, meaning-making, creativity, and sociality and it is hence natural to expect interactions between bots to be relatively predictable and uneventful. In this article, we analyze the interactions between bots that edit articles on Wikipedia. We track the extent to which bots undid each other’s edits over the period 2001–2010, model how pairs of bots interact over time, and identify different types of interaction trajectories. We find that, although Wikipedia bots are intended to support the encyclopedia, they often undo each other’s edits and these sterile “fights” may sometimes continue for years. Unlike humans on Wikipedia, bots’ interactions tend to occur over longer periods of time and to be more reciprocated. Yet, just like humans, bots in different cultural environments may behave differently. Our research suggests that even relatively “dumb” bots may give rise to complex interactions, and this carries important implications for Artificial Intelligence research. Understanding what affects bot-bot interactions is crucial for managing social media well, providing adequate cyber-security, and designing well functioning autonomous vehicles.

<![CDATA[Crowdsourcing a Collective Sense of Place]]>

Place can be generally defined as a location that has been assigned meaning through human experience, and as such it is of multidisciplinary scientific interest. Up to this point place has been studied primarily within the context of social sciences as a theoretical construct. The availability of large amounts of user-generated content, e.g. in the form of social media feeds or Wikipedia contributions, allows us for the first time to computationally analyze and quantify the shared meaning of place. By aggregating references to human activities within urban spaces we can observe the emergence of unique themes that characterize different locations, thus identifying places through their discernible sociocultural signatures. In this paper we present results from a novel quantitative approach to derive such sociocultural signatures from Twitter contributions and also from corresponding Wikipedia entries. By contrasting the two we show how particular thematic characteristics of places (referred to herein as platial themes) are emerging from such crowd-contributed content, allowing us to observe the meaning that the general public, either individually or collectively, is assigning to specific locations. Our approach leverages probabilistic topic modelling, semantic association, and spatial clustering to find locations are conveying a collective sense of place. Deriving and quantifying such meaning allows us to observe how people transform a location to a place and shape its characteristics.

<![CDATA[Understanding and coping with extremism in an online collaborative environment: A data-driven modeling]]>

The Internet has provided us with great opportunities for large scale collaborative public good projects. Wikipedia is a predominant example of such projects where conflicts emerge and get resolved through bottom-up mechanisms leading to the emergence of the largest encyclopedia in human history. Disaccord arises whenever editors with different opinions try to produce an article reflecting a consensual view. The debates are mainly heated by editors with extreme views. Using a model of common value production, we show that the consensus can only be reached if groups with extreme views can actively take part in the discussion and if their views are also represented in the common outcome, at least temporarily. We show that banning problematic editors mostly hinders the consensus as it delays discussion and thus the whole consensus building process. To validate the model, relevant quantities are measured both in simulations and Wikipedia, which show satisfactory agreement. We also consider the role of direct communication between editors both in the model and in Wikipedia data (by analyzing the Wikipedia talk pages). While the model suggests that in certain conditions there is an optimal rate of “talking” vs “editing”, it correctly predicts that in the current settings of Wikipedia, more activity in talk pages is associated with more controversy.

<![CDATA[Predicting Virtual World User Population Fluctuations with Deep Learning]]>

This paper proposes a system for predicting increases in virtual world user actions. The virtual world user population is a very important aspect of these worlds; however, methods for predicting fluctuations in these populations have not been well documented. Therefore, we attempt to predict changes in virtual world user populations with deep learning, using easily accessible online data, including formal datasets from Google Trends, Wikipedia, and online communities, as well as informal datasets collected from online forums. We use the proposed system to analyze the user population of EVE Online, one of the largest virtual worlds.

<![CDATA[An Algorithm to Automatically Generate the Combinatorial Orbit Counting Equations]]>

Graphlets are small subgraphs, usually containing up to five vertices, that can be found in a larger graph. Identification of the graphlets that a vertex in an explored graph touches can provide useful information about the local structure of the graph around that vertex. Actually finding all graphlets in a large graph can be time-consuming, however. As the graphlets grow in size, more different graphlets emerge and the time needed to find each graphlet also scales up. If it is not needed to find each instance of each graphlet, but knowing the number of graphlets touching each node of the graph suffices, the problem is less hard. Previous research shows a way to simplify counting the graphlets: instead of looking for the graphlets needed, smaller graphlets are searched, as well as the number of common neighbors of vertices. Solving a system of equations then gives the number of times a vertex is part of each graphlet of the desired size. However, until now, equations only exist to count graphlets with 4 or 5 nodes. In this paper, two new techniques are presented. The first allows to generate the equations needed in an automatic way. This eliminates the tedious work needed to do so manually each time an extra node is added to the graphlets. The technique is independent on the number of nodes in the graphlets and can thus be used to count larger graphlets than previously possible. The second technique gives all graphlets a unique ordering which is easily extended to name graphlets of any size. Both techniques were used to generate equations to count graphlets with 4, 5 and 6 vertices, which extends all previous results. Code can be found at and

<![CDATA[Tracking Protests Using Geotagged Flickr Photographs]]>

Recent years have witnessed waves of protests sweeping across countries and continents, in some cases resulting in political and governmental change. Much media attention has been focused on the increasing usage of social media to coordinate and provide instantly available reports on these protests. Here, we investigate whether it is possible to identify protest outbreaks through quantitative analysis of activity on the photo sharing site Flickr. We analyse 25 million photos uploaded to Flickr in 2013 across 244 countries and regions, and determine for each week in each country and region what proportion of the photographs are tagged with the word “protest” in 34 different languages. We find that higher proportions of “protest”-tagged photographs in a given country and region in a given week correspond to greater numbers of reports of protests in that country and region and week in the newspaper The Guardian. Our findings underline the potential value of photographs uploaded to the Internet as a source of global, cheap and rapidly available measurements of human behaviour in the real world.

<![CDATA[Accuracy and Completeness of Drug Information in Wikipedia: A Comparison with Standard Textbooks of Pharmacology]]>

The online resource Wikipedia is increasingly used by students for knowledge acquisition and learning. However, the lack of a formal editorial review and the heterogeneous expertise of contributors often results in skepticism by educators whether Wikipedia should be recommended to students as an information source. In this study we systematically analyzed the accuracy and completeness of drug information in the German and English language versions of Wikipedia in comparison to standard textbooks of pharmacology. In addition, references, revision history and readability were evaluated. Analysis of readability was performed using the Amstad readability index and the Erste Wiener Sachtextformel. The data on indication, mechanism of action, pharmacokinetics, adverse effects and contraindications for 100 curricular drugs were retrieved from standard German textbooks of general pharmacology and compared with the corresponding articles in the German language version of Wikipedia. Quantitative analysis revealed that accuracy of drug information in Wikipedia was 99.7%±0.2% when compared to the textbook data. The overall completeness of drug information in Wikipedia was 83.8±1.5% (p<0.001). Completeness varied in-between categories, and was lowest in the category “pharmacokinetics” (68.0%±4.2%; p<0.001) and highest in the category “indication” (91.3%±2.0%) when compared to the textbook data overlap. Similar results were obtained for the English language version of Wikipedia. Of the drug information missing in Wikipedia, 62.5% was rated as didactically non-relevant in a qualitative re-evaluation study. Drug articles in Wikipedia had an average of 14.6±1.6 references and 262.8±37.4 edits performed by 142.7±17.6 editors. Both Wikipedia and textbooks samples had comparable, low readability. Our study suggests that Wikipedia is an accurate and comprehensive source of drug-related information for undergraduate medical education.

<![CDATA[CuboCube: Student creation of a cancer genetics e-textbook using open-access software for social learning]]>

Student creation of educational materials has the capacity both to enhance learning and to decrease costs. Three successive honors-style classes of undergraduate students in a cancer genetics class worked with a new software system, CuboCube, to create an e-textbook. CuboCube is an open-source learning materials creation system designed to facilitate e-textbook development, with an ultimate goal of improving the social learning experience for students. Equipped with crowdsourcing capabilities, CuboCube provides intuitive tools for nontechnical and technical authors alike to create content together in a structured manner. The process of e-textbook development revealed both strengths and challenges of the approach, which can inform future efforts. Both the CuboCube platform and the Cancer Genetics E-textbook are freely available to the community.

<![CDATA[Search strategies of Wikipedia readers]]>

The quest for information is one of the most common activity of human beings. Despite the the impressive progress of search engines, not to miss the needed piece of information could be still very tough, as well as to acquire specific competences and knowledge by shaping and following the proper learning paths. Indeed, the need to find sensible paths in information networks is one of the biggest challenges of our societies and, to effectively address it, it is important to investigate the strategies adopted by human users to cope with the cognitive bottleneck of finding their way in a growing sea of information. Here we focus on the case of Wikipedia and investigate a recently released dataset about users’ click on the English Wikipedia, namely the English Wikipedia Clickstream. We perform a semantically charged analysis to uncover the general patterns followed by information seekers in the multi-dimensional space of Wikipedia topics/categories. We discover the existence of well defined strategies in which users tend to start from very general, i.e., semantically broad, pages and progressively narrow down the scope of their navigation, while keeping a growing semantic coherence. This is unlike strategies associated to tasks with predefined search goals, namely the case of the Wikispeedia game. In this case users first move from the ‘particular’ to the ‘universal’ before focusing down again to the required target. The clear picture offered here represents a very important stepping stone towards a better design of information networks and recommendation strategies, as well as the construction of radically new learning paths.