ResearchPad - natural-language https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[The Language of Innovation]]> https://www.researchpad.co/article/elastic_article_10245 Predicting innovation is a peculiar problem in data science. Following its definition, an innovation is always a never-seen-before event, leaving no room for traditional supervised learning approaches. Here we propose a strategy to address the problem in the context of innovative patents, by defining innovations as never-seen-before associations of technologies and exploiting self-supervised learning techniques. We think of technological codes present in patents as a vocabulary and the whole technological corpus as written in a specific, evolving language. We leverage such structure with techniques borrowed from Natural Language Processing by embedding technologies in a high dimensional euclidean space where relative positions are representative of learned semantics. Proximity in this space is an effective predictor of specific innovation events, that outperforms a wide range of standard link-prediction metrics. The success of patented innovations follows a complex dynamics characterized by different patterns which we analyze in details with specific examples. The methods proposed in this paper provide a completely new way of understanding and forecasting innovation, by tackling it from a revealing perspective and opening interesting scenarios for a number of applications and further analytic approaches.

]]>
<![CDATA[Using case-level context to classify cancer pathology reports]]> https://www.researchpad.co/article/elastic_article_7869 Individual electronic health records (EHRs) and clinical reports are often part of a larger sequence—for example, a single patient may generate multiple reports over the trajectory of a disease. In applications such as cancer pathology reports, it is necessary not only to extract information from individual reports, but also to capture aggregate information regarding the entire cancer case based off case-level context from all reports in the sequence. In this paper, we introduce a simple modular add-on for capturing case-level context that is designed to be compatible with most existing deep learning architectures for text classification on individual reports. We test our approach on a corpus of 431,433 cancer pathology reports, and we show that incorporating case-level context significantly boosts classification accuracy across six classification tasks—site, subsite, laterality, histology, behavior, and grade. We expect that with minimal modifications, our add-on can be applied towards a wide range of other clinical text-based tasks.

]]>
<![CDATA[Switching between reading tasks leads to phase-transitions in reading times in L1 and L2 readers]]> https://www.researchpad.co/article/5c63394bd5eed0c484ae6445

Reading research uses different tasks to investigate different levels of the reading process, such as word recognition, syntactic parsing, or semantic integration. It seems to be tacitly assumed that the underlying cognitive process that constitute reading are stable across those tasks. However, nothing is known about what happens when readers switch from one reading task to another. The stability assumptions of the reading process suggest that the cognitive system resolves this switching between two tasks quickly. Here, we present an alternative language-game hypothesis (LGH) of reading that begins by treating reading as a softly-assembled process and that assumes, instead of stability, context-sensitive flexibility of the reading process. LGH predicts that switching between two reading tasks leads to longer lasting phase-transition like patterns in the reading process. Using the nonlinear-dynamical tool of recurrence quantification analysis, we test these predictions by examining series of individual word reading times in self-paced reading tasks where native (L1) and second language readers (L2) transition between random word and ordered text reading tasks. We find consistent evidence for phase-transitions in the reading times when readers switch from ordered text to random-word reading, but we find mixed evidence when readers transition from random-word to ordered-text reading. In the latter case, L2 readers show moderately stronger signs for phase-transitions compared to L1 readers, suggesting that familiarity with a language influences whether and how such transitions occur. The results provide evidence for LGH and suggest that the cognitive processes underlying reading are not fully stable across tasks but exhibit soft-assembly in the interaction between task and reader characteristics.

]]>
<![CDATA[No evidence for effects of Turkish immigrant children‘s bilingualism on executive functions]]> https://www.researchpad.co/article/5c61b7ccd5eed0c484937fd6

Recent research has increasingly questioned the bilingual advantage for executive functions (EF). We used structural equation modeling in a large sample of Turkish immigrant and German monolingual children (N = 337; aged 5–15 years) to test associations between bilingualism and EF. Our data showed no significant group differences between Turkish immigrant and German children’s EF skills while taking into account maternal education, child gender, age, and working memory (i.e., digit span backwards). Moreover, neither Turkish immigrant children’s proficiency in either language nor their home language environment predicted EF. Our findings offer important new evidence in light of the ongoing debate about the existence of a bilingual advantage for EF.

]]>
<![CDATA[Beyond opinion classification: Extracting facts, opinions and experiences from health forums]]> https://www.researchpad.co/article/5c3fa56ad5eed0c484ca4115

Introduction

Surveys indicate that patients, particularly those suffering from chronic conditions, strongly benefit from the information found in social networks and online forums. One challenge in accessing online health information is to differentiate between factual and more subjective information. In this work, we evaluate the feasibility of exploiting lexical, syntactic, semantic, network-based and emotional properties of texts to automatically classify patient-generated contents into three types: “experiences”, “facts” and “opinions”, using machine learning algorithms. In this context, our goal is to develop automatic methods that will make online health information more easily accessible and useful for patients, professionals and researchers.

Material and methods

We work with a set of 3000 posts to online health forums in breast cancer, morbus crohn and different allergies. Each sentence in a post is manually labeled as “experience”, “fact” or “opinion”. Using this data, we train a support vector machine algorithm to perform classification. The results are evaluated in a 10-fold cross validation procedure.

Results

Overall, we find that it is possible to predict the type of information contained in a forum post with a very high accuracy (over 80 percent) using simple text representations such as word embeddings and bags of words. We also analyze more complex features such as those based on the network properties, the polarity of words and the verbal tense of the sentences and show that, when combined with the previous ones, they can boost the results.

]]>
<![CDATA[Semantic algorithms can detect how media language shapes survey responses in organizational behaviour]]> https://www.researchpad.co/article/5c117b80d5eed0c48469977f

Research on sensemaking in organisations and on linguistic relativity suggests that speakers of the same language may use this language in different ways to construct social realities at work. We apply a semantic theory of survey response (STSR) to explore such differences in quantitative survey research. Using text analysis algorithms, we have studied how language from three media domains–the business press, PR Newswire and general newspapers–has differential explanatory value for analysing survey responses in leadership research. We projected well-known surveys measuring leadership, motivation and outcomes into large text samples from these three media domains significantly different impacts on survey responses. Business press language was best in explaining leadership-related items, PR language best at explaining organizational results and “ordinary” newspaper language seemed to explain the relationship among motivation items. These findings shed light on how different public arenas construct organizational realities in different ways, and how these differences have consequences on methodology in research on leadership.

]]>
<![CDATA[Feature engineering for sentiment analysis in e-health forums]]> https://www.researchpad.co/article/5c099452d5eed0c4842aea35

Introduction

Exploiting information in health-related social media services is of great interest for patients, researchers and medical companies. The challenge is, however, to provide easy, quick and relevant access to the vast amount of information that is available. One step towards facilitating information access to online health data is opinion mining. Even though the classification of patient opinions into positive and negative has been previously tackled, most works make use of machine learning methods and bags of words. Our first contribution is an extensive evaluation of different features, including lexical, syntactic, semantic, network-based, sentiment-based and word embeddings features to represent patient-authored texts for polarity classification. The second contribution of this work is the study of polar facts (i.e. objective information with polar connotations). Traditionally, the presence of polar facts has been neglected and research in polarity classification has been bounded to opinionated texts. We demonstrate the existence and importance of polar facts for the polarity classification of health information.

Material and methods

We annotate a set of more than 3500 posts to online health forums of breast cancer, crohn and different allergies, respectively. Each sentence in a post is manually labeled as “experience”, “fact” or “opinion”, and as “positive”, “negative” and “neutral”. Using this data, we train different machine learning algorithms and compare traditional bags of words representations with word embeddings in combination with lexical, syntactic, semantic, network-based and emotional properties of texts to automatically classify patient-authored contents into positive, negative and neutral. Beside, we experiment with a combination of textual and semantic representations by generating concept embeddings using the UMLS Metathesaurus.

Results

We reach two main results: first, we find that it is possible to predict polarity of patient-authored contents with a very high accuracy (≈ 70 percent) using word embeddings, and that this considerably outperforms more traditional representations like bags of words; and second, when dealing with medical information, negative and positive facts (i.e. objective information) are nearly as frequent as negative and positive opinions and experiences (i.e. subjective information), and their importance for polarity classification is crucial.

]]>
<![CDATA[The Biological Origin of Linguistic Diversity]]> https://www.researchpad.co/article/5989d9f9ab0ee8fa60b71262

In contrast with animal communication systems, diversity is characteristic of almost every aspect of human language. Languages variously employ tones, clicks, or manual signs to signal differences in meaning; some languages lack the noun-verb distinction (e.g., Straits Salish), whereas others have a proliferation of fine-grained syntactic categories (e.g., Tzeltal); and some languages do without morphology (e.g., Mandarin), while others pack a whole sentence into a single word (e.g., Cayuga). A challenge for evolutionary biology is to reconcile the diversity of languages with the high degree of biological uniformity of their speakers. Here, we model processes of language change and geographical dispersion and find a consistent pressure for flexible learning, irrespective of the language being spoken. This pressure arises because flexible learners can best cope with the observed high rates of linguistic change associated with divergent cultural evolution following human migration. Thus, rather than genetic adaptations for specific aspects of language, such as recursion, the coevolution of genes and fast-changing linguistic structure provides the biological basis for linguistic diversity. Only biological adaptations for flexible learning combined with cultural evolution can explain how each child has the potential to learn any human language.

]]>
<![CDATA[Turning Text into Research Networks: Information Retrieval and Computational Ontologies in the Creation of Scientific Databases]]> https://www.researchpad.co/article/5989d9f8ab0ee8fa60b70df2

Background

Web-based, free-text documents on science and technology have been increasing growing on the web. However, most of these documents are not immediately processable by computers slowing down the acquisition of useful information. Computational ontologies might represent a possible solution by enabling semantically machine readable data sets. But, the process of ontology creation, instantiation and maintenance is still based on manual methodologies and thus time and cost intensive.

Method

We focused on a large corpus containing information on researchers, research fields, and institutions. We based our strategy on traditional entity recognition, social computing and correlation. We devised a semi automatic approach for the recognition, correlation and extraction of named entities and relations from textual documents which are then used to create, instantiate, and maintain an ontology.

Results

We present a prototype demonstrating the applicability of the proposed strategy, along with a case study describing how direct and indirect relations can be extracted from academic and professional activities registered in a database of curriculum vitae in free-text format. We present evidence that this system can identify entities to assist in the process of knowledge extraction and representation to support ontology maintenance. We also demonstrate the extraction of relationships among ontology classes and their instances.

Conclusion

We have demonstrated that our system can be used for the conversion of research information in free text format into database with a semantic structure. Future studies should test this system using the growing number of free-text information available at the institutional and national levels.

]]>
<![CDATA[pubmed2ensembl: A Resource for Mining the Biological Literature on Genes]]> https://www.researchpad.co/article/5989d9daab0ee8fa60b675c1

Background

The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.

Methodology/Principal Findings

To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.

Conclusion/Significance

By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.

]]>
<![CDATA[Beyond Captions: Linking Figures with Abstract Sentences in Biomedical Articles]]> https://www.researchpad.co/article/5989da24ab0ee8fa60b8013e

Although figures in scientific articles have high information content and concisely communicate many key research findings, they are currently under utilized by literature search and retrieval systems. Many systems ignore figures, and those that do not typically only consider caption text. This study describes and evaluates a fully automated approach for associating figures in the body of a biomedical article with sentences in its abstract. We use supervised methods to learn probabilistic language models, hidden Markov models, and conditional random fields for predicting associations between abstract sentences and figures. Three kinds of evidence are used: text in abstract sentences and figures, relative positions of sentences and figures, and the patterns of sentence/figure associations across an article. Each information source is shown to have predictive value, and models that use all kinds of evidence are more accurate than models that do not. Our most accurate method has an -score of 69% on a cross-validation experiment, is competitive with the accuracy of human experts, has significantly better predictive accuracy than state-of-the-art methods and enables users to access figures associated with an abstract sentence with an average of 1.82 fewer mouse clicks. A user evaluation shows that human users find our system beneficial. The system is available at http://FigureItOut.askHERMES.org.

]]>
<![CDATA[PepeSearch: Semantic Data for the Masses]]> https://www.researchpad.co/article/5989dac3ab0ee8fa60bb1880

With the emergence of the Web of Data, there is a need of tools for searching and exploring the growing amount of semantic data. Unfortunately, such tools are scarce and typically require knowledge of SPARQL/RDF. We propose here PepeSearch, a portable tool for searching semantic datasets devised for mainstream users. PepeSearch offers a multi-class search form automatically constructed from a SPARQL endpoint. We have tested PepeSearch with 15 participants searching a Linked Open Data version of the Norwegian Register of Business Enterprises for non-trivial challenges. Retrieval performance was encouragingly high and usability ratings were also very positive, thus suggesting that PepeSearch is effective for searching semantic datasets by mainstream users. We also assessed its portability by configuring PepeSearch to query other SPARQL endpoints.

]]>
<![CDATA[Interactive Language Learning by Robots: The Transition from Babbling to Word Forms]]> https://www.researchpad.co/article/5989db09ab0ee8fa60bc973c

The advent of humanoid robots has enabled a new approach to investigating the acquisition of language, and we report on the development of robots able to acquire rudimentary linguistic skills. Our work focuses on early stages analogous to some characteristics of a human child of about 6 to 14 months, the transition from babbling to first word forms. We investigate one mechanism among many that may contribute to this process, a key factor being the sensitivity of learners to the statistical distribution of linguistic elements. As well as being necessary for learning word meanings, the acquisition of anchor word forms facilitates the segmentation of an acoustic stream through other mechanisms. In our experiments some salient one-syllable word forms are learnt by a humanoid robot in real-time interactions with naive participants. Words emerge from random syllabic babble through a learning process based on a dialogue between the robot and the human participant, whose speech is perceived by the robot as a stream of phonemes. Numerous ways of representing the speech as syllabic segments are possible. Furthermore, the pronunciation of many words in spontaneous speech is variable. However, in line with research elsewhere, we observe that salient content words are more likely than function words to have consistent canonical representations; thus their relative frequency increases, as does their influence on the learner. Variable pronunciation may contribute to early word form acquisition. The importance of contingent interaction in real-time between teacher and learner is reflected by a reinforcement process, with variable success. The examination of individual cases may be more informative than group results. Nevertheless, word forms are usually produced by the robot after a few minutes of dialogue, employing a simple, real-time, frequency dependent mechanism. This work shows the potential of human-robot interaction systems in studies of the dynamics of early language acquisition.

]]>
<![CDATA[The Global Burden of Journal Peer Review in the Biomedical Literature: Strong Imbalance in the Collective Enterprise]]> https://www.researchpad.co/article/5989dafdab0ee8fa60bc53ae

The growth in scientific production may threaten the capacity for the scientific community to handle the ever-increasing demand for peer review of scientific publications. There is little evidence regarding the sustainability of the peer-review system and how the scientific community copes with the burden it poses. We used mathematical modeling to estimate the overall quantitative annual demand for peer review and the supply in biomedical research. The modeling was informed by empirical data from various sources in the biomedical domain, including all articles indexed at MEDLINE. We found that for 2015, across a range of scenarios, the supply exceeded by 15% to 249% the demand for reviewers and reviews. However, 20% of the researchers performed 69% to 94% of the reviews. Among researchers actually contributing to peer review, 70% dedicated 1% or less of their research work-time to peer review while 5% dedicated 13% or more of it. An estimated 63.4 million hours were devoted to peer review in 2015, among which 18.9 million hours were provided by the top 5% contributing reviewers. Our results support that the system is sustainable in terms of volume but emphasizes a considerable imbalance in the distribution of the peer-review effort across the scientific community. Finally, various individual interactions between authors, editors and reviewers may reduce to some extent the number of reviewers who are available to editors at any point.

]]>
<![CDATA[Active Semi-Supervised Learning Method with Hybrid Deep Belief Networks]]> https://www.researchpad.co/article/5989da09ab0ee8fa60b76d33

In this paper, we develop a novel semi-supervised learning algorithm called active hybrid deep belief networks (AHD), to address the semi-supervised sentiment classification problem with deep learning. First, we construct the previous several hidden layers using restricted Boltzmann machines (RBM), which can reduce the dimension and abstract the information of the reviews quickly. Second, we construct the following hidden layers using convolutional restricted Boltzmann machines (CRBM), which can abstract the information of reviews effectively. Third, the constructed deep architecture is fine-tuned by gradient-descent based supervised learning with an exponential loss function. Finally, active learning method is combined based on the proposed deep architecture. We did several experiments on five sentiment classification datasets, and show that AHD is competitive with previous semi-supervised learning algorithm. Experiments are also conducted to verify the effectiveness of our proposed method with different number of labeled reviews and unlabeled reviews respectively.

]]>
<![CDATA[Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry]]> https://www.researchpad.co/article/5989da00ab0ee8fa60b73d70

Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.

]]>
<![CDATA[Extraction of Temporal Networks from Term Co-Occurrences in Online Textual Sources]]> https://www.researchpad.co/article/5989db08ab0ee8fa60bc94f8

A stream of unstructured news can be a valuable source of hidden relations between different entities, such as financial institutions, countries, or persons. We present an approach to continuously collect online news, recognize relevant entities in them, and extract time-varying networks. The nodes of the network are the entities, and the links are their co-occurrences. We present a method to estimate the significance of co-occurrences, and a benchmark model against which their robustness is evaluated. The approach is applied to a large set of financial news, collected over a period of two years. The entities we consider are 50 countries which issue sovereign bonds, and which are insured by Credit Default Swaps (CDS) in turn. We compare the country co-occurrence networks to the CDS networks constructed from the correlations between the CDS. The results show relatively small, but significant overlap between the networks extracted from the news and those from the CDS correlations.

]]>
<![CDATA[Word Boundaries Affect Visual Attention in Chinese Reading]]> https://www.researchpad.co/article/5989da4aab0ee8fa60b8c91f

In two experiments, we explored attention deployment during the reading of Chinese words using a probe detection task. In both experiments, Chinese readers saw four simplified Chinese characters briefly, and then a probe was presented at one of the character positions. The four characters constituted either one word or two words of two characters each. Reaction time was shorter when the probe was at the character 2 position than the character 3 position in the two-word condition, but not in the one-word condition. In Experiment 2, there were more trials and the materials were more carefully controlled, and the results replicated that of Experiment 1. These results suggest that word boundary information affects attentional deployment in Chinese reading.

]]>
<![CDATA[Determinants of Smoking and Quitting in HIV-Infected Individuals]]> https://www.researchpad.co/article/5989da2bab0ee8fa60b825ae

Background

Cigarette smoking is widespread among HIV-infected patients, who confront increased risk of smoking-related co-morbidities. The effects of HIV infection and HIV-related variables on smoking and smoking cessation are incompletely understood. We investigated the correlates of smoking and quitting in an HIV-infected cohort using a validated natural language processor to determine smoking status.

Method

We developed and validated an algorithm using natural language processing (NLP) to ascertain smoking status from electronic health record data. The algorithm was applied to records for a cohort of 3487 HIV-infected from a large health care system in Boston, USA, and 9446 uninfected control patients matched 3:1 on age, gender, race and clinical encounters. NLP was used to identify and classify smoking-related portions of free-text notes. These classifications were combined into patient-year smoking status and used to classify patients as ever versus never smokers and current smokers versus non-smokers. Generalized linear models were used to assess associations of HIV with 3 outcomes, ever smoking, current smoking, and current smoking in analyses limited to ever smokers (persistent smoking), while adjusting for demographics, cardiovascular risk factors, and psychiatric illness. Analyses were repeated within the HIV cohort, with the addition of CD4 cell count and HIV viral load to assess associations of these HIV-related factors with the smoking outcomes.

Results

Using the natural language processing algorithm to assign annual smoking status yielded sensitivity of 92.4, specificity of 86.2, and AUC of 0.89 (95% confidence interval [CI] 0.88–0.91). Ever and current smoking were more common in HIV-infected patients than controls (54% vs. 44% and 42% vs. 30%, respectively, both P<0.001). In multivariate models HIV was independently associated with ever smoking (adjusted rate ratio [ARR] 1.18, 95% CI 1.13–1.24, P <0.001), current smoking (ARR 1.33, 95% CI 1.25–1.40, P<0.001), and persistent smoking (ARR 1.11, 95% CI 1.07–1.15, P<0.001). Within the HIV cohort, having a detectable HIV RNA was significantly associated with all three smoking outcomes.

Conclusions

HIV was independently associated with both smoking and not quitting smoking, using a novel algorithm to ascertain smoking status from electronic health record data and accounting for multiple confounding clinical factors. Further research is needed to identify HIV-related barriers to smoking cessation and develop aggressive interventions specific to HIV-infected patients.

]]>
<![CDATA[PubMedPortable: A Framework for Supporting the Development of Text Mining Applications]]> https://www.researchpad.co/article/5989da30ab0ee8fa60b841fa

Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user’s system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects.

]]>