ResearchPad - natural-language-processing https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[The Language of Innovation]]> https://www.researchpad.co/article/elastic_article_10245 Predicting innovation is a peculiar problem in data science. Following its definition, an innovation is always a never-seen-before event, leaving no room for traditional supervised learning approaches. Here we propose a strategy to address the problem in the context of innovative patents, by defining innovations as never-seen-before associations of technologies and exploiting self-supervised learning techniques. We think of technological codes present in patents as a vocabulary and the whole technological corpus as written in a specific, evolving language. We leverage such structure with techniques borrowed from Natural Language Processing by embedding technologies in a high dimensional euclidean space where relative positions are representative of learned semantics. Proximity in this space is an effective predictor of specific innovation events, that outperforms a wide range of standard link-prediction metrics. The success of patented innovations follows a complex dynamics characterized by different patterns which we analyze in details with specific examples. The methods proposed in this paper provide a completely new way of understanding and forecasting innovation, by tackling it from a revealing perspective and opening interesting scenarios for a number of applications and further analytic approaches.

]]>
<![CDATA[Using case-level context to classify cancer pathology reports]]> https://www.researchpad.co/article/elastic_article_7869 Individual electronic health records (EHRs) and clinical reports are often part of a larger sequence—for example, a single patient may generate multiple reports over the trajectory of a disease. In applications such as cancer pathology reports, it is necessary not only to extract information from individual reports, but also to capture aggregate information regarding the entire cancer case based off case-level context from all reports in the sequence. In this paper, we introduce a simple modular add-on for capturing case-level context that is designed to be compatible with most existing deep learning architectures for text classification on individual reports. We test our approach on a corpus of 431,433 cancer pathology reports, and we show that incorporating case-level context significantly boosts classification accuracy across six classification tasks—site, subsite, laterality, histology, behavior, and grade. We expect that with minimal modifications, our add-on can be applied towards a wide range of other clinical text-based tasks.

]]>
<![CDATA[Beyond opinion classification: Extracting facts, opinions and experiences from health forums]]> https://www.researchpad.co/article/5c3fa56ad5eed0c484ca4115

Introduction

Surveys indicate that patients, particularly those suffering from chronic conditions, strongly benefit from the information found in social networks and online forums. One challenge in accessing online health information is to differentiate between factual and more subjective information. In this work, we evaluate the feasibility of exploiting lexical, syntactic, semantic, network-based and emotional properties of texts to automatically classify patient-generated contents into three types: “experiences”, “facts” and “opinions”, using machine learning algorithms. In this context, our goal is to develop automatic methods that will make online health information more easily accessible and useful for patients, professionals and researchers.

Material and methods

We work with a set of 3000 posts to online health forums in breast cancer, morbus crohn and different allergies. Each sentence in a post is manually labeled as “experience”, “fact” or “opinion”. Using this data, we train a support vector machine algorithm to perform classification. The results are evaluated in a 10-fold cross validation procedure.

Results

Overall, we find that it is possible to predict the type of information contained in a forum post with a very high accuracy (over 80 percent) using simple text representations such as word embeddings and bags of words. We also analyze more complex features such as those based on the network properties, the polarity of words and the verbal tense of the sentences and show that, when combined with the previous ones, they can boost the results.

]]>
<![CDATA[Feature engineering for sentiment analysis in e-health forums]]> https://www.researchpad.co/article/5c099452d5eed0c4842aea35

Introduction

Exploiting information in health-related social media services is of great interest for patients, researchers and medical companies. The challenge is, however, to provide easy, quick and relevant access to the vast amount of information that is available. One step towards facilitating information access to online health data is opinion mining. Even though the classification of patient opinions into positive and negative has been previously tackled, most works make use of machine learning methods and bags of words. Our first contribution is an extensive evaluation of different features, including lexical, syntactic, semantic, network-based, sentiment-based and word embeddings features to represent patient-authored texts for polarity classification. The second contribution of this work is the study of polar facts (i.e. objective information with polar connotations). Traditionally, the presence of polar facts has been neglected and research in polarity classification has been bounded to opinionated texts. We demonstrate the existence and importance of polar facts for the polarity classification of health information.

Material and methods

We annotate a set of more than 3500 posts to online health forums of breast cancer, crohn and different allergies, respectively. Each sentence in a post is manually labeled as “experience”, “fact” or “opinion”, and as “positive”, “negative” and “neutral”. Using this data, we train different machine learning algorithms and compare traditional bags of words representations with word embeddings in combination with lexical, syntactic, semantic, network-based and emotional properties of texts to automatically classify patient-authored contents into positive, negative and neutral. Beside, we experiment with a combination of textual and semantic representations by generating concept embeddings using the UMLS Metathesaurus.

Results

We reach two main results: first, we find that it is possible to predict polarity of patient-authored contents with a very high accuracy (≈ 70 percent) using word embeddings, and that this considerably outperforms more traditional representations like bags of words; and second, when dealing with medical information, negative and positive facts (i.e. objective information) are nearly as frequent as negative and positive opinions and experiences (i.e. subjective information), and their importance for polarity classification is crucial.

]]>
<![CDATA[Turning Text into Research Networks: Information Retrieval and Computational Ontologies in the Creation of Scientific Databases]]> https://www.researchpad.co/article/5989d9f8ab0ee8fa60b70df2

Background

Web-based, free-text documents on science and technology have been increasing growing on the web. However, most of these documents are not immediately processable by computers slowing down the acquisition of useful information. Computational ontologies might represent a possible solution by enabling semantically machine readable data sets. But, the process of ontology creation, instantiation and maintenance is still based on manual methodologies and thus time and cost intensive.

Method

We focused on a large corpus containing information on researchers, research fields, and institutions. We based our strategy on traditional entity recognition, social computing and correlation. We devised a semi automatic approach for the recognition, correlation and extraction of named entities and relations from textual documents which are then used to create, instantiate, and maintain an ontology.

Results

We present a prototype demonstrating the applicability of the proposed strategy, along with a case study describing how direct and indirect relations can be extracted from academic and professional activities registered in a database of curriculum vitae in free-text format. We present evidence that this system can identify entities to assist in the process of knowledge extraction and representation to support ontology maintenance. We also demonstrate the extraction of relationships among ontology classes and their instances.

Conclusion

We have demonstrated that our system can be used for the conversion of research information in free text format into database with a semantic structure. Future studies should test this system using the growing number of free-text information available at the institutional and national levels.

]]>
<![CDATA[pubmed2ensembl: A Resource for Mining the Biological Literature on Genes]]> https://www.researchpad.co/article/5989d9daab0ee8fa60b675c1

Background

The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.

Methodology/Principal Findings

To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.

Conclusion/Significance

By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.

]]>
<![CDATA[Beyond Captions: Linking Figures with Abstract Sentences in Biomedical Articles]]> https://www.researchpad.co/article/5989da24ab0ee8fa60b8013e

Although figures in scientific articles have high information content and concisely communicate many key research findings, they are currently under utilized by literature search and retrieval systems. Many systems ignore figures, and those that do not typically only consider caption text. This study describes and evaluates a fully automated approach for associating figures in the body of a biomedical article with sentences in its abstract. We use supervised methods to learn probabilistic language models, hidden Markov models, and conditional random fields for predicting associations between abstract sentences and figures. Three kinds of evidence are used: text in abstract sentences and figures, relative positions of sentences and figures, and the patterns of sentence/figure associations across an article. Each information source is shown to have predictive value, and models that use all kinds of evidence are more accurate than models that do not. Our most accurate method has an -score of 69% on a cross-validation experiment, is competitive with the accuracy of human experts, has significantly better predictive accuracy than state-of-the-art methods and enables users to access figures associated with an abstract sentence with an average of 1.82 fewer mouse clicks. A user evaluation shows that human users find our system beneficial. The system is available at http://FigureItOut.askHERMES.org.

]]>
<![CDATA[Interactive Language Learning by Robots: The Transition from Babbling to Word Forms]]> https://www.researchpad.co/article/5989db09ab0ee8fa60bc973c

The advent of humanoid robots has enabled a new approach to investigating the acquisition of language, and we report on the development of robots able to acquire rudimentary linguistic skills. Our work focuses on early stages analogous to some characteristics of a human child of about 6 to 14 months, the transition from babbling to first word forms. We investigate one mechanism among many that may contribute to this process, a key factor being the sensitivity of learners to the statistical distribution of linguistic elements. As well as being necessary for learning word meanings, the acquisition of anchor word forms facilitates the segmentation of an acoustic stream through other mechanisms. In our experiments some salient one-syllable word forms are learnt by a humanoid robot in real-time interactions with naive participants. Words emerge from random syllabic babble through a learning process based on a dialogue between the robot and the human participant, whose speech is perceived by the robot as a stream of phonemes. Numerous ways of representing the speech as syllabic segments are possible. Furthermore, the pronunciation of many words in spontaneous speech is variable. However, in line with research elsewhere, we observe that salient content words are more likely than function words to have consistent canonical representations; thus their relative frequency increases, as does their influence on the learner. Variable pronunciation may contribute to early word form acquisition. The importance of contingent interaction in real-time between teacher and learner is reflected by a reinforcement process, with variable success. The examination of individual cases may be more informative than group results. Nevertheless, word forms are usually produced by the robot after a few minutes of dialogue, employing a simple, real-time, frequency dependent mechanism. This work shows the potential of human-robot interaction systems in studies of the dynamics of early language acquisition.

]]>
<![CDATA[The Global Burden of Journal Peer Review in the Biomedical Literature: Strong Imbalance in the Collective Enterprise]]> https://www.researchpad.co/article/5989dafdab0ee8fa60bc53ae

The growth in scientific production may threaten the capacity for the scientific community to handle the ever-increasing demand for peer review of scientific publications. There is little evidence regarding the sustainability of the peer-review system and how the scientific community copes with the burden it poses. We used mathematical modeling to estimate the overall quantitative annual demand for peer review and the supply in biomedical research. The modeling was informed by empirical data from various sources in the biomedical domain, including all articles indexed at MEDLINE. We found that for 2015, across a range of scenarios, the supply exceeded by 15% to 249% the demand for reviewers and reviews. However, 20% of the researchers performed 69% to 94% of the reviews. Among researchers actually contributing to peer review, 70% dedicated 1% or less of their research work-time to peer review while 5% dedicated 13% or more of it. An estimated 63.4 million hours were devoted to peer review in 2015, among which 18.9 million hours were provided by the top 5% contributing reviewers. Our results support that the system is sustainable in terms of volume but emphasizes a considerable imbalance in the distribution of the peer-review effort across the scientific community. Finally, various individual interactions between authors, editors and reviewers may reduce to some extent the number of reviewers who are available to editors at any point.

]]>
<![CDATA[Active Semi-Supervised Learning Method with Hybrid Deep Belief Networks]]> https://www.researchpad.co/article/5989da09ab0ee8fa60b76d33

In this paper, we develop a novel semi-supervised learning algorithm called active hybrid deep belief networks (AHD), to address the semi-supervised sentiment classification problem with deep learning. First, we construct the previous several hidden layers using restricted Boltzmann machines (RBM), which can reduce the dimension and abstract the information of the reviews quickly. Second, we construct the following hidden layers using convolutional restricted Boltzmann machines (CRBM), which can abstract the information of reviews effectively. Third, the constructed deep architecture is fine-tuned by gradient-descent based supervised learning with an exponential loss function. Finally, active learning method is combined based on the proposed deep architecture. We did several experiments on five sentiment classification datasets, and show that AHD is competitive with previous semi-supervised learning algorithm. Experiments are also conducted to verify the effectiveness of our proposed method with different number of labeled reviews and unlabeled reviews respectively.

]]>
<![CDATA[Using Workflows to Explore and Optimise Named Entity Recognition for Chemistry]]> https://www.researchpad.co/article/5989da00ab0ee8fa60b73d70

Chemistry text mining tools should be interoperable and adaptable regardless of system-level implementation, installation or even programming issues. We aim to abstract the functionality of these tools from the underlying implementation via reconfigurable workflows for automatically identifying chemical names. To achieve this, we refactored an established named entity recogniser (in the chemistry domain), OSCAR and studied the impact of each component on the net performance. We developed two reconfigurable workflows from OSCAR using an interoperable text mining framework, U-Compare. These workflows can be altered using the drag-&-drop mechanism of the graphical user interface of U-Compare. These workflows also provide a platform to study the relationship between text mining components such as tokenisation and named entity recognition (using maximum entropy Markov model (MEMM) and pattern recognition based classifiers). Results indicate that, for chemistry in particular, eliminating noise generated by tokenisation techniques lead to a slightly better performance than others, in terms of named entity recognition (NER) accuracy. Poor tokenisation translates into poorer input to the classifier components which in turn leads to an increase in Type I or Type II errors, thus, lowering the overall performance. On the Sciborg corpus, the workflow based system, which uses a new tokeniser whilst retaining the same MEMM component, increases the F-score from 82.35% to 84.44%. On the PubMed corpus, it recorded an F-score of 84.84% as against 84.23% by OSCAR.

]]>
<![CDATA[Extraction of Temporal Networks from Term Co-Occurrences in Online Textual Sources]]> https://www.researchpad.co/article/5989db08ab0ee8fa60bc94f8

A stream of unstructured news can be a valuable source of hidden relations between different entities, such as financial institutions, countries, or persons. We present an approach to continuously collect online news, recognize relevant entities in them, and extract time-varying networks. The nodes of the network are the entities, and the links are their co-occurrences. We present a method to estimate the significance of co-occurrences, and a benchmark model against which their robustness is evaluated. The approach is applied to a large set of financial news, collected over a period of two years. The entities we consider are 50 countries which issue sovereign bonds, and which are insured by Credit Default Swaps (CDS) in turn. We compare the country co-occurrence networks to the CDS networks constructed from the correlations between the CDS. The results show relatively small, but significant overlap between the networks extracted from the news and those from the CDS correlations.

]]>
<![CDATA[Word Boundaries Affect Visual Attention in Chinese Reading]]> https://www.researchpad.co/article/5989da4aab0ee8fa60b8c91f

In two experiments, we explored attention deployment during the reading of Chinese words using a probe detection task. In both experiments, Chinese readers saw four simplified Chinese characters briefly, and then a probe was presented at one of the character positions. The four characters constituted either one word or two words of two characters each. Reaction time was shorter when the probe was at the character 2 position than the character 3 position in the two-word condition, but not in the one-word condition. In Experiment 2, there were more trials and the materials were more carefully controlled, and the results replicated that of Experiment 1. These results suggest that word boundary information affects attentional deployment in Chinese reading.

]]>
<![CDATA[Determinants of Smoking and Quitting in HIV-Infected Individuals]]> https://www.researchpad.co/article/5989da2bab0ee8fa60b825ae

Background

Cigarette smoking is widespread among HIV-infected patients, who confront increased risk of smoking-related co-morbidities. The effects of HIV infection and HIV-related variables on smoking and smoking cessation are incompletely understood. We investigated the correlates of smoking and quitting in an HIV-infected cohort using a validated natural language processor to determine smoking status.

Method

We developed and validated an algorithm using natural language processing (NLP) to ascertain smoking status from electronic health record data. The algorithm was applied to records for a cohort of 3487 HIV-infected from a large health care system in Boston, USA, and 9446 uninfected control patients matched 3:1 on age, gender, race and clinical encounters. NLP was used to identify and classify smoking-related portions of free-text notes. These classifications were combined into patient-year smoking status and used to classify patients as ever versus never smokers and current smokers versus non-smokers. Generalized linear models were used to assess associations of HIV with 3 outcomes, ever smoking, current smoking, and current smoking in analyses limited to ever smokers (persistent smoking), while adjusting for demographics, cardiovascular risk factors, and psychiatric illness. Analyses were repeated within the HIV cohort, with the addition of CD4 cell count and HIV viral load to assess associations of these HIV-related factors with the smoking outcomes.

Results

Using the natural language processing algorithm to assign annual smoking status yielded sensitivity of 92.4, specificity of 86.2, and AUC of 0.89 (95% confidence interval [CI] 0.88–0.91). Ever and current smoking were more common in HIV-infected patients than controls (54% vs. 44% and 42% vs. 30%, respectively, both P<0.001). In multivariate models HIV was independently associated with ever smoking (adjusted rate ratio [ARR] 1.18, 95% CI 1.13–1.24, P <0.001), current smoking (ARR 1.33, 95% CI 1.25–1.40, P<0.001), and persistent smoking (ARR 1.11, 95% CI 1.07–1.15, P<0.001). Within the HIV cohort, having a detectable HIV RNA was significantly associated with all three smoking outcomes.

Conclusions

HIV was independently associated with both smoking and not quitting smoking, using a novel algorithm to ascertain smoking status from electronic health record data and accounting for multiple confounding clinical factors. Further research is needed to identify HIV-related barriers to smoking cessation and develop aggressive interventions specific to HIV-infected patients.

]]>
<![CDATA[PubMedPortable: A Framework for Supporting the Development of Text Mining Applications]]> https://www.researchpad.co/article/5989da30ab0ee8fa60b841fa

Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user’s system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects.

]]>
<![CDATA[Big Words, Halved Brains and Small Worlds: Complex Brain Networks of Figurative Language Comprehension]]> https://www.researchpad.co/article/5989dadaab0ee8fa60bb9668

Language comprehension is a complex task that involves a wide network of brain regions. We used topological measures to qualify and quantify the functional connectivity of the networks used under various comprehension conditions. To that aim we developed a technique to represent functional networks based on EEG recordings, taking advantage of their excellent time resolution in order to capture the fast processes that occur during language comprehension. Networks were created by searching for a specific causal relation between areas, the negative feedback loop, which is ubiquitous in many systems. This method is a simple way to construct directed graphs using event-related activity, which can then be analyzed topologically. Brain activity was recorded while subjects read expressions of various types and indicated whether they found them meaningful. Slightly different functional networks were obtained for event-related activity evoked by each expression type. The differences reflect the special contribution of specific regions in each condition and the balance of hemispheric activity involved in comprehending different types of expressions and are consistent with the literature in the field. Our results indicate that representing event-related brain activity as a network using a simple temporal relation, such as the negative feedback loop, to indicate directional connectivity is a viable option for investigation which also derives new information about aspects not reflected in the classical methods for investigating brain activity.

]]>
<![CDATA[On the Time Course of Vocal Emotion Recognition]]> https://www.researchpad.co/article/5989d9ecab0ee8fa60b6cd20

How quickly do listeners recognize emotions from a speaker's voice, and does the time course for recognition vary by emotion type? To address these questions, we adapted the auditory gating paradigm to estimate how much vocal information is needed for listeners to categorize five basic emotions (anger, disgust, fear, sadness, happiness) and neutral utterances produced by male and female speakers of English. Semantically-anomalous pseudo-utterances (e.g., The rivix jolled the silling) conveying each emotion were divided into seven gate intervals according to the number of syllables that listeners heard from sentence onset. Participants (n = 48) judged the emotional meaning of stimuli presented at each gate duration interval, in a successive, blocked presentation format. Analyses looked at how recognition of each emotion evolves as an utterance unfolds and estimated the “identification point” for each emotion. Results showed that anger, sadness, fear, and neutral expressions are recognized more accurately at short gate intervals than happiness, and particularly disgust; however, as speech unfolds, recognition of happiness improves significantly towards the end of the utterance (and fear is recognized more accurately than other emotions). When the gate associated with the emotion identification point of each stimulus was calculated, data indicated that fear (M = 517 ms), sadness (M = 576 ms), and neutral (M = 510 ms) expressions were identified from shorter acoustic events than the other emotions. These data reveal differences in the underlying time course for conscious recognition of basic emotions from vocal expressions, which should be accounted for in studies of emotional speech processing.

]]>
<![CDATA[Phenotyping for patient safety: algorithm development for electronic health record based automated adverse event and medical error detection in neonatal intensive care]]> https://www.researchpad.co/article/5bc15eee40307c11974ebd33

Background

Although electronic health records (EHRs) have the potential to provide a foundation for quality and safety algorithms, few studies have measured their impact on automated adverse event (AE) and medical error (ME) detection within the neonatal intensive care unit (NICU) environment.

Objective

This paper presents two phenotyping AE and ME detection algorithms (ie, IV infiltrations, narcotic medication oversedation and dosing errors) and describes manual annotation of airway management and medication/fluid AEs from NICU EHRs.

Methods

From 753 NICU patient EHRs from 2011, we developed two automatic AE/ME detection algorithms, and manually annotated 11 classes of AEs in 3263 clinical notes. Performance of the automatic AE/ME detection algorithms was compared to trigger tool and voluntary incident reporting results. AEs in clinical notes were double annotated and consensus achieved under neonatologist supervision. Sensitivity, positive predictive value (PPV), and specificity are reported.

Results

Twelve severe IV infiltrates were detected. The algorithm identified one more infiltrate than the trigger tool and eight more than incident reporting. One narcotic oversedation was detected demonstrating 100% agreement with the trigger tool. Additionally, 17 narcotic medication MEs were detected, an increase of 16 cases over voluntary incident reporting.

Conclusions

Automated AE/ME detection algorithms provide higher sensitivity and PPV than currently used trigger tools or voluntary incident-reporting systems, including identification of potential dosing and frequency errors that current methods are unequipped to detect.

]]>
<![CDATA[Twitter in the Cross Fire—The Use of Social Media in the Westgate Mall Terror Attack in Kenya]]> https://www.researchpad.co/article/5989d9e6ab0ee8fa60b6b2eb

On September 2013 an attack on the Westgate mall in Kenya led to a four day siege, resulting in 67 fatalities and 175 wounded. During the crisis, Twitter became a crucial channel of communication between the government, emergency responders and the public, facilitating the emergency management of the event. The objectives of this paper are to present the main activities, use patterns and lessons learned from the use of the social media in the crisis. Using TwitterMate, a system developed to collect, store and analyze tweets, the main hashtags generated by the crowd and specific Twitter accounts of individuals, emergency responders and NGOs, were followed throughout the four day siege. A total of 67,849 tweets were collected and analyzed. Four main categories of hashtags were identified: geographical locations, terror attack, social support and organizations. The abundance of Twitter accounts providing official information made it difficult to synchronize and follow the flow of information. Many organizations posted simultaneously, by their manager and by the organization itself. Creating situational awareness was facilitated by information tweeted by the public. Threat assessment was updated through the information posted on social media. Security breaches led to the relay of sensitive data. At times, misinformation was only corrected after two days. Social media offer an accessible, widely available means for a bi-directional flow of information between the public and the authorities. In the crisis, all emergency responders used and leveraged social media networks for communicating both with the public and among themselves. A standard operating procedure should be developed to enable multiple responders to monitor, synchronize and integrate their social media feeds during emergencies. This will lead to better utilization and optimization of social media resources during crises, providing clear guidelines for communications and a hierarchy for dispersing information to the public and among responding organizations.

]]>
<![CDATA[Integrating Various Resources for Gene Name Normalization]]> https://www.researchpad.co/article/5989d9ddab0ee8fa60b6822d

The recognition and normalization of gene mentions in biomedical literature are crucial steps in biomedical text mining. We present a system for extracting gene names from biomedical literature and normalizing them to gene identifiers in databases. The system consists of four major components: gene name recognition, entity mapping, disambiguation and filtering. The first component is a gene name recognizer based on dictionary matching and semi-supervised learning, which utilizes the co-occurrence information of a large amount of unlabeled MEDLINE abstracts to enhance feature representation of gene named entities. In the stage of entity mapping, we combine the strategies of exact match and approximate match to establish linkage between gene names in the context and the EntrezGene database. For the gene names that map to more than one database identifiers, we develop a disambiguation method based on semantic similarity derived from the Gene Ontology and MEDLINE abstracts. To remove the noise produced in the previous steps, we design a filtering method based on the confidence scores in the dictionary used for NER. The system is able to adjust the trade-off between precision and recall based on the result of filtering. It achieves an F-measure of 83% (precision: 82.5% recall: 83.5%) on BioCreative II Gene Normalization (GN) dataset, which is comparable to the current state-of-the-art.

]]>