ResearchPad - computer-science https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[Crisis social media data labeled for storm-related information and toponym usage]]> https://www.researchpad.co/article/Ne0d39b21-7227-447c-8243-0b8f7d4f6b7e Social media provides citizens and officials with important sources of information during times of crisis. This data article makes available labeled, storm-related social media data collected over a six-hour period during a severe storm and F1 tornado that struck Central Pennsylvania on May 1st, 2017. Three datasets were collected from Twitter using location, keyword, and network filtering techniques, respectively. Only 2% of the 22,706 total tweets overlap among the datasets, providing researchers with a broader scope of information than normally available when collecting tweets using location (i.e., geotag-based) and keyword filtering alone or in combination during a crisis. Each data collection technique is described in detail, including network filtering which collects data from networks of social media users associated with a geographic area.

The datasets are manually labeled for information content and toponym usage. The 22,706 tweet IDs, dehydrated for privacy, are labeled for relevance (storm-related and off-topic) and 19 types of storm-related information organized into six categories: infrastructure damage, service disruption, personal experience, weather updates, weather forecasts, and weather warnings. Data are also labeled for toponym usage (with or without toponyms), location (local, remote, and generic toponyms), and granularity (hyperlocal, municipal, and regional toponyms). The comprehensively labeled datasets provide researchers with opportunities to analyze crisis-related information behaviors and volunteered location information behaviors during a hyperlocal crisis event, as well as develop and evaluate automated filtering, geolocation, and event detection techniques that can aid citizens and crisis responders.

]]>
<![CDATA[Llaima volcano dataset: In-depth comparison of deep artificial neural network architectures on seismic events classification]]> https://www.researchpad.co/article/N024e6094-9e49-4d07-a0a6-23982523d88c This data manuscript presents a set of signals collected from the Llaima volcano located at the western edge of the Andes in Araucania Region, Chile. The signals were recorded from the LAV station between 2010 and 2016. After individually processing and analyzing every signal, specialists from the Observatorio Vulcanológico de los Andes Sur (OVDAS) classified them into four class according to their event source: i) Volcano-Tectonic (VT); ii) Long Period (LP); iii) Tremor (TR), and iv) Tectonic (TC). The dataset is composed of 3592 signals separated by class and filtered to select the segment that contains the most representative part of the seismic event. This dataset is important to support researchers interested in studying seismic signals from active volcanoes and developing new methods to model time-dependent data. In this sense, we have published the manuscript “In-Depth Comparison of Deep Artificial Neural Network Architectures on Seismic Events Classification” [1] analyzing such signals with different Deep Neural Networks (DNN). The main contribution of such manuscript is a new DNN architecture called SeismicNet, which provided classification results among the best in the literature without demanding explicit signal pre-processing steps. Therefore, the reader is referred to such manuscript for the interpretation of the data.

]]>
<![CDATA[Clustering benchmark datasets exploiting the fundamental clustering problems]]> https://www.researchpad.co/article/N21ececc1-cabd-40c9-845b-8d186997f9a2 The Fundamental Clustering Problems Suite (FCPS) offers a variety of clustering challenges that any algorithm should be able to handle given real-world data. The FCPS consists of datasets with known a priori classifications that are to be reproduced by the algorithm. The datasets are intentionally created to be visualized in two or three dimensions under the hypothesis that objects can be grouped unambiguously by the human eye. Each dataset represents a certain problem that can be solved by known clustering algorithms with varying success. In the R package “Fundamental Clustering Problems Suite” on CRAN, user-defined sample sizes can be drawn for the FCPS. Additionally, the distances of two high-dimensional datasets called Leukemia and Tetragonula are provided here. This collection is useful for investigating the shortcomings of clustering algorithms and the limitations of dimensionality reduction methods in the case of three-dimensional or higher datasets. This article is a simultaneous co-submission with Swarm Intelligence for Self-Organized Clustering [1].

]]>
<![CDATA[Performance data of multiple-precision scalar and vector BLAS operations on CPU and GPU]]> https://www.researchpad.co/article/Neaaa012e-76cd-4378-8793-2b41a8734aca Many optimized linear algebra packages support the single- and double-precision floating-point data types. However, there are a number of important applications that require a higher level of precision, up to hundreds or even thousands of digits. This article presents performance data of four dense basic linear algebra subprograms – ASUM, DOT, SCAL, and AXPY – implemented using existing extended-/multiple-precision software for conventional central processing units and CUDA compatible graphics processing units. The following open source packages are considered: MPFR, MPDECIMAL, ARPREC, MPACK, XBLAS, GARPREC, CAMPARY, CUMP, and MPRES-BLAS. The execution time of CPU and GPU implementations is measured at a fixed problem size and various levels of numeric precision. The data in this article are related to the research article entitled “Design and implementation of multiple-precision BLAS Level 1 functions for graphics processing units” [1].

]]>
<![CDATA[FaceLift: a transparent deep learning framework to beautify urban scenes]]> https://www.researchpad.co/article/N5fd42e94-d295-4df2-a0e6-a8f34580eb5a

In the area of computer vision, deep learning techniques have recently been used to predict whether urban scenes are likely to be considered beautiful: it turns out that these techniques are able to make accurate predictions. Yet they fall short when it comes to generating actionable insights for urban design. To support urban interventions, one needs to go beyond predicting beauty, and tackle the challenge of recreating beauty. Unfortunately, deep learning techniques have not been designed with that challenge in mind. Given their ‘black-box nature’, these models cannot be directly used to explain why a particular urban scene is deemed to be beautiful. To partly fix that, we propose a deep learning framework (which we name FaceLift1) that is able to both beautify existing urban scenes (Google Street Views) and explain which urban elements make those transformed scenes beautiful. To quantitatively evaluate our framework, we cannot resort to any existing metric (as the research problem at hand has never been tackled before) and need to formulate new ones. These new metrics should ideally capture the presence (or absence) of elements that make urban spaces great. Upon a review of the urban planning literature, we identify five main metrics: walkability, green spaces, openness, landmarks and visual complexity. We find that, across all the five metrics, the beautified scenes meet the expectations set by the literature on what great spaces tend to be made of. This result is further confirmed by a 20-participant expert survey in which FaceLift has been found to be effective in promoting citizen participation. All this suggests that, in the future, as our framework’s components are further researched and become better and more sophisticated, it is not hard to imagine technologies that will be able to accurately and efficiently support architects and planners in the design of the spaces we intuitively love.

]]>
<![CDATA[Reconciling periodic rhythms of large-scale biological networks by optimal control]]> https://www.researchpad.co/article/Nd3fd2fe7-1722-490f-9f77-9cf9436cd0cd

Periodic rhythms are ubiquitous phenomena that illuminate the underlying mechanism of cyclic activities in biological systems, which can be represented by cyclic attractors of the related biological network. Disorders of periodic rhythms are detrimental to the natural behaviours of living organisms. Previous studies have shown that the state transition from one to another attractor can be accomplished by regulating external signals. However, most of these studies until now have mainly focused on point attractors while ignoring cyclic ones. The aim of this study is to investigate an approach for reconciling abnormal periodic rhythms, such as diminished circadian amplitude and phase delay, to the regular rhythms of complex biological networks. For this purpose, we formulate and solve a mixed-integer nonlinear dynamic optimization problem simultaneously to identify regulation variables and to determine optimal control strategies for state transition and adjustment of periodic rhythms. Numerical experiments are implemented in three examples including a chaotic system, a mammalian circadian rhythm system and a gastric cancer gene regulatory network. The results show that regulating a small number of biochemical molecules in the network is sufficient to successfully drive the system to the target cyclic attractor by implementing an optimal control strategy.

]]>
<![CDATA[Intermediacy of publications]]> https://www.researchpad.co/article/Nff3da153-262d-4273-9b64-46b5cf2760ab

Citation networks of scientific publications offer fundamental insights into the structure and development of scientific knowledge. We propose a new measure, called intermediacy, for tracing the historical development of scientific knowledge. Given two publications, an older and a more recent one, intermediacy identifies publications that seem to play a major role in the historical development from the older to the more recent publication. The identified publications are important in connecting the older and the more recent publication in the citation network. After providing a formal definition of intermediacy, we study its mathematical properties. We then present two empirical case studies, one tracing historical developments at the interface between the community detection literature and the scientometric literature and one examining the development of the literature on peer review. We show both conceptually and empirically how intermediacy differs from main path analysis, which is the most popular approach for tracing historical developments in citation networks. Main path analysis tends to favour longer paths over shorter ones, whereas intermediacy has the opposite tendency. Compared to the main path analysis, we conclude that intermediacy offers a more principled approach for tracing the historical development of scientific knowledge.

]]>
<![CDATA[Dealing with uncertainty in agent-based models for short-term predictions]]> https://www.researchpad.co/article/Nb7fb9af3-6a06-4655-b0d5-864323a6b15d

Agent-based models (ABMs) are gaining traction as one of the most powerful modelling tools within the social sciences. They are particularly suited to simulating complex systems. Despite many methodological advances within ABM, one of the major drawbacks is their inability to incorporate real-time data to make accurate short-term predictions. This paper presents an approach that allows ABMs to be dynamically optimized. Through a combination of parameter calibration and data assimilation (DA), the accuracy of model-based predictions using ABM in real time is increased. We use the exemplar of a bus route system to explore these methods. The bus route ABMs developed in this research are examples of ABMs that can be dynamically optimized by a combination of parameter calibration and DA. The proposed model and framework is a novel and transferable approach that can be used in any passenger information system, or in an intelligent transport systems to provide forecasts of bus locations and arrival times.

]]>
<![CDATA[Reconstructed data of landings for the artisanal beach seine fishery in the marine-coastal area of Taganga, Colombian Caribbean Sea]]> https://www.researchpad.co/article/N0f6887ef-2947-440d-baf8-3abfe6ca35b7

This paper presents a dataset on the abiotic (oceanographic, atmospheric and global climatic indices) and fishery variables of the marine-coastal area of the Magdalena Province in the area between Taganga and Bahía Concha, located north of Santa Marta in the Colombian Caribbean. The abiotic variables were downloaded from the satellites of the National Aeronautics and Space Administration (NASA) and the meteorological stations of the Institute of Hydrology, Meteorology and Environmental Studies (IDEAM). The fishery variables were obtained through field trips in the study area. A dynamic artificial neural network was implemented to reconstruct the missing data in the fishery variables from the known abiotic variables (Precipitation, North Atlantic Oscillation and Multivariate ENSO Indices). In this way, a dataset was obtained that is important to determine the historical changes of fishery resources for the study area and to make catch forecasts incorporating the variability of the environmental conditions (atmospheric and oceanographic).

]]>
<![CDATA[The green view dataset for the capital of Finland, Helsinki]]> https://www.researchpad.co/article/N2cdaf76a-6e58-49d3-9ccd-4fc637e8aa7c

Recent studies have incorporated human perspective methods like making use of street view images and measuring green view in addition to more traditional ways of mapping city greenery [1]. Green view describes the relative amount of green vegetation visible at street level and is often measured with the green view index (GVI), which describes the percentage of green vegetation in a street view image or images of a certain location [2]. The green view dataset of Helsinki was created as part of the master's thesis of Akseli Toikka at the University of Helsinki [3].

We calculated the GVI values for a set of locations on the streets of Helsinki using Google Street View (GSV) 360° panorama images from summer months (May through September) between 2009 and 2017. From the available images, a total of 94 454 matched the selection criteria. These were downloaded using the Google application programming interface (API). We calculated the GVI values from the panoramas based on the spectral characteristics of green vegetation in RGB images. The result was a set of points along the street network with GVI values.

By combining the point data with the street network data of the area, we generated a dataset for GVI values along the street centre lines. Streets with GVI points within a threshold distance of 30 meters were given the average of the GVI values of the points. For the streets with no points in the vicinity (∼67%), the land cover data from the area was used to estimate the GVI, as suggested in the thesis [3]. The point and street-wise data are stored in georeferenced tables that can be utilized for further analyses with geographical information systems.

]]>
<![CDATA[A dataset of microscopic peripheral blood cell images for development of automatic recognition systems]]> https://www.researchpad.co/article/Na760f550-1ace-4c53-baa2-9249ca09ec6f

This article makes available a dataset that was used for the development of an automatic recognition system of peripheral blood cell images using convolutional neural networks [1]. The dataset contains a total of 17,092 images of individual normal cells, which were acquired using the analyzer CellaVision DM96 in the Core Laboratory at the Hospital Clinic of Barcelona. The dataset is organized in the following eight groups: neutrophils, eosinophils, basophils, lymphocytes, monocytes, immature granulocytes (promyelocytes, myelocytes, and metamyelocytes), erythroblasts and platelets or thrombocytes. The size of the images is 360 × 363 pixels, in format jpg, and they were annotated by expert clinical pathologists. The images were captured from individuals without infection, hematologic or oncologic disease and free of any pharmacologic treatment at the moment of blood collection.

This high-quality labelled dataset may be used to train and test machine learning and deep learning models to recognize different types of normal peripheral blood cells. To our knowledge, this is the first publicly available set with large numbers of normal peripheral blood cells, so that it is expected to be a canonical dataset for model benchmarking.

]]>
<![CDATA[Dataset of academic performance evolution for engineering students]]> https://www.researchpad.co/article/N94548e27-6e77-467f-b691-f6ecadcd5fe8

This data article presents data on the results in national assessments for secondary and university education in engineering students. The data contains academic, social, economic information for 12,411 students. The data were obtained by orderly crossing the databases of the Colombian Institute for the Evaluation of Education (ICFES). The structure of the data allows us to observe the influence of social variables and the evolution of students' learning skills. In addition to serving as input to develop analysis of academic efficiency, student recommendation systems and educational data mining. The data is presented in comma separated value format. Data can be easily accessed through the Mendeley Data Repository (https://data.mendeley.com/datasets/83tcx8psxv/1).

]]>
<![CDATA[Self-reported data for mental workload modelling in human-computer interaction and third-level education]]> https://www.researchpad.co/article/N12208356-5d24-45e7-a276-5ee1e6036429

Mental workload (MWL) is an imprecise construct, with distinct definitions and no predominant measurement technique. It can be intuitively seen as the amount of mental activity devoted to a certain task over time. Several approaches have been proposed in the literature for the modelling and assessment of MWL. In this paper, data related to two sets of tasks performed by participants under different conditions is reported. This data was gathered from different sets of questionnaires answered by these participants. These questionnaires were aimed at assessing the features believed by domain experts to influence overall mental workload. In total, 872 records are reported, each representing the answers given by a user after performing a task. On the one hand, collected data might support machine learning researchers interested in using predictive analytics for the assessment of mental workload. On the other hand, data, if exploited by a set of rules/arguments (as in [3]), may serve as knowledge-bases for researchers in the field of knowledge-based systems and automated reasoning. Lastly, data might serve as a source of information for mental workload designers interested in investigating the features reported here for mental workload modelling. This article was co-submitted from a research journal “An empirical evaluation of the inferential capacity of defeasible argumentation, non-monotonic fuzzy reasoning and expert systems” [3]. The reader is referred to it for the interpretation of the data.

]]>
<![CDATA[Real and synthetic data sets for benchmarking key-value stores focusing on various data types and sizes]]> https://www.researchpad.co/article/N080afd19-9ae1-40ed-90c6-7088dcd679e0

In this article, we present real and synthetic data sets for benchmarking key-values stores. Here, we focus on various data types and sizes. Key-value pairs in key-value data sets consist of the key and the value. We can construct any kinds of data as key-value data sets by assigning an arbitrary type of data as the value and a unique ID as the key. Therefore, key-value pairs are quite worthy when we deal with big data because the data types in the big data application become more various and, even sometimes, they are not known or determined. In this article, we crawl four kinds of real data sets by varying the type of data sets (i.e., variety) and generate four kinds of synthetic data sets by varying the size of data sets (i.e., volume). For real data sets, we crawl data sets with various data types from Twitter, i.e., Tweets in text, a list of hashtags, geo-location of the tweet, and the number of followers. We also present algorithms for crawling real data sets based on REST APIs and streaming APIs and for generating synthetic data sets. Using those algorithms, we can crawl any key-value pairs of data types supported by Twitter and can generate any size of synthetic data sets by extending them simply. Last, we show that the crawled and generated data sets are actually utilized for the well-known key-value stores such as Level DB of Google, RocksDB of Facebook, and Berkeley DB of Oracle. Actually, the presented real and synthetic data sets have been used for comparing the performance of them. As an example, we present an algorithm of the basic operations for the key-value stores of LevelDB.

]]>
<![CDATA[Early warning of some notifiable infectious diseases in China by the artificial neural network]]> https://www.researchpad.co/article/Nf6c3af52-397d-45c0-99c9-48bcd25792d4

In order to accurately grasp the timing for the prevention and control of diseases, we established an artificial neural network model to issue early warning signals. The real-time recurrent learning (RTRL) and extended Kalman filter (EKF) methods were performed to analyse four types of respiratory infectious diseases and four types of digestive tract infectious diseases in China to comprehensively determine the epidemic intensities and whether to issue early warning signals. The numbers of new confirmed cases per month between January 2004 and December 2017 were used as the training set; the data from 2018 were used as the test set. The results of RTRL showed that the number of new confirmed cases of respiratory infectious diseases in September 2018 increased abnormally. The results of the EKF showed that the number of new confirmed cases of respiratory infectious diseases increased abnormally in January and February of 2018. The results of these two algorithms showed that the number of new confirmed cases of digestive tract infectious diseases in the test set did not have any abnormal increases. The neural network and machine learning can further enrich and develop the early warning theory.

]]>
<![CDATA[Fluid–structure interaction simulations outperform computational fluid dynamics in the description of thoracic aorta haemodynamics and in the differentiation of progressive dilation in Marfan syndrome patients]]> https://www.researchpad.co/article/N35b6edf0-5fe1-4fe2-8f14-835eea74ba8a

Abnormal fluid dynamics at the ascending aorta may be at the origin of aortic aneurysms. This study was aimed at comparing the performance of computational fluid dynamics (CFD) and fluid–structure interaction (FSI) simulations against four-dimensional (4D) flow magnetic resonance imaging (MRI) data; and to assess the capacity of advanced fluid dynamics markers to stratify aneurysm progression risk. Eight Marfan syndrome (MFS) patients, four with stable and four with dilating aneurysms of the proximal aorta, and four healthy controls were studied. FSI and CFD simulations were performed with MRI-derived geometry, inlet velocity field and Young's modulus. Flow displacement, jet angle and maximum velocity evaluated from FSI and CFD simulations were compared to 4D flow MRI data. A dimensionless parameter, the shear stress ratio (SSR), was evaluated from FSI and CFD simulations and assessed as potential correlate of aneurysm progression. FSI simulations successfully matched MRI data regarding descending to ascending aorta flow rates (R2 = 0.92) and pulse wave velocity (R2 = 0.99). Compared to CFD, FSI simulations showed significantly lower percentage errors in ascending and descending aorta in flow displacement (−46% ascending, −41% descending), jet angle (−28% ascending, −50% descending) and maximum velocity (−37% ascending, −34% descending) with respect to 4D flow MRI. FSI- but not CFD-derived SSR differentiated between stable and dilating MFS patients. Fluid dynamic simulations of the thoracic aorta require fluid–solid interaction to properly reproduce complex haemodynamics. FSI- but not CFD-derived SSR could help stratifying MFS patients.

]]>
<![CDATA[Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality]]> https://www.researchpad.co/article/Nbf30117c-7bf3-4987-ad3a-597177b037e8

The application of machine learning to inference problems in biology is dominated by supervised learning problems of regression and classification, and unsupervised learning problems of clustering and variants of low-dimensional projections for visualization. A class of problems that have not gained much attention is detecting outliers in datasets, arising from reasons such as gross experimental, reporting or labelling errors. These could also be small parts of a dataset that are functionally distinct from the majority of a population. Outlier data are often identified by considering the probability density of normal data and comparing data likelihoods against some threshold. This classical approach suffers from the curse of dimensionality, which is a serious problem with omics data which are often found in very high dimensions. We develop an outlier detection method based on structured low-rank approximation methods. The objective function includes a regularizer based on neighbourhood information captured in the graph Laplacian. Results on publicly available genomic data show that our method robustly detects outliers whereas a density-based method fails even at moderate dimensions. Moreover, we show that our method has better clustering and visualization performance on the recovered low-dimensional projection when compared with popular dimensionality reduction techniques.

]]>
<![CDATA[Data on optimization of the non-linear Muskingum flood routing in Kardeh River using Goa algorithm]]> https://www.researchpad.co/article/Nc1688a88-0a00-4e3e-892d-4ed7124852b2

This article describes the time series data for optimizing the Non-linear Muskingum flood routing of the Kardeh River, located in Northeastern of Iran for a period of 2 days (from 27 April 1992 to 28 April 1992). The utilized time-series data included river inflow, Storage volume and river outflow. In this data article, a model based on the Grasshopper Optimization Algorithm (GOA) was developed for the optimization of the Non-linear Muskingum flood routing model. The GOA algorithm was compared with other metaheuristic algorithms such as the Genetic Algorithm (GA) and Harmony search (HS). The analysis showed that the best solutions achieved by the GOA, Genetic Algorithm (GA), and Harmony search (HS) were 3.53, 5.29, and 5.69, respectively. The analysis of these datasets revealed that the GOA algorithm was superior to GA and HS algorithms for the optimal flood routing river problem.

]]>
<![CDATA[UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection]]> https://www.researchpad.co/article/N2f0ca662-47b6-47a7-a4d6-8beaab74b860

In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated domain names decorated with a feature set ready-to-use for machine learning (ML) analysis. This proposed dataset has been co-submitted with the research article ”UMUDGA: a dataset for profiling DGA-based botnet” [1], and it aims to enable researchers to move forward the data collection, organization, and pre-processing phases, eventually enabling them to focus on the analysis and the production of ML-powered solutions for network intrusion detection. In this research, we selected 50 among the most notorious malware variants to be as exhaustive as possible. Inhere, each family is available both as a list of domains (generated by executing the malware DGAs in a controlled environment with fixed parameters) and as a collection of features (generated by extracting a combination of statistical and natural language processing metrics).

]]>
<![CDATA[Data on optimal operation of Safarud Reservoir using symbiotic organisms search (SOS) algorithm]]> https://www.researchpad.co/article/Nb40e17f2-3d1c-4941-beb6-093cee8cf789

This data article explains the time-series data for optimal operation of Safarud Reservoir located in Halilrood basin in the south of Iran for a period of 223 months, from October 2000 to April 2019. The utilized data included the release of the reservoir, reservoir inflow, reservoir storage, evaporation and precipitation. A model based on Symbiotic Organisms Search (SOS) algorithm was also developed for the optimal operation of Safarud Reservoir. The analysis of the objective function showed that the best solution achieved by the SOS algorithm was 10.89. Also, the analysis of these datasets revealed that the SOS algorithm was efficient for the optimal operation of the reservoir problem.

]]>