ResearchPad - decision-trees https://www.researchpad.co Default RSS Feed en-us © 2020 Newgen KnowledgeWorks <![CDATA[iterb-PPse: Identification of transcriptional terminators in bacterial by incorporating nucleotide properties into PseKNC]]> https://www.researchpad.co/article/elastic_article_14750 Terminator is a DNA sequence that gives the RNA polymerase the transcriptional termination signal. Identifying terminators correctly can optimize the genome annotation, more importantly, it has considerable application value in disease diagnosis and therapies. However, accurate prediction methods are deficient and in urgent need. Therefore, we proposed a prediction method “iterb-PPse” for terminators by incorporating 47 nucleotide properties into PseKNC-Ⅰ and PseKNC-Ⅱ and utilizing Extreme Gradient Boosting to predict terminators based on Escherichia coli and Bacillus subtilis. Combing with the preceding methods, we employed three new feature extraction methods K-pwm, Base-content, Nucleotidepro to formulate raw samples. The two-step method was applied to select features. When identifying terminators based on optimized features, we compared five single models as well as 16 ensemble models. As a result, the accuracy of our method on benchmark dataset achieved 99.88%, higher than the existing state-of-the-art predictor iTerm-PseKNC in 100 times five-fold cross-validation test. Its prediction accuracy for two independent datasets reached 94.24% and 99.45% respectively. For the convenience of users, we developed a software on the basis of “iterb-PPse” with the same name. The open software and source code of “iterb-PPse” are available at https://github.com/Sarahyouzi/iterb-PPse.

]]>
<![CDATA[ECG-based prediction algorithm for imminent malignant ventricular arrhythmias using decision tree]]> https://www.researchpad.co/article/elastic_article_14548 Spontaneous prediction of malignant ventricular arrhythmia (MVA) is useful to avoid delay in rescue operations. Recently, researchers have developed several algorithms to predict MVA using various features derived from electrocardiogram (ECG). However, there are several unresolved issues regarding MVA prediction such as the effect of number of ECG features on a prediction remaining unclear, possibility that an alert for occurring MVA may arrive very late and uncertainty in the performance of the algorithm predicting MVA minutes before onset. To overcome the aforementioned problems, this research conducts an in-depth study on the number and types of ECG features that are implemented in a decision tree classifier. In addition, this research also investigates an algorithm’s execution time before the occurrence of MVA to minimize delays in warnings for MVA. Lastly, this research aims to study both the sensitivity and specificity of an algorithm to reveal the performance of MVA prediction algorithms from time to time. To strengthen the results of analysis, several classifiers such as support vector machine and naive Bayes are also examined for the purpose of comparison study. There are three phases required to achieve the objectives. The first phase is literature review on existing relevant studies. The second phase deals with design and development of four modules for predicting MVA. Rigorous experiments are performed in the feature selection and classification modules. The results show that eight ECG features with decision tree classifier achieved good prediction performance in terms of execution time and sensitivity. In addition, the results show that the highest percentage for sensitivity and specificity is 95% and 90% respectively, in the fourth 5-minute interval (15.1 minutes–20 minutes) that preceded the onset of an arrhythmia event. Such results imply that the fourth 5-minute interval would be the best time to perform prediction.

]]>
<![CDATA[Improvement of electrocardiographic diagnostic accuracy of left ventricular hypertrophy using a Machine Learning approach]]> https://www.researchpad.co/article/elastic_article_14491 The electrocardiogram (ECG) is the most common tool used to predict left ventricular hypertrophy (LVH). However, it is limited by its low accuracy (<60%) and sensitivity (30%). We set forth the hypothesis that the Machine Learning (ML) C5.0 algorithm could optimize the ECG in the prediction of LVH by echocardiography (Echo) while also establishing ECG-LVH phenotypes. We used Echo as the standard diagnostic tool to detect LVH and measured the ECG abnormalities found in Echo-LVH. We included 432 patients (power = 99%). Of these, 202 patients (46.7%) had Echo-LVH and 240 (55.6%) were males. We included a wide range of ventricular masses and Echo-LVH severities which were classified as mild (n = 77, 38.1%), moderate (n = 50, 24.7%) and severe (n = 75, 37.1%). Data was divided into a training/testing set (80%/20%) and we applied logistic regression analysis on the ECG measurements. The logistic regression model with the best ability to identify Echo-LVH was introduced into the C5.0 ML algorithm. We created multiple decision trees and selected the tree with the highest performance. The resultant five-level binary decision tree used only six predictive variables and had an accuracy of 71.4% (95%CI, 65.5–80.2), a sensitivity of 79.6%, specificity of 53%, positive predictive value of 66.6% and a negative predictive value of 69.3%. Internal validation reached a mean accuracy of 71.4% (64.4–78.5). Our results were reproduced in a second validation group and a similar diagnostic accuracy was obtained, 73.3% (95%CI, 65.5–80.2), sensitivity (81.6%), specificity (69.3%), positive predictive value (56.3%) and negative predictive value (88.6%). We calculated the Romhilt-Estes multilevel score and compared it to our model. The accuracy of the Romhilt-Estes system had an accuracy of 61.3% (CI95%, 56.5–65.9), a sensitivity of 23.2% and a specificity of 94.8% with similar results in the external validation group. In conclusion, the C5.0 ML algorithm surpassed the accuracy of current ECG criteria in the detection of Echo-LVH. Our new criteria hinge on ECG abnormalities that identify high-risk patients and provide some insight on electrogenesis in Echo-LVH.

]]>
<![CDATA[Oxycodone versus morphine for cancer pain titration: A systematic review and pharmacoeconomic evaluation]]> https://www.researchpad.co/article/N5c0f7a4c-4090-42ec-ba95-57e120b0c99c

Objective

To evaluate the efficacy, safety and cost-effectiveness of Oxycodone Hydrochloride Controlled-release Tablets (CR oxycodone) and Morphine Sulfate Sustained-release Tablets (SR morphine) for moderate to severe cancer pain titration.

Methods

Randomized controlled trials meeting the inclusion criteria were searched through Medline, Cochrane Library, Pubmed, EMbase, CNKI,VIP and WanFang database from the data of their establishment to June 2019. The efficacy and safety data were extracted from the included literature. The pain control rate was calculated to eatimate efficacy. Meta-analysis was conducted by Revman5.1.4. A decision tree model was built to simulate cancer pain titration process. The initial dose of CR oxycodone and SR morphine group were 20mg and 30mg respectively. Oral immediate-release morphine was administered to treat break-out pain. The incremental cost-effectiveness ratio was performed with TreeAge Pro 2019.

Results

19 studies (1680 patients)were included in this study. Meta-analysis showed that the pain control rate of CR oxycodone and SR morphine were 86% and 82.98% respectively. The costs of CR oxycodone and SR morphine were $23.27 and $13.31. The incremental cost-effectiveness ratio per unit was approximate $329.76. At the willingness-to-pay threshold of $8836, CR oxycodone was cost-effective, while the corresponding probability of being cost-effective at the willingness-to-pay threshold of $300 was 31.6%. One-way sensitivity analysis confirmed robustness of results.

Conclusions

CR oxycodone could be a cost-effective option compared with SR morphine for moderate to severe cancer pain titration in China, according to the threshold defined by the WHO.

]]>
<![CDATA[Automated localization and quality control of the aorta in cine CMR can significantly accelerate processing of the UK Biobank population data]]> https://www.researchpad.co/article/5c6f151bd5eed0c48467adda

Introduction

Aortic distensibility can be calculated using semi-automated methods to segment the aortic lumen on cine CMR (Cardiovascular Magnetic Resonance) images. However, these methods require visual quality control and manual localization of the region of interest (ROI) of ascending (AA) and proximal descending (PDA) aorta, which limit the analysis in large-scale population-based studies. Using 5100 scans from UK Biobank, this study sought to develop and validate a fully automated method to 1) detect and locate the ROIs of AA and PDA, and 2) provide a quality control mechanism.

Methods

The automated AA and PDA detection-localization algorithm followed these steps: 1) foreground segmentation; 2) detection of candidate ROIs by Circular Hough Transform (CHT); 3) spatial, histogram and shape feature extraction for candidate ROIs; 4) AA and PDA detection using Random Forest (RF); 5) quality control based on RF detection probability. To provide the ground truth, overall image quality (IQ = 0–3 from poor to good) and aortic locations were visually assessed by 13 observers. The automated algorithm was trained on 1200 scans and Dice Similarity Coefficient (DSC) was used to calculate the agreement between ground truth and automatically detected ROIs.

Results

The automated algorithm was tested on 3900 scans. Detection accuracy was 99.4% for AA and 99.8% for PDA. Aorta localization showed excellent agreement with the ground truth, with DSC ≥ 0.9 in 94.8% of AA (DSC = 0.97 ± 0.04) and 99.5% of PDA cases (DSC = 0.98 ± 0.03). AA×PDA detection probabilities could discriminate scans with IQ ≥ 1 from those severely corrupted by artefacts (AUC = 90.6%). If scans with detection probability < 0.75 were excluded (350 scans), the algorithm was able to correctly detect and localize AA and PDA in all the remaining 3550 scans (100% accuracy).

Conclusion

The proposed method for automated AA and PDA localization was extremely accurate and the automatically derived detection probabilities provided a robust mechanism to detect low quality scans for further human review. Applying the proposed localization and quality control techniques promises at least a ten-fold reduction in human involvement without sacrificing any accuracy.

]]>
<![CDATA[Selection of the optimal trading model for stock investment in different industries]]> https://www.researchpad.co/article/5c6dc9d9d5eed0c48452a2e0

In general, the stock prices of the same industry have a similar trend, but those of different industries do not. When investing in stocks of different industries, one should select the optimal model from lots of trading models for each industry because any model may not be suitable for capturing the stock trends of all industries. However, the study has not been carried out at present. In this paper, firstly we select 424 S&P 500 index component stocks (SPICS) and 185 CSI 300 index component stocks (CSICS) as the research objects from 2010 to 2017, divide them into 9 industries such as finance and energy respectively. Secondly, we apply 12 widely used machine learning algorithms to generate stock trading signals in different industries and execute the back-testing based on the trading signals. Thirdly, we use a non-parametric statistical test to evaluate whether there are significant differences among the trading performance evaluation indicators (PEI) of different models in the same industry. Finally, we propose a series of rules to select the optimal models for stock investment of every industry. The analytical results on SPICS and CSICS show that we can find the optimal trading models for each industry based on the statistical tests and the rules. Most importantly, the PEI of the best algorithms can be significantly better than that of the benchmark index and “Buy and Hold” strategy. Therefore, the algorithms can be used for making profits from industry stock trading.

]]>
<![CDATA[Development and validation of clinical prediction models to distinguish influenza from other viruses causing acute respiratory infections in children and adults]]> https://www.researchpad.co/article/5c6b26add5eed0c484289e58

Predictive models have been developed for influenza but have seldom been validated. Typically they have focused on patients meeting a definition of infection that includes fever. Less is known about how models perform when more symptoms are considered. We, therefore, aimed to create and internally validate predictive scores of acute respiratory infection (ARI) symptoms to diagnose influenza virus infection as confirmed by polymerase chain reaction (PCR) from respiratory specimens. Data from a completed trial to study the indirect effect of influenza immunization in Hutterite communities were randomly split into two independent groups for model derivation and validation. We applied different multivariable modelling techniques and constructed Receiver Operating Characteristics (ROC) curves to determine predictive indexes at different cut-points. From 2008–2011, 3288 first seasonal ARI episodes and 321 (9.8%) influenza positive events occurred in 2202 individuals. In children up to 17 years, the significant predictors of influenza virus infection were fever, chills, and cough along with being of age 6 years and older. In adults, presence of chills and cough but not fever were highly specific for influenza virus infection (sensitivity 30%, specificity 96%). Performance of the models in the validation set was not significantly different. The predictors were consistently found to be significant irrespective of the multivariable technique. Symptomatic predictors of influenza virus infection vary between children and adults. The scores could assist clinicians in their test and treat decisions but the results need to be externally validated prior to application in clinical practice.

]]>
<![CDATA[Multi-sensor movement analysis for transport safety and health applications]]> https://www.researchpad.co/article/5c5ca2c0d5eed0c48441ea09

Recent increases in the use of and applications for wearable technology has opened up many new avenues of research. In this paper, we consider the use of lifelogging and GPS data to extend fine-grained movement analysis for improving applications in health and safety. We first design a framework to solve the problem of indoor and outdoor movement detection from sensor readings associated with images captured by a lifelogging wearable device. Second we propose a set of measures related with hazard on the road network derived from the combination of GPS movement data, road network data and the sensor readings from a wearable device. Third, we identify the relationship between different socio-demographic groups and the patterns of indoor physical activity and sedentary behaviour routines as well as disturbance levels on different road settings.

]]>
<![CDATA[In silico identification of critical proteins associated with learning process and immune system for Down syndrome]]> https://www.researchpad.co/article/5c58d652d5eed0c484031b96

Understanding expression levels of proteins and their interactions is a key factor to diagnose and explain the Down syndrome which can be considered as the most prevalent reason of intellectual disability in human beings. In the previous studies, the expression levels of 77 proteins obtained from normal genotype control mice and from trisomic Ts65Dn mice have been analyzed after training in contextual fear conditioning with and without injection of the memantine drug using statistical methods and machine learning techniques. Recent studies have also pointed out that there may be a linkage between the Down syndrome and the immune system. Thus, the research presented in this paper aim at in silico identification of proteins which are significant to the learning process and the immune system and to derive the most accurate model for classification of mice. In this paper, the features are selected by implementing forward feature selection method after preprocessing step of the dataset. Later, deep neural network, gradient boosting tree, support vector machine and random forest classification methods are implemented to identify the accuracy. It is observed that the selected feature subsets not only yield higher accuracy classification results but also are composed of protein responses which are important for the learning and memory process and the immune system.

]]>
<![CDATA[Using computer-vision and machine learning to automate facial coding of positive and negative affect intensity]]> https://www.researchpad.co/article/5c633970d5eed0c484ae6711

Facial expressions are fundamental to interpersonal communication, including social interaction, and allow people of different ages, cultures, and languages to quickly and reliably convey emotional information. Historically, facial expression research has followed from discrete emotion theories, which posit a limited number of distinct affective states that are represented with specific patterns of facial action. Much less work has focused on dimensional features of emotion, particularly positive and negative affect intensity. This is likely, in part, because achieving inter-rater reliability for facial action and affect intensity ratings is painstaking and labor-intensive. We use computer-vision and machine learning (CVML) to identify patterns of facial actions in 4,648 video recordings of 125 human participants, which show strong correspondences to positive and negative affect intensity ratings obtained from highly trained coders. Our results show that CVML can both (1) determine the importance of different facial actions that human coders use to derive positive and negative affective ratings when combined with interpretable machine learning methods, and (2) efficiently automate positive and negative affect intensity coding on large facial expression databases. Further, we show that CVML can be applied to individual human judges to infer which facial actions they use to generate perceptual emotion ratings from facial expressions.

]]>
<![CDATA[Computational prediction of diagnosis and feature selection on mesothelioma patient health records]]> https://www.researchpad.co/article/5c40f7e0d5eed0c484386b51

Background

Mesothelioma is a lung cancer that kills thousands of people worldwide annually, especially those with exposure to asbestos. Diagnosis of mesothelioma in patients often requires time-consuming imaging techniques and biopsies. Machine learning can provide for a more effective, cheaper, and faster patient diagnosis and feature selection from clinical data in patient records.

Methods and findings

We analyzed a dataset of health records of 324 patients having mesothelioma symptoms from Turkey. The patients had prior asbestos exposure and displayed symptoms consistent with mesothelioma. We compared probabilistic neural network, perceptron-based neural network, random forest, one rule, and decision tree classifiers to predict diagnosis of the patient records. We measured classifiers’ performance through standard confusion matrix scores such as Matthews correlation coefficient (MCC). Random forest outperformed all models tried, obtaining MCC = +0.37 on the complete imbalanced dataset and MCC = +0.64 on the under-sampled balanced dataset. We then employed random forest feature selection to identify the two most relevant dataset traits associated with mesothelioma: lung side and platelet count. These two risk factors resulted so predictive, that decision tree focusing on them achieved the second top accuracy on the complete dataset diagnosis prediction (MCC = +0.28), outperforming all other methods and even decision tree itself applied to all features.

Conclusions

Our results show that machine learning can predict diagnoses of patients having mesothelioma symptoms with high accuracy, sensitivity, and specificity, in few minutes. Additionally, random forest can efficiently select the most important features of this clinical dataset (lung side and platelet count) in few seconds. The importance of pleural plaques in lung sides and blood platelets in mesothelioma diagnosis indicates that physicians should focus on these two features when reading records of patients with mesothelioma symptoms. Moreover, doctors can exploit our machinery to predict patient diagnosis when only lung side and platelet data are available.

]]>
<![CDATA[Using machine learning and an ensemble of methods to predict kidney transplant survival]]> https://www.researchpad.co/article/5c3fa572d5eed0c484ca46fc

We used an ensemble of statistical methods to build a model that predicts kidney transplant survival and identifies important predictive variables. The proposed model achieved better performance, measured by Harrell’s concordance index, than the Estimated Post Transplant Survival model used in the kidney allocation system in the U.S., and other models published recently in the literature. The model has a five-year concordance index of 0.724 (in comparison, the concordance index is 0.697 for the Estimated Post Transplant Survival model, the state of the art currently in use). It combines predictions from random survival forests with a Cox proportional hazards model. The rankings of importance for the model’s variables differ by transplant recipient age. Better survival predictions could eventually lead to more efficient allocation of kidneys and improve patient outcomes.

]]>
<![CDATA[Poincaré plot analysis of cerebral blood flow signals: Feature extraction and classification methods for apnea detection]]> https://www.researchpad.co/article/5c141ee1d5eed0c484d28a77

Objective

Rheoencephalography is a simple and inexpensive technique for cerebral blood flow assessment, however, it is not used in clinical practice since its correlation to clinical conditions has not yet been extensively proved. The present study investigates the ability of Poincaré Plot descriptors from rheoencephalography signals to detect apneas in volunteers.

Methods

A group of 16 subjects participated in the study. Rheoencephalography data from baseline and apnea periods were recorded and Poincaré Plot descriptors were extracted from the reconstructed attractors with different time lags (τ). Among the set of extracted features, those presenting significant differences between baseline and apnea recordings were used as inputs to four different classifiers to optimize the apnea detection.

Results

Three features showed significant differences between apnea and baseline signals: the Poincaré Plot ratio (SDratio), its correlation (R) and the Complex Correlation Measure (CCM). Those differences were optimized for time lags smaller than those recommended in previous works for other biomedical signals, all of them being lower than the threshold established by the position of the inflection point in the CCM curves. The classifier showing the best performance was the classification tree, with 81% accuracy and an area under the curve of the receiver operating characteristic of 0.927. This performance was obtained using a single input parameter, either SDratio or R.

Conclusions

Poincaré Plot features extracted from the attractors of rheoencephalographic signals were able to track cerebral blood flow changes provoked by breath holding. Even though further validation with independent datasets is needed, those results suggest that nonlinear analysis of rheoencephalography might be a useful approach to assess the correlation of cerebral impedance with clinical changes.

]]>
<![CDATA[Biomarkers of erosive arthritis in systemic lupus erythematosus: Application of machine learning models]]> https://www.researchpad.co/article/5c1028cdd5eed0c4842481eb

Objective

Limited evidences are available on biomarkers to recognize Systemic Lupus erythematosus (SLE) patients at risk to develop erosive arthritis. Anti-citrullinated peptide antibodies (ACPA) have been widely investigated and identified in up to 50% of X-ray detected erosive arthritis; conversely, few studies evaluated anti-carbamylated proteins antibodies (anti-CarP). Here, we considered the application of machine learning models to identify relevant factors in the development of ultrasonography (US)-detected erosive damage in a large cohort of SLE patients with joint involvement.

Methods

We enrolled consecutive SLE patients with arthritis/arthralgia. All patients underwent joint (DAS28, STR) and laboratory assessment (detection of ACPA, anti-CarP, Rheumatoid Factor, SLE-related antibodies). The bone surfaces of metacarpophalangeal and proximal interphalangeal joints were assessed by US: the presence of erosions was registered with a dichotomous value (0/1), obtaining a total score (0–20). Concerning machine learning techniques, we applied and compared Logistic Regression and Decision Trees in conjunction with the feature selection Forward Wrapper method.

Results

We enrolled 120 SLE patients [M/F 8/112, median age 47.0 years (IQR 15.0); median disease duration 120.0 months (IQR 156.0)], 73.3% of them referring at least one episode of arthritis. Erosive damage was identified in 25.8% of patients (mean±SD 0.7±1.6), all of them with clinically evident arthritis. We applied Logistic Regression in conjunction with the Forward Wrapper method, obtaining an AUC value of 0.806±0.02. As a result of the learning procedure, we evaluated the relevance of the different factors: this value was higher than 35% for ACPA and anti-CarP.

Conclusion

The application of Machine Learning Models allowed to identify factors associated with US-detected erosive bone damage in a large SLE cohort and their relevance in determining this phenotype. Although the scope of this study is limited by the small sample size and its cross-sectional nature, the results suggest the relevance of ACPA and anti-CarP antibodies in the development of erosive damage as also pointed out in other studies.

]]>
<![CDATA[Estimation of paddy rice leaf area index using machine learning methods based on hyperspectral data from multi-year experiments]]> https://www.researchpad.co/article/5c117b6fd5eed0c484699303

The performance of three machine learning methods (support vector regression, random forests and artificial neural network) for estimating the LAI of paddy rice was evaluated in this study. Traditional univariate regression models involving narrowband NDVI with optimized band combinations as well as linear multivariate calibration partial least squares regression models were also evaluated for comparison. A four year field-collected dataset was used to test the robustness of LAI estimation models against temporal variation. The partial least squares regression and three machine learning methods were built on the raw hyperspectral reflectance and the first derivative separately. Two different rules were used to determine the models’ key parameters. The results showed that the combination of the red edge and NIR bands (766 nm and 830 nm) as well as the combination of SWIR bands (1114 nm and 1190 nm) were optimal for producing the narrowband NDVI. The models built on the first derivative spectra yielded more accurate results than the corresponding models built on the raw spectra. Properly selected model parameters resulted in comparable accuracy and robustness with the empirical optimal parameter and significantly reduced the model complexity. The machine learning methods were more accurate and robust than the VI methods and partial least squares regression. When validating the calibrated models against the standalone validation dataset, the VI method yielded a validation RMSE value of 1.17 for NDVI(766,830) and 1.01 for NDVI(1114,1190), while the best models for the partial least squares, support vector machine and artificial neural network methods yielded validation RMSE values of 0.84, 0.82, 0.67 and 0.84, respectively. The RF models built on the first derivative spectra with mtry = 10 showed the highest potential for estimating the LAI of paddy rice.

]]>
<![CDATA[Mobile detection of autism through machine learning on home video: A development and prospective validation study]]> https://www.researchpad.co/article/5c06f047d5eed0c484c6d5be

Background

The standard approaches to diagnosing autism spectrum disorder (ASD) evaluate between 20 and 100 behaviors and take several hours to complete. This has in part contributed to long wait times for a diagnosis and subsequent delays in access to therapy. We hypothesize that the use of machine learning analysis on home video can speed the diagnosis without compromising accuracy. We have analyzed item-level records from 2 standard diagnostic instruments to construct machine learning classifiers optimized for sparsity, interpretability, and accuracy. In the present study, we prospectively test whether the features from these optimized models can be extracted by blinded nonexpert raters from 3-minute home videos of children with and without ASD to arrive at a rapid and accurate machine learning autism classification.

Methods and findings

We created a mobile web portal for video raters to assess 30 behavioral features (e.g., eye contact, social smile) that are used by 8 independent machine learning models for identifying ASD, each with >94% accuracy in cross-validation testing and subsequent independent validation from previous work. We then collected 116 short home videos of children with autism (mean age = 4 years 10 months, SD = 2 years 3 months) and 46 videos of typically developing children (mean age = 2 years 11 months, SD = 1 year 2 months). Three raters blind to the diagnosis independently measured each of the 30 features from the 8 models, with a median time to completion of 4 minutes. Although several models (consisting of alternating decision trees, support vector machine [SVM], logistic regression (LR), radial kernel, and linear SVM) performed well, a sparse 5-feature LR classifier (LR5) yielded the highest accuracy (area under the curve [AUC]: 92% [95% CI 88%–97%]) across all ages tested. We used a prospectively collected independent validation set of 66 videos (33 ASD and 33 non-ASD) and 3 independent rater measurements to validate the outcome, achieving lower but comparable accuracy (AUC: 89% [95% CI 81%–95%]). Finally, we applied LR to the 162-video-feature matrix to construct an 8-feature model, which achieved 0.93 AUC (95% CI 0.90–0.97) on the held-out test set and 0.86 on the validation set of 66 videos. Validation on children with an existing diagnosis limited the ability to generalize the performance to undiagnosed populations.

Conclusions

These results support the hypothesis that feature tagging of home videos for machine learning classification of autism can yield accurate outcomes in short time frames, using mobile devices. Further work will be needed to confirm that this approach can accelerate autism diagnosis at scale.

]]>
<![CDATA[Development and validation of machine learning models to identify high-risk surgical patients using automatically curated electronic health record data (Pythia): A retrospective, single-site study]]> https://www.researchpad.co/article/5c06f02dd5eed0c484c6d2c6

Background

Pythia is an automated, clinically curated surgical data pipeline and repository housing all surgical patient electronic health record (EHR) data from a large, quaternary, multisite health institute for data science initiatives. In an effort to better identify high-risk surgical patients from complex data, a machine learning project trained on Pythia was built to predict postoperative complication risk.

Methods and findings

A curated data repository of surgical outcomes was created using automated SQL and R code that extracted and processed patient clinical and surgical data across 37 million clinical encounters from the EHRs. A total of 194 clinical features including patient demographics (e.g., age, sex, race), smoking status, medications, comorbidities, procedure information, and proxies for surgical complexity were constructed and aggregated. A cohort of 66,370 patients that had undergone 99,755 invasive procedural encounters between January 1, 2014, and January 31, 2017, was studied further for the purpose of predicting postoperative complications. The average complication and 30-day postoperative mortality rates of this cohort were 16.0% and 0.51%, respectively. Least absolute shrinkage and selection operator (lasso) penalized logistic regression, random forest models, and extreme gradient boosted decision trees were trained on this surgical cohort with cross-validation on 14 specific postoperative outcome groupings. Resulting models had area under the receiver operator characteristic curve (AUC) values ranging between 0.747 and 0.924, calculated on an out-of-sample test set from the last 5 months of data. Lasso penalized regression was identified as a high-performing model, providing clinically interpretable actionable insights. Highest and lowest performing lasso models predicted postoperative shock and genitourinary outcomes with AUCs of 0.924 (95% CI: 0.901, 0.946) and 0.780 (95% CI: 0.752, 0.810), respectively. A calculator requiring input of 9 data fields was created to produce a risk assessment for the 14 groupings of postoperative outcomes. A high-risk threshold (15% risk of any complication) was determined to identify high-risk surgical patients. The model sensitivity was 76%, with a specificity of 76%. Compared to heuristics that identify high-risk patients developed by clinical experts and the ACS NSQIP calculator, this tool performed superiorly, providing an improved approach for clinicians to estimate postoperative risk for patients. Limitations of this study include the missingness of data that were removed for analysis.

Conclusions

Extracting and curating a large, local institution’s EHR data for machine learning purposes resulted in models with strong predictive performance. These models can be used in clinical settings as decision support tools for identification of high-risk patients as well as patient evaluation and care management. Further work is necessary to evaluate the impact of the Pythia risk calculator within the clinical workflow on postoperative outcomes and to optimize this data flow for future machine learning efforts.

]]>
<![CDATA[Development and validation of a modified quick SOFA scale for risk assessment in sepsis syndrome]]> https://www.researchpad.co/article/5bb530dd40307c24312bb0b5

Sepsis is a severe clinical syndrome owing to its high mortality. Quick Sequential Organ Failure Assessment (qSOFA) score has been proposed for the prediction of fatal outcomes in sepsis syndrome in emergency departments. Due to the low predictive performance of the qSOFA score, we propose a modification to the score by adding age. We conducted a multicenter, retrospective cohort study among regional referral centers from various regions of the country. Participants recruited data of patients admitted to emergency departments and obtained a diagnosis of sepsis syndrome. Crude in-hospital mortality was the primary endpoint. A generalized mixed-effects model with random intercepts produced estimates for adverse outcomes. Model-based recursive partitioning demonstrated the effects and thresholds of significant covariates. Scores were internally validated. The H measure compared performances of scores. A total of 580 patients from 22 centers were included for further analysis. Stages of sepsis, age, time to antibiotics, and administration of carbapenem for empirical treatment were entered the final model. Among these, severe sepsis (OR, 4.40; CIs, 2.35–8.21), septic shock (OR, 8.78; CIs, 4.37–17.66), age (OR, 1.03; CIs, 1.02–1.05) and time to antibiotics (OR, 1.05; CIs, 1.01–1.10) were significantly associated with fatal outcomes. A decision tree demonstrated the thresholds for age. We modified the quick Sequential Organ Failure Assessment (mod-qSOFA) score by adding age (> 50 years old = one point) and compared this to the conventional score. H-measures for qSOFA and mod-qSOFA were found to be 0.11 and 0.14, respectively, whereas AUCs of both scores were 0.64. We propose the use of the modified qSOFA score for early risk assessment among sepsis patients for improved triage and management of this fatal syndrome.

]]>
<![CDATA[Restricting the nonlinearity parameter in soil greenhouse gas flux calculation for more reliable flux estimates]]> https://www.researchpad.co/article/5b69465f463d7e3867f4ad06

The static chamber approach is often used for greenhouse gas (GHG) flux measurements, whereby the flux is deduced from the increase of species concentration after closing the chamber. Since this increase changes diffusion gradients between chamber air and soil air, a nonlinear increase is expected. Lateral gas flow and leakages also contribute to non linearity. Several models have been suggested to account for this non linearity, the most recent being the Hutchinson–Mosier regression model (hmr). However, the practical application of these models is challenging because the researcher needs to decide for each flux whether a nonlinear fit is appropriate or exaggerates flux estimates due to measurement artifacts. In the latter case, a flux estimate from the linear model is a more robust solution and introduces less arbitrary uncertainty to the data. We present the new, dynamic and reproducible flux calculation scheme, kappa.max, for an improved trade-off between bias and uncertainty (i.e. accuracy and precision). We develop a tool to simulate, visualise and optimise the flux calculation scheme for any specific static N2O chamber measurement system. The decision procedure and visualisation tools are implemented in a package for the R software. Finally, we demonstrate with this approach the performance of the applied flux calculation scheme for a measured flux dataset to estimate the actual bias and uncertainty. The kappa.max method effectively improved the decision between linear and nonlinear flux estimates reducing the bias at a minimal cost of uncertainty.

]]>
<![CDATA[Analysis of factors associated with extended recovery time after colonoscopy]]> https://www.researchpad.co/article/5b49cad3463d7e33e4eac063

Background & aims

A common limiting factor in the throughput of gastrointestinal endoscopy units is the availability of space for patients to recover post-procedure. This study sought to identify predictors of abnormally long recovery time after colonoscopy performed with procedural sedation. In clinical research, this type of study would be performed using only one regression modeling approach. A goal of this study was to apply various “machine learning” techniques to see if better prediction could be achieved.

Methods

Procedural data for 31,442 colonoscopies performed on 29,905 adult patients at Massachusetts General Hospital from 2011 to 2015 were analyzed to identify potential predictors of long recovery times. These data included the identities of hospital personnel, and the initial statistical analysis focused on the impact of these personnel on recovery time via multivariate logistic regression. Secondary analyses included more information on patient vitals both to identify secondary predictors and to predict long recoveries using more complex techniques.

Results

In univariate analysis, the endoscopist, procedure room nurse, recovery room nurse, and surgical technician all showed a statistically significant relationship to long recovery times, with p-value below 0.0001 in all cases. In the multivariate logistic regression, the most significant predictor of a long recovery time was the identity of the recovery room nurse, with the endoscopist also showing a statistically significant relationship with a weaker effect. Complex techniques led to a negligible improvement over simple techniques in prediction of long recovery periods.

Conclusion

The hospital personnel involved in performing a colonoscopy show a strong association with the likelihood of a patient spending an abnormally long time recovering from the procedure, with the most pronounced effect for the nurse in the recovery room. The application of more advanced approaches to improve prediction in this clinical data set only yielded modest improvements.

]]>