Machine Learned Combination of Ventilatory, Hypoxic, and Arousal Burden Predicts Short- and Long-Term Consequences of OSA better than the AHI across Multiple Large Cohorts
Authors List
Sajila Wickramaratne, Korey Kam, Thomas Tolbert, Andrew W. Varga, Indu Ayappa, David M. Rapoport, Ankit Parekh
Introduction
Current metric of obstructive sleep apnea (OSA) severity, i.e., the apnea-hypopnea index (AHI), is weakly correlated to short-term (daytime sleepiness) and long-term (all-cause mortality) outcomes. Here, we assessed whether a machine-learned combination of possibly independent metrics across ventilatory/hypoxic/arousal domains would be better associated with daytime sleepiness and all-cause mortality than the AHI using data from 3 large cohorts.
Methods
PSG data were obtained from The Sleep Heart Health Study (SHHS), The Multi-Ethnic Study of Atherosclerosis (MESA), and The Osteoporotic Fractures in Men (MrOS) Study. A total of N=6618 (60.1% male, 39.9% female; age 68.7±6.6) subjects with/without OSA had valid data (ESS and airflow/EEG/SpO2). The ventilatory burden was evaluated using a derived flow signal that utilized the sum of thoracic and abdominal effort signals for SHHS and was evaluated using the Nasal Cannula/Pressure Transducer signal for the MrOS and MESA data. The hypoxic burden was calculated as the area between the baseline and the SpO2 trace for any episode with >= 3% desaturation. Arousal burden was defined as the manually scored arousal index for Daytime Sleepiness prediction. It was calculated with Slow-wave activity surrounding K-complexes (∆SWAK) detected during non-REM stage 2 sleep for all-cause mortality prediction. Sleepiness was coded as present or absent based on a cut-off of ESS(ESS>= 10). Death from any cause (including CVD death) was confirmed by using multiple follow-up methods and made available through the National Sleep Research Resource. For sleepiness, data were analyzed in two ways:
Using all 3 cohorts as train (70%) and test (30%) and by permutations and combinations of the 3 cohorts (70/30 split, e.g., SHHS for training, MESA and MrOS for test). Only the SHHS cohort was used for all-cause mortality prediction model training and testing. A stacked machine-learned model based on was also trained using 70% of the data and was tested on the remaining 30% of the data. For AHI3A a logistic regression model was fit.
Results
Model performance metrics were the area under the receiver operating characteristic curve (AUROC) and Accuracy. The logistic regression model with AHI3a classified sleepiness with an AUROC of only 0.51±0.07. The random forest model trained on 70% of all 3 cohorts achieved the highest AUROC of 0.88±0.07 (mean accuracy of 85.1± 2.13%). In contrast, the permutations, and combinations of the 3 datasets resulted in an average AUROC of 0.63 ±0.12 (mean accuracy of 76.4±6.57%) for sleepiness prediction. The performance metrics for the stacking ensemble model exceeded the performance of the individual classifiers. The stacked model classified all-cause mortality with an AUROC of 0.94±0.02 and an Accuracy of 87.1±0.01%. In contrast, the logistic regression model with AHI (3% desaturation and/or EEG arousal) achieved an AUROC of 0.6±0.08.
Conclusions
In a large cohort of community adults, the machine-learned combination of ventilatory/hypoxic/arousal burdens classifies daytime sleepiness and all-cause mortality in OSA better than AHI3A.
Sajila Wickramaratne, Korey Kam, Thomas Tolbert, Andrew W. Varga, Indu Ayappa, David M. Rapoport, Ankit Parekh
Introduction
Current metric of obstructive sleep apnea (OSA) severity, i.e., the apnea-hypopnea index (AHI), is weakly correlated to short-term (daytime sleepiness) and long-term (all-cause mortality) outcomes. Here, we assessed whether a machine-learned combination of possibly independent metrics across ventilatory/hypoxic/arousal domains would be better associated with daytime sleepiness and all-cause mortality than the AHI using data from 3 large cohorts.
Methods
PSG data were obtained from The Sleep Heart Health Study (SHHS), The Multi-Ethnic Study of Atherosclerosis (MESA), and The Osteoporotic Fractures in Men (MrOS) Study. A total of N=6618 (60.1% male, 39.9% female; age 68.7±6.6) subjects with/without OSA had valid data (ESS and airflow/EEG/SpO2). The ventilatory burden was evaluated using a derived flow signal that utilized the sum of thoracic and abdominal effort signals for SHHS and was evaluated using the Nasal Cannula/Pressure Transducer signal for the MrOS and MESA data. The hypoxic burden was calculated as the area between the baseline and the SpO2 trace for any episode with >= 3% desaturation. Arousal burden was defined as the manually scored arousal index for Daytime Sleepiness prediction. It was calculated with Slow-wave activity surrounding K-complexes (∆SWAK) detected during non-REM stage 2 sleep for all-cause mortality prediction. Sleepiness was coded as present or absent based on a cut-off of ESS(ESS>= 10). Death from any cause (including CVD death) was confirmed by using multiple follow-up methods and made available through the National Sleep Research Resource. For sleepiness, data were analyzed in two ways:
Using all 3 cohorts as train (70%) and test (30%) and by permutations and combinations of the 3 cohorts (70/30 split, e.g., SHHS for training, MESA and MrOS for test). Only the SHHS cohort was used for all-cause mortality prediction model training and testing. A stacked machine-learned model based on was also trained using 70% of the data and was tested on the remaining 30% of the data. For AHI3A a logistic regression model was fit.
Results
Model performance metrics were the area under the receiver operating characteristic curve (AUROC) and Accuracy. The logistic regression model with AHI3a classified sleepiness with an AUROC of only 0.51±0.07. The random forest model trained on 70% of all 3 cohorts achieved the highest AUROC of 0.88±0.07 (mean accuracy of 85.1± 2.13%). In contrast, the permutations, and combinations of the 3 datasets resulted in an average AUROC of 0.63 ±0.12 (mean accuracy of 76.4±6.57%) for sleepiness prediction. The performance metrics for the stacking ensemble model exceeded the performance of the individual classifiers. The stacked model classified all-cause mortality with an AUROC of 0.94±0.02 and an Accuracy of 87.1±0.01%. In contrast, the logistic regression model with AHI (3% desaturation and/or EEG arousal) achieved an AUROC of 0.6±0.08.
Conclusions
In a large cohort of community adults, the machine-learned combination of ventilatory/hypoxic/arousal burdens classifies daytime sleepiness and all-cause mortality in OSA better than AHI3A.