California Teachers Study (CTS) has been collaborating with Office of Statewide Health Planning and Development (OSHPD) to obtain years of patient information from the admission questionaires to the outcome of the patients. In this data analytic report, the aims are 1. to develop a prediction model for the probability of death with stratified time windows (30, 60, 90, 150, and longer than 150 days) 2. To analyze whether the specific time windows after hospitalization is responsible to the decease outcome of patients 3. to analyze whether certain phenotypes of comorbities are more responsible to the decease outcome of patients than the others 4. to find temporal or spatial trends among the deceased patients, by utilizing regressional models with machine learning (ML) skills.
The analyzed data sources are provided by the California Teachers Study (CTS). Specifically the OSHPD hospitalization records data set with 132538 patient input were used. In this analysis only the records of CTS participants from 2000 to 2015 were used in analysis.
To develop a prediction model for the probability of death with stratified time windows (30, 60, 90, 150, and longer than 150 days), tree-based recursive partitioning classification models were built by ML. Each data set were divided into training set and testing set, which were used to develop the model and to evaluate the performances of model, respectively. Each model’s variable importance were plotted, then mis-classification error or accuracy were assessed with the interpretations of the outcomes.
As for the comorbidities, the publicly known Charlson Comorbidity Index (CCI) was reflected and calculated on dataset provided, filtered for the observations with full diagnoses primary to quinary. Once each observation’s CCI was obtained, logistic calculations with the deceased outcome is completed.
Due to the significance of the spatial information in relation to the mortality rates, Provided zip code information of patients’ residential area and publicly available socioeconomic status were paired to assess the spatial relations with the patients mortality rates.
The available data were stratified based on the length of the stay in hospital, as finding potential patterns and relations between the duration of stay and the decease outcome is the main question that is addressed in this report.
(stay_tier_table_kable)
Staytier Var | Number of Patients | |
---|---|---|
<30 days | 1 | 131393 |
30-59 | 2 | 940 |
60-89 | 3 | 119 |
90-149 | 4 | 51 |
>150 days | 5 | 35 |
The main outcome variable of interest, the deceased status, was visualized by stratification of days stayed at the hospital. The rate of decease increases as the days of stay prolonged.
p = ggplot(datm, aes(x = ind, y = proportion, fill = deceased)) +
geom_col ( position = "fill", colour = "black") +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "Dark2")
p + ggtitle("Proportion of Death Outcome Depending on Duration of Stay in Hospital")+
xlab ("Stay Tier") + ylab("Proportion (%)")
Feature selections were manually done to exclude any redundantly available features. For example, if the diagnosis codes are available the description of each code were omitted from the data set as it may raise concerns for confounding. The machine learning feature selection was attempted, due to the allowed memory and server capacity it was failed. After manual selection, total 89 variables were left to proceed.
To develop the best fitting model, either through regression or machine learning, that predicts the probability of death within a certain time window(ex.30 days, 180days, etc) based on patient-specific baseline and hospitalization characteristics.
barplot(t1_varimp[1:10], col = "darkblue", horiz = TRUE, las=2, cex.names=0.5,main = "Var. Importance for Death Among Patients with Stay Less than 30 days")
For the patients who stayed less than 30 days at the hospital, it is apparent that the total amount of hospital charges (total_charges_amt) is directly related to the decease outcome. In addition, the ranked important variables are shown in the plot above.
confusionMatrix(t1_test$deceased,t1_pred) # Accuracy 0.6448 (0.6401, 0.6496)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 9647 9528
## 1 4472 15771
##
## Accuracy : 0.6448
## 95% CI : (0.6401, 0.6496)
## No Information Rate : 0.6418
## P-Value [Acc > NIR] : 0.1065
##
## Kappa : 0.2842
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.6833
## Specificity : 0.6234
## Pos Pred Value : 0.5031
## Neg Pred Value : 0.7791
## Prevalence : 0.3582
## Detection Rate : 0.2447
## Detection Prevalence : 0.4865
## Balanced Accuracy : 0.6533
##
## 'Positive' Class : 0
##
n = nrow(tier2_selected)
ntrain = floor(0.7*n)
ntest = floor(0.3*n)
set.seed(1234)
split = sample(rep(1:2, times=c(ntrain, ntest)))
#split the data
t2_train = tier2_selected[split==1,]
t2_test = tier2_selected[split==2,]
dim(t2_train);dim(t2_test) #658 89 ; 282 89
## [1] 658 89
## [1] 282 89
library(rpart)
t2_rpart = rpart(t2_train$deceased ~ ., data = t2_train, method = "class", control = list(minsplit=20, minbucket = 7, cp=0))
printcp(t2_rpart)
##
## Classification tree:
## rpart(formula = t2_train$deceased ~ ., data = t2_train, method = "class",
## control = list(minsplit = 20, minbucket = 7, cp = 0))
##
## Variables actually used in tree construction:
## [1] age_at_baseline allex_life_hrs bmi_q1 total_charges_amt
##
## Root node error: 136/658 = 0.20669
##
## n= 658
##
## CP nsplit rel error xerror xstd
## 1 0.8235294 0 1.000000 1.0000 0.076375
## 2 0.1029412 1 0.176471 1.0956 0.078941
## 3 0.0073529 2 0.073529 1.1029 0.079127
## 4 0.0000000 4 0.058824 1.1250 0.079678
t2_rpart$cptable[which.min(t2_rpart$cptable[,"xerror"])] #0.8235294
## [1] 0.8235294
t2_treepruned = prune(t2_rpart, cp=0.8235294)
t2_varimp = round(summary(t2_treepruned)$variable.importance,2)
## Call:
## rpart(formula = t2_train$deceased ~ ., data = t2_train, method = "class",
## control = list(minsplit = 20, minbucket = 7, cp = 0))
## n= 658
##
## CP nsplit rel error xerror xstd
## 1 0.8235294 0 1.0000000 1.000000 0.07637529
## 2 0.8235294 1 0.1764706 1.095588 0.07894055
##
## Variable importance
## total_charges_amt allex_life_hrs facility_zip5_cde smoke_lifeexpo
## 44 24 13 8
## study_start_date_q3 diag_ccs_code1
## 6 5
##
## Node number 1: 658 observations, complexity param=0.8235294
## predicted class=1 expected loss=0.2066869 P(node) =1
## class counts: 136 522
## probabilities: 0.207 0.793
## left son=2 (112 obs) right son=3 (546 obs)
## Primary splits:
## total_charges_amt splits as RRRR-RRR-RRRRLRR-RRRRRRLRLRL-RRR-RRRL-RRR-RRL---RRLRRRL-RRLRRRL--R-RLR---RR---R-R-RLR-R-LRR-RRLL-RRRR-RRLRR--RRRRR-L--RLRL--LR-RL--R-LL-LRR--LR-RRRR-L-RRRRL-RLLR-LR-RRR--RRLRRRR--LR---RRRLRL-RRR---RRRRLRRL-RRL--RLRR-R-RRR-R-----RR-RRR-R-RRRRRLR-LRRRL-LRRR-RLRRLL-RR-RRR-RRR-LRRRR--RRRRR-RRRLLRRR-LR-R-L-RRRRRRLRRR----RR--R-R-LR-RL-RRRRR---R-LR--R---RRR--LR-RLRRRRL--RRRLRRRRRRLRR--R-RRRLRRR--RR--R-RRR-RRRRRRRRR-RLRR--RL-RRRRR-RRRRR-RL--R-R-LRR-R-RLR---R--RRRL----R-RRRR---L--RR-RLRRR-RRRRRR-RRLRR-R--RR----RLRL-RRR-RR-RR-R--L-RRRR-RRRR-RR-RRL--RRRLR--R-R--LL-RLL-RR-RRR-R--R-RRLRR-R-RRRRRRRLLRR----R---RRR-R-RRRL--R---RRR-RR-RLLRR-RLRRR-RR--RRLRLRRRR-RRRRLRRRR-RRR-L-RR-R-RRLRRLRLRR-RR-RRRL-RRRRLRRRRL--RR-RLR-R-R-LLRRRRR---LRR----RRRLR-RR--LRR-RR-R-L-R-L-R-RRRR--RRR-LRRLRLLR-RRR-LRRRRRLR--RRRLRRR--LLRR, improve=169.89100, (0 missing)
## allex_life_hrs splits as RLLRRRLR-RRLRLRRRR-RRRRR-RL--RRRRR-RRRRR-RLRRLRRR-RRRLR-LRRLRRR---RRRRRLRRR-LRRRRRRRRRR--RRR-R--RLRRLR-RLRRLRRRRRRRLRR-L---R-LRRRRRLRL-LR-LRRRRLR-RRRRR--LRRRRRR-RLRRRRRRRLRR-RLRR--R-RRLRRRLL-LRR-RRRRLRRLRRRRRRR-LRRRRRRRRRR-RLRLR-RRRRL-R-RRL-R-RRRRLLLL-R-LR-RRR-LLLRLRL-LR-R--RRLR--RRRRLL-RL-L-RLRRRLLL-LLRR-L-RR--RRR--RRRRRLLRRRR-R-R--RRRR-RRRR-LR---RRR-RRR-RLRRRLRRRRRRR-RLLRRRRRLRL-LLR-RLRRRRLRRLRRR-LLL-RR-RRRL-R-RRRRRLR-RRLR--RR-RRLLRRRRRLLRRLRR-L--RLRRR-RRLR--RRRRRLRLR--LRLRRRLRRR--RRRR--RL-RRR, improve=130.70760, (11 missing)
## smoke_lifeexpo splits as RR-RLL-RRRRR-L-R-LR-RRRRRR--RRRRRR-RRRRL-RR-RRLRRRRRRRRLLR-R-RRRRRRR-LRRRRRLLRRRRRLR--LLR-RRLLLRRLRRRRLLL-RLRLLRRLRRLRRRRR-RLRRR--RRLLLL-RRRLR-RRRLLR-RLR-RRRLLRRRRR-LLL-RL-RR-RRL-RRRRRR-R-RRLRRRRLR--RRLRLR-R---LRRRLRRRRRRLRRLRL--RRRRRR-RRRLRL-RRLRR--RRR-RRRRR-RLRLRR----R-RRR-RRLRRRRRRRRRR-L-LRRR-RR-RR--RRRRR--R-RRRR-R-RRLRRRRRRR--RLRRR-RR--RRRRRRRRRRRR-R-RRRRRRLRRRRR-RRLR-R-R-R-R---R, improve= 82.13081, (32 missing)
## facility_zip5_cde splits as --R--RRRLRRRRRRRR-R-RR-RLLRRRLRRRLR-R-RRRRRRRLRLRLRRRRRRRRRLRRRRRRRLRRRRLRR-RRRRRRRRRRRRR--LLLLRRLL-RRRRRLLRRRLRRR-RRRRRLRRRR-LRR-RRRR-LRLLLLR-RRRRL-RLRRRLRRLRRLLR-RLLRRRRRLRRRR-LRLLRRRRRRRR-RRRR-RRRRL-R-RL-R--LLRRRLRRLRRRRLRLRRRR, improve= 57.40662, (0 missing)
## bmi_q1 splits as RRR-L-RRRRRLLRLRRRRLRLRRRRRLRRRRLLRRLLRLLRRRRRLRLR-RRRRRRRRRRRRRRRRR-RLRLRLRRRRRLRRLRRRRRLRLRRL-RRRLRRRL-RRRRLRRRRRRLRRRRLRRRLLLRLRRRRRRRRLRRRRRRLRLRLRRLRLLLLRLRLLLRRRRRRRRR-R-RRRRR---R---R-RR-RRRRRLRRRLRR--RLR, improve= 44.91337, (41 missing)
## Surrogate splits:
## allex_life_hrs splits as RLLRRRRR-RRLRRRRRR-RRRRR-RL--RRRRR-RRRRR-RLRRRRRR-RRRLR-RRRLRRR---RRRRRRRRR-LRRRRRRRRRR--RRR-R--RLRRRR-RLRRRRRRRRRRLRR-L---R-RRRRRRRRR-RR-LRRRRRR-RRRRR--LRRRRRR-RRRRRRRRRLRR-RLRR--R-RRLRRRRL-RRR-RRRRRRRLRRRRRRR-LRRRRRRRRRR-RRRRR-RRRRL-R-RRL-R-RRRRLLRL-R-RR-RRR-LLRRRRL-LR-R--RRRR--RRRRLR-RL-R-RLRRRRLL-RLRR-L-RR--RRR--RRRRRLLRRRR-R-R--RRRR-RRRR-LR---RRR-RRR-RLRRRLRRRRRRR-RRLRRRRRLRL-RLR-RLRRRRLRRLRRR-LLL-RR-RRRL-R-RRRRRLR-RRRR--RR-RRLLRRRRRLRRRLRR-L--RLRRR-RRLR--RRRRRLRLR--LRLRRRLRRR--RRRR--RR-RRR, agree=0.924, adj=0.554, (0 split)
## facility_zip5_cde splits as --R--RRRRRRRRRRRR-R-RR-RLLRRRRRRRRR-R-RRRRRRRRRLRRRRRRRRRRRRRRRRRRRLRRRRLRR-RRRRRRRRRRRRR--RRRLRRRR-RRRRRLLRRRRRRR-RRRRRLRRRR-LRR-RRRR-LRLLLLR-RRRRR-RLRRRLRRLRRRRR-RLLRRRRRRRRRR-RRLLRRRRRRRR-RRRR-RRRRR-R-RR-R--RLRRRLRRRRRRRLRLRRRR, agree=0.881, adj=0.304, (0 split)
## smoke_lifeexpo splits as RR-RLL-RRRRR-L-R-RR-RRRRRR--RRRRRR-RRRRL-RR-RRLRRRRRRRRRLR-R-RRRRRRR-LRRRRRLRRRRRRRR--RLR-RRRRRRRLRRRRLLL-RLRLRRRRRRLRRRRR-RLRRR--RRLRLL-RRRRR-RRRLLR-RLR-RRRRLRRRRR-LLL-RL-RR-RRL-RRRRRR-R-RRRRRRRRR--RRRRLR-R---LRRRRRRRRRRLRRLRR--RRRRRR-RRRRRR-RRLRR--RRR-RRRRR-RLRRRR----R-RRR-RRLRRRRRRRRRR-L-LRRR-RR-RR--RRRRR--R-RRRR-R-RRLRRRRRRR--RRRRR-RR--RRRRRRRRRRRR-R-RRRRRRLRRRRR-RRLR-R-R-R-R---R, agree=0.862, adj=0.187, (0 split)
## study_start_date_q3 splits as LRRRRRRRRRRRRRRRRRRRRRRRRR-RR-RRRRRR---RRRRRRRRRRLRR-RRRLL-LRLRRRRRRRL--RRRRRRRRRRRRLRRRRRR-LRLRRR-RRR--RRRR-RRRRRRRRRRRLRRRRRRRR-LRRLLRRR, agree=0.854, adj=0.143, (0 split)
## diag_ccs_code1 splits as RRRRRRRRRRR-RRRRRRRRRRRRLRRRRRRR-RRRRRRLRRL-RRRRRRRRRLRRRRRRRRRRRRRRLRLRRRRRRRRR-RRLRL--RRRRLRRRRLRRLRRRRRRRRRRRR-RRRRRR, agree=0.848, adj=0.107, (0 split)
##
## Node number 2: 112 observations
## predicted class=0 expected loss=0 P(node) =0.1702128
## class counts: 112 0
## probabilities: 1.000 0.000
##
## Node number 3: 546 observations
## predicted class=1 expected loss=0.04395604 P(node) =0.8297872
## class counts: 24 522
## probabilities: 0.044 0.956
For the outcome of decease for patients who stayed more than 30 days but less than 60 days again shows the most relation to the total amount of charges by hospital. It is followed by the regular hours of work-out, then zipcode of the hospital.
## [1] 84 89
## [1] 35 89
##
## Classification tree:
## rpart(formula = t3_train$deceased ~ ., data = t3_train, method = "class",
## control = list(minsplit = 20, minbucket = 7, cp = 0))
##
## Variables actually used in tree construction:
## [1] allex_life_hrs
##
## Root node error: 19/84 = 0.22619
##
## n= 84
##
## CP nsplit rel error xerror xstd
## 1 0.94737 0 1.000000 1.0000 0.20181
## 2 0.00000 1 0.052632 1.2632 0.21792
## [1] 0.9473684
## Call:
## rpart(formula = t3_train$deceased ~ ., data = t3_train, method = "class",
## control = list(minsplit = 20, minbucket = 7, cp = 0))
## n= 84
##
## CP nsplit rel error xerror xstd
## 1 0.9473684 0 1.00000000 1.000000 0.2018089
## 2 0.0000000 1 0.05263158 1.263158 0.2179154
##
## Variable importance
## allex_life_hrs facility_zip5_cde age_at_baseline study_start_date_q3
## 28 17 15 14
## weight_q1 bmi_q1
## 14 12
##
## Node number 1: 84 observations, complexity param=0.9473684
## predicted class=1 expected loss=0.2261905 P(node) =1
## class counts: 19 65
## probabilities: 0.226 0.774
## left son=2 (18 obs) right son=3 (66 obs)
## Primary splits:
## allex_life_hrs splits as RR-RR-RRRRRRRRRRR-L-R-LLRLRRLLRRRRLLRLR--LRRR-R--R-L-RR---R--RRRLR--LRR--R-RRLR-RRL--RLRRLR-RRRRL-R-, improve=27.43506, (0 missing)
## total_charges_amt splits as RL--RRRLRRRRL--R---RR-RR-R---R-R-RLRRLLRRRRL-RLL-LRRR-L-RRRR-RLR-R-R-R-RRL-RLRRRRRR---RLRRRRL--RRLRRRRRRL-R, improve=27.43506, (0 missing)
## facility_zip5_cde splits as RRLRRRRR-RRLRR-R-R-RLLLR-R-RL-LLR-RRRRRRLLLRRRRR-RL--R-LR-RR-RRRRRR-RRR-RRR-LRR, improve=16.88961, (0 missing)
## bmi_q1 splits as -L-R-RLRRLL-RLRRRRRRRR-LRRRRRRRL-RRRLRLR--RRRRRRRLL-RRRRRRRRRRRR-RRLL--RL-RL, improve=16.59436, (3 missing)
## age_at_baseline splits as LLL-LLRLRLRLL-R-RLLRLRLRRRLRRRLLRRRRLRRRRR-RR-R-, improve=16.36232, (0 missing)
## Surrogate splits:
## facility_zip5_cde splits as RRLRRRRR-RRLRR-R-R-RLLLR-R-RL-LRR-RRRRRRLLLRRRRR-RR--R-RR-RR-RRRRRR-RRR-RRR-RRR, agree=0.917, adj=0.611, (0 split)
## age_at_baseline splits as LLL-RLRLRLRLL-R-RRLRRRRRRRRRRRRLRRRRRRRRRR-RR-R-, agree=0.905, adj=0.556, (0 split)
## weight_q1 splits as LLRLRLRR-RRRRR--RRRRRRRRLRLRRRR-RRRRRRRRRRRRRRRRRRRLL---RRL, agree=0.893, adj=0.500, (0 split)
## study_start_date_q3 splits as LRLRLLRRRRRRRRRRLRRRR-LRRRRLRRRR-RRL-RRRR---LR-, agree=0.893, adj=0.500, (0 split)
## bmi_q1 splits as -L-R-RLRRLL-RRRRRRRRRR-LRRRRRRRL-RRRLRRR--RRRRRRRRR-RRRRRRRRRRRR-RRLL--RL-RL, agree=0.881, adj=0.444, (0 split)
##
## Node number 2: 18 observations
## predicted class=0 expected loss=0 P(node) =0.2142857
## class counts: 18 0
## probabilities: 1.000 0.000
##
## Node number 3: 66 observations
## predicted class=1 expected loss=0.01515152 P(node) =0.7857143
## class counts: 1 65
## probabilities: 0.015 0.985
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1 7
## 1 4 23
##
## Accuracy : 0.6857
## 95% CI : (0.5071, 0.8315)
## No Information Rate : 0.8571
## P-Value [Acc > NIR] : 0.9976
##
## Kappa : -0.0267
##
## Mcnemar's Test P-Value : 0.5465
##
## Sensitivity : 0.20000
## Specificity : 0.76667
## Pos Pred Value : 0.12500
## Neg Pred Value : 0.85185
## Prevalence : 0.14286
## Detection Rate : 0.02857
## Detection Prevalence : 0.22857
## Balanced Accuracy : 0.48333
##
## 'Positive' Class : 0
##
From the length of stay longer than 60 days, the total hospital charges disappear completely. The amount of workout in hours dictated the decease outcome, which was followed by the facility zip code again.
## [1] 36 89
## [1] 15 89
##
## Classification tree:
## rpart(formula = t4_train$deceased ~ ., data = t4_train, method = "class",
## control = list(minsplit = 5, minbucket = 2, cp = 0))
##
## Variables actually used in tree construction:
## [1] facility_zip5_cde
##
## Root node error: 3/36 = 0.083333
##
## n= 36
##
## CP nsplit rel error xerror xstd
## 1 1 0 1 1 0.55277
## 2 0 1 0 1 0.55277
## [1] 1
## Call:
## rpart(formula = t4_train$deceased ~ ., data = t4_train, method = "class",
## control = list(minsplit = 5, minbucket = 2, cp = 0))
## n= 36
##
## CP nsplit rel error xerror xstd
## 1 1 0 1 1 0.5527708
## 2 0 1 0 1 0.5527708
##
## Variable importance
## facility_zip5_cde age_at_baseline age_dad_atbirth height_q1
## 38 12 12 12
## participant_race weight_q1
## 12 12
##
## Node number 1: 36 observations, complexity param=1
## predicted class=1 expected loss=0.08333333 P(node) =1
## class counts: 3 33
## probabilities: 0.083 0.917
## left son=2 (3 obs) right son=3 (33 obs)
## Primary splits:
## facility_zip5_cde splits as R-RR-RR-R-LRRRRR--R-RRR-RRR-RRLRRRRR-LRRRR, improve=5.500000, (0 missing)
## bmi_q1 splits as R-RR-RRRR-R-R-RR-RLRRL-RRR-RRRRRRRRRLR-, improve=4.000000, (0 missing)
## diag_ccs_code1 splits as RR-RRRRR-LRRRRL-R-RRRLRRR-, improve=4.000000, (0 missing)
## allex_hrs_q1 splits as R-RRRLRRRRRRRR-RRRRLR-R, improve=3.558824, (0 missing)
## allex_life_hrs splits as RRR--RRR-RRRRR-RRRRRRRRRR-R-RRR--R-RRRR-LL, improve=3.558824, (0 missing)
## Surrogate splits:
## age_at_baseline splits as R--RLRRRRRR--RRR-RRRRR-RRRRRRRR-, agree=0.944, adj=0.333, (0 split)
## participant_race splits as RRRRLR, agree=0.944, adj=0.333, (0 split)
## age_dad_atbirth splits as RRRR-L, agree=0.944, adj=0.333, (0 split)
## height_q1 splits as RRRRRRRRRRRL, agree=0.944, adj=0.333, (0 split)
## weight_q1 splits as -RR-RR-RRRRRRR-RRRLRRRRRRRRRRRR-, agree=0.944, adj=0.333, (0 split)
##
## Node number 2: 3 observations
## predicted class=0 expected loss=0 P(node) =0.08333333
## class counts: 3 0
## probabilities: 1.000 0.000
##
## Node number 3: 33 observations
## predicted class=1 expected loss=0 P(node) =0.9166667
## class counts: 0 33
## probabilities: 0.000 1.000
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 0 3
## 1 0 12
##
## Accuracy : 0.8
## 95% CI : (0.5191, 0.9567)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : NA
## Specificity : 0.8
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 0.0
## Detection Rate : 0.0
## Detection Prevalence : 0.2
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
## Call:
## rpart(formula = t4_train$deceased ~ ., data = t4_train, method = "class",
## control = list(minsplit = 5, minbucket = 2, cp = 0))
## n= 36
##
## CP nsplit rel error xerror xstd
## 1 1 0 1 1 0.5527708
## 2 0 1 0 1 0.5527708
##
## Variable importance
## facility_zip5_cde age_at_baseline age_dad_atbirth height_q1
## 38 12 12 12
## participant_race weight_q1
## 12 12
##
## Node number 1: 36 observations, complexity param=1
## predicted class=1 expected loss=0.08333333 P(node) =1
## class counts: 3 33
## probabilities: 0.083 0.917
## left son=2 (3 obs) right son=3 (33 obs)
## Primary splits:
## facility_zip5_cde splits as R-RR-RR-R-LRRRRR--R-RRR-RRR-RRLRRRRR-LRRRR, improve=5.500000, (0 missing)
## bmi_q1 splits as R-RR-RRRR-R-R-RR-RLRRL-RRR-RRRRRRRRRLR-, improve=4.000000, (0 missing)
## diag_ccs_code1 splits as RR-RRRRR-LRRRRL-R-RRRLRRR-, improve=4.000000, (0 missing)
## allex_hrs_q1 splits as R-RRRLRRRRRRRR-RRRRLR-R, improve=3.558824, (0 missing)
## allex_life_hrs splits as RRR--RRR-RRRRR-RRRRRRRRRR-R-RRR--R-RRRR-LL, improve=3.558824, (0 missing)
## Surrogate splits:
## age_at_baseline splits as R--RLRRRRRR--RRR-RRRRR-RRRRRRRR-, agree=0.944, adj=0.333, (0 split)
## participant_race splits as RRRRLR, agree=0.944, adj=0.333, (0 split)
## age_dad_atbirth splits as RRRR-L, agree=0.944, adj=0.333, (0 split)
## height_q1 splits as RRRRRRRRRRRL, agree=0.944, adj=0.333, (0 split)
## weight_q1 splits as -RR-RR-RRRRRRR-RRRLRRRRRRRRRRRR-, agree=0.944, adj=0.333, (0 split)
##
## Node number 2: 3 observations
## predicted class=0 expected loss=0 P(node) =0.08333333
## class counts: 3 0
## probabilities: 1.000 0.000
##
## Node number 3: 33 observations
## predicted class=1 expected loss=0 P(node) =0.9166667
## class counts: 0 33
## probabilities: 0.000 1.000
It is interesting to observe that the facility zip code is found to be the most important factor for predicting the death of patients who stayed in the hospital more than 90 days.
n = nrow(tier5_select4ed)
ntrain = floor(0.7*n)
ntest = floor(0.3*n)
set.seed(1234)
split = sample(rep(1:2, times=c(ntrain, ntest)))
#split the data
t5_train = tier5_selected[split==1,]
t5_test = tier5_selected[split==2,]
dim(t5_train);dim(t5_test) #24 89 ; 11 89
length(t5_train$deceased == 1) #24 : so all data are deceased. cannot divide!
length(tier5_selected$deceased == 1) #35
length(tier5_selected$deceased) #35
#t5_rpart = rpart(t5_train$deceased ~ ., data = t5_train, method = "class", control = list(minsplit=1, minbucket = 1, cp=0.001))
#printcp(t5_rpart)
The data for tier 5 patients who stayed longer than 150 days in hospital were not able to be analyzed as all outcomes were identified as deceased.
From the models developed and the performances from each tier are summarized below:
It was attempted to develop another prediction model to compare the model performances. However, the imputation of missing data were too extensive for the allocated memories and even with the complete cases for patients, the codes could not be completed running. At this point, the analysis proceeded with the recursive partition regression model developed above.
Elastic net regression iteration was successfully completed with the tier 1, with cv AUC = 0.6736, and misclassification error with 0.4975.
Elastic net regression iteration was succesfully completed with tier 2, with cv AUC = 0.8953. ROC curves were plotted to visualize the AUC, but it was omitted to evaluate in this report due to the extended iteration time(>3 hrs).
Elastic net regression was successfully completed with tier 3, with misclassification error or 0.2368.
In order to understand the effect of comorbidities on deceased rate, a new variable with presence of secondary, tertiary, quarternary and quinary ccs codes was created.
Spearman’s rank correlation coefficient as known as Spearman’s rho statistics is used to estimate a rank-based measure association, for the data that does not have bivariate normal distribution.
new feature creation with comorbidities, entire dataset
## diag_ccs1
## 1 Acute cerebrovascular disease
## 2 Cardiac dysrhythmias
## 3 Congestive heart failure; nonhypertensive
## 4 Fracture of neck of femur (hip)
## 5 Osteoarthritis
## 6 Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 7 Rehabilitation care; fitting of prostheses; and adjustment of devices
## 8 Septicemia (except in labor)
## 9 Undefined
## 10 Urinary tract infections
## counts
## 1 2389
## 2 2501
## 3 3071
## 4 2264
## 5 2751
## 6 2907
## 7 2547
## 8 3286
## 9 1875
## 10 1975
then, make new var for comorbidities for best 3 CCS code (primary diagnosis during that hospital visits, which collapses ICD codes into the broader group).
## diag_ccs2
## 1 Acute and unspecified renal failure
## 2 Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 3 Respiratory failure; insufficiency; arrest (adult)
## 4 Shock
## 5 Urinary tract infections
## counts
## 1 257
## 2 508
## 3 271
## 4 301
## 5 406
## Selecting by counts
## diag_ccs2
## 1 Cardiac dysrhythmias
## 2 Heart valve disorders
## 3 Peri-; endo-; and myocarditis; cardiomyopathy (except that caused by tuberculosis or sexually transm
## 4 Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 5 Respiratory failure; insufficiency; arrest (adult)
## counts
## 1 408
## 2 195
## 3 186
## 4 247
## 5 215
## Selecting by counts
## diag_ccs2 counts
## 1 Cardiac dysrhythmias 170
## 2 Chronic obstructive pulmonary disease and bronchiectasis 298
## 3 Congestive heart failure; nonhypertensive 299
## 4 Fluid and electrolyte disorders 187
## 5 Respiratory failure; insufficiency; arrest (adult) 222
The Charlson Comorbidity Index estimates the survival in patients with multiple comorbidities which consists of different items. Following the publicly available Charlson Comorbidity Index categories, each diagnosis was scored.
id | deceased | diag_ccs_code1 | s1 | diag_ccs_code2 | s2 | diag_ccs_code3 | s3 | diag_ccs_code4 | s4 | diag_ccs_code5 | s5 | com.score |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 109 | 1 | 82 | 0 | 108 | 1 | 95 | 0 | 211 | 0 | 2 |
2 | 0 | 114 | 1 | 155 | 0 | 144 | 0 | 47 | 0 | 146 | 0 | 1 |
3 | 1 | 108 | 1 | 127 | 1 | 131 | 0 | 98 | 0 | 128 | 0 | 2 |
4 | 0 | 109 | 1 | 95 | 0 | 108 | 1 | 159 | 0 | 2 | 0 | 2 |
5 | 1 | 109 | 1 | 108 | 1 | 106 | 0 | 653 | 1 | 98 | 0 | 3 |
6 | 1 | 50 | 2 | 0 | 0 | 62 | 0 | 0 | 0 | 42 | 2 | 4 |
##
## Pearson's product-moment correlation
##
## data: com.o$deceased and com.o$com.score
## t = 12.56, df = 13706, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.09009436 0.12319444
## sample estimates:
## cor
## 0.106674
##
## Call:
## glm(formula = com.o$deceased ~ com.o$com.score, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6738 0.5063 0.6875 0.7581 0.7581
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.87737 0.04116 21.32 <2e-16 ***
## com.o$com.score 0.22241 0.01799 12.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13998 on 13707 degrees of freedom
## Residual deviance: 13820 on 13706 degrees of freedom
## AIC: 13824
##
## Number of Fisher Scoring iterations: 4
## (Intercept) com.o$com.score
## 2.404571 1.249087
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 2.4046 2.2179 2.6062
## com.o$com.score 1.2491 1.2063 1.2945
From the univariate analysis with deceased outcome with the COH Charlson comorbidities, per 1 unit of increase in comorbidity index is associated with deceased outcome at the odds of 1.2490.
logit(1) = (0.87737 + 0.22241 * comorbidity score of 1) = OR of 3.00 logit(5) = e^(.87737+0.22241 * comorbidity score of 5) = OR of 7.311
Due to the significance of the spatial information in relation to the mortality rates, Provided zip code information of patients’ residential area and publicly available socioeconomic status were paired to assess the spatial relations with the patients mortality rates.
## Warning: Removed 35 rows containing missing values (geom_point).
## Warning: Removed 35 rows containing missing values (geom_point).
## Warning: Removed 14 rows containing missing values (geom_point).
## Warning: Removed 5 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_point).
Patients information by COH from 2000 to 2015 were analyzed to predict and to define any patterns for the deceased outcome. Based on the stratified length of stay since admission to the hospital to the outcome, recursive partitioning regression methods were used to develop prediction models per each tier with training set (70%), then each model’s performances were evaluated with the testing set(30%). Although it was not feasible in this particular setting, it is highly desired for different types of machine learning models to be developed then compared in terms of the performances, to pick out the best models.
The tiers for length of hospital stay were divided as 30, 60, 90, 150, and 150+ days. For the first tier and second tier, the total amount of the hospital charges were dominating to predict the death of the patients. It may be related to the urgency of the medical care needs which followed by the multiple costly testings, or may be related to the government supports period for patients care. The insurance tier was also included in the variables, but it was not singled out for a significant variable to be considered to predict the death of patients.
From tier 1 through tier 4, the hospital zip code was prevalently appeared as one of the most important variables to predict the death outcome of patients. The longer the length of the stay, the higher it ranked as an important factor. To define further about this, the spatial map with the death rate of the patients relative to the socioeconomic status for neighborhood were built for each tier. The increasing death rate as the tier increases was apparent, and death outcome seem to be more prevalent in large cities such as Los Angeles and San Francisco. The relation to the socioeconomic status was not too apparent based on the spatial map created, and further analysis will be required to define more details.
The comorbidities were assessed from the publicly known Charlson Comorbidity Index (CCI), with fltered data sets for the observations with full diagnoses primary to quinary. Once each observation’s CCI was obtained, logistic calculations with the deceased outcome is completed. From the univariate analysis with deceased outcome with the COH Charlson comorbidities, per 1 unit of increase in comorbidity index is associated with deceased outcome at the odds of 1.2490.
The future direction of the report would be analyzing not just spatial but also temporal assessments in predicting death outcome of patients. Also, finding out the effect of co-morbities towards the patients’ death will be a great source to further understandings.