Introduction / Research Question / Problem Statement

California Teachers Study (CTS) has been collaborating with Office of Statewide Health Planning and Development (OSHPD) to obtain years of patient information from the admission questionaires to the outcome of the patients. In this data analytic report, the aims are 1. to develop a prediction model for the probability of death with stratified time windows (30, 60, 90, 150, and longer than 150 days) 2. To analyze whether the specific time windows after hospitalization is responsible to the decease outcome of patients 3. to analyze whether certain phenotypes of comorbities are more responsible to the decease outcome of patients than the others 4. to find temporal or spatial trends among the deceased patients, by utilizing regressional models with machine learning (ML) skills.


Data collection / measurements

The analyzed data sources are provided by the California Teachers Study (CTS). Specifically the OSHPD hospitalization records data set with 132538 patient input were used. In this analysis only the records of CTS participants from 2000 to 2015 were used in analysis.


Methods

To develop a prediction model for the probability of death with stratified time windows (30, 60, 90, 150, and longer than 150 days), tree-based recursive partitioning classification models were built by ML. Each data set were divided into training set and testing set, which were used to develop the model and to evaluate the performances of model, respectively. Each model’s variable importance were plotted, then mis-classification error or accuracy were assessed with the interpretations of the outcomes.

As for the comorbidities, the publicly known Charlson Comorbidity Index (CCI) was reflected and calculated on dataset provided, filtered for the observations with full diagnoses primary to quinary. Once each observation’s CCI was obtained, logistic calculations with the deceased outcome is completed.

Due to the significance of the spatial information in relation to the mortality rates, Provided zip code information of patients’ residential area and publicly available socioeconomic status were paired to assess the spatial relations with the patients mortality rates.


Loading the data / EDA / Data Clean-up and Optimization

The available data were stratified based on the length of the stay in hospital, as finding potential patterns and relations between the duration of stay and the decease outcome is the main question that is addressed in this report.

(stay_tier_table_kable)
Cases per Days Stayed
Staytier Var Number of Patients
<30 days 1 131393
30-59 2 940
60-89 3 119
90-149 4 51
>150 days 5 35

The main outcome variable of interest, the deceased status, was visualized by stratification of days stayed at the hospital. The rate of decease increases as the days of stay prolonged.

p = ggplot(datm, aes(x = ind, y = proportion, fill = deceased)) + 
  geom_col ( position = "fill", colour = "black") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette = "Dark2") 

p + ggtitle("Proportion of Death Outcome Depending on Duration of Stay in Hospital")+
  xlab ("Stay Tier") + ylab("Proportion (%)") 

Feature selections were manually done to exclude any redundantly available features. For example, if the diagnosis codes are available the description of each code were omitted from the data set as it may raise concerns for confounding. The machine learning feature selection was attempted, due to the allowed memory and server capacity it was failed. After manual selection, total 89 variables were left to proceed.


Model Development and Performance Metrics

To develop the best fitting model, either through regression or machine learning, that predicts the probability of death within a certain time window(ex.30 days, 180days, etc) based on patient-specific baseline and hospitalization characteristics.


Results | Variable Importances

1 : < 30 days

barplot(t1_varimp[1:10], col = "darkblue", horiz = TRUE, las=2, cex.names=0.5,main = "Var. Importance for Death Among Patients with Stay Less than 30 days")

For the patients who stayed less than 30 days at the hospital, it is apparent that the total amount of hospital charges (total_charges_amt) is directly related to the decease outcome. In addition, the ranked important variables are shown in the plot above.

confusionMatrix(t1_test$deceased,t1_pred) # Accuracy 0.6448 (0.6401, 0.6496)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0  9647  9528
##          1  4472 15771
##                                           
##                Accuracy : 0.6448          
##                  95% CI : (0.6401, 0.6496)
##     No Information Rate : 0.6418          
##     P-Value [Acc > NIR] : 0.1065          
##                                           
##                   Kappa : 0.2842          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6833          
##             Specificity : 0.6234          
##          Pos Pred Value : 0.5031          
##          Neg Pred Value : 0.7791          
##              Prevalence : 0.3582          
##          Detection Rate : 0.2447          
##    Detection Prevalence : 0.4865          
##       Balanced Accuracy : 0.6533          
##                                           
##        'Positive' Class : 0               
## 

2 : 30 - 59 days

n = nrow(tier2_selected)
ntrain = floor(0.7*n)
ntest = floor(0.3*n)

set.seed(1234)
split = sample(rep(1:2, times=c(ntrain, ntest)))

#split the data 
t2_train = tier2_selected[split==1,] 
t2_test = tier2_selected[split==2,] 

dim(t2_train);dim(t2_test) #658 89 ; 282 89
## [1] 658  89
## [1] 282  89
library(rpart)
t2_rpart = rpart(t2_train$deceased ~ ., data = t2_train, method = "class", control = list(minsplit=20, minbucket = 7, cp=0))
printcp(t2_rpart)
## 
## Classification tree:
## rpart(formula = t2_train$deceased ~ ., data = t2_train, method = "class", 
##     control = list(minsplit = 20, minbucket = 7, cp = 0))
## 
## Variables actually used in tree construction:
## [1] age_at_baseline   allex_life_hrs    bmi_q1            total_charges_amt
## 
## Root node error: 136/658 = 0.20669
## 
## n= 658 
## 
##          CP nsplit rel error xerror     xstd
## 1 0.8235294      0  1.000000 1.0000 0.076375
## 2 0.1029412      1  0.176471 1.0956 0.078941
## 3 0.0073529      2  0.073529 1.1029 0.079127
## 4 0.0000000      4  0.058824 1.1250 0.079678
t2_rpart$cptable[which.min(t2_rpart$cptable[,"xerror"])] #0.8235294
## [1] 0.8235294
t2_treepruned = prune(t2_rpart, cp=0.8235294)
t2_varimp = round(summary(t2_treepruned)$variable.importance,2)
## Call:
## rpart(formula = t2_train$deceased ~ ., data = t2_train, method = "class", 
##     control = list(minsplit = 20, minbucket = 7, cp = 0))
##   n= 658 
## 
##          CP nsplit rel error   xerror       xstd
## 1 0.8235294      0 1.0000000 1.000000 0.07637529
## 2 0.8235294      1 0.1764706 1.095588 0.07894055
## 
## Variable importance
##   total_charges_amt      allex_life_hrs   facility_zip5_cde      smoke_lifeexpo 
##                  44                  24                  13                   8 
## study_start_date_q3      diag_ccs_code1 
##                   6                   5 
## 
## Node number 1: 658 observations,    complexity param=0.8235294
##   predicted class=1  expected loss=0.2066869  P(node) =1
##     class counts:   136   522
##    probabilities: 0.207 0.793 
##   left son=2 (112 obs) right son=3 (546 obs)
##   Primary splits:
##       total_charges_amt splits as  RRRR-RRR-RRRRLRR-RRRRRRLRLRL-RRR-RRRL-RRR-RRL---RRLRRRL-RRLRRRL--R-RLR---RR---R-R-RLR-R-LRR-RRLL-RRRR-RRLRR--RRRRR-L--RLRL--LR-RL--R-LL-LRR--LR-RRRR-L-RRRRL-RLLR-LR-RRR--RRLRRRR--LR---RRRLRL-RRR---RRRRLRRL-RRL--RLRR-R-RRR-R-----RR-RRR-R-RRRRRLR-LRRRL-LRRR-RLRRLL-RR-RRR-RRR-LRRRR--RRRRR-RRRLLRRR-LR-R-L-RRRRRRLRRR----RR--R-R-LR-RL-RRRRR---R-LR--R---RRR--LR-RLRRRRL--RRRLRRRRRRLRR--R-RRRLRRR--RR--R-RRR-RRRRRRRRR-RLRR--RL-RRRRR-RRRRR-RL--R-R-LRR-R-RLR---R--RRRL----R-RRRR---L--RR-RLRRR-RRRRRR-RRLRR-R--RR----RLRL-RRR-RR-RR-R--L-RRRR-RRRR-RR-RRL--RRRLR--R-R--LL-RLL-RR-RRR-R--R-RRLRR-R-RRRRRRRLLRR----R---RRR-R-RRRL--R---RRR-RR-RLLRR-RLRRR-RR--RRLRLRRRR-RRRRLRRRR-RRR-L-RR-R-RRLRRLRLRR-RR-RRRL-RRRRLRRRRL--RR-RLR-R-R-LLRRRRR---LRR----RRRLR-RR--LRR-RR-R-L-R-L-R-RRRR--RRR-LRRLRLLR-RRR-LRRRRRLR--RRRLRRR--LLRR, improve=169.89100, (0 missing)
##       allex_life_hrs    splits as  RLLRRRLR-RRLRLRRRR-RRRRR-RL--RRRRR-RRRRR-RLRRLRRR-RRRLR-LRRLRRR---RRRRRLRRR-LRRRRRRRRRR--RRR-R--RLRRLR-RLRRLRRRRRRRLRR-L---R-LRRRRRLRL-LR-LRRRRLR-RRRRR--LRRRRRR-RLRRRRRRRLRR-RLRR--R-RRLRRRLL-LRR-RRRRLRRLRRRRRRR-LRRRRRRRRRR-RLRLR-RRRRL-R-RRL-R-RRRRLLLL-R-LR-RRR-LLLRLRL-LR-R--RRLR--RRRRLL-RL-L-RLRRRLLL-LLRR-L-RR--RRR--RRRRRLLRRRR-R-R--RRRR-RRRR-LR---RRR-RRR-RLRRRLRRRRRRR-RLLRRRRRLRL-LLR-RLRRRRLRRLRRR-LLL-RR-RRRL-R-RRRRRLR-RRLR--RR-RRLLRRRRRLLRRLRR-L--RLRRR-RRLR--RRRRRLRLR--LRLRRRLRRR--RRRR--RL-RRR, improve=130.70760, (11 missing)
##       smoke_lifeexpo    splits as  RR-RLL-RRRRR-L-R-LR-RRRRRR--RRRRRR-RRRRL-RR-RRLRRRRRRRRLLR-R-RRRRRRR-LRRRRRLLRRRRRLR--LLR-RRLLLRRLRRRRLLL-RLRLLRRLRRLRRRRR-RLRRR--RRLLLL-RRRLR-RRRLLR-RLR-RRRLLRRRRR-LLL-RL-RR-RRL-RRRRRR-R-RRLRRRRLR--RRLRLR-R---LRRRLRRRRRRLRRLRL--RRRRRR-RRRLRL-RRLRR--RRR-RRRRR-RLRLRR----R-RRR-RRLRRRRRRRRRR-L-LRRR-RR-RR--RRRRR--R-RRRR-R-RRLRRRRRRR--RLRRR-RR--RRRRRRRRRRRR-R-RRRRRRLRRRRR-RRLR-R-R-R-R---R, improve= 82.13081, (32 missing)
##       facility_zip5_cde splits as  --R--RRRLRRRRRRRR-R-RR-RLLRRRLRRRLR-R-RRRRRRRLRLRLRRRRRRRRRLRRRRRRRLRRRRLRR-RRRRRRRRRRRRR--LLLLRRLL-RRRRRLLRRRLRRR-RRRRRLRRRR-LRR-RRRR-LRLLLLR-RRRRL-RLRRRLRRLRRLLR-RLLRRRRRLRRRR-LRLLRRRRRRRR-RRRR-RRRRL-R-RL-R--LLRRRLRRLRRRRLRLRRRR, improve= 57.40662, (0 missing)
##       bmi_q1            splits as  RRR-L-RRRRRLLRLRRRRLRLRRRRRLRRRRLLRRLLRLLRRRRRLRLR-RRRRRRRRRRRRRRRRR-RLRLRLRRRRRLRRLRRRRRLRLRRL-RRRLRRRL-RRRRLRRRRRRLRRRRLRRRLLLRLRRRRRRRRLRRRRRRLRLRLRRLRLLLLRLRLLLRRRRRRRRR-R-RRRRR---R---R-RR-RRRRRLRRRLRR--RLR, improve= 44.91337, (41 missing)
##   Surrogate splits:
##       allex_life_hrs      splits as  RLLRRRRR-RRLRRRRRR-RRRRR-RL--RRRRR-RRRRR-RLRRRRRR-RRRLR-RRRLRRR---RRRRRRRRR-LRRRRRRRRRR--RRR-R--RLRRRR-RLRRRRRRRRRRLRR-L---R-RRRRRRRRR-RR-LRRRRRR-RRRRR--LRRRRRR-RRRRRRRRRLRR-RLRR--R-RRLRRRRL-RRR-RRRRRRRLRRRRRRR-LRRRRRRRRRR-RRRRR-RRRRL-R-RRL-R-RRRRLLRL-R-RR-RRR-LLRRRRL-LR-R--RRRR--RRRRLR-RL-R-RLRRRRLL-RLRR-L-RR--RRR--RRRRRLLRRRR-R-R--RRRR-RRRR-LR---RRR-RRR-RLRRRLRRRRRRR-RRLRRRRRLRL-RLR-RLRRRRLRRLRRR-LLL-RR-RRRL-R-RRRRRLR-RRRR--RR-RRLLRRRRRLRRRLRR-L--RLRRR-RRLR--RRRRRLRLR--LRLRRRLRRR--RRRR--RR-RRR, agree=0.924, adj=0.554, (0 split)
##       facility_zip5_cde   splits as  --R--RRRRRRRRRRRR-R-RR-RLLRRRRRRRRR-R-RRRRRRRRRLRRRRRRRRRRRRRRRRRRRLRRRRLRR-RRRRRRRRRRRRR--RRRLRRRR-RRRRRLLRRRRRRR-RRRRRLRRRR-LRR-RRRR-LRLLLLR-RRRRR-RLRRRLRRLRRRRR-RLLRRRRRRRRRR-RRLLRRRRRRRR-RRRR-RRRRR-R-RR-R--RLRRRLRRRRRRRLRLRRRR, agree=0.881, adj=0.304, (0 split)
##       smoke_lifeexpo      splits as  RR-RLL-RRRRR-L-R-RR-RRRRRR--RRRRRR-RRRRL-RR-RRLRRRRRRRRRLR-R-RRRRRRR-LRRRRRLRRRRRRRR--RLR-RRRRRRRLRRRRLLL-RLRLRRRRRRLRRRRR-RLRRR--RRLRLL-RRRRR-RRRLLR-RLR-RRRRLRRRRR-LLL-RL-RR-RRL-RRRRRR-R-RRRRRRRRR--RRRRLR-R---LRRRRRRRRRRLRRLRR--RRRRRR-RRRRRR-RRLRR--RRR-RRRRR-RLRRRR----R-RRR-RRLRRRRRRRRRR-L-LRRR-RR-RR--RRRRR--R-RRRR-R-RRLRRRRRRR--RRRRR-RR--RRRRRRRRRRRR-R-RRRRRRLRRRRR-RRLR-R-R-R-R---R, agree=0.862, adj=0.187, (0 split)
##       study_start_date_q3 splits as  LRRRRRRRRRRRRRRRRRRRRRRRRR-RR-RRRRRR---RRRRRRRRRRLRR-RRRLL-LRLRRRRRRRL--RRRRRRRRRRRRLRRRRRR-LRLRRR-RRR--RRRR-RRRRRRRRRRRLRRRRRRRR-LRRLLRRR, agree=0.854, adj=0.143, (0 split)
##       diag_ccs_code1      splits as  RRRRRRRRRRR-RRRRRRRRRRRRLRRRRRRR-RRRRRRLRRL-RRRRRRRRRLRRRRRRRRRRRRRRLRLRRRRRRRRR-RRLRL--RRRRLRRRRLRRLRRRRRRRRRRRR-RRRRRR, agree=0.848, adj=0.107, (0 split)
## 
## Node number 2: 112 observations
##   predicted class=0  expected loss=0  P(node) =0.1702128
##     class counts:   112     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 546 observations
##   predicted class=1  expected loss=0.04395604  P(node) =0.8297872
##     class counts:    24   522
##    probabilities: 0.044 0.956

For the outcome of decease for patients who stayed more than 30 days but less than 60 days again shows the most relation to the total amount of charges by hospital. It is followed by the regular hours of work-out, then zipcode of the hospital.

3 : 60 - 89 days

## [1] 84 89
## [1] 35 89
## 
## Classification tree:
## rpart(formula = t3_train$deceased ~ ., data = t3_train, method = "class", 
##     control = list(minsplit = 20, minbucket = 7, cp = 0))
## 
## Variables actually used in tree construction:
## [1] allex_life_hrs
## 
## Root node error: 19/84 = 0.22619
## 
## n= 84 
## 
##        CP nsplit rel error xerror    xstd
## 1 0.94737      0  1.000000 1.0000 0.20181
## 2 0.00000      1  0.052632 1.2632 0.21792
## [1] 0.9473684
## Call:
## rpart(formula = t3_train$deceased ~ ., data = t3_train, method = "class", 
##     control = list(minsplit = 20, minbucket = 7, cp = 0))
##   n= 84 
## 
##          CP nsplit  rel error   xerror      xstd
## 1 0.9473684      0 1.00000000 1.000000 0.2018089
## 2 0.0000000      1 0.05263158 1.263158 0.2179154
## 
## Variable importance
##      allex_life_hrs   facility_zip5_cde     age_at_baseline study_start_date_q3 
##                  28                  17                  15                  14 
##           weight_q1              bmi_q1 
##                  14                  12 
## 
## Node number 1: 84 observations,    complexity param=0.9473684
##   predicted class=1  expected loss=0.2261905  P(node) =1
##     class counts:    19    65
##    probabilities: 0.226 0.774 
##   left son=2 (18 obs) right son=3 (66 obs)
##   Primary splits:
##       allex_life_hrs    splits as  RR-RR-RRRRRRRRRRR-L-R-LLRLRRLLRRRRLLRLR--LRRR-R--R-L-RR---R--RRRLR--LRR--R-RRLR-RRL--RLRRLR-RRRRL-R-, improve=27.43506, (0 missing)
##       total_charges_amt splits as  RL--RRRLRRRRL--R---RR-RR-R---R-R-RLRRLLRRRRL-RLL-LRRR-L-RRRR-RLR-R-R-R-RRL-RLRRRRRR---RLRRRRL--RRLRRRRRRL-R, improve=27.43506, (0 missing)
##       facility_zip5_cde splits as  RRLRRRRR-RRLRR-R-R-RLLLR-R-RL-LLR-RRRRRRLLLRRRRR-RL--R-LR-RR-RRRRRR-RRR-RRR-LRR, improve=16.88961, (0 missing)
##       bmi_q1            splits as  -L-R-RLRRLL-RLRRRRRRRR-LRRRRRRRL-RRRLRLR--RRRRRRRLL-RRRRRRRRRRRR-RRLL--RL-RL, improve=16.59436, (3 missing)
##       age_at_baseline   splits as  LLL-LLRLRLRLL-R-RLLRLRLRRRLRRRLLRRRRLRRRRR-RR-R-, improve=16.36232, (0 missing)
##   Surrogate splits:
##       facility_zip5_cde   splits as  RRLRRRRR-RRLRR-R-R-RLLLR-R-RL-LRR-RRRRRRLLLRRRRR-RR--R-RR-RR-RRRRRR-RRR-RRR-RRR, agree=0.917, adj=0.611, (0 split)
##       age_at_baseline     splits as  LLL-RLRLRLRLL-R-RRLRRRRRRRRRRRRLRRRRRRRRRR-RR-R-, agree=0.905, adj=0.556, (0 split)
##       weight_q1           splits as  LLRLRLRR-RRRRR--RRRRRRRRLRLRRRR-RRRRRRRRRRRRRRRRRRRLL---RRL, agree=0.893, adj=0.500, (0 split)
##       study_start_date_q3 splits as  LRLRLLRRRRRRRRRRLRRRR-LRRRRLRRRR-RRL-RRRR---LR-, agree=0.893, adj=0.500, (0 split)
##       bmi_q1              splits as  -L-R-RLRRLL-RRRRRRRRRR-LRRRRRRRL-RRRLRRR--RRRRRRRRR-RRRRRRRRRRRR-RRLL--RL-RL, agree=0.881, adj=0.444, (0 split)
## 
## Node number 2: 18 observations
##   predicted class=0  expected loss=0  P(node) =0.2142857
##     class counts:    18     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 66 observations
##   predicted class=1  expected loss=0.01515152  P(node) =0.7857143
##     class counts:     1    65
##    probabilities: 0.015 0.985

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  1  7
##          1  4 23
##                                           
##                Accuracy : 0.6857          
##                  95% CI : (0.5071, 0.8315)
##     No Information Rate : 0.8571          
##     P-Value [Acc > NIR] : 0.9976          
##                                           
##                   Kappa : -0.0267         
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.20000         
##             Specificity : 0.76667         
##          Pos Pred Value : 0.12500         
##          Neg Pred Value : 0.85185         
##              Prevalence : 0.14286         
##          Detection Rate : 0.02857         
##    Detection Prevalence : 0.22857         
##       Balanced Accuracy : 0.48333         
##                                           
##        'Positive' Class : 0               
## 

From the length of stay longer than 60 days, the total hospital charges disappear completely. The amount of workout in hours dictated the decease outcome, which was followed by the facility zip code again.

4 : 90 - 149 days

## [1] 36 89
## [1] 15 89
## 
## Classification tree:
## rpart(formula = t4_train$deceased ~ ., data = t4_train, method = "class", 
##     control = list(minsplit = 5, minbucket = 2, cp = 0))
## 
## Variables actually used in tree construction:
## [1] facility_zip5_cde
## 
## Root node error: 3/36 = 0.083333
## 
## n= 36 
## 
##   CP nsplit rel error xerror    xstd
## 1  1      0         1      1 0.55277
## 2  0      1         0      1 0.55277
## [1] 1
## Call:
## rpart(formula = t4_train$deceased ~ ., data = t4_train, method = "class", 
##     control = list(minsplit = 5, minbucket = 2, cp = 0))
##   n= 36 
## 
##   CP nsplit rel error xerror      xstd
## 1  1      0         1      1 0.5527708
## 2  0      1         0      1 0.5527708
## 
## Variable importance
## facility_zip5_cde   age_at_baseline   age_dad_atbirth         height_q1 
##                38                12                12                12 
##  participant_race         weight_q1 
##                12                12 
## 
## Node number 1: 36 observations,    complexity param=1
##   predicted class=1  expected loss=0.08333333  P(node) =1
##     class counts:     3    33
##    probabilities: 0.083 0.917 
##   left son=2 (3 obs) right son=3 (33 obs)
##   Primary splits:
##       facility_zip5_cde splits as  R-RR-RR-R-LRRRRR--R-RRR-RRR-RRLRRRRR-LRRRR, improve=5.500000, (0 missing)
##       bmi_q1            splits as  R-RR-RRRR-R-R-RR-RLRRL-RRR-RRRRRRRRRLR-, improve=4.000000, (0 missing)
##       diag_ccs_code1    splits as  RR-RRRRR-LRRRRL-R-RRRLRRR-, improve=4.000000, (0 missing)
##       allex_hrs_q1      splits as  R-RRRLRRRRRRRR-RRRRLR-R, improve=3.558824, (0 missing)
##       allex_life_hrs    splits as  RRR--RRR-RRRRR-RRRRRRRRRR-R-RRR--R-RRRR-LL, improve=3.558824, (0 missing)
##   Surrogate splits:
##       age_at_baseline  splits as  R--RLRRRRRR--RRR-RRRRR-RRRRRRRR-, agree=0.944, adj=0.333, (0 split)
##       participant_race splits as  RRRRLR, agree=0.944, adj=0.333, (0 split)
##       age_dad_atbirth  splits as  RRRR-L, agree=0.944, adj=0.333, (0 split)
##       height_q1        splits as  RRRRRRRRRRRL, agree=0.944, adj=0.333, (0 split)
##       weight_q1        splits as  -RR-RR-RRRRRRR-RRRLRRRRRRRRRRRR-, agree=0.944, adj=0.333, (0 split)
## 
## Node number 2: 3 observations
##   predicted class=0  expected loss=0  P(node) =0.08333333
##     class counts:     3     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 33 observations
##   predicted class=1  expected loss=0  P(node) =0.9166667
##     class counts:     0    33
##    probabilities: 0.000 1.000

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  0  3
##          1  0 12
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.5191, 0.9567)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity :  NA             
##             Specificity : 0.8             
##          Pos Pred Value :  NA             
##          Neg Pred Value :  NA             
##              Prevalence : 0.0             
##          Detection Rate : 0.0             
##    Detection Prevalence : 0.2             
##       Balanced Accuracy :  NA             
##                                           
##        'Positive' Class : 0               
## 
## Call:
## rpart(formula = t4_train$deceased ~ ., data = t4_train, method = "class", 
##     control = list(minsplit = 5, minbucket = 2, cp = 0))
##   n= 36 
## 
##   CP nsplit rel error xerror      xstd
## 1  1      0         1      1 0.5527708
## 2  0      1         0      1 0.5527708
## 
## Variable importance
## facility_zip5_cde   age_at_baseline   age_dad_atbirth         height_q1 
##                38                12                12                12 
##  participant_race         weight_q1 
##                12                12 
## 
## Node number 1: 36 observations,    complexity param=1
##   predicted class=1  expected loss=0.08333333  P(node) =1
##     class counts:     3    33
##    probabilities: 0.083 0.917 
##   left son=2 (3 obs) right son=3 (33 obs)
##   Primary splits:
##       facility_zip5_cde splits as  R-RR-RR-R-LRRRRR--R-RRR-RRR-RRLRRRRR-LRRRR, improve=5.500000, (0 missing)
##       bmi_q1            splits as  R-RR-RRRR-R-R-RR-RLRRL-RRR-RRRRRRRRRLR-, improve=4.000000, (0 missing)
##       diag_ccs_code1    splits as  RR-RRRRR-LRRRRL-R-RRRLRRR-, improve=4.000000, (0 missing)
##       allex_hrs_q1      splits as  R-RRRLRRRRRRRR-RRRRLR-R, improve=3.558824, (0 missing)
##       allex_life_hrs    splits as  RRR--RRR-RRRRR-RRRRRRRRRR-R-RRR--R-RRRR-LL, improve=3.558824, (0 missing)
##   Surrogate splits:
##       age_at_baseline  splits as  R--RLRRRRRR--RRR-RRRRR-RRRRRRRR-, agree=0.944, adj=0.333, (0 split)
##       participant_race splits as  RRRRLR, agree=0.944, adj=0.333, (0 split)
##       age_dad_atbirth  splits as  RRRR-L, agree=0.944, adj=0.333, (0 split)
##       height_q1        splits as  RRRRRRRRRRRL, agree=0.944, adj=0.333, (0 split)
##       weight_q1        splits as  -RR-RR-RRRRRRR-RRRLRRRRRRRRRRRR-, agree=0.944, adj=0.333, (0 split)
## 
## Node number 2: 3 observations
##   predicted class=0  expected loss=0  P(node) =0.08333333
##     class counts:     3     0
##    probabilities: 1.000 0.000 
## 
## Node number 3: 33 observations
##   predicted class=1  expected loss=0  P(node) =0.9166667
##     class counts:     0    33
##    probabilities: 0.000 1.000

It is interesting to observe that the facility zip code is found to be the most important factor for predicting the death of patients who stayed in the hospital more than 90 days.

5 : No tree nodes were generated

n = nrow(tier5_select4ed)
ntrain = floor(0.7*n)
ntest = floor(0.3*n)

set.seed(1234)
split = sample(rep(1:2, times=c(ntrain, ntest)))

#split the data 
t5_train = tier5_selected[split==1,] 
t5_test = tier5_selected[split==2,] 

dim(t5_train);dim(t5_test) #24 89 ;  11 89

length(t5_train$deceased == 1) #24 : so all data are deceased. cannot divide!
length(tier5_selected$deceased == 1) #35
length(tier5_selected$deceased) #35

#t5_rpart = rpart(t5_train$deceased ~ ., data = t5_train, method = "class", control = list(minsplit=1, minbucket = 1, cp=0.001))
#printcp(t5_rpart)

The data for tier 5 patients who stayed longer than 150 days in hospital were not able to be analyzed as all outcomes were identified as deceased.

From the models developed and the performances from each tier are summarized below:

It was attempted to develop another prediction model to compare the model performances. However, the imputation of missing data were too extensive for the allocated memories and even with the complete cases for patients, the codes could not be completed running. At this point, the analysis proceeded with the recursive partition regression model developed above.


Variables selected by recursive partitioning then Elastic-Net Regression

Elastic net regression iteration was successfully completed with the tier 1, with cv AUC = 0.6736, and misclassification error with 0.4975.

Elastic net regression iteration was succesfully completed with tier 2, with cv AUC = 0.8953. ROC curves were plotted to visualize the AUC, but it was omitted to evaluate in this report due to the extended iteration time(>3 hrs).

Elastic net regression was successfully completed with tier 3, with misclassification error or 0.2368.


Comorbidities Analysis

In order to understand the effect of comorbidities on deceased rate, a new variable with presence of secondary, tertiary, quarternary and quinary ccs codes was created.

Spearman’s rank correlation coefficient as known as Spearman’s rho statistics is used to estimate a rank-based measure association, for the data that does not have bivariate normal distribution.

new feature creation with comorbidities, entire dataset

EDA Tables for Comorbidity Analysis

EDA for Diagnoses Distributions for Deceased

##                                                                         diag_ccs1
## 1                                                   Acute cerebrovascular disease
## 2                                                            Cardiac dysrhythmias
## 3                                       Congestive heart failure; nonhypertensive
## 4                                                 Fracture of neck of femur (hip)
## 5                                                                  Osteoarthritis
## 6  Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 7           Rehabilitation care; fitting of prostheses; and adjustment of devices
## 8                                                    Septicemia (except in labor)
## 9                                                                       Undefined
## 10                                                       Urinary tract infections
##    counts
## 1    2389
## 2    2501
## 3    3071
## 4    2264
## 5    2751
## 6    2907
## 7    2547
## 8    3286
## 9    1875
## 10   1975

then, make new var for comorbidities for best 3 CCS code (primary diagnosis during that hospital visits, which collapses ICD codes into the broader group).

Secondary diagnosis among deceased septicemia

##                                                                        diag_ccs2
## 1                                            Acute and unspecified renal failure
## 2 Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 3                             Respiratory failure; insufficiency; arrest (adult)
## 4                                                                          Shock
## 5                                                       Urinary tract infections
##   counts
## 1    257
## 2    508
## 3    271
## 4    301
## 5    406

Secondary diagnosis among deceased CHF

## Selecting by counts
##                                                                                              diag_ccs2
## 1                                                                                 Cardiac dysrhythmias
## 2                                                                                Heart valve disorders
## 3 Peri-; endo-; and myocarditis; cardiomyopathy (except that caused by tuberculosis or sexually transm
## 4                       Pneumonia (except that caused by tuberculosis or sexually transmitted disease)
## 5                                                   Respiratory failure; insufficiency; arrest (adult)
##   counts
## 1    408
## 2    195
## 3    186
## 4    247
## 5    215

Secondary diagnosis among deceased pneumonia

## Selecting by counts
##                                                  diag_ccs2 counts
## 1                                     Cardiac dysrhythmias    170
## 2 Chronic obstructive pulmonary disease and bronchiectasis    298
## 3                Congestive heart failure; nonhypertensive    299
## 4                          Fluid and electrolyte disorders    187
## 5       Respiratory failure; insufficiency; arrest (adult)    222

The Charlson Comorbidity Index estimates the survival in patients with multiple comorbidities which consists of different items. Following the publicly available Charlson Comorbidity Index categories, each diagnosis was scored.

Example : COH Charlson Comorbidity Index with Scores
id deceased diag_ccs_code1 s1 diag_ccs_code2 s2 diag_ccs_code3 s3 diag_ccs_code4 s4 diag_ccs_code5 s5 com.score
1 0 109 1 82 0 108 1 95 0 211 0 2
2 0 114 1 155 0 144 0 47 0 146 0 1
3 1 108 1 127 1 131 0 98 0 128 0 2
4 0 109 1 95 0 108 1 159 0 2 0 2
5 1 109 1 108 1 106 0 653 1 98 0 3
6 1 50 2 0 0 62 0 0 0 42 2 4
## 
##  Pearson's product-moment correlation
## 
## data:  com.o$deceased and com.o$com.score
## t = 12.56, df = 13706, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09009436 0.12319444
## sample estimates:
##      cor 
## 0.106674
## 
## Call:
## glm(formula = com.o$deceased ~ com.o$com.score, family = binomial(link = "logit"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6738   0.5063   0.6875   0.7581   0.7581  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      0.87737    0.04116   21.32   <2e-16 ***
## com.o$com.score  0.22241    0.01799   12.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13998  on 13707  degrees of freedom
## Residual deviance: 13820  on 13706  degrees of freedom
## AIC: 13824
## 
## Number of Fisher Scoring iterations: 4
##     (Intercept) com.o$com.score 
##        2.404571        1.249087
## Waiting for profiling to be done...
##                     OR  2.5 % 97.5 %
## (Intercept)     2.4046 2.2179 2.6062
## com.o$com.score 1.2491 1.2063 1.2945

From the univariate analysis with deceased outcome with the COH Charlson comorbidities, per 1 unit of increase in comorbidity index is associated with deceased outcome at the odds of 1.2490.

logit(1) = (0.87737 + 0.22241 * comorbidity score of 1) = OR of 3.00 logit(5) = e^(.87737+0.22241 * comorbidity score of 5) = OR of 7.311


Spatial Pattern Visualization

Due to the significance of the spatial information in relation to the mortality rates, Provided zip code information of patients’ residential area and publicly available socioeconomic status were paired to assess the spatial relations with the patients mortality rates.

## Warning: Removed 35 rows containing missing values (geom_point).

Spatial Deceased Rates Visualization

Tier 1

## Warning: Removed 35 rows containing missing values (geom_point).

Tier 2

## Warning: Removed 14 rows containing missing values (geom_point).

Tier 3

## Warning: Removed 5 rows containing missing values (geom_point).

Tier 4

## Warning: Removed 2 rows containing missing values (geom_point).

Tier 5


Conclusion

Patients information by COH from 2000 to 2015 were analyzed to predict and to define any patterns for the deceased outcome. Based on the stratified length of stay since admission to the hospital to the outcome, recursive partitioning regression methods were used to develop prediction models per each tier with training set (70%), then each model’s performances were evaluated with the testing set(30%). Although it was not feasible in this particular setting, it is highly desired for different types of machine learning models to be developed then compared in terms of the performances, to pick out the best models.

The tiers for length of hospital stay were divided as 30, 60, 90, 150, and 150+ days. For the first tier and second tier, the total amount of the hospital charges were dominating to predict the death of the patients. It may be related to the urgency of the medical care needs which followed by the multiple costly testings, or may be related to the government supports period for patients care. The insurance tier was also included in the variables, but it was not singled out for a significant variable to be considered to predict the death of patients.

From tier 1 through tier 4, the hospital zip code was prevalently appeared as one of the most important variables to predict the death outcome of patients. The longer the length of the stay, the higher it ranked as an important factor. To define further about this, the spatial map with the death rate of the patients relative to the socioeconomic status for neighborhood were built for each tier. The increasing death rate as the tier increases was apparent, and death outcome seem to be more prevalent in large cities such as Los Angeles and San Francisco. The relation to the socioeconomic status was not too apparent based on the spatial map created, and further analysis will be required to define more details.

The comorbidities were assessed from the publicly known Charlson Comorbidity Index (CCI), with fltered data sets for the observations with full diagnoses primary to quinary. Once each observation’s CCI was obtained, logistic calculations with the deceased outcome is completed. From the univariate analysis with deceased outcome with the COH Charlson comorbidities, per 1 unit of increase in comorbidity index is associated with deceased outcome at the odds of 1.2490.

The future direction of the report would be analyzing not just spatial but also temporal assessments in predicting death outcome of patients. Also, finding out the effect of co-morbities towards the patients’ death will be a great source to further understandings.