Introduction to 'drglm'
drglm.Rmd
Package ‘drglm’ provide users to fit GLMs to big data sets which can be attached into memory. This package uses popular “Divide and Recombine” method to fit GLMs to large data sets. Lets generate a data set which is not that big but serves our purpose.
Generating a Data Set
set.seed(123)
#Number of rows to be generated
n <- 1000000
#creating dataset
dataset <- data.frame(
Var_1 = round(rnorm(n, mean = 50, sd = 10)),
Var_2 = round(rnorm(n, mean = 7.5, sd = 2.1)),
Var_3 = as.factor(sample(c("0", "1"), n, replace = TRUE)),
Var_4 = as.factor(sample(c("0", "1", "2"), n, replace = TRUE)),
Var_5 = sample(0:6, n, replace = TRUE),
Var_6 = round(rnorm(n, mean = 60, sd = 5))
)
This data set contains six variables of which four of them are continuous generated from normal distribution and two of them are catagorial and other one is count variable. Now we shall fit different GLMs with this data set below.
Fitting Multiple Linear Regression Model
Now, we shall fit multiple linear regression model to the data sets assuming Var_1 as response variable and all other variables as independent ones.
nmodel= drglm::drglm(Var_1 ~ Var_2+ Var_3+ Var_4+ Var_5+ Var_6,
data=dataset, family="gaussian",
fitfunction="speedglm", k=10)
#Output
print(nmodel)
## Estimate standard error t value Pr(>|t|)
## (Intercept) 49.9317067180 0.127368889 392.0243567 0.0000000
## Var_2 -0.0045654674 0.004721297 -0.9669943 0.3335469
## Var_31 0.0141079507 0.020006611 0.7051644 0.4807079
## Var_41 -0.0071647241 0.024493999 -0.2925094 0.7698972
## Var_42 0.0029739494 0.024507130 0.1213504 0.9034135
## Var_5 -0.0005907645 0.005001232 -0.1181238 0.9059696
## Var_6 0.0015528677 0.001996831 0.7776662 0.4367658
## 95% CI
## (Intercept) [ 49.68 , 50.18 ]
## Var_2 [ -0.01 , 0 ]
## Var_31 [ -0.03 , 0.05 ]
## Var_41 [ -0.06 , 0.04 ]
## Var_42 [ -0.05 , 0.05 ]
## Var_5 [ -0.01 , 0.01 ]
## Var_6 [ 0 , 0.01 ]
Fitting Binomial Regression (Logistic Regression) Model
Now, we shall fit logistic regression model to the data sets assuming Var_3 as response variable and all other variables as independent ones.
bmodel=drglm::drglm(Var_3~ Var_1+ Var_2+ Var_4+ Var_5+ Var_6,
data=dataset, family="binomial",
fitfunction="speedglm", k=10)
#Output
print(bmodel)
## Estimate Odds Ratio standard error z value Pr(>|z|)
## (Intercept) -0.0509893145 0.9502888 0.0272855408 -1.8687302 0.06166036
## Var_1 0.0001409687 1.0001410 0.0001999610 0.7049813 0.48082188
## Var_2 -0.0010358477 0.9989647 0.0009440378 -1.0972524 0.27253109
## Var_41 -0.0008665869 0.9991338 0.0048975644 -0.1769424 0.85955361
## Var_42 0.0008942254 1.0008946 0.0049002168 0.1824869 0.85520063
## Var_5 -0.0006510342 0.9993492 0.0010000010 -0.6510335 0.51502484
## Var_6 0.0008820860 1.0008825 0.0003992730 2.2092302 0.02715863
## 95% CI
## (Intercept) [ -0.1 , 0 ]
## Var_1 [ 0 , 0 ]
## Var_2 [ 0 , 0 ]
## Var_41 [ -0.01 , 0.01 ]
## Var_42 [ -0.01 , 0.01 ]
## Var_5 [ 0 , 0 ]
## Var_6 [ 0 , 0 ]
Fitting Poisson Regression Model
Now, we shall fit poisson regression model to the data sets assuming Var_5 as response variable and all other variables as independent ones.
pmodel=drglm::drglm(Var_5~ Var_1+ Var_2+ Var_3+ Var_4+ Var_6,
data=dataset, family="poisson",
fitfunction="speedglm", k=10)
#Output
print(pmodel)
## Estimate Odds Ratio standard error z value Pr(>|z|)
## (Intercept) 1.111764e+00 3.0397171 7.844631e-03 141.7229717 0.0000000
## Var_1 -8.530443e-06 0.9999915 5.770457e-05 -0.1478296 0.8824773
## Var_2 -3.972801e-04 0.9996028 2.724303e-04 -1.4582817 0.1447629
## Var_31 -8.719392e-04 0.9991284 1.154426e-03 -0.7553012 0.4500683
## Var_41 1.501374e-04 1.0001501 1.413838e-03 0.1061914 0.9154305
## Var_42 2.088608e-03 1.0020908 1.413911e-03 1.4771853 0.1396260
## Var_6 -1.584737e-04 0.9998415 1.152213e-04 -1.3753856 0.1690119
## 95% CI
## (Intercept) [ 1.1 , 1.13 ]
## Var_1 [ 0 , 0 ]
## Var_2 [ 0 , 0 ]
## Var_31 [ 0 , 0 ]
## Var_41 [ 0 , 0 ]
## Var_42 [ 0 , 0 ]
## Var_6 [ 0 , 0 ]
Fitting Multinomial Logistic Regression Model
Now, we shall fit multinomial logistic regression model to the data sets assuming Var_4 as response variable and all other variables as independent ones.
mmodel=drglm::drglm(Var_4~ Var_1+ Var_2+ Var_3+ Var_5+ Var_6,
data=dataset,family="multinomial",
fitfunction="multinom", k=10)
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109859.924516
## final value 109859.701793
## converged
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109858.025066
## final value 109856.247559
## converged
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109857.168027
## final value 109855.028006
## converged
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109857.411318
## final value 109856.006772
## converged
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109857.575481
## final value 109854.463544
## converged
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109856.817080
## final value 109853.551812
## converged
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109858.042179
## final value 109856.223538
## converged
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109856.773011
## final value 109853.685139
## converged
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109858.213223
## final value 109857.373232
## converged
## # weights: 21 (12 variable)
## initial value 109861.228867
## iter 10 value 109855.898011
## final value 109854.318130
## converged
#Output
print(mmodel)
## Estimate.1 Estimate.2 Odds Ratio.1 Odds Ratio.2
## (Intercept) 2.830473e-02 1.158303e-02 1.0287091 1.0116504
## Var_1 -7.176471e-05 2.793888e-05 0.9999282 1.0000279
## Var_2 1.360669e-03 2.468253e-04 1.0013616 1.0002469
## Var_31 -8.820202e-04 9.447668e-04 0.9991184 1.0009452
## Var_5 1.095235e-04 1.564296e-03 1.0001095 1.0015655
## Var_6 -5.798311e-04 -3.696137e-04 0.9994203 0.9996305
## standard error.1 standard error.2 z value.1 z value.2 Pr(>|z|).1
## (Intercept) 0.0333081745 0.0333282475 0.84978341 0.3475438 0.3954455
## Var_1 0.0002448119 0.0002449394 -0.29314222 0.1140645 0.7694134
## Var_2 0.0011557794 0.0011563869 1.17727369 0.2134452 0.2390863
## Var_31 0.0048975648 0.0049002165 -0.18009363 0.1928010 0.8570791
## Var_5 0.0012242948 0.0012249480 0.08945842 1.2770303 0.9287176
## Var_6 0.0004888216 0.0004890920 -1.18618141 -0.7557140 0.2355507
## Pr(>|z|).2 95% lower CI.1 95% lower CI.2 95% upper CI.1
## (Intercept) 0.7281828 -0.0369780884 -0.0537391374 0.0935875566
## Var_1 0.9091867 -0.0005515873 -0.0004521335 0.0004080578
## Var_2 0.8309797 -0.0009046173 -0.0020196513 0.0036259547
## Var_31 0.8471148 -0.0104810708 -0.0086594811 0.0087170304
## Var_5 0.2015915 -0.0022900503 -0.0008365582 0.0025090972
## Var_6 0.4498207 -0.0015379039 -0.0013282165 0.0003782417
## 95% upper CI.2
## (Intercept) 0.0769051920
## Var_1 0.0005080113
## Var_2 0.0025133019
## Var_31 0.0105490147
## Var_5 0.0039651499
## Var_6 0.0005889891
In fitting of four models, we used fitfunction= “speedglm” as fitting function for smaller computation time. In fitfunction= “glm” can also be used which will provide the exact same result as yielded by fitfunction=“speedglm”.
Note that, function ‘drglm’ is designed for fitting GLMs to data sets which can be fitted into memory. To fit data set that is larger than the memory, function ‘big.drglm’ can be used. Users are requested to check the respective vignette.