Skip to contents

Function drglm aimed to fit GLMs to datasets larger in size that can be stored in memory. It uses popular divide and recombine technique to handle large data sets efficiently.Function drglm optimizes performance when linked with optimized BLAS libraries like ATLAS.The function drglm requires defining the number of chunks K and the fitfunction.The rest of the arguments are almost identical with the speedglm or biglm package.

Usage

drglm(formula, family, data, k, fitfunction)

Arguments

formula

An entity belonging to the "formula" class (or one that can be transformed into that class) represents a symbolic representation of the model that needs to be adjusted. Specifics about how the model is defined can be found in the 'Details' section.

family

An explanation of the error distribution that will be implemented in the model.

data

A data frame, list, or environment that is not required but can be provided if available.

k

Number of subsets to be used.

fitfunction

The function to be utilized for model fitting. glm or speedglm should be used.For Multinomial models, multinom function is preferred.

Value

A Generalized Linear Model is fitted in "Divide & Recombine" approach using "k" chunks to data set. A list of model coefficients is estimated using divide and recombine method with the respective standard error of estimates.

References

  • Xi, R., Lin, N., & Chen, Y. (2009). Compression and aggregation for logistic regression analysis in data cubes. IEEE Transactions on Knowledge and Data Engineering, 21(4).

  • Chen, Y., Dong, G., Han, J., Pei, J., Wah, B. W., & Wang, J. (2006). Regression cubes with lossless compression and aggregation. IEEE Transactions on Knowledge and Data Engineering, 18(12).

  • Zuo, W., & Li, Y. (2018). A New Stochastic Restricted Liu Estimator for the Logistic Regression Model. Open Journal of Statistics, 08(01).

  • Karim, M. R., & Islam, M. A. (2019). Reliability and Survival Analysis. In Reliability and Survival Analysis.

  • Enea, M. (2009) Fitting Linear Models and Generalized Linear Models with large data sets in R.

  • Bates, D. (2009) Technical Report on Least Square Calculations.

  • Lumley, T. (2009) biglm package documentation.

Author

MH Nayem

Examples

set.seed(123)
#Number of rows to be generated
n <- 10000
#creating dataset
dataset <- data.frame( pred_1 = round(rnorm(n, mean = 50, sd = 10)),
pred_2 = round(rnorm(n, mean = 7.5, sd = 2.1)),
pred_3 = as.factor(sample(c("0", "1"), n, replace = TRUE)),
pred_4 = as.factor(sample(c("0", "1", "2"), n, replace = TRUE)),
pred_5 = as.factor(sample(0:15, n, replace = TRUE)),
pred_6 = round(rnorm(n, mean = 60, sd = 5)))
#fitting MLRM
nmodel= drglm::drglm(pred_1 ~ pred_2+ pred_3+ pred_4+ pred_5+ pred_6,
data=dataset, family="gaussian", fitfunction="speedglm", k=10)
#Output
nmodel
#>                Estimate standard error    t value   Pr(>|t|)            95% CI
#> (Intercept) 51.72130615     1.32114969 39.1487102 0.00000000 [ 49.13 , 54.31 ]
#> pred_2       0.02094802     0.04748735  0.4411285 0.65911997  [ -0.07 , 0.11 ]
#> pred_31     -0.13949603     0.20171843 -0.6915384 0.48922728  [ -0.53 , 0.26 ]
#> pred_41      0.38350656     0.24833980  1.5442815 0.12252015   [ -0.1 , 0.87 ]
#> pred_42      0.23785108     0.24752089  0.9609334 0.33658568  [ -0.25 , 0.72 ]
#> pred_51     -1.06696639     0.56657382 -1.8831904 0.05967457  [ -2.18 , 0.04 ]
#> pred_52     -0.80267657     0.56004238 -1.4332426 0.15178853   [ -1.9 , 0.29 ]
#> pred_53     -0.64240893     0.56243644 -1.1421894 0.25337531  [ -1.74 , 0.46 ]
#> pred_54     -0.87049071     0.56948141 -1.5285674 0.12637173  [ -1.99 , 0.25 ]
#> pred_55     -0.51662926     0.56337343 -0.9170281 0.35912793  [ -1.62 , 0.59 ]
#> pred_56     -0.51405571     0.56179393 -0.9150254 0.36017830  [ -1.62 , 0.59 ]
#> pred_57     -0.68371489     0.56680847 -1.2062538 0.22771963  [ -1.79 , 0.43 ]
#> pred_58     -0.83233284     0.56987357 -1.4605570 0.14413705  [ -1.95 , 0.28 ]
#> pred_59     -0.76583552     0.56309505 -1.3600466 0.17381517  [ -1.87 , 0.34 ]
#> pred_510    -0.69443427     0.56813346 -1.2223083 0.22159105  [ -1.81 , 0.42 ]
#> pred_511    -0.75598173     0.55912331 -1.3520841 0.17634842  [ -1.85 , 0.34 ]
#> pred_512    -1.32332553     0.56884076 -2.3263550 0.01999962 [ -2.44 , -0.21 ]
#> pred_513    -0.76349854     0.56265917 -1.3569468 0.17479812  [ -1.87 , 0.34 ]
#> pred_514    -0.60991931     0.57137187 -1.0674647 0.28576204  [ -1.73 , 0.51 ]
#> pred_515     0.14287426     0.57115597  0.2501493 0.80247190  [ -0.98 , 1.26 ]
#> pred_6      -0.02291395     0.02004498 -1.1431264 0.25298613  [ -0.06 , 0.02 ]
#fitting simple logistic regression model
bmodel=drglm::drglm(pred_3~ pred_1+ pred_2+ pred_4+ pred_5+ pred_6,
data=dataset, family="binomial", fitfunction="speedglm", k=10)
#Output
bmodel
#>                  Estimate Odds Ratio standard error     z value   Pr(>|z|)
#> (Intercept)  0.0952195429  1.0999003    0.286765015  0.33204728 0.73985356
#> pred_1      -0.0013836918  0.9986173    0.002045315 -0.67651761 0.49871207
#> pred_2      -0.0004142688  0.9995858    0.009605457 -0.04312848 0.96559912
#> pred_41      0.0184132664  1.0185838    0.050213834  0.36669708 0.71384499
#> pred_42      0.0863909757  1.0902325    0.050044304  1.72628990 0.08429527
#> pred_51     -0.1066254801  0.8988623    0.114997732 -0.92719637 0.35382459
#> pred_52     -0.0591881914  0.9425294    0.113290817 -0.52244474 0.60136071
#> pred_53     -0.0807833291  0.9223935    0.113929295 -0.70906547 0.47828385
#> pred_54      0.0229787647  1.0232448    0.115269581  0.19934804 0.84199051
#> pred_55     -0.0057667632  0.9942498    0.113913596 -0.05062401 0.95962513
#> pred_56      0.0254936407  1.0258214    0.113816898  0.22398819 0.82276649
#> pred_57      0.0233435801  1.0236182    0.114746435  0.20343621 0.83879410
#> pred_58     -0.0092262931  0.9908161    0.115149070 -0.08012477 0.93613802
#> pred_59     -0.1390418914  0.8701916    0.114051937 -1.21911030 0.22280233
#> pred_510     0.0532619633  1.0547059    0.114808460  0.46392020 0.64270492
#> pred_511    -0.0815427288  0.9216933    0.113599237 -0.71781053 0.47287412
#> pred_512     0.0934829685  1.0979919    0.114770538  0.81452061 0.41534677
#> pred_513     0.0508340238  1.0521482    0.114111051  0.44547853 0.65597397
#> pred_514    -0.0722004220  0.9303444    0.115422244 -0.62553300 0.53162130
#> pred_515     0.1086928534  1.1148199    0.115575343  0.94045019 0.34698669
#> pred_6      -0.0008249102  0.9991754    0.004057493 -0.20330540 0.83889633
#>                       95% CI
#> (Intercept) [ -0.47 , 0.66 ]
#> pred_1         [ -0.01 , 0 ]
#> pred_2      [ -0.02 , 0.02 ]
#> pred_41     [ -0.08 , 0.12 ]
#> pred_42     [ -0.01 , 0.18 ]
#> pred_51     [ -0.33 , 0.12 ]
#> pred_52     [ -0.28 , 0.16 ]
#> pred_53      [ -0.3 , 0.14 ]
#> pred_54      [ -0.2 , 0.25 ]
#> pred_55     [ -0.23 , 0.22 ]
#> pred_56      [ -0.2 , 0.25 ]
#> pred_57      [ -0.2 , 0.25 ]
#> pred_58     [ -0.23 , 0.22 ]
#> pred_59     [ -0.36 , 0.08 ]
#> pred_510    [ -0.17 , 0.28 ]
#> pred_511     [ -0.3 , 0.14 ]
#> pred_512    [ -0.13 , 0.32 ]
#> pred_513    [ -0.17 , 0.27 ]
#> pred_514     [ -0.3 , 0.15 ]
#> pred_515    [ -0.12 , 0.34 ]
#> pred_6      [ -0.01 , 0.01 ]
#fitting poisson regression model
pmodel=drglm::drglm(pred_5~ pred_1+ pred_2+ pred_3+ pred_4+ pred_6,
data=dataset, family="binomial", fitfunction="speedglm", k=10)
#Output
pmodel
#>                 Estimate Odds Ratio standard error     z value     Pr(>|z|)
#> (Intercept)  2.405511769 11.0841013    0.559554526  4.29897652 1.715886e-05
#> pred_1      -0.007027986  0.9929967    0.004122233 -1.70489760 8.821352e-02
#> pred_2       0.028680323  1.0290956    0.019505478  1.47037271 1.414608e-01
#> pred_31     -0.009920401  0.9901286    0.082764726 -0.11986267 9.045919e-01
#> pred_41      0.078623643  1.0817971    0.102399679  0.76781142 4.425992e-01
#> pred_42      0.010046680  1.0100973    0.100601142  0.09986646 9.204503e-01
#> pred_6       0.005829844  1.0058469    0.008198830  0.71105795 4.770483e-01
#>                       95% CI
#> (Intercept)   [ 1.31 , 3.5 ]
#> pred_1         [ -0.02 , 0 ]
#> pred_2      [ -0.01 , 0.07 ]
#> pred_31     [ -0.17 , 0.15 ]
#> pred_41     [ -0.12 , 0.28 ]
#> pred_42     [ -0.19 , 0.21 ]
#> pred_6      [ -0.01 , 0.02 ]
#fitting multinomial logistic regression model
mmodel=drglm::drglm(pred_4~ pred_1+ pred_2+ pred_3+ pred_5+ pred_6,
data=dataset, family="multinomial", fitfunction="multinom", k=10)
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1081.842250
#> iter  20 value 1079.724677
#> iter  30 value 1079.709082
#> final  value 1079.708962 
#> converged
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1084.321143
#> iter  20 value 1075.145510
#> iter  30 value 1075.060898
#> final  value 1075.059122 
#> converged
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1086.065933
#> iter  20 value 1080.654708
#> iter  30 value 1080.617651
#> final  value 1080.617432 
#> converged
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1085.721140
#> iter  20 value 1082.392343
#> iter  30 value 1082.378604
#> final  value 1082.378515 
#> converged
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1083.460043
#> iter  20 value 1076.975921
#> iter  30 value 1076.930816
#> final  value 1076.930268 
#> converged
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1086.343504
#> iter  20 value 1083.483907
#> iter  30 value 1083.429147
#> final  value 1083.428788 
#> converged
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1086.383304
#> iter  20 value 1079.248191
#> iter  30 value 1079.131777
#> final  value 1079.129705 
#> converged
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1087.658835
#> iter  20 value 1079.056688
#> iter  30 value 1078.843593
#> final  value 1078.841904 
#> converged
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1080.330228
#> iter  20 value 1073.171327
#> iter  30 value 1072.982368
#> final  value 1072.981502 
#> converged
#> # weights:  63 (40 variable)
#> initial  value 1098.612289 
#> iter  10 value 1082.581872
#> iter  20 value 1079.435676
#> iter  30 value 1079.288356
#> final  value 1079.287908 
#> converged
#Output
mmodel
#>                Estimate.1   Estimate.2 Odds Ratio.1 Odds Ratio.2
#> (Intercept) -0.2853539188 -0.533750309    0.7517481    0.5864017
#> pred_1       0.0038416692  0.002793053    1.0038491    1.0027970
#> pred_2       0.0082891106 -0.009496117    1.0083236    0.9905488
#> pred_31      0.0218560016  0.098556010    1.0220966    1.1035762
#> pred_51      0.1041197912  0.107943130    1.1097334    1.1139844
#> pred_52      0.0185882466  0.034426012    1.0187621    1.0350254
#> pred_53      0.2397209895  0.080935362    1.2708945    1.0843008
#> pred_54      0.0938156731 -0.031423995    1.0983573    0.9690646
#> pred_55      0.1328703482  0.033111908    1.1421019    1.0336662
#> pred_56      0.2595421380  0.084977571    1.2963364    1.0886926
#> pred_57     -0.0258373204  0.006160637    0.9744936    1.0061797
#> pred_58      0.1254919206  0.064560739    1.1337060    1.0666904
#> pred_59      0.1020629948  0.024347754    1.1074532    1.0246466
#> pred_510     0.1031342956 -0.001880707    1.1086403    0.9981211
#> pred_511     0.1036721356  0.098304574    1.1092367    1.1032988
#> pred_512     0.1629845156 -0.009769957    1.1770185    0.9902776
#> pred_513     0.0560079157  0.033235440    1.0576061    1.0337939
#> pred_514    -0.0114576021  0.056310921    0.9886078    1.0579266
#> pred_515    -0.0323858251 -0.020269661    0.9681330    0.9799344
#> pred_6      -0.0004666329  0.007235091    0.9995335    1.0072613
#>             standard error.1 standard error.2   z value.1   z value.2
#> (Intercept)      0.352355572      0.351293926 -0.80984648 -1.51938383
#> pred_1           0.002519101      0.002511803  1.52501623  1.11197121
#> pred_2           0.011828815      0.011781775  0.70075579 -0.80600053
#> pred_31          0.050219507      0.050048272  0.43520940  1.96921904
#> pred_51          0.143050124      0.140461598  0.72785530  0.76848855
#> pred_52          0.141289751      0.138335472  0.13156118  0.24885889
#> pred_53          0.141463492      0.141642918  1.69457848  0.57140423
#> pred_54          0.141581660      0.141524963  0.66262589 -0.22203853
#> pred_55          0.141310183      0.140318969  0.94027440  0.23597599
#> pred_56          0.140523643      0.140900389  1.84696420  0.60310388
#> pred_57          0.142934906      0.139626085 -0.18076285  0.04412239
#> pred_58          0.143771676      0.141982106  0.87285566  0.45471039
#> pred_59          0.141325495      0.139770477  0.72218388  0.17419812
#> pred_510         0.142115285      0.142460092  0.72570868 -0.01320164
#> pred_511         0.141672817      0.138927665  0.73177154  0.70759538
#> pred_512         0.141233081      0.142472377  1.15401090 -0.06857440
#> pred_513         0.140878949      0.138964031  0.39756057  0.23916578
#> pred_514         0.144333825      0.140790079 -0.07938265  0.39996370
#> pred_515         0.143521586      0.140359894 -0.22565125 -0.14441206
#> pred_6           0.004998638      0.004979892 -0.09335200  1.45286115
#>             Pr(>|z|).1 Pr(>|z|).2 95% lower CI.1 95% lower CI.2 95% upper CI.1
#> (Intercept) 0.41802842 0.12866591   -0.975958149  -1.2222737520    0.405250311
#> pred_1      0.12725505 0.26615053   -0.001095677  -0.0021299910    0.008779015
#> pred_2      0.48345543 0.42024255   -0.014894941  -0.0325879707    0.031473162
#> pred_31     0.66341044 0.04892794   -0.076572423   0.0004631993    0.120284427
#> pred_51     0.46670217 0.44219699   -0.176253300  -0.1673565430    0.384492883
#> pred_52     0.89533139 0.80346994   -0.258334577  -0.2367065310    0.295511070
#> pred_53     0.09015541 0.56772567   -0.037542360  -0.1966796557    0.516984339
#> pred_54     0.50757019 0.82428389   -0.183679281  -0.3088078254    0.371310627
#> pred_55     0.34707683 0.81345130   -0.144092521  -0.2419082168    0.409833218
#> pred_56     0.06475233 0.54643959   -0.015879141  -0.1911821164    0.534963417
#> pred_57     0.85655373 0.96480684   -0.305984588  -0.2675014606    0.254309947
#> pred_58     0.38274176 0.64931761   -0.156295386  -0.2137190764    0.407279227
#> pred_59     0.47018143 0.86170976   -0.174929886  -0.2495973465    0.379055875
#> pred_510    0.46801738 0.98946692   -0.175406544  -0.2810973573    0.381675135
#> pred_511    0.46430802 0.47919656   -0.174001483  -0.1739886449    0.381345755
#> pred_512    0.24849570 0.94532840   -0.113827237  -0.2890106853    0.439796268
#> pred_513    0.69095413 0.81097704   -0.220109750  -0.2391290552    0.332125581
#> pred_514    0.93672827 0.68918326   -0.294346701  -0.2196325644    0.271431497
#> pred_515    0.82147268 0.88517510   -0.313682965  -0.2953699988    0.248911315
#> pred_6      0.92562392 0.14626231   -0.010263784  -0.0025253173    0.009330518
#>             95% upper CI.2
#> (Intercept)    0.154773134
#> pred_1         0.007716097
#> pred_2         0.013595738
#> pred_31        0.196648820
#> pred_51        0.383242804
#> pred_52        0.305558555
#> pred_53        0.358550379
#> pred_54        0.245959836
#> pred_55        0.308132032
#> pred_56        0.361137259
#> pred_57        0.279822734
#> pred_58        0.342840554
#> pred_59        0.298292855
#> pred_510       0.277335943
#> pred_511       0.370597794
#> pred_512       0.269470771
#> pred_513       0.305599936
#> pred_514       0.332254406
#> pred_515       0.254830677
#> pred_6         0.016995500