Disordered Multi-classification Logistic regression Model (text and text + data set)

First, teaching content

If the dependent variable is disordered multi-classification data, or the dependent variable is ordered multi-classification but does not meet the proportional advantage hypothesis (P>0.05 for parallelism test), disordered multi-classification logistic regression can be used for analysis. Of course, when the outcome variable is disordered and there is only one independent variable and it is a classified variable, the Chi-square test can be directly adopted. When the outcome variables are orderly and there is only one independent variable and it is a categorical variable, non-parametric test can be directly adopted. The disordered multiple classification logistic regression model is different from the ordered multiple classification logistic regression model. The cumulative Logit model is used in logistic regression of ordered multiple categories, and the cumulative probability of ordered value level of dependent variables is used in logit transformation. The generalized Logit model is adopted for the disordered multi-classification Logistic regression, which uses the natural logarithm of the ratio of each level of dependent variable (except the reference level) to the reference level to establish the model equation. When the level number is 2, this model is equivalent to the Logistic regression of binary data. Therefore, this model can be regarded as an extension of binary logistic regression model. The dependent variable Y is an unordered multi-classification variable with N levels, and n-1 generalized Logit model can be generated during logistic regression of unordered multi-classification. The positive probability of reference level R is denoted as πR, the KTH level (k=1,2… The positive probabilities of n) are πk, π1+π2+… PI n = 1. There are m independent variables x, and the ith independent variable at the KTH level (I =1,2… M) Xi coefficient is β Ki.

It is obvious that π1+π2+π3+π4=1, and that the corresponding function can be obtained by subtracting the corresponding formulas for 1 and 2, or by subtracting the corresponding functions for 2 and 3. Of course, we can also modify the reference level directly. Example: A researcher wanted to determine whether access to health information differed between communities and genders among adult residents. 314 adults in 2 communities were surveyed. The results are shown in the table below. Variables were assigned as follows: community (community A=0, community B=1), gender (male =0, female =1), access to health knowledge (traditional mass media =1, Internet =2, community publicity =3). Please fit the multi-classification logistic regression model of community and gender on residents’ access to health knowledge.

1. Data entry

2, Data weighting: Data>>Weight Cases… , to weight [frequency]

Multinomial Multinomial Logistic… L Factor: Community, gender and factors must be classified variables. Covariates are independent explanatory variables that do not attract the attention of researchers but have an impact on the results in the design of the experiment. They can be classified variables or continuous variables. [Reference Category…] The default reference category is the last category, and the default category order is ascending. In ascending order, the minimum value of dependent variable is the first category, while in descending order, the minimum value is the last category.

[Model] : You can specify the model for analysis. By default, only the main effect can be analyzed. You can also carry out all-factor analysis (main effect + interaction), and of course, you can also carry out custom analysis. If Custom/Stepwise is selected, the model can be customized and variables can be filtered, similar to Block and Method in binary logistics regression. The default main effect analysis is used in this example. [Statistics] : In addition to the default option, select the information criterion (output AIC and BIC), cell probability, classification table and goodness of fit test. The default option to define a subpopulation is to calculate cell probability and perform goodness of fit tests for all independent and covariate variables.

[Convergence criterion] : mainly set for iteration. [Option] : Entry and elimination criteria and inspection methods can be set. [Save] : Can save the new variable [estimated response probability], [predicted classification], [predicted classification probability], [actual classification probability]. 4. Results [Case Treatment Summary] : Analyze the basic situation of the example.

[Model fitting information] : Compared with the initial model containing only constant terms, AIC (Akaike information criterion), BIC (Bayes information criterion) and logarithmic likelihood value of minus 2 times (-2ll) of the final model all decreased. The -2LL value decreased from 80.877 to 36.821, a decrease of 44.056 (Chi-square value), and the Chi-square test of likelihood ratio was statistically significant (P<0.001), indicating that at least one partial regression coefficient of gender and community variables included in the model was not 0.

[Goodness of fit test] : Show the results of Pearson goodness of fit test and Deviance goodness of fit test. These two methods actually test the comparison between the predicted value of the current model and the measured value of the sample, and the P values of both results are greater than 0.05, indicating good fitting. However, it should be noted that these two methods have certain requirements on the sample size of independent variables. When there are many independent variables or continuous variables, the test results of these two methods are generally not adopted.

[pseudo R2] : Output three pseudo determination coefficients. In statistical analysis of classified data, it is unnecessary to pay too much attention to these three pseudo-determination coefficients.

[Likelihood ratio test] : The table shows the AIC, BIC and -2LL values of the final model (consistent with the results in the [Model Fitting Information] Table) and the AIC, BIC and -2LL values of the reduced model (after removing the effect of a certain independent variable). The Chi-square test statistic is the -2LL difference between the reduced model and the final model. The results showed that the contributions of community and gender to the model were statistically significant.

[Parameter Estimation] By default, the dependent variable in SPSS takes the high value level as the Reference level (in this case, it is publicized in the community). If you want to take other value levels as the Reference level, you can modify the assignment of each level of the dependent variable in the data, or through [Reference Category…] To specify. The default value level of the independent variable is high as the reference level, and the assignment value of each level of the independent variable can also be modified to change the reference level. If the variable is included as a covariable in the analysis, the low level will be default as the reference level. Therefore, in this case, community B (Community =1) and female (gender =1) are reference levels, and their parameter value is 0, which is generally a parameter that researchers are not interested in, namely redundant parameter.

From the results, the regression coefficient of community A (community =0) is negative, P=0.001<0.05, OR=0.370. It was statistically significant that the regression coefficient of community A was not 0 (that of community B was 0). The regression coefficient was negative, indicating that community A was less likely (than community B) to acquire health information through traditional mass media than through community advocacy, or that community A was more likely to acquire health information through community advocacy; OR=0.370, that is, compared with community publicity, community A’s acquisition of health knowledge through traditional mass media is 0.37 times that of community B, OR A more logical statement is that community A’s acquisition of health knowledge through community publicity is 2.70 times that of community B (1/0.370). Community B was 2.70 times more likely than community A to obtain health information through traditional mass media. Of course, strictly speaking, the expression of OR should be: the ratio of community B choosing traditional mass media and community publicity is 2.70 times that of community A. Similarly, compared with community publicity, men (more than women) are more willing to obtain health knowledge through traditional mass media (OR=3.410). Compared with online publicity, there was no statistical difference between community A (and community B) in acquiring health knowledge through traditional mass media (Wald χ2=1.7,P=0.192>0.05), but men were more likely to choose online access to health knowledge (Wald χ2=8.126,P=0.004<0.05, OR = 2.213).

If you want to compare traditional mass media and the Internet, you can directly subtract the corresponding model equations.

It can be generally judged that compared with the Internet, community A is less inclined to traditional mass media (that is, more inclined to the Internet), and men are more inclined to traditional mass media. However, whether there is statistical significance needs to be further tested. In the multivariable regression dialog box, use the [Reference Category…] Customize the reference category as network (Custom Value=2), and obtain the following results, which are consistent with the above calculation results. Interpretation is omitted.

In addition, the principle of simultaneous entry and simultaneous exit should be followed when the independent variables are classified as multiple variables. [Classification Table] : The difference between observed frequency and predicted frequency. The diagonal is the number of correct judgments, while the non-diagonal is the number of wrong judgments. The accuracy of prediction is average and needs to be improved. [Observed frequency and predicted frequency] : relatively close, good fitting.

Second, the remark

Relevant data uploaded my resources, download link https://blog.csdn.net/TIQCmatlab?spm=1011.2124.3001.5343

Disordered Multi-classification Logistic regression Model (text and text + data set)

First, teaching content

Second, the remark

Related Posts

CNN Basics — How do I set BatchSize

Start to write a simple music recommendation system

Machine learning — gradient descent algorithm