SEARCH –– SEARCHING FOR STRUCTURE
SEARCH is a binary segmentation procedure used to develop a predictive model for a dependent variable. It searches among a set of predictor variables for those predictors which most increase the researcher’s ability to account for the variance or distribution of a dependent variable. The question, “what dichotomous split on which single predictor variable will give us a maximum improvement in our ability to predict values of the dependent variable?,” embedded in an iterative scheme, is the basis for the algorithm used in this command.
SEARCH divides the sample, through a series of binary splits, into a mutually exclusive series of subgroups. They are chosen so that, at each step in the procedure, the split into the two new subgroups accounts for more of the variance or distribution (reduces the predictive error more) than a split into any other pair of subgroups. The predictor variables may be ordinally or nominally scaled. The dependent variable may be continuous or categorical.
Research questions are often of the type “What is the effect of X on Y?” But the answer requires answering a larger question “What set of variables and their combinations seems to affect Y?” With SEARCH a variable X that seems to have an overall effect may have its apparent influence disappear after a few splits, with the final groups, while varying greatly as to their levels of Y, showing no effect of X. The implication is that, given other things, X does not really affect Y.
Conversely, while X may seem to have no overall effect on Y, after splitting the sample into groups that take account of other powerful factors, there may be some groups in which X has a substantial effect. Think of economists’ notion of the actor at the margin. A motivating factor might affect those not constrained or compelled by other forces. Those who, other things considered, have a 40-60 percent probability of acting, might show substantial response to some motivator. Or a group with very high or very low likelihood of acting might be discouraged or encouraged by some motivator. But if X has no effect on any of the subgroups generated by Search, one has pretty good evidence that it does not matter, even in an interactive way.
SEARCH makes a sequence of binary divisions of a dataset in such a way that each split maximally reduces the error variance or increases the information (chi-square or rank correlation). It finds the best split on each predictor and takes the best of the best. The process stops when additional splits are not likely to improve predictions to a fresh sample or to the population, i.e., when the null probability from that split rises above some selected level (e.g., .05, .025, .01 or .005). Of course, having tried several possibilities for each of several predictors, the null probability is clearly understated. Alternative stopping rules can be used in any combination: minimum group size, maximum number of splits, minimum reduction in explained variance relative to the original total, or maximum null probability
Splitting creteria:
There can be four splitting criteria, based on the dependent variable type:
Means
Regressions
Classifications
Ranks
The splitting criterion in each case is the reduction in ignorance (error variance, etc.) or increase in information. Terms like classification and regression trees should be replaced by binary segmentation or unrestricted analysis of variance components, or searching for structure. With rich bodies of data, many non-linearity’s and non-additivity possible, and many competing theories, the usual restrictions and assumptions that one is testing a single model are not appropriate. What does remain, however, is a systematic, pre-stated searching strategy that is reproducible, not a free ransacking.
Means. For means the splitting criterion is the reduction in error variance, that is, the sum of squares around the mean, using two subgroup means instead of one parent group mean.
Regressions. For regressions (y=a+bx) the splitting criterion is the reduction in error variance from using two regressions rather than one.
Classifications (Chi option). For classifications (categorical dependent variable), the splitting criterion is the likelihood-ratio chi-square for dividing the parent group into two subgroups.
Ranks (Tau option). For rankings (ordered dependent variable), the splitting criterion is Kendall’s tau-b, a rank correlation measure.
The major components of output:
The analysis of variance or distribution on final groups (except for “analysis=tau”)
The split summary
The final group summary
Summary table of best splits for each predictor for each group (except for “analysis= tau”)
The predictor summary table. You may request the first group (PRINT=FIRST), the final groups (PRINT=FINAL), or all groups (PRINT=TABLE). The tables are printed in reverse group order, i.e., last group first and first group last.
Group Tree Structure
A structure table with entries for each group, numbered in order and indented, so that one can easily see the pedigree of each final group and its detail.
References:
Agresti, Alan (1996), Introduction to Categorical Data Analysis, New York: John Wiley & Sons, Inc.
Dunn, Olive Jean, and Virginia A. Clark (1974), Applied Statistics: Analysis of Variance and Regression, New York: Holt, Rinehart and Winston.
Chow, G. (1960), “Test of Equality between Sets of Coefficients in Two Linear Regressions,” Econometrica, 29:591-605.
Gibbons, Jean Dickinson (1997), Nonparametric Methods for Quantitative Analysis, 3rd edition, Syracuse: American Sciences Press.
Hays, William (1988), Statistics, 4th edition, New York: Holt, Rinehart, & Winston.
Klem, Laura (1974), “Formulas and Statistical References,” in Osiris III, Volume 5, Ann Arbor: Institute for Social Research.
Sonquist, J. A., E. L. Baker and J. N. Morgan (1974), Searching for Structure, revised edition, Ann Arbor: Institute for Social Research, The University of Michigan.
Example: Investigates income (V268)
ANALYSIS TYPE: MEANS
Dependent variable: 268 Income
Predictor variables: 32 37 251 30
The number of cases is 326
The partitioning ends with 9 final groups
The variation explained is 38.2 percent
One-way Analysis of Final Groups
Source Variation DF
Explained .701177E+10 8
Error .113438E+11 317
Total .183555E+11 325
Split Summary Table
Group 1, N=326
Mean(Y)=10451.0, Var(Y)=.564786E+08, Variation=.183555E+11
Split on V37: RACE, Var expl=.216040E+08, Significance=.544344
Into Group 2, Codes 1
And Group 3, Codes 0,2-9
Group 2, N=299
Mean(Y)=10528.3, Var(Y)=.570540E+08, Variation=.170021E+11
Split on V30: MARITAL STATUS, Var expl=.312812E+10, Significance=0.000100
Into Group 4, Codes 1
And Group 5, Codes 2-5
Group 4, N=221
Mean(Y)=12449.9, Var(Y)=.571999E+08, Variation=.125840E+11
Split on V32: EDUC OF HEAD, Var expl=.173944E+10, Significance=0.000100
Into Group 6, Codes 1-5
And Group 7, Codes 6-8
Group 6, N=171
Mean(Y)=10932.9, Var(Y)=.430128E+08, Variation=.731217E+10
Split on V251: OCCUPATION B, Var expl=.140900E+10, Significance=0.000100
Into Group 8, Codes 0
And Group 9, Codes 1-9
Group 9, N=142
Mean(Y)=12230.1, Var(Y)=.402303E+08, Variation=.567247E+10
Split on V251: OCCUPATION B, Var expl=.423362E+09, Significance=0.001380
Into Group 10, Codes 1-3
And Group 11, Codes 4-9
Group 11, N=115
Mean(Y)=11393.4, Var(Y)=.249652E+08, Variation=.284603E+10
Split on V251: OCCUPATION B, Var expl=.495146E+08, Significance=.156284
Into Group 12, Codes 4-6
And Group 13, Codes 7-9
Group 12, N=69
Mean(Y)=11929.2, Var(Y)=.212965E+08, Variation=.144816E+10
Split on V32: EDUC OF HEAD, Var expl=.571610E+08, Significance=0.097853
Into Group 14, Codes 1-3
And Group 15, Codes 4,5
Group 5, N=78
Mean(Y)=5083.86, Var(Y)=.167531E+08, Variation=.128999E+10
Split on V251: OCCUPATION B, Var expl=.183562E+09, Significance=0.000992
Into Group 16, Codes 0
And Group 17, Codes 1,2,4-9
Final Group Summary Table
Group 3, N=27
Mean(Y)=9594.30, Var(Y)=.512249E+08, Variation=.133185E+10
Group 7, N=50
Mean(Y)=17638.2, Var(Y)=.720890E+08, Variation=.353236E+10
Group 8, N=29
Mean(Y)=4580.97, Var(Y)=.823915E+07, Variation=.230696E+09
Group 10, N=27
Mean(Y)=15793.6, Var(Y)=.924261E+08, Variation=.240308E+10
Group 13, N=46
Mean(Y)=10589.8, Var(Y)=.299634E+08, Variation=.134835E+10
Group 14, N=28
Mean(Y)=13030.6, Var(Y)=.309307E+08, Variation=.835128E+09
Group 15, N=41
Mean(Y)=11177.0, Var(Y)=.138968E+08, Variation=.555873E+09
Group 16, N=35
Mean(Y)=3383.49, Var(Y)=.515942E+07, Variation=.175420E+09
Group 17, N=43
Mean(Y)=6467.88, Var(Y)=.221668E+08, Variation=.931006E+09
Percent Total Variation Explained by Best Split for Each Group (*=Final Groups)
1 2 3* 4 5 6 7* 8* 9 10*
V32 12.00 11.90 0.00 9.48 0.86 3.62 0.00 0.00 0.68 0.00
V37 0.12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
V251 18.12 16.90 0.00 9.14 1.00 7.68 0.00 0.00 2.31 0.00
V30 17.92 17.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Percent Total Variation Explained by Best Split for Each Group (*=Final Groups) – continued
11 12 13* 14* 15* 16* 17*
V32 0.16 0.31 0.00 0.00 0.00 0.00 0.00
V37 0.00 0.00 0.00 0.00 0.00 0.00 0.00
V251 0.27 0.01 0.00 0.00 0.00 0.00 0.00
V30 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Group TREE Structure
Group 1: All Cases
N=326, Mean(Y)=10451.0
Group 2 V37: RACE, Codes 1
N=299, Mean(Y)=10528.3
Group 4 V30: MARITAL STATUS, Codes 1
N=221, Mean(Y)=12449.9
Group 6 V32: EDUC OF HEAD, Codes 1-5
N=171, Mean(Y)=10932.9
Group 8 V251: OCCUPATION B, Codes 0
N=29, Mean(Y)=4580.97
Group 9 V251: OCCUPATION B, Codes 1-9
N=142, Mean(Y)=12230.1
Group 10 V251: OCCUPATION B, Codes 1-3
N=27, Mean(Y)=15793.6
Group 11 V251: OCCUPATION B, Codes 4-9
N=115, Mean(Y)=11393.4
Group 12 V251: OCCUPATION B, Codes 4-6
N=69, Mean(Y)=11929.2
Group 14 V32: EDUC OF HEAD, Codes 1-3
N=28, Mean(Y)=13030.6
Group 15 V32: EDUC OF HEAD, Codes 4,5
N=41, Mean(Y)=11177.0
Group 13 V251: OCCUPATION B, Codes 7-9
N=46, Mean(Y)=10589.8
Group 7 V32: EDUC OF HEAD, Codes 6-8
N=50, Mean(Y)=17638.2
Group 5 V30: MARITAL STATUS, Codes 2-5
N=78, Mean(Y)=5083.86
Group 16 V251: OCCUPATION B, Codes 0
N=35, Mean(Y)=3383.49
Group 17 V251: OCCUPATION B, Codes 1,2,4-9
N=43, Mean(Y)=6467.88
Group 3 V37: RACE, Codes 0,2-9
N=27, Mean(Y)=9594.30