Friday, March 23, 2012

Mining Content Viewer for Linear Regression: Node Distribution output

With the number of threads it is difficult to know if this has been posted. If I use the Mining Content Viewer for Linear Regression, under Node Distribution, there are values given for Attribute Name, Attribute Value, Support, Probability, Variance, and Value Type. The output is similar to what Joris supplied in his thread about Predict Probability in Decision Trees. My questions:

1. How should these fields be interpreted?

2. With Linear Regression, is it possible to get the coefficient values and tests of significance (t-tests?), if they are not part of the output I have pointed to?

Thanks for your help with this?

Sam

The interpretation of the NODE_DISTRIBUTION rows depends mainly on the VALUE TYPE column.

To exemplify the values, here is the distribution of one node from applying regression to the Iris data set. The target is PetalWidth, with SepalLength, SepalWidth and PetalLength as regressors:

- Two rows of the distribution describe the target continuous attribute. They can be recognized by their value type. The row having value type 1 (Missing) represents the statistics for the Missing state of the target attribute in the current node, while the row having value type 3 (Continuous) represents the statistics for the Existing state of the target attribute. If you do not have gaps in your data, than you can ignore the row with ValueType = 1. For the row with value type 3, ATTRIBUTE_NAME is the name of the target attribute (PetalWidth in my example), ATTRIBUTE_VALUE is the mean of the PetalWidth. You also get the support and variance. Support is the number of training casese in this node, Mean and Variance are computed only over the traiing cases that ended up in this node

- For each regressor, there are 3 distribution rows, having the valuetype, respectively: 7(coefficient), 8(Score gain), 9(Statistics). For all these 3 rows, ATTRIBUTE_NAME is the name of the regressor. Then:

for the row with Value Type 7 (Coefficient), ATTRIBUTE_VALUE is the regression coefficient associated with the regressor ('a' in y=ax+b).

No comments:

Post a Comment