How many indicator variables




















Eberly College of Science. Home » Lesson 8: Categorical Predictors. Here are a few common examples of binary predictor variables that you are likely to encounter in your own research: Gender male, female Smoking status smoker, nonsmoker Treatment yes, no Health status diseased, healthy Company status private, public Example : On average, do smoking mothers have babies with lower birth weight?

The researchers collected the following data: Response y : birth weight Weight in grams of baby Potential predictor x 1 : length of gestation Gest in weeks Potential predictor x 2 : Smoking status of mother smoker or non-smoker In order to include a qualitative variable in a regression model, we have to " code " the variable, that is, assign a unique number to each of the possible categories.

Upon fitting our formulated regression model to our data, statistical software output tells us: Unfortunately, this output doesn't precede the phrase "regression equation" with the adjective "estimated" in order to emphasize that we've only obtained an estimate of the actual unknown population regression function.

But anyway — if we set Smoking once equal to 0 and once equal to 1 — we obtain, as hoped, two distinct estimated lines: Now, let's use our model and analysis to answer the following research question: Is there a significant difference in mean birth weights for the two groups, after taking into account length of gestation?

Well, that's easy enough! But what about when a predictor is a categorical variable taking only two values e. Such a variable might arise, for example, when forecasting daily sales and you want to take account of whether the day is a public holiday or not. A dummy variable can also be used to account for an outlier in the data. Rather than omit the outlier, a dummy variable removes its effect. In this case, the dummy variable takes value 1 for that observation and 0 everywhere else.

An example is the case where a special event has occurred. For example when forecasting tourist arrivals to Brazil, we will need to account for the effect of the Rio de Janeiro summer Olympics in If there are more than two categories, then the variable can be coded using several dummy variables one fewer than the total number of categories. There is usually no need to manually create the corresponding dummy variables. Suppose that we are forecasting daily data and we want to account for the day of the week as a predictor.

Then the following dummy variables can be created. Notice that only six dummy variables are needed to code seven categories. That is because the seventh category in this case Sunday is captured by the intercept, and is specified when the dummy variables are all set to zero.

Many beginners will try to add a seventh dummy variable for the seventh category. There will be one too many parameters to estimate when an intercept is also included. The general rule is to use one fewer dummy variables than categories.

So for quarterly data, use three dummy variables; for monthly data, use 11 dummy variables; and for daily data, use six dummy variables, and so on. The interpretation of each of the coefficients associated with the dummy variables is that it is a measure of the effect of that category relative to the omitted category. One rule of thumb is to have at least 15 subjects per parameter in the model. How many observation you have for each level could also be a consideration.

So, if you do not have enough observations, you could consider regularization. For categorical variables the fused lasso is an idea. See Principled way of collapsing categorical variables with many levels? You can include as many dummy variables as you want, but it will make the interpretation in the model coefficient a bit complex. You can check if all the levels in the variables are really important to be included in the model.

This depends on the amount of data you have. In general, you might want to consider pooling some similar levels. Sign up to join this community. The best answers are voted up and rise to the top. Stack Overflow for Teams — Collaborate and share knowledge with a private group.

Values for IQ and X 1 are known inputs from the data table. The only unknowns on the right side of the equation are the regression coefficients, which we will estimate through least-squares regression. The remaining material assumes familiarity with topics covered in previous lessons. Specifically, you need to know:.

The first task in our analysis is to assign values to coefficients in our regression equation. Excel does all the hard work behind the scenes, and displays the result in a regression coefficients table:. For now, the key outputs of interest are the least-squares estimates for regression coefficients. They allow us to fully specify our regression equation:.

This is the only linear equation that satisfies a least-squares criterion. That means this equation fits the data from which it was created better than any other linear equation. The fact that our equation fits the data better than any other linear equation does not guarantee that it fits the data well. We still need to ask: How well does our equation fit the data?

To answer this question, researchers look at the coefficient of multiple determination R 2. When the regression equation fits the data well, R 2 will be large i. Luckily, the coefficient of multiple determination is a standard output of Excel and most other analysis packages. Here is what Excel says about R 2 for our equation:.

The coefficient of muliple determination is 0. Translation: Our equation fits the data pretty well. At this point, we'd like to assess the relative importance our independent variables. We do this by testing the statistical significance of regression coefficients. Before we conduct those tests, however, we need to assess multicollinearity between independent variables. If multicollinearity is high, significance tests on regression coefficient can be misleading.

But if multicollinearity is low, the same tests can be informative. To measure multicollinearity for this problem, we can try to predict IQ based on Gender.



0コメント

  • 1000 / 1000