How do I create dummy variables?
A dummy variable is a variable that takes on the values 1 or 0 where 1 means some condition is true (such as age<30, gender is female, type of government is a dictatorship, ethnicity is Hispanic, etc.) and 0 means the condition is false.
Dummy variables for many categories of 1 variable
In this case, you want to create a dummy variable for each category of a variable, such as regions or races, to implement a fixed effects regression.
/* creates new variables, 1 for each category of the region variable*/
tab region,
gen(regiondum)
/* creates new variables, 1 variable for all but 1 category of the region variable. To avoid
collinearity in your model, categorical variables with
k levels, must produce
k-1 dummies */
xi i.region
/* This will run a regression and create dummy variables in one step */
xi: regress Y X1 X2
i.region
Dummy variables based on 2 or more variables or conditions
Suppose we need to create a dummy variable to indicate whether an observation is a high school dropout. We would need to code this based on two conditions: years of completed schooling and age
gen dropouts = educ < 14 & age > 18
replace dropouts = . if age ==. | educ == .
/* always tab your newly created variable to see if the expected number of 0s and 1s were assigned */
tab dropouts
These are just a few recommended ways of creating dummy variables. For other methods, see
Stata's help resource.
How do I combine datasets?
You'll first need to identify what kind of combination you want to accomplish:
Vertical Combination which uses the
append command.
You want to use this when you are adding
observations from one file to another file with the same variables. For instance, if you have one dataset containing variables for country and GDP per capita for the year 2004 but need to add observations for 2005 then this would be a candidate for the
append command
Syntax:
append using
filename
Example:
append using economy2005.dta
Horizontal Combination
We're going to look at two cases of horizontal combination: one-to-one (
merge 1:1) and many-to-one (
merge m:1). In horizontal combination you want to add variables, and not observations. The observations appear in both files, but in each file there is different information about them. When you combine files in this way you need some
identifying variable (social security #, country name, student ID, etc.) in each dataset so that Stata knows which rows to match.
For example, suppose we have a file with country name and land area; suppose we want to combine this with another file that has country name and GDP per Capita. We want a single dataset that has country name, land area and GDP per capita.
If the identifying variable which appears in the files is unique, then it's a one-to-one match. Unique means that for each value of this variable, there is only one observation that contains it. In the figure below,
country is the identifying variable. In both datasets, each country has only one observation.
Syntax:
merge 1:1 identifying variable(s) using
filename
Example:
merge 1:1 country using economy.dta
If the identifying variable is unique in one file, but not unique in the other, then it's a one-to-many match. This is very common when you have groups of observations in one file (the file with the identifying variable which is not unique), and information regarding each group in another file (the other file).
Syntax:
merge m:1 identifying variable(s) using
filename
Example:
merge m:1 fam_ID using households.dta
Always sort your identifying variable in both datasets before merging
The merge command requires that both the Master and Using Data will be sorted by the identifying variables. If the Master Data isn't sorted, run sort
before the merge command. If the Using Data isn't sorted, open it first (use , clear), then run the sort command, then save it (save , replace), open the Master Data and run the merge command. Here's an example:
use economy.dta, clear
sort country
save economy.dta, replace
use geography.dta, clear
sort country
merge 1:1 country using economy.dta
1) Since you saved economy.dta in the third line, you will not need to open economy.dta and sort it again in future runs.
2) If you are doing a one-to-one match (i.e if the identifying variable(s) are unique in both sets), you can run the merge command with the sort option. It will automatically sort the datasets within the merge command. The sort option will not work if the identifying variables are not unique.
The _merge variable
The merge command automatically creates a variables named _merge,
which contains information about the observation's existence in each of the two datasets:
1 -> the observation (the identifying variable(s) values) appeared only in the Master Data
2 -> the observation (the identifying variable(s) values) appeared only in the Using Data
3 -> the observation (the identifying variable(s) values) appeared in both datasets
For more on merging see Stata's documentation
How do I identify duplicate observations in my data?