LibGuides: Social Science Data and Statistics Resources: Stata FAQ

Find Data
Use Data
COVID-19 Data

Basics

From the command box you can type help name of command to open up Stata's help documentation.

For example, for information on how to use the regress command, type: help regress

You can also refer to Tisch Library's Stata guide for links and books on more in-depth Stata usage: Stata Research Guide

The Data Lab is also an excellent resource for drop-in help and appointments.

How can I get my data into Stata?

.dta (Stata) files can be opened simply with the use command followed by either the name of the data file or the full filepath.

Example: use sp500.dta

xlsx, xls (Excel) files can be opened with the import excel command followed by either the name of the data file or the full filepath. By default, Stata will import the first worksheet in an Excel workbook.

Example: import excel wdi.xlsx

.csv (comma delimited) files can opened with the import delimited command followed by either the name of the data file or the full filepath

Example: import delimited prices.csv

You can also go to 'File > Import' and select the data type that matches your dataset

For detailed information on importing data see the Stata documentation on entering and importing data

When I try to open a dataset I get the error message "file _ not found r(601);" - What does this mean?/ How do I set my working directory?

Double check and make sure you haven't mistyped the filename and that you've specified your working directory. Your working directory is simply the folder on your computer containing the file(s) you will be working with.

To set your working directory, go to 'File > Change Working Directory...' and select the folder containing the file you want to open.

You can also set your working directory in a .do file or from the command line by using the cd command followed by the path of your folder.

Example: cd "C:\Users\jquan01\Documents\Ec-15"

What is a Do File? Why should I use one?

A Do file is a text editor within Stata that allows you to organize and run all your commands easily and save your work for future analyses. There is no "Undo" in Stata. Working with a Do file is the next best thing. To open a new Do file go to 'File > Do..."

More on Do Files

How do I convert SPSS/SAS files to Stata?

If you are working with another software program, typically there are special commands within the respective package to convert to Stata format or at least a text format that can be brought into Stata. For more on this see this UCLA resource on converting between SPSS/SAS file formats.

There is also a software tool called Stat Transfer that can do the conversion for you. There are copies on the collaborative workstation machines in Tisch Library.

Describing Data

How can I get descriptive statistics of my data?

The four most widely used commands used for descriptive statistics of your data:
describe, summarize, codebook, tabulate

To demonstrate each of the descriptive commands, we'll use the sysuse command to open a pre-installed dataset: The Life Expectancy, 1998 dataset. This cross-sectional dataset contains variables for region, country, average % population growth, life expectancy at birth, GNP per capita and % of population with access to safe water.

The describe command provides the number of observations and variables, storage type, display format and any variable labels assigned to the variables.

The summarize or abbreviated sum command is a bit more useful. It provides # of observations, mean, standard deviation, min and max for each numerical variable.

summarize can be used with specified variables:

adding ,detail provides additionally, the percentiles (where you can identify the median), variance, skewness and kurtosis.

The codebook command provides the data type, percentiles, mean, min/max and standard deviation. It additionally let's you know how many missing and unique values there are for the specified variable.

The tabulate command is useful for creating frequency tables. It is most useful for categorical/factor variables.

Another useful application of the tabulate command involves adding the , summarize() option. This creates a table that breaks down the mean, standard deviation and frequency of a continuous variable "by" some categorical variable. Example:

There are many applications of the tabulate command. Such as creating two-way, three-way, four-way...n-way crosstabs. For more methods of generating descriptive statistics see this Stata Resource

Changing the Data

How do I create dummy variables?

A dummy variable is a variable that takes on the values 1 or 0 where 1 means some condition is true (such as age<30, gender is female, type of government is a dictatorship, ethnicity is Hispanic, etc.) and 0 means the condition is false.

Dummy variables for many categories of 1 variable

In this case, you want to create a dummy variable for each category of a variable, such as regions or races, to implement a fixed effects regression.

/* creates new variables, 1 for each category of the region variable*/
tab region, gen(regiondum)

/* creates new variables, 1 variable for all but 1 category of the region variable. To avoid collinearity in your model, categorical variables with k levels, must produce k-1 dummies */
xi i.region

/* This will run a regression and create dummy variables in one step */
xi: regress Y X1 X2 i.region

Dummy variables based on 2 or more variables or conditions

Suppose we need to create a dummy variable to indicate whether an observation is a high school dropout. We would need to code this based on two conditions: years of completed schooling and age

gen dropouts = educ < 14 & age > 18
replace dropouts = . if age ==. | educ == .
/* always tab your newly created variable to see if the expected number of 0s and 1s were assigned */
tab dropouts

These are just a few recommended ways of creating dummy variables. For other methods, see Stata's help resource.

How do I combine datasets?

You'll first need to identify what kind of combination you want to accomplish:

Vertical Combination which uses the append command.

You want to use this when you are adding observations from one file to another file with the same variables. For instance, if you have one dataset containing variables for country and GDP per capita for the year 2004 but need to add observations for 2005 then this would be a candidate for the append command

Syntax: append using filename
Example: append using economy2005.dta

Horizontal Combination

We're going to look at two cases of horizontal combination: one-to-one (merge 1:1) and many-to-one (merge m:1). In horizontal combination you want to add variables, and not observations. The observations appear in both files, but in each file there is different information about them. When you combine files in this way you need some identifying variable (social security #, country name, student ID, etc.) in each dataset so that Stata knows which rows to match.

For example, suppose we have a file with country name and land area; suppose we want to combine this with another file that has country name and GDP per Capita. We want a single dataset that has country name, land area and GDP per capita.

One-to-one merge

If the identifying variable which appears in the files is unique, then it's a one-to-one match. Unique means that for each value of this variable, there is only one observation that contains it. In the figure below, country is the identifying variable. In both datasets, each country has only one observation.

Syntax:merge 1:1 identifying variable(s) using filename
Example:merge 1:1 country using economy.dta

One-to-many merge

If the identifying variable is unique in one file, but not unique in the other, then it's a one-to-many match. This is very common when you have groups of observations in one file (the file with the identifying variable which is not unique), and information regarding each group in another file (the other file).

Syntax:merge m:1 identifying variable(s) using filename
Example:merge m:1 fam_ID using households.dta

Always sort your identifying variable in both datasets before merging

The merge command requires that both the Master and Using Data will be sorted by the identifying variables. If the Master Data isn't sorted, run sort before the merge command. If the Using Data isn't sorted, open it first (use , clear), then run the sort command, then save it (save , replace), open the Master Data and run the merge command. Here's an example:

use economy.dta, clear
sort country
save economy.dta, replace
use geography.dta, clear
sort country
merge 1:1 country using economy.dta

1) Since you saved economy.dta in the third line, you will not need to open economy.dta and sort it again in future runs.

2) If you are doing a one-to-one match (i.e if the identifying variable(s) are unique in both sets), you can run the merge command with the sort option. It will automatically sort the datasets within the merge command. The sort option will not work if the identifying variables are not unique.

The _merge variable

The merge command automatically creates a variables named _merge,
which contains information about the observation's existence in each of the two datasets:

1 -> the observation (the identifying variable(s) values) appeared only in the Master Data
2 -> the observation (the identifying variable(s) values) appeared only in the Using Data
3 -> the observation (the identifying variable(s) values) appeared in both datasets

For more on merging see Stata's documentation

How do I identify duplicate observations in my data?

See this Stata help documentation for working with duplicate observations

Data Analysis

What kind of analysis is appropriate?

Important! This question is necessarily determined by your discipline, the nature of the variables and what outcomes/predictions/comparisons/relationships you want to examine with your data. Refer to notes and texts from your research methods/statistics course and consult with your professor or advisor before working with a significance test if uncertain.

As important as picking the "right" significance test is understanding the rationale and assumptions underlying the method.

A good reference point, with Stata examples for several different types of tests, to get you started can be found at UCLA's statistical computing center What Statistical Test Should I Use?

Exporting Data Tables

How do I export the results of an OLS regression?

There are several approaches to exporting tables in Stata. See this University of North Carolina help document to see an overview of the different methods: Exporting Results

The command outreg2 allows for presentation of regression tables (and more) like you would see in an academic journal article. It is a user-written command and can be installed with the following syntax,

ssc install outreg2

A simple example using the outreg2 command is as follows,

regress write read
outreg2 using model.doc
regress write read science
outreg2 using model.doc, append

The first outreg2 command captures the first regression model (the effect of read on write) and formats it into a Word document (titled model.doc)
It will look like this in Word,

Using the ,append option with outreg2 after running a model with additional explanatory variables will tack on those variable's coefficients and standard errors to the table. For example,

Organizing Your Data

Organizing Folders

Create a new folder for your class. Name it without spaces (e.g., ECON203-SP14).
Create 2 new folders in your class folder: one for assignments (Assignments) and one for your final project (Project).
Save your data files to the appropriate folders within your class folder.

Naming Files

-Adopt consistent file naming conventions!

-Name your files something that alludes to their content or purpose.

Example 1: Data for a class's second lab. Poor choice: mydata.dta Better name: lab2census.dta

Example 2: You download US census data for your final project for the year 2000; name it census00.dta. Then you create a subset containing only women ages 18-30; name it census00fem.dta.

-If you have multiple versions of the same file, add increasing numbers to the end of the file (e.g., census00v2, census00v3, etc). If you go back and make changes to an earlier version of the file, save a copy with the next highest number (e.g., census00v4).

<< Previous: Stata
Next: R >>