National Survey of Family Growth Weights Adjusted for Abortion Under-Reporting
These weights can be used for analysis of unintended pregnancies in the United States, such as of contraceptive efficacy and failure.
Table of Contents
- Abortion Under-Reporting on the National Survey of Family Growth
- Weights to Compensate for Abortion Under-Reporting
- Citations
- Footnotes
Abortion Under-Reporting on the National Survey of Family Growth
The National Survey of Family Growth (NSFG) is the premiere fertility survey of the United States. It began in 1973, and since its inception has been designed and administered by the National Center for Health Statistics, which is now an agency within the U.S. Department of Health and Human Services’ Centers for Disease Control and Prevention.
However, there is a known issue with the NSFG data that affects any analyses involving counts of pregnancies generally and counts of unintended pregnancies especially. Less than half of pregnancies ending in induced abortion have been reported by respondents to the NSFG. This is depicted graphically in a previous article.
The Appendix 2 “Topic-Specific Notes” to the User’s Guide of the 2017-2019 NSFG states:
Abortions have always been under-reported in the NSFG and virtually all other demographic surveys. This has been determined by comparing NSFG weighted estimates of abortions with external data from abortion providers….
For the 2017-2019 NSFG, comparing pregnancies ending in abortion reported in 2012-2016 with the number of abortions in the Guttmacher Abortion Provider Census shows that approximately 38% of pregnancies ending in abortion were reported in the NSFG….
Therefore, as in previous surveys, the NSFG staff advises NSFG data users that, generally speaking, NSFG data on abortion should not be used for substantive research focused on the determinants or consequences of abortion. The NSFG abortion data can be used for:
- methodological studies of factors affecting abortion reporting.
- studies of contraceptive efficacy, but only after the data are adjusted for the under-reporting of abortion. (“Appendix 2: Topic-Specific Notes for 2017-2019, NSFG User’s Guide” 2021)
This issue has been known for some time,1 and it appears to be due to some number of respondents simply not reporting their pregnancies that ended in induced abortion. The Public Use Data Tape Documentation for Cycle 3 (1982) of the NSFG states:
As in the past in surveys of this kind, abortions were underreported by NSFG respondents. The number of induced abortions in the 5 years before the survey represents about 47 percent of the number reported during 1977-1981 by the Alan Guttmacher Institute…
With due allowance for sampling variability, estimates of numbers of live births and of miscarriages and stillbirths correspond fairly well to information from other data sources. It appears that, rather than misreporting pregnancy outcomes, the majority of the underreporting is the result of women not reporting at all the pregnancies that ended in abortions. (“Public Use Data Tape Documentation, National Survey of Family Growth, Cycle III 1982” 1986)
As a result of this, any analyses involving counts of pregnancies – especially unintended pregnancies – based on unadjusted NSFG data would be flawed. For instance, analysis of contraceptive efficacy based on unadjusted NSFG data would underestimate contraceptive failure due to the under-counting of pregnancies that ended in induced abortion which resulted from contraceptive failure.
Such analyses have been done in the past, and they have attempted to compensate for the abortion under-reporting in the NSFG data by using back-of-the-envelope arithmetic adjustments. (Sundaram et al. 2017; Kost et al. 2008; Fu et al. 1999; Jones and Forrest 1992; Grady, Hayward, and Yagi 1986) While such approaches are plausible, they have the disadvantage of limiting what statistical methodologies can be used in the analysis to just those that are amenable to such adjustments.
Weights to Compensate for Abortion Under-Reporting
Herein are provided survey weights for NSFG public use data files that have been adjusted to compensate for abortion under-reporting. Incorporating the adjustments for abortion under-reporting into the survey weights has the advantage that once these adjusted weights are used, all of the full range of statistical methodologies available in the analysis of surveys can be used, and they can be used without any back-of-the-envelope arithmetic adjustments.
For instance, using these adjusted weights, regression analysis can be done with unintended pregnancy as one of the model variables simply by using the svyglm
function of the survey
library for the R statistical programming language or by using the SURVEYREG
or SURVEYLOGISTIC
procedures of the SAS statistical programming language.
The weight adjustments were made using a modified form of post-stratification, which is a technique in which survey weights are adjusted such that estimates match totals retrieved from an external source. (Lohr 2021, 329) The weights included with the public-use data files of the NSFG have already been post-stratified in order to make the estimates of population counts based on the NSFG data match external population counts from the U.S. Census Bureau and others. (National Center for Health Statistics 2020)
The adjusted survey weights were produced by performing additional post-stratifications using the technique of raking, in which multiple post-stratifications can be done simultaneously. (Lohr 2021, 331)
Post-Stratification Dimensions
The post-stratifications performed on the weights for each survey given this treatment were:
- a repeat of the post-stratification already done in preparing the public-use data files, which were based on number of women in the United States of reproductive age in demographic categories such as age, race and/or ethnicity, sex, marital status, and parity,2
- for a given analysis period preceding the survey, the number of live births in each year by respondent’s age at end of pregnancy, and
- for the same analysis period preceding the survey, the number of induced abortions in the entirety of the analysis period by respondent’s age at end of pregnancy.
The births by age and abortions by age dimensions only included post-strata for women under 40 years of age for most of the NSFG surveys. Because the target population of the NSFG before 2015 was women 15 years to 44 years of age, there were insufficient numbers of respondents reporting pregnancies while aged 40 years or more in order to do post-stratification. In 2015, the target population was extended to women aged 15 years to 49 years, so a post-stratum for women aged 40 years or more at the end of pregnancy was included in the weight adjustments for the 2015-2019 NSFG.
Post-Stratification Targets
The targets for post-stratifications based on population counts of demographic categories were replicated from the NSFG data sets themselves, based on documentation of each NSFG survey’s weighting procedures.
Targets for post-stratification of live births by age per year were taken from the National Vital Statistics Service’s natality data. For 1995 and after, these came by way of the CDC WONDER system. (“Natality on CDC WONDER Online Database,” n.d.) For years before 1995, these came from data files prepared by the National Bureau of Economic Research.3 (“Vital Statistics Natality Birth Data” 2023)
Targets for post-stratification of induced abortions by age per analysis period were taken from a data set created by Guttmacher Institute personnel based on total counts from the Guttmacher Institute’s Abortion Provider Census with age distributions inferred from Abortion Surveillance Reports of the Centers for Disease Control and Prevention. (Maddow-Zimet, Kost, and Finn 2022) Post-stratifications to number of induced abortions were done for the entirety of each survey’s analysis period rather than by year like live births due to insufficient numbers of respondents reporting abortions to divide up respondents by year.
Analysis Periods
Adjusted survey weights are provided for Cycle 3 (1982) of the NSFG through the 2015-2019 NSFG. No attempt was made to produce adjusted weights for Cycle 1 or Cycle 2 of the NSFG, as these cycles only interviewed women who had ever been married, and so are not useful for estimating the counts of all pregnancies in the United States. The 2011-2019 versions of the NSFG public-use data files, which were released in two-year chunks, were combined into two four-year data sets in order to make the sample sizes similar to earlier surveys.4
The analysis periods for each survey were hand-selected in order to give the most accurate estimates that would give complete coverage for all years from 1974 to 2014, and this is summarized in Table 1.
Update 11/21/2024: The public use data file for Cycle 4 of the NSFG had a post-stratification scheme for its weights that was more complicated than the post-stratification done for other releases of the NSFG. As a result, the weight adjustment reduced estimate error for all other releases of the NSFG, but actually increased estimate error for Cycle 4. Thus, confidence intervals for estimates of counts of pregnancies based on Cycle 4 of the NSFG are noticeably larger than other years.
Therefore, the analysis periods were adjusted for a “version 2” of the weights to increase years covered by Cycle 3 and decrease the years covered by Cycle 4. This led to an eight-year analysis period for Cycle 3, which was divided into two four-year periods. The “version 2” weight files for Cycle 3 and for Cycle 4 are now included in the “Data Files Containing Adjusted Weights” section below.
Analysis Period(s) | Survey |
---|---|
1974-1977, 1978-1981 | Cycle 3 (1982) |
1982-1985 | Cycle 4 (1985) |
1986-1990, 1991-1994 | Cycle 5 (1995) |
1995-2001 | Cycle 6 (2002) |
2002-2005 | 2006-2010 |
2006-2010 | 2011-2015 |
2011-2014 | 2015-2019 |
The analysis periods of Cycle 3 and of Cycle 5 of the NSFG are particularly long, and so were divided into two separate analysis periods.
When using the adjusted weights, estimates of the number of births in each year of a survey’s analysis period and estimates of the number of induced abortions in the entirety of each survey’s analysis period should match external counts for women less than 40 years of age.
In order to accomplish this, estimates for calculation of post-stratification adjustments were based on the number of abortions a respondent reported as occurring in the analysis period, since there are many respondents to the NSFG who report multiple abortions. This was a departure from typical post-stratification done.
In addition to applying adjustments to the weight of each respondent reporting an abortion in a survey’s analysis period, weight adjustments were applied to any respondent reporting an abortion after the analysis period, based on the adjustment done for respondents reporting abortions during the analysis period. This was done so that the contraceptive use calendar, which includes many respondent-months that occur after a survey’s analysis period, could be used for analysis. This was another a departure from how post-stratification is typically done.
Replicate Weight Schemes
These departures from typical post-stratification necessitated the use of replicate weights. Any adjustments made to the main survey weights were also performed upon all replicate weights, which ensures that the standard errors of estimates can be calculated by statistical software.
The public use data file for Cycle 4 (1985) of the NSFG is distributed with replicate weights. Replicate weights for other surveys were calculated before raking post-stratification was done. What replicate weight scheme was used depended on the structure of strata and clusters in the public use data files, and this is summarized in Table 2.
Survey | Replicate Weight Scheme |
---|---|
Cycle 3 (1982) | Balanced Repeated Replication (BRR) |
Cycle 4 (1985) | Balanced Repeated Replication (BRR) |
Cycle 5 (1995) | Jackknife |
Cycle 6 (2002) | Balanced Repeated Replication (BRR) |
2006-2010 | Jackknife |
2011-2015 | Jackknife |
2015-2019 | Jackknife |
Data Files Containing Adjusted Weights
The NSFG web site includes example programs for reading NSFG public use data files in the SAS, SPSS, and Stata programming languages. Because of the diversity of potential environments for their use, the adjusted NSFG weights are distributed below in comma separated values (CSV) files, which should be readable in any programming environment.
Update 11/21/2024: The files for Cycle 3 and for Cycle 4 are “version 2.” See note under “Analysis Periods” above.
Each file contains a column for the CASEID
of each female respondent in the data set,5 a column FULL_SAMPLE_WGT
that contains the main survey weight for the respondent, and multiple columns beginning with REP_WGT_
and ending with a number that contain the replicate weights for each respondent.
Additionally, for those replicate weights that use the jackknife scheme, statistical software requires the jackknife coefficients for each stratum. Coefficients for survey weights that use the jackknife scheme are included in the CSV files linked below.
Example SAS Programs
Female Respondents File – Balanced Repeated Replication
The below SAS program will print out estimates of the population of women in the United States by age, based on Cycle 6 (2002) of the NSFG, which uses the balanced repeated replication weight scheme.
Program prerequisites:
nsfgdata
is a SAS library that contains the tabled2002_2003femresp
.d2002_2003femresp
is the female respondents data table of Cycle 6 (2002) of the NSFG, as set up by the example programs on the NSFG web site.- The
&weights_path.
macro variable contains the path to the weights file.
data cycle6_fem_resp;
set nsfgdata.d2002_2003femresp(rename=(CASEID=CASEID_char));
CASEID=input(CASEID_char, 8.);
drop CASEID_char;
run;
proc import
file="&weights_path./NSFG_2002_2003_cycle_6_weights.csv"
dbms=csv
out=cycle6_weights;
run;
proc sort data=cycle6_fem_resp;
by CASEID;
run;
proc sort data=cycle6_weights;
by CASEID;
run;
data cycle6_fem_resp_adj;
merge cycle6_fem_resp(drop=FINALWGT) cycle6_weights;
by CASEID;
run;
proc surveyfreq data=cycle6_fem_resp_adj varmethod=BRR;
tables AGER;
weight FULL_SAMPLE_WGT;
repweights REP_WGT_:;
run;
Pregnancies File – Balanced Repeated Replication
The below SAS program will print out estimates of the number of pregnancies by pregnancy outcome in the analysis period of Cycle 6 (2002) of the NSFG. When merging the weights table to the pregnancies table, CASEID
s that do not correspond to any pregnancies should be discarded, otherwise spurious rows will be introduced.
Program prerequisites:
nsfgdata
is a SAS library that contains the tabled2002_2003fempreg
.d2002_2003fempreg
is the pregnancies data table of Cycle 6 (2002) of the NSFG, as set up by the example programs on the NSFG web site.- The
&weights_path.
macro variable contains the path to the weights file.
data cycle6_pregs;
set nsfgdata.d2002_2003fempreg(rename=(CASEID=CASEID_char));
CASEID=input(CASEID_char, 8.);
drop CASEID_char;
run;
proc import
file="&weights_path./NSFG_2002_2003_cycle_6_weights.csv"
dbms=csv
out=cycle6_weights;
run;
proc sort data=cycle6_pregs;
by CASEID;
run;
proc sort data=cycle6_weights;
by CASEID;
run;
data cycle6_pregs_adj;
merge cycle6_pregs(drop=FINALWGT in=in_pregs) cycle6_weights;
by CASEID;
if in_pregs;
run;
proc surveyfreq data=cycle6_pregs_adj varmethod=BRR;
tables OUTCOME;
weight FULL_SAMPLE_WGT;
repweights REP_WGT_:;
where 1141 <= DATEND <= 1224;
run;
Female Respondents File – Jackknife
The below SAS program also prints out estimates of the population of women in the United States by age, but based on the 2006-2010 version of the NSFG, which uses jackknife replicate weight scheme instead of balanced repeated repetition. Because of this, the jackknife coefficients must also be loaded.
Program prerequisites:
nsfgdata
is a SAS library that contains the tabled2006_2010femresp
.d2006_2010femresp
is the female respondents data table of the 2006-2010 version of the NSFG, as set up by the example programs on the NSFG web site.- The
&weights_path.
macro variable contains the path to the weights file and jackknife coefficients file.
data fem_resp_2006_2010;
format CASEID 8.;
length CASEID 8;
set nsfgdata.d2006_2010femresp;
run;
proc import
file="&weights_path./NSFG_2006_2010_weights.csv"
dbms=csv
out=weights_2006_2010;
run;
proc sort data=fem_resp_2006_2010;
by CASEID;
run;
proc sort data=weights_2006_2010;
by CASEID;
run;
data fem_resp_2006_2010_adj;
merge fem_resp_2006_2010(drop=FINALWGT30) weights_2006_2010;
by CASEID;
run;
proc import
file="&weights_path./NSFG_2006_2010_jkcoefs.csv"
dbms=csv
out=jkcoefs_2006_2010;
run;
proc surveyfreq data=fem_resp_2006_2010_adj varmethod=jackknife;
tables AGER;
weight FULL_SAMPLE_WGT;
repweights REP_WGT_: / jkcoefs=jkcoefs_2006_2010;
run;
Pregnancies File – Jackknife
The below SAS program will print out estimates of the number of pregnancies by pregnancy outcome in the analysis period of the 2006-2010 version of the NSFG, which uses jackknife replicate weight scheme instead of balanced repeated repetition. Thus, the jackknife coefficients must be loaded to use the weights.
Program prerequisites:
nsfgdata
is a SAS library that contains the tabled2006_2010fempreg
.d2006_2010fempreg
is the pregnancies data table of the 2006-2010 version of the NSFG, as set up by the example programs on the NSFG web site.- The
&weights_path.
macro variable contains the path to the weights file and jackknife coefficients file.
data pregs_2006_2010;
format CASEID 8.;
length CASEID 8;
set nsfgdata.d2006_2010fempreg;
run;
proc import
file="&weights_path./NSFG_2006_2010_weights.csv"
dbms=csv
out=weights_2006_2010;
run;
proc sort data=pregs_2006_2010;
by CASEID;
run;
proc sort data=weights_2006_2010;
by CASEID;
run;
data pregs_2006_2010_adj;
merge pregs_2006_2010(drop=FINALWGT30 in=in_pregs) weights_2006_2010;
by CASEID;
if in_pregs;
run;
proc import
file="&weights_path./NSFG_2006_2010_jkcoefs.csv"
dbms=csv
out=jkcoefs_2006_2010;
run;
proc surveyfreq data=pregs_2006_2010_adj varmethod=jackknife;
tables OUTCOME;
weight FULL_SAMPLE_WGT;
repweights REP_WGT_: / jkcoefs=jkcoefs_2006_2010;
where 1225 <= DATEND <= 1272;
run;
Example R Programs
The example R programs use the srvyr
library and two tidyverse
libraries to process input and output from the survey
library.
Female Respondents File – Balanced Repeated Replication
The below R program will print out estimates of the population of women in the United States of by age, based on Cycle 6 (2002) of the NSFG, which uses the balanced repeated replication weight scheme.
Program prerequisites:
- The
cycle_6_femresp
variable contains the female respondents data table of Cycle 6 (2002) of the NSFG. - The path to the weights file is stored in the
weights_path
variable.
library(dplyr)
library(readr)
library(srvyr)
cycle_6_weights <- read_csv(paste0(weights_path, 'NSFG_2002_2003_cycle_6_weights.csv'))
cycle_6_femresp |>
mutate(
CASEID = as.numeric(CASEID)
) |>
left_join(
cycle_6_weights,
by = join_by(CASEID)
) |>
as_survey_rep(
weights = FULL_SAMPLE_WGT,
repweights = starts_with('REP_WGT_'),
type = 'BRR'
) |>
group_by(
AGER
) |>
summarize(
est_pop = survey_total(vartype = 'ci')
)
Pregnancies File – Balanced Repeated Replication
The below R program will print out estimates of the number of pregnancies by pregnancy outcome in the analysis period of Cycle 6 (2002) of the NSFG.
Program prerequisites:
- The
cycle_6_pregs
variable contains the pregnancies data table of Cycle 6 (2002) of the NSFG. - The path to the weights file is stored in the
weights_path
variable.
library(dplyr)
library(readr)
library(srvyr)
cycle_6_weights <- read_csv(paste0(weights_path, 'NSFG_2002_2003_cycle_6_weights.csv'))
cycle_6_pregs |>
mutate(
CASEID = as.numeric(CASEID)
) |>
left_join(
cycle_6_weights,
by = join_by(CASEID)
) |>
as_survey_rep(
weights = FULL_SAMPLE_WGT,
repweights = starts_with('REP_WGT_'),
type = 'BRR'
) |>
filter(
1141 <= DATEND & DATEND <= 1224
) |>
group_by(
OUTCOME
) |>
summarize(
est_num = survey_total(vartype = 'ci')
)
Female Respondents File – Jackknife
The below R program prints out estimates of the population of women in the United States by age, based on the 2006-2010 version of the NSFG, which uses jackknife replicate weight scheme instead of balanced repeated repetition. Because of this, the jackknife coefficients must also be loaded.
Program prerequisites:
- The
femresp_2006_2010
variable contains the female respondents data table of Cycle 6 (2002) of the NSFG. - The path to both the weights file and the jackknife coefficients file is stored in the
weights_path
variable.
library(dplyr)
library(readr)
library(srvyr)
weights_2006_2010 <- read_csv(paste0(weights_path, 'NSFG_2006_2010_weights.csv'))
jkcoefs_2006_2010 <- read_csv(paste0(weights_path, 'NSFG_2006_2010_jkcoefs.csv'))
femresp_2006_2010 |>
left_join(
weights_2006_2010,
by = join_by(CASEID)
) |>
as_survey_rep(
weights = FULL_SAMPLE_WGT,
repweights = starts_with('REP_WGT_'),
type = 'JKn',
rscales = jkcoefs_2006_2010$JKCoefficient
) |>
group_by(
AGER
) |>
summarize(
est_pop = survey_total(vartype = 'ci')
)
Pregnancies File – Jackknife
The below R program will print out estimates of the number of pregnancies by pregnancy outcome in the analysis period of the 2006-2010 version of the NSFG, which uses jackknife replicate weight scheme instead of balanced repeated repetition.
Program prerequisites:
- The
fempreg_2006_2010
variable contains the pregnancies data table of Cycle 6 (2002) of the NSFG. - The path to both the weights file and the jackknife coefficients file is stored in the
weights_path
variable.
library(dplyr)
library(readr)
library(srvyr)
weights_2006_2010 <- read_csv(paste0(weights_path, 'NSFG_2006_2010_weights.csv'))
jkcoefs_2006_2010 <- read_csv(paste0(weights_path, 'NSFG_2006_2010_jkcoefs.csv'))
fempreg_2006_2010 |>
left_join(
weights_2006_2010,
by = join_by(CASEID)
) |>
as_survey_rep(
weights = FULL_SAMPLE_WGT,
repweights = starts_with('REP_WGT_'),
type = 'JKn',
rscales = jkcoefs_2006_2010$JKCoefficient
) |>
filter(
1225 <= DATEND & DATEND <= 1272
) |>
group_by(
OUTCOME
) |>
summarize(
est_num = survey_total(vartype = 'ci')
)
Citations
Footnotes
An attempt has been made to address this issue by introducing a self-administered portion of the survey, but this did not ameliorate the issue. The User’s Guide for Cycle 5 (1995) of the NSFG states:
In an effort to diminish the extent of under-reporting, abortion data in the 1995 NSFG were obtained through interviewer-administered questions (CAPI) as well as self-administered questions (Audio-CASI)…. This … yielded an estimate of 3,294,000 abortions for 1991-94, the most recent period for which comparable data are available…
This figure represents about 54 percent of the number reported for that period by the Alan Guttmacher Institute (AGI)… The number of abortions reported in the 5 years prior to the 1988 NSFG was only 41 percent of the number reported by AGI. Thus the reporting of abortions appears to be more complete in Cycle 5 than in Cycle 4 of the NSFG. This increase in reporting may be attributable to the use of incentives and Audio-CASI in 1995, but despite these methodological innovations, there still remains a large degree of under-reporting of abortion. (“Public Use Data File Documentation, National Survey of Family Growth, Cycle 5: 1995, User’s Guide” 1997)
“Parity” in this context refers to the number of live births that a given respondent has had.↩︎
The NSFG already estimates live births well. The inclusion of post-stratifications to live births has several advantages.
- It maintains and improves the accuracy of the estimates of live births per year.
- It reduces the variance of estimates of pregnancies per year.
Both these compensate for any loss of accuracy or increase in estimator variance due to weight inflation to adjust for abortion under-reporting.↩︎
Additionally, the 2015-2019 NSFG expanded its target population to women from ages 15 to 49 from women ages 15 to 44, so combining the 2015-2019 with data from 2011-2015 would result in losing the over-44 respondents from the 2011-2015 data.↩︎
Starting with Cycle 6 (2002) of the NSFG, male respondents were surveyed. However, male respondents are given a separate survey questionnaire that does not contain a contraceptive use calendar. No attempt was made to derive weights adjusted for abortion under-reporting for male respondents.↩︎