Title: | Tools and Tests for Experiments with Partially Synthetic Data Sets |
---|---|
Description: | A set of functions to support experimentation in the utility of partially synthetic data sets. All functions compare an observed data set to one or a set of partially synthetic data sets derived from the observed data to (1) check that data sets have identical attributes, (2) calculate overall and specific variable perturbation rates, (3) check for potential logical inconsistencies, and (4) calculate confidence intervals and standard errors of desired variables in multiple imputed data sets. Confidence interval and standard error formulas have options for either synthetic data sets or multiple imputed data sets. For more information on the formulas and methods used, see Reiter & Raghunathan (2007) <doi:10.1198/016214507000000932>. |
Authors: | Charlotte Looby [aut, cre] |
Maintainer: | Charlotte Looby <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.1 |
Built: | 2025-02-28 04:48:05 UTC |
Source: | https://github.com/cran/SynthTools |
This function will calculate confidence intervals and standard errors from the responses of multiple imputed datasets for a specified continuous variable, and also give a YES/NO indicator for whether or not the observed value is within the confidence interval.
The confidence intervals and standard errors are calculated by first taking the means of the variable from the partially synthesized datasets, then using t.test()
to get the confidence intervals.
ContCI(obs_data, imp_data_list, var, sig = 6, alpha = 0.05)
ContCI(obs_data, imp_data_list, var, sig = 6, alpha = 0.05)
obs_data |
The original dataset to which the next will be compared, of the type "data.frame". |
imp_data_list |
A list composed of |
var |
The continuous variable being checked. |
sig |
The number of significant digits in the output data frame. Defaults to 6. |
alpha |
Test size, defaults to 0.05. |
This function was developed with the intention of making the job of researching partially synthetic data utility a bit easier by providing another way of measuring utility.
This function returns a data frame with the variable's observed mean, lower and upper limits of the confidence interval, standard error, and a YES/NO indicating whether or not the observed value is within the confidence interval.
#"PPA" is the observed data set #"PPAm5" is a list of 5 partially synthetic data sets derived from PPA #"age" is a continuous variable present in the synthesized data sets. #3 significant digits are desired from the output data frame. ContCI(PPA, PPAm5, "age", sig=3)
#"PPA" is the observed data set #"PPAm5" is a list of 5 partially synthetic data sets derived from PPA #"age" is a continuous variable present in the synthesized data sets. #3 significant digits are desired from the output data frame. ContCI(PPA, PPAm5, "age", sig=3)
This function will check for comparability between two data sets, including dimensions, order of variables, variable classifications, and levels of factors. When a data set is fully or partially synthesized from an observed data set, these are the features that should be equal between the data sets so the utility of the synthetic data can be measured.
dataComp(obs_data, new_data)
dataComp(obs_data, new_data)
obs_data |
The original data set to which the next will be compared, of the type "data.frame". |
new_data |
The fully or partially synthetic data set to be compared to the observed data, of the type "data.frame". |
This function was developed with the intention of making the job of researching synthetic data utility a bit easier by making preliminary data set comparisons quickly.
A list containing the following components:
same.dim |
A logical value indicating whether or not |
same.order |
A logical value indicating whether or not the variables in |
class.identical |
A logical value indicating where or not the variable classifications are identical. |
class.table |
A table of types of variable classifications. |
fac.num.same |
A logical value indicating whether or not the factors in the data sets have the same number of levels. |
fac.lev.same |
A logical value indicating whether or not the factors in the data sets have the same levels. |
#PPA is observed data set, PPAps1 is a partially synthetic data set derived from the observed data. dataComp(PPA, PPAps1)
#PPA is observed data set, PPAps1 is a partially synthetic data set derived from the observed data. dataComp(PPA, PPAps1)
This function will check for logical consistency between two categorical variables in a fully or partially synthesized data set.
logicCheck(obs_data, new_data, vars, NAopt = T)
logicCheck(obs_data, new_data, vars, NAopt = T)
obs_data |
The original data set to which the next will be compared, of the type "data.frame". |
new_data |
The fully or partially synthetic data set to be compared to the observed data, of the type "data.frame". |
vars |
A vector of two categorical variables in the data sets to check for logical consistency. |
NAopt |
Defaults to TRUE to use NAs in tables. If you do not wish to check for NAs, put FALSE. |
When a data set is fully or partially synthesized from an observed data set, sometimes there are logical consistencies in the observed data set which must be adhered to in the synthesized data set that may be violated during the course of the synthesis.
For example, if there is a data set which contains an age variable and a variable that represents whether or not a person has a drivers license in the state of Pennsylvania, the age variable should indicate that the person is at least 16-years-old if the license indicator shows that the person has a drivers license.
It is recommended that you check for data comparability with dataComp()
prior to using this function.
This function creates cross-tabulations of the specified variables of both the observed data set and synthesized data set, then checks that the corresponding cell values are either zero or a positive value accordingly. It was developed with the intention of making the job of researching synthetic data utility a bit easier by quickly checking for logical consistency.
This function returns a message stating whether or not there were any potential logical inconsistencies found in the data sets for the variables specified. Then the cross-tabulations will be printed (in either case) for the analyst to review.
This function will also return a list of the following components:
consistent |
A logical value indicating whether the variable cross-tabulation is logically consistent. |
obs.table |
The original data set cross-tabulation. |
new.table |
The new data set cross-tabulation. |
which |
A matrix indicating if values are logically consistent. 0=consistent, otherwise=inconsistent. |
#PPA is observed data set, PPAps2 is a partially synthetic data set derived from the observed data. #age17plus and marriage are two categorical variables within these data sets. logicCheck(PPA, PPAps2, c("age17plus", "marriage"))
#PPA is observed data set, PPAps2 is a partially synthetic data set derived from the observed data. #age17plus and marriage are two categorical variables within these data sets. logicCheck(PPA, PPAps2, c("age17plus", "marriage"))
This function will calculate confidence intervals and standard errors from the proportional responses of multiply imputed datasets for a specified categorical variable, and also gives a YES/NO indicator for whether or not the observed value is within the confidence interval. The confidence intervals and standard errors are calculated from variance formulas that are specific to whether the multiple imputed datasets are fully or partially synthetic. See reference for more information.
oneCatCI(obs_data, imp_data_list, type, var, sig = 6, alpha = 0.05)
oneCatCI(obs_data, imp_data_list, type, var, sig = 6, alpha = 0.05)
obs_data |
The original dataset to which the next will be compared, of the type "data.frame". |
imp_data_list |
A list of datasets that are either synthetic or contain imputed values. |
type |
Specifies which type of datasets are in |
var |
The categorical variable being checked. Should be of type "factor". |
sig |
The number of significant digits in the output dataframe. Defaults to 6. |
alpha |
Test size, defaults to 0.05. |
This function was developed with the intention of making the job of researching synthetic data utility a bit easier by providing another way of measuring utility.
This function returns a dataframe with the variable's responses, observed values, lower and upper limits of the confidence interval, standard error, and "YES"/"NO" indicating whether or not the observed value is within the confidence interval.
Reiter JP, Raghunathan TE (2007). “The Multiple Adaptations of Multiple Imputation.” Journal of the American Statistical Association.
#PPA is observed data set, PPAm5 is a list of 5 partially synthetic data sets derived from PPA. #sex is a categorical variable within these data sets. 3 significant digits are desired. oneCatCI(obs_data=PPA, imp_data_list=PPAm5, type="partially", var="sex", sig=3)
#PPA is observed data set, PPAm5 is a list of 5 partially synthetic data sets derived from PPA. #sex is a categorical variable within these data sets. 3 significant digits are desired. oneCatCI(obs_data=PPA, imp_data_list=PPAm5, type="partially", var="sex", sig=3)
This function will calculate the overall perturbation rate of an imputed data set and for specific variables requested.
pertRates(obs_data, new_data, imp_vars, desc = FALSE, sig = 4)
pertRates(obs_data, new_data, imp_vars, desc = FALSE, sig = 4)
obs_data |
The original dataset to which the next will be compared, of the type "data.frame". |
new_data |
The fully or partially synthetic data set to be compared to the observed data, of the type "data.frame". |
imp_vars |
The variable or a vector of variables which were imputed and are to be used in the overall perturbation rate calculation. |
desc |
Whether or not the variable perturbation rates should be output in descending rate order. Defaults to FALSE. |
sig |
The number of significant digits desired for the overall perturbation rate. Defaults to 4. |
A record in a data set is considered "perturbed" when at least one value in the record is different from the observed data. The overall perturbation rate is therefore the number of records that are found to be perturbed over the number of records in a data set.
The variable perturbation rate is simply the rate at which the values for a given variable are different from those in the observed data set.
This function was developed with the intention of making the job of researching synthetic data utility a bit easier by quickly calculating perturbation rates.
Returns the overall perturbation rate of the synthetic data set and the specific variable perturbation rates in percentages, rounded to 0.1. The function will also output in list format with the following components:
overall |
The overall perturbation rate. |
variable |
A vector of variable perturbation rates. |
#PPA is observed data set, PPAps2 is a partially synthetic data set derived from the observed data. #age17plus, marriage, and vet are three categorical variables within these data sets. pertRates(PPA, PPAps2, c("age17plus", "marriage", "vet"))
#PPA is observed data set, PPAps2 is a partially synthetic data set derived from the observed data. #age17plus, marriage, and vet are three categorical variables within these data sets. pertRates(PPA, PPAps2, c("age17plus", "marriage", "vet"))
A dataset containing some variables about 1000 people in Pennsylvania. This is a subset of the 2017 ACS PUMS data with one indicator variable added.
PPA
PPA
A data frame with 1000 rows and 7 variables:
age of respondent, in years, ("AGEP")
sex of respondent, ("SEX")
recoded detailed race code, ("RAC1P")
married/spouse present/spouse absent, ("MSP")
employment status recode, ("ESR")
veteran period of service, ("VPS")
age >= 17 indicator
This is a list that has the 5 partially synthetic versions of PPA (PPAps1 - PPAps5).
PPAm5
PPAm5
5 data frames with 1000 rows and 7 variables:
age of respondent, in years, ("AGEP")
sex of respondent, ("SEX")
recoded detailed race code, ("RAC1P")
married/spouse present/spouse absent, ("MSP")
employment status recode, ("ESR")
veteran period of service, ("VPS")
age >= 17 indicator
This is a version of the PPA data set that is partially synthetic. Some of the values of "sex", "marriage", and "age17plus" were imputed.
PPAps1
PPAps1
A data frame with 1000 rows and 7 variables:
age of respondent, in years, ("AGEP")
sex of respondent, ("SEX")
recoded detailed race code, ("RAC1P")
married/spouse present/spouse absent, ("MSP")
employment status recode, ("ESR")
veteran period of service, ("VPS")
age >= 17 indicator
This is a version of the PPA data set that is partially synthetic. Some of the values of "sex", "marriage", and "age17plus" were imputed.
PPAps2
PPAps2
A data frame with 1000 rows and 7 variables:
age of respondent, in years, ("AGEP")
sex of respondent, ("SEX")
recoded detailed race code, ("RAC1P")
married/spouse present/spouse absent, ("MSP")
employment status recode, ("ESR")
veteran period of service, ("VPS")
age >= 17 indicator
This is a version of the PPA data set that is partially synthetic. Some of the values of "sex", "marriage", and "age17plus" were imputed.
PPAps3
PPAps3
A data frame with 1000 rows and 7 variables:
age of respondent, in years, ("AGEP")
sex of respondent, ("SEX")
recoded detailed race code, ("RAC1P")
married/spouse present/spouse absent, ("MSP")
employment status recode, ("ESR")
veteran period of service, ("VPS")
age >= 17 indicator
This is a version of the PPA data set that is partially synthetic. Some of the values of "sex", "marriage", and "age17plus" were imputed.
PPAps4
PPAps4
A data frame with 1000 rows and 7 variables:
age of respondent, in years, ("AGEP")
sex of respondent, ("SEX")
recoded detailed race code, ("RAC1P")
married/spouse present/spouse absent, ("MSP")
employment status recode, ("ESR")
veteran period of service, ("VPS")
age >= 17 indicator
This is a version of the PPA data set that is partially synthetic. Some of the values of "sex", "marriage", and "age17plus" were imputed.
PPAps5
PPAps5
A data frame with 1000 rows and 7 variables:
age of respondent, in years, ("AGEP")
sex of respondent, ("SEX")
recoded detailed race code, ("RAC1P")
married/spouse present/spouse absent, ("MSP")
employment status recode, ("ESR")
veteran period of service, ("VPS")
age >= 17 indicator
This function will calculate confidence intervals and standard errors from the proportional tabular responses of multiply imputed datasets for the cross-tabulation of two categorical variables, and also give a YES/NO indicator for whether or not the observed value is within the confidence interval. The confidence intervals and standard errors are calculated from formulas that are adapted for fully and partially synthetic data sets. See reference for more information.
twoCatCI(obs_data, imp_data_list, type, vars, sig = 4, alpha = 0.05)
twoCatCI(obs_data, imp_data_list, type, vars, sig = 4, alpha = 0.05)
obs_data |
The original dataset to which the next will be compared, of the type "data.frame". |
imp_data_list |
A list composed of |
type |
Specifies which type of datasets are in |
vars |
A vector of the two categorical variable being checked. Should be of type "factor". |
sig |
The number of significant digits in the output dataframes. Defaults to 4. |
alpha |
Test size, defaults to 0.05. |
This function was developed with the intention of making the job of researching synthetic data utility a bit easier by providing another way of measuring utility.
This function returns a list of five data frames:
Observed |
A cross-tabular proportion of observed values |
Lower |
Lower limit of the confidence interval |
Upper |
Upper limit of the confidence interval |
SEs |
Standard Errors |
CI_Indicator |
"YES"/"NO" indicating whether or not the observed value is within the confidence interval |
Reiter JP, Raghunathan TE (2007). “The Multiple Adaptations of Multiple Imputation.” Journal of the American Statistical Association.
#PPA is the observed data set. PPAm5 is a list of 5 partially synthetic data sets derived from PPA. #"sex" and "race" are categorical variables present in the synthesized data sets. #3 significant digits are desired in the output dataframes. twoCatCI(PPA, PPAm5, "partially", c("sex", "race"), sig=3)
#PPA is the observed data set. PPAm5 is a list of 5 partially synthetic data sets derived from PPA. #"sex" and "race" are categorical variables present in the synthesized data sets. #3 significant digits are desired in the output dataframes. twoCatCI(PPA, PPAm5, "partially", c("sex", "race"), sig=3)