Datasets generally require a data dictionary that will incorporate
information related to the dataset’s variables and their descriptions,
including the description of any variable options. This information is
crucial, particularly when a dataset will be shared for further
analyses. However, constructing it may be time-consuming. The
dataMeta
package has a collection of functions that are
designed to construct a data dictionary and append it to the original
dataset as an attribute, along with other information generally provided
in other software as metadata. This information will include: the time
and date when it was edited last, the user name, a main description of
the dataset. Finally, the dataset is saved as an R dataset (.rds).
There are three basic steps to building a data dictionary with this package, which are outlined in the figure below. First, a “linker” data frame is created, where the user will add each variable description and also provide a variable “type” or key. These keys/variable types are explained below. The variable type will depend on whether the user wants to list all available variable options or if a range of variable values would suffice. Secondly, the main data dictionary is created using the original dataset and the linker data frame. Here, the user will be able to construct any additional variable option descriptions as needed or just build a dictionary with variable names, their descriptions and options. Finally, the user can append the dictionary to the original dataset as an R attribute, along with the date in which the dictionary is created, the author name, and also general R attributes included for data frames. The new dataset with its attributes is then saved as an R dataset (.rds).
The data used for this vignette will be obtained from the Centers for
Disease Control and Prevention (CDC) github repository for publicly
avaiable Zika data: https://github.com/cdcepi/zika. This repo contains Zika
data that has been scraped from the published records of the Health
Department of various countries. The dataset used contains information
related with Zika infection in the United States Virgin Islands (USVI)
as published by their Health Department on their January
03, 2017 report and can be obtained using the code below. Please,
note that the data is read with stringsAsFactors = FALSE
.
Below is a portion of this dataset:
report_date | location | location_type | data_field | data_field_code | time_period | time_period_type | value | unit |
---|---|---|---|---|---|---|---|---|
2017-01-03 | United_States_Virgin_Islands | territory | zika_reported | VI0001 | NA | NA | 1930 | cases |
2017-01-03 | United_States_Virgin_Islands-Saint_Thomas | county | zika_reported | VI0001 | NA | NA | 1163 | cases |
2017-01-03 | United_States_Virgin_Islands-Saint_Croix | county | zika_reported | VI0001 | NA | NA | 646 | cases |
2017-01-03 | United_States_Virgin_Islands-Saint_John | county | zika_reported | VI0001 | NA | NA | 121 | cases |
2017-01-03 | United_States_Virgin_Islands | territory | zika_lab_positive | VI0002 | NA | NA | 916 | cases |
2017-01-03 | United_States_Virgin_Islands-Saint_Thomas | county | zika_lab_positive | VI0002 | NA | NA | 645 | cases |
2017-01-03 | United_States_Virgin_Islands-Saint_Croix | county | zika_lab_positive | VI0002 | NA | NA | 198 | cases |
2017-01-03 | United_States_Virgin_Islands-Saint_John | county | zika_lab_positive | VI0002 | NA | NA | 73 | cases |
2017-01-03 | United_States_Virgin_Islands | territory | zika_not | VI0003 | NA | NA | 932 | cases |
2017-01-03 | United_States_Virgin_Islands | territory | zika_pending | VI0004 | NA | NA | 81 | cases |
Another way to load the data is using it’s raw link from the github repo where it is stored:
To build a data dictionary for the previous dataset, a linker data
frame is constructed. This linker will serve as an intermediary to build
the data dictionary. It will contain the names of the variables, a
description of each variable provided by the user and a “variable type.”
The variable type will give each variable a value of 1
or
0
, as follows:
0
: for those variables that have options that can be
portrayed as a range of values. For example, age or dates or any
categorical factors that the user does not want to list out (NA values
will be maintained).1
: for those variables that have options that need to
be listed and/or descrcibed later on.The linker is built using one of the following two functions:
prompt_linker.R
will prompt the user to add the
description of each variable in the console and the variable type.build_linker.R
will require that the user create two
vectors that will fill out the variable descriptions and variable types.
This is shown below:
var_desc <- c("Date when report was published", "Regional location",
"Description of regional location", "Type of case",
"A specific code for each data field", "The time period of each week",
"The type of time period", "The number of cases per data field type",
"The unit in which cases are reported")
var_type <- c(0, 1, 0, 1, 0, 0, 0, 0, 1)
linker <- build_linker(my.data, variable_description = var_desc, variable_type = var_type)
var_name | var_desc | var_type |
---|---|---|
report_date | Date when report was published | 0 |
location | Regional location | 1 |
location_type | Description of regional location | 0 |
data_field | Type of case | 1 |
data_field_code | A specific code for each data field | 0 |
time_period | The time period of each week | 0 |
time_period_type | The type of time period | 0 |
value | The number of cases per data field type | 0 |
unit | The unit in which cases are reported | 1 |
Next, the dicitonary is built using the linker created above and the
original dataset using the build_dict
. Here,
option_description
is set to NULL
because the
options of each variable are self explanatory, thus the dictionary will
not contain a description for each variable option. If
option_descriptions
will be added, then a vector of each
option description is constructed. The length of this vector needs to be
the same as the number of rows in the dictionary and it will depend on
the the variable_type
used in the linker. Another way to
add option descriptions is to use the prompt_varopts
function, which prompt the user for a description of each variable
option, without the need to write the descriptions vector beforehand. If
using this option, then prompt_varopts
must be set to
TRUE
and option_description
must be
NULL
. The following code builds a data dictionary for the
USVI report:
dict <- build_dict(my.data = my.data, linker = linker, option_description = NULL,
prompt_varopts = FALSE)
The dictionary will look as follows:
variable_name | variable_class | variable_description | variable_options |
---|---|---|---|
report_date | character | Date when report was published | 2017-01-03 to 2017-01-03 |
location | Regional location | United_States_Virgin_Islands | |
United_States_Virgin_Islands-Saint_Thomas | |||
United_States_Virgin_Islands-Saint_Croix | |||
United_States_Virgin_Islands-Saint_John | |||
location_type | Description of regional location | county to territory | |
data_field | Type of case | zika_reported | |
zika_lab_positive | |||
zika_not | |||
zika_pending | |||
confirmed_age_under20 | |||
confirmed_age_20to39 | |||
confirmed_age_40to59 | |||
confirmed_age_over59 | |||
confirmed_age_unk | |||
confirmed_male | |||
confirmed_female | |||
confirmed_fever | |||
confirmed_acute_fever | |||
confirmed_arthralgia | |||
confirmed_arthritis | |||
confirmed_rash | |||
confirmed_conjunctivitis | |||
confirmed_eyepain | |||
confirmed_headache | |||
confirmed_malaise | |||
zika_no_specimen | |||
zika_tested_pregnant | |||
zika_positive_pregnant | |||
zika_negative_pregnant | |||
zika_pending_pregnant | |||
zika_no_specimen_pregnant | |||
data_field_code | A specific code for each data field | VI0001 to VI0026 | |
time_period | logical | The time period of each week | NA to NA |
time_period_type | The type of time period | NA to NA | |
value | integer | The number of cases per data field type | 0 to 1930 |
unit | The unit in which cases are reported | cases |
To add the dictionary as an attribute to the original dataset, the
function incorporate_attr
is used. This function will
require that the user add a main description of the dataset as shown
below. There is also a prompt_attr
function that will
prompt the user to add this description through the console. Once the
attributes are added to the dataset, the function save_it
will save the file as an R data file .rds
, which will
preserve all attributes to the dataset. Accessing the attributes of the
data is done using the following comand:
attributes(my.new.data)
and single attributes, like the
main description of the dataset, can be obtained by typing:
attributes(my.new.data)$main
, as is shown below:
data_desc = "This data set portrays Zika infection related cases as reported by USVI."
my.new.data <- incorporate_attr(my.data = my.data, data.dictionary = dict, main_string = data_desc)
attributes(my.new.data)
$names
[1] "report_date" "location" "location_type" "data_field" "data_field_code"
[6] "time_period" "time_period_type" "value" "unit"
$class
[1] "data.frame"
$row.names
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
$main
[1] "This data set portrays Zika infection related cases as reported by USVI."
$dictionary
variable_name variable_class variable_description
1 report_date character Date when report was published
2 location Regional location
3
4
5
6 location_type Description of regional location
7 data_field Type of case
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33 data_field_code A specific code for each data field
34 time_period logical The time period of each week
35 time_period_type The type of time period
36 value integer The number of cases per data field type
37 unit The unit in which cases are reported
variable_options
1 2017-01-03 to 2017-01-03
2 United_States_Virgin_Islands
3 United_States_Virgin_Islands-Saint_Thomas
4 United_States_Virgin_Islands-Saint_Croix
5 United_States_Virgin_Islands-Saint_John
6 county to territory
7 zika_reported
8 zika_lab_positive
9 zika_not
10 zika_pending
11 confirmed_age_under20
12 confirmed_age_20to39
13 confirmed_age_40to59
14 confirmed_age_over59
15 confirmed_age_unk
16 confirmed_male
17 confirmed_female
18 confirmed_fever
19 confirmed_acute_fever
20 confirmed_arthralgia
21 confirmed_arthritis
22 confirmed_rash
23 confirmed_conjunctivitis
24 confirmed_eyepain
25 confirmed_headache
26 confirmed_malaise
27 zika_no_specimen
28 zika_tested_pregnant
29 zika_positive_pregnant
30 zika_negative_pregnant
31 zika_pending_pregnant
32 zika_no_specimen_pregnant
33 VI0001 to VI0026
34 NA to NA
35 NA to NA
36 0 to 1930
37 cases
$last_edit_date
[1] "2025-02-27 03:00:29 UTC"
$author
[1] "root"
The user can also export the data dicitonary using as .csv or .xlsx by itlself and also save the data with all of its attributes, as shown below:
# Exporting dictionary only:
dict_only <- attributes(my.new.data)$dictionary
write.csv(dict_only, "dict_only.csv")
# Saving as .rds (dataset with appended dictionary)
save_it(complete_dict = my.new.data, name_of_file = "My Complete Dataset")