Exploring Dengue Data with BigDataPE

Introduction

The BigDataPE package provides a simple and secure interface for accessing datasets published by the Big Data PE platform from the Government of the State of Pernambuco, Brazil.

In this vignette we walk through the full package workflow using the Dengue, Zika and Chikungunya dataset — a record of notified cases in health units (public and private) in Recife, shared by the State Health Department of Pernambuco (Secretaria Estadual de Saude de Pernambuco).

Field Value
Classification Broad (public)
Source institution State Health Department of Pernambuco
Ingestion type SQL
Records 1 009
LGPD level None
Last updated 2024-05-24

Access note: Although this dataset is public, you must submit an access request on the Big Data PE platform and be connected to the PE Conectado network or a VPN to query the data through the API. Without this connection, requests will time out or return a service-unavailable error.

Setup

Load the package and the helper libraries used in the analysis:

library(BigDataPE)
library(dplyr)
library(tibble)

1. Storing the access token

The first step is to store the authentication token associated with the dataset. This token is provided by the Big Data PE platform after your access request is approved.

bdpe_store_token("dengue", "your-token-here")
#> ✔ Token stored in environment variable: `BigDataPE_dengue`

You can verify that the token was stored correctly:

bdpe_list_tokens()
#> [1] "dengue"

2. Fetching data

Basic query

bdpe_fetch_data() is the most direct way to query the API. Let’s fetch the first 100 records:

data <- bdpe_fetch_data("dengue", limit = 100, offset = 0)
glimpse(data)
#> Rows: 100
#> Columns: 126
#> $ nu_notificacao           <chr> "3517726", "3613049", "3507055", ...
#> $ tp_notificacao           <chr> "2", "2", "2", ...
#> $ co_cid                   <chr> "A90", "A90", "A90", ...
#> $ dt_notificacao           <chr> "2020-07-07", "2020-06-27", "2020-01-03", ...
#> $ ds_semana_notificacao    <chr> "202028", "202026", "202001", ...
#> $ notificacao_ano          <chr> "2020", "2020", "2020", ...
#> $ co_municipio_notificacao <chr> "261160", "260790", "261160", ...
#> $ tp_sexo                  <chr> "M", "M", "M", ...
#> $ febre                    <chr> "1", "1", "1", ...
#> $ mialgia                  <chr> "1", "2", "1", ...
#> $ cefaleia                 <chr> "2", "1", "1", ...
#> $ ...

The dataset contains 126 columns with detailed information on each notification, including demographics, symptoms, laboratory results, warning signs, severity indicators and case outcome.

Using query parameters

The API supports additional filters through the query parameter. For example, to fetch only notifications from the year 2020:

dengue_2020 <- bdpe_fetch_data(
  "dengue",
  limit = 50,
  offset = 0,
  query = list(notificacao_ano = "2020")
)
nrow(dengue_2020)
#> [1] 50

You can also filter by municipality of residence. The IBGE code for Recife is 261160:

dengue_recife <- bdpe_fetch_data(
  "dengue",
  limit = 100,
  offset = 0,
  query = list(co_municipio_residencia = "261160")
)
nrow(dengue_recife)
#> [1] 100

Filters can be combined. To fetch female cases in Recife:

dengue_female_recife <- bdpe_fetch_data(
  "dengue",
  limit = 100,
  offset = 0,
  query = list(
    co_municipio_residencia = "261160",
    tp_sexo = "F"
  )
)
nrow(dengue_female_recife)

3. Fetching data in chunks

When the data volume is large, use bdpe_fetch_chunks() which paginates automatically. To fetch all 1,009 records in blocks of 500:

all_data <- bdpe_fetch_chunks(
  "dengue",
  total_limit = Inf,
  chunk_size = 500,
  verbosity = 1
)
#> ℹ Fetched 500 records (total: 500).
#> ℹ Fetched 500 records (total: 1000).
#> ℹ Fetched 9 records (total: 1009).
#> ✔ Fetching complete: 1009 records retrieved.

dim(all_data)
#> [1] 1009  126

4. Exploring the data

With the data in hand we can run quick exploratory summaries.

Distribution by sex

all_data |>
  count(tp_sexo, sort = TRUE)
#> # A tibble: 3 x 2
#>   tp_sexo     n
#>   <chr>   <int>
#> 1 F         561
#> 2 M         443
#> 3 I           5

Notifications by year

all_data |>
  count(notificacao_ano, sort = TRUE)
#> # A tibble: 1 x 2
#>   notificacao_ano     n
#>   <chr>           <int>
#> 1 2020             1009

Most frequent symptoms

Symptom fields use "1" for yes and "2" for no. We can compute the proportion of each symptom:

symptoms <- c("febre", "mialgia", "cefaleia", "exantema", "vomito",
              "nausea", "dor_costas", "conjutivite", "artrite",
              "artralgia", "dor_retro")

all_data |>
  summarise(across(all_of(symptoms), ~ mean(.x == "1", na.rm = TRUE))) |>
  tidyr::pivot_longer(everything(),
                      names_to  = "symptom",
                      values_to = "proportion") |>
  arrange(desc(proportion))
#> # A tibble: 11 x 2
#>    symptom      proportion
#>    <chr>             <dbl>
#>  1 febre             0.914
#>  2 cefaleia          0.531
#>  3 mialgia           0.512
#>  4 artralgia         0.194
#>  5 exantema          0.181
#>  6 dor_costas        0.155
#>  7 vomito            0.150
#>  8 nausea            0.139
#>  9 dor_retro         0.125
#> 10 conjutivite       0.053
#> 11 artrite           0.046

Distribution by neighbourhood (top 10)

all_data |>
  filter(no_bairro_residencia != "") |>
  count(no_bairro_residencia, sort = TRUE) |>
  head(10)
#> # A tibble: 10 x 2
#>    no_bairro_residencia     n
#>    <chr>                <int>
#>  1 IBURA                  46
#>  2 VARZEA                 37
#>  3 COHAB                  33
#>  4 BOA VIAGEM             30
#>  5 AGUA FRIA              27
#>  ...

Hospitalisations

all_data |>
  count(st_ocorreu_hospitalizacao) |>
  mutate(description = case_match(
    st_ocorreu_hospitalizacao,
    "1" ~ "Yes",
    "2" ~ "No",
    "9" ~ "Unknown",
    ""  ~ "Not reported"
  ))
#> # A tibble: 4 x 3
#>   st_ocorreu_hospitalizacao     n description
#>   <chr>                     <int> <chr>
#> 1                             330 Not reported
#> 2 1                            51 Yes
#> 3 2                           565 No
#> 4 9                            63 Unknown

5. Managing tokens

Retrieve a stored token

my_token <- bdpe_get_token("dengue")

Remove a token

At the end of your session, or when the token is no longer needed:

bdpe_remove_token("dengue")
#> ✔ Token successfully removed for dataset: "dengue"

Data dictionary (key variables)

Variable Description
nu_notificacao Notification number
co_cid ICD-10 code (A90 = Dengue)
dt_notificacao Notification date
notificacao_ano Notification year
co_municipio_notificacao IBGE code of the notifying municipality
dt_diagnostico_sintoma Date of first symptoms
tp_sexo Sex (M, F, I = Indeterminate)
tp_raca_cor Race/colour (1–5, 9 = Unknown)
no_bairro_residencia Neighbourhood of residence
febredor_retro Symptoms (1 = Yes, 2 = No)
diabetesauto_imune Pre-existing conditions (1 = Yes, 2 = No)
st_ocorreu_hospitalizacao Hospitalisation occurred (1/2/9)
tp_classificacao_final Final case classification
tp_evolucao_caso Outcome (1 = Recovery, 2 = Death, …)

References