Data Visualization

Instructor: Will Beasley

Steps

In your personal folder on Enclave, create workbook called "graphs-1" in "session-4/"
Import data from Jerrod's lesson
- From "Short Cource/ Student Worksapces/anzalone_j/session_3/workbook-output/Covid-19 patient summary fact table De-Id".
- I prefer you use Jerrod's (instead of your own), so we're consistent.
Create a SQL transform called pt_thinned
- I like when the transform name clearly indicates the grain of the resulting table.
- Toggle "Save as dataset".
- Temporarily trim to 5% of the 100k patients so our graphs are quicker during development. Later on, remove the thinning and use everyone. This strategy is common in stats (e.g. Andrew Gelman's advise in multilevel modeling).
- Select the variables you need. It reduces teh thigns that can go wrong. Be stingy. You can add more later.
- code:
```
SELECT 
    data_partner_id
    ,year(COVID_first_PCR_or_AG_lab_positive) as covid_year
    ,cast(age_at_covid as int)                as age_at_covid
    ,Severity_Type                            as severity_type
FROM Covid_19_patient_summary_fact_table_De_Id
WHERE 
    cast(right(person_id, 2) as int) between 0 and 4
```

Create an R transform called year_age_boxplot_1

Toggle "Save as dataset".
Iterate in the console. What can we improve?

Starting code

year_age_boxplot_1 <- function (pt_thinned) {
    # Execute the `library()` lines separately from the others --when debugging in the console (ctrl+shift+enter).
    library(magrittr)
    library(ggplot2)

    if (class(pt_thinned) != "data.frame") {
      stop("The incoming dataset is NOT an R data.frame.")
    }

    ds <-
        pt_thinned %>%
        tibble::as_tibble() %>%
        dplyr::select(
            age   = age_at_covid,
            year  = covid_year,
        ) %>%
        dplyr::mutate(
            # age   = as.integer(age), # Cast from a 64-bit to 32-bit integer
            year_f  = factor(year)
        )

    g <-
        ggplot(ds, aes(x = year_f, y = age)) +
        geom_boxplot()
    
    print(g)

    # Return something, preferrably the dataset underneath the graph.
    #    Don't end with `print()`.
    return (ds)
}

Create an R transform called year_severity_1

Toggle "Save as dataset".
Iterate in the console. What can we improve?

Starting code

year_severity_1 <- function (pt_thinned) {
    
    # Execute the `library()` lines separately from the others --when debugging in the console (ctrl+shift+enter).
    library(magrittr)
    library(ggplot2)

    if (class(pt_thinned) != "data.frame") {
        stop("The incoming dataset is NOT an R data.frame.")
    }

    ds_year_severity <-
        pt_thinned %>%
        tibble::as_tibble() %>%
        dplyr::select(
            data_partner_id,
            year              = covid_year,
            severity_type,
        ) %>%
        dplyr::mutate(
            year_f  = factor(year)
        ) %>%
        dplyr::group_by(year_f, severity_type) %>%
        dplyr::summarize(
            pt_count       = dplyr::n(),
            partner_count  = dplyr::n_distinct(data_partner_id),
        ) %>%
        dplyr::ungroup()

    g <-
        ds_year_severity %>%
        ggplot(aes(x = severity_type, y = pt_count)) +
        geom_bar(stat = "identity") +
        facet_wrap("year_f")

    print(g)

    # Return something, preferrably the dataset underneath the graph.
    #    Don't end with `print()`.
    return (ds_year_severity)
}

Once things are stable, remove the thinning code in pt_thinned, and rename it to something like pt. It's a little tedious, but reduces your chance of forgetting.

Once you're comfortable with the basics, here are some tools & techniques to help you manage the complexity & volume

Tweaking the Spark Environment, including adding R packages.
Saving an R object that was the result of an expensive calculation. Calculate it once, and let multiple downstream R transforms focus on a smaller role.

"Global Code", such as

load_packages <- function () {
  library(magrittr)
  library(ggplot2)
  library(survey)
}

predictors_1  <- "asthma * tx + smoking_ever + bmi_cut6 + gender_male + race_v2"

tidy_model <- function (m, model_title) {
  m %>%
    broom::tidy() %>%
    dplyr::mutate(
      model = model_title
    ) %>%
    tibble::as_tibble() %>%
    dplyr::mutate_if(is.numeric, round, 5)
}

palette_dark <- c( # http://colrd.com/image-dna/29746/
"(Intercept)"     = "gray50",   # Same color as reference group below
"none documented" = "gray50",
"saba"            = "#bfa9c4",  # lavender
"nasal"           = "#45a79e",  # green
"inhaled"         = "#bfc269",  # yellow green
"systemic"        = "#f46b4f",  # orange
"biologic"        = "#4287f5",  # blue
"both"            = "#646596",  # purple
)

palette_light <- scales::alpha(palette_dark, alpha = .5)

Resources

ggplot2 website
ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham
R Graphics Cookbook, 2nd edition by Winston Chang
See N3C resources we've listed, especially
- The Languages Section in the Palantir Foundary/Enclave documentation
- N3C Training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data Visualization

Steps

Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data Visualization

Steps

Resources