ooooo ooooo .oooooo. oooooo oooooo oooo ooooooooooooo .oooooo.
`888' `888' d8P' `Y8b `888. `888. .8' 8' 888 `8 d8P' `Y8b
888 888 888 888 `888. .8888. .8' 888 888 888
888ooooo888 888 888 `888 .8'`888. .8' 888 888 888
888 888 888 888 `888.8' `888.8' 8888888 888 888 888
888 888 `88b d88' `888' `888' 888 `88b d88'
o888o o888o `Y8bood8P' `8' `8' o888o `Y8bood8P'
A cheat sheet for common data journalism stuff. For details on installing these tools, see how I work. Use CMD + F to search the page, or the jump menu below if you know what you're looking for.
Command line tools grep | head/tail | ffmpeg | pdftk | esridump | wget | file | sed | wc | imagemagick | libpst | gunzip | sysctl
R packages shortcut keys | base | readr | scales | ggpmisc | dplyr | stringr | plyr | clipr | googlesheets4 | janitor For more on tips and tricks for working with common data formatting issues in R, check out the code snippet collection here
Math for journalists Standard deviation | Rate comparisons | Odds ratios | Poisson distributions | Making sense of symbols
GIS tips Latitude and longitude | NC bounding box
Troubleshooting Location errors
Convenience files U.S. states | N.C. places
A collection of tips and tricks for working with tools executed using bash terminals.
Search text files for specific character sequences.
grep "DUKES" absentee.csv
Return lines containing a string from a specified file and print to the command line.
grep "\"DUKES\",\"MICHAEL\",\"TYLER\"" absentee.csv > dukes.csv
Search for a string with quotes and output all lines to a file.
head -1 ./*/*|grep -B1 'Hospital overall rating' > variable.txt
Combining head
and grep
with a pipe allows you to chain commands, and the -B1
flag allows you to output the file name.
Get a preview of a file.
head absentee.csv
Print the first 10 lines of a file to the command line.
head -1 absentee.csv > absentee_header.csv
Get the first line of a file and save it to a CSV file.
A power tool for processing and converting video and audio files.
ffmpeg -ss 5 -t 1.7 -i video.MOV -vf "fps=24,scale=640:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" -loop 0 video.gif
Create a high quality GIF 640 pixels wide at 24 frames per second using the specified video file, skipping 5 seconds and lasting 1.7 seconds and save the output.
ffmpeg -ss 278.8 -t 3.3 -i wings.mp4 -filter_complex "fps=24,scale=640:-1:flags=lanczos,drawtext=box=1:[email protected]:boxborderw=5:fontfile=/System/Library/Fonts/Supplemental/Impact.ttf:text='CONTRARY TO MY APPEARANCE,':fontsize=48:fontcolor=white:x=(w-tw)/2:y=(h/PHI)+th,drawtext=box=1:[email protected]:boxborderw=5:fontfile=/System/Library/Fonts/Supplemental/Impact.ttf:text='I AM ENJOYING THIS.':fontsize=48:fontcolor=white:x=(w-tw)/2:y=(h/PHI)+th+50,split[x1][x2];[x1]palettegen[p];[x2][p]paletteuse" wings.gif
Tweak the parameters (or delete the second line) to adjust the font, text etc. of the caption.
ffmpeg -i apple.mov -vcodec h264 -acodec aac apple.mp4
Use a codec flag to convert a video file from QuickTime to the more universal mp4 format.
ffmpeg -i full_video.avi full_video.mp4
Convert an AVI file to the more universal mp4 format.
ffmpeg -f concat -safe 0 -i vidlist.txt -c copy full_video.avi
Combine all of the files recorded in a text file called vidlist.txt
, which looks like this:
file '/Users/username/directory/vid_seq001.avi'
file '/Users/username/directory/vid_seq002.avi'
ffmpeg -r 1/5 -i img%03d.jpg -c:v libx264 -vf "fps=25,format=yuv420p" out.mp4
Read in a sequence of images from a folder and write to an mp4 file. The -r
flag is the framerate, where the duration of each image is the inverse of the provided value (e.g. 1/5 is 5 seconds, 60 is 1/60 of a second). The -i
flag specifies the filename structure, with 0 padding specified (e.g. img%03d.jpg will iterate through img001.jpg, img002.jpg, img003.jpg etc.). More details here.
A power tool for processing and converting PDF files.
pdftk *.pdf cat output all_documents.pdf
pdftk doc01.pdf doc02.pdf cat output all_documents.pdf
Combine all the PDF files in the present directory into a single file. Or specify individual files Note: check to make sure the capitalization of the filetype matches.
pdftk blue_docs.pdf cat 1-700 output blue_docs01.pdf
pdftk blue_docs.pdf cat 701-end output blue_docs02.pdf
Specify the page number or use the end
keyword to slice up a document.
pdftk doj_emails_portfolio.pdf unpack_files output doj_emails
PDF portfolios contain a bunch of individual files bound up in a filetype that needs a native PDF reader. Get around this by unpacking each file into a specific directory.
pdftk doj_emails_portfolio.pdf unpack_files output doj_emails;
IFS=$'\n'; set -f
for f in $(find ./doj_emails/ -name '*.pdf'); do pdftk "$f" unpack_files output ./doj_emails/; done
unset IFS; set +f
If your PDF portfolio has attachements within the individual PDF, you can use your terminal to unpack the portfolio into a directory, then set up a loop to unpack all of the PDFs in that directory on by one (Thanks to this Stackoverflow thread for tips on bash recursion and dealing with spaces in filenames.)
A targeted utility to pull down geographic data from ESRI maps.
esri2geojson https://services.arcgis.com/iFBq2AW9XO0jYYF7/arcgis/rest/services/Covid19byZIPnew/FeatureServer/0 nc_zipDATE.geojson
Download data from the ESRI REST endpoint that powers the N.C. DHHS COVID map of cases by zip code and save it as a geojson file.
A power tool for recursively downloading files, for example from the Web.
wget http://mtdukes.com/work.html
Saves the file from the specified URL.
wget -i file_list.txt
Saves individual files from URLs specified in a TXT file, one URL on each line.
wget --recursive --no-parent http://mtdukes.com
Recursively download an entire site's contents.
Tool to do various file formatting things I think.
file -I input.csv
Detect encoding of a file.
Tool to make substitutions in a text file (submitted by Chris Alcantara).
sed "s/dook/duke/g" ./old.csv > new.csv
Syntax uses /
as a delimiter to separate patterns you want to substitute.
sed -i "" "s/dook/duke/g" ./data.csv
Substitute directly in the file by passing an empty string after the -i
flag.
sed -i "./data-backup.csv" "s/dook/duke/g" ./data.csv
A basic character counting utility for the command line.
wc -l < data_file.txt
The <
flag excludes the file name from the results.
curl mtdukes.com --silent | wc -l
A power tool for quickly editing images.
mogrify -crop 800x450+0+40 -path ./cropped *.jpg
In a folder of images, crop every jpg image at size 800x450, with a 0px offset from the left (x) and a 40px offset from the top (y).
A collection of tools to work with and convert PST files (Outlook folders containing email). On Macs, install with homebrew.
readpst public_records.pst
Outputs a single file in mbox format, which is a more open format you can import into a number of email clients. That includes Mozilla's Thunderbird, which you can import using the ImportExportNG add-on.
readpst -e -D public_records.pst
Separates the PST into individual eml files. Attachments are embedded in the file. The -D
flag preserves deleted items. Can be read by services like Google's Pinpoint.
readpst -S -D public_records.pst
Separates the PST into individual eml files, each emails rich text body and individual attachment. The numbered files are in eml format, with no extension. The -D
flag preserves deleted items.
readpst -m -D public_records.pst
Produces both msg and eml files for each message. The -D
flag preserves deleted items.
Quickly and efficiently unzip files (or a folder full of files). Can also unzip some files where a normal unzipping GUI throws errors. Useful for .gz
file extensions.
gunzip map_file.gz
Unzips a specific file and replaces it with the unzipped version.
gunzip -c map_file.gz > map_file.shp
Uses the stdout
flag to read to the console, but pipe to a new file to keep the original.
gunzip -r /map_files
Uses the recursive flag to iterate through every zipped file in a folder and replace it with the unzipped version.
A suite of tools to anaylze your system (for Mac).
sysctl hw.physicalcpu hw.logicalcpu
Provides an output of physical and logical cores your CPU has.
A few common shortcuts save you from typing in RStudio.
Command + Shift + R
Prompt for a new label used in the document outline for an R script
Command + Enter
Run a section of code in your R script.
Command + Shift + M
Input a %>%
at your cursor to pipe output to the next line with the dplyr
library.
Option + -
Input a <-
at your cursor to assign output to a variable.
The stripped down version of R has lots of built-in stuff worth using.
rm(list = ls())
Start with a clean slate using the rm
command.
setwd('~/projects/newsobserver/big_investigation/data')
Save time otherwise spent typing out long path names.
property_sales %>%
unique()
Generates a dataframe of unique rows across all fields.
wake_sales %>%
.[duplicated(.), ]
Generates a dataframe of duplicated rows, comparing all fields.
county_data <- list.files(path = './data/counties', full.names = TRUE)
The full.names
flag prepends the directory path if TRUE
, and file name only if false
.
detach("package:plyr", unload=TRUE)
Removing a package can help when you have conflicts between functions with the same name.
as.Date('01/01/2001', format = '%m/%d/%Y' )
Specify the format explicitly using the syntax from strptime
.
gsub(',', '', 'womp,womp')
gsub('\\(', ',for real', 'Replace the literal parenthesis (' )
Enter a pattern, replacement and data value to search.
gsub('[^ -~]', '', '日本人GALATIA')
This pattern, translates to "not any ASCII character". Useful when cleaning a malformed file. Can also use the [^ -~]
in other contexts.
vax_data <- read.delim('cnty20210731.csv', fileEncoding = 'UTF-16LE', sep = '\t',
col.names = c('index', 'county', 'week_of', 'age12_17',
'age18_24', 'age25_49', 'age50_64')
)
The readr package is my preferred method for importing data, sometimes it chokes on strange file encodings even with the locale
parameter set. So this method is worth trying.
options(scipen = 999)
Prints out the full numeral in your current workspace.
#Black maternal mortality in 2020 and 2021 compared to the number of live births
prop.test(x = c(323, 506), n = c(397827, 517889), correct = FALSE)
Use a 2-sample test for equality of proportions to generate a p-value, which tells you how likely it is that the observed difference is due to random chance. Typically, p-values of less than 0.05 are considered statistically signficant (95% confidence interval).
The prop.test
function accepts a vector of values for the numerator (x
) and the denominator (n
).
A "fast and friendly way" to read in data. Part of the tidyverse suite of packages.
my_data <- read_csv('../my_data.csv', col_types = cols(.default = 'c', date = 'D'))
Tell read_csv
to import all columns as a character by default, except for the date field, which should be a date (you can remove the date part if you want to read everything in as a character).
guess_encoding(file = 'annoying_file.csv')
Useful for errors reading in the file (like embedded nulls).
county_fips <- read_tsv( url(
'https://gist.githubusercontent.com/mtdukes/e0c6563927fb4f3e48f4e092b84b7023/raw/56e5abc1daf2277b5d901cbc25b9f9e64ab8c073/nc_fips_tab.tsv'
))
Use the url
function from base R to pull formatted data posted in public gists, like these convenience files. Swap out the delimiter function (read_csv, read_delim, etc) as needed. Get the URL by clicking "Raw" on the Gist page.
my_dataframe <- read_tsv('annoying_file.csv', locale = locale(encoding = "UTF-16LE") )
Specify non-UTF encodings you get from the guess_encoding
output.
my_spreadsheet %>%
write_csv(paste0('my_data/my_spreadsheet',
format(Sys.time(),'%Y%m%d%H%M'),
'.csv'))
Using Sys.time()
, you can quickly save a dataset with an automatic timestamp for easy organization. No more my_spreadsheet_final_Final_FINAL_FINALFORREAL.csv
!
A library to make scaling and labeling easier.
vax_income %>%
ggplot(aes(x = median_income, y=PctTotal)) +
geom_point() +
scale_x_continuous(labels = scales::dollar_format())
The dollar_format
function shortcuts the annoying parsing issues.
Miscellaneous extensions to the ggplot package.
vax_income %>%
ggplot(aes(x = median_income, y = pct_total)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x, show.legend = FALSE) +
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
label.x.npc = "right", label.y.npc = 0.15,
formula = y ~ x, parse = TRUE, size = 3)
The stat_poly_eq
function lets you annotate the graph with a regression formula. BONUS: What's a good value for R-squared?
A "grammar of data manipulation" and part of the tidyverse package.
nc_voters %>%
sample_n(10)
Specify the number of rows from the dataframe to return.
clean_table <- dirty_table %>%
mutate(across(where(is.character), toupper))
Tranforms all columns containing characters to uppercase all at once. Incredibly useful for cleaning data!
df_clean <- df %>% mutate(across(everything(), ~ iconv(.x, sub = '') ))
Works across an entire dataframe, removing all malformed characters, multibyte strings or bad, non-UTF8 encodings that can't be converted.
A suite of convenient functions for working with strings, and part of the tidyverse package.
clean_table <- dirty_table %>%
rename_all(~str_replace(., 'column_prefix', ''))
Uses str_replace
and rename_all
to replace all matching patterns. Uses regex.
Tools to solve common problems, like performing the same task over and over. NOTE: This package conflicts with some dplyr commands, so if you're getting weird errors, this might be why.
precinct_sort <- ldply(county_files, read_tsv, na='', col_types = cols(
county_id = col_double(),
election_dt = col_date(format = '%m/%d/%Y'),
contest_id = col_double(),
contest_title = col_character(),
contest_vote_for = col_double(),
precinct_code = col_character(),
candidate_id = col_double(),
candidate_name = col_character(),
candidate_party_lbl = col_character(),
voting_method_lbl = col_character(),
voting_method_rslt_desc = col_character(),
vote_ct = col_double()
))
The first parameter is the list and the second is the function you want to repeat. Everything that follows are parameters specific to your function. You can use your own functions too.
A set of simple commands for writing to and reading from the clipboard.
age_group_populations %>%
mutate(lookup_helper = paste0(fips,age_group), .after = 'age_group') %>%
write_clip()
Copies dataframes in a tab-delimited format for easy pasting into spreadsheets.
Read and write data from Google Sheets. Part of the tidyverse.
test <- read_sheet('https://docs.google.com/spreadsheets/d/<<SPREADSHEET_ID>>/edit#gid=<<SHEET_ID>>')
Authentication may be required depending on permissions. Accepts the URL of the sheet and writes to a dataframe.
write_sheet(census_data,
as_sheets_id('https://docs.google.com/spreadsheets/d/<<SPREADSHEET_ID>>/edit#gid=<<SHEET_ID>>'),
sheet = "census_data"
)
Accepts a dataframe and writes to a sheet specified with the as_sheets_id
function and a URL. If a sheet name isn't specified, it will create a new sheet with the dataframe name.
Excellent library with convenience functions for cleaning and exploring data.
zoo_animals %>%
count(animal_type) %>%
adorn_totals()
Use adorn_totals
to tally up a number column (or multiple columns) in a dataframe. Particularly useful with count
.
zoo_animals %>%
count(animal_type) %>%
adorn_totals() %>%
adorn_percentages('col') %>%
adorn_pct_formatting() %>%
adorn_ns(position = "front")
A few extra lines of code place both the raw number and percentage together.
Formulas and concepts I always have to look up. For more, read Numbers in the Newsroom by Sarah Cohen.
Also check out Ben Welsh's Observable collection of calculators based on Cohen's book.
A measure of how tightly clustered, or varied, data is around a set of values. Often described using the Greek letter sigma (σ).
For normally distributed data (often described as a "bell curve"), about two thirds of the observed values fall within one standard deviation of the average.
Calculating the rate of something happening in a subgroup and comparing it to another can help suss out disproportionate impact, especially when the groups are different sizes.
A school has 300 white and 120 Black students. Last year, 30 white students were suspended, and 20 Black students were suspended.
It may be enough to say that Black students made up 60% of suspensions while only making up 29% of the school. But you may want to put a finer point on the disparity. First, calculate the suspension rate for Black students
20 / 120 = 0.17
Next, do the same for white students.
30 / 300 = 0.1
Then you compare the two rates of suspension.
0.167 / 0.1 = 1.7
So black students are suspended at about 1.7 times the rate of white students.
Compare the likelihood of something happening for one group compared to another (h/t to Arianna Giorgi). Often used in medical contexts. For more, see this worksheet from the University of Delaware.
Instead of calculating and comparing the rate of something happening in a subgroup, calculating an odds ratio means you have to look at how much more likely something is to happen than not happen within that subgroup. So:
A class has 21 boys and 16 girls. On a recent test, 11 boys and 14 girls passed.
First calculate the likelihood that boys passed the test vs. not passing the tests
11 / 10 = 1.1
Do the same for the girls.
14 / 2 = 7
Now you can compare the ratios.
7 / 1.1 = 6.4
So girls are 6.4 times as likely to pass the test than boys.
Tips and tricks for understanding mathematical symbols outside the scope of the normal add, subtract, etc.
As Freya Holmér points out, these two "scary math symbols" are just for loops (Image courtesy of Freya Holmér).
In contexts like public health, rare events such as cancer or maternal mortality assume a Poisson distribution, which we can use to calculate confidence intervals used to compare rates. For more, see this handout from the Utah Department of Health and Human Services:
This distribution is not symmetric about its mean and so the associated confidence intervals will not be symmetric (the upper limit is farther from the estimate than is the lower limit).
You can use a Poisson confidence interval calculator (examples here and here) to calculate the upper and lower bounds of the confidence intervals for rates like these.
#Pregnancy-related deaths (observed events)
1345
#95% confidence Interval
0.95
#Calcuated confidence interval
[1274.07160, 1418.84932]
#Live births
4636541
#Pregnancy-related death rate per 100,000 live births
1345 / 4636541 * 100000 = 29.0
#Lower bound
1274.07160 / 4636541 * 100000 = 27.5
#Upper bound
1418.84932 / 4636541 * 100000 = 30.6
If you don't want to use a calculator, you can also calculate the upper and lower bounds using the chi square critical value in R using the formulas here.
#for 95% (0.95) confidence interval
#where x is the number of observed events
lower_bound <- qchisq((1-ci)/2, 2*x)/2
upper_bound < qchisq(1-(1-ci)/2, 2*(x+1))/2
One of the most common confidence levels is 95% (also called the 97.5th percentile point).
Confidence level | Alpha (α) | α/2 | Zα/2 | Z1 - α/2 | Z score (Zα/2) |
---|---|---|---|---|---|
80% | 0.20 | 0.100 | Z0.1 | Z0.90 | <-1.285, >1.285 |
90% | 0.10 | 0.050 | Z0.05 | Z0.95 | <-1.645, >1.645 |
94% | 0.06 | 0.030 | Z0.03 | Z0.96 | <-1.88, >1.88 |
95% | 0.05 | 0.025 | Z0.025 | Z0.975 | <-1.96, >1.96 |
96% | 0.04 | 0.020 | Z0.02 | Z0.98 | <-2.055, >2.055 |
98% | 0.02 | 0.010 | Z0.01 | Z0.99 | <-2.33, >2.33 |
99% | 0.01 | 0.005 | Z0.005 | Z0.995 | <-2.575, >2.575 |
The U.S. Census Bureau has good resources for testing for signficance with their data, which used a 90% confidence level.
General guidance for working with mapping files and geographic information systems.
(Image credit Illinois State University)
Latitude is the Y axis, with the equator at 0. Longitude is the X axis, with the prime meridian at 0.
North America, located in the north-west quadrant, latitude values will be positive or N. Longitude values will be negative or W.
For subsetting coordinates or geometries that requires a "bounding box," use these coordinates for North Carolina (h/t to Anthony Louis D'Agostino).
reference | value |
---|---|
xmin | -84.321869 |
ymin | 33.842316 |
xmax | -75.460621 |
ymax | 36.588117 |
Plugins, URL parameters and other neat stuff.
Tips and tricks for handling PDFs in a Web browser (like Chrome)
https://assets.avigilon.com/file_library/pdf/acc7/avigilon-player7-en.pdf#page=21
Pass the page
number as a URL parameter to link directly to a page.
Shortcuts and other cool things that help navigate various online video players.
https://youtu.be/UwVNkfCov1k?t=30
Add a t
parameter to specify the jump point in seconds.
A few common things to check when things get mucked up.
Mapping or working with counties, cities, etc. in North Carolina.
If you're missing one of North Carolina's 100 counties, or your map is inexplicably blank, check New Hanover County, McDowell County and Winston-Salem first. Then think of other location names that might not be a literal, string-to-string match.
Collections of commonly used lists and references in various data structures.
- State names and postal code abbreviations, comma- and line-separated and text-qualified by single quotes
- US counties and FIPS codes by state, tab-delimited
- NC counties, comma- and line-separated and text-qualified by single quotes
- NC counties and FIPS codes, tab-delimited
- NC counties and election/voter data codes, tab-delimited
- NC counties and court codes, as tuples
- NC counties and court codes, tab-delimited
- NC municipalities and counties, tab-delimited
- NC places and census designation, tab-delimited
- Not where you expect - A list of NC cities not in the counties you think