Skip to content
/ how-to Public

A collection of cheat sheets for remembering common commands and tips for data journalism work.

Notifications You must be signed in to change notification settings

mtdukes/how-to

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ooooo   ooooo   .oooooo.   oooooo   oooooo     oooo         ooooooooooooo   .oooooo.
`888'   `888'  d8P'  `Y8b   `888.    `888.     .8'          8'   888   `8  d8P'  `Y8b
 888     888  888      888   `888.   .8888.   .8'                888      888      888
 888ooooo888  888      888    `888  .8'`888. .8'                 888      888      888
 888     888  888      888     `888.8'  `888.8'     8888888      888      888      888
 888     888  `88b    d88'      `888'    `888'                   888      `88b    d88'
o888o   o888o  `Y8bood8P'        `8'      `8'                   o888o      `Y8bood8P'

A cheat sheet for common data journalism stuff. For details on installing these tools, see how I work. Use CMD + F to search the page, or the jump menu below if you know what you're looking for.

Jump to:

Command line tools grep | head/tail | ffmpeg | pdftk | esridump | wget | file | sed | wc | imagemagick | libpst | gunzip | sysctl

R packages shortcut keys | base | readr | scales | ggpmisc | dplyr | stringr | plyr | clipr | googlesheets4 | janitor For more on tips and tricks for working with common data formatting issues in R, check out the code snippet collection here

Math for journalists Standard deviation | Rate comparisons | Odds ratios | Poisson distributions | Making sense of symbols

GIS tips Latitude and longitude | NC bounding box

Browser tricks PDFs | Video

Troubleshooting Location errors

Convenience files U.S. states | N.C. places

Command line tools

A collection of tips and tricks for working with tools executed using bash terminals.

grep

Search text files for specific character sequences.

Basic search

grep "DUKES" absentee.csv

Return lines containing a string from a specified file and print to the command line.

Print to a file

grep "\"DUKES\",\"MICHAEL\",\"TYLER\"" absentee.csv > dukes.csv

Search for a string with quotes and output all lines to a file.

Recursively search a directory of files, first line only

head -1 ./*/*|grep -B1 'Hospital overall rating' > variable.txt

Combining head and grep with a pipe allows you to chain commands, and the -B1 flag allows you to output the file name.

▲ BACK TO NAV

head/tail

Get a preview of a file.

See the top

head absentee.csv

Print the first 10 lines of a file to the command line.

Specify the number of lines

head -1 absentee.csv > absentee_header.csv

Get the first line of a file and save it to a CSV file.

▲ BACK TO NAV

ffmpeg

A power tool for processing and converting video and audio files.

Make a GIF

ffmpeg -ss 5 -t 1.7 -i video.MOV -vf "fps=24,scale=640:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" -loop 0 video.gif

Create a high quality GIF 640 pixels wide at 24 frames per second using the specified video file, skipping 5 seconds and lasting 1.7 seconds and save the output.

Caption the GIF

ffmpeg -ss 278.8 -t 3.3 -i wings.mp4 -filter_complex "fps=24,scale=640:-1:flags=lanczos,drawtext=box=1:[email protected]:boxborderw=5:fontfile=/System/Library/Fonts/Supplemental/Impact.ttf:text='CONTRARY TO MY APPEARANCE,':fontsize=48:fontcolor=white:x=(w-tw)/2:y=(h/PHI)+th,drawtext=box=1:[email protected]:boxborderw=5:fontfile=/System/Library/Fonts/Supplemental/Impact.ttf:text='I AM ENJOYING THIS.':fontsize=48:fontcolor=white:x=(w-tw)/2:y=(h/PHI)+th+50,split[x1][x2];[x1]palettegen[p];[x2][p]paletteuse" wings.gif

Tweak the parameters (or delete the second line) to adjust the font, text etc. of the caption.

MOV to mp4

ffmpeg -i apple.mov -vcodec h264 -acodec aac apple.mp4

Use a codec flag to convert a video file from QuickTime to the more universal mp4 format.

AVI to mp4

ffmpeg -i full_video.avi full_video.mp4

Convert an AVI file to the more universal mp4 format.

Combine video clips

ffmpeg -f concat -safe 0 -i vidlist.txt -c copy full_video.avi

Combine all of the files recorded in a text file called vidlist.txt, which looks like this:

file '/Users/username/directory/vid_seq001.avi'
file '/Users/username/directory/vid_seq002.avi'

Create a video from a sequence of images

ffmpeg -r 1/5 -i img%03d.jpg -c:v libx264 -vf "fps=25,format=yuv420p" out.mp4

Read in a sequence of images from a folder and write to an mp4 file. The -r flag is the framerate, where the duration of each image is the inverse of the provided value (e.g. 1/5 is 5 seconds, 60 is 1/60 of a second). The -i flag specifies the filename structure, with 0 padding specified (e.g. img%03d.jpg will iterate through img001.jpg, img002.jpg, img003.jpg etc.). More details here.

▲ BACK TO NAV

pdftk

A power tool for processing and converting PDF files.

Combine files

pdftk *.pdf cat output all_documents.pdf
pdftk doc01.pdf doc02.pdf cat output all_documents.pdf

Combine all the PDF files in the present directory into a single file. Or specify individual files Note: check to make sure the capitalization of the filetype matches.

Split files by page number

pdftk blue_docs.pdf cat 1-700 output blue_docs01.pdf
pdftk blue_docs.pdf cat 701-end output blue_docs02.pdf

Specify the page number or use the end keyword to slice up a document.

Split a PDF portfolio

pdftk doj_emails_portfolio.pdf unpack_files output doj_emails

PDF portfolios contain a bunch of individual files bound up in a filetype that needs a native PDF reader. Get around this by unpacking each file into a specific directory.

Split a PDF portfolio with attachments

pdftk doj_emails_portfolio.pdf unpack_files output doj_emails;
IFS=$'\n'; set -f
for f in $(find ./doj_emails/ -name '*.pdf'); do pdftk "$f" unpack_files output ./doj_emails/; done
unset IFS; set +f

If your PDF portfolio has attachements within the individual PDF, you can use your terminal to unpack the portfolio into a directory, then set up a loop to unpack all of the PDFs in that directory on by one (Thanks to this Stackoverflow thread for tips on bash recursion and dealing with spaces in filenames.)

▲ BACK TO NAV

esridump

A targeted utility to pull down geographic data from ESRI maps.

Download a geojson

esri2geojson https://services.arcgis.com/iFBq2AW9XO0jYYF7/arcgis/rest/services/Covid19byZIPnew/FeatureServer/0 nc_zipDATE.geojson

Download data from the ESRI REST endpoint that powers the N.C. DHHS COVID map of cases by zip code and save it as a geojson file.

▲ BACK TO NAV

wget

A power tool for recursively downloading files, for example from the Web.

Download a single file

wget http://mtdukes.com/work.html

Saves the file from the specified URL.

Download a list of files

wget -i file_list.txt

Saves individual files from URLs specified in a TXT file, one URL on each line.

Download a directory

wget --recursive --no-parent http://mtdukes.com

Recursively download an entire site's contents.

▲ BACK TO NAV

file

Tool to do various file formatting things I think.

Detect encoding of a file

file -I input.csv

Detect encoding of a file.

▲ BACK TO NAV

sed

Tool to make substitutions in a text file (submitted by Chris Alcantara).

Replace all instances of a word and output result to new file

sed "s/dook/duke/g" ./old.csv > new.csv

Syntax uses / as a delimiter to separate patterns you want to substitute.

Replace all instance of a word within the file

sed -i "" "s/dook/duke/g" ./data.csv

Substitute directly in the file by passing an empty string after the -i flag.

Pass a file name to make a backup.

sed -i "./data-backup.csv" "s/dook/duke/g" ./data.csv

▲ BACK TO NAV

wc

A basic character counting utility for the command line.

Count the number of lines in a file

wc -l < data_file.txt

The < flag excludes the file name from the results.

Pipe the results of some data and count the lines

curl mtdukes.com --silent | wc -l

imagemagick

A power tool for quickly editing images.

Batch crop a folder of images

mogrify -crop 800x450+0+40 -path ./cropped *.jpg

In a folder of images, crop every jpg image at size 800x450, with a 0px offset from the left (x) and a 40px offset from the top (y).

▲ BACK TO NAV

libpst

A collection of tools to work with and convert PST files (Outlook folders containing email). On Macs, install with homebrew.

Convert PST to MBOX

readpst public_records.pst

Outputs a single file in mbox format, which is a more open format you can import into a number of email clients. That includes Mozilla's Thunderbird, which you can import using the ImportExportNG add-on.

Convert PST to individual email files

readpst -e -D public_records.pst

Separates the PST into individual eml files. Attachments are embedded in the file. The -D flag preserves deleted items. Can be read by services like Google's Pinpoint.

Convert PST to rich text files and export attachments

readpst -S -D public_records.pst

Separates the PST into individual eml files, each emails rich text body and individual attachment. The numbered files are in eml format, with no extension. The -D flag preserves deleted items.

Convert PST to eml and msg files

readpst -m -D public_records.pst

Produces both msg and eml files for each message. The -D flag preserves deleted items.

▲ BACK TO NAV

gunzip

Quickly and efficiently unzip files (or a folder full of files). Can also unzip some files where a normal unzipping GUI throws errors. Useful for .gz file extensions.

Unzip a single file

gunzip map_file.gz

Unzips a specific file and replaces it with the unzipped version.

Keep the original zipped file

gunzip -c map_file.gz > map_file.shp

Uses the stdout flag to read to the console, but pipe to a new file to keep the original.

Unzip a folder full of files

gunzip -r /map_files

Uses the recursive flag to iterate through every zipped file in a folder and replace it with the unzipped version.

sysctl

A suite of tools to anaylze your system (for Mac).

Examine CPU threads

sysctl hw.physicalcpu hw.logicalcpu

Provides an output of physical and logical cores your CPU has.

▲ BACK TO NAV

R packages

shortcut keys

A few common shortcuts save you from typing in RStudio.

Start a new section

Command + Shift + R

Prompt for a new label used in the document outline for an R script

Execute a command

Command + Enter

Run a section of code in your R script.

Use a pipe

Command + Shift + M

Input a %>% at your cursor to pipe output to the next line with the dplyr library.

Use an assignment

Option + -

Input a <- at your cursor to assign output to a variable.

▲ BACK TO NAV

base

The stripped down version of R has lots of built-in stuff worth using.

Clear all environment variables

rm(list = ls())

Start with a clean slate using the rm command.

Set your working directory

setwd('~/projects/newsobserver/big_investigation/data')

Save time otherwise spent typing out long path names.

Get unique values

property_sales %>% 
  unique()

Generates a dataframe of unique rows across all fields.

Get duplicate values

wake_sales %>% 
  .[duplicated(.), ]

Generates a dataframe of duplicated rows, comparing all fields.

Get a list of files in a folder

county_data <- list.files(path = './data/counties', full.names = TRUE)

The full.names flag prepends the directory path if TRUE, and file name only if false.

Remove a package

detach("package:plyr", unload=TRUE)

Removing a package can help when you have conflicts between functions with the same name.

Format a date

as.Date('01/01/2001', format = '%m/%d/%Y' )

Specify the format explicitly using the syntax from strptime.

Find and replace characters in a string

gsub(',', '', 'womp,womp')
gsub('\\(', ',for real', 'Replace the literal parenthesis (' )

Enter a pattern, replacement and data value to search.

Get rid of non-ASCII characters

gsub('[^ -~]', '', '日本人GALATIA')

This pattern, translates to "not any ASCII character". Useful when cleaning a malformed file. Can also use the [^ -~] in other contexts.

Load a file with an annoying encoding

vax_data <- read.delim('cnty20210731.csv', fileEncoding = 'UTF-16LE', sep = '\t',
                              col.names = c('index', 'county', 'week_of', 'age12_17',
                              'age18_24', 'age25_49', 'age50_64')
                              )

The readr package is my preferred method for importing data, sometimes it chokes on strange file encodings even with the locale parameter set. So this method is worth trying.

Turn off scientific notation

options(scipen = 999)

Prints out the full numeral in your current workspace.

Test for significance when preparing proportions

#Black maternal mortality in 2020 and 2021 compared to the number of live births
prop.test(x = c(323, 506), n = c(397827, 517889), correct = FALSE)

Use a 2-sample test for equality of proportions to generate a p-value, which tells you how likely it is that the observed difference is due to random chance. Typically, p-values of less than 0.05 are considered statistically signficant (95% confidence interval).

The prop.test function accepts a vector of values for the numerator (x) and the denominator (n).

▲ BACK TO NAV

readr

A "fast and friendly way" to read in data. Part of the tidyverse suite of packages.

Specify a default column type

my_data <- read_csv('../my_data.csv', col_types = cols(.default = 'c', date = 'D'))

Tell read_csv to import all columns as a character by default, except for the date field, which should be a date (you can remove the date part if you want to read everything in as a character).

Get the file encoding

guess_encoding(file = 'annoying_file.csv')

Useful for errors reading in the file (like embedded nulls).

Pull data from a GitHub gist

county_fips <- read_tsv( url(
    'https://gist.githubusercontent.com/mtdukes/e0c6563927fb4f3e48f4e092b84b7023/raw/56e5abc1daf2277b5d901cbc25b9f9e64ab8c073/nc_fips_tab.tsv'
    ))

Use the url function from base R to pull formatted data posted in public gists, like these convenience files. Swap out the delimiter function (read_csv, read_delim, etc) as needed. Get the URL by clicking "Raw" on the Gist page.

Specify a file encoding when loading data

my_dataframe <- read_tsv('annoying_file.csv', locale = locale(encoding = "UTF-16LE") )

Specify non-UTF encodings you get from the guess_encoding output.

Write and timestamp a csv

my_spreadsheet %>% 
  write_csv(paste0('my_data/my_spreadsheet',
                   format(Sys.time(),'%Y%m%d%H%M'),
                   '.csv'))

Using Sys.time(), you can quickly save a dataset with an automatic timestamp for easy organization. No more my_spreadsheet_final_Final_FINAL_FINALFORREAL.csv!

▲ BACK TO NAV

scales

A library to make scaling and labeling easier.

Show figures as dollars

vax_income %>%
  ggplot(aes(x = median_income, y=PctTotal)) +
  geom_point() +
  scale_x_continuous(labels = scales::dollar_format())

The dollar_format function shortcuts the annoying parsing issues.

▲ BACK TO NAV

ggpmisc

Miscellaneous extensions to the ggplot package.

Include a regression equation on your scatterplot

vax_income %>%
  ggplot(aes(x = median_income, y = pct_total)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x, show.legend = FALSE) +
  stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
               label.x.npc = "right", label.y.npc = 0.15,
               formula = y ~ x, parse = TRUE, size = 3)

The stat_poly_eq function lets you annotate the graph with a regression formula. BONUS: What's a good value for R-squared?

▲ BACK TO NAV

dplyr

A "grammar of data manipulation" and part of the tidyverse package.

Get a random sample of rows

nc_voters %>% 
  sample_n(10)

Specify the number of rows from the dataframe to return.

Convert table to uppercase

clean_table <- dirty_table %>% 
  mutate(across(where(is.character), toupper))

Tranforms all columns containing characters to uppercase all at once. Incredibly useful for cleaning data!

Fix multibyte strings and bad character encodings

df_clean <- df %>% mutate(across(everything(), ~ iconv(.x, sub = '') ))

Works across an entire dataframe, removing all malformed characters, multibyte strings or bad, non-UTF8 encodings that can't be converted.

▲ BACK TO NAV

stringr

A suite of convenient functions for working with strings, and part of the tidyverse package.

Remove prefix characters from column names

clean_table <- dirty_table %>%
  rename_all(~str_replace(., 'column_prefix', ''))

Uses str_replace and rename_all to replace all matching patterns. Uses regex.

▲ BACK TO NAV

plyr

Tools to solve common problems, like performing the same task over and over. NOTE: This package conflicts with some dplyr commands, so if you're getting weird errors, this might be why.

Repeat a function using a list as input

precinct_sort <- ldply(county_files, read_tsv, na='', col_types = cols(
  county_id = col_double(),
  election_dt = col_date(format = '%m/%d/%Y'),
  contest_id = col_double(),
  contest_title = col_character(),
  contest_vote_for = col_double(),
  precinct_code = col_character(),
  candidate_id = col_double(),
  candidate_name = col_character(),
  candidate_party_lbl = col_character(),
  voting_method_lbl = col_character(),
  voting_method_rslt_desc = col_character(),
  vote_ct = col_double()
))

The first parameter is the list and the second is the function you want to repeat. Everything that follows are parameters specific to your function. You can use your own functions too.

▲ BACK TO NAV

clipr

A set of simple commands for writing to and reading from the clipboard.

Write to the clipboard

age_group_populations %>%
  mutate(lookup_helper = paste0(fips,age_group), .after = 'age_group') %>% 
  write_clip()

Copies dataframes in a tab-delimited format for easy pasting into spreadsheets.

▲ BACK TO NAV

googlesheets4

Read and write data from Google Sheets. Part of the tidyverse.

Read in data from a Google Sheet

test <- read_sheet('https://docs.google.com/spreadsheets/d/<<SPREADSHEET_ID>>/edit#gid=<<SHEET_ID>>')

Authentication may be required depending on permissions. Accepts the URL of the sheet and writes to a dataframe.

Overwrite data in an existing sheet

write_sheet(census_data,
            as_sheets_id('https://docs.google.com/spreadsheets/d/<<SPREADSHEET_ID>>/edit#gid=<<SHEET_ID>>'),
            sheet = "census_data"
            )

Accepts a dataframe and writes to a sheet specified with the as_sheets_id function and a URL. If a sheet name isn't specified, it will create a new sheet with the dataframe name.

▲ BACK TO NAV

janitor

Excellent library with convenience functions for cleaning and exploring data.

Get the totals of a dataframe

zoo_animals %>%
  count(animal_type) %>%
  adorn_totals()

Use adorn_totals to tally up a number column (or multiple columns) in a dataframe. Particularly useful with count.

Calculate percentages across categories

zoo_animals %>%
  count(animal_type) %>%
  adorn_totals() %>% 
  adorn_percentages('col') %>% 
  adorn_pct_formatting() %>% 
  adorn_ns(position = "front")

A few extra lines of code place both the raw number and percentage together.

▲ BACK TO NAV

Math for journalism

Formulas and concepts I always have to look up. For more, read Numbers in the Newsroom by Sarah Cohen.

Also check out Ben Welsh's Observable collection of calculators based on Cohen's book.

Standard deviation

A measure of how tightly clustered, or varied, data is around a set of values. Often described using the Greek letter sigma (σ).

Normal distribution

Distribution of values based on standard deviation.

(Image by M. W. Toews)

For normally distributed data (often described as a "bell curve"), about two thirds of the observed values fall within one standard deviation of the average.

▲ BACK TO NAV

Rate comparisons

Calculating the rate of something happening in a subgroup and comparing it to another can help suss out disproportionate impact, especially when the groups are different sizes.

Comparing two groups

A school has 300 white and 120 Black students. Last year, 30 white students were suspended, and 20 Black students were suspended.

It may be enough to say that Black students made up 60% of suspensions while only making up 29% of the school. But you may want to put a finer point on the disparity. First, calculate the suspension rate for Black students

20 / 120 = 0.17

Next, do the same for white students.

30 / 300 = 0.1

Then you compare the two rates of suspension.

0.167 / 0.1 = 1.7

So black students are suspended at about 1.7 times the rate of white students.

▲ BACK TO NAV

Odds ratios

Compare the likelihood of something happening for one group compared to another (h/t to Arianna Giorgi). Often used in medical contexts. For more, see this worksheet from the University of Delaware.

Comparing two groups

Instead of calculating and comparing the rate of something happening in a subgroup, calculating an odds ratio means you have to look at how much more likely something is to happen than not happen within that subgroup. So:

A class has 21 boys and 16 girls. On a recent test, 11 boys and 14 girls passed.

First calculate the likelihood that boys passed the test vs. not passing the tests

11 / 10 = 1.1

Do the same for the girls.

14 / 2 = 7

Now you can compare the ratios.

7 / 1.1 = 6.4

So girls are 6.4 times as likely to pass the test than boys.

▲ BACK TO NAV

Making sense of symbols

Tips and tricks for understanding mathematical symbols outside the scope of the normal add, subtract, etc.

Summation/Product

Not scary! As Freya Holmér points out, these two "scary math symbols" are just for loops (Image courtesy of Freya Holmér).

▲ BACK TO NAV

Poisson distributions

In contexts like public health, rare events such as cancer or maternal mortality assume a Poisson distribution, which we can use to calculate confidence intervals used to compare rates. For more, see this handout from the Utah Department of Health and Human Services:

This distribution is not symmetric about its mean and so the associated confidence intervals will not be symmetric (the upper limit is farther from the estimate than is the lower limit).

You can use a Poisson confidence interval calculator (examples here and here) to calculate the upper and lower bounds of the confidence intervals for rates like these.

#Pregnancy-related deaths (observed events)
1345

#95% confidence Interval
0.95

#Calcuated confidence interval
[1274.07160, 1418.84932]

#Live births
4636541

#Pregnancy-related death rate per 100,000 live births
1345 / 4636541 * 100000 = 29.0

#Lower bound
1274.07160 / 4636541 * 100000 = 27.5

#Upper bound
1418.84932 / 4636541 * 100000 = 30.6

If you don't want to use a calculator, you can also calculate the upper and lower bounds using the chi square critical value in R using the formulas here.

#for 95% (0.95) confidence interval
#where x is the number of observed events
lower_bound <- qchisq((1-ci)/2, 2*x)/2
upper_bound < qchisq(1-(1-ci)/2, 2*(x+1))/2

More significance testing

Confidence interval, illustrated

One of the most common confidence levels is 95% (also called the 97.5th percentile point).

Confidence level Alpha (α) α/2 Zα/2 Z1 - α/2 Z score (Zα/2)
80% 0.20 0.100 Z0.1 Z0.90 <-1.285, >1.285
90% 0.10 0.050 Z0.05 Z0.95 <-1.645, >1.645
94% 0.06 0.030 Z0.03 Z0.96 <-1.88, >1.88
95% 0.05 0.025 Z0.025 Z0.975 <-1.96, >1.96
96% 0.04 0.020 Z0.02 Z0.98 <-2.055, >2.055
98% 0.02 0.010 Z0.01 Z0.99 <-2.33, >2.33
99% 0.01 0.005 Z0.005 Z0.995 <-2.575, >2.575

The U.S. Census Bureau has good resources for testing for signficance with their data, which used a 90% confidence level.

▲ BACK TO NAV

GIS tips

General guidance for working with mapping files and geographic information systems.

Latitude and Longitude

Latitude and longitude

(Image credit Illinois State University)

Latitude is the Y axis, with the equator at 0. Longitude is the X axis, with the prime meridian at 0.

North America, located in the north-west quadrant, latitude values will be positive or N. Longitude values will be negative or W.

NC bounding box

For subsetting coordinates or geometries that requires a "bounding box," use these coordinates for North Carolina (h/t to Anthony Louis D'Agostino).

reference value
xmin -84.321869
ymin 33.842316
xmax -75.460621
ymax 36.588117

▲ BACK TO NAV

Browser tricks

Plugins, URL parameters and other neat stuff.

PDFs

Tips and tricks for handling PDFs in a Web browser (like Chrome)

Jump to a page

https://assets.avigilon.com/file_library/pdf/acc7/avigilon-player7-en.pdf#page=21

Pass the page number as a URL parameter to link directly to a page.

▲ BACK TO NAV

Video

Shortcuts and other cool things that help navigate various online video players.

Jump to a time in YouTube

https://youtu.be/UwVNkfCov1k?t=30

Add a t parameter to specify the jump point in seconds.

▲ BACK TO NAV

Troubleshooting

A few common things to check when things get mucked up.

Location errors

Mapping or working with counties, cities, etc. in North Carolina.

You three again.

Check New Hanover County, McDowell County and Winston-Salem

If you're missing one of North Carolina's 100 counties, or your map is inexplicably blank, check New Hanover County, McDowell County and Winston-Salem first. Then think of other location names that might not be a literal, string-to-string match.

▲ BACK TO NAV

Convenience files

Collections of commonly used lists and references in various data structures.

U.S. states

▲ BACK TO NAV

N.C. places

▲ BACK TO NAV

About

A collection of cheat sheets for remembering common commands and tips for data journalism work.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published