This reference will show R and Python code to read, view, manipulate, transform, and save out comma-separated value (CSV) files. Though actions are largely similar, details and caveats regarding more advanced tabular file formats such as Excel Workbooks will not be covered here.
- 0. Requirements
- 1. Source Data
- 2. Read and Write File
- 3. Examine Contents
- 4. Subsetting
- 5. Transforming
- 6. Merging
- 7. Special
Only R and Python will be used in this reference. To quickly deploy either (or both, if you're adventurous!) of these programming languages and their development environments, I would suggest leveraging Docker.
Instructions to install Docker for Windows 10/11 can be found here: https://github.com/atet/wsl
After Docker is installed on your computer, click to expand the steps below to easily download and create the development environments:
R and RStudio
In your WSL terminal, download the rocker/rstudio
1 Docker image (~700 MB) and start the container:
$ docker pull rocker/rstudio:4.4.2
$ docker run -dit -p 8787:8787 --name rstudio -e PASSWORD=rstudio rocker/rstudio:4.4.2
In a web browser, visit http://localhost:8787 and log into RStudio using the username rstudio
and password rstudio
.
Python and Jupyter Notebook
In your WSL terminal, download the jupyter/minimal-notebook
2 Docker image (~450 MB) and start the container:
$ docker pull jupyter/minimal-notebook:x86_64-ubuntu-22.04
$ docker run -dit -p 8888:8888 --name jupyter -e JUPYTER_TOKEN=jupyter jupyter/minimal-notebook:x86_64-ubuntu-22.04
In a web browser, visit http://localhost:8888 and log into Jupyter using the token jupyter
.
We will be working with a simple comma-separated (CSV) file that could be downloaded directly from this GitHub repository into your programming environment (here) or simply copy/pasted into a new file called office.csv
.
This file contains three rows (a.k.a. records) and three columns (a.k.a. fields) for a total of nine cells (not including column headers):
ID,FIRST NAME,QUOTE
2,Pam,"Just go after what you want and act fast, because life just isn't that long."
1,Pan,"There's a lot of beauty in ordinary things. Isn't that kind of the point?"
3,Jan,"There are always a million reasons not to do something."
Troubleshooting:
- Ensure that there are no spaces between delimiting commas as any "initial spaces" will cause issues with Microsoft Excel and Python Pandas reading the comma-separated value correctly.
R | Python |
---|---|
# Read office.csv from GitHub as R data.frame object
df = read.csv(
"https://raw.githubusercontent.com/atet/csv/refs/heads/main/dat/office.csv",
check.names = FALSE,
stringsAsFactors = FALSE
)
# View structure of the object
str(df)
# Save out object as office2.csv file to working directory
write.csv(df, "office2.csv", row.names = FALSE)
# View working directory where file was saved
getwd()
|
# Install Pandas package and import packages
import os
os.system("pip install pandas")
import pandas as pd
# Read office.csv from GitHub as Pandas DataFrame object
df = pd.read_csv(
"https://raw.githubusercontent.com/atet/csv/refs/heads/main/dat/office.csv"
)
# View structure of the object
df.dtypes
# Save out object as office2.csv file to working directory
df.to_csv("office2.csv", index=False)
# View working directory where file was saved
os.getcwd() |
R | Python |
---|---|
# See the first two rows of the data
head(df, n = 2)
# See the last two rows of the data
tail(df, n = 2)
# Prints truncated output up to max.print = 1,000 default
print(df)
# Increase max.print number to display more
options(max.print = 9999)
print(df)
|
# See the first two rows of the data
df.head(2)
# See the last two rows of the data
df.tail(2)
# Prints truncated output if data exceeds display.min_rows = 10 default
# and up to display.max_rows = 60 default
print(df)
# Increase display.min/max_rows numbers to display more
pd.set_option('display.max_rows', 9999)
pd.set_option('display.max_rows', 100)
print(df) |
- Indices
- By row(s)
- By column(s)
- By row(s) and Column(s)
- Matching (R infix)
R | Python |
---|---|
|
|
- Summarize/pivot
- Split-apply-combine
- NOTE: R vectorization only applies to R numeric matrices
R | Python |
---|---|
|
|
- Joins (left, inner, right, outer)
R | Python |
---|---|
|
|
- Dates
R | Python |
---|---|
|
|
Description | URL Link |
---|---|
null | null |
Issue | Solution |
---|---|
"It's not working!" | This concise tutorial has distilled hours of sweat, tears, and troubleshooting; it can't not work |
Citation | Description | Docker Hub Link |
---|---|---|
1 | R and RStudio Server Docker image from Rocker Project | https://hub.docker.com/r/rocker/rstudio |
2 | Python and Jupyter Labs Docker image from Jupyter Project | https://hub.docker.com/r/jupyter/minimal-notebook |
Copyright © 2025-∞ Athit Kao, Terms and Conditions