Skip to content

atet/csv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

.img/logo_csv.png

This reference will show R and Python code to read, view, manipulate, transform, and save out comma-separated value (CSV) files. Though actions are largely similar, details and caveats regarding more advanced tabular file formats such as Excel Workbooks will not be covered here.


Table of Contents

Supplemental


0. Requirements

Only R and Python will be used in this reference. To quickly deploy either (or both, if you're adventurous!) of these programming languages and their development environments, I would suggest leveraging Docker.

Instructions to install Docker for Windows 10/11 can be found here: https://github.com/atet/wsl

After Docker is installed on your computer, click to expand the steps below to easily download and create the development environments:

R and RStudio

In your WSL terminal, download the rocker/rstudio1 Docker image (~700 MB) and start the container:

$ docker pull rocker/rstudio:4.4.2
$ docker run -dit -p 8787:8787 --name rstudio -e PASSWORD=rstudio rocker/rstudio:4.4.2

In a web browser, visit http://localhost:8787 and log into RStudio using the username rstudio and password rstudio.

Python and Jupyter Notebook

In your WSL terminal, download the jupyter/minimal-notebook2 Docker image (~450 MB) and start the container:

$ docker pull jupyter/minimal-notebook:x86_64-ubuntu-22.04
$ docker run -dit -p 8888:8888 --name jupyter -e JUPYTER_TOKEN=jupyter jupyter/minimal-notebook:x86_64-ubuntu-22.04

In a web browser, visit http://localhost:8888 and log into Jupyter using the token jupyter.


Back to Top


1. Source Data

We will be working with a simple comma-separated (CSV) file that could be downloaded directly from this GitHub repository into your programming environment (here) or simply copy/pasted into a new file called office.csv.

This file contains three rows (a.k.a. records) and three columns (a.k.a. fields) for a total of nine cells (not including column headers):

ID,FIRST NAME,QUOTE
2,Pam,"Just go after what you want and act fast, because life just isn't that long."
1,Pan,"There's a lot of beauty in ordinary things. Isn't that kind of the point?"
3,Jan,"There are always a million reasons not to do something."

Troubleshooting:

  • Ensure that there are no spaces between delimiting commas as any "initial spaces" will cause issues with Microsoft Excel and Python Pandas reading the comma-separated value correctly.

Back to Top


2. Read and Write File

R Python
# Read office.csv from GitHub as R data.frame object
df = read.csv(
    "https://raw.githubusercontent.com/atet/csv/refs/heads/main/dat/office.csv",
    check.names = FALSE,
    stringsAsFactors = FALSE
)

# View structure of the object
str(df)

# Save out object as office2.csv file to working directory
write.csv(df, "office2.csv", row.names = FALSE)

# View working directory where file was saved
getwd()


# Install Pandas package and import packages
import os
os.system("pip install pandas")
import pandas as pd

# Read office.csv from GitHub as Pandas DataFrame object
df = pd.read_csv(
    "https://raw.githubusercontent.com/atet/csv/refs/heads/main/dat/office.csv"
)

# View structure of the object
df.dtypes

# Save out object as office2.csv file to working directory
df.to_csv("office2.csv", index=False)

# View working directory where file was saved
os.getcwd()

Back to Top


3. Examine Contents

R Python
# See the first two rows of the data
head(df, n = 2)

# See the last two rows of the data
tail(df, n = 2)

# Prints truncated output up to max.print = 1,000 default
print(df)

# Increase max.print number to display more
options(max.print = 9999)
print(df)

# See the first two rows of the data
df.head(2)

# See the last two rows of the data
df.tail(2)

# Prints truncated output if data exceeds display.min_rows = 10 default
#  and up to display.max_rows = 60 default
print(df)

# Increase display.min/max_rows numbers to display more
pd.set_option('display.max_rows', 9999)
pd.set_option('display.max_rows', 100)
print(df)

Back to Top


4. Subsetting

  • Indices
    • By row(s)
    • By column(s)
    • By row(s) and Column(s)
  • Matching (R infix)
R Python

Back to Top


5. Transforming

  • Summarize/pivot
  • Split-apply-combine
  • NOTE: R vectorization only applies to R numeric matrices
R Python

Back to Top


6. Merging

  • Joins (left, inner, right, outer)
R Python

Back to Top


7. Special

  • Dates
R Python

Back to Top


Other Resources

Description URL Link
null null

Back to Top


Troubleshooting

Issue Solution
"It's not working!" This concise tutorial has distilled hours of sweat, tears, and troubleshooting; it can't not work

Back to Top


References

Citation Description Docker Hub Link
1 R and RStudio Server Docker image from Rocker Project https://hub.docker.com/r/rocker/rstudio
2 Python and Jupyter Labs Docker image from Jupyter Project https://hub.docker.com/r/jupyter/minimal-notebook

Back to Top


Copyright © 2025-∞ Athit Kao, Terms and Conditions

About

Handling CSV Files Through Code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published