gadsdc/13-databases at master · ajschumacher/gadsdc

README.md

Read the Tidy Data paper on structuring data. Optionally also check out the corresponding slides and presentation video. [paper] [github] [slides] [video]

Optional:

Go through Zed Shaw's work-in-progress Learn SQL the Hard Way, which will still take you through even more SQL with SQLite than we'll do in class.

Consider thinking of multinomial Naive Bayes likelihood probabilities as coefficients on word dummy features. How are they similar or different as compared with logistic regression coefficients?
How can binary classifiers be used for multiclass problems? That is, if a technique only gives a probability of "yes" vs. "no" (for some question) how can you use the technique for questions with more than two possible answers?
How do K Nearest Neighbors, Naive Bayes, and linear models compare in terms of model interpretability? How/when could this inform model choices?
What are the negatives of "tidy data"? When would it not be a good idea to have data in a "tidy" format?
What other thoughts, comments, concerns, and questions do you have? What's on your mind?

Application presentation.

Question review.

Slides on databases.

SQL lab on SQL, with data pre-populated.

SQL lab on using SQLite with your own data.

On the structured side of the spectrum, this summarizes a lot of the data structure and software map:

Structure	Format	Software	Servers
Tabular	CSV etc.	most; SQLite	MySQL, PostgreSQL, etc.
Nested	JSON, XML	rjson, lxml, etc.	web etc.
Graph	various	networkx, Gephi, etc.	Neo4j, etc.

Optional:

For more introductory SQL, check out the Mode Analytics "SQL School".
Install PostgreSQL on your Ubuntu machine and play with it.
Look into two ways of using SQL from R.
Check out RPostgreSQL for combining R with PostgreSQL.
Check out the python module dataset, which tries to combine the ease of JSON with relational databases.
Read this chapter from the Bad Data Handbook: "When Databases Attack: A Guide for When to Stick to Files"
Explore Kristof Kovacs' Comparison of NoSQL Databases.