- Read the Tidy Data paper on structuring data. Optionally also check out the corresponding slides and presentation video. [paper] [github] [slides] [video]
Optional:
- Go through Zed Shaw's work-in-progress Learn SQL the Hard Way, which will still take you through even more SQL with SQLite than we'll do in class.
- Consider thinking of multinomial Naive Bayes likelihood probabilities as coefficients on word dummy features. How are they similar or different as compared with logistic regression coefficients?
- How can binary classifiers be used for multiclass problems? That is, if a technique only gives a probability of "yes" vs. "no" (for some question) how can you use the technique for questions with more than two possible answers?
- How do K Nearest Neighbors, Naive Bayes, and linear models compare in terms of model interpretability? How/when could this inform model choices?
- What are the negatives of "tidy data"? When would it not be a good idea to have data in a "tidy" format?
- What other thoughts, comments, concerns, and questions do you have? What's on your mind?
Application presentation.
Question review.
Slides on databases.
SQL lab on SQL, with data pre-populated.
SQL lab on using SQLite with your own data.
On the structured side of the spectrum, this summarizes a lot of the data structure and software map:
Structure | Format | Software | Servers |
---|---|---|---|
Tabular | CSV etc. | most; SQLite | MySQL, PostgreSQL, etc. |
Nested | JSON, XML | rjson, lxml, etc. | web etc. |
Graph | various | networkx, Gephi, etc. | Neo4j, etc. |
Optional:
- For more introductory SQL, check out the Mode Analytics "SQL School".
- Install PostgreSQL on your Ubuntu machine and play with it.
- Look into two ways of using SQL from
R
. - Check out RPostgreSQL for combining
R
with PostgreSQL. - Check out the
python
module dataset, which tries to combine the ease of JSON with relational databases. - Read this chapter from the Bad Data Handbook: "When Databases Attack: A Guide for When to Stick to Files"
- Explore Kristof Kovacs' Comparison of NoSQL Databases.