-
Notifications
You must be signed in to change notification settings - Fork 1
/
airbnb_nyc_analysis.Rmd
862 lines (535 loc) Β· 28.8 KB
/
airbnb_nyc_analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
---
title: "Beyond Spreadsheets: R for MBAs"
subtitle: "Exploring New York City's Airbnb Listings"
author: "Asif Mehedi | Batten Institute"
date: Oct 3, 2019
output:
rmdformats::readthedown:
self_contained: false
lightbox: true
gallery: true
highlight: tango
code_folding: show
css: "style.css"
---
# Goals
- Understand how a coding-based analysis workflow is different from one based on spreadsheets
- Get familiar with R (a programming language) and RStudio (a desktop program where you write R code)
- Learn some powerful analytic techniques, especially useful for exploring a large-ish dataset
- Learn a few fundamental programming concepts (variable, function, data type)
We'll accomplish these objectives by taking a close look at an interesting dataset, one that contains detailed data on over 35,000 recent (August 2019) Airbnb listings, all in New York City.
It's okay if you can't start using these techniques right away after the workshop. To be comfortable with a tool like R takes some time and practice. Your exposure today to these tools and techniques will likely show you R's possibilities and motivate you to keep learning. I suggest some learning resources at the end of this document.
# RStudio
RStudio is a free program that makes writing R code much more enjoyable and efficient. Strictly speaking, you don't need it to write R code, but I highly recommend you use RStudio.
It has four main panes. This one β assuming you're reading this on RStudio β is the "code editor" (also known as "script editor" and "source editor"). This is where we'll spend most of our time. I'll describe the other panes later.
## R Notebook
Before starting our analysis, let's understand how this document works. (I'm assuming you're reading this on RStudio.)
This document is an *R Notebook*. An R Notebook contains commentary interspersed with code chunks. The code chunks can be run (executed) independently and interactively. The output will appear below the code chunk. R Notebooks are easy to convert to well-formatted final documents as a pdf file, a webpage, or an MS Word file. [More here](http://rmarkdown.rstudio.com/r_notebooks.html)
What you're reading is *commentary* and what you see below is the container for a *code* chunk.
```{r}
```
To run a code chunk, click on the little triangle at the right edge of the code chunk. Or, use the keyboard shortcut **CTRL + SHIFT + ENTER** (Windows) or **CMD + SHIFT + ENTER** (Mac).
π Run the chunk below and see what happens.
```{r}
a <- 5 + 7
print(a)
```
Do you see some output below the code chunk? This way you can immediately see the result of your code as you work on your analysis.
(Do you know what's going on in the code above? We'll talk about variables and assignments later.)
Feel free to add your commentary and code anywhere in this document. You can always download the [unmodified version](https://github.com/asifm/workshops/archive/R-intro-2019.zip).
To write your own code chunk, look for the **insert** button above and then select **R**. Or, use the keyboard shortcut **CTRL + ALT + I** (Windows) or **CMD + OPTION + I** (Mac).
π Place the cursor below and create a container for writing code. Now write some code and run it. For example, find the result of `3543 / 562`.
# The Data
The dataset contains almost all the Airbnb listings in NYC as of August 2019. The data came from [Inside Airbnb](http://insideairbnb.com/get-the-data.html).
It's always a good idea to approach a new dataset with a few basic questions. Some examples:
- What kind of data is here? What are the variables? How many observations?
- Who collected these data? How? Why?
- How accurate are these data?
- How complete are these data? Are there too many missing values?
- Do I have the legal rights to use these data? Under what conditions?
[Here's some background information](http://insideairbnb.com/behind.html) that may answer some of these questions.
Let's take a look at the data. Open the Excel file in the `data` folder. Browse around a bit.
π€ Can you explain what each column is about?
π€ What's the first thing you'd like to find out from these data?
π€ What do you think about the quality of the data?
# The Context
Data analysis happens within a broader context. Usually, there's an overarching business, policy, or scientific question that one would like to answer. Often, though, the question is not very clear. Whatever the case, you need to approach the data with curiosity about the larger context.
The more you understand the context of the data, the better will be your questions and hypotheses guiding your analysis.
For our dataset, we should have a good understanding of the business model of Airbnb as well as the economy and geography of NYC.
π€ Is there anything else we should know about?
## Airbnb
π€ How much do I need to know about Airbnb and its business? Is my personal experience with Airbnb enough? What if I don't have any personal experience?
- [How Airbnb works](https://www.wikiwand.com/en/Airbnb#/How_it_works)
- [Example of an Airbnb listing](https://www.airbnb.com/rooms/2168594?s=frffIXWS)
## New York City
We can start with a map of the city.
![](https://i.imgur.com/se7fTYT.png)
The map gives us an idea about the relative size and location of the five boroughs of NYC.
π€ What else do we know about these boroughs? What about the neighborhoods within the boroughs?
# Prepare to Analyze
## Install and load packages
We'll first install a few packages that we'll need at different stages of the analysis. This may take a minute or two.
Take a look at the code chunk below and note the `#`s at the start some lines. The `#` at the start of a line means the line is a comment. These lines are not executed when you run a code chunk. You can use comments either to explain your code within a code chunk or to temporarily prevent a line of code from executing. The latter is known as "commenting out" a line of code.
If the lines containing `install.packages` are "commented out" below, please remove the `#`s before running the code chunk.
```{r, message=FALSE}
# For loading data
# install.packages("readr")
# # For data manipulation (we'll spend most time with this one)
# install.packages("dplyr")
# # For visualization
# install.packages("ggplot2")
# # To work with date and time
# install.packages("lubridate")
```
And then we load the packages for our current R session. After these packages are successfully loaded, all the functions in these packages will be available for our use.
```{r}
library(readr)
library(dplyr)
library(ggplot2)
library(lubridate)
```
## Load and prepare data
We'll use the function `read_csv` β which comes from the package `readr` β to load data from a file and convert the data into a dataframe. A dataframe is a tabular data structure, with columns as variables and rows as observations. In this case, the dataframe looks exactly as it does in the CSV file.
π€ What's a CSV file?
π€ What are some other formats used for storing data?
π€ What can I do if my data is in a format I'm not familiar with? [hint: Google, package, function]
```{r}
df <- read_csv("data/airbnb_nyc_data.csv")
```
The first thing to check is whether the data have been successfully loaded and assigned to the variable (here `df`). We don't see any error β which is a good sign. We can also see the variable `df` in the environment pane (usually to the right of this code editor pane β click on "Environment" if hidden).
Next you should check whether `read_csv` was able to correctly infer the data type of each column.
For a better glimpse at the data, use the function `glimpse`.
π€ How am I supposed to know what functions are out there?
```{r}
glimpse(df)
```
What you see within the angular brackets `<...>` next to each column name is the data type of that column.
But what are data types? And why are they important?
What exactly is a variable in programming?
What is a function?
Let's take a detour.
# Detour: Programming Concepts
## Variable
In programming, a variable stores some value. Our code can then reference the name of the variable to do something with it.
We use a left-facing arrow to assign a value to a variable. The arrow looks like this: `<-` (the less-than sign followed by a hyphen).
For example, in the code below, `foo` and `bar` are variable names. We assigned the values `43` and `"kitten"` to `foo` and `bar` respectively. You can read them as `foo` gets `43` and `bar` gets `"kitten"`.
```{r}
foo <- 43
bar <- "kitten"
```
Let's do something with these variables.
```{r}
# Multiply foo by 10
print(foo * 10)
# Just see the value of bar
print(bar)
```
Important: The word `variable` can be ambiguous. You may be familiar with the term as used in statistics. For example, if you have some data about people, some variables there could be age, sex, income and address. The names of the columns of a dataframe are variables in this sense. Usually, the context would clarify which meaning is intended.
Tip: Give short descriptive names to your variables. This will make the code easy to follow.
π A room is 11 yards long and 7.5 yards wide. Assign these values to `width` and `length` variables and then calculate the `area` of the room. You can, of course, give different names to these variables if you wish.
```{r}
```
## Data types
We also need to know a little about data types. These are the main ones:
- Integer: such as 2, 543, 90.
- Double: numbers other than integers, such as 4.56, 1/33. Doubles are always approximations.
- Character: a sequence of letters, numbers, punctuation, etc., such as "rainbow", "34265". Character data are always written within quotes, either single or double (we'll always use double for consistency).
- Logical: data that can take on one of only two values β TRUE or FALSE.
π€ Why is "34265" above of the type character? What's a real-world example of this kind of data?
There are some other important data types:
- `factor`, for categorical variables (here "variable" in statistical sense). A categorical variable is one that can take on a limited, fixed number of values.
- `date-time` and `date`
## Data structures
`vector`, `list`, `matrix`, and `dataframe` are the main data-structures for storing data in R.
Think of a vector as simply a collection of data elements, where each element is of the same data type.
This is how you create a vector: `c(element1, element2, element3, ...)`
```{r}
vec <- c(3, 5, 10, 20)
print(vec)
```
## Functions
You're likely familiar with Excel functions, such as `sum()` or `average()`. The concept of functions in a programming language is no different. You give some values (called parameters) to a function. The function does something with the values and returns a new value. So `sum(3, 6, 1)` returns 10. Here `sum` is the name of the function, 3, 6, 1 are parameters, and 10 is the returned value.
Most of the times, you'll use functions written by other people. But it's not too difficult to write your own functions. Here's a simple one. All it does is add 10 to a given number and then return the total.
```{r}
function(num){
new_value <- num + 10
new_value
}
```
But we can't really use the function yet. We need to give it a name. Let's go with `add10`.
```{r}
add10 <- function(num){
new_value <- num + 10
new_value
}
```
Now we can use this function whenever we need to add 10 to a number. (This is, of course, a toy example. In practice, a function would do much more work.)
```{r}
# Add 10 to 35
add10(35)
```
We won't write any function today. Rather, we'll use functions that R β and the packages we loaded β make available to us. Here's an example: the `mean` function, which comes preloaded in R.
```{r}
# First let's create a vector
vec2 <- c(23, 53, 11, 34, 87, 100, 5, 12, 66, 9, 87, 110, 20, 33, 54, 43, 76)
# Let's calculate the mean of the vector
calculated_mean <- mean(vec2)
# Then output it with some text description
message("The mean of the vector vec2 is ", calculated_mean)
```
π Calculate the sum, median, and standard deviation (hint: `sd`) of the vector `vec2`.
```{r}
```
π Let's load some new data. You'll see there are two other data files in the `data` folder: `babynames_2018.csv` and `babynames_1880.csv`. They contain the top 1000 names of babies born in 2018 and 1880, respectively. Load the 2018 data to a dataframe (hint: `read_csv`). Assign the dataframe to a variable named `df_babynames` (You can give a variable any name you like β but it's always a good idea to give variables meaningful, consistent names).
```{r}
```
π Now take a look at the dataframe's data types. (hint: `glimpse`)
```{r}
```
[Quiz Time]
# Back to Analysis
Let's take another look at the data types of our dataframe.
```{r}
glimpse(df)
```
Some of the data types don't look right. Let's correct them.
π€ why is it important to have the correct data type?
But, first, let's learn how to reference a column in a dataframe: first the name of the dataframe, then `$` and then the column name. For example: `df$cancellation_policy` or `df_babynames$name`.
To change data type, we apply the relevant function, such `as.character` or `as.factor`, to a column and then assign the output of the function (values with changed data type) to the same column.
```{r}
df$host_id <- as.character(df$host_id)
df$host_response_rate <- as.double(df$host_response_rate)
df$property_type <- as.factor(df$property_type)
df$host_since <- mdy(df$host_since)
```
π€ What is the rationale for each of these changes?
Side note: The last function `mdy` is different from the others. Normally, working with date-time in R (and in programming in general) is not simple. The `lubridate` package, which we loaded earlier, makes the job much easier. We'll not go into the details of date-times. What we did here with `mdy` is convert character data into date data. We used `mdy` because the character data were in the form month-date-year. If they were in the form, for example, year-month-date, we would've used the function `ymd`.
π Change the data type of any other variables you think necessary.
```{r}
```
## Missing values
Let's inspect our data for missing values using the function `summary`, which will give us plenty of other relevant information as well.
```{r}
summary(df)
```
π€ Should we worry about missing values?
There are a few options for dealing with missing values.
- We can drop all rows that have one or more missing values.
- Drop rows that have missing values in particular columns.
- Replace the missing values with some other value.
- Do nothing, but do remember to take care of them when running arithmetic operations. (explained later)
In our case, we'll go with the last option.
π€ Is the last option the best for our dataset? What would be a better alternative?
## A Set of Verbs
Finally, we're ready to dive into the data. We'll use six functions β think of them as six verbs β to slice and dice the data in all kinds of ways. These functions come from the `dplyr` package. They are:
- filter
- select
- mutate
- arrange
- summarize
- group_by
We'll first learn these functions one by one and later we'll learn how to combine them for rich analysis.
For each of these functions, we provide the name of the dataframe as the first parameter. The subsequent parameters inform what the function is to do with the dataframe.
### Verb: filter
You use `filter`, when you want a subset of the rows based on one or more conditions. The returned rows will be those that meet the condition(s).
The syntax is: `filter(name_of_dataframe, condition1, condition2, ...)`
Use these *logical operators* to create the conditions:
< less than
<= less than or equal to
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
!x Not x
x | y x OR y
x & y x AND y
Example: Return a dataframe with only those rows where the neighborhood is Chelsea.
```{r}
df_chelsea <- filter(df, neighborhood == "Chelsea")
```
Example: Return a dataframe with only those rows where the price is more than 8000.
```{r}
filter(df, price > 8000)
```
You can combine multiple conditions.
Example: Return a dataframe with rows where the borough is Manhattan and the price is less than 50.
```{r}
filter(df, borough == "Manhattan", price < 50)
```
Example: Return a dataframe with rows where cancellation_policy is flexible *or* accommodates more than 2.
```{r}
filter(df, cancellation_policy == "flexible" | accommodates > 2 )
```
Example: Return a dataframe with rows where number_of_reviews is *not* 0
```{r}
filter(df, number_of_reviews != 0)
```
Note: when multiple conditions are separated by commas, the conditions are assumed to have *and* logical operator between them. Using `&` instead would be the same thing. For *or* and other logical operators, we have to explicitly use the appropriate logical operator (e.g. `|` for *or*).
π Return a dataframe with rows where number of bedrooms is not 1 and property_type is House
```{r}
```
π Return a dataframe with rows where host_response_time is within an hour or host_response_rate is more than 90% or calendar was updated today
```{r}
```
### Verb: select
Our second verb is `select`, which is used to return a subset of the columns.
The syntax is: `select(name_of_dataframe, name_of_column1, name_of_column2, ...)`
There are some variations to this syntax, which come in handy in certain situations.
- If you want all columns except one: `select(name_of_dataframe, -column_to_exclude)`
- If you want all columns except two: `select(name_of_dataframe, -c(column_to_exlude1, column_to_exclue2))`
- If you want the third through eighth columns: `select(name_of_dataframe, 3:8)`
Example: Return a dataframe with only the listing_url column.
```{r}
select(df, listing_url)
```
Example: Return a dataframe with columns host_response_time, cancellation_policy, and neighborhood.
```{r}
select(df, host_response_time, cancellation_policy, neighborhood)
```
Example: Return a dataframe with second through fourth columns
```{r}
select(df, 2:4)
```
π Return a dataframe with property_type, room_type, and zip_code columns.
```{r}
```
π Return a dataframe with the last 23 columns (sixth through 28th)
```{r}
```
π What would be another way of getting the same dataframe?
```{r}
```
### Piping
Before I introduce the third verb, let's take another detour.
You can't accomplish a lot with `filter` and `select` in isolation. Let's combine them.
Here, we're creating a new dataframe `df_expensive` by using `filter` and then we're selecting some columns of interest from that dataframe.
```{r}
df_expensive <- filter(df, price > 8000)
select(df_expensive, price, neighborhood, listing_url)
```
π€ More than $8000 per night? What's going on here? How can we find out more? What should we do with this kind of outliers?
The above code could be written more succinctly by using pipe `%>%`.
```{r}
df %>%
filter(price > 8000) %>%
select(price, neighborhood, listing_url)
```
Let's try to understand how the pipe works.
`filter` and `select` both receive a dataframe as their first parameter (this is true for the other four main verbs as well.) Also, you may have noticed that `filter` and `select` return a dataframe (again, it's true for the others as well). Thanks to this, we can use the pipe to string together several verbs to form a complex query.
```{r}
filter(df, price > 8000)
```
is the same as
```{r}
df %>% filter(price > 8000)
```
Similarly,
```{r}
select(df, price, review_scores_rating, listing_url)
```
is the same as
```{r}
df %>% select(price, review_scores_rating, listing_url)
```
The left side of the pipe is used as the first parameter of the function on the right side of the pipe.
This is how `filter`, `select`, and such functions receive a dataframe, do something to it, return the modified dataframe and passes it on to the next function.
Here's again the piped expression. Make sure you understand what's going on.
```{r}
df %>%
filter(price > 8000) %>%
select(price, review_scores_rating, listing_url)
```
### Verb: arrange
`arrange` is our third verb. (Note that I'm using verb and function interchangeably.) `arrange` sorts a dataframe based on the values of one or more columns.
Example: Get the same dataframe as the last code chunk. Sort it by price.
```{r}
df %>%
filter(price > 8000) %>%
select(price, review_scores_rating, listing_url) %>%
arrange(desc(price))
```
Example: The same dataframe, but now in descending order of price.
```{r}
df %>%
filter(price > 8000) %>%
select(price, review_scores_rating, listing_url) %>%
arrange(desc(price))
```
Example: You can add more columns to `arrange`. Now when price is the same, the order will be based on neighborhood.
```{r}
df %>%
filter(price > 8000) %>%
select(price, neighborhood, listing_url) %>%
arrange(desc(price), neighborhood)
```
π Return the whole dataframe `df`, in descending order of number of bedrooms and where number of bedrooms are equal, descending order of review_scores_rating
```{r}
```
### Verb: mutate
`mutate` creates a new column by modifying one or more existing columns. It's better explained by examples.
Example:
```{r}
mutate(df, price_per_person = price / accommodates)
```
π Create a new review_rating column, which is an average of review_scores_accuracy and review_scores_value.
```{r}
```
### Verb: summarize (or summarise)
`summarize`, as you'd expect, gives some kind of summary (such as mean, median, min, max) of the values of a given column.
Example: What is the total number of people that can stay in NYC Airbnbs?
```{r}
summarize(df, total_accommodates = sum(accommodates))
```
Here, total_accommodates is a name I've given to the summary number.
The common summary functions are:
- sum
- n (count)
- mean
- median
- sd (standard deviation)
- var (variance)
- range
- min
- max
Note: The function `n` doesn't need a parameter because it just counts the number of rows in the dataframe. This would become more clear in the examples.
We can also write our own function and use here. But that is a more advanced topic.
Example: What is the mean rating for accuracy
```{r}
summarize(df, mean_accuracy = mean(review_scores_accuracy))
```
Hmm, not exactly what we expected. This is because any arithmetic operation involving `NA` (that is, *Not Available* or missing values) results in `NA`.
```{r}
5 + NA
```
Or,
```{r}
NA / 33
```
The review_scores_accuracy had several `NA`s. The correct way of doing this would be:
```{r}
summarize(df, mean_accuracy = mean(review_scores_accuracy, na.rm = TRUE))
```
`na.rm = TRUE` means "remove all NAs before performing arithmetic operations."
When we summed the accommodates column earlier, we didn't add `na.rm = TRUE` because that column didn't have any NAs.
Example: We can have multiple summaries at the same time.
```{r}
summarize(df, min_score_value = min(review_scores_value, na.rm = TRUE),
max_score_value = max(review_scores_value, na.rm = TRUE))
```
Example: What's the total number of listings in this dataset?
```{r}
df %>%
summarize(count = n())
```
π The host who's been with Airbnb NYC for the longest time started on which date?
```{r}
```
π Find the standard deviation and mean of the price variable.
```{r}
```
π Count the number of unique zip_codes in the data. Hint `distinct` and `n`.
```{r}
```
π How many listings have perfect 100 for review_scores_rating (which is the overall rating shown as stars on the listing's webpage)?
```{r}
```
π How many listings have a perfect 100 of review_scores_rating and cost less than 50?
```{r}
```
### Verb: group_by
`group_by` is one of the most useful functions. It divides the data into different groups and then `summarize` summarizes each of the groups. This way, we can compare the different groups on various attributes.
Example: What's the mean and median price in each of the five boroughs?
```{r}
df %>%
group_by(borough) %>%
summarize(mean(price), median(price))
```
π Return the same dataframe but sorted by mean price.
```{r}
```
Example: Which are the top 10 neighborhoods by total number of listings?
```{r}
df %>%
group_by(neighborhood) %>%
summarize(count = n()) %>%
top_n(10, wt = count) %>%
arrange(desc(count))
```
We introduce a new function `top_n` here, which returns top n rows (or bottom n rows, when n is negative) ordered by some variable. In the code chunk above, if we had `top_n(-10)`, we would've got the bottom 10 rows.
Example: Which are the top two neighborhoods in each of the boroughs by number of listings?
```{r}
df %>%
group_by(borough, neighborhood) %>%
summarize(count = n()) %>%
top_n(2, wt = count) %>%
arrange(desc(count))
```
π Do the same as above but only for listings where price is more than 500 and property_type is Apartment.
```{r}
```
Example: In Manhattan, which neighborhood has the highest mean price?*
```{r}
df %>%
filter(borough == "Manhattan") %>%
group_by(neighborhood) %>%
summarize(mean_price = mean(price)) %>%
arrange(desc(mean_price))
```
π In Bronx and Queens, what's the mean price for different property_type?
```{r}
```
π For each borough, which neighborhood has the highest variance in price?
```{r}
```
π Which combination of property_type and room_type has the highest median price?
```{r}
```
π Considering only the Brooklyn listings that have review_scores_value 9 or 10, which neighborhood, out of those neighborhoods with at least 100 listings, has the lowest median price?
```{r}
```
# Explore Visually
We're going to use the ggplot2 package, which we loaded earlier, for some visual exploration of the data. Actually, this is how we should have begun our data exploration. We couldn't do this because we needed to learn the six verbs first to get the full benefit of ggplot2.
Unfortunately, we won't have time today to learn ggplot2. It may be possible to organize a separate session later on visualization in R.
Example: What's the distribution of price?
```{r}
ggplot(df) + geom_histogram(aes(price))
```
π What's the problem with this histogram? How can we solve this?
```{r}
```
Example: Compare price distribution of different boroughs.
```{r}
df %>%
filter(price < 500) %>%
ggplot() + geom_histogram(aes(x = price, fill = borough))
```
Example: Do the same, but with boxplots
```{r}
df %>%
filter(price < 500) %>%
ggplot() + geom_boxplot(aes(x = borough, y = price))
```
Example: Which neighborhoods have the highest number Of listings that have review_scores_value less than 5.
```{r}
df %>%
filter(review_scores_value < 5) %>%
group_by(borough, neighborhood) %>%
summarize(count = n()) %>%
ungroup() %>%
top_n(5) %>%
ggplot() + geom_col(aes(x = reorder(neighborhood, count), y = count)) +
coord_flip()
```
# Keep Learning
- If you like `dplyr` and the other tidyverse packages we used in this workshop, [R for Data Science](https://r4ds.had.co.nz) is an excellent resource. Hadley Wickham, the author of the tidyverse packages, wrote that book.
- [Swirl](https://swirlstats.com) teaches R and data science interactively right on the console. Follow the instructions [here](https://swirlstats.com/students.html).
- To learn how to do statistics using R, you can watch [this YouTube video](https://www.youtube.com/watch?v=_V8eKsto3Ug)
- To learn R's programming concepts (which we didn't cover in detail today), you can watch [this YouTube video](https://www.youtube.com/watch?v=fDRa82lxzaU)
- [An Introduction to R](https://cran.r-project.org/doc/manuals/r-release/R-intro.html) is perhaps the most comprehensive introduction. *But it's not the best place to start if you're a complete beginner.*
## Download Cheatsheets
- [RStudio](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf)
- [Data transformation with dplyr](https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf)
- [Work with strings](https://github.com/rstudio/cheatsheets/raw/master/strings.pdf)
- [Data import](https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf)
- [R Markdown](https://github.com/rstudio/cheatsheets/raw/master/rmarkdown-2.0.pdf)
- [Data visualization with ggplot2](https://github.com/rstudio/cheatsheets/raw/master/data-visualization-2.1.pdf)