Skip to content

Commit e6e21af

Browse files
author
Pedro Bernardo
committed
Added sparkSql/HousePriceProblem.py and sparkSql/HousePriceSolution.py
1 parent 25927ae commit e6e21af

File tree

2 files changed

+55
-0
lines changed

2 files changed

+55
-0
lines changed

sparkSql/HousePriceProblem.py

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
if __name__ == "__main__":
2+
3+
'''
4+
Create a Spark program to read the house data from in/RealEstate.csv,
5+
group by location, aggregate the average price per SQ Ft and sort by average price per SQ Ft.
6+
7+
The houses dataset contains a collection of recent real estate listings in San Luis Obispo county and
8+
around it. 
9+
10+
The dataset contains the following fields:
11+
1. MLS: Multiple listing service number for the house (unique ID).
12+
2. Location: city/town where the house is located. Most locations are in San Luis Obispo county and
13+
northern Santa Barbara county (Santa Maria­Orcutt, Lompoc, Guadelupe, Los Alamos), but there
14+
some out of area locations as well.
15+
3. Price: the most recent listing price of the house (in dollars).
16+
4. Bedrooms: number of bedrooms.
17+
5. Bathrooms: number of bathrooms.
18+
6. Size: size of the house in square feet.
19+
7. Price/SQ.ft: price of the house per square foot.
20+
8. Status: type of sale. Thee types are represented in the dataset: Short Sale, Foreclosure and Regular.
21+
22+
Each field is comma separated.
23+
24+
Sample output:
25+
26+
+----------------+-----------------+
27+
| Location| avg(Price SQ Ft)|
28+
+----------------+-----------------+
29+
| Oceano| 95.0|
30+
| Bradley| 206.0|
31+
| San Luis Obispo| 359.0|
32+
| Santa Ynez| 491.4|
33+
| Cayucos| 887.0|
34+
|................|.................|
35+
|................|.................|
36+
|................|.................|
37+
'''
38+

sparkSql/HousePriceSolution.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
from pyspark.sql import SparkSession
2+
3+
PRICE_SQ_FT = "Price SQ Ft"
4+
5+
if __name__ == "__main__":
6+
7+
session = SparkSession.builder.appName("HousePriceSolution").master("local").getOrCreate()
8+
session.sparkContext.setLogLevel("ERROR")
9+
realEstate = session.read \
10+
.option("header","true") \
11+
.option("inferSchema", value=True) \
12+
.csv("in/RealEstate.csv")
13+
14+
realEstate.groupBy("Location") \
15+
.avg(PRICE_SQ_FT) \
16+
.orderBy("avg(Price SQ FT)") \
17+
.show()

0 commit comments

Comments
 (0)