Skip to content

Commit a2068e7

Browse files
author
Pedro Bernardo
committed
Added pairRdd/filter/*.py
1 parent d6b58da commit a2068e7

File tree

2 files changed

+36
-0
lines changed

2 files changed

+36
-0
lines changed
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
from pyspark import SparkContext
2+
3+
if __name__ == "__main__":
4+
5+
'''
6+
Create a Spark program to read the airport data from in/airports.text;
7+
generate a pair RDD with airport name being the key and country name being the value.
8+
Then remove all the airports which are located in United States and output the pair RDD to out/airports_not_in_usa_pair_rdd.text
9+
10+
Each row of the input file contains the following columns:
11+
Airport ID, Name of airport, Main city served by airport, Country where airport is located,
12+
IATA/FAA code, ICAO Code, Latitude, Longitude, Altitude, Timezone, DST, Timezone in Olson format
13+
14+
Sample output:
15+
16+
("Kamloops", "Canada")
17+
("Wewak Intl", "Papua New Guinea")
18+
...
19+
20+
'''
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from pyspark import SparkContext
2+
from commons.Utils import Utils
3+
4+
if __name__ == "__main__":
5+
6+
sc = SparkContext("local", "airports")
7+
sc.setLogLevel("ERROR")
8+
9+
airportsRDD = sc.textFile("in/airports.text")
10+
11+
airportPairRDD = airportsRDD.map(lambda line: \
12+
(Utils.COMMA_DELIMITER.split(line)[1],
13+
Utils.COMMA_DELIMITER.split(line)[3]))
14+
airportsNotInUSA = airportPairRDD.filter(lambda keyValue: keyValue[1] != "\"United States\"")
15+
16+
airportsNotInUSA.saveAsTextFile("out/airports_not_in_usa_pair_rdd.text")

0 commit comments

Comments
 (0)