What_do_cars_sell_for

By scraping Craigslist and following up on the URLs, it is possible to determine the like price at which cars sell/don’t sell.

The system behavior is going to look something like this:

Check craigslist for new listings; save all of them with a unique ID (URL)
Check all saved listings to see if they've changed status (options are [removed by user, expired, flagged])
repeat

After that process has run for a while, I'll have a dataset that I can use to compare a listing's time on the platform,end status, and any details about the listings

That information can probably be used to learn stuff about cars; or maybe even train a neural network to predict what price/timeframe a given listing would sell at/in

I need two main pieces of tech infrastructure to build this:

a database that can store HTML files
a webscraper that can access craigslist

In the past I've spent a ton of time and energy building data cleaning functions into the webscraper; taking pieces of information like model, price, etc out of the HTML before writing it to the database.

This practice made the pipeline brittle, the frequent breaks limited the size/usefulness of a dataset

Craigslist's own T&C cause listings to expire after 45 days no matter what, so I can label a URL as 'retired' after that timeframe.

I also can just compare the saved HTML to the scraped HTML to see if it's changed and save a new version if it has. No need to determine conclusivity out of the gate.

I need to run the scraper on ec2 because Craigslist uses client-side rendering, so I need to host a web browser client to get all the information out of HTML

dynamoDB is a key-value store that'll be adequate for saving HTML pages with a unique key

In the future I can write an HTML parser to analyze this data; the mistake I've made in the past is writing a parser inline with the pipeline.

Note: don't do this.

I'm following this guide for the ec2-dynamoDB setup: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/vpc-endpoints-dynamodb.html

Ec2 instance name: Jenkins

Publis DNS: ec2-54-149-50-230.us-west-2.compute.amazonaws.com

ssh command: ssh -i my-keypair.pem ec2-user@public-dns-name

formatted: ssh -i Jenkins.pem [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What_do_cars_sell_for

About

Releases

Packages

mckinlde/What_do_cars_sell_for

Folders and files

Latest commit

History

Repository files navigation

What_do_cars_sell_for

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages