Skip to content

SunnyShikhar/alumnize

Repository files navigation

Alumnize: Alumni Intelligence System - User Guide

Documentation instructing how to use alumnize to gather, transform and analyze alumni data for the University of Waterloo Capstone Project.

Website: https://alumnize-capstone.herokuapp.com/

Table of Contents

Prerequisites

Please ensure the following programs are downloaded and installed to use Alumnize:

Gather Data

  1. Go to a LinkedIn profile of an alumnus.

  2. Click the download button that appears in the header when Alumni Data Buddy Chrome Extension is enabled.

Figure1

  1. Repeat for 50 profiles (limitation set by the Chrome extension).

  2. Click the chrome extension beside the URL bar.

Figure2

  1. Enter e-mail and download.

Figure3

Repeat this process until all alumni profiles have been downloaded.

Transform Data

  1. Go to Alumnize

Figure4

  1. Upload the almabase output file(s) exported using the extension.

3.1 If this is the first time using the website, there is no need to upload a master file.

3.2 If this is not the first time, please upload the latest file outputted from Alumnize ('MasterFile.csv').

  1. Click Transform.

  2. Save the output file.

Analyze Data

Connecting Original csv file to Data Source for Tableau

  • Open the Tableau file named “alumni_analysis.twb”.

  • Tableau must find the folder path of the csv file it is connected too. It is currently connected to “updated_masterfile.csv”. Upon opening the Tableau file:

  1. Tableau may ask the user to find the folder path of the “updated_masterfile.csv”. If so, click on “Connect” and find the folder path of the “updated_masterfile.csv”. (To avoid any inconveniences, always use the same csv file name (“updated_masterfile.csv”).

  2. Tableau will automatically find the folder path of the file it is connected too. In this case, the user does not have to do any additional steps to establish a connection with an Excel file.

  • Click on the "Main" or “Full-Time Duration” tab to look at the alumni dashboards.

Connecting Tableau to a New “updated_masterfile”

  • Go to the “Data Source” tab (bottom left).

  • Go to the “Connections” bar located on the top left section of the page.

Figure5

  • Click on “Edit connection…” and select the new csv file outputted by the web-application

Figure6

Additional Links for Help

The following are links of free training videos/articles provided by Tableau and other sources:

Additional Information

Transformation Logic

The following is a brief explanation of how the fields within the transformed dataset are determined.

ID: Identifier that is unique for each person in the dataset

Work ID/Coop ID: The Work ID is a unique identifier for each person for each job that is a full-time position. Coop ID is for coop positions. These fields are determined using the graduating year. If the graduating year is before the start month-year of a position, the work ID column is populated; else, the coop ID column is populated.

Name: Obtained from the “Name” column in the data extracted from Almabase

Year: Determined by looking at the education duration for any education done in Waterloo. Looks for “waterloo” as education location in the two most recent educations, starting with the 2nd most recent. If the 2nd most recent education is Waterloo, it’s duration-end is used as education year and it is assumed that the most recent education is a post-graduate education. Otherwise, if the most-recent education location is “waterloo”, similar steps are taken.

Company: Determined from the “Employment: Employer 1” column in the data extracted from Almabase. Same is done for Employer two to five if the person has multiple employments on their LinkedIn profile.

Position: Determined from the “Employment: Title 1” column in the data extracted from Almabase. Same is done for Title two to five if the person has multiple employments on their LinkedIn profile.

URL: Determined from the “LinkedIn URL” column in the data extracted from Almabase. The same URL is used for multiple rows in which multiple employments of the same person are present.

All Date fields: Job position durations can be inputted in various formats by users. Rule-based logic was created to account for all cases of duration in job positions represented by the “date_format” function in the transform script which takes the raw duration as input from the Almabase file. The different formats are:

  • Month Year – Month Year
  • Year – Year
  • Month Year – Year
  • Year – Month Year
  • Month Year – Present (current job)
  • Year – Present (current job)

The following columns are created within the transformed dataset:

  • Duration – Full Dates of job duration including mm/yyyy-mm/yyyy OR yyyy-yyyy
  • Start.Date – Start date including mm/yyyy or yyyy
  • Start.Month
  • Start.Year
  • End.Date.pres - End date including mm/yyyy OR yyyy OR pres for current positions
  • End.Month
  • End.Year
  • Job.Duration – Total job duration in years (Ex. 0.2 years)

Full.Location: Raw location obtained from data extracted from Almabase

City & Country: The content in the location attribute from LinkedIn is inconsistent. Toronto could be entered as "Toronto, ON" or "Toronto, Ontario, Canada", "GTA" or " Toronto". This deemed to be problematic when conducting location analysis of alumni jobs. Initially, string-based rules were considered to standardize the city and country from locations, however it is impossible to account for different cases for thousands of cities around the world. Instead, this issue was handled by implementing the external Python libraries Geopy and Geotext. Geopy leverages Google Maps API to extract the full address of the location attribute. Geotext then extracts city and country from the full address.

Despite efforts to extract and standardize the location attribute, Geopy has certain limitations due to Google Maps API. Google Maps API limits the free service to 2500 requests per day, calculated as a sum of client-side and server-side queries. Thus, requests for only 1,250 locations can be accepted by the API daily. If there are five job positions per alumni, a maximum of 250 alumni profiles can be transformed per day. If the user exceeds the maximum allowed requests, Geotext is used to extract the city and country from the location attribute. This method is less accurate than using the combination of Geopy and Geotext.

Append/Replace Functionality within Transformation Script

The user may have a Masterfile of all of the previously extracted and transformed alumni. If transformation is applied on newly extracted data, the user can append/replace the data in the current Masterfile as long as a path to the Masterfile is added. Otherwise, the newly extracted data will be transformed only without append/replace into another csv. If the user inputs a file path for a Masterfile and an almabase output, the transform_script.py will append new and replace old rows in the Masterfile. A unique identifier for a row is the combination of “Name”, “Year”, “Company”, and “Position”. The append_replace() method in the script implements the following steps:

  • Inserts all primary keys (“Name” + “Year” + “Company” + “Position”) of the Masterfile into a dictionary
  • Inserts all primary keys (“Name” + “Year” + “Company” + “Position”) of the newly transformed almabase output into a dictionary
  • Using the two dictionaries, it finds any new primary keys present in the newly transformed almabase output and not in Masterfile
  • The script appends all new rows to the Masterfile as well as replace old rows present in the Masterfile and newly transformed almabase output.

There are a few limitations with the append_replace() method. One limitation is that it cannot handle duplicate “Name” and “Year” combination. For example, if there are two alumni named “John Smith” from the class of 2012 present in the Masterfile under “ID”s 1 and 2, the new rows that are appended to the Masterfile may be appended under the incorrect “ID”. However, for the current Management Engineering dataset, no alumni has the same “Name” and “Year” combination. Furthermore, if any of the alumni change their “Name”, “Year”, “Company” and/or “Position” on LinkedIn for an old data point in the Masterfile, a duplicate row will be appended to the Masterfile (as the primary key has changed).

Limitations

At the moment the website is only ablet to handle small data that takes less than 30 seconds to compute. This limitation will soon be resolved.

About

Alumni Intelligence System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published