Hello, I’m Michael Reigelman | LinkedIn: @mtreigelman | Twitter: @MtReigelman
I am a Data Scientist (M.S.) and an Aerospace Engineer (B.S.), who have beein in the work force for 7+ years. This GitHub page describes various projects I have completed relating to data science, machine learning, data pipeline building and data visualization topics. Please feel free to contact me for more information on any of these projects.
In addition to the work I've done professionally and in school, I also have an interest in working on: data visualizations, maps, ML & prediction modeling, projects related to coding and music, sports analytics, and autonomous control systems.
Tools used in include Python, PySpark, SciKitLearn, Pytorch, Plotly, Unix, SQL , and NoSQL (MongoDB).
- Predict implicit ratings from four features (whether a user is likely to like/buy a specifc item), using data from this Booking.com Kaggle competition. Placed in the top 10 out of 55 teams.
- Created python libraries that can perform machine learning tasks on structured datasets. These libraries include linear regression, logistic regression, decision tree, and random forest model techniques. These are python object-oriented programming .py files, and allow for hyperparameter tuning.
- In this project I setup a data pipeline moving information from AWS S3 to a MongoDB distributed database. Machine learning analysis was coded in Databricks to understand a variety of questions about how depression was related to COVID-19 cases and tweet sentiments on various topics. (This pipeline has been archived).
A detailed notebook discussing topics related to K-Means clustering including:
- What K-means is, and how it works
- Practical Applications and Examples
- Inadequacies of K-Means
- Spectral Clustering
- Real-World Applications of K-Means
A detailed notebook discussing topics related to Feature Importance using data from the National Football League. Topics include:
- Analysis techniques
- Algorithm explanation
- Strategies for using feature importance to improve machine learning models
- Examples and visualizations
- Created a series of python scripts to parse .txt files and output a .html, .csv, .json, or .xml file; depending on request. A second set of scripts was created to reverse these changes when required. This program can be executed from your command line.
- From a collection of around 17GB of .txt files, python scripts were created to perform linear searches, index/dictionary searches, and hash table searches of words or phrases contained in the documents. The results were then pushed to a local .html file that could be viewed and navigated by the user in their browser of choice. This would allow for the user to see the results as well as open up the desired file to read the contents all within the browser. Jinja2 was utilized for the HTML file formatting.
- From a large Reuters article database, a TFIDF tool was created in python that would parse and score articles from a .xml format. An evaluation code was provided by the grader to measure the code’s accuracy and speed requirements were also enforced & achieved.
- Using an AWS Server, python files ran a temporary website to provide links to articles (.txt files saved on the AWS machine) and recommend similar articles to the user. The user could navigate to an article, read its contents, and be provided with 5 articles like what their selection was about. Jinja2 was used for HTML formatting. The Stanford GLOVE database was also used to get article similarity scores for the recommendations.
- Created an AWS server was created to run a temporary website where the user could view the last 100 tweets by any Twitter profile and have a color-coded sentiment score appear: red meaning more negative tweets, while green means positive. The user could also view the top 100 followers of any profile, ranked by the number of followers. These python scripts utilized the VaderSentiment library to get the sentiment scores based on the words in each individual tweet. The Twitter API was also used to pull the tweet content, profile information, followers, and followers’ follower counts. Jinja2 was also used for HTML formatting.
- Using the Naïve Bayes theorem and movie review data, a python script was used to create a sklearn classifier model that would predict whether the review was positive or negative. K-fold cross-validation was used when training this model.
- Created various python functions to edit .png and .jpg files. Changes include blurring and sharpening photos, mirroring/flipping images, adding and removing noise, and highlighting edges. These functions are housed in a Jupyter Notebook and utilized the PIL Image library.
- Using the GLOVE word vector dataset, a command-line tool was created using python scripts. This tool will pull the closest five words in the 300-vector space to a word input by the user. Converted .txt file to NumPy for speed improvements.
- From a collection of 1.5GB of emails, converted (to feather), cleaned, and analyzed the messages to see if any suspicious trends existed in executive employee communications prior to the Enron bankruptcy filing in 2001. The data was cleaned, formatted, and condensed in a python script, while visualizations were generated in a Jupyter Notebook using the MatPlotLib and Skilling libraries. Visualizations include heat maps, bar charts, spring layouts, and Kamada layouts.
- Created a python class to perform hash table management (utilizing iter(), setitem(), getitem(), contains(), and other functions). This class borrowed code from a word search project to create an object-oriented tool. The previous code also used hash tables, however, did not use an object-oriented approach.