- Overview
- Web Scraping from Simply Hired
- Data Normalization and Preprocessing
- Model Training
- Resume Enhancement
- Data Analysis and Visualization
- Word Cloud and Bar Chart Analysis
- Integrating Hugging Face Models
- Results and Insights
- How to Run the Project
- Future Work
- Conclusion
- Collaborators
This project aims to develop a machine learning model and a comprehensive analysis system for enhancing resumes based on job titles and job market trends. The project workflow includes web scraping job data, data normalization, clustering analysis, resume enhancement using advanced machine learning models, and visualization of results. The ultimate goal is to assist candidates by tailoring their resumes to align with market demands.
- Web Scraping and Data Collection
- Data Normalization and Preprocessing
- Model Training and Resume Enhancement
- Data Analysis and Visualization
- Integration with Hugging Face Models
- Cluster Analysis and Insights
The first step of the project involved scraping job postings from Simply Hired using Selenium for dynamic page interactions. Selenium was chosen due to its ability to handle JavaScript-rendered content and interact with page elements. We targeted job postings specifically filtered for positions located in California and included six job roles:
- Data Engineer jobs in California.
- Software Engineer jobs in California.
- Data Scientist jobs in California.
- Data Analyst jobs in California.
- Business Systems Analyst jobs in California.
- Software Developer jobs in California.
- Selenium for HTML parsing.
- pandas for organizing the data into a structured format.
The scraped data was structured into a CSV file with the following headers:
- Job Title: The title of the job position (e.g., Data Scientist, Software Engineer).
- Company Name: The name of the company offering the job.
- Job Location: The location of the job.
- Estimated Salary: The estimated salary range for the position.
- Job Details: The detailed job description, including responsibilities and required qualifications.
- Job Href: The hyperlink to the job posting on Simply Hired.
For our machine learning model, we decided to use only the Job Title
and Job Details
columns. The other fields, such as Company Name
, Job Location
, Estimated Salary
, and Job Href
, were excluded because they were not essential for the purpose of resume enhancement. The goal was to use the job title and detailed description to enhance resumes to meet current job needs, making them more relevant for specific roles in the job market.
- Implementing compliance with the site’s terms of service and ensuring ethical data collection.
- Handling dynamic content loading with Selenium and avoiding anti-bot detection mechanisms.
- A CSV file containing the focused job data, with the following columns:
Job_Title
andJob_Details
.
After gathering the raw job data, the next step was to clean and normalize it for further analysis. This included removing HTML tags, special characters, and redundant information.
-
Tokenization and Text Cleaning:
- Used
nltk
andspaCy
for initial text tokenization and entity extraction. - Removed stop words, punctuation, and common phrases.
- Used
-
Skill Extraction:
- Extracted skill keywords from the job descriptions.
- Mapped each job title to a set of core and secondary skills.
-
Text Embedding and Vectorization:
- Implemented BERT-based embeddings using Hugging Face's Transformers library.
- Preprocessed text using the
meta-llama/Meta-Llama-3.1-8B-Instruct
model for further NLP tasks.
- NLTK and spaCy Integration Issues:
- Initial attempts to combine
nltk
andspaCy
with Hugging Face models faced incompatibility issues due to different tokenization schemes. - Resolved by switching entirely to Hugging Face's tokenizers for consistent embeddings.
- Initial attempts to combine
-
Custom TensorFlow Model:
- We created a custom deep learning model using Keras and trained it on the cleaned job descriptions and skills.
- The model (
resume_generator_model.h5
) was trained over 100 epochs and achieved an accuracy of accuracy of 99.79% on the training data and a validation accuracy of 97.15%.
-
Hugging Face’s
Meta-Llama-3.1-8B-Instruct
Model:- Fine-tuned this transformer model to handle contextual queries for enhancing resumes.
- The model was accessed using a Pro API key and integrated into our pipeline using the InferenceClient from the
transformers
library.
- Data Input: Combined data from
Job_Title
andJob_Details
. - Model Architecture: A multi-layer perceptron (MLP) model with three dense layers.
- Loss Function: Categorical Cross-Entropy.
- Optimizer: Adam.
The core feature of this project was to create a system that takes an existing resume and optimizes it based on a specified job title. The enhancement process is as follows:
-
Input: Upload a resume (PDF or text) and specify the desired job title.
-
Text Extraction:
- Used PyMuPDF (fitz) to extract text from PDF files.
- Applied BERT-based embeddings to parse the text.
-
Keyword Mapping:
- Used the
resume_generator_model.h5
to identify gaps in the resume based on the job title’s requirements. - Applied additional enhancements using
Meta-Llama-3.1-8B-Instruct
to generate text aligned with the target job title.
- Used the
-
Output:
- A dynamically enhanced resume, formatted using a pre-defined HTML template located in the
template
folder.
- A dynamically enhanced resume, formatted using a pre-defined HTML template located in the
import gradio as gr
# Load the trained model
model = tf.keras.models.load_model('resume_generator_model.h5')
def enhance_resume(resume_text, job_title):
# Process the resume text (e.g., using the same TF-IDF vectorization)
resume_vector = vectorizer.transform([resume_text]).toarray()
# Predict the job title based on the resume content
predicted_class = model.predict(resume_vector)
predicted_title = label_encoder.inverse_transform([predicted_class.argmax()])
return predicted_title[0]
# Create a Gradio interface for the resume enhancement
iface = gr.Interface(fn=enhance_resume, inputs=["text", "text"], outputs="text", title="Resume Enhancement Tool")
iface.launch()
- K-Means Clustering: Grouped job postings into 5 clusters based on skills and job descriptions.
- PCA (Principal Component Analysis): Reduced dimensionality for visualization.
- Elbow Curve Analysis: Determined the optimal number of clusters.
- Silhouette Scores: Evaluated cluster compactness and separation.
# Perform K-Means clustering on the TF-IDF vectorized data
k_values = range(2, 10)
silhouette_scores = []
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42)
cluster_labels = kmeans.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
silhouette_scores.append(silhouette_avg)
# Plot the silhouette scores for each value of k
plt.plot(k_values, silhouette_scores)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores for K-Means Clustering')
plt.show()
To visualize the distribution of job titles and the most common skills, we generated a word cloud and bar charts.
The word cloud represents the frequency of skills extracted from job descriptions, highlighting the most sought-after skills in the job market.
The bar chart provides a comparative view of the most common job titles, showcasing the top job roles and their frequencies in the scraped dataset.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Generate the word cloud from job skills
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(' '.join(skills_list))
# Plot the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Word Cloud of Job Skills")
plt.show()
We successfully integrated Hugging Face models into our project pipeline to enhance the NLP capabilities of the resume enhancement tool. By combining the resume_generator_model.h5
with the Meta-Llama-3.1-8B-Instruct
model, we ensured a robust approach to understanding context and generating relevant resume content.
- Model Loading: Utilized the Hugging Face
InferenceClient
to load the transformer model. - Text Generation: Developed an API endpoint to handle resume enhancement requests, combining outputs from both models.
from transformers import pipeline
# Load Hugging Face model
text_generator = pipeline("text-generation", model="meta-llama/Meta-Llama-3.1-8B-Instruct")
def generate_text(prompt):
return text_generator(prompt, max_length=100)[0]['generated_text']
# Example usage
resume_enhanced = generate_text(f"Enhance the following resume for the job title: {job_title}\n\n{resume_text}")
The model successfully enhances resumes based on the specified job title, generating relevant content that aligns with job market demands. The analysis revealed key insights into the skill requirements for various roles, and the visualizations provided a clear overview of the job landscape in California.
- The demand for data-related roles is on the rise, with data science and engineering positions being the most frequently posted.
- Key skills for these roles include Python, SQL, and machine learning.
- Clone the Repository:
git clone https://github.com/omidk414/ResumeEnhancer.git cd ResumeEnhancer
- Install the necessary dependencies from
requirements.txt
.pip install -r requirements.txt
Future improvements include incorporating real-time labor market trends and expanding the enhancement system to cover a broader range of industries and job titles.
Our objective was to develop a model that could accurately predict job salaries, with a target R² score close to 0.80. We experimented with various feature engineering techniques, including text vectorization, polynomial features, and hyperparameter tuning on models such as Random Forest and Gradient Boosting Regressors.
Despite our extensive efforts, we were unable to achieve the desired R² score. The highest R² we were able to attain was around 0.38. We have thoroughly explored the data and implemented multiple model enhancements but encountered limitations in predictive accuracy. At this point, further improvements are proving challenging given our current resources and time constraints.
We are prepared to submit the project with these findings and accept the final results. We appreciate the learning opportunity this project has provided and welcome any feedback on potential further improvements.
This project successfully demonstrates the capabilities of machine learning and NLP in enhancing resumes. By utilizing modern scraping techniques, advanced model training, and integration with powerful transformer models, we have developed a tool that can help job seekers better align their qualifications with market needs.
- Evan Wall
- Thay Chansy
- Omid Khan