This repository contains programs to extract relevant fields from resumes and optionally build a knowledge graph (KG) for effective querying using a chatbot.
Detailed specifications are here and here
- Extract important fields like name, email, phone, etc. from resumes based on patterns specified in a
config.xml
file.
parser.py
takes the config file and a directory containing text resumes as arguments.- The config file specifies the fields to extract and their respective patterns.
- The parser logic is independent of the domain (resumes in this case), and changes or additions are made in the config file.
- Field extraction methods and patterns are defined in the config file.
- Prepare your own
config.xml
file similar to the provided one. - For command-line execution, run
python parser_by_regex.py
. - For a GUI, run
python main.py
.
- Uses an open-source model from Hugging Face as a Large Language Model (LLM) and a prompt for extraction.
- Run
python parser_by_llm.py
.
- File to be created:
parser_by_spacy.py
- Use spaCy-based Named Entity Recognition (NER) models for extraction.
- Build custom NER models if necessary. Data from
rijaraju repo
is atdata\rijaraju_repo_resume_ner_training_data.json
- Add spaCy Matcher logic if needed.
- Output is json of key-value pairs, where Key is NER type and value is specific to the resume-person.
- Also extract relationships values, it
Education
as key and value as sayCoEP
, its date range etc.
- File to be created:
build_kg.py
- Build a Knowledge Graph (KG) based on the extractions.
- Nodes can represent entities like Person, Organizations, Skills, etc., and edges can represent relationships like "educated_in," "programs_in," etc.
- Central person-node can have person specific attributes, but other nodes like
Autodesk
orCoEP
should not have, as other resume-person may also refer them. Resume-person specific attributes should be on edge fromYogesh
toCoEP
like date range, branch etc. - Nodes like
Python
,NLP
will be common and can come from different company nodes, likeIcertis
,Intuit
etc. - Schema design is critical as it decides which extractions can be NODES, EDGES and attributes on them.
- Follow standard schema like schema.org or DBpedia for resume extraction.
- Represent the KG initially in networkx format and later in Neo4j.
- Build a Streamlit app to upload resumes and visualize the KG or use Neo4j.
- File to be created:
resume_chatbot.py
- Use query languages like SPARQL or Cypher, depending on the KG's residence.
- Leverage LLMs to convert natural language English queries into SPARQL or Cypher.
- Build a Streamlit chatbot for querying the KG. See if you can visualize the built KG.
- Deploy the chatbot on Streamlit-Shares for limited (e.g., 5 resumes) public access.
- Build an end-to-end system with payment integration as a pay-per-use MicroSaaS.
- Consider deploying on cloud platforms like VertexAI or Azure.
- The author ([email protected]) provides no guarantee for the program's results. It is a fun script with room for improvement. Do not depend on it entirely.
Copyright (C) 2017 Yogesh H Kulkarni