This repository comprises the necessary files for the Portfolio task assigned in COMP2200/6200 S1 2023 , associated with Macquarie University Individual Assesment .
The dataset (Yelp_Portfolio1_Input.csv ) can be accessed via the link provided at https://github.com/COMP2200-S1-2023/portfolio-part-1-dataset/releases/download/portfolio-dataset-p1/Yelp_Portfolio1_Input.csv
From this unit, I have learnt to use jupyter notebook for python code execution. First step was installation of latest version of python 3.9.10 and anaconda packages. From anaconda prompt terminal, jupyter notebook could be used to opened using " Jupyter Notebook" command. Then , github account is opened and I installed github desktop to access my github from my local machine.
From making these parts and going through different process and mechanism to solve particular questions made from various datasets, it is learned that problem solving in data science involves applying a systematic and analytical approach to address complex problems using data-driven methods. It encompasses several stages and techniques to extract insights and make informed decisions. Here's a brief description of the problem-solving process in data science:
Problem Definition : Real problems can be identified and picked from possible data in real world.
Data Collection : Relevant data can be gathered from various sources, such as databases, APIs, files, or surveys. Data quality must be ensured by addressing missing values, outliers, and inconsistencies.
Data Exploration and Preprocessing : The data is explored to gain insights and understand its characteristics. The data can be handled with missing values, feature selection or extraction, and necessary transformation. This step aims to prepare the data for analysis.
Model Selection and Training : Appropriate models or algorithms based on the problem type, are chosen for available data, and desired outcome. Then the data is split into training and testing sets to evaluate the model's performance. The model is trained using the training data and fine-tune its parameters.
Model Evaluation : Assessing model's performance using evaluation metrics suitable for the problem, such as accuracy, precision, recall, or F1-score is very essential for this. The model must be evaluated on the testing set to estimate its generalization capability.
Model Optimization : The model's performance is improved by adjusting hyperparameters, employing regularization techniques, or using more advanced algorithms. The model can be optimized based on the evaluation results.
Interpretation and Communication : The model's results is interpreted to extract meaningful insights. The findings can be communicated using visualizations, reports, or presentations. Make recommendations or decisions based on the analysis.
Throughout the problem-solving journey, data scientists rely on their expertise in statistical analysis, machine learning, programming, and domain knowledge to derive meaningful insights and provide data-driven solutions.
In this learning journey, I have chosen Mental Health Tech Survey dataset from Kaggle for 4th part of portfolio. This dataset contains survey responses from individuals in the tech industry about their mental health status and the support they receive from their employers. It includes demographic information as well as responses to questions about mental health conditions, treatment, and attitudes towards seeking help. There are various reason for choosing this dataset:
Richness of data: The dataset provides a comprehensive set of survey responses related to mental health in the tech industry. It includes information on demographic factors, mental health conditions, treatment options, and attitudes towards mental health. This richness of data allows for in-depth analysis and exploration of various factors related to mental health.
Large sample size: The dataset consists of responses from a large number of individuals working in the tech industry. A larger sample size increases the statistical power of the analysis and allows for more robust findings and insights. It also enables the exploration of patterns across different subgroups within the tech industry.
Varied variables : The dataset includes a wide range of variables, such as employment details, company size, benefits, and various aspects of mental health. This variety enables the investigation of correlations between different factors and the identification of potential predictors or drivers of mental health outcomes.
Real-world relevance : Mental health in the tech industry is a significant and increasingly important topic. By working with this dataset, you can gain insights into the mental health challenges faced by tech professionals, understand the support systems in place, and potentially contribute to addressing mental health concerns in the workplace.
Potential for actionable insights : Analyzing this dataset can help identify patterns and correlations that can be used to inform policies, interventions, and support systems within the tech industry. By uncovering the factors that impact mental health, organizations can make informed decisions to create healthier work environments and support their employees' well-being.
Overall, the Mental Health in Tech Survey dataset provides a robust and comprehensive foundation for conducting data science projects related to mental health in the tech industry, offering valuable insights and potential for actionable outcomes.
There are several process involved in this piece of work. Few can be mentioned below:
Explorating statistics of data : The distributions of data are explored. Calculating inter-quartile rate for the data and performing outlier analysis on data for pre-processing . Boxplot is used to visualize the data without outlier.
Data Pre-processing :- Handling missing values , converting non-numerical values into numerical values are crucial parts of pre-processing data.
Choosing model :- Considering behaviour of several factors , linear regression model is chosen for predicting target variable. Because of significant co-relation among different variables, linear relationship developed and linear regression model performed with accuracy.
Accuracy measures :- Recall, precision, F1 score , accuracy metrics are used to measure accuracy of performance of the model. To visualize the data, barchart is used.
Future Prospect
In future , I would like to analyze model performance with unseen data and make a dashboard. Using identified patterns, in tech industry , there could be practical solutions implemented with machine learning.