Data Science Awareness

Data science is not about:

Using the latest tools
Plotting the best graphs
Building the best ML model
Having the fancy title "Data Scientist"

Data science is about:

Understanding the business problem
Being curious to understand the data
Getting insights from data to solve the problem
Convincing stakeholders to take action

Remember, You're a problem solver, Get the [foundation] right. 💪💪 (https://towardsdatascience.com/problem-solving-as-data-scientist-a-case-study-49296d8cd7b7?gi=d02eec4553b5)

If you just start your machine learning journey, you must learn about data splitting.

Splitting data is a process of splitting the original data into various datasets. The data is typically split into training, validation, and test sets.

Why data splitting? It's important to splitting the machine learning model as it ensures the model can generalize well to new, unseen data.

There are also several reasons for splitting the data into three kinds of datasets, including:

👉 Training set: The training set is used to train the model. The model learns patterns and relationships between input features and target labels.

👉Validation set: The validation set is used for model selection and hyperparameter tuning. By evaluating the model's performance on the validation set, you can prevent overfitting and adjust the model's hyperparameters.

👉Test set: The test set assesses the model's performance on completely unseen data. This gives an unbiased estimation of the model's generalization ability.

However, there are still a few common misconceptions related to data splitting, including:

🚨Using a single dataset for both training and evaluation will provide accurate performance estimates.

Truth: This approach leads to overfitting, as the model learns to perform well on the specific dataset but may fail to generalize to new, unseen data.

🚨 The larger the training set, the better the model.

Truth: While having more training data generally improves model performance, it's essential to maintain a balance between training, validation, and test sets. Allocating significant data to the validation and test sets ensures robust performance evaluation and prevents overfitting.

🚨Data splitting is not necessary for small datasets.

Truth: Even with small datasets, splitting data is crucial to prevent overfitting and assess model performance accurately. In such cases, cross-validation can be employed to make the most of the limited data.

🚨Randomly splitting data always guarantees a good model evaluation.

Truth: Random splitting is a common approach, but it may not be suitable for all cases. For time-series data, maintaining the temporal order is essential. In cases with class imbalance, stratified sampling should be used to ensure the proportion of classes is maintained in each subset.

Remember that data splitting is important to create a reliable and robust machine learning model. There are various techniques and considerations to ensure the splitting process is effective for your problem.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
Books - Literature		Books - Literature
Data Roles.jpeg		Data Roles.jpeg
Life Notes		Life Notes
Poster_POMITS_06111640000031.pptx.pdf		Poster_POMITS_06111640000031.pptx.pdf
README.md		README.md
Seminar Dilo Banda Aceh.pdf		Seminar Dilo Banda Aceh.pdf
Webinar OMITS 14th.pdf		Webinar OMITS 14th.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Awareness

If you just start your machine learning journey, you must learn about data splitting.

Contributing

About

Releases

Packages

prestasicode/Data-Science-Awareness

Folders and files

Latest commit

History

Repository files navigation

Data Science Awareness

If you just start your machine learning journey, you must learn about data splitting.

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages