A Short Course in Chemometrics

Objectives

🎯 The goal of this short course is to introduce and explain elementary chemometric analysis methods. We will also touch on more advanced ML approaches. The course will cover the use of python-based tools that can accelerate your workflow and improve reproducibility. We will assume no prior knowledge or familiarity with any of these methods, tools, or mathematical background. We will review only as much mathematics as is necessary to ground an understanding of the methods discussed since a deep understanding is not necessary for application, which is the focus of this course.

🚀 What we hope to achieve:

Give you a new set of tools to help you do your job better
Create a coherent and more consistent approach to chemometric analysis by introducing you to a standard library for these tasks
Improve reproducibility and transparency
Create a community where ideas, needs, and methodologies can be exchanged

📚 In the end you will be able to go to a library of standardized example notebooks, select the one you need, enter your data, then run it from start to finish. This course will also teach you to modify and expand things as needed.

Outline

Introduction
- 📓 The Jupyter Notebook
  - The Basics
  - Google Colab
  - Managing Your Session
  - Installing Python Packages
  - Saving Code
- 🐍 The Python Language
  - Why Learn Python?
  - Before We Get Started
  - Variables
    - Built-in Data Types
    - Variable Assignment and Operators
    - Sequences: Lists, Dictionaries, and Tuples
    - Referencing
  - Logic
    - Comparison Operators
    - Logical Operators
    - If Else Statements
  - Loops
    - For Loops
    - While Loops
  - Numpy, Scipy, and Pandas
    - Numpy
    - Scipy
    - Pandas
  - Plotting with Matplotlib
  - Defining Functions
    - Documentation and Type Hints
    - Scope
    - Number and Order of Arguments
    - Default Values
  - Object Orientation and Classes
- 🔬 Chemometrics
  - The Authentication Problem
    - Some Motivating Examples
    - Class Models
    - A Machine Learning Perspective
  - $N << p$
  - Regression, Classification, and Clustering
  - scitkit-learn
  - PyChemAuth
- 🔮 Statistics Background
  - $\chi^2$ statistics
  - Performance Metrics
  - Rashomon sets
  - Bias-Variance Tradeoff
✨ Techniques
- Exploratory Data Analysis (EDA)
  - Basic Suggestions
  - Jensen-Shannon Divergence
    - What is it?
    - Developing an Intuition
    - JSD Reveals Plausible Tree Stumps
    - Identifying Clusters
    - Binary vs OvA
    - Common Pitfalls
  - See also:
    - Interactive Trace Element Correlations
- Pipelines
- Evaluation Metrics
- Cross-Validation
🚦 Pre-processing
- Scaling and Centering
- Filtering
  - MSC
  - SNV and RNV
  - Savitzky-Golay
- Missing Values and Imputation
  - Limits of Detection (LOD)
  - Basic Imputation
  - Predictive Imputers
- Class Balancing
  - SMOTE
  - Edited Nearest Neighbors (ENN)
  - SMOTEENN
  - ScaledSMOTEENN
  - Imblearn pipelines
- Feature Selection
🔳 Conventional Chemometric Models
- 📈 Regression Models
  - Ordinary Least Squares (OLS)
    - Learn | sklearn API | Interactive Tool
  - Principal Components Analysis (PCA) and Regression (PCR)
    - Learn | API | Interactive PCA Tool, Interactive PCR Tool
  - Partial Least-Squares (PLS) or Projection to Latent Structures
    - Learn | API | Interactive Tool
- ✅ Classification and Authentication Models
  - Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA)
    - Learn | sklearn API | Interactive Tool
  - Partial Least-Squares-Discriminant Analysis (PLS-DA)
    - Learn | API | Interactive Tool
  - Soft Independent Modeling of Class Analogies (SIMCA)
    - Learn | API | Interactive Tool
💻 Machine Learning Models
- 📈 Regression Models
  - Artificial Neural Networks
  - Explainable Boosting Machine
- ✅ Classification Models
  - 🌳 Decision Trees
    - Visualizing Decision Trees
    - Visualizing Decision Boundaries
    - Pros and Cons
  - 🎼 Ensemble Methods
    - Bagging
    - Boosting
  - 🌳🌳🌳 Random Forests
  - Logistic Regression
- Authentication Models
  - EllipticManifold
  - Out-of-Distribution / Novelty Detection
    - 🌳🙉🌳 Isolation Forest
    - Other Resources
  - Open Set Recognition
- AutoML
  - What is it?
  - Caveats
🔍 Comparison and Inspection
- Comparing Relative Performance of Pipelines
- 👀 Model-agnostic Inspection Methods
  - Permutation Feature Importance (PFI)
  - SHapley Additive exPlanations (SHAP)
    - Shapley Values (Theory)
    - Computing SHAP Values (Practice)
    - Margin Space Explanation
    - Best Practices
- Do I Need More Data?
💾 Saving and Sharing Models
📁 Case Studies

Next Steps:

❓ You can ask questions, provide feedback, and find community support on the GitHub Discussions page for this course.
✖️ If you find a mistake please submit a Bug Report.
🔭 If you would us to cover new area(s) or have an idea to improve this course, please submit a Feature Request!
💡 Is you have requests or ideas specific to PyChemAuth you can find similar options on its Issues page.
🤝 Please consider contributing to PyChemAuth examples!

Instructor(s):

Nate Mahynski, [email protected]

Thanks to 👏

Tom Allison, [email protected]
Bill Krekelberg, [email protected]
Dave Sheen, [email protected]

The logo was designed using Google Gemini (Imagen 3) with the prompt "Design a logo for determining geographic origin using chemistry and statistical models" on Nov. 8, 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 1,027 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
data		data
notebooks		notebooks
presentations		presentations
streamlit		streamlit
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODEMETA.yaml		CODEMETA.yaml
CODEOWNERS		CODEOWNERS
LICENSE.md		LICENSE.md
README.md		README.md
conda-env.yml		conda-env.yml
logo.png		logo.png
version.py		version.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Short Course in Chemometrics

Objectives

Outline

Next Steps:

About

Releases 1

Packages

Contributors 2

Languages

License

mahynski/chemometric-carpentry

Folders and files

Latest commit

History

Repository files navigation

A Short Course in Chemometrics

Objectives

Outline

Next Steps:

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages