Skip to content

Library for machine learning where all algorithms are implemented from scratch. Used only numpy.

License

Notifications You must be signed in to change notification settings

adityajn105/MLfromScratch

Repository files navigation

๐Ÿง  MLfromScratch

Python NumPy License GitHub contributors

MLfromScratch is a library designed to help you learn and understand machine learning algorithms by building them from scratch using only NumPy! No black-box libraries, no hidden magicโ€”just pure Python and math. It's perfect for beginners who want to see what's happening behind the scenes of popular machine learning models.

๐Ÿ”— Explore the Documentation


๐Ÿ“ฆ Package Structure

Our package structure is designed to look like scikit-learn, so if you're familiar with that, you'll feel right at home!

๐Ÿ”ง Modules and Algorithms (Explained for Beginners)

๐Ÿ“ˆ 1. Linear Models (linear_model)

  • LinearRegression Linear Regression: Imagine drawing a straight line through a set of points to predict future values. Linear Regression helps in predicting something like house prices based on size.

  • SGDRegressor SGD: A fast way to do Linear Regression using Stochastic Gradient Descent. Perfect for large datasets.

  • SGDClassifier Classifier: A classification algorithm predicting categories like "spam" or "not spam."

๐ŸŒณ 2. Decision Trees (tree)

  • DecisionTreeClassifier Tree: Think of this as playing 20 questions to guess something. A decision tree asks yes/no questions to classify data.

  • DecisionTreeRegressor Regressor: Predicts a continuous number (like temperature tomorrow) based on input features.

๐Ÿ‘ฅ 3. K-Nearest Neighbors (neighbors)

  • KNeighborsClassifier KNN: Classifies data by looking at the 'k' nearest neighbors to the new point.

  • KNeighborsRegressor KNN: Instead of classifying, it predicts a number based on nearby data points.

๐Ÿงฎ 4. Naive Bayes (naive_bayes)

  • GaussianNB Gaussian: Works great for data that follows a normal distribution (bell-shaped curve).

  • MultinomialNB Multinomial: Ideal for text classification tasks like spam detection.

๐Ÿ“Š 5. Clustering (cluster)

  • KMeans KMeans: Groups data into 'k' clusters based on similarity.

  • AgglomerativeClustering Agglomerative: Clusters by merging similar points until a single large cluster is formed.

  • DBSCAN DBSCAN: Groups points close to each other and filters out noise. No need to specify the number of clusters!

  • MeanShift MeanShift: Shifts data points toward areas of high density to find clusters.

๐ŸŒฒ 6. Ensemble Methods (ensemble)

  • RandomForestClassifier RandomForest: Combines multiple decision trees to make stronger decisions.

  • RandomForestRegressor RandomForest: Predicts continuous values using an ensemble of decision trees.

  • GradientBoostingClassifier GradientBoosting: Builds trees sequentially, each correcting errors made by the last.

  • VotingClassifier Voting: Combines the results of multiple models to make a final prediction.

๐Ÿ“ 7. Metrics (metrics)

Measure your modelโ€™s performance:

  • accuracy_score Accuracy: Measures how many predictions your model got right.

  • f1_score F1 Score: Balances precision and recall into a single score.

  • roc_curve ROC: Shows the trade-off between true positives and false positives.

โš™๏ธ 8. Model Selection (model_selection)

  • train_test_split TrainTestSplit: Splits your data into training and test sets.

  • KFold KFold: Trains the model in 'k' iterations for better validation.

๐Ÿ” 9. Preprocessing (preprocessing)

  • StandardScaler StandardScaler: Standardizes your data so it has a mean of 0 and a standard deviation of 1.

  • LabelEncoder LabelEncoder: Converts text labels into numerical labels (e.g., "cat", "dog").

๐Ÿงฉ 10. Dimensionality Reduction (decomposition)

Dimensionality Reduction helps in simplifying data while retaining most of its valuable information. By reducing the number of features (dimensions) in a dataset, it makes data easier to visualize and speeds up machine learning algorithms.

  • PCA (Principal Component Analysis) PCA: PCA reduces the number of dimensions by finding new uncorrelated variables called principal components. It projects your data onto a lower-dimensional space while retaining as much variance as possible.

    • How It Works: PCA finds the axes (principal components) that maximize the variance in your data. The first principal component captures the most variance, and each subsequent component captures progressively less.
    • Use Case: Use PCA when you have many features, and you want to simplify your dataset for better visualization or faster computation. It is particularly useful when features are highly correlated.

๐ŸŽฏ Why Use This Library?

  • Learning-First Approach: If you're a beginner and want to understand machine learning, this is the library for you. No hidden complexity, just code.
  • No Hidden Magic: Everything is written from scratch, so you can see exactly how each algorithm works.
  • Lightweight: Uses only NumPy, making it fast and easy to run.

๐Ÿš€ Getting Started

# Clone the repository
git clone https://github.com/adityajn105/MLfromScratch.git

# Navigate to the project directory
cd MLfromScratch

# Install the required dependencies
pip install -r requirements.txt



๐Ÿ‘จโ€๐Ÿ’ป Author

This project is maintained by Aditya Jain

๐Ÿง‘โ€๐Ÿ’ป Contributors

Constributor: Subrahmanya Gaonkar

We welcome contributions from everyone, especially beginners! If you're new to open-source, donโ€™t worryโ€”feel free to ask questions, open issues, or submit a pull request.

๐Ÿค How to Contribute

  1. Fork the repository.
  2. Create a new branch (git checkout -b feature-branch).
  3. Make your changes and commit (git commit -m "Added new feature").
  4. Push the changes (git push origin feature-branch).
  5. Submit a pull request and explain your changes.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.