Iris data set is used in .csv format.
Downloaded from Iris Dataset
It includes three iris species with 50 samples each as well as some properties about each flower.
One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.
The columns in this dataset are:
- Id
- SepalLengthCm
- SepalWidthCm
- PetalLengthCm
- PetalWidthCm
- Species
The code is contained in knn-classifier.py
It contains 6 functions -
- loadDataset
Loads the dataset from a .csv file. First instance is assumed to contain feature labels, so it is skipped. - getDistance
Returns euclidean distance between two vectors - getNeighbours
Returns k nearest neighbours to a test instance - predictClass
Returns the most likely class from all the neighbours - getAccuracy
Tests the predictions against the entire test dataset. Accuracy is printed in percentage - myknnclassify
Required function mentioned in the problem
Value of k: 12
Training and test data are created by randomly splitting data in 66:34 ratio.
Classifier accuracy generally >95%
Fertility dataset is used in .csv format.
Downloaded from Fertitlity Dataset
100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria.
Sperm concentration are related to socio-demographic data, environmental factors, health status, and life habits
Season in which the analysis was performed. 1) winter, 2) spring, 3) Summer, 4) fall. (-1, -0.33, 0.33, 1)
Age at the time of analysis. 18-36 (0, 1)
Childish diseases (ie , chicken pox, measles, mumps, polio) 1) yes, 2) no. (0, 1)
Accident or serious trauma 1) yes, 2) no. (0, 1)
Surgical intervention 1) yes, 2) no. (0, 1)
High fevers in the last year 1) less than three months ago, 2) more than three months ago, 3) no. (-1, 0, 1)
Frequency of alcohol consumption 1) several times a day, 2) every day, 3) several times a week, 4) once a week, 5) hardly ever or never (0, 1)
Smoking habit 1) never, 2) occasional 3) daily. (-1, 0, 1)
Number of hours spent sitting per day ene-16 (0, 1)
Output: Diagnosis normal (N), altered (O)
The code is contained in knn-regressor.py
It contains 6 functions -
- loadDataset
Loads the dataset from a .csv file. First instance is assumed to contain feature labels, so it is skipped. - getDistance
Returns euclidean distance between two vectors - getNeighbours
Returns k nearest neighbours to a test instance - calculateValue
Returns the most likely class from all the neighbours - getAccuracy
Tests the predictions against the entire test dataset. Accuracy is printed in percentage - myknnregress
Required function mentioned in the problem statement
Value of k: 12
Training and test data are created by randomly splitting data in 66:34 ratio.
Regressor accuracy is generally >85%
Mushroom dataset in .csv format.
Downloaded from Mushroom Dataset
Mushroom Dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled
mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely
poisonous, or of unknown edibility and not recommended.
The columns in this dataset are -
- cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
- cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
- cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y
- bruises?: bruises=t,no=f
- odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s
- gill-attachment: attached=a,descending=d,free=f,notched=n
- gill-spacing: close=c,crowded=w,distant=d
- gill-size: broad=b,narrow=n
- gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
- stalk-shape: enlarging=e,tapering=t
- stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=?
- stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
- stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
- stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y
- veil-type: partial=p,universal=u
- veil-color: brown=n,orange=o,white=w,yellow=y
- ring-number: none=n,one=o,two=t
- ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z
- spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y
- population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y
- habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d
The code is contained in naive-bayes.py
It contains 6 functions -
- loadDataset - Loads the dataset from a .csv file.
- split data by classes - Split training data according to class labels
- calculate probabilities - Calculating dependent probabilities for each feature given a particular class label.
- calculate z - Calculate the scaling factor Z
- predict class - Returns the most likely class by calculating argmax for each class label
- main - driver function
Instances with missing attributes are skipped.
Training - top 4000 instances
Test - Remaining 1644 instances
Classifier accuracy is 84.97%