# Framework-for-Data-Classification **Repository Path**: lanicon/Framework-for-Data-Classification ## Basic Information - **Project Name**: Framework-for-Data-Classification - **Description**: Mini Framework for Data Classification - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-08-21 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Framework for Data Classification This framework is created as a baseline to solve standard data classification problems. It features the use of machine learning algorithms, such as logistic regression and neural network, to classify the [NMIST dataset](http://yann.lecun.com/exdb/mnist/). The NMIST dataset is first divided into: * Training set (60%) * Cross validation set (20%) * Test set (20%) Using the ***training set***, we create the respective weights by applying optimization to the ***cost*** and ***gradient*** functions, based on the chosen classifier. ```python #anonymous functions for cost and gradient shortCostFunction = lambda nnParams : self.computeCost(inputLayerSize, hiddenLayerSize, numLabels, X, y, lambdaVal, nnParams) shortGradFunction = lambda nnParams : self.computeGradient(inputLayerSize, hiddenLayerSize, numLabels, X, y, lambdaVal, nnParams) #optimization retVal = fmin_cg(shortCostFunction, x0=nnParams, fprime=shortGradFunction, maxiter=maxIter, full_output=True) ``` To regularize for over/underfitting, we vary the ***lambdaVal*** and obtain the optimized weights by choosing the one with lowest cost based the ***cross validation set***. ```python #iterate through the given lambda values for i in range(0,len(lambdaVals)): print("lambdaVal: ", lambdaVals[i]) retVal = self.train(update=False, X=X_train, y=y_train, lambdaVal=lambdaVals[i], maxIter=maxIter, numLabels=numLabels, inputLayerSize = inputLayerSize, hiddenLayerSize = hiddenLayerSize) lambdaValCost[i,0] = lambdaVals[i] theta_train = retVal[0] lambdaValCost[i,1] = self.computeCost(inputLayerSize, hiddenLayerSize, numLabels, X_cv, y_cv, lambdaVals[i], theta_train) print("currCost: ", lambdaValCost[i,1]) #compare and store the lowest cost, together with the respective weights and lambda value if(lambdaValCost[i,1] < minCost): minCost = lambdaValCost[i,1] minCostTheta = theta_train minCostLambdaVal = lambdaVals[i] print("minCostLambdaVal: ", minCostLambdaVal) ``` By using the optimized weights, we then compute the model's accuracy using the ***test set***. ```python #returns all predictions of training set predictions = classifier.predict(DataStore.test_set_X) #compares predictions with labels accuracy = np.mean(predictions == DataStore.test_set_y.conjugate().T) * 100 ``` In terms of software design, this framework is divided into several packages: * Data Store * Classifiers - classification methods developed using OOP * Visualization * Test (Unit test) And the following software design patterns were implemented: * Dependency injection - swaps classification methods without modifying the sourcecode * Inversion of control - swaps classification methods using inheritance and common interface * Mediator Pattern Classification methods used are: * Logistic regression (with one-vs-all method) * Neural Network ### Program Details The NMIST dataset consists of hand written digits and their respective labels. *Program inputs* * Features - A handwritten digit is an image made up of 20x20 pixels box, and this gives rise to 400 independent variables (represented by a record/row). There are 5000 images in our dataset (data.mat) * Labels - The corresponding values to the handwritten digits (data.mat) * Pre-trained weights of the neural network model (nnWeights.mat) * Pre-trained weights of the logistic regression model (lrWeights.mat) *Program outputs* * Accuracy of the neural network/logistic regression model on the test set * Recognition of individual handwritten digit from the training dataset using either of the models ![picture1](pictures/picture1.png) ### To run the program 1. Dependencies * Python 3 or above * Numpy * Scipy * Matplotlib * Pydev 2. After the dependencies are installed, place the project folder into the workspace:
![picture2](pictures/picture2.png) 3. In PyDev, goto 'File -> Import -> General -> Existing Projects into Workspace', then select the file system:
![picture3](pictures/picture3.png)
Click 'Finish' 4. In Package Explorer, double-click on 'MNIST Classification -> main -> Main.py':
![picture4](pictures/picture4.png) 5. Go to 'Run -> Run As -> Run Configuration'.
'Main Module' should be as per the following:
![picture5](pictures/picture5.png) 6. Go to the Arguments tab, enter ***--NN*** in the 'Program arguments' to use Neural Network for this program run
(enter ***--LR*** if we want to use Logistic Regression)
By default, the program will run using the pre-trained weights; include ***--T*** to train the model using the training and cross validation sets:
![picture6](pictures/picture6.png)
Click 'run' to execute the program. 7. From the program, the Neural Network classifier has a training set accuracy of 92.4%; this may differ slightly on every program run due to the randomness involved. To continue recognizing the next random image, close the 'Figure 1' window:
![picture7](pictures/picture7.png) ***Future inclusions*** * More data source types * More classification models (e.g. SVM, TensorFlow) ***References*** * Coursera - Stanford University - Machine Learning Course * [NMIST Dataset](http://yann.lecun.com/exdb/mnist/)