Cyber Crime Cases and the Role of Confusion Matrix — case study

8 min readJun 7, 2021

Hello Techies !

let’s start with, what is Confusion Matrix !

Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.

The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.

It gives you insight not only into the errors being made by your classifier but more importantly the types of errors that are being made.

It is this breakdown that overcomes the limitation of using classification accuracy alone.

Let’s start with an example confusion matrix for a binary classifier (though it can easily be extended to the case of more than two classes):

What can we learn from this matrix?

There are two possible predicted classes: “yes” and “no”. If we were predicting the presence of a disease, for example, “yes” would mean they have the disease, and “no” would mean they don’t have the disease.
The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that disease).
Out of those 165 cases, the classifier predicted “yes” 110 times, and “no” 55 times.
In reality, 105 patients in the sample have the disease, and 60 patients do not.

Let’s now define the most basic terms, which are whole numbers (not rates):

true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
true negatives (TN): We predicted no, and they don’t have the disease.
false positives (FP): We predicted yes, but they don’t actually have the disease. (Also known as a “Type I error.”)
false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a “Type II error.”)

I’ve added these terms to the confusion matrix, and also added the row and column totals:

This is a list of rates that are often computed from a confusion matrix for a binary classifier:

Accuracy: Overall, how often is the classifier correct?
(TP+TN)/total = (100+50)/165 = 0.91
Misclassification Rate: Overall, how often is it wrong?
(FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy
also known as “Error Rate”
True Positive Rate: When it’s actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as “Sensitivity” or “Recall”
False Positive Rate: When it’s actually no, how often does it predict yes?
FP/actual no = 10/60 = 0.17
True Negative Rate: When it’s actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate
also known as “Specificity”
Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91
Prevalence: How often does the yes condition actually occur in our sample?
actual yes/total = 105/165 = 0.64

Cyber Crimes

Particularly in the last decade, Internet usage has been growing rapidly. However, as the Internet becomes a part of the day to day activities, cybercrime is also on the rise. Cybercrime will cost nearly $6 trillion per annum by 2021 as per the cybersecurity ventures report in 2020. For illegal activities, cybercriminals utilize any network computing devices as a primary means of communication with a victims’ devices, so attackers get profit in terms of finance, publicity and others by exploiting the vulnerabilities over the system.

Cybercrimes are steadily increasing daily. Evaluating cybercrime attacks and providing protective measures by manual methods using existing technical approaches and also investigations has often failed to control cybercrime attacks. Existing literature in the area of cybercrime offenses suffers from a lack of a computation methods to predict cybercrime, especially on unstructured data.

According to the article, Computational System to Classify Cyber Crime Offenses using Machine Learning, they have done some research and Based on the literature review, it can be said that machine learning is an efficient tool to detect and classify cybercrimes. However, still, there is a scope of improvements in this regard. Therefore, in that research work a machine learning based tool is proposed to find the attacks that take advantage of security weaknesses and classify these cyber-attacks.

Proposed Methodology

At present, there is no generalized framework is available to categorize cybercrime offenses by feature extraction of the cases. In the present work, data analysis and machine learning are incorporated to build a cybercrime detection and analytics system. The proposed system’s design and implementation utilize classification, clustering and supervised algorithms. Figure 1 depicts the proposed methodology. Here, naïve Bayes is used for classification and k-means are used for clustering . For feature extraction in the proposed work, the TFIDF or tf–idf vector process is used . This developed methodology is based on 4 phases that are applied to the data, which are reconnaissance, preprocessing, data clustering and classification and prediction analysis.

figure 1. Proposed approach to analyze cybercrime incidents

Evaluate Term Frequency: The term frequency (tf) for each term in a record is evaluated based on statistics of its presence in a record. Later, the tf values of each term are stored in a matrix format. That is called as tf matrix.

Evaluate Inverse Document Frequency: The inverse document frequency (idf) for each term in a record evaluated here. Later stored those in a matrix referred it as idf matrix.

Scoring the record sentences: A sentence of scoring is various in different approaches or algorithms. Here, we have considered tf–idf to allot a score to the terms of a sentence that belongs to a record. The average value of tf–idf of all the terms of a sentence becomes the score of that sentence.

Find the Threshold: Number of approaches existed to evaluate the threshold values. Here we have considered the average score of all the sentences in the record as the threshold. Generally, this value helps to detect the correlated terms in the data.

The chi-squared (χ 2 ) measure is used to find out the correlation between the two categorical attributes of the incident. It checks whether a relationship between the two variables reflect on the cybercrime dataset or not.

Results and Analysis

The proposed system is designed and developed by considering the data from sources such as Kaggle and CERT-In. It consists of more than 2000 records with the eight attributes such as incident, offender, victim, harm, year, location, age of the offender and cybercrime. Incidents that occurred in India during 2012–2017 were considered. More than 2000 records are used to construct and test the proposed computational system.

Precision: It is the measure of truly predicted positive samples to the total number of positively predicted samples. If the precision score is more then it represents that our model is pretty good to classify the samples.

Precision= TP/ TP+FP

Recall: It is the measure of truly predicted positive samples of all the samples present in the actual class as yes. It is also termed as the sensitivity of the model.

Precision= TP/ TP+FN

F1 score: It is calculated as the weighted average of both precision and recall. Its main components (considerations) are true negatives, true positives, false negatives and false positives.

F1 Score = 2 × (precision × recall)

Accuracy is the performance measure used to check our model. It is preferred when the number of false positives values and the false negative values are the same. When the false-positive rates and the false negative rates are different then it is not much a good approach to check the performance of our classifier. In this situation, it is better to use f1 score rather than an accuracy measure.

Acuracy= (TP + TN)/ (TP+ TN + FP +FN )

Precision recall and f-1 score for the proposed model

The below figure depicts the confusion matrix for the model when the training size was 0.8 and the test size was 0.2. By this, They know how many cases are classified correctly and how many are classified incorrectly. It means They can find out the true negatives and true positives and false negatives and false positives classified by using the model.

Conclusion

In the present world, cybercrime offenses are happening at an alarming rate. As the use of the Internet is increasing many offenders, make use of this as a means of communication in order to commit a crime. The framework developed in their work is essential to the creation of a model that can support analytics regarding the identification, detection and classification of the integrated cybercrime offenses (structured and unstructured). The main focus of their work is to find the attacks that take advantage of the security vulnerabilities and analyze these attacks by making use of machine learning techniques. The aim is that the developed framework will provide the essential broad knowledge of cybercrime offenses in the society, enable them to consider the threat landscape of such attacks and avoid the incarnation of the cybercrime offenses. From the results, it is evident that the developed framework reduces the time consumption and manual reporting process. It helps to identify the number of filing cases by incident wise and area-wise.

Reference

Computational System to Classify Cyber Crime Offenses using Machine Learning

BY- Rupa Ch , Thippa Reddy Gadekallu , Mustufa Haider Abidi , and Abdulrahman Al-Ahmari

https://www.mdpi.com/2071-1050/12/10/4087/pdf

https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/#:~:text=A%20confusion%20matrix%20is%20a,related%20terminology%20can%20be%20confusing.

Hope you like this Article !!

Thanks for Reading :)