Background
Given recent attacks on the Healthcare sector, and some noted data breaches, FauxCura Health have engaged Quantum.LogiGuardian (Q.LG), a cyber security consultancy and analytics firm.
FauxCura believes they may have had un-detected cyber security breaches within their systems. As a caring and respectable healthcare provider, they want to examine their historic network data to determine whether undetected breaches have occurred.
The security operations centre at FauxCura run a Security Incident and Event Management (SIEM) platform called Splunk™. This platform collects vast quantities of log data from servers, desktop computers, routers and other network equipment and aggregates it in the form of reports and alerts that can be viewed by security personnel to identify incidents that require investigation.
During an initial investigation of FauxCura’s data, Q.LG were able to trawl through history data to produce an initial report that provides some top-level metrics on incidents based on certain triggers. Incident specific data is also retained, but generally consists of extremely large data sets.
It is hoped that the report data contains sufficient information to be able to construct an ML model that can more accurately identify events of interest.
Data Overview
The data you are working with are records extracted from FauxCura’s SIEM. The records have already been processed and reduced to a summary of individual event detections that were triggered by the SIEM.
The data have also been aggregated from multiple other sources and reports in the SIEM. This means some values may be inconsistent across systems or there may be errors in the data that need to be identified and cleaned.
Descriptions of Features:
Below is a brief explanation of the features in the data set. It is not necessary to understand these features. It is also important to note that feature naming conventions are very subjective. Reliance on the meaning of a name may miss important data or detail.
Alert Category (Categorical):
This feature describes what type of alert was created by Splunk. It is largely subjective as the alert creators can identify their own alert levels for different types of events. The levels present in the data can be approximately summarised as:
Informational:
An event that is being logged to the system for information purposes only, it is possible these could relate to malicious activity, but this is the lowest level of alert.
Warning:
This is a higher level of alert and typically used to identify a situation that may not be typical.
Alert:
These are typically used for specific events that represent a security concern that requires action.
NetworkEventType (Categorical):
This is the type of event that the SIEM report believes has occurred. It can be used to differentiate between apparently normal network traffic, to things like policy violations and even threat detections and data exfiltration.
NormalOperation:
No specific anomalies occur in this logged event – there are many reasons this data may be logged.
PolicyViolation:
A security or business policy has been violated. This can range from attempts to run unauthorised software on the network, to using the wrong type of webbrowser to access a database.
ThreatDetected:
A specific condition has been detected that has previously been identified as a security thread. These could be normal operations mis-tagged, or they may include malicious software or techniques in use.
NetworkInteractionType (Categorical):
This is another ‘computer’ metric that uses an unknown 3rd party plugin to identify network interactions that are not typical.
Regular:
These appear to be normal network traffic requests.
Elevated:
Requests that are attempting to access resources that require specific permissions. For example, a computer trying to log in to an administrative console or a restricted device.
Suspicious:
Generally, these are elevated network events that are unexpected, have come from an unexpected source, or have unexpected patterns of usage.
Anomalous:
Network interactions that aren’t typical but may not have any relation to security events.
Critical:
A network condition that should never occur. This could be an interaction that indicates an attack condition, or a severe equipment outage or malfunction.
Unknown:
The interaction status is unknown
DataTransferVolume (out and in) (Numeric):
Quantifies the amount of data transferred over the network. Values are given whether they are into the network or out of the network.
TransactionsPerSession (Integer):
The number of transactions exchanged between devices and the service they are communicating with.
NetworkAccessFrequency (Integer):
Measures how frequently network ports are accessed, with abnormal frequencies potentially signalling unauthorized access attempts or scans.
UserActivityLevel (Numeric):
A generated metric indicating how active a user is on the system they are connected to. Higher scores generally mean more activity.
SystemAccessRate (Integer):
A generated metric that indicates how frequently the company’s core systems are being accessed.
SessionIntegrityCheck (Logical):
A flag that indicates whether the session has been correctly open, communicated and closed, with all underlying network protocols and signals correctly used.
ResourceUtilizationFlag (Logical):
A flag that is raised when the resource utilisation of servers or network devices is unusually high. This could include excessive memory consumption on some devices, slow response times, or large network transfers.
SecurityRiskLevel (Numeric):
A calculated metric created by a 3rd party “AI” plugin that can identify security risks based on unknown parameters and conditions.
ResponseTime (milliseconds) (Numeric):
Measures the time taken to respond to network requests or events. This is the time between when a network resource or event occurs, and the corresponding reply packet is returned.
Classification (Categorical):
The final classification of the event. Where indicated “Normal” and “Malicious” can be assumed to have been identified with reasonable accuracy.
The raw data for the above variables are contained in the HealthCareData_2024.csv file.
The needle in the haystack
The data were gathered over a period of time and processed by several systems in order to associate specific events with confirmed malicious activities. However, the number of confirmed malicious events was very low, with these events accounting for approximately 4% of all logged network events.
Although the malicious events are quite uncommon, the identification of malicious events are extremely important.
Objectives
You are the data scientist that has been hired by Q.LG to examine the data and provide insights. Your goals will be to
• Clean the data file and prepare it for Machine Learning (ML)
• Recommend a ML algorithm that will provide the most accurate detection of malicious events.
• Create a brief report on your findings
You job
Your job is to develop the detection algorithms that will provide the most accurate incident detection. You do not need to concern yourself about the specifics of the SIEM plugin or software integration, i.e., your task is to focus on accurate classification of malicious events using R.
You are to test and evaluate two machine learning algorithms (each in two scenarios) to determine which supervised learning model is best for the task as described.
Task
You are to import and clean the same HealthCareData_2024.csv, Then run, tune and evaluate two supervised ML algorithms (each with two types of training data) to identify the most accurate way of classifying malicious events.
Part 1 – General data preparation and cleaning
a) Import the HealthCareData_2024.csv into R Studio. This version is the same as Assignment 1.
b) Write the appropriate code in R Studio to prepare and clean the HealthCareData_2024 dataset as follows:
i. Clean the whole dataset based on the feedback received for Assignment 1. ii. For the feature NetworkInteractionType, merge the ‘Regular’ and ‘Unknown’ categories together to form the category ‘Others’. Hint: use the forcats:: fct_collapse(.) function.
iii. Select only the complete cases using the na.omit(.) function, and name the dataset dat.cleaned.
Briefly outline the preparation and cleaning process in your report and why you believe the above steps were necessary.
c) Use the code below to generated two training datasets (one unbalanced mydata.ub.train and one balanced mydata.b.train) along with the testing set (mydata.test). Make sure you enter your ID into the command set.seed(.).
# Separate samples of normal and malicious events
dat.class0 – dat.cleaned % % filter(Classification == -Normal-) # normal dat.class1 – dat.cleaned % % filter(Classification == -Malicious-) # malicious
# Randomly select 9600 non-malicious and 400 malicious samples using your ID, then combine them to form a working data set set.seed(Enter your ID) rows.train0 – sample(1:nrow(dat.class0), size = 9600, replace = FALSE) rows.train1 – sample(1:nrow(dat.class1), size = 400, replace = FALSE)
# Your 10000 ‘unbalanced’ training samples