Artifical Intelligence SIG
Artificial Intelligence 101
Artificial intelligence (AI), as a discipline, is not new. However, it has gained renewed popularity in recent times due to the astounding accomplishments of large language models such as ChatGPT and Gemini. These complex models are trained on very large datasets and use intensive resources to learn from this data such as GPU clusters as well as human feedback. However, not all forms of AI have such hefty requirements. In fact, early forms of AI were much smaller in size and complexity than the models we see today. The term AI was introduced at a conference in 1956, and the field has a long and storied history, having gone through multiple summers and winters.
Research in AI resulted in some successes as early as the 1960s, for example with the natural language processing program ELIZA exploring communication between man and machine. ELIZA adopted a pattern matching and substitution method. Nevertheless, it was sophisticated enough to produce intelligent responses that were capable of deceiving early users of the program. Another important approach adopted by early AI researchers was to attempt to replicate the decision-making process of a human expert. Introduced by Edward Feigenbaum, these expert systems would collect information from a human expert and use this information for advising non-experts. There were two main components to expert systems – (1) a knowledge base representing information about the world, and (2) an inference system utilising logic rules and formal statements to create new knowledge. These systems were used in industries and enjoyed some success; it was probably best exemplified by the victory of Deep Blue over Gary Kasparov, who was the world’s best chess player in 1997. Deep Blue combined the knowledge provided by chess experts with an efficient search algorithm and powerful processors, enabling it to evaluate 200 million chess positions per second.
These early AI programs attempted to replicate human intelligence by using a collection of rules, and implicitly assumes that the problem can be formalised based on these rules. While this may be true of certain tasks, others such as image recognition or language translation are less amenable to this approach. For such tasks, a better approach may be for the system to learn from the data and to adjust and adapt itself to better achieve its objectives without being explicitly programmed. These algorithms come under the subset of AI known as machine learning. Due to the automated nature of machine learning algorithms, they can quickly learn from large amounts of data and can identify patterns that are not obvious even to experienced analysts. Machine learning models can vary widely in terms of complexity and size.
A popular subset of machine learning today is deep learning, which builds on a particular machine learning model called neural networks. The inspiration for this model comes from nature, specifically the workings of the human brain, although the abstract representations found in the AI model today do not exactly mirror their biological counterpart. Deep learning models, by virtue of their large and complex architectures, can learn from large and complex datasets without requiring advanced preprocessing of the data. Classical machine learning models, on the other hand, usually require domain specific knowledge to extract relevant features from the data to improve model performance. Deep learning models have been tremendously successful in the fields of computer vision and natural language processing, outperforming their classical counterparts and are also the foundation of impressive generative models such as ChatGPT or DALL-E.
Broadly speaking, most machine learning tasks can be divided into supervised and unsupervised learning. The crucial distinction between these two tasks is the availability of labels. A label is an actual classification for each sample, for example whether an email is spam or ham, or whether a file is malware or not. When such labels are available, the task is one of supervised learning, and the AI tries to learn the relationship between the inputs (such as the presence or absence of certain words in an email) and the labels (such as whether the email is spam or ham). For such tasks, the machine learning process normally undergoes two phases: (1) a training phase where the model is presented with the correctly labelled training data and ‘learns’ from this data; and (2) a testing phase where the fully trained model is fed with ‘unseen’ data i.e. data not in the training data but for which the labels are known, and the model’s performance is evaluated. Some examples of supervised machine learning models include distance-based methods such as K-Nearest Neighbours, probabilistic methods such as Naïve Bayes, deep learning neural network architectures such as convolutional neural networks and transformers, and ensemble methods which use a collection of simpler models to improve performance.
For evaluation of supervised learning models, the most intuitive metric to use is accuracy, which essentially measures the overall percentage of correctly predicted samples. However, for some problems where the number of samples in the output classes are not evenly distributed, such a measure may not be reflective of the actual performance of the model. For example, in the context of cybersecurity related problems, malicious events are usually much rarer than benign ones. Thus, a high accuracy based on overall percentage would be misleading; the model would be able to give good results just by always predicting the majority class, as the small number of malicious events that are incorrectly predicted will not affect the accuracy score greatly. In this case, other performance metrics such as precision (the percentage of correct predictions out of all malicious samples), recall (the percentage of correct predictions out of all maliciously predicted samples) or other such metrices may be informative.
For unlabelled data, unsupervised learning methods can be used. These methods attempt to discover patterns or find some underlying structure within the unlabelled dataset. Clustering algorithms like k-means clustering, group the data based on similarity, and these groups can be further studied by a domain expert to identify anomalies that do not belong to any group or to classify new data into one of the groups. Alternatively, deep learning architectures such as autoencoders can be trained to recreate the original input from a lower dimensional representation of the input. Once trained, autoencoders can reconstruct the input accurately for non-anomalous data but will fail for outliers and can therefore be used for anomaly detection. Other unsupervised anomaly detection methods include tree-based methods such as isolation forests which uses the number of splits in the trees to identify anomalies.
In some situations, a combination of both supervised and unsupervised learning can be used in what is called semi-supervised learning. This normally applies to cases where there is limited labelled data but large amounts of unlabelled data. The goal of semi-supervised learning is the same as that of supervised learning, but in this case, the small set of labelled data is used to create machine generated labels for the unlabelled data which can then be used to enhance the performance of the overall model. Another subset of machine learning that differs from supervised and unsupervised learning is reinforcement learning. There are three key components in reinforcement learning – the environment, action and reward. An agent repeatedly interacts with a simulated environment, and in the process receives rewards or penalties depending on the actions taken. By exploring the simulated environment in this fashion, the agent learns the ‘best’ moves to take when in a given state and optimises for the long term instead of short-term rewards. An important requirement for the success of reinforcement learning is the creation of realistic simulation environments. One such platform provided by Microsoft is CyberBattleSim, which simulates attack and defence cyber-agents (https://www.microsoft.com/en-us/research/project/cyberbattlesim/). The attacker evolves in the network via movements to exploit existing vulnerabilities while the defender attempts to contain the attacker and evict it from the network. Reinforcement learning can be used to learn the appropriate strategies for an attacker to infiltrate all PCs in a network environment, or to develop effective defence strategies.
With digital transformation changing the way we live and work, addressing cybersecurity threats has become increasingly important. Attackers are constantly finding new ways to attack systems, creating a dynamic and changing threat landscape that may not be amenable to rule based methods. The strengths of machine learning, as a fast and automated method of learning from data, appear to make it well suited for tackling problems in the cybersecurity domain. However, challenges remain, such as the need for large and diverse datasets, data drift, interpretability of AI models and difficulties in handling both zero-day attacks and AI-based adversarial attacks created specifically to escape detection. Furthermore, the trend from narrow intelligence (AI developed for specific tasks) to a more general intelligence, such as the multimodal language models that can perform multiple tasks, will open new opportunities and threats in the cybersecurity domain. While there is still a discrepancy between the research potential and practical deployment of AI-based cybersecurity solutions today, it is likely that this gap will narrow in the near future.
Author Contact Information