Analysis of the Importance and Development of AI in SIEM

summary

SIEM (Security Information Management) is the core hub of enterprise security, responsible for collecting and aggregating all data and combining it with threat intelligence to accurately identify and warn of dangers. However, traditional SIEMs rely excessively on manually customized security strategies, which not only increases labor costs but also significantly reduces the overall accuracy and effectiveness of the SIEM. Currently, SIEM systems with AI capabilities merely integrate AI as an algorithmic plugin and cannot function independently and intelligently without the intervention of security personnel.

This article will begin by introducing the components of traditional SIEMs, then discuss the applicability and importance of AI for next-generation SIEMs, and focus on the differences between current mainstream SIEM & AI platforms and the next-generation SIEM@AI platform. It will then delve into the two core technical principles of SIEM@AI using real-world examples: data analysis and data correlation. Finally, the article will explore the development and research directions of SIEM@AI.

I. A Brief History of SIEM

SIEM is an abbreviation for Security Information Event Management, also known as a security information event management platform. As the security brain of an enterprise, it can provide enterprises with functions such as security data collection, integration, analysis, correlation, processing and display, and is the core and foundation of enterprise business security operations.

The concept of SIEM was first proposed 10 years ago. As an internal security log management platform for enterprises, SIEM provides functions for log collection, storage, analysis, and querying. After more than a decade of development, SIEM's product form has been enriched and expanded, including support for multi-dimensional data source input, Threat Intelligence Center, and policy script library (Playbook), while the sharing and acquisition of external threat data has also enabled the SIEM system to be continuously improved.

Figure 1: SIEM Market Size Forecast (from Gartner 2017 Report)

SIEM has maintained rapid growth in the United States. According to a Gartner market report, SIEM globally (mainly in the US) has maintained an annual growth rate of 10% recently, and the market size is expected to reach 20 billion RMB in 2020. However, in China, SIEM is still in a relatively early stage, and many companies do not have systematic management of their security issues. In 2017, the entire Chinese market was only 317 million RMB, a figure that is disproportionate to China's share of the global economy. Encouragingly, the Chinese SIEM market has maintained an annual growth rate of nearly 20% recently, indicating that more and more Chinese companies have realized the importance of SIEM.

However, not all enterprises need SIEM. Enterprises in the early stages of development have simple data flows and business volumes, face fewer security threats, and have relatively small needs for security equipment and software. They can meet their basic needs with independent security products. As enterprises grow to medium to large scale, their business lines increase, their internal and external network security environments become more complex, and they have accumulated a certain number of security products used in the early stages. At this point, it becomes necessary to integrate SIEM to achieve unified security operations management.

II. Deconstructing SIEM

Figure 2: Overall SIEM Architecture Diagram

The main architecture of the SIEM platform consists of 5 layers:

Acquisition layer

System data entry points. SIEMs typically support multiple data inputs, which can be categorized by source (end-user devices, network devices, servers, storage devices, etc.), OSI model (network traffic at the data link layer, network layer, transport layer, and application layer), and system role (different business systems, middleware systems, load balancing systems, etc.). This data is delivered to the SIEM platform either through push or pull methods for subsequent analysis and computation.

The technologies used in the data collection layer are mainly divided into two categories: "intrusive" and "non-intrusive." "Intrusive" methods typically involve deploying agent programs or adding program probes to the user's own code logic to collect data. "Non-intrusive" methods generally use methods such as bypassing traffic mirroring or inputting logs to collect data. Each mode has its advantages and disadvantages. "Intrusive" methods are beneficial for enterprises to add customized functions and deeply integrate with the multi-dimensional features of the SIEM platform to fit their business needs. However, the drawback is that if the plug-in agent is unstable, it will affect the user's own business and even cause system crashes. I myself have encountered several clients who complained to me that their services were unstable due to the vendor's embedded SDK. "Non-intrusive" methods, on the other hand, can completely avoid impacting business systems, improving system stability and protecting system data security. With mature technology, the "non-intrusive" data collection method is clearly more user-friendly.

storage layer

The collected data is used for subsequent calculations and analysis, and it is also stored. The storage layer serves two purposes: first, to store the raw collected data, and second, to store the results of the completed calculations and analysis.

Storage technology stack options generally include data pipelines (intermediate data transmission), hot storage (storing frequently used data for querying and updating), and cold storage (storing infrequently used data). Strictly speaking, data pipelines are not considered storage, but in practice, to prevent backend data loss or accumulation, data passing through pipelines is usually temporarily stored. For example, the Kafka queue, most commonly used by internet companies, stores intermediate data on disk.

The purpose of tiered cold and hot storage is to reduce enterprise storage costs to some extent while ensuring the speed of hot data operations. For cold storage, the greater technical challenge than performance is reliability and availability, making large-scale distributed storage systems that support multiple data centers or even multiple zones the preferred choice for enterprises. For hot storage, the focus is more on read and write speeds and how the data is used by computing units, so distributed storage with sharding capabilities is generally chosen.

Computation layer

The core of a SIEM platform. The accuracy, comprehensiveness, and speed of analysis all depend on the computing units at this layer. Currently, the mainstream computing models include real-time computing platforms and offline computing platforms.

Offline computing platforms for massive data have a longer history, appearing more than a decade ago in Google's MapReduce system. MapReduce first uses GFS to partition and store massive amounts of data, solving the IO throughput bottleneck of single-point devices. Each computing node then relies on a scheduler to execute either Map or Reduce tasks, continuously decomposing and merging massive computing tasks to ultimately output the desired computational results. Real-time computing platforms are a relative newcomer to massive data computing, encompassing two technical implementations: real-time stream processing, represented by Storm, and micro-batch processing, represented by Spark Streaming.

In terms of real-time performance, real-time stream processing is faster, but practical experience shows that it also requires greater technical and operational expertise. Both real-time and offline computing platforms need to support task partitioning to ensure successful computation even if some hosts fail.

The core of a computing platform is not its computing framework, but rather the computational logic within its algorithmic components. This logic performs calculations on different types of data, such as traffic, user requests, and system interaction information. Currently, most SIEM platforms are implemented based on rule engines, such as Drools. This necessitates users defining a large number of rules, and errors or omissions in these rules can lead to misjudgments or missed judgments.

Output layer

The results of the computational layer analysis are ultimately transmitted to the output layer. Traditional SIEM output methods are numerous, including presentation, reporting, alarm notification, and real-time blocking, allowing enterprises to choose the appropriate method based on the specific needs of different business departments. The output of SIEM is not only relevant to security or business departments but may also involve other business units, such as asset management and organizational management.

From the perspective of the incident handling lifecycle, the handling methods can be divided into automatic and manual methods. The automatic method can automatically handle security threat events analyzed by the computing layer, including notification, warning, reporting and even blocking. For situations that cannot be handled automatically, the manual method is required. In this case, the work order system can be used for follow-up processing and tracking to ultimately ensure that the security threat is handled.

Intelligence Center

The intelligence center provides additional data support to the SIEM computing layer, thereby improving the accuracy of threat and anomalous behavior identification. The intelligence center's data sources generally fall into three categories: first, publicly available threat intelligence, such as X-ForceExchange, ThreatBook, and Shodan; second, threat intelligence data collected internally, such as valuable threat intelligence obtained through honeypots, API access, or exchange purchases; and third, auxiliary data related to the business itself, such as user registration information, corporate asset information, and organizational information. While this information may seem unrelated to security threats, when multiple data sources are analyzed together, it can provide valuable references for the final output.

The data in the intelligence center comes in various forms, commonly including IP databases, device fingerprint databases, black card databases, and vulnerability databases. When using or relying on an intelligence center, it's crucial to ensure the intelligence is up-to-date. Due to the widespread adoption of cloud computing and the sharing economy, many resources are not exclusive but are reclaimed and used for other purposes after a certain period. Therefore, untimely intelligence updates can be counterproductive.

III. SIEM, Situational Awareness and SOC Security Operations Center

SIEM, situational awareness, and SOC (Security Operations Center) are closely related. Situational awareness has a broad scope, primarily focusing on three levels: perceiving the past, understanding the present, and predicting the future. This aligns closely with SIEM's data collection, computation, analysis, and prediction. Some enterprise-released situational awareness systems are essentially simplified SIEMs or supersets of SIEMs. SOC, building upon SIEM, emphasizes the role of humans, stressing the collaboration between people, platforms, and software. Through a task tracking mechanism similar to a ticket system, combined with data analysis results provided by SIEM, it enables comprehensive security management of business operations and assets by human personnel.

In summary, SIEM is crucial for the overall security analysis of an enterprise. By connecting information from multiple data streams, SIEM enables proactive, reactive, and reactive responses to security threats, ultimately ensuring the security of the enterprise's overall assets and business operations.

IV. AI Meets SIEM

If there's a hot trend in IT technology, then AI is undoubtedly the most cutting-edge and practical. The overall development of AI can be divided into three stages:

1. The identification phase addresses the "What" question, which is the most fundamental AI problem. Current AI uses extensive supervised learning to extract the superficial or intrinsic features of labeled samples, forming one or more classifiers. These classifiers learn and train on the features of the sample data, ultimately accurately identifying new inputs, thus solving the "what is what" problem. For example, what is a puppy, what is a pornographic image, etc.

Typical applications include CAPTCHA recognition, speech recognition, and spam detection. AlphaGo, a well-known example, also deals with recognition problems. Deep learning, trained on thousands of pre-annotated game positions, uses the first few layers of its neural network to delve beyond superficial features, uncovering deeper features that are difficult for humans to understand. This creates a "sensory" ability to understand the game, allowing it to determine whether a position is more advantageous for Black or White. Combined with algorithms like αβsearch or MCTS, it provides the optimal solution for the next move. Recognition is arguably the most mature area of AI application.

2. The understanding stage addresses the "Why" question, which is a further AI question building upon recognition. For example, what emotion does a piece of text convey? What story does a movie tell? What is the question in a spoken message? The most typical application scenario is human-computer dialogue, which is based on understanding what a person is saying and what they want to express.

The most basic solution to understanding problems is to construct various semantic templates for sentiment annotation, effectively transforming the understanding problem into a recognition problem. However, with the popularization of deep learning, many new technologies have emerged to break through the limitations of template definitions and attempt to truly understand the inner meaning. But as the example of Apple's Siri shows, current AI's ability to understand problems is far from mature.

3. The feedback phase addresses the "How" question. "How" essentially involves understanding the information provided by the other party after recognition and then providing appropriate feedback. Feedback represents the highest level of AI and is key to achieving true human-computer interaction. With the ability to interact and provide feedback, AI can partially or even completely replace humans in certain areas, much like a human. However, it's clear that the current stage of AI development is still far from this goal.

Looking at the three stages of AI development, AI is currently mainly in the early stages of "recognition" and "understanding," and there is still a long way to go before it can truly "replace humans." The technologies that are currently mature and in use are primarily focused on the "recognition" problem. Meanwhile, observing the security field reveals that the problems there are precisely typical "recognition" problems. By analyzing various input data in a SIEM (Security Information Management System), it is only necessary to identify whether an event or user poses a threat; the entire process is unrelated to understanding or feedback.

Figure 3: Examples of AI tools in mainstream SIEM systems

It should be noted that current new SIEMs have integrated AI capabilities. For example, some SIEM platforms have integrated commonly used AI algorithms, such as anomaly detection and linear prediction. These algorithms are integrated into the platform as plug-ins, and users can analyze their own data based on these algorithms.

V. From SIEM & AI to SIEM@AI

The biggest drawback of mainstream SIEM platforms is that they are merely SIEM & AI (using AI as a tool), treating AI simply as an add-on or tool to the SIEM platform, without building the entire SIEM platform on AI technology. This results in enterprises needing to spend significant time, effort, and manpower learning, configuring, and using these AI tools. Furthermore, SIEM & AI requires enterprises to have certain feature engineering experience, which is unrealistic for many. I've met many enterprise clients who, when asked about their experience using the AI component of SIEM & AI products, are completely bewildered, as if they've spent a fortune on a high-end toy but haven't been able to use it effectively.

What businesses truly need is SIEM@AI (using AI as a platform), which allows them to use AI technology to discover threat events from massive amounts of input data streams without much cost or even any learning cost. It also enables AI technology to intelligently correlate data from different business areas and dimensions, establish internal connections, and ultimately automatically handle threat events.

VI. AI-enabled data analysis

Data labeling challenges

As mentioned earlier, in the security field, most problems are "identification" problems, which, from a data analysis perspective, can ultimately be categorized as classification problems. By building algorithmic models, we predict whether ongoing events or even yet-to-arrive events pose a threat, essentially classifying them into threatening and non-threatening categories. However, a significant challenge in using AI in the security field is the difficulty of sample labeling. For classic image recognition problems, companies can use relatively low manpower to create labeled samples in batches and then feed them into deep neural networks for training. But security problems are different. Identifying the existence and nature of threats from massive amounts of messy data requires specialized security personnel and even cross-departmental collaboration.

Unsupervised learning solves the labeling problem

Is the problem of labeling difficulties solvable? The answer is yes, by using unsupervised learning. Unsupervised learning can group normal events together, as well as abnormal events, making it easier for algorithms to identify abnormal threats. Furthermore, the entire threat identification process does not require labeled samples, significantly reducing the need for human intervention.

Unsupervised learning is a crucial branch of machine learning. Unlike supervised learning, which relies on a large number of labeled samples for the classifier to learn, unsupervised learning allows the classifier to learn autonomously without any labeled samples. However, most products on the market currently focus on supervised learning, leading to the long-term neglect of unsupervised learning.

Figure 4: Clustering diagram

Baishan ATD (Advanced Threat Detection, a next-generation SIEM@AI system) products extensively utilize unsupervised learning techniques for threat event identification. The essence of unsupervised learning is data clustering, and based on different clustering processes, it mainly falls into three categories:

1. Distance clustering

2. Kernel density clustering

3. Hierarchical clustering

Distance clustering

Distance clustering is the most common clustering algorithm, essentially the EM algorithm. It iteratively refines the distance to the centroid, ultimately classifying all events. Threatening events are naturally grouped into one or more clusters, while normal events are grouped into one or more clusters due to their greater similarity. However, this is an ideal scenario; implementing the algorithm in real-world scenarios requires significant processing. The biggest challenges of distance clustering are choosing the distance calculation method and selecting the number of clusters.

Distance calculation selection mainly includes two aspects:

- How to define event boundaries: In the face of massive amounts of complex data input, where does an event begin and end, and what data does it include? This requires different processing depending on the application scenario. Common methods include defining boundaries by time period or by event segmentation points.

- How to determine the distance between events: Events have many different descriptive dimensions. For the most common dimensions—time and location—the recorded time might be a UNIX timestamp, and the recorded location might be a GEOIP or MAC address. The question then becomes how to compare the distance between UNIX timestamps and IP addresses within a vector space model. Here, ATD uses the Z-Score algorithm for distance mapping, ensuring that the mapped data exhibits a completely normal distribution.

The choice of the number of clusters is crucial to the effectiveness of unsupervised learning algorithms. If the initial number of clusters is not chosen appropriately, the clustering results may be completely wrong.

Figure 5: Clustering diagram

As shown in the image above, the red outliers are the ones we need to identify. Clearly, a cluster size of 2 is more effective than a cluster size of 3, because 3 also divides normal event points into two categories. ATD uses a series of algorithms to accurately predict the number of clusters before clustering, which can improve clustering performance by up to 200% in the best-case scenario.

Kernel density clustering

Kernel density clustering does not require pre-specifying the number of clusters. Instead, it selects clusters based on the initial density value. All events that are too far from the kernel are marked as outliers, which may be threat events from a security perspective.

Density clustering relies on selecting appropriate initial density values. Inappropriate selection can lead to incorrect outlier identification and ultimately misjudgment of threat events. Furthermore, controlling the number and purity of outliers is crucial for the final identification effect, as a large number of discrete points are likely to occur in real-world production environments. Therefore, sometimes it's necessary to adjust the feature selection algorithm after the initial clustering to perform secondary clustering for outliers.

Hierarchical clustering

The principle of hierarchical clustering is to first treat all events as leaf nodes of a tree, with each leaf node forming its own class. Then, based on their distance from each other, the nodes are merged layer by layer from bottom to top, eventually forming a root.

Hierarchical clustering can merge clusters layer by layer as needed, based on the final number of clusters. The resulting small clusters can be considered outliers, which may represent threat events. It's clear that the core of hierarchical clustering remains the choice of distance calculation model.

Intelligent risk analysis

Unsupervised learning can be used to discover many anomalous threats and risks without the need for labeled samples or human intervention . The image below shows a real-world example identified by the ATD system:

Figure 6: Example results of ATD unsupervised learning

This is a real-world case study of ATD's unsupervised learning of enterprise e-commerce operations. The case shows that most users' access paths are concentrated in...

By analyzing the access trends from the login page to the authorization page to the order page, unsupervised learning can cluster the behaviors of normal users. Conversely, malicious behavior such as fraudulent orders bypasses the authorization page and directly accesses the order page. This naturally creates outliers during the unsupervised learning process, allowing us to help businesses identify the threat and risk of fraudulent orders.

VII. AI-enabled data correlation

Horizontal association

AI threat data analysis is divided into vertical data analysis and horizontal data correlation:

Figure 7: Vertical analysis and horizontal correlation of data

Longitudinal analysis refers to learning patterns from a timeline of event groups to identify existing threats and assess future situations. Horizontal correlation refers to using algorithms to uncover deeper relationships between spatially unrelated event groups, ultimately leading to more accurate threat identification or facilitating a more comprehensive retrospective analysis of threat events.

Most SIEM products, especially those with AI tools, can perform tasks such as anomaly detection and trend prediction (although the vast majority of these are supervised learning, meaning customers need to provide a large number of labeled threat and normal event samples). However, these tasks involve longitudinal analysis, not horizontal correlation. Therefore, for next-generation SIEM@AI systems, the more challenging task than unsupervised learning for longitudinal analysis is to establish potential correlations in massive amounts of seemingly unrelated data, thereby achieving true deep threat identification.

Event-related calculations

Common event association scenarios can be broadly categorized into two types:

A, a set of events within a specific scope (such as a specific time period), exploring the relationships between these events, such as:

The image above shows two events that are statistically output by completely different systems. We need to use algorithms to analyze whether there is a correlation. This process can actually be transformed into: analyzing the correlation by row.

B. For events of the same type, investigate whether there are any correlations among the constituent factors, such as:

As shown in the diagram above, the process of determining whether there are correlations among the various factors in all "ERP system inaccessible" events can actually be transformed into: column-based correlation analysis.

Therefore, it can be seen that whether it is the correlation analysis of different events or the correlation of intrinsic factors of the same type of events, it can essentially be converted into the row correlation or column correlation of a matrix. For column correlation, it can also be converted into row correlation through matrix transpose, that is:

We only need to analyze whether K1 and K2 are correlated to some extent.

For this type of association analysis, the most common approach is to use an algorithm similar to KNN, which calculates the angle between two event elements to determine their correlation.

θ=acos(K1⋅K2/(|K1||K2|))

The smaller the angle between the two events, the more related they are. When the angles are perpendicular (or orthogonal), the two events are completely unrelated.

Of course, we can also use other methods to calculate correlation, such as Jaccard distance:

J(K1,K2)=|K1⋂K2|/|K1⋃K2|

The larger the J value, the more correlated the two events are; conversely, the smaller the J value, the less correlated they are.

The angle distance calculation method is more suitable for numerical event vectors, while the Jaccard distance calculation method is more suitable for enumerating string-type event vectors. In fact, we can convert any string-type event into a numerical event vector using algorithms such as word2vec or simhash, and then perform the angle calculation.

Beer and Diapers

When discussing data correlation, the classic story of "beer and diapers" is a must-mention. Walmart, during data correlation analysis, discovered a correlation between beer and diapers on shopping lists. How did this happen? It turns out that wives often reminded their husbands to buy diapers for the baby after get off work. And after buying diapers, the husbands would also conveniently buy their favorite beer, thus creating a correlation between beer and diaper sales.

From the perspective of the complexity of data association algorithms, the association between beer and diapers is relatively simple and direct. The Apriori algorithm is one of the simple and feasible algorithms to solve this problem. The Apriori algorithm obtains the event elements with the strongest association by continuously filtering frequent items and generating new association rules.

Figure 8: Schematic diagram of the Apriori algorithm

A deeper dive into the Apriori algorithm reveals that its entire calculation process is quite similar to calculating the Jaccard distance between events; both essentially compare the similarities between two events and then filter them. However, the Apriori algorithm is more efficient than pairwise comparisons because it incorporates a pruning process to narrow down the possibilities.

More subtle connections between events

In fact, in ATD's actual application scenarios for customer service, the "beer, diapers" mentioned above are relatively simple event association models. What's more complex is how to discover relationships that aren't so directly related from a human perception perspective. For example, the relationship between air pollution levels and urban electricity consumption isn't particularly direct from a human perception standpoint. However, when we introduce a bridge between the two events—the percentage of people indoors—we discover the following probabilistic relationship:

P(Electricity Consumption/Smog) => P(Increase in Indoor Population/Smog) * P(Increase in Electricity Consumption/Increase in Indoor Population), where P(A/B) represents the probability of event A occurring given event B.

If we can list all the core events caused by smog, we can use the law of total probability to deduce the relationship between smog and electricity consumption (so I did not use the equal sign = here but used =>).

From a threat identification perspective, this bridging event can similarly be used to construct a relationship between two seemingly unrelated events. For example, during the deployment of ATD for a home appliance company, we discovered a suspected CC attack that was actually related to a database change operation in a backend business line:

P (Suspected CC attack/Business line database change) => P (Suspected CC attack/Sudden surge in API access) * P (Sudden surge in API access/504 percentage) * P (504 percentage/Request blocking) * P (Request blocking/Database blocking) * P (Database blocking/Database change)

The prerequisite for resolving such complex and subtle event correlations is to first collect all information (regardless of whether it is considered relevant) (which is also what the SIEM data collection layer mentioned at the beginning of the article needs to address). Collect as much data as possible, because only with collected data can correlations be established. When massive amounts of data are collected, we often encounter a problem when performing further related analysis: the sheer volume of data leads to low analysis performance. If threat event analysis is not timely, it can significantly impact subsequent processing; therefore, low latency throughout the entire analysis process is crucial.

Data dimensionality reduction

How can we ensure processing speed? We need to reduce the dimensionality of the data to decrease computational space. There are two approaches to this:

1. Supervised dimensionality reduction

If a company has a large amount of labeled data, it can use supervised dimensionality reduction. The most classic supervised dimensionality reduction is PCA (Principal Component Analysis). Its principle is to select an optimal data projection method to project from a high-dimensional space to a low-dimensional space, and ensure that there is good discrimination after projection.

2. Unsupervised dimensionality reduction

Unsupervised dimensionality reduction can be used even without a large amount of labeled data, which is precisely the data dimensionality reduction method used by ATD. There are many algorithms that can perform unsupervised dimensionality reduction. ATD initially used the LDA (Latent Dirichlet Allocation) topic discovery model for dimensionality reduction. LDA first clusters the data according to topic relevance, reducing the number and dimensionality of data in each class, thereby reducing the complexity of subsequent calculations.

Here, I would like to introduce another method of data dimensionality reduction, which is what we are currently trying—SVD (Singular Value Decomposition).

Figure 9: SVD decomposition of threat events

As shown in the diagram above, we first performed SVD decomposition on a massive set of events. The result of the decomposition is the product of three matrices. Then, by filtering the elements of the middle Σ matrix, we can reduce the complexity of the entire event cluster and find related events and factors under the same latent theme. The number of latent themes is essentially the rank of the event matrix.

From a deeper perspective, both LDA and SVD essentially aim to find the rank of an event matrix. By using the rank, they identify the core factors constituting an event. For example, in an intrusion event, the core factors might be user attributes (internal/external user, authorization status, job level, etc.), the time of intrusion, and the type of business being intruded upon. Other factors, such as employee age and server load at the time, might be automatically identified as non-critical and ignored by the algorithm. This approach allows for the discovery of key factors within a vast amount of information, significantly reducing the computational burden for subsequent event correlation.

In summary, horizontal data correlation is an extremely challenging task. The most important prerequisite is collecting sufficient data through the SIEM acquisition layer. Secondly, it requires selecting appropriate algorithms to process the data, and finally, using AI algorithms for correlation analysis. In actual use by ATD customers, we successfully discovered the relationship between external network interface attacks and internal network database changes, as well as the relationship between Exchange log events and internal network SSH events in a certain email system. This correlation analysis is not only helpful for backtracking known threats but also of great significance for future security posture awareness.

VIII. Exploration of Future Directions

From the SIEM&AI model to the SIEM@AI model, we no longer view AI as a plugin or tool, but rather as a system running on a fully AI-driven intelligent platform. On this platform, we eliminate the need for labeled data, extensive manual intervention, and customized rules. Instead, we automatically identify abnormal threat events through machine learning algorithms primarily based on unsupervised learning, automatically establishing inherent connections between various complex events. This improves the accuracy and recall rate of identification while freeing up manpower for security engineers and increasing their efficiency, ultimately achieving three-tiered intelligent defense for the enterprise's external network, business operations, and internal network.

Baishan ATD's product is a completely new SIEM@AI system. We have invested significant time and effort in developing AI algorithms based on unsupervised learning to replace current traditional enterprise security products, and the effectiveness of this model has been validated in enterprise practice. In the future, ATD will conduct further research and exploration in two directions:

1. Introducing human participation through active learning

The purpose of introducing unsupervised learning is to avoid relying on labeled samples, because obtaining labeled samples in the security field is extremely costly. However, this does not mean that human intervention can be completely eliminated. Within a foreseeable timeframe, experienced security experts are crucial for identifying threats, refining algorithms, and maintaining the robustness of the entire AI system. However, security experts have limited time and energy; therefore, reducing their time costs while accurately and comprehensively identifying security threats is critical.

To address this, we introduce an active learning algorithm, a special type of semi-supervised learning. It relies on security experts to manually verify a small number of AI-generated results, continuously fine-tuning the original algorithm until it converges. Two factors are crucial in active learning: how to select the identification results for manual verification, and how to feed the corrections to the algorithm model. Through active learning, we can build a continuously learning and evolving SIEM system. As it integrates with human intervention, the system becomes increasingly intelligent and accurate.

2. Identifying non-intuitive threats through deep learning

Some threats or anomalies lack intuitive descriptiveness and cannot even be vectorized or discretized. The most direct example is encrypted traffic. Encrypted traffic itself is indescribable to humans; it's merely a binary input stream. Other security incidents, due to their numerous associated business processes, are difficult to explain in words as to why they were initially classified as anomalies. These problems can be addressed using deep learning algorithms. However, deep learning requires a large number of labeled samples to guarantee its effectiveness. This necessitates that enterprises continuously accumulate threat event assessments during the regular operation of their SIEM systems. Once a sufficient amount of data has been accumulated, deep learning algorithms can be used for analysis.

As a disruptive technology in the security field, AI, when combined with SIEM, will build a new generation of SIEM@AI platform that is fully AI-based, intelligent, and requires little or no human intervention. This will change the current security product model that relies on policy settings and become the next generation of enterprise security brain.

Analysis of the Importance and Development of AI in SIEM

Read next

CATDOLL 146CM Jing TPE

New Developments in Magnetic Components in Power Electronics Technology

CATDOLL 146CM Ya TPE (Customer Photos)

CATDOLL 135CM Ya