Tackling Anomaly Mining Problems in the Wild with Networks and Beyond

By Leman Akoglu

Last week on October 18, I attended this talk hold in Machine Learning Department of Carnegie Mellon University, but I now have a chance to write about it on my blog. The reason I preferred to attend this talk was that I saw buzzword “security” in the abstract of it. As I had known before I attended to the talk, one of the key domains applying anomaly mining is security. Since I thought that I may need techniques discussed in this talk in my future studies, I decided to go it. Below is its summary…

Summary

At the beginning of the talk, Prof Akoglu mentioned about the application areas of anomaly detection, which are advanced persistent threat, sensor monitoring, identity theft and social security. Not being able to catch the connection while I was trying to take notes in hurry, I noted the 3V’s of data as she mentioned. We know that the big data is defined by  its 3V’s by most authors. Hence, the reason she mentioned is most probably the anomaly mining problems they work tackles with big data. Having said that, those 3Vs are: Volume, Velocity and Variety corresponding to large scale of the data, varying in time, and diverity of data source, respectively.

In the part she titled as “Anomaly mining vs detection”, she mentioned about 3D’s of anomaly mining, the concept which she defined. Those are Definition, Detection and Description corresponding to “What is an anomaly?”, “How can we find it?”, and “Why anomalous? Who did it?”. Again, she mentioned about the real-world applications, but this time for anomaly mining. Example questions in those applications are: “Is traffic normal?” and “Is there political unrest?”. She mentioned about two basic methods for solving anomaly mining problems: outlier detection and filtering. In addition, she talked about the challenges people faced in these studies. First challenge is that the data samples are not labeled, so there is no ground truth. Second, anomalies are rare, so data are quite imbalanced. Last, there is asymmetric cost for the analysis results. More precisely, the cost of false positive results are different than false negative results. Moreover, she touched to the importance of assistance to end-users in anomaly mining applications. She elaborated on the issue that, end-users need justification , so we should help them to prioritize, resolve, and explore the results of the analysis.

After this introduction-background style part, she moved to the part where she talked about her reserch roadmap. She mainly explained two sample application scenarios to illustrate the application of anomaly mining techniques by referring to her previous papers. Before presenting the examples, she mentioned about the questions to be asked when someone is trying to solve an anomaly mining problem. These are the following 4 questions:

1)What is the problem?

2)What data is available?

3)What do we know about the anomalies or normal behavior?

4)What are the system requirements?

The first application scenario is anomaly detection in social networks. Here the task is to find anomalous subgraphs (social circles). Hence, the answer of the first question is: given a set of attributed subgraphs, find poorly defined ones. The data available is the social network of users and profile attributes. The system requirement is the scalability in attributes (which may be in million amount). She elaborated on the answer of the first question in the oncoming part. Therefore, the problem is redefined as follows: Given an attributed subgraph, how to quantify the quality of the network? They proposed two types of measures to quantify the quality of the graph: internal (internal conistency) and external (external seperability) measures. It means that the normality in a graph was tested by using those measures. The ultimate goal was to optimize the normality by maximizing the weights of the attributes in the graph. They showed that their optimization algorithm was linear in number of attributes. She illustrated a research network from dblp where researchers work in telescopic op-amps domain.  As the attributes, they used keywords in articles to construct the network.

The second application scenario is host-level intrustion by applying anomaly mining techniques. The asnwers of the above 4 questions are as follows: 1) Given a stream of system events, find suspicious system activity (so taint-tracking is fulfilled to create meaningful logical information flow), 2) The data avaliable is each event in the logs, 3) We know that malicious flow differs from benign flow, 4) Being real-time and fast. They constructed graph from event log such that every node represents a process. They tagged the events based on the time stamps and the ones being tagged with same label are grouped into the same graph. Finally, they contructed clusters of benign events and found the events deviating from the normal.

Discussion

The talk was quite beneficial for me that I saw a real, concrete example of graph theory and anomaly detection techniques applied in intrusion detection domain. I had a rough idea about anomaly detection in security domain before attending  to the talk, but this talk helped me to visualize it.

Links

http://calendar.cs.cmu.edu/mlSeminar/4906.html

Leave a comment