Who’s on First?

I wanna share with you a little problem that we solve. While it’s not the most important issue we’re facing, it nicely highlights the sort of cyber security big data challenges we deal with almost every day. Or, if you wish, the devil which is in our details.

What is the problem? Occasionally, an organizational entity has multiple names. For example, the user John Doe might have the following names: [email protected], [email protected], [email protected]. The first two are Active-Directory names: the first is the misc-username, and the second corresponds to the administrator account. The latter username is the Google Docs account name. (Notice that his Google-Doc username hints that our organization has multiple John Doe’s.) Now, we want to track all the actions done by John Doe. However, our logs identify actions with usernames. In other words, we wanna know (let’s say) who is [email protected]? (And this is why I named this blog post after Abbott and Costello’s well known sketch.)

How do we use it? Fortscale’s main business is all about user intelligence. (Check out Ben’s great post about this subject.) A core element of the intelligence process deals with collection and packaging user information into a “profile” that describes the user from a security angle. Thus, unifying all traces of the same entity is a core capability in our system. For example, we connect the “person” with all of his activities (e.g., all his corresponding usernames).

whos on first

Is it a hard problem? At the recent KDD Cup 2013 data-mining conference, a public challenge was posted that resembles this problem. In the KDD Cup’13 challenge, the participants had to determine which papers of a Microsoft Academic Search author profile were truly written by a given author. The only data that they had to work with was lists of author names and corresponding information about the papers (e.g., paper name, publication venue, etc.). The challenges that they faced are somewhat close to our “who’s on first?” problem: author name variations (e.g., Bryan Smith and Bryan J. Smith), and multiple authors sharing the same name. The data-mining community and Microsoft think this is a challenging problem. That in itself is a good indication of the problem’s “hardness.”

How does Fortscale solve it? Unfortunately, I’m not allowed to describe our solution. Instead, let me share our guidelines for evaluating a solution. We demand that our solution must be simple so it can be effectively tested and implemented. Mind that the former sentence was carefully phrased. It hints that our solution must be testable. In the sense that we can declare how well it delivers results (we are not talking about QA tests here). This will allow us to tune and optimize it further. Note that we also take into account implementation considerations when examining possible solutions. Remember, this solution is merely a utility within our overall Big-Data system. We must deliver feasible solutions not only in terms of runtime and memory overheads, but also in terms of engineering complexity.

So, who’s on first?