Process Discovery: The Basics

This lesson is a part of an audio course Process Mining as the Bridge between Process Science and Data Science by Wil van der Aalst

Listen Now

In the previous lesson, we learned that there are four types of process mining: process discovery, conformance checking, process enhancement, and operational support. In this lesson, we focus on the first type: process discovery. Process discovery techniques transform event data into process models. People that see process discovery results for their processes for the first time are often astonished that this is possible. As mentioned before, this has a similar effect as seeing snow for the first time in your life. Often people's jaws are dropping when they see process models showing the real processes in an organization. Suddenly people realize that reality is very different from what they expected.

Input for process discovery is an event log where for each event, we only use the case identifier, the activity name, and the timestamp. This allows us to create a trace for each case. Remember that a case may represent an order, an application, or a patient. Cases are also called process instances, and the trace of a case describes the sequence of activities executed for this case. Let us assume that we have activities A, B, C, D, and E. For example, activity A may correspond to placing an order, and activity D may correspond to making a payment. I'm using abstract names for activities just to be clear. For one case, we may see the trace ABCE, and for another trace, we may see ADE. Now assume we have an event log with 100 cases where 50 cases follow trace ABCE, 30 cases follow trace ACBE, and 20 cases follow ADE. This event log contains 50x4+30x4+20x3 = 380 events.

A very simple process discovery technique is to create a so-called directly-follows graph. In such a graphical model, the nodes represent activities, and the edges between activities represent the directly follows relations. In our example, we have a node for each of the activities A, B, C, D, and E. We can also attach frequencies to such nodes. For example, in our abstract example with 100 cases, the activities A and E are executed 100 times, B and C 80 times, and D 20 times. The edges between these activity nodes are created based on the so-called directly follows relation. We count how often one activity is followed by another activity within the same case. In our small running example, A is followed 50 times by B, A is followed 30 times by C, A is followed 20 times by D, and A is never followed by E. We can think of these frequencies as weights showing the importance of a directly follows relationship between two activities. We can remove the edges that have a low frequency. We can also remove nodes that have a low frequency. These nodes represent activities that are rare in the log.

One can think of the resulting directly-follows graph as a geographic map. The nodes represent activities rather than cities, and the edges represent causal dependencies rather than roads. Like in Google maps, we can decide in which detail we want to see the process. We can look at the big cities and highways in the process, or we can also consider smaller towns and dirt roads.

While building the directly-follows graph, we ignored the timestamps and other event attributes. However, these can be used to annotate the resulting graph with additional formation. For example, for each edge representing a directly follows relationship between two activities, we can report the total time, the average time, the minimal time, and the maximal time between the two corresponding two activities. We can use this to highlight the bottlenecks in the process. For example, we may find that the average time between activity D and E is more than 15 days resulting in delays. It is important to realize that such simple process models can be learned for the data in today's information systems. Moreover, such models are based on facts rather than subjective impressions.

Directly-follows graphs are simple, and it is easy to discover them. All of the over 35 commercial process mining tools support the discovery of such models, and this is extremely scalable. It is no problem to handle event logs with millions of events. However, directly-follows graphs are just the starting point. For example, they cannot capture concurrency and tend to produce spaghetti-like models. To explain this, consider the directly follows graph based on the event log with 100 cases where 50 cases follow trace ABCE, 30 cases follow trace ACBE, and 20 cases follow ADE. The model also allows for the trace ADE because there are connections between A and D and D and E. However, this never happened. The directly follows graph also allows for traces like ABCBCBCBE illustrating the problem. Therefore, we need more advanced process discovery techniques that are able to discover the process that starts with A and is then followed by either B and C or just D, and then completes with activity E. All of the more advanced process mining techniques can do this. Process mining can be used to discover Petri nets, BPMN diagrams, UML Activity diagrams, statecharts, etc. These notations can model sequences, choices, concurrency, and loops.

In the next lesson, you will learn more about process discovery techniques that go beyond discovering a simple directly-follows graph.

Career & Success

Share: