Image Description

Getting the Right Event Data

In the earlier lessons, we have seen that process mining is able to do amazing things. We can automatically learn process models and reveal performance and conformance problems in an objective manner. People seeing process mining results for the first time are often flabbergasted that this is all possible. It is even possible to find root causes for performance and conformance problems, predict such problems, and recommend counter actions. The algorithms and software tools are available to do this. However, all of this is only possible if the right event data are available. Getting the right event data is the topic of this lesson.

Organizations that conduct their first process-mining project, typically spend 80% of the time on finding the data and extracting the data, and only 20% of the time is used for analysis. The efforts needed to get reliable event data are sometimes a showstopper. However, if this is done properly, then this is a one-time effort, and new event data are loaded automatically. Of course, there are organizations that have massive data management projects with inconsistent data formats, different uncorrelated identifiers, and incomplete recordings. Process mining will reveal such problems, and organizations need to solve these anyway, independent of whether they use process mining or not. Don't shoot the messenger.

Ideally, event data are stored in the XES format. XES is the IEEE standard for storing and exchanging event data. The leading process mining tools support XES and can immediately perform process mining without any further actions. However, we first need to extract such data.

Event data can be found in almost all information systems. The term information systems should be interpreted broadly; this includes SAP, Oracle, Microsoft Dynamics, Infor, Salesforce, but also hospital information systems, medical equipment like X-ray machines, materials-handling systems, websites, social media platforms, and ticketing systems. Independent of the application, data are often stored in tabular format, for example, in a relational database. Most tables have columns that refer to timestamps or dates. Therefore, such systems are, in principle, loaded with event data. However, the challenge is to locate the right data and to scope it. Moreover, one may need to handle data quality problems when information is entered manually.

As explained in one of the first lessons, each event needs to refer to a case and an activity, and has a timestamp. An event may have additional attributes such as costs, value, weight, customer, location, etc. However, as explained before, the three mandatory attributes of an event are case, activity, and timestamp. When we are analyzing a specific process, we need to scope the collection of cases and activities. We only consider cases of a specific type, e.g., purchase orders. We look for all events directly or indirectly related to such cases. Thereby, we may focus on a subset of activities. This sounds easy, but in many applications, this is not. Events may be scattered around hundreds of tables. Note that larger SAP installations may have tens of thousands of tables. Therefore, one needs domain experience to select the tables relevant for a process.

Another complication is determining the case notion. Consider, for example, the handling of orders by Amazon. One order may contain multiple items, and packages shipped by Amazon may contain items from multiple orders. Also, items of one order may end up in different packages. Hence, there is a one-to-many relationship between orders and items, and between packages and orders. There is a many-to-many relationship between orders and packages. This triggers the question of what a good case notion is. Should instances of the discovered process model describe orders, items, or packages? This leads to the so-called convergence and divergence problems. One has a convergence problem when one event relates to multiple process instances leading to unintentional duplication. One has a divergence problem when there are multiple concurrent instances within a case. Within one order, the different items may follow a well-defined process, but when items are handled concurrently, this leads to Spaghetti-like models not showing the underlying structure.

Object-centric process mining is able to address all of these problems. In an object-centric event log, an event does not need to refer to a case, but may refer to any number of objects. A shipping activity may refer to one package, multiple items, one customer, and multiple payments. Using object-centric process mining, it is possible to create holistic process models that show the different object flows. This is something we are working on now in research, and soon such capabilities will also be available in commercial process mining tools.

After locating the data and extracting event logs, one may be confronted with data quality problems. For example, different systems may use different identifiers or different formats. Also, manually entered information may be problematic, for example, the month and day of the month may be swapped. Problems particularly relevant for process mining are issues related to timestamps. Some events may have very precise timestamps, whereas other events have only a date. As a result, the ordering of timestamps is not clear. There are process mining techniques to handle such problems. However, it is good to be aware of such complications.

To summarize: The good news is that most information systems are loaded with event data. The bad news is that one needs to invest time and effort to extract the right data. Therefore, it is important not to see process mining as a one-time activity. The return-on-investment is much higher when data extraction is repeatable, and results are used every day.

In the next lesson, we will focus on process mining tools that help to turn the extracted event data into value.

Image Description
Written by

Wil van der Aalst