# Derived Descriptive Statistics and Probability Distributions

Our log analytics calculate descriptive statistics and probability distributions.  These assist organizations understand processes beyond BPMN diagrams, communicate results to stakeholders, perform additional advanced analyses, and kick-off turnkey simulations.  For example, the descriptive statistics may be used to develop rather precise data-driven Key Performance Indicators (KPI), which help drive improvement.  In combination, the descriptive statistics and probability distributions may be used as inputs for sophisticated simulations.

## Descriptive Statistics

Event Counts:  These values represent the total number of events which flowed across the trace from State A (rows) to State B (columns).

Transition Probability with Re-Work: These values represent the portion of events which entered State B (column) from State A (row).  For each row, the sum of transition probabilities equals 1.0. “Re-work” is represented as values along the diagonal. This means that an activity state is repeated, rather than transitioning to a different activity state.

Transition Probability sans (without) Re-Work: These values represent the portion of events which entered State B (column) from State A (row) after any re-work activities have been removed.  As with the Transition Probabilities with Re-Work, the sum of transition probabilities for each row equals 1.0. All “re-work” activities have been removed (there are no values along the diagonal). This view of the process shows transitions between activities which occur after repetitive states are eliminated.

Mean, Median, and Mode of Event Time: These are basic descriptive statistics based upon the transition time from a preceding State A (rows) to the follow-on State B (columns).  These statistics are calculated both in linear space and log space.

Skew, Kurtosis, Standard Deviation, Variance, and Root Mean Square (RMS): These are used in the probability distribution classification rules.  For example, a Kurtosis of -3 suggests a uniform distribution.  These statistics are calculated both in linear space and log space.

Log Space Statistics: These are used in the classification rules for log normal.  Some business processes operate in log space, so it is useful to look at things like the variance in log space versus variance in normal space to get a sense as to whether a process seems to be following the diminish return curve associated with log processes or not.  A result of -9 is effectively negative infinity.

## Probability Distribution Measures​

Our AI determines the distributions associated with each process trace.

Normal and Platykurtic:  These distributions are probably manual processes, which are likely not yet at full capacity.

Uniform or Discrete: These are likely automated processes times stay the same since are automated and way below capacity.

Poisson or Normal:  A Poisson distribution may be a good approximation when the mean and variance are within 10% of variance. Poisson or normal (abs value of skew & excess kurtosis less than 0.5 and mean & variance within 10% of each other. Normally this would indicate a poisson distribution, but our data shouldn’t produce poisson distributions, hence the ‘poisson or normal’)

Extreme value distributions:  Distributions are classified as Extreme with an absolute value of skew or excess kurtosis greater than 2 and not uniform or discrete. These include log normal and leptokurtic distributions.  We observe this behavior when process steps are probably near capacity, as evidenced by the distribution.  For example, employees integral to the process are doing triage and some action items in the queue never get processed.  Note: Extreme value distributions are challenging to classify without lots of data or long time periods, generally.

## Queuing Theory 1D Statistics

These statistics are inferred using heuristics or “guestimates” based on queueing theory.  Simple queueing assumes a simple M/M/1 queue, which is a simple model where a single server (or one tightly coordinated team) serves jobs that arrive according to a Poisson process and have exponentially distributed service requirements.  These statistics look at the total amount of time tasks were processing versus the total amount of time on record to come up with capacity guestimates, together with simple queueing theory:

• If the data are incomplete, and there are tasks not in the records, then it will come up with an incorrect capacity estimate.
• Simple queueing theory assumes that there is a single ‘worker” or “process” behind each task. If there is some hidden team of, say, junior lawyers on standby to support a group of senior lawyers, and the events of tasks moving between junior & senior lawyers is not captured, then simple queueing theory breaks down because this is not a simple queue.

You know from the file what % of the total elapsed time of the file the task was in progress. (It’s not as simple as turned that into capacity, because you have to apply statistics.) That task ‘time to transition’ includes both time waiting in queue and time the job was processing by the worker.

So 2 of 1D stats are measured, the rest are inferred from simple queueing theory. These are 1D statistics, so I used the diagonal to keep the same display. (Although they could be put into a table row. Certainly for the CSV & XLSX they should probably be a row rather than a diagonal.)  More exciting are the 1D statistics, including capacity, that are ‘guesstimated’ from simple queueing theory.

Capacity: The capacity variable really causes the bottlenecks to pop out of the data.  If you double the number of customer orders relative to normal (normal being the XES trace) but keep the workers the same (or assume current times are worker capacity), what happens to the delays in each process? With the known distributions, it should be possible to answer this. (Well maybe with some hand-waving assumptions.) Where are the bottlenecks? How do the bottlenecks change? How does the average time heatmap change if everything is 2x capacity? (You might simply be able to show the bottlenecks in the current process. Do transitions take a long time because the processes take a long time, or do they take a long time because there is a bottleneck and orders build up at that station? A simulation might be able to answer this, which could be converted into a different heatmap of over/under capacity.)  (its pretty clear from Erlang’s questions that log normal, extreme value &c distributions come from over capacity situations, which was my suspicion the entire time.)

Avg Arrival Time: is interval between tasks arriving in queue. This is an important variable to know for the initial inputs to a business process simulation. (It can deduce the intervals for deeper layers from the simulation.) If this is a supermarket, this is the average interval between customers arriving in the line.

Avg Wait Time: the estimated way time in seconds before a worker starts processing the queue (or the customer gives up). If this is a checkout line, this the average amount of time spend waiting in line.

Average Processing Time: If this is a supermarket, the average amount of time for the checkout person to run up your order once s/he starts scanning.

Average Transition time: These two numbers together (wait time plus processing time) add up to transition time. We would expect average processing time to remain constant in a business simulator, but wait time might change if there are a lot more or a lot less tasks lining up at a worker’s simulated door.  Avg transition time, this we actually know precisely from the data. (The interface may say ‘guestimate.’ It isn’t.) The amount of time from this task to the start of the next stage, inclusive of waiting in a queue. It should equal average wait time plus average processing time.

Average Jobs Waiting: avg jobs waiting, the average # of people or tasks waiting at any given time. This would be the average length of that clerk’s line at a supermarket.

Average Tasks Per Time: the average number of jobs “in the system” for that state at any given time, which we can also measure precise from the data. (Total time of tasks in that state over total elapsed time.)

## BPMN Social Network and Distance Metrics

:Various network distance and centrality metrics are presented which help describe the process ecosystem further. Currently, these include:

Average connector degree: A connector is a gateway or an activity with multiple incoming or outgoing sequence flows in BPMN notation. The average connector degree is a metric which indicates how many incoming and outgoing sequence flows all connectors have. We define a connector activity as an activity which has at least two incoming or two outgoing sequence flows. In order to aggregate all information concerning incoming and outgoing sequence flows into one metric, the average is used. The higher the average connector degree, the less understandable the model.

Connector Heterogeneity: The connector heterogeneity gives the entropy over the gateway types and keeps into account the possibility that some gateways may not be present in the model. This metric is inspired by the information entropy. For calculating the connector heterogeneity, the relative frequency of the gateways for each gateway type is calculated. In this definition, Log is the mathematical operator logarithm. The base number is equal to the number of types of gateways. When there is exactly one type of gateway present in the model or if there are no gateways, the connector heterogeneity is set to 0. In addition, the activities with multiple outgoing sequence flows are considered a parallel split gateway. The activities with multiple incoming sequence flows are recognized as exclusive join gateways.  The higher the connector heterogeneity, the less understandable the process model.

Connector Mismatch: Connector mismatch is defined as the sum of the absolute value of the difference between the number of split gateways and the number of join gateways for each gateway type. An activity with multiple outgoing sequence flows is considered a parallel split in this definition. In addition, an activity with multiple incoming sequence flows is considered an exclusive join. The bigger the connector mismatch, the less understandable the process model.

Control Flow Complexity: The control flow complexity takes into account that an inclusive gateway is more difficult to understand than a parallel gateway or exclusive gateway by assigning a higher weight to this element. The higher the control flow complexity, the less understandable the process model.

Maximum Connector Degree: The definition of connector and connector activity of section 1.11 is used. The maximum connector degree is a metric which determines how many incoming and outgoing sequence flows the connector with the most incoming and outgoing sequence flows has. The higher the maximum connector degree, the less understandable the model.

Sequentiality: The sequentiality metric expresses how sequential the process is modeled. A process is completely sequential when no connectors are used. Consequently, there is no room for concurrency, choice or loops. To calculate sequentiality, non-connector activities should be used. We define non-connector activities as activities with exactly one incoming and one outgoing sequence flow. We define a sequential sequence flow as a sequence flow connecting non-connector activities or a sequence flow connecting an event with a non-connector activity. This event should have not more than one incoming and one outgoing sequence flow. The higher the sequentiality, the more understandable the model.

Token Split: The token split gives an indication of the level of activities that are executed in parallel and measures therefore the concurrency. As the tokens are only split in inclusive and parallel gateways and activities with multiple outgoing sequence flows, the token split is calculated by summing the number of outgoing flows from these objects and subtracting one for each object.  The higher the token split, the less understandable the model.

There are many more BPMN Social Network and Distance Metrics which may be added.

## Capabilities in Development

Here are some of the items on our to-do list.  We welcome feedback and suggestions for improvement.  You may submit feedback any time here.

Process map animation: This feature enables visual analytics in order to see how work flows across your process.  It will be a great tool for communicating results and visualizing bottlenecks.

Simulations: Our process simulations help take the guess work out of process planning and quantify likely outcomes.  What if you added additional resources to your process, like another employee to assist with the workload?  Or, what if the amount of work doubled?  How might these changes affect your process?  Using queuing theory and Monte Carlo simulation, we will answer these questions and help organizations understand quantitatively the likely outcome when a process changes are considered.

Process Deviation Alerts and Anomaly Detection:  Proactive compliance and audit management is a necessity in today’s world.  Data contained in event logs may augment audit and compliance efforts.  Consider a process in which it is of critical importance that certain activities are never skipped.  Or, consider another process where certain rules must be followed, such a rule might be that the same person is not allowed to perform more than two activities within the same case (segregation of duty constraints).  Such rules may be critical for managing compliance and audit risk.  We are developing the capability of monitoring processes in real time and sending alerts when execution rules are broken. Detecting anomalies compared to status quo is another helpful capability, especially when monitoring process ecosystems in a national security context.

API Data Delivery:  We are planning to allow log data to be delivered securely, transformed, and analyzed seamlessly via API.

Secure Government Certifications: See A Word About Data Privacy and Security.