Sunday 10:45 AM–11:30 AM in Room 1

Finding Driving-Style Patterns in Caterpillar Machine Data

Benjamin Hodel

Audience level:


Identifying predominant driving-style patterns in logged time series data of Caterpillar machines is daunting due to the nature and size of the data. However, insight gained from field data can deliver optimized powertrain control software and better machine performance. A solution for finding patterns was built using engineered features, dimensionality reduction, and unsupervised learning.


Caterpillar earth-moving machines are large electro-hydraulic-mechanical systems that use embedded control units to regulate the performance of the engine, upper drivetrain, lower drivetrain, hydraulics, and other systems. These systems work seamlessly with the operator's inputs to move the machine and perform work such as digging, loading, dozing, hauling, etc. The function of the machine is managed by very complex embedded software which allow for a range of operator environments and styles while continuously attempting to optimize machine productivity, fuel usage, and operator comfort.

The design and testing of the integrated powertrain system (IPS) control software is therefore critical to proper performance of the engine and transmission across a wide variety of operation and driving styles. Caterpillar IPS control software has historically been validated by manually selecting cycles from limited test data. Consequently the risk that the software may not be optimal is higher than desired since the sample size is small.

The advent of relatively cheap, integrated data-loggers that can record many parameters at high sampling frequencies has greatly increased the amount and variety of recorded operator styles and customer applications. There is now ample data from which to sample. However, in order to avoid sample bias, we desire that extracted time histories only be drawn from predominant patterns.

The data is so large that manually looking for patterns would be exorbitantly time-consuming. A programmatic approach is needed to reduce and categorize the data. However, simple reduction of data into summary statistics does not preserve the time-dependence of patterns (gear shifting, engine acceleration or deceleration). Nor does attempting to match waveforms exactly since the patterns can be time-dilated or time-reordered.

Therefore, we present a method to divide the data into time-period sessions and pull out specially-engineered features that can be subsequently mined for patterns. This involves domain expertise as well as programming knowledge. Python is used to create the features (numpy, scipy, pandas), and the feature space is grouped into patterns through dimensionality reduction and unsupervised learning (scikit-learn). The scalability problem is addressed by building a functional programming pipeline on a distributed task queue (pytoolz, celery).

We will discuss this method and (if time permits) some challenges we face related to data storage, data access, distributed computing, function serialization, workflow design, and analytical modeling. We will conclude with a reflection on the value to controls development that this method realizes, what works well, and where the process can be improved.