Is your data ready for data science? — Motivating this topic from a sail-race-analysis example

Gerhard Svolba
6 min readJul 28, 2021

Beside data science and analytics, my other passion is sailing. Sometimes I enjoy a relaxing trip with friends or alone, on other days I like the challenge and participate in a few local sail races on our lake.

When trying to optimize the boat speed and my race performance, I can’t fully leave my analytical background on land. In fact there are many tactical decisions during a race that can be made better when you analyze sail performance in different situations. Such questions include the selection of the optimal angle of the boat to the true wind direction and the decision whether to tack in a sharp and effective way or to tack round and fluid.

Recently I posted a webinar that shows examples how you can analyze your race data.

Do we have the data to perform the analysis?

Is sufficient data available to analyze a sail race? In general “YES”! There are two main sources of data that we can use.

The first group is data collected with our GPS tracking device.

We have a GPS tracking device on board. Beside the real-time display of the boat speed and the heading during the race, the device also collects and the records the GPS track points (longitude and latitude), the speed and the heading in two-second intervals. This data can then be uploaded to a computer and analyzed to show the true course during the race as shown in the figure.

Another data source is our manual recordings. Here we document base data for each race; who was part of the crew? Which sails did we use? What was the average wind direction and wind speed?

Based on this data the above mentioned questions and many others can be answered to improve our race performance and to be able to better compete against others.

However, learning from data to improve is not the only parallelism to the business world. In the analysis we experience the same data quality challenges that occur in business analyses.

Data completeness — do we see all details?

In some cases we experience a failure of the GPS device for a few minutes because of low temperature or bad batteries. The true values are assumed to be lost as they cannot be gained from another source later on. The only way to present a full data picture for the analysis is to define appropriate imputation rules, how missing data shall be filled with values.

The completeness of our manual records is not fully given as some of the data was not documented by the crew immediately after the race. Sometimes this happens weeks later or at the end of the sailing season.

In business situations the quality status of customer or product base data also often declines over time if no data maintenance procedures are in place and responsibilities for the quality of this data are well defined.

Data correctness — what is really the truth?

However our base data for the race is not always the only reason for weak data completeness and correctness. We can also have problem with the correctness of the data provided by the GPS tracking device. If the device has a bad connection to its satellites, the position can be misplaced or delayed. Thus not only the positions for a few points in time are incorrect, but also the derived variables “compass heading” and “speed in knots” are wrong. We need to define rules to identify and correct these data points.

Another data correctness problem is the transfer of data between systems. In our case we physically transfer the GPS data from the device via a USB connection to a PC. We receive an XML file that we use to generate a dataset. The more systems and interfaces your data pass in the preparation and analysis process, the more likely that errors occur.

An example how data can be imported in a wrong way into the analysis system and might remain undetected unless you perform interactive data analysis, is discussed in a webinar, that I recently published.

Data quantity — do we have enough?

In the first sailing season we only had 97 well documented tacks. For statistical purposes this may be not enough to answer the desired questions on the best tacking strategy for different wind strengths or sail types. So even if “quantity” and “quality” are often used as antagonisms, “data quantity” is an important factor in data quality for analytics.

Data usability — can we start the analysis immediately?

In some cases the desired data is available. It is complete and correct, however a lot of data pre-processing needs to be applied. In our example this is the case with the data measured by the GPS tracking device. The device starts collecting the data as a stream when it is turned on until it is turned off. When being turned on in the harbor on a racing day with three races, it will collect data when sailing from the harbor to the race area, waiting for the start, sailing race #1, waiting for the start of race #2, sailing race #2, etc. To run the analysis we need to isolate the single races in the data, potentially also separate between upwind and downwind courses, to be able to analyze and compare them.

This is very similar to many business analyses I have involved in. To answer the business question, you might get access to the relevant data. However you often have to filter certain customers who are not active any more. Or you need to select regions or time windows and apply these rules across different tables to have a consistent and timely aligned data base. When you analyze transactional data, a lot of additional features and events might be recorded there, which you have to separate from your data.

Data availability — is external data always the cure?

On our boat we don’t have a wind measuring device, so we can’t perform certain analyses that relate wind speed and wind direction to boat performance.

A quick advice that you also might get in business situations is to use “external data”. In our case this surrogate can be data from the weather station in the harbor. However this is not necessarily helpful for a detailed analysis of the boat angle to the true wind. The measured conditions in the harbor at a certain point in time will be different from those in the race area. Also the data is only collected in 5 minutes intervals which might be too coarse to analyze short term wind shifts and wind gusts.

In marketing analytics external data like social media data or demographic data are sometimes helpful. However they usually do not compensate for missing attributes of individual customer behavior.

Conclusion

Data quality for analytics is an important topic across business domains and functional questions. There are specific requirements for analytics and data science that need to be fulfilled. These requirements usually extend the basic data quality tasks like data standardization or data de-duplication. Just having data in a technical adequate format does not necessarily mean “green light” for any data science analysis.

Links

An earlier version of this article has been published in the IT-Briefcase.

Links to two webinars on related analyses are shown in the text above. My data preparation for data science webinar contains more contributions around this topic.

Presentation #102 in my slide collection contains more visuals on this topic.

Chapters 1–9 in my SAS Press book “Data Quality for Analytics Using SAS” discuss these topics in more detail. Chapter 1 also gives a deeper introduction into the sail race example.

--

--

Gerhard Svolba

Applying data science and machine learning methods-Generating relevant findings to better understand business processes