Data Wrangling and the Business Analyst

Featured
13262 Views
0 Comments
2 Likes

The quality of any data analysis created to inform business decisions will ultimately be constrained by the quality of the underlying data. If the data is faulty, then the analysis will be faulty too.

This is why data wrangling–the transformation of raw data into a format that is appropriate for use–has become such a ubiquitous task in most organizations.

Unfortunately, the significance of data wrangling is still often overlooked. Impatient executives interested in quick results may pressure the team to skip data preparation steps in order to jump more quickly into the tasks required to produce analyses, models, and reports.

And this is where data-savvy business analysts can help save the day. First, by helping communicate the risks associated with ignoring or rushing through the data wrangling phase. Second, by recommending to data engineers the best strategies to ensure the data is in a reliable state before it is used to produce KPIs, generate forecasts, create recommendations, and support various business decisions.

What are some typical data wrangling activities?

The most common data wrangling tasks required to get data ready for analysis fit one of these two buckets:

  • Data cleaning involves sanitizing a data set by fixing structural errors and typos, standardizing units of measure, ensuring that values are reasonable, removing unwanted observations, handling outliers and missing values, etc.
  • Data structuring is about organizing, transforming, and mapping data from its raw form into a more usable format. For example, using patients’ date of birth to assign them to age groups (<6, 6-17, 18-25, etc.). Or, transforming the content of free-text customer reviews into a customer satisfaction score on a scale from 0 to 5.

Who is responsible for data wrangling tasks in an organization?

Data wrangling can be time-consuming and taxing on resources, so data engineers typically automate this task. However, the decisions surrounding the process of getting the data in shape for analysis will greatly benefit from having someone with domain knowledge to pave the way for the right automation to take place.

In reality, much of the work that turns raw data into usable data requires specific domain knowledge that business analysts are more likely to possess than data engineers. And while computer-based systems are better than humans at finding hidden patterns in high volumes of data, when it comes to deciding what kinds of transformations will benefit analyses, those systems will underperform compared to a person who can think abstractly, generalize from one domain to another, and tap into their thoughts and memories to develop creative ideas.

Computers are bad at contextualizing data

A retired product from IBM, Watson Analytics was a modeling application based on technology that combined machine learning, reasoning, and natural language processing to investigate data and find its hidden patterns. It was commercialized between 2015 and 2018, with pundits writing about the possibility that the tool might even "replace data scientists."

In a test published by IEEE in 2018, a researcher uploaded to Watson Analytics a dataset containing event data with variables that included start time, finish time, and total time. Unlike human beings, a machine learning model can't immediately tell how these variables relate to each other. Because it only knows statistics and has no semantic understanding of what the variables truly mean, Watson Analytics generated findings like the one below, which a regular person would be able to reach without any need for computation: 

Computers are bad at contextualizing data


Let’s look at a practical example that illustrates the benefits of involving a business analyst in data wrangling decisions. Imagine that an agricultural business is trying to predict whether it will be a good or bad year for fruit harvest. Weather data is an important factor, since fruit can be killed or damaged by freezing weather.

To train a machine learning model to predict how good the harvest will be, the business uses weather data that includes the minimum temperature (TMIN) for each day at a given location. But the data collection process is faulty, and for some days and locations, only the maximum temperature (TMAX) is available. Most predictive models can’t handle missing data, and as part of data preparation, a strategy needs to be chosen to replace missing values with some sort of approximation.

In such cases, I've seen a data engineer take the easy option of replacing missing values with the variable’s most frequent or average value. However, in this particular problem, using either technique to fill the missing TMIN values would be a bad choice, since it could hide freezing events that happened during winter.

A business analyst would be able to investigate and suggest a better approach. For example, use the latitude and longitude values available in the data to determine the nearest sensor from which to "borrow" the missing TMIN value from a close location. Or use the average difference TMAX-TMIN observed in adjacent days when both values were available to estimate TMIN. Suggestions like that require a semantic understanding of weather and location data that computers don't have—and business subject matter expertise that data engineers may lack.

Why should companies encourage the involvement of business analysts in data wrangling activities?

There are plenty of good reasons for organizations to engage their business analysts in data wrangling activities:

1. Skipping or rushing this step may result in poor data models that impact an organization’s decision-making and reputation.

2. Leaving the task exclusively in the hands of data engineers may lead to negative outcomes resulting from bad choices during transformation steps due to a lack of business domain knowledge.

3. Involving a business analyst in the data wrangling process will often lead to insights that might even change the future course of a project to increase its odds of success.

Where can business analysts learn more about data wrangling to improve the quality of their organizations' analyses, models, and reports?

The following resources are good starting points for business analysts interested in combining their business domain with data wrangling expertise to help ensure the data their organizations are using to make consequential decisions is in a reliable state:

What is Data Cleansing and Transformation/ Wrangling?

Data Wrangling in Python with Examples

13 Tips for Quick, Accurate Data Wrangling


Author: Adriana Beal

Adriana Beal has been working as a data scientist since 2016. Her educational background includes graduate degrees in Electrical Engineering and Strategic Management of Information obtained from top schools in her native country, Brazil and a certificate in Big Data and Data Analytics from the University of Texas. Over the past five years, she has developed predictive models to improve outcomes in healthcare, mobility, IoT, customer science, human services, and agriculture. Prior to that she worked for more than a decade in business analysis and product management helping U.S. Fortune 500 companies and high tech startups make better software decisions. Adriana has two IT strategy books published in Brazil and work internationally published by IEEE and IGI Global. You can find more of her useful advice for business analysts at bealprojects.com.

 



Upcoming Live Webinars

 




Copyright 2006-2024 by Modern Analyst Media LLC