Importance of Data Normalization Prior to Analytics

datasciencedojo · February 27, 2019, 2:10am

Organizations are investing heavily in technology as artificial intelligence techniques, such as machine learning, continue to gain traction across several industries.

A Price Water Cooper Survey pointed out that 40% of business executives in 2018 make major decisions at least once every 30 days using data and this is constantly increasing
A Gartner study states the 40% of enterprise data is either incomplete, inaccurate or unavailable

As the speed of data coming into the business increases with the Internet of Things starting to become more mature, the risk of disconnected and siloed data grows if it is poorly managed within the organization. Gartner has suggested that a lack of data quality control costs average businesses up to $14 million per year.

The adage of “garbage in, garbage out” still plagues analytics and decision making and it is fundamental that businesses realize the importance of clean and normalized data before embarking on any such data driven projects.

When most people talk about clean data, they think it means getting rid of duplicates from their system which, although important, is only the first step in quality control and there are more advanced methods to truly optimize and streamline your data.

There are two key schools of thought on good practice for data management; normalization and standardization. Both have their place in a data governance and/or preparation strategy.

A data normalization strategy takes a database and organizes it into specific tables and columns with the purpose of reducing duplication, avoiding data modification issues and simplifying queries. All information is stored logically in one central location which reduces the propensity for inconsistent data (sometimes known as a “single source of truth”). In simple terms, it ensures your data looks and reads the same across all records.

In the context of machine learning and data science, normalization takes the values from the database and where they are numeric columns, changes them into a common scale. For example, imagine you have a table with two columns and one contains values between 0 and 1 and the other contains values between 10,000 and 100,000. The huge differences in scale might cause problems if you attempt to do any analytics or modelling. Data normalization will take these two columns by creating a matching scale across all columns whilst maintaining the distribution e.g. 10,000 might become 0 and 100,000 becomes 1 with values in-between being weighted proportionality.

In real world terms, consider a dataset of credit card information which has two variables, one for the number of credit cards and the second for income. Using these attributes, you might want to create a cluster and find similar applicants. Both of these variables will be on completely different types of scale (income being much higher) and would therefore likely have a far greater influence on any results or analytics. Normalization removes the risk of this kind of bias.

The main benefits of normalization in analytical terms are that it allows faster searching and sorting as it is better at creating indexes via smaller, logical tables. Also, in having more tables, there is a better use of segments to control the tangible placement of data. There will be fewer nulls and redundant data after modelling any necessary columns and bias/issues with anomalies are greatly reduced by removing the differences in scale.

Normalization should not be confused with data standardization and it is important that both are considered within any strategy.

Data standardization takes disparate datasets and puts them on the same scale to allow easy comparison between different types of variable. It uses the average (mean) and the standard deviation of a dataset to achieve a standardized value of a column.

For example, let’s say a store sells $520 worth of chocolate in a day. We know that on average, the store sells $420 per day and has a standard deviation of $50. To standardize the $520 we would do a calculation as follows:

520-420/50 = 100/50 = 2 - our standardized value for this day is 2. If the sales were $600, we’d scale in a similar way as 600-420/50 = 180/50 = 3.6.

If all columns are done on a similar basis we quickly have a great base for analytics that is consistent and allows us to quickly spot correlations.

In summary, data normalization processes ensure that our data is structured logically and scaled proportionally where required, generally on a scale of 0 to 1. It tends to be used where you have predefined assumptions of your model. Data standardization can be used where you are dealing with multiple variables together and need to find correlations and trends via a weighted ratio.

By ensuring you have normalized data, the likelihood of success in your machine learning and data science projects vastly improves. It is vital that organizations invest as much in ensuring the quality of their data as they do in the analytical and scientific models that are created by it. Preparation is everything in a successful data strategy.