📏 Normalizing Data
📏 Normalizing Data for K-means clustering
Given the strucure of an
AnimeEntry model, it can be seen
that there is a variety of data types present. There is data
that can be considered as categorical
,
ordinal
, and numerical
.
To handle these various data types and ensure that K-means clustering was possible, there was a need to normalize the existing data. Various normalization techniques were implemented and after trial and error, the following were used in the current model:
-
Ordinal Encoding For anime properties such
as
type
,rating
, anddemographics
, originally I was using multi-hot encoding however, I noticed that just the presence/absence of a property was not sufficient as the distance between anime was key. For example, I wanted anime of typeTV
andONA
to be closer in distance compared toTV
andMovie
. To accomplish this, I chose to assign distinct values to each unique value within each of these properties, leveraging my understanding of how related values are associated with each other -
Multi-Hot Encoding: For anime properties
such as
genres
,demographics
,source
, etc, this was normalized into a binary vector filled with 1's and 0's, indicating the presence or absence of a unique value -
Min-Max Scaling: For anime properties such
as
score
anddurationMinutes
, the data was normalized by determining the minimum and maximum values of each property and utilizing the following formula:(value - min) / (max - min)
** -
Robust Scaling: This method could be
applied to numerical attributes like
score
andepisodes
, which would be beneficial given the presence of outliers in the dataset. These outliers include anime with scores or episode counts of 0 or "Unknown" due to insufficient information - TF-IDF: This was primarily used for the synopsis, utilzing natural, remove-stopwords, word-list, and lemmatizer in combination, the goal was to extract top keywords from each anime synopsis and determine their frequency. In the current state, the frequency value is not used as is and instead only the presence/absence represented by a 1 or 0 is used. Originally, I planned to use the top 10 keywords from each anime and compare it with word-list, however, this did result in words I felt were irrelevant to the synopsis. To combat this, I created a list of keywords which started by using the genres, themes, demographics as a base and looking at popular anime in combination with the top 1000 keywords returned using the word-list approach and adding what I felt was relevant. There is definetly a lot of keywords that can be added and further improvements are possible.
Other normalization techniques were explored as well during the experimentation phase of the project, such as:
-
Combination Encoding: This technique aims
to represent each unique value with a distinct integer. For
each anime, the values of its properties are combined into a
single value. For instance, consider the
genres
property of a given anime with values['action', 'comedy', 'romance']
. Each genre is assigned a unique integer, starting from 0 up to the total count of unique values. These integers are then summed to produce a single value for the genres property of that anime
To see more information about the normalization process, you can consider checking out the source code which shows all the normalization functions and additionally, how they are incorporated to create feature tensors using Tensorflow.