🔧 Development Overview
🔧 Development Overview
To develop a project like this, several different steps were neccessary alongside many hours of trial and error.
Finding a suitable database
To achieve this, I browsed through Kaggle to find datasets with the information that I would need for this situation, ideally just the anime information would have been enough. I originally used this dataset however, I ended up transitioning to the current dataset as it provided more detailed information and addtionally, I could see the license information for the dataset.
Determining how to use dataset
My first thought was to figure out how to use machine learning
frameworks such as Tensorflow or PyTorch and create a model
based on the dataset using those. This would have involved a
similar kind of process to what I achieved in the end,
normalizing the data and creating feature tensors, however, I
would have needed labels
to test the models ability
to predict. I could have used the score
property of
an anime however, there are several outliers in the thousands
with values of 0 or 'Unknown' for various properties
including score
and additionally, this did not suit
my use case. I later learned that what I was originally trying
to achieve was something known as
supervised learning and then I learned that
there was something known as
unsupervised learning which did not need labels
of any kind, this seemed like what I needed.
From here, I had to determine the type of filtering to use
within the recommendation system, I learned there were two main
types, content-based and collaborative filtering followed by a
hybrid approach which would combine both of the methods. Given
the dataset, I did have the capability to perform both types of
filtering, however, the size of the user interaction data was
1 GB
and I did attempt to try to use it however,
the issue is that there are over 200,000 users within that data
and over 15,000 anime within the dataset which means utilizing
this in a user-interaction matrix would have been
computationally expensive. That is when I decided to just use
content-based filtering.
Normalizing the Data
Once I determined that I would use content-based filtering, I had to research various normalizing techniques for the various types of data that existed within my data and use this to create feature tensors through Tensorflow, further explained here.
Clustering the Data
Since the objective of my project was to recommend anime that was similar to user-selected anime, I had to figure out a way to group anime based on their properties. This led me to research various existing clustering algorithms where I finalized on using k-means clustering. To effectively cluster anime, I had to also determine a distance function which would be used to tell how similar two distinct anime were based on their properies, in the end, a combination of the Manhattan and Dice distance was used. This process is further explained here.
Designing the UI
Once all the previous steps were completed, the final step was to design a simple and easy to use UI that was as visually appealing as possible, to achieve this goal I utilized React and React Router in combination with TailwindCSS and DaisyUI alongside various npm packages such as MiniSearch, lodash, tailwind-scrollbar, React Infinite Scroller, react-slick, and slick-carousel.