How I Determined The Success Of A Manga Using Data Science

Keegan Fernandes
7 min readSep 19, 2021
Photo by Gracia Dharma on Unsplash

Introduction

Every year hundreds of manga are abandoned by publishers. As someone who voraciously reads, It’s heartbreaking to see a piece of literature being abandoned. Many times this is due to the lack of readers, poor translation and the way it’s publicized. Using data science and analytics I will find a way to determine the success of manga and using this analysis maybe change the way they are portrayed to the public.

The Data

This data was found on myanimelist.com through web scraping. For more information on the web, scraping check out my previous article. The data consists of the following columns MALID, Name, Score, Genres, Synopsis (the description of the manga). In the Score column, a score of “unknown” means that the manga has not been rated by any of the members. The files are structured in the following format on my local machine.

The EDA (Exploratory Data Analysis)

If you are following along here is the code for the EDA. I have also explained my reasoning for the steps below.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from gensim.parsing.preprocessing import remove_stopwords
from wordcloud import WordCloud
df = pd.read_csv("../Data/anime_with_synopsis.csv")
display(df)
display(df.info())
display(df.describe())
numerical_features_columns = [""]
description_label = ""
df["Score"] = df["Score"].apply(lambda x : .0 if x == "Unknown" else x)
df.dropna(inplace = True)
categories_list = []
for row in df["Genres"]:
[categories_list.append(x.strip()) for x in row.split(",")]
categories = list(set(categories_list))
categories_count = [categories_list.count(x) for x in categories]
fig = plt.figure(figsize=(15,25))
plt.barh( categories,categories_count )
plt.xlabel("Category")
plt.ylabel("Count")
plt.title("Genres")
plt.show()
df["Score"] = pd.to_numeric(df["Score"])
fig = plt.figure(figsize=(15,25))
plt.hist(list(df["Score"]) , bins = [0,1,2,3,4,5,6,7,8,9,10])
plt.show()
# We'll be removing all the manga with rating 0 to fit the gaussian curvedf["Score"] = pd.to_numeric(df["Score"])
fig = plt.figure(figsize=(15,25))
plt.hist(list(df[df["Score"] != 0]["Score"]) , bins = [0,1,2,3,4,5,6,7,8,9,10])
plt.show()
df = df[df["Score"] != 0]
plt.figure(figsize=(30,30))
text = " ".join(str for str in df["sypnopsis"])
text_without_stopwords = remove_stopwords(text)
word_cloud = WordCloud(background_color = "black" , collocations = True).generate(text_without_stopwords)
plt.imshow(word_cloud , interpolation= "bilinear")
plt.axis("off")
plt.show()
high_df = df[df["Score"] > 5]
low_df = df[df["Score"] <= 5]
plt.figure(figsize=(30,30))
text = " ".join(str for str in high_df["sypnopsis"])
text_without_stopwords = remove_stopwords(text)
word_cloud = WordCloud(background_color = "black" , collocations = True).generate(text_without_stopwords)
plt.imshow(word_cloud , interpolation= "bilinear")
plt.axis("off")
plt.show()
plt.figure(figsize=(30,30))
text = " ".join(str for str in low_df["sypnopsis"])
text_without_stopwords = remove_stopwords(text)
word_cloud = WordCloud(background_color = "black" , collocations = True).generate(text_without_stopwords)
plt.imshow(word_cloud , interpolation= "bilinear")
plt.axis("off")
plt.show()

Since the manga which hadn’t been rated were given a Score of “Unknown” I gave them a rating of zero.

I then made a histogram of the scores. I noticed that the Scores column seemed to follow the normal distribution curve however the Score of zero was too high. The normal distribution or the bell curve is what most ranges should fall into. For example, the scores of exams and quantifiable results. This is what a bell curve should look like.

This is the result I got from the Score histogram.

As you can see my decision to label all the manga with the score given as “Unknown” as zero was wrong. To fix this I dropped all the scores with the label “Unknown” although I lost a lot of data this would ensure more accurate readings.

After removing the “Unknown” scores here's what the histogram looked like. This is far better and it fits the bell curve well. Looking at the histogram most scores seem to fall between 4 and 8.

Next, I started by measuring the number of manga each genre has. Comedy and action seem to have the most number of manga. The most frequent Genres in the manga were Comedy, Action, Fantasy, Drama, Kids, Adventure, Music and Sci-Fi.

I then made a wordcloud from the synopsis column to check the most frequent words in the synopsis column.

Next, I separated the synopsis of the data with scores greater than or equal to 5 and the wordcloud for scores less than 5 to check the different words used in the different types of ratings.

WordCloud for a rating greater than 5
WordCloud for a rating of less than 5

The Analysis

The reason for this project was to prove that there is a statistical significance between Genre, Synopsis and the score of a manga. So I will make the following two hypotheses.

  1. There exists a linear relationship between the Synopsis and the score on myanimelist.
  2. There is no linear relationship between the Genre and the score on myanimelist i.e all Genres have an equal chance of succeeding.

To prove the hypothesis I built a model and measured the accuracy of the model to determine the statistical significance. This was the reason I used a classification model instead of a regression model. We will set the following conditions to prove that there is a linear relationship.

  1. An accuracy of above 0.85 will prove the relationship between Synopsis and score.
  2. An accuracy of above 0.91 will prove the relationship between Genre and score

The reason I chose a higher accuracy for Genres is that certain Genres have a higher frequency than others like Comedy and the model might learn this instead.

Here is the code if you want to try it for yourselves (shoutout to Gunjit Bedi)

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm,naive_bayes
from gensim.utils import simple_preprocess
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
df = pd.read_csv("../Data/anime_with_synopsis.csv")
display(df)
numerical_features_columns = [""]
description_label = ""
df["Score"] = df["Score"].apply(lambda x : .0 if x == "Unknown" else x)
df["Score"] = pd.to_numeric(df["Score"])
df = df[df["Score"] != 0.0]
df.dropna(inplace = True)
display(df)
display(df["sypnopsis"].values[0])
display(simple_preprocess(df["sypnopsis"].values[0]))
# In[90]:df["final_text"] = df["sypnopsis"].apply(lambda x:" ".join(simple_preprocess(x)))Tfidf_vect = TfidfVectorizer(max_features = 1500)
Tfidf_vect.fit(df["final_text"])
X = Tfidf_vect.transform(df["final_text"])
y = df["Score"]
y = [round(x) for x in y]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.33 , random_state = 42 , stratify= y)
model = naive_bayes.MultinomialNB()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print(accuracy_score(prediction , y_test))
df["final_Genres"] = df["Genres"].apply(lambda x:" ".join(simple_preprocess(x)))
Tfidf_vect = TfidfVectorizer(max_features = 25)
Tfidf_vect.fit(df["final_Genres"])
X = Tfidf_vect.transform(df["final_Genres"])
y = df["Score"]
y = [round(x) for x in y]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.33 , random_state = 42 , stratify= y)
model = naive_bayes.MultinomialNB()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
print(accuracy_score(prediction , y_test))

Results

These were the results from fitting a linear model to the Synopsis

                    .-------------.----------.
| Model | Accuracy |
:-------------+----------:
| SVM | 0.55 |
:-------------+----------:
| Naïve Bayes | 0.49 |
'-------------'----------'

and these were the results I obtained from trying to fit a linear model to Genres

                   .-------------.----------.
| Model | Accuracy |
:-------------+----------:
| SVM | 0.46 |
:-------------+----------:
| Naïve Bayes | 0.43 |
'-------------'----------'

From the results, we can see that my first hypothesis was wrong and the second one was right. There is no linear relationship between the Synopsis and the Score as well as Genre and Score.

Conclusion

As you can see my first hypothesis turned out to be right and the second one turned out to be wrong. This doesn’t mean that there is no way to predict the score of a manga like I said there is no linear relationship between these values. There is a reason why I chose the word linear. By using non-linear models it’s possible to make predictions that wouldn’t be possible with a linear model. Examples of non-linear models are neural networks specifically BERT. These models have been designed to classify text with high accuracy. You can also use pretrained versions that have been trained on a huge volume of text using huggingface transformers.

There are also things like Artwork and the number of chapters that contribute to the success of a manga that you could use to classify the score. Data science doesn’t just depend on your data but your creativity as well.

--

--

Keegan Fernandes

First year student in Msc Data Science. Writes about data science and machine learning tutorials and the impact it has on the world.