Generating the Best Article Names using Neural Networks and Data Analytics

Keegan Fernandes
5 min readAug 30, 2021

--

Photo by Nick Fewings on Unsplash

Article names are hard to decide. A reader may or may not click on your article based on the title. Making titles, however, is hard so as a data scientist I decided to automate this task using GPT2 and NLP

Data

I made a csv file that contains the best data science articles from various tags that I scraped using Parsehub on the Medium.com website. The csv file has information on the title of the article, tags used, publication, claps received, number of responses, etc. This dataset is available on Kaggle and is called Medium-Search-Dataset.

Task

My task is to make a text generator to generate coherent Article Titles. I will be using the transformers library for pre-processing and model building I will then finetune the model using PyTorch Lightning.

Installing Transformers

To install Transformers in your environment go to your environment and use the following command.

pip install transformers

It will install the library in your environment. If you want to avoid this step Run your notebook on a Kaggle Kernel since it will have the transformers library preinstalled in the environment.

Notebook

This Notebook was run on Kaggle using a GPU. You can view the full notebook on Kaggle and Github. I would recommend running this notebook on a Kaggle kernel instead of a local machine or collab since Kaggle already has most of the dependencies installed in the environment. Pytorch Lightning will be used as a wrapper class to speed up model building.

Dependencies

Run the cell below to make sure you have all the necessary packages installed in your environment. If you do not have all the packages installed it will throw an error.

from transformers import  GPT2LMHeadModel, GPT2Tokenizer,AdamW
import pandas as pd
from torch.utils.data import Dataset , DataLoader
import pytorch_lightning as pl
from sklearn.model_selection import train_test_split

Data

df = pd.read_csv("../input/mediumsearchdataset/Train.csv")
df

Downloading GPT2

I will be downloading GPT2-large it is available for public use. It has a size of 3 GB which is why I recommend using a remote notebook like Kaggle.

tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
gpt2 = GPT2LMHeadModel.from_pretrained("gpt2-large")

Testing the GPT2 model (before finetuning)

tokenizer.pad_token = tokenizer.eos_token
prompt = tokenizer.encode("machine learning", max_length = 30 , padding = "max_length" , truncation = True , return_tensors = "pt")
output = gpt2.generate(prompt,do_sample = True, max_length = 100,top_k = 10, temperature = 0.8)
tokenizer.decode(output[0] , skip_special_tokens = True)

As we can see the model does generate text on the prompt “machine learning” it is nowhere near Title material. In the following sections, we will finetune the model to generate better text.

Dataset

The dataset will create and send the tokenized titles to the dataset.

class TitleDataset(Dataset):
def __init__(self,titles):
self.tokenizer = tokenizer
self.titles = titles

def __len__(self):
return len(self.titles)

def __getitem__(self,index):
title = self.titles[index]
title_token = tokenizer.encode(title , max_length = 30 , padding = "max_length" , truncation = True, return_tensors = "pt").reshape(-1)
return title_token
#sanity checkdset = TitleDataset(df["post_name"].values)
title = next(iter(DataLoader(dset , batch_size = 1,shuffle = True)))
display(title)
Tokenized title

DataModule

class TitleDataModule(pl.LightningDataModule):
def __init__(self):
super().__init__()
self.train = TitleDataset(x_train["post_name"].values )
self.test = TitleDataset(x_test["post_name"].values )
self.val = TitleDataset(x_test["post_name"].values)

def train_dataloader(self):
return DataLoader(self.train , batch_size = 1 , shuffle = True)
def test_dataloader(self):
return DataLoader(self.test , batch_size = 1 , shuffle = False)
def val_dataloader(self):
return DataLoader(self.val , batch_size = 1 , shuffle = False)

The Model

GPT2 returns the output logits and the loss of the model when the tokenized text is passed to it.

class TitleGenerator(pl.LightningModule):
def __init__(self):
super().__init__()
self.neural_net = gpt2_model

def forward(self,x):
return self.neural_net(x , labels = x)

def configure_optimizers(self):
return AdamW(self.parameters(), 1e-4)

def training_step(self,batch,batch_idx):
x= batch
output = self(x)
return output.loss

def test_step(self,batch,batch_idx):
x= batch
output = self(x)
return output.loss

def validation_step(self,batch,batch_idx):
x= batch
output = self(x)
return output.loss

Training

Fine-tuning a GPT2 model takes a long time I recommend using GPU if available. Lightning allows us to declare GPUs in the trainer while it handles the rest. six epochs should take around 30 minutes.

from pytorch_lightning import Trainer
model = TitleGenerator()
module = TitleDataModule()
trainer = Trainer(max_epochs = 6,gpus = 1)
trainer.fit(model,module)

Testing and Predictions

I wouldn’t recommend doing this if you are planning on deploying the code to production as it may cause errors down the line. The following code is a quick and dirty method to alter the weights of the original model

gpt2.state_dict = model.state_dict

Generating Titles

raw_text = ["The" ,"machine Learning"  , "A" , "Data science" , "AI" , "A" , "The" , "Why" , "how"]
for x in raw_text:
prompts = tokenizer.encode(x , return_tensors = "pt")
outputs = gpt2.generate(prompt,do_sample = True, max_length = 32,top_k = 10, temperature = 0.8)
display(tokenizer.decode(outputs[0] , skip_special_tokens = True))

Endnote

I would deploy the model as an API however the model is over 3 GB and It really wouldn’t make sense cost-wise to have it hosted on a website. You could also try to upload the model on the huggingface hub after finetuning it. Check out the huggingface website for more models and tutorials.

--

--

Keegan Fernandes

First year student in Msc Data Science. Writes about data science and machine learning tutorials and the impact it has on the world.