Generating the Best Article Names using Neural Networks and Data Analytics
Article names are hard to decide. A reader may or may not click on your article based on the title. Making titles, however, is hard so as a data scientist I decided to automate this task using GPT2 and NLP
Data
I made a csv file that contains the best data science articles from various tags that I scraped using Parsehub on the Medium.com website. The csv file has information on the title of the article, tags used, publication, claps received, number of responses, etc. This dataset is available on Kaggle and is called Medium-Search-Dataset.
Task
My task is to make a text generator to generate coherent Article Titles. I will be using the transformers library for pre-processing and model building I will then finetune the model using PyTorch Lightning.
Installing Transformers
To install Transformers in your environment go to your environment and use the following command.
pip install transformers
It will install the library in your environment. If you want to avoid this step Run your notebook on a Kaggle Kernel since it will have the transformers library preinstalled in the environment.
Notebook
This Notebook was run on Kaggle using a GPU. You can view the full notebook on Kaggle and Github. I would recommend running this notebook on a Kaggle kernel instead of a local machine or collab since Kaggle already has most of the dependencies installed in the environment. Pytorch Lightning will be used as a wrapper class to speed up model building.
Dependencies
Run the cell below to make sure you have all the necessary packages installed in your environment. If you do not have all the packages installed it will throw an error.
from transformers import GPT2LMHeadModel, GPT2Tokenizer,AdamW
import pandas as pd
from torch.utils.data import Dataset , DataLoader
import pytorch_lightning as pl
from sklearn.model_selection import train_test_split
Data
df = pd.read_csv("../input/mediumsearchdataset/Train.csv")
df
Downloading GPT2
I will be downloading GPT2-large it is available for public use. It has a size of 3 GB which is why I recommend using a remote notebook like Kaggle.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")
gpt2 = GPT2LMHeadModel.from_pretrained("gpt2-large")
Testing the GPT2 model (before finetuning)
tokenizer.pad_token = tokenizer.eos_token
prompt = tokenizer.encode("machine learning", max_length = 30 , padding = "max_length" , truncation = True , return_tensors = "pt")
output = gpt2.generate(prompt,do_sample = True, max_length = 100,top_k = 10, temperature = 0.8)
tokenizer.decode(output[0] , skip_special_tokens = True)
As we can see the model does generate text on the prompt “machine learning” it is nowhere near Title material. In the following sections, we will finetune the model to generate better text.
Dataset
The dataset will create and send the tokenized titles to the dataset.
class TitleDataset(Dataset):
def __init__(self,titles):
self.tokenizer = tokenizer
self.titles = titles
def __len__(self):
return len(self.titles)
def __getitem__(self,index):
title = self.titles[index]
title_token = tokenizer.encode(title , max_length = 30 , padding = "max_length" , truncation = True, return_tensors = "pt").reshape(-1)
return title_token#sanity checkdset = TitleDataset(df["post_name"].values)
title = next(iter(DataLoader(dset , batch_size = 1,shuffle = True)))
display(title)
DataModule
class TitleDataModule(pl.LightningDataModule):
def __init__(self):
super().__init__()
self.train = TitleDataset(x_train["post_name"].values )
self.test = TitleDataset(x_test["post_name"].values )
self.val = TitleDataset(x_test["post_name"].values)
def train_dataloader(self):
return DataLoader(self.train , batch_size = 1 , shuffle = True)
def test_dataloader(self):
return DataLoader(self.test , batch_size = 1 , shuffle = False)
def val_dataloader(self):
return DataLoader(self.val , batch_size = 1 , shuffle = False)
The Model
GPT2 returns the output logits and the loss of the model when the tokenized text is passed to it.
class TitleGenerator(pl.LightningModule):
def __init__(self):
super().__init__()
self.neural_net = gpt2_model
def forward(self,x):
return self.neural_net(x , labels = x)
def configure_optimizers(self):
return AdamW(self.parameters(), 1e-4)
def training_step(self,batch,batch_idx):
x= batch
output = self(x)
return output.loss
def test_step(self,batch,batch_idx):
x= batch
output = self(x)
return output.loss
def validation_step(self,batch,batch_idx):
x= batch
output = self(x)
return output.loss
Training
Fine-tuning a GPT2 model takes a long time I recommend using GPU if available. Lightning allows us to declare GPUs in the trainer while it handles the rest. six epochs should take around 30 minutes.
from pytorch_lightning import Trainer
model = TitleGenerator()
module = TitleDataModule()
trainer = Trainer(max_epochs = 6,gpus = 1)
trainer.fit(model,module)
Testing and Predictions
I wouldn’t recommend doing this if you are planning on deploying the code to production as it may cause errors down the line. The following code is a quick and dirty method to alter the weights of the original model
gpt2.state_dict = model.state_dict
Generating Titles
raw_text = ["The" ,"machine Learning" , "A" , "Data science" , "AI" , "A" , "The" , "Why" , "how"]
for x in raw_text:
prompts = tokenizer.encode(x , return_tensors = "pt")
outputs = gpt2.generate(prompt,do_sample = True, max_length = 32,top_k = 10, temperature = 0.8)
display(tokenizer.decode(outputs[0] , skip_special_tokens = True))
Endnote
I would deploy the model as an API however the model is over 3 GB and It really wouldn’t make sense cost-wise to have it hosted on a website. You could also try to upload the model on the huggingface hub after finetuning it. Check out the huggingface website for more models and tutorials.