Project Description

This project focuses on fine-tune a distilbert model to predict news categories using only news headline. For model demo and downloading the model, please check my HuggingFace Repo🤗. HuggingFace Repository

HuggingFace Demo
Scenario 1: Across columns

Data Description

The data is from Kaggle.
There are 200, 000 rows and 42 cotagories in our predict column.

Model Training

Input preprocessing

To transform text data into vetors, I first applied TfidfVectorizer to preproess text data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=True,
                        preprocessor=None,  # applied preprocessor in Data Cleaning
                        tokenizer=word_tokenize,
                        use_idf=True,
                        norm='l2',
                        smooth_idf=True,
                        stop_words= 'english',
                        max_df=0.5,
                        sublinear_tf=True)

Loss History
Scenario 1: Across columns
Accuracy History
Scenario 1: Across columns

Project Detail

Please refer to this PDF to check the project details. Project PDF

For more details about the code of the project, please refer to my GitHub Repository