How does BertTokenizer work in transformers?

This recipe explains how does BertTokenizer work in transformers.

Recipe Objective - How does BertTokenizer work in transformers?

Subword tokenization methods work on the idea that common words should not be broken down into smaller subwords, but rare words should be broken down into meaningful subwords.

Access Avocado Machine Learning Project for Price Prediction

For more related projects -

/projects/data-science-projects/neural-network-projects
/projects/data-science-projects/tensorflow-projects

Example of BertTokenizer:

# Importing BertTokenizer
from transformers import BertTokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Passing input
bert_tokenizer.tokenize("Welcome to Transformers tutorials!!!")

Output - 
['welcome', 'to', 'transformers', 'tutor', '##ials', '!', '!', '!']

The sentence was lowercased first because we're using the uncased model. We can see that the words ["welcome", "to", "transformers"] are present in the tokenizer’s vocabulary, but the word "tutorials" is not. Consequently, the tokenizer splits "tutorials" into known subwords: ["tutor" and "##ials"]. The symbol "##" indicates that the remainder of the token should be connected to the previous one without leaving any gap (for decoding or reversal of the tokenization).

In this way, we can perform BertTokenizer in transformers.

What Users are saying..

profile image

Jingwei Li

Graduate Research assistance at Stony Brook University
linkedin profile url

ProjectPro is an awesome platform that helps me learn much hands-on industrial experience with a step-by-step walkthrough of projects. There are two primary paths to learn: Data Science and Big Data.... Read More

Relevant Projects

Azure Deep Learning-Deploy RNN CNN models for TimeSeries
In this Azure MLOps Project, you will learn to perform docker-based deployment of RNN and CNN Models for Time Series Forecasting on Azure Cloud.

Image Classification Model using Transfer Learning in PyTorch
In this PyTorch Project, you will build an image classification model in PyTorch using the ResNet pre-trained model.

Linear Regression Model Project in Python for Beginners Part 1
Machine Learning Linear Regression Project in Python to build a simple linear regression model and master the fundamentals of regression for beginners.

Build CNN for Image Colorization using Deep Transfer Learning
Image Processing Project -Train a model for colorization to make grayscale images colorful using convolutional autoencoders.

GCP MLOps Project to Deploy ARIMA Model using uWSGI Flask
Build an end-to-end MLOps Pipeline to deploy a Time Series ARIMA Model on GCP using uWSGI and Flask

Natural language processing Chatbot application using NLTK for text classification
In this NLP AI application, we build the core conversational engine for a chatbot. We use the popular NLTK text classification library to achieve this.

Build a Multi ClassText Classification Model using Naive Bayes
Implement the Naive Bayes Algorithm to build a multi class text classification model in Python.

Predictive Analytics Project for Working Capital Optimization
In this Predictive Analytics Project, you will build a model to accurately forecast the timing of customer and supplier payments for optimizing working capital.

Build an AI Quiz Generator from Video with OpenAI API
In this LLM project, you will build a model to automate the transcription of video content and generate interactive quizzes using OpenAI’s Whisper and GPT-4o.

Langchain Project for Customer Support App in Python
In this LLM Project, you will learn how to enhance customer support interactions through Large Language Models (LLMs), enabling intelligent, context-aware responses. This Langchain project aims to seamlessly integrate LLM technology with databases, PDF knowledge bases, and audio processing agents to create a comprehensive customer support application.

OSZAR »