Download Complete Transformers for natural language processing and computer vision, third edition de

Page 1


TransformersforNaturalLanguageProcessingand ComputerVision,ThirdEditionDenisRothman

https://ebookmass.com/product/transformers-for-naturallanguage-processing-and-computer-vision-third-edition-denisrothman/

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

Feature extraction and image processing for computer vision Fourth Edition Aguado

https://ebookmass.com/product/feature-extraction-and-image-processingfor-computer-vision-fourth-edition-aguado/

ebookmass.com

COGNITIVE APPROACH TO NATURAL LANGUAGE PROCESSING 1st

Edition Bernadette Sharp

https://ebookmass.com/product/cognitive-approach-to-natural-languageprocessing-1st-edition-bernadette-sharp/

ebookmass.com

Probabilistic graphical models for computer vision Ji Q

https://ebookmass.com/product/probabilistic-graphical-models-forcomputer-vision-ji-q/

ebookmass.com

eTextbook 978-0134163734 Criminal Law Today (6th Edition)

https://ebookmass.com/product/etextbook-978-0134163734-criminal-lawtoday-6th-edition/

ebookmass.com

Principles of Risk Management and Insurance, Global Editon, 14th Edition George. Mcnamara Rejda (Michael.)

https://ebookmass.com/product/principles-of-risk-management-andinsurance-global-editon-14th-edition-george-mcnamara-rejda-michael/

ebookmass.com

Mindfulness-based Strategic Awareness Training Comprehensive Workbook 1st Edition Juan Humberto Young

https://ebookmass.com/product/mindfulness-based-strategic-awarenesstraining-comprehensive-workbook-1st-edition-juan-humberto-young/

ebookmass.com

From Statistical Physics to Data-Driven Modelling: with Applications to Quantitative Biology Simona Cocco

https://ebookmass.com/product/from-statistical-physics-to-data-drivenmodelling-with-applications-to-quantitative-biology-simona-cocco/

ebookmass.com

Sensation and Perception Second Edition – Ebook PDF Version

https://ebookmass.com/product/sensation-and-perception-second-editionebook-pdf-version/

ebookmass.com

Stars and Galaxies 10th Edition Michael A. Seeds

https://ebookmass.com/product/stars-and-galaxies-10th-edition-michaela-seeds/

ebookmass.com

Human-Centered

https://ebookmass.com/product/human-centered-ai-ben-shneiderman-2/

ebookmass.com

Transformers for Natural Language Processing and Computer Vision

Third Edition

Explore Generative AI and Large Language

Models

with Hugging Face, ChatGPT, GPT-4V, and DALL-E 3

BIRMINGHAM—MUMBAI

Transformers for Natural Language Processing and Computer Vision

Third Edition

Copyright © 2024 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Publishing Product Manager: Bhavesh Amin

Acquisition Editor – Peer Reviews: Tejas Mhasvekar

Project Editor: Janice Gonsalves

Content Development Editor: Bhavesh Amin

Copy Editor: Safis Editing

Technical Editor: Karan Sonawane

Proofreader: Safis Editing

Indexer: Rekha Nair

Presentation Designer: Ajay Patule

First published: January 2021

Second edition: February 2022

Third edition: February 2024

Production reference: 1280224

Published by Packt Publishing Ltd.

Grosvenor House

11 St Paul’s Square

Birmingham

B3 1RB, UK.

ISBN 978-1-80512-872-4

www.packt.com

Contributors

About the author

Denis Rothman graduated from Sorbonne University and Paris Diderot University, designing one of the first patented encoding and embedding systems. He authored one of the first patented AI cognitive robots and bots. He began his career delivering Natural Language Processing (NLP) chatbots for Moët et Chandon and as an AI tactical defense optimizer for Airbus (formerly Aerospatiale).

Denis then authored an AI resource optimizer for IBM and luxury brands, leading to an Advanced Planning and Scheduling (APS) solution used worldwide.

I want to thank the corporations that trusted me from the start to deliver artificial intelligence solutions and shared the risks of continuous innovation. I also want to thank my family, who always believed I would make it.

About the reviewer

George Mihaila has 7 years of research experience with transformer models, having started working with them since they came out in 2017. He is a final-year PhD student in computer science working in research on transformer models in Natural Language Processing (NLP). His research covers both Generative and Predictive NLP modeling.

He has over 6 years of industry experience working in top companies with transformer models and machine learning, covering a broad area from NLP and Computer Vision to Explainability and Causality. George has worked in both science and engineering roles. He is an end-to-end Machine Learning expert leading Research and Development, as well as MLOps, optimization, and deployment.

He was a technical reviewer for the first and second editions of Transformers for Natural Language Processing by Denis Rothman.

Join our community on Discord

Join our community’s Discord space for discussions with the authors and other readers: https://www.packt.link/Transformers

Contents

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

1. What Are Transformers?

How constant time complexity O(1) changed our lives forever

O(1) attention conquers O(n) recurrent methods

Attention layer

Recurrent layer

The magic of the computational time complexity of an attention layer

Computational time complexity with a CPU

Computational time complexity with a GPU

Computational time complexity with a TPU

TPU-LLM

A brief journey from recurrent to attention

A brief history

From one token to an AI revolution

From one token to everything

Foundation Models

From general purpose to specific tasks

The role of AI professionals

The future of AI professionals

What resources should we use?

Decision-making guidelines

The rise of transformer seamless APIs and assistants

Choosing ready-to-use API-driven libraries

Choosing a cloud platform and transformer model

Summary

Questions

References

Further reading

2. Getting Started with the Architecture of the Transformer Model

The rise of the Transformer: Attention Is All You Need

The encoder stack

Input embedding

Positional encoding

Sublayer 1: Multi-head attention

Sublayer 2: Feedforward network

The decoder stack

Output embedding and position encoding

The attention layers

The FFN sublayer, the post-LN, and the linear layer

Training and performance

Hugging Face transformer models

Summary

Questions

References

Further reading

3. Emergent vs Downstream Tasks: The Unseen Depths of Transformers

The paradigm shift: What is an NLP task?

Inside the head of the attention sublayer of a transformer

Exploring emergence with ChatGPT

Investigating the potential of downstream tasks

Evaluating models with metrics

Accuracy score

F1-score

MCC

Human evaluation

Benchmark tasks and datasets

Defining the SuperGLUE benchmark tasks

Running downstream tasks

The Corpus of Linguistic Acceptability (CoLA)

Stanford Sentiment TreeBank (SST-2)

Microsoft Research Paraphrase Corpus (MRPC)

Winograd schemas

Summary

Questions

References

Further reading

4. Advancements in Translations with Google Trax, Google Translate, and Gemini

Defining machine translation

Human transductions and translations

Machine transductions and translations

Evaluating machine translations

Preprocessing a WMT dataset

Preprocessing the raw data

Finalizing the preprocessing of the datasets

Evaluating machine translations with BLEU

Geometric evaluations

Applying a smoothing technique

Translations with Google Trax

Installing Trax

Creating the Original Transformer model

Initializing the model using pretrained weights

Tokenizing a sentence

Decoding from the Transformer

De-tokenizing and displaying the translation

Translation with Google Translate

Translation with a Google Translate AJAX API Wrapper

Implementing googletrans

Translation with Gemini

Gemini’s potential

Summary

Questions

References

Further reading

5. Diving into Fine-Tuning through BERT

The architecture of BERT

The encoder stack

Preparing the pretraining input environment

Pretraining and fine-tuning a BERT model

Fine-tuning BERT

Defining a goal

Hardware constraints

Installing Hugging Face Transformers

Importing the modules

Specifying CUDA as the device for torch

Loading the CoLA dataset

Creating sentences, label lists, and adding BERT tokens

Activating the BERT tokenizer

Processing the data

Creating attention masks

Splitting the data into training and validation sets

Converting all the data into torch tensors

Selecting a batch size and creating an iterator

BERT model configuration

Loading the Hugging Face BERT uncased base model

Optimizer grouped parameters

The hyperparameters for the training loop

The training loop

Training evaluation

Predicting and evaluating using the holdout dataset

Exploring the prediction process

Evaluating using the Matthews correlation coefficient

Matthews correlation coefficient evaluation for the whole dataset

Building a Python interface to interact with the model

Saving the model

Creating an interface for the trained model

Interacting with the model

Summary

Questions

References

Further reading

6. Pretraining a Transformer from Scratch through RoBERTa

Training a tokenizer and pretraining a transformer

Building KantaiBERT from scratch

Step 1: Loading the dataset

Step 2: Installing Hugging Face transformers

Step 3: Training a tokenizer

Step 4: Saving the files to disk

Step 5: Loading the trained tokenizer files

Step 6: Checking resource constraints: GPU and CUDA

Step 7: Defining the configuration of the model

Step 8: Reloading the tokenizer in transformers

Step 9: Initializing a model from scratch

Exploring the parameters

Step 10: Building the dataset

Step 11: Defining a data collator

Step 12: Initializing the trainer

Step 13: Pretraining the model

Step 14: Saving the final model (+tokenizer + config) to disk

Step 15: Language modeling with FillMaskPipeline

Pretraining a Generative AI customer support model on X data

Step 1: Downloading the dataset

Step 2: Installing Hugging Face transformers

Step 3: Loading and filtering the data

Step 4: Checking Resource Constraints: GPU and CUDA

Step 5: Defining the configuration of the model

Step 6: Creating and processing the dataset

Step 7: Initializing the trainer

Step 8: Pretraining the model

Step 9: Saving the model

Step 10: User interface to chat with the Generative AI agent

Further pretraining

Limitations

Next steps

Summary

Questions

References

Further reading

7. The Generative AI Revolution with ChatGPT

GPTs as GPTs

Improvement

Diffusion

New application sectors

Self-service assistants

Development assistants

Pervasiveness

The architecture of OpenAI GPT transformer models

The rise of billion-parameter transformer models

The increasing size of transformer models

Context size and maximum path length

From fine-tuning to zero-shot models

Stacking decoder layers

GPT models

OpenAI models as assistants

ChatGPT provides source code

GitHub Copilot code assistant

General-purpose prompt examples

Getting started with ChatGPT – GPT-4 as an assistant

1. GPT-4 helps to explain how to write source code

2. GPT-4 creates a function to show the YouTube presentation of GPT-4 by Greg Brockman on March 14, 2023

3. GPT-4 creates an application for WikiArt to display images

4. GPT-4 creates an application to display IMDb reviews

5. GPT-4 creates an application to display a newsfeed

6. GPT-4 creates a k-means clustering (KMC) algorithm

Getting started with the GPT-4 API

Running our first NLP task with GPT-4

Steps 1: Installing OpenAI and Step 2: Entering the API key

Step 3: Running an NLP task with GPT-4

Key hyperparameters

Running multiple NLP tasks

Retrieval Augmented Generation (RAG) with GPT-4

Installation

Document retrieval

Augmented retrieval generation

Summary

Questions

References

Further reading

8. Fine-Tuning OpenAI GPT Models

Risk management

Fine-tuning a GPT model for completion (generative)

1. Preparing the dataset

1.1. Preparing the data in JSON

1.2. Converting the data to JSONL

2. Fine-tuning an original model

3. Running the fine-tuned GPT model

4. Managing fine-tuned jobs and models

Before leaving

Summary

Questions

References

Further reading

9. Shattering the Black Box with Interpretable Tools

Transformer visualization with BertViz

Running BertViz

Step 1: Installing BertViz and importing the modules

Step 2: Load the models and retrieve attention

Step 3: Head view

Step 4: Processing and displaying attention heads

Step 5: Model view

Step 6: Displaying the output probabilities of attention heads

Streaming the output of the attention heads

Visualizing word relationships using attention scores with pandas

exBERT

Interpreting Hugging Face transformers with SHAP

Introducing SHAP

Explaining Hugging Face outputs with SHAP

Transformer visualization via dictionary learning

Transformer factors

Introducing LIME

The visualization interface

Other interpretable AI tools

LIT

Running LIT

OpenAI LLMs explain neurons in transformers

Limitations and human control

Summary

Questions

References

Further reading

10. Investigating the Role of Tokenizers in Shaping Transformer Models

Matching datasets and tokenizers

Best practices

Step 1: Preprocessing

Step 2: Quality control

Step 3: Continuous human quality control

Word2Vec tokenization

Case 0: Words in the dataset and the dictionary

Case 1: Words not in the dataset or the dictionary

Case 2: Noisy relationships

Case 3: Words in a text but not in the dictionary

Case 4: Rare words

Case 5: Replacing rare words

Exploring sentence and WordPiece tokenizers to understand the efficiency of subword tokenizers for transformers

Word and sentence tokenizers

Sentence tokenization

Word tokenization

Regular expression tokenization

Treebank tokenization

White space tokenization

Punkt tokenization

Word punctuation tokenization

Multi-word tokenization

Subword tokenizers

Unigram language model tokenization

SentencePiece

Byte-Pair Encoding (BPE)

WordPiece

Exploring in code

Detecting the type of tokenizer

Displaying token-ID mappings

Analyzing and controlling the quality of token-ID mappings

Summary

Questions

References

Further reading

11. Leveraging LLM Embeddings as an Alternative to Fine-Tuning

LLM embeddings as an alternative to fine-tuning

From prompt design to prompt engineering

Fundamentals of text embedding with NLKT and Gensim

Installing libraries

1. Reading the text file

2. Tokenizing the text with Punkt

Preprocessing the tokens

3. Embedding with Gensim and Word2Vec

4. Model description

5. Accessing a word and vector

6. Exploring Gensim’s vector space

7. TensorFlow Projector

Implementing question-answering systems with embedding-based search techniques

1. Installing the libraries and selecting the models

2. Implementing the embedding model and the GPT model

2.1 Evaluating the model with a knowledge base: GPT can answer questions

2.2 Add a knowledge base

2.3 Evaluating the model without a knowledge base: GPT cannot answer questions

3. Prepare search data

4. Search

5. Ask

5.1.Example question

5.2.Troubleshooting wrong answers

Transfer learning with Ada embeddings

1. The Amazon Fine Food Reviews dataset

1.2. Data preparation

2. Running Ada embeddings and saving them for future reuse

3. Clustering

3.1. Find the clusters using k-means clustering

3.2. Display clusters with t-SNE

4. Text samples in the clusters and naming the clusters

Summary

Questions

References

Further reading

12. Toward Syntax-Free Semantic Role Labeling with ChatGPT and GPT4

Getting started with cutting-edge SRL

Entering the syntax-free world of AI

Defining SRL

Visualizing SRL

SRL experiments with ChatGPT with GPT-4

Basic sample

Difficult sample

Questioning the scope of SRL

The challenges of predicate analysis

Redefining SRL

From task-specific SRL to emergence with ChatGPT

1. Installing OpenAI

2. GPT-4 dialog function

3. SRL

Sample 1 (basic)

Sample 2 (basic)

Sample 3 (basic)

Sample 4 (difficult)

Sample 5 (difficult)

Sample 6 (difficult)

Summary

Questions

References

Further reading

13. Summarization with T5 and ChatGPT

Designing a universal text-to-text model

The rise of text-to-text transformer models

A prefix instead of task-specific formats

The T5 model

Text summarization with T5

Hugging Face

Selecting a Hugging Face transformer model

Initializing the T5-large transformer model

Getting started with T5

Exploring the architecture of the T5 model

Summarizing documents with T5-large

Creating a summarization function

A general topic sample

The Bill of Rights sample

A corporate law sample

From text-to-text to new word predictions with OpenAI ChatGPT

Comparing T5 and ChatGPT’s summarization methods

Pretraining

Specific versus non-specific tasks

Summarization with ChatGPT

Summary

Questions

References

Further reading

14. Exploring Cutting-Edge LLMs with Vertex AI and PaLM 2

Architecture

Pathways

Client Resource manager

Intermediate representation

Compiler

Scheduler

Executor

PaLM

Parallel layer processing that increases training speed

Shared input-output embeddings, which saves memory

No biases, which improves training stability

Rotary Positional Embedding (RoPE) improves model quality

SwiGLU activations improve model quality

PaLM 2

Improved performance, faster, and more efficient

Scaling laws, optimal model size, and the number of parameters

State-of-the-art (SOA) performance and a new training methodology

Assistants

Gemini

Google Workspace

Google Colab Copilot

Vertex AI PaLM 2 interface

Vertex AI PaLM 2 assistant

Vertex AI PaLM 2 API

Question answering

Question-answer task

Summarization of a conversation

Sentiment analysis

Multi-choice problems

Code

Fine-tuning

Creating a bucket

Fine-tuning the model

Summary

Questions

References

Further reading

15. Guarding the Giants: Mitigating Risks in Large Language Models

The emergence of functional AGI

Cutting-edge platform installation limitations

Auto-BIG-bench

WandB

When will AI agents replicate?

Function: `create_vocab`

Process:

Function: `scrape_wikipedia`

Process:

Function: `create_dataset`

Process:

Classes: `TextDataset`, `Encoder`, and `Decoder`

Function: `count_parameters`

Function: `main`

Process:

Saving and Executing the Model

Risk management

Hallucinations and memorization

Memorization

Risky emergent behaviors

Disinformation

Influence operations

Harmful content

Privacy

Cybersecurity

Risk mitigation tools with RLHF and RAG

1. Input and output moderation with transformers and a rule base

2. Building a knowledge base for ChatGPT and GPT-4

Adding keywords

3. Parsing the user requests and accessing the KB

4. Generating ChatGPT content with a dialog function

Token control

Moderation

Summary

Questions

References

Further reading

16. Beyond Text: Vision Transformers in the Dawn of Revolutionary AI

From task-agnostic models to multimodal vision transformers

ViT – Vision Transformer

The basic architecture of ViT

CLIP

Step 1: Splitting the image into patches

Step 2: Building a vocabulary of image patches

Step 3: The transformer

Vision transformers in code

A feature extractor simulator

The transformer

Configuration and shapes

The basic architecture of CLIP

CLIP in code

DALL-E 2 and DALL-E 3

The basic architecture of DALL-E

Getting started with the DALL-E 2 and DALL-E 3 API

Creating a new image

Creating a variation of an image

From research to mainstream AI with DALL-E

GPT-4V, DALL-E 3, and divergent semantic association

Defining divergent semantic association

Creating an image with ChatGPT Plus with DALL-E

Implementing the GPT-4V API and experimenting with DAT

Example 1: A standard image and text

Example 2: Divergent semantic association, moderate divergence

Example 3: Divergent semantic association, high divergence

Summary

Questions

References

Further Reading

17. Transcending the Image-Text Boundary with Stable Diffusion

Transcending image generation boundaries

Part I: Defining text-to-image with Stable Diffusion

1. Text embedding using a transformer encoder

2. Random image creation with noise

3. Stable Diffusion model downsampling

4. Decoder upsampling

5. Output image

Running the Keras Stable Diffusion implementation

Part II: Running text-to-image with Stable Diffusion

Generative AI Stable Diffusion for a Divergent Association

Task (DAT)

Part III: Video

Text-to-video with Stability AI animation

Text-to-video, with a variation of OpenAI CLIP

A video-to-text model with TimeSformer

Preparing the video frames

Putting the TimeSformer to work to make predictions on the video frames

Summary

Questions

References

Further reading

18. Hugging Face AutoTrain: Training Vision Models without Coding

Goal and scope of this chapter

Getting started

Uploading the dataset

No coding?

Training models with AutoTrain

Deploying a model

Running our models for inference

Retrieving validation images

The program will now attempt to classify the validation images. We will see how a vision transformer reacts to this image.

Inference: image classification

Validation experimentation on the trained models

ViTForImageClassification

SwinForImageClassification 1

BeitForImage Classification

SwinForImageClassification 2

ConvNextForImageClassification

ResNetForImageClassification

Trying the top ViT model with a corpus

Summary

Questions

References

Further reading

19. On the Road to Functional AGI with HuggingGPT and its Peers

Defining F-AGI

Installing and importing

Validation set

Level 1 image: easy

Level 2 image: difficult

Level 3 image: very difficult

HuggingGPT

Level 1: Easy

Level 2: Difficult

Level 3: Very difficult

CustomGPT

Google Cloud Vision

Level 1: Easy

Level 2: Difficult

Level 3: Very difficult

Model chaining: Chaining Google Cloud Vision to ChatGPT

Model Chaining with Runway Gen-2

Midjourney: Imagine a ship in the galaxy

Gen-2: Make this ship sail the sea

Summary

Questions

References

Further Reading

20. Beyond Human-Designed Prompts with Generative Ideation

Part I: Defining generative ideation

Automated ideation architecture

Scope and limitations

Part II: Automating prompt design for generative image design

ChatGPT/GPT-4 HTML presentation

ChatGPT with GPT-4 provides the text for the presentation

ChatGPT with GPT-4 provides a graph in HTML to illustrate the presentation

Llama 2

A brief introduction to Llama 2

Implementing Llama 2 with Hugging Face

Midjourney

Discord API for Midjourney

Microsoft Designer

Part III: Automated generative ideation with Stable Diffusion

1. No prompt: Automated instruction for GPT-4

2. Generative AI (prompt generation) using ChatGPT with GPT-4

3. and 4. Generative AI with Stable Diffusion and displaying images

The future is yours!

The future of development through VR-AI

The groundbreaking shift: Parallelization of development through the fusion of VR and AI Opportunities and risks

Summary

Questions

References

Further reading

Appendix: Answers to the Questions

Other Books You May Enjoy

Index

Preface

Transformer-driven Generative AI models are a game-changer for Natural Language Processing (NLP) and computer vision. Large Language

Generative AI transformer models have achieved superhuman performance through services such as ChatGPT with GPT-4V for text, image, data science, and hundreds of domains. We have gone from primitive Generative AI to superhuman AI performance in just a few years!

Language understanding has become the pillar of language modeling, chatbots, personal assistants, question answering, text summarizing, speechto-text, sentiment analysis, machine translation, and more. The expansion from the early Large Language Models (LLMs) to multimodal (text, image, sound) algorithms has taken AI into a new era.

For the past few years, we have been witnessing the expansion of social networks versus physical encounters, e-commerce versus physical shopping, digital newspapers, streaming versus physical theaters, remote doctor consultations versus physical visits, remote work instead of on-site tasks, and similar trends in hundreds more domains. This digital activity is now increasingly driven by transformer copilots in hundreds of applications.

The transformer architecture began just a few years ago as revolutionary and disruptive. It broke with the past, leaving the dominance of RNNs and CNNs behind. BERT and GPT models abandoned recurrent network layers and replaced them with self-attention. But in 2023, OpenAI GPT-4

proposed AI into new realms with GPT-4V (vision transformer), which is paving the path for functional (everyday tasks) AGI. Google Vertex AI offered similar technology. 2024 is not a new year in AI; it’s a new decade! Meta (formerly Facebook) has released Llama 2, which we can deploy seamlessly on Hugging Face.

Transformer encoders and decoders contain attention heads that train separately, parallelizing cutting-edge hardware. Attention heads can run on separate GPUs, opening the door to billion-parameter models and soon-tocome trillion-parameter models.

The increasing amount of data requires training AI models at scale. As such, transformers pave the way to a new era of parameter-driven AI.

Learning to understand how hundreds of millions of words and images fit together requires a tremendous amount of parameters. Transformer models such as Google Vertex AI PaLM 2 and OpenAI GPT-4V have taken emergence to another level. Transformers can perform hundreds of NLP tasks they were not trained for.

Transformers can also learn image classification and reconstruction by embedding images as sequences of words. This book will introduce you to cutting-edge computer vision transformers such as Vision Transformers (ViTs), CLIP, GPT-4V, DALL-E 3, and Stable Diffusion.

Think of how many humans it would take to control the content of the billions of messages posted on social networks per day to decide if they are legal and ethical before extracting the information they contain.

Think of how many humans would be required to translate the millions of pages published each day on the web. Or imagine how many people it would take to manually control the millions of messages and images made per minute!

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.