Skip to content

Commit

Permalink
Add multimodal, updated style transfer demo + text
Browse files Browse the repository at this point in the history
  • Loading branch information
shwars committed May 18, 2022
1 parent fb111a1 commit 9b4ffba
Show file tree
Hide file tree
Showing 15 changed files with 1,685 additions and 819 deletions.
25 changes: 11 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,12 +43,12 @@ For a gentle introduction to *AI in the Cloud* topics you may consider taking th
<table>
<tr><th>No</th><th>Lesson</th><th>Intro</th><th>PyTorch</th><th>Keras/TensorFlow</th><th>Lab</th></tr>

<tr><td>I</td><td colspan="4"><b>Introduction to AI</b></td><td>PAT</td></tr>
<tr><td>I</td><td colspan="4"><b>Introduction to AI</b></td><td></td></tr>
<tr><td>1</td><td>Introduction and History of AI</td><td><a href="lessons/1-Intro/README.md">Text</a></td><td></td><td></td><td></td></tr>

<tr><td>II</td><td colspan="4"><b>Symbolic AI</b></td><td>PAT</td></tr>
<tr><td>II</td><td colspan="4"><b>Symbolic AI</b></td><td></td></tr>
<tr><td>2 </td><td>Knowledge Representation and Expert Systems</td><td><a href="lessons/2-Symbolic/README.md">Text</a></td><td colspan="2"><a href="lessons/2-Symbolic/Animals.ipynb">Expert System</a>, <a href="lessons/2-Symbolic/FamilyOntology.ipynb">Ontology</a>, <a href="lessons/2-Symbolic/MSConceptGraph.ipynb">Concept Graph</a></td><td></td></tr>
<tr><td>III</td><td colspan="4"><b><a href="lessons/3-NeuralNetworks/README.md">Introduction to Neural Networks</a></b></td><td>PAT</td></tr>
<tr><td>III</td><td colspan="4"><b><a href="lessons/3-NeuralNetworks/README.md">Introduction to Neural Networks</a></b></td><td></td></tr>
<tr><td>3</td><td>Perceptron</td>
<td><a href="lessons/3-NeuralNetworks/03-Perceptron/README.md">Text</a>
<td colspan="2"><a href="lessons/3-NeuralNetworks/03-Perceptron/Perceptron.ipynb">Notebook</a></td><td><a href="lessons/3-NeuralNetworks/03-Perceptron/lab/README.md">Lab</a></td></tr>
Expand All @@ -62,18 +62,18 @@ For a gentle introduction to *AI in the Cloud* topics you may consider taking th
<tr><td>IV</td><td colspan="2"><b><a href="lessons/4-ComputerVision/README.md">Computer Vision</a></b></td>
<td><a href="https://docs.microsoft.com/learn/modules/intro-computer-vision-pytorch/?WT.mc_id=academic-57639-dmitryso">MS Learn</a></td>
<td><a href="https://docs.microsoft.com/learn/modules/intro-computer-vision-TensorFlow/?WT.mc_id=academic-57639-dmitryso">MS Learn</a></td>
<td>PAT</td></tr>
<td></td></tr>
<tr><td>6</td><td>Intro to Computer Vision. OpenCV</td><td><a href="lessons/4-ComputerVision/06-IntroCV/README.md">Text</a><td colspan="2"><a href="lessons/4-ComputerVision/06-IntroCV/OpenCV.ipynb">Notebook</a></td><td><a href="lessons/4-ComputerVision/06-IntroCV/lab/README.md">Lab</a></td></tr>
<tr><td>7</td><td>Convolutional Neural Networks<br/>CNN Architectures</td><td><a href="lessons/4-ComputerVision/07-ConvNets/README.md">Text</a><br/><a href="lessons/4-ComputerVision/07-ConvNets/CNN_Architectures.md">Text</a></td><td><a href="lessons/4-ComputerVision/07-ConvNets/ConvNetsPyTorch.ipynb">PyTorch</a></td><td><a href="lessons/4-ComputerVision/07-ConvNets/ConvNetsTF.ipynb">TensorFlow</a></td><td><a href="lessons/4-ComputerVision/07-ConvNets/lab/README.md">Lab</a></td></tr>
<tr><td>8</td><td>Pre-trained Networks and Transfer Learning<br/>Training Tricks</td><td><a href="lessons/4-ComputerVision/08-TransferLearning/README.md">Text</a><br/><a href="lessons/4-ComputerVision/08-TransferLearning/TrainingTricks.md">Text</a></td><td><a href="lessons/4-ComputerVision/08-TransferLearning/TransferLearningPyTorch.ipynb">PyTorch</a></td><td><a href="lessons/4-ComputerVision/08-TransferLearning/TransferLearningTF.ipynb">TensorFlow</a><br/><a href="lessons/4-ComputerVision/08-TransferLearning/Dropout.ipynb">Dropout sample</a></td><td><a href="lessons/4-ComputerVision/08-TransferLearning/lab/README.md">Lab</a></td></tr>
<tr><td>9</td><td>Autoencoders and VAEs</td><td><a href="lessons/4-ComputerVision/09-Autoencoders/README.md">Text</a></td><td><a href="lessons/4-ComputerVision/09-Autoencoders/AutoEncodersPytorch.ipynb">PyTorch</td><td><a href="lessons/4-ComputerVision/09-Autoencoders/AutoencodersTF.ipynb">TensorFlow</a></td><td></td></tr>
<tr><td>10</td><td>Generative Adversarial Networks</td><td><a href="lessons/4-ComputerVision/10-GANs/README.md">Text</a></td><td><a href="lessons/4-ComputerVision/10-GANs/GANPyTorch.ipynb">PyTorch</td><td><a href="lessons/4-ComputerVision/10-GANs/GANTF.ipynb">TensorFlow</a></td><td></td></tr>
<tr><td>10</td><td>Generative Adversarial Networks<br/>Artistic Style Transfer</td><td><a href="lessons/4-ComputerVision/10-GANs/README.md">Text</a></td><td><a href="lessons/4-ComputerVision/10-GANs/GANPyTorch.ipynb">PyTorch</td><td><a href="lessons/4-ComputerVision/10-GANs/GANTF.ipynb">TensorFlow GAN</a><br/><a href="lessons/4-ComputerVision/10-GANs/StyleTransfer.ipynb">Style Transfer</a></td><td></td></tr>
<tr><td>11</td><td>Object Detection</td><td>Text</td><td>PyTorch</td><td>TensorFlow</td><td></td></tr>
<tr><td>12</td><td>Semantic Segmentation. U-Net</td><td><a href="lessons/4-ComputerVision/12-Segmentation/README.md">Text</a></td><td><a href="lessons/4-ComputerVision/12-Segmentation/SemanticSegmentationPytorch.ipynb">PyTorch</td><td><a href="lessons/4-ComputerVision/12-Segmentation/SemanticSegmentationTF.ipynb">TensorFlow</td><td></td></tr>
<tr><td>V</td><td colspan="2"><b><a href="lessons/5-NLP/README.md">Natural Language Processing</a></b></td>
<td><a href="https://docs.microsoft.com/learn/modules/intro-natural-language-processing-pytorch/?WT.mc_id=academic-57639-dmitryso">MS Learn</a></td>
<td><a href="https://docs.microsoft.com/learn/modules/intro-natural-language-processing-TensorFlow/?WT.mc_id=academic-57639-dmitryso">MS Learn</a></td>
<td>PAT</td></tr>
<td></td></tr>
<tr><td>13</td><td>Text Representation. Bow/TF-IDF</td><td><a href="lessons/5-NLP/13-TextRep/README.md">Text</a></td><td><a href="lessons/5-NLP/13-TextRep/TextRepresentationPyTorch.ipynb">PyTorch</a></td><td><a href="lessons/5-NLP/13-TextRep/TextRepresentationTF.ipynb">TensorFlow</td><td></td></tr>
<tr><td>14</td><td>Semantic word embeddings. Word2Vec and GloVe</td><td><a href="lessons/5-NLP/14-Embeddings/README.md">Text</td><td><a href="lessons/5-NLP/14-Embeddings/EmbeddingsPyTorch.ipynb">PyTorch</a></td><td><a href="lessons/5-NLP/14-Embeddings/EmbeddingsTF.ipynb">TensorFlow</a></td><td></td></tr>
<tr><td>15</td><td>Language Modeling. Training your own embeddings</td><td><a href="lessons/5-NLP/15-LanguageModeling">Text</a></td><td>PyTorch</td><td>TensorFlow</td><td></td></tr>
Expand All @@ -82,14 +82,14 @@ For a gentle introduction to *AI in the Cloud* topics you may consider taking th
<tr><td>18</td><td>Transformers. BERT.</td><td><a href="lessons/5-NLP/18-Transformers/README.md">Text</a></td><td><a href="lessons/5-NLP/18-Transformers/TransformersPyTorch.md">PyTorch</a></td><td><a href="lessons/5-NLP/18-Transformers/TransformersTF.md">TensorFlow</a></td><td></td></tr>
<tr><td>19</td><td>Named Entity Recognition</td><td>Text</td><td>PyTorch</td><td>TensorFlow</td><td></td></tr>
<tr><td>20</td><td>Large Language Models, Prompt Programming and Few-Shot Tasks</td><td>Text</td><td>PyTorch</td><td>TensorFlow</td><td></td></tr>
<tr><td>VI</td><td colspan="4"><b>Other AI Techniques</b></td><td>PAT</td></tr>
<tr><td>VI</td><td colspan="4"><b>Other AI Techniques</b></td><td></td></tr>
<tr><td>21</td><td>Genetic Algorithms</td><td><a href="lessons/6-Other/21-GeneticAlgorithms/README.md">Text</a><td colspan="2"><a href="lessons/6-Other/21-GeneticAlgorithms/Genetic.ipynb">Notebook</a></td><td></td></tr>
<tr><td>22</td><td>Deep Reinforcement Learning</td><td><a href="lessons/6-Other/22-DeepRL/README.md">Text</a></td><td>PyTorch</td><td>TensorFlow</td><td></td></tr>
<tr><td>23</td><td>Multi-Agent Systems</td><td><a href="lessons/6-Other/23-MultiagentSystems/README.md">Text</a></td><td></td><td></td><td></td></tr>
<tr><td>VII</td><td colspan="4"><b>AI Ethics</b></td><td>PAT</td></tr>
<tr><td>VII</td><td colspan="4"><b>AI Ethics</b></td><td></td></tr>
<tr><td>24</td><td>AI Ethics and Responsible AI</td><td><a href="lessons/7-Ethics/README.md">Text</a></td><td></td><td></td><td></td></tr>
<tr><td></td><td colspan="4"><b>Extras</b></td><td></td></tr>
<tr><td>X1</td><td>Multi-Modal Networks, CLIP and VQGAN</td><td><a href="lessons/X-Extras/X1-MultiModal/README.md">Text</a></td><td></td><td></td><td></td></tr>
<tr><td>X1</td><td>Multi-Modal Networks, CLIP and VQGAN</td><td><a href="lessons/X-Extras/X1-MultiModal/README.md">Text</a></td><td colspan="2"><a href="lessons/X-Extras/X1-MultiModal/Clip.ipynb">Notebook</a></td><td></td></tr>
</table>

**[Mindmap of the Course](http://soshnikov.com/courses/ai-for-beginners/mindmap.html)**
Expand All @@ -98,9 +98,6 @@ Each lesson contains some pre-reading material (linked as **Text** above), and s

Some sections also contain links to **MS Learn** modules that cover related topics. Microsoft Learn provides a convenient GPU-enabled learning environment, although in terms of content you can expect this curriculum to go a bit deeper.

Course sections also include the links to **PAT**s - Progress Assessment Tool, a list of items that you are likely to get to know after completing the module. You can review it and assess your progress on the course yourself.


# Getting Started

**Students**, there are a couple of ways to use the curriculum. First of all, you can just read the text and look through the code directly on GitHub. If you want to run the code in any of the notebooks - you can find the advice on how to do it [in this blog post](https://soshnikov.com/education/how-to-execute-notebooks-from-github/).
Expand All @@ -113,7 +110,7 @@ However, if you would like to take the course as a self-study project, we sugges
- Notebooks often contain some of the challenges that require you to tweak the code a little bit to experiment
- Take the post-lecture quiz
- If there is a lab attached to the module - complete the assignment
- Visit the [Discussion board](https://github.com/microsoft/AI-For-Beginners/discussions) to and "learn out loud" by filling out the appropriate PAT rubric. A 'PAT' is a Progress Assessment Tool that is a rubric you fill out to further your learning. You can also react to other PATs so we can learn together
- Visit the [Discussion board](https://github.com/microsoft/AI-For-Beginners/discussions) to "learn out loud".

> For further study, we recommend following these [Microsoft Learn](https://docs.microsoft.com/users/jenlooper-2911/collections/k7o7tg1gp306q4?WT.mc_id=academic-15963-cxa) modules and learning paths.
Expand All @@ -123,7 +120,7 @@ However, if you would like to take the course as a self-study project, we sugges

## Credits

**✍️ Hearty thanks to our authors** [Dmitry Soshnikov](http://soshnikov.com), [Evgenii Pishchik](https://github.com/Pe4enIks), with editors [Jen Looper](https://twitter.com/jenlooper) and [Lateefah Bello](https://twitter.com/CinnamonXI)
**✍️ Hearty thanks to our authors** [Dmitry Soshnikov](http://soshnikov.com), [Evgenii Pishchik](https://github.com/Pe4enIks), with editors [Jen Looper](https://twitter.com/jenlooper) and [Lateefah Bello](https://github.com/CinnamonXI)

**🎨 Thanks as well to our sketchnote illustrator:** [Tomomi Imura](https://twitter.com/girlie_mac)

Expand Down
8 changes: 4 additions & 4 deletions lessons/4-ComputerVision/06-IntroCV/OpenCV.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
},
{
"cell_type": "code",
"execution_count": 162,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -51,7 +51,7 @@
},
{
"cell_type": "code",
"execution_count": 97,
"execution_count": 2,
"metadata": {},
"outputs": [
{
Expand All @@ -64,10 +64,10 @@
{
"data": {
"text/plain": [
"<matplotlib.image.AxesImage at 0x1de7642e7c0>"
"<matplotlib.image.AxesImage at 0x19a755cc640>"
]
},
"execution_count": 97,
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
},
Expand Down

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@
},
"outputs": [],
"source": [
"if not os.path.exists('data/kagglecatsanddogs_3367a.zip'):\n",
" !wget -P data -q https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip"
"if not os.path.exists('data/kagglecatsanddogs_5340.zip'):\n",
" !wget -P data https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_5340.zip"
]
},
{
Expand All @@ -56,7 +56,7 @@
"source": [
"import zipfile\n",
"if not os.path.exists('data/PetImages'):\n",
" with zipfile.ZipFile('data/kagglecatsanddogs_3367a.zip', 'r') as zip_ref:\n",
" with zipfile.ZipFile('data/kagglecatsanddogs_5340.zip', 'r') as zip_ref:\n",
" zip_ref.extractall('data')"
]
},
Expand Down
14 changes: 14 additions & 0 deletions lessons/4-ComputerVision/10-GANs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,20 @@ GANs are known to be especially difficult to train. Here are a few problems:
* Keeping a **balance** between the generator and the discriminator. In many cases discriminator loss can drop to zero relatively quickly, which results in the generator being unable to train further. To overcome this, we can try setting different learning rates for the generator and discriminator, or skip discriminator training if the loss is already too low.
* Training for **high resolution**. Reflecting the same problem as with autoencoders, this problem is triggered because reconstructing too many layers of convolutional network leads to artifacts. This problem is typically solved with so-called **progressive growing**, when first a few layers are trained on low-res images, and then layers are "unblocked" or added. Another solution would be adding extra connections between layers and training several resolutions at once - see this [Multi-Scale Gradient GANs paper](https://arxiv.org/abs/1903.06048) for details.

## Style Transfer

GANs is a great way to generate artistic images. Another interesting technique is so-called **style transfer**, which takes one **content image**, and re-draws it in a different style, applying filters from **style image**.

The way it works is the following:
* We start with a random noise image (or with a content image, but for the sake of understanding it is easier to start from random noise)
* Our goal would be to create such an image, that would be close to both content image and style image. This would be determined by two loss functions:
- **Content loss** is computed based on the features extracted by the CNN at some layers from current image and content image
- **Style loss** is computed between current image and style image in a clever way using Gram matrices (more details in the [example notebook](StyleTransfer.ipynb))
* To make the image smoother and remove noise, we also introduce **Variation loss**, which computes average distance between neighboring pixels
* The main optimization loop adjusts current image using gradient descent (or some other optimization algorithm) to minimize the total loss, which is a weighted sum of all three losses.

## ✍️ Example: [Style Transfer](StyleTransfer.ipynb)

## [Post-lecture quiz](https://black-ground-0cc93280f.1.azurestaticapps.net/quiz/210)

## Conclusion
Expand Down
1,375 changes: 583 additions & 792 deletions lessons/4-ComputerVision/10-GANs/StyleTransfer.ipynb

Large diffs are not rendered by default.

798 changes: 798 additions & 0 deletions lessons/4-ComputerVision/10-GANs/StyleTransfer_Keras.ipynb

Large diffs are not rendered by default.

209 changes: 209 additions & 0 deletions lessons/X-Extras/X1-MultiModal/Clip.ipynb

Large diffs are not rendered by default.

57 changes: 57 additions & 0 deletions lessons/X-Extras/X1-MultiModal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,60 @@
After the success of transformer models for solving NLP tasks, there were many attempts to apply the same or similar architectures to computer vision tasks. Also, there is a growing interest in building models that would *combine* vision and natural language capabilities. One of such attempts was done by OpenAI, which is called CLIP.

## Contrastive Image Pre-Training (CLIP)

The main idea of CLIP is to be able to compare text prompt with an image and say how well the image corresponds to the prompt.

![CLIP Architecture](images/clip-arch.png)

> *Picture from [this blog post](https://openai.com/blog/clip/)*
The model is trained on images obtained from the Internet and their captions. At each batch, we take N pairs of (image, text), and convert them to some vector representations I<sub>1</sub>,..., I<sub>N</sub> / T<sub>1</sub>, ..., T<sub>N</sub>. Those representations are then matched together. The loss function is defined to maximize the cosine similarity between vectors corresponding to one pair (eg. I<sub>i</sub> and T<sub>i</sub>), and minimize cosine similarity between all other pairs. That is the reason this approach is called **constrastive**.

CLIP model/library is available from [OpenAI GitHub](https://github.com/openai/CLIP). The approach is described in [this blog post](https://openai.com/blog/clip/), and in more detail in [this paper](https://arxiv.org/pdf/2103.00020.pdf).

Once this model is pre-trained, we can give it a batch of images and a batch of text prompts, and it will return is the tensor with probabilities. CLIP can be used for several tasks:

**Image Classification**

Suppose we need to classify images between, say, cats, dogs and humans. In this case, we can give the model an image, and a series of text prompts: "*a picture of a cat*", "*a picture of a dog*", "*a picture of a human*". In the resulting vector of 3 probabilities we just need to select the index with a highest value.

![CLIP for Image Classification](images/clip-class.png)

> *Picture from [this blog post](https://openai.com/blog/clip/)*
**Text-Based Image Search**

We can also do the opposite. If we have a collection of images, we can pass this collection to the model, and a text prompt - this will give us the image that is most similar to a given prompt.

## ✍️ Example: [Using CLIP for Image Classification and Image Search](Clip.ipynb)

Open [Clip.ipynb](Clip.ipynb) notebook to see CLIP in action.

## Image Generation with VQGAN + CLIP

CLIP can also be used for **image generation** from text prompt. In order to do this, we need a **generator model** that will be able to generate images based on some vector input. One of such models is called [VQGAN](https://compvis.github.io/taming-transformers/) (Vector-Quantized GAN).

The main ideas of VQGAN that differentiate it from ordinary [GAN](../../4-ComputerVision/10-GANs/README.md) are the following:
* Using autoregressive transformer architecture to generate a sequence of context-rich visual parts that compose the image. Those visual parts are in turn learned by [CNN](../../4-ComputerVision/07-ConvNets/README.md)
* Use sub-image discriminator that detects whether parts of the image are "real" of "fake" (unlike "all-or-nothing" approach in traditional GAN).

Learn more about VQGAN at [Taming Transformers](https://compvis.github.io/taming-transformers/) web site.

One of the important differences between VQGAN and traditional GAN is that the latter can produce decent image from any input vector, while VQGAN is likely to produce an image that would not be coherent. Thus, we need to further guide the image creation process, and that can be done using CLIP.

![VQGAN+CLIP Architecture](images/vqgan.png)

To generate an image corresponding to a text prompt, we start with some random encoding vector that is passed through VQGAN to produce an image. Then CLIP is used to produce loss function that shows how well the image corresponds to the text prompt. The goal then is to minimize this loss, using back propagation to adjust the input vector parameters.

A great library that implements VQGAN+CLIP is [Pixray](http://github.com/pixray/pixray)

![Picture produced by Pixray](a_closeup_watercolor_portrait_of_young_male_teacher_of_literature_with_a_book.png) | ![Picture produced by pixray](a_closeup_oil_portrait_of_young_female_teacher_of_computer_science_with_a_computer.png)
----|----
Picture generated from prompt *a closeup watercolor portrait of young male teacher of literature with a book* | Picture generated from prompt *a closeup oil portrait of young female teacher of computer science with a computer*

> Pictures from **Artificial Teachers** collection by [Dmitry Soshnikov](http://soshnikov.com)
## References

* VQGAN Paper: [Taming Transformers for High-Resolution Image Synthesis](https://compvis.github.io/taming-transformers/paper/paper.pdf)
* CLIP Paper: [Learning Transferable Visual Models From Natural Language Supervision](https://arxiv.org/pdf/2103.00020.pdf)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added lessons/X-Extras/X1-MultiModal/images/vqgan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 9b4ffba

Please sign in to comment.