Deep Learning Image Auto-Captioner with PyTorch
See our report!
- Our team included Takuro Kitasawa, Rye Gleason, Jeremy Nurding, and myself.
- In this project, we developed and trained an AI generative model that could take in input images, and auto-generate English captions properly describing the image’s content with correct grammar.
- Having to figure out the proper custom PyTorch Architecture, we eventually developed several versions of this model:
- The model architecture connects either a custom Convolutional Neural Network or a pre-trained ResNet to an LSTM Recurrent Neural Network.
- This custom Convolutional Neural Network Architecture was called AlexNet (Not shown in the flowchart above). I was the teammate responsible for implementing it’s architecture on PyTorch. Since it was not pre-trained, the model variant that used it trained much slower.
- Regardless, each image needs to undergo another Dense Layer to a shape compatible with the convolutional neural network’s input.
- The Convolutional Neural Network serves as the Encoder, and the LSTM serves as the Decoder
- The LSTM is originally trained with Teacher-forcing. Each auto-generated sentence is capped at 20 words.
- Using UCSD’s Datahub server, this trained for up to 6 hours, several times for different configurations of parameters.