Bangla language textual image description by hybrid neural network model

Md. Asifuzzaman Jishan, Khan Raqib Mahmud, Abul Kalam Al Azad, Mohammad Rifat Ahmmad Rashid, Bijan Paul, Md. Shahabub Alam


Automatic image captioning task in different language is a challenging task which has not been well investigated yet due to the lack of dataset and effective models. It also requires good understanding of scene and contextual embedding for robust semantic interpretation of images for natural language image descriptor. To generate image descriptor in Bangla, we created a new Bangla dataset of images paired with target language label, named as Bangla Natural Language Image to Text (BNLIT) dataset. To deal with the image understanding, we propose a hybrid encoder-decoder model based on encoder-decoder architecture and the model is evaluated on our newly created dataset. This proposed approach achieves significance performance improvement on task of semantic retrieval of images. Our hybrid model uses the Convolutional Neural
Network as an encoder whereas the Bidirectional Long Short Term Memory is used for the sentence representation that decreases the computational complexities without trading off the exactness of the descriptor. The model yielded benchmark accuracy in recovering Bangla natural language and we also conducted a thorough numerical analysis of the model performance on the BNLIT dataset.


Convolutional Neural Network; Hybrid Recurrent Neural Network; Long Short-Term Memory; Bi-directional Recurrent Neural Network; Bangla Natural Language Descriptors


K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, ”Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts”, In IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2321-2334, 2017.

L. Chen, H. Zhang, J. Xiao, L. Nie,J. Shao, W. Liu, T. Chua, ”SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning”, In Computer Vision and Pattern Recognition (CVPR), pp. 6298-6306, 2017.

P. Anderson, B. Fernando, M. John-son, S. Gould, ”SPICE: semantic propositional image caption evaluation”, In European Conference on Computer Vision (ECCV), pp. 382-398, 2016.

Q. You, H. Jin, Z.Wang, C. Fang, J. Luo, ”Image captioning with semantic attention”, In Computer Vision and Pattern Recognition (CVPR), pp. 4651-4659, 2016.

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, ”Show and tell: A neural image caption generator”, arXiv:1411.4555v2, 2015.

Wang, H., Zhang, Y., Yu, X., ”An Overview of Image Caption Generation Methods”, Computational intelligence and neuroscience, pp. 1-13, 2020.

Md. Asifuzzaman Jishan, K. R. Mahmud, A. K. Al Azad, ”Natural language description of images using hybrid recurrent neural network”, International Journal of Electrical and Computer Engineering, vol. 9, no. 4, pp. 2932-2940, 2019.

T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, M. Sun, ”Show, adapt and tell: adversarial training of cross-domain image captioner”, in Proceedings of the IEEE Conference on International Conference on Computer Vision and Pattern Recognition, pp. 521-530, Honolulu, HI, USA, July 2017.

J. Aneja, A. Deshpande, A. G. Schwing, ”Convolutional image captioning”, In Computer Vision and Pattern Recognition (CVPR), pp. 5561-5570, 2018.

F. Fang, H. Wang, Y. Chen, P. Tang, ”Looking deeper and transferring attention for image captioning”, Multimedia Tools and Application, 77(23):31159-31175, 2018.

T. Yao, Y. Pan, Y. Li, T. Mei, ”Exploring visual relationship for image captioning”, In European Conference on Computer Vision (ECCV), pp. 711-727, 2018.

Q. Wang and A. B. Chan, ”CNN+CNN: convolutional decoders for image captioning”, arXiv:1805.09019v1 [cs.CV], 2018.

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, ”Bottom-up and top-down attention for image captioning and visualquestion answering”, In Computer Vision and Pattern Recognition (CVPR), pp. 6077-6086, 2018.

Takashi Miyazaki, Nobuyuki Shimizu, ”Cross-Lingual Image Caption Generation”, Association for Computational Linguistics (ACL), pp. 1780-1790, 2016.

A. Krizhevsky, I. Sutskever, and G. Hinton, ”ImageNet classification with deep convolutional neural networks”, Neural Information Processing Systems (NIPS), vol. 1, pp. 1097-1105, 2012.

S. Hochreiter and J. Schmidhuber, ”Long short-term memory”, Neural computation, 9(8):1735-1780, 1997.

M. Schuster and K. K. Paliwal, ”Bidirectional recurrent neural networks”, Signal Processing, IEEE Transactions, vol. 45, no. 11, pp. 2673-2681, 1997.

Md. Asifuzzaman Jishan, Khan Raqib Mahmud, and Abul Kalam Al Azad, ”Bangla Natural Language Image to Text (BNLIT)”, 2019. [Online]. Available:, (Harvard Dataverse), (Mendeley-ELSEVIER), (Zenodo).

S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, and R. Szeliski, ”A database and evaluation methodology for optical flow”, International Journal of Computer Vision (IJCV), vol. 92, no. 1, pp. 1-31, 2011.

L. Fei-Fei, R. Fergus, and P. Perona, ”Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories”, Computer Vision and Pattern Recognition (CVPR), Workshop of Generative Model Based Vision (WGMBV), 2004.

G. Griffin, A. Holub, and P. Perona, ”Caltech-256 object category dataset”, California Institute of Technology, Tech. Rep. 7694, 2007.

N. Dalal and B. Triggs, ”Histograms of oriented gradients for human detection”, Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 886-893, 2005.

Y. Lecun and C. Cortes, ”The MNIST database of handwritten digits”, 1998. [Online]. Available:

S. A. Nene, S. K. Nayar, and H. Murase, ”Columbia object image library (coil-20)”, Columbia Universty, Tech. Rep., 1996.

A. Krizhevsky and G. Hinton, ”Learning multiple layers of features from tiny images”, Computer Science Department, University of Toronto, Tech. Rep, 2009.

A. Torralba, R. Fergus, and W. T. Freeman, ”80 million tiny images: A large data set for nonparametric object and scene recognition”, The Pattern Analysis and Machine Intelligence (PAMI), vol. 30, no. 11, pp. 1958-1970, 2008.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ”ImageNet: A Large-Scale Hierarchical Image Database”, IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 248-255, 2009.

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He, ”AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks”, arXiv:1711.10485v1 [cs.CV], 2017.

Shikhar Sharma, Dendi Suhubdy, Vincent Michalski, Samira Ebrahimi Kahou, and Yoshua Bengio, ”ChatPainter: Improving Text to Image Generation using Dialogue”, arXiv:1802.08216v1 [cs.CV], 2018.

Richard Socher, Andrej Karpathy, Quoc V. Le*, Christopher D. Manning, and Andrew Y. Ng, ”Grounded Compositional Semantics for Finding and Describing Images with Sentences”, Tennessee Association of Community Leadership (TACL), vol. 2, no. 1, pp. 207-218, 2014.

J. Shotton, J. Winn, C. Rother, and A. Criminisi, Texton-Boost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation, In European Conference on Computer Vision (ECCV), pp. 1-15, 2006.

A. Geiger, P. Lenz, and R. Urtasun, Are we ready for autonomous driving? the KITTI vision benchmark suite, IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 3354-3361, 2012.

G. J. Brostow, J. Fauqueur, and R. Cipolla, Semantic object classes in video: A high-definition ground truth database, Patt. Rec. Letters, 30(2):8897, 2009.

C. Liu, J. Yuen, and A. Torralba, Nonparametric scene parsing via label transfer. IEEE Trans, on The Pattern Analysis and Machine Intelligence (PAMI), 33(12):2368-2382, 2011.

J. Tighe and S. Lazebnik, Superparsing: Scalable nonparametric image parsing with superpixels, In European Conference on Computer Vision (ECCV), pp. 352-365, 2010.

B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, Scene parsing through ADE20K dataset, In IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 352-365, 2017.

Total views : 251 times


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

shopify stats IJEECS visitor statistics