Skip to main content
Log in

Position, Padding and Predictions: A Deeper Look at Position Information in CNNs

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In contrast to fully connected networks, Convolutional Neural Networks (CNNs) achieve efficiency by learning weights associated with local filters with a finite spatial extent. Theoretically, an implication of this fact is that a filter may know what it is looking at, but not where it is positioned in the image. In this paper, we first test this hypothesis and reveal that a surprising degree of absolute position information is encoded in commonly used CNNs. We show that zero padding drives CNNs to encode position information in their internal representations, while a lack of padding precludes position encoding. This observation gives rise to deeper questions about the role of position information in CNNs: (i) What boundary heuristics enable optimal position encoding for downstream tasks? (ii) Does position encoding affect the learning of semantic representations? (iii) Does position encoding always improve performance? To provide answers, we perform the largest case study to date on the role that padding and border heuristics play in CNNs. We design novel tasks that allow us to quantify boundary effects as a function of the distance to the border. Numerous semantic objectives reveal the effect of the border on semantic representations. Finally, we demonstrate the implications of these findings on multiple real-world tasks to show that position information can both help or hurt performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. We use the term gradient to denote pixel intensities instead of the gradient in back propagation.

References

  • Alrasheedi, F., Zhong, X., & Huang, P.-C. (2023). Padding module: Learning the padding in deep neural networks. IEEE Access, 11, 7348–7357.

    Article  Google Scholar 

  • Alsallakh, B., Kokhlikyan, N., Miglani, V., Yuan, J., & Reblitz-Richardson, O. (2020). Mind the pad-CNNs can develop blind spots. In International conference on learning representations.

  • Bai, Q., Xu, Y., Zhu, J., Xia, W., Yang, Y., & Shen, Y. (2022). High-fidelity GAN inversion with padding space. In European Conference on Computer Vision (pp. 36–53).

  • Brendel, W., & Bethge, M. (2019). Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. arXiv:1904.00760.

  • Brock, A., Donahue, J., & Simonyan, K. (2018). Large scale GAN training for high fidelity natural image synthesis. In International conference on learning representations.

  • Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking ATROUS convolution for semantic image segmentation. arXiv:1706.05587.

  • Choi, J., Gao, C., Messou, J. C., & Huang, J.-B. (2019). Why can’t i dance in the mall? Learning to mitigate scene bias in action recognition. In Advances in neural information processing systems, vol. 32.

  • Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3606–3613).

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3213–3223).

  • Demir, U., & Unal, G. (2018). Patch-based image inpainting with generative adversarial networks. arXiv:1803.07422.

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition (vol. 2009, pp. 248–255).

  • Denton, E. L., et al. (2017). Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, vol. 30.

  • DeVries, T., & Taylor, G.W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552.

  • Esser, P., Rombach, R., & Ommer, B. (2020). A disentangling invertible interpretation network for explaining latent representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9223–9232).

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes challenge 2010 (VOC2010) results. http://www.pascal-network.org/challenges/VOC/voc2010/workshop/index.html.

  • Garcia-Gasulla, D., Gimenez-Abalos, V., & Martin-Torres, P. (2023). Padding aware neurons. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 99–108).

  • Gavrikov, P., & Keuper, J. (2023). On the interplay of convolutional padding and adversarial robustness. In Proceedings of the IEEE/CVF international conference on computer vision (vol. 2023, pp. 3981–3990).

  • Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., & Brendel, W. (2018). Imagenet-trained CNNs are biased towards texture: Increasing shape bias improves accuracy and robustness. In International conference on learning representations.

  • Ghorbani, A., Wexler, J., Zou, J. Y., & Kim, B. (2019). Towards automatic concept-based explanations. In Advances in neural information processing systems, vol. 32.

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).

  • Gregor, K., Danihelka, I., Graves, A., Rezende, D., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. In International conference on machine learning. PMLR (pp. 1462–1471).

  • Han, Z., Liu, B., Lin, S.-B., & Zhou, D.-X. (2023). Deep convolutional neural networks with zero-padding: Feature extraction and learning. arXiv:2307.16203.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Huang, Z., Heng, W., Zhou, S. (2019). Learning to paint with model-based deep reinforcement learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 8709–8718).

  • Innamorati, C., Ritschel, T., Weyrich, T., & Mitra, N. J. (2020). Learning on the edge: Investigating boundary filters in CNNs. International Journal of Computer Vision, 128(4), 773–782.

    Article  Google Scholar 

  • Islam, M. A., Jia, S., & Bruce, N. D. (2020). How much position information do convolutional neural networks encode? In International conference on learning representations.

  • Islam, M.A., Kowal, M., Jia, S., Derpanis, K.G., & Bruce, N.D. (2021). Global pooling, more than meets the eye: Position information is encoded channel-wise in CNNs. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 793–801).

  • Jia, S., & Bruce, N. D. (2018). Eml-net: An expandable multi-layer network for saliency prediction. arXiv:1805.01047.

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1725–1732).

  • Kayhan, O.S., & Gemert, J.C.V. (2020). On translation invariance in CNNs: Convolutional layers can exploit absolute spatial location. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14 274–14 285).

  • Krizhevsky, A., Nair, V., & Hinton, G. (2014). The CIFAR-10 dataset. http://www.cs.toronto.edu/kriz/cifar.html.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, vol. 25.

  • Le, Y., & Yang, X. (2015). Tiny ImageNet visual recognition challenge. In CS 231N.

  • Li, Y., Hou, X., Koch, C., Rehg,J. M., & Yuille, A. L. (2014). The secrets of salient object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 280–287).

  • Liu, N., Han, J. (2016). Dhsnet: Deep hierarchical saliency network for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 678–686).

  • Liu, N., Han, J., & Yang, M.-H. (2018). Picanet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3089–3098).

  • Liu, R., Lehman, J., Molino, P., Petroski Such, F., Frank, E., Sergeev, A., & Yosinski, J. (2018). An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in neural information processing systems, vol. 31.

  • Liu, G., Shih, K. J., Wang, T.-C., Reda, F. A., Sapra, K., Yu, Z., Tao, A., Catanzaro, B. (2018). Partial convolution based padding. arXiv:1811.11718.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).

  • Lorenz, D., Bereska, L., Milbich, T., & Ommer, B. (2019). Unsupervised part-based disentangling of object shape and appearance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10 955–10 964).

  • Mopuri, K. R., Ganeshan, A., & Babu, R. V. (2018). Generalizable data-free objective for crafting universal adversarial perturbations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(10), 2452–2465.

    Article  Google Scholar 

  • Mukai, K., & Yamanaka, T. (2023). Improving translation invariance in convolutional neural networks with peripheral prediction padding. In IEEE International Conference on Image Processing (ICIP) (vol. 2023, pp. 945–949).

  • Murase, R., Suganuma, M., & Okatani, T. (2020). How can CNNs use image position for segmentation? arXiv:2005.03463.

  • Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision (pp. 1520–1528).

  • Pérez, J., Marinković, J., & Barceló, P. (2019). On the turing completeness of modern neural network architectures. In International conference on learning representations.

  • Petsiuk, V., Das, A., & Saenko, K. (2018). Rise: Randomized input sampling for explanation of black-box models. arXiv:1806.07421.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, vol. 28.

  • Rösch, P. J., & Libovickỳ, J. (2023). Probing the role of positional information in vision-language models. arXiv:2305.10046.

  • Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules. In Advances in neural information processing systems, vol. 30.

  • Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618–626).

  • Shi, J., Yan, Q., Xu, L., & Jia, J. (2015). Hierarchical image saliency detection on extended CSSD. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4), 717–729.

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations.

  • Sirovich, L., Brodie, S. E., & Knight, B. (1979). Effect of boundaries on the response of a neural network. Biophysical Journal, 28, 424.

    Article  Google Scholar 

  • Tang, M., Zheng, L., Yu, B., & Wang, J. (2018). High speed kernelized correlation filters without boundary effect. arXiv:1806.06406.

  • Tsotsos, J. K., Culhane, S. M., Wai, W. Y. K., Lai, Y., Davis, N., & Nuflo, F. (1995). Modeling visual attention via selective tuning. Artificial Intelligence, 78(1–2), 507–545.

    Article  Google Scholar 

  • Visin, F., Kastner, K., Cho, K., Matteucci, M., Courville, A., & Bengio, Y. (2015). A recurrent neural network based alternative to convolutional networks. arXiv:1505.00393.

  • Wang, Z., & Veksler, O. (2018). Location augmentation for CNN. arXiv:1807.07044.

  • Wang, L., Lu, H., Wang, Y., Feng, M., Wang, D., Yin, B., & Ruan, X. (2017). Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 136–145).

  • Wohlberg, B., & Rodriguez, P. (2017). Convolutional sparse coding: Boundary handling revisited. arXiv:1707.06718.

  • Xu, R., Wang, X., Chen, K., Zhou, B., & Loy, C.C. (2020). Positional encoding as spatial inductive bias in GANs. arXiv:2012.05217.

  • Xue, J., Zhang, H., & Dana, K. (2018). Deep texture manifold for ground terrain recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 558–567).

  • Yang, C., Zhang, L., Lu, H., Ruan, X., & Yang, M.-H. (2013) Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3166–3173).

  • Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S. (2018). Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5505–5514).

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Proceedings of the European conference on computer vision (ECCV) (pp. 818–833).

  • Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and high level feature learning. In 2011 international conference on computer vision (pp. 2018–2025).

  • Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv:1611.03530.

  • Zhang, P., Wang, D., Lu, H., Wang, H., & Ruan, X. (2017). Amulet: Aggregating multi-level convolutional features for salient object detection. In Proceedings of the IEEE international conference on computer vision (pp. 202–211).

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2921–2929).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md Amirul Islam.

Additional information

Communicated by Takayuki Okatani.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Experimental Details of Absolute Position Encoding Experiments

Datasets: We use the DUT-S dataset (Wang et al., 2017) as our training set, which contains 10, 533 images for training. Following the common training protocol used in Zhang et al. (2017); Liu et al. (2018), we train the model on the training set of DUT-S and evaluate the existence of position information on the natural images of the PASCAL-S (Li et al., 2014) dataset. The synthetic images (white, black and Gaussian noise) are also used as described in Sect. 3.4 of the main manuscript. Note that we follow the common setting used in saliency detection just to make sure that there is no overlap between the training and test sets. However, any images can be used in our experiments given that the position information is relatively content independent.

Evaluation Metrics: As position encoding measurement is a new direction, there is no universal metric. We use two different natural choices for metrics (Spearmen Correlation (SPC) and Mean Absoute Error (MAE)) to measure the position encoding performance. The SPC is defined as the Spearman’s correlation between the ground-truth and the predicted position map. For ease of interpretation, we keep the SPC score within range [-1 1]. MAE is the average pixel-wise difference between the predicted position map and the ground-truth gradient position map.

Implementation Details We initialize the architecture with a network pretrained for the ImageNet classification task. The new layers in the position encoding branch are initialized with xavier initialization (Glorot and Bengio, 2010). We train the networks using stochastic gradient descent for 15 epochs with momentum of 0.9, and weight decay of \(1e-4\). We resize each image to a fixed size of \(224 \times 224\) during training and inference. Since the spatial extent of multi-level features are different, we align all the feature maps to a size of \(28 \times 28\).

1.2 Implementation Details of VGG-5 Network for Position Information

We use a simplified VGG network (VGG-5) for the position encoding experiments in Sect. 3.4 of the main manuscript and texture recognition experiments in Sec. 7 of the main manuscript. The details of the VGG-5 architecture are shown in Table 11 (in this table we show the VGG-5 network trained on the tiny ImageNet dataset, the VGG-5 network trained on texture recognition has a different input size: \(224 \times 224\)). Note that the network is trained from scratch. The tiny ImageNet dataset contains 200 classes and each class has 500 images for training and 50 for validation. The size of the input image is \(64 \times 64\), a random crop of \(56 \times 56\) is used for training and a center crop is applied for validation. The total training epochs is set to 100 with an initial learning rate of 0.01. The learning rate was decayed at the 60th and 80th epochs by multiplying the learning rate by a factor of 0.1. A momentum of 0.9 and a weight decay of \(1e-4\) are applied with the the stochastic gradient descent optimizer. After the pre-training process, a simple read-out module is applied on the pre-trained frozen backbone for position evaluation, following the training protocol as used in Islam et al. (2020). Note that the type of padding strategy is consistent between the pre-training and position evaluation procedures.

Table 11 Configuration of VGG-5 architecture trained on tiny ImageNet
Fig. 14
figure 14

Classification accuracy on CIFAR-10 for different constant padding values

Fig. 15
figure 15

Classification accuracy on CIFAR-100 (Krizhevsky et al., 2014) for different constant padding values

1.3 Is Zero the Optimal Constant Padding Value?

From the experiments in Sect. 3.6 of the main manuscript, as zero delivers maximal position information, it is natural to ask if zero padding also outperforms other padding types for a semantic objective. To this end, we train separate ResNet-18s (He et al., 2016) for image classification on the CIFAR-10 (Krizhevsky et al., 2014) dataset and vary the padding value in the range \([-10, 10]\). We run this experiment under two settings: with data normalization (i.e., subtract the dataset mean and divide by the standard deviation) and without (i.e., pixel values are scaled to \( [-1,1]\)). We illustrate the padding value results in Fig. 14. The pattern is immediate and clear. For all constant padding values (excluding any dynamic padding strategies, such as reflection padding), zero gives the best performance, with what appears to be an exponential error increase as the absolute value increases (e.g., \(-10\) and 10). Additionally, we also train a ResNet-18 on CIFAR-10 with reflection, replicate, and circular padding which achieve accuracies of 89.6%, 89.7%, and 87.1%, respectively, all underperforming zero padding.

We further experiment on the CIFAR-100 dataset (Krizhevsky et al., 2014) for both normalized (i.e., subtract mean and divide by variance) and unnormalized (i.e., rescaled to [0, 1]) image values. The results (see Fig. 15) are consistent with those for CIFAR-10 (see Fig. 3 in the main paper) and suggest that zero is the optimal constant padding value for image classification, irrespective of the dataset used or whether the inputs are normalized. For consistency, in addition to zero padding (76.38%), we also used reflection, replicate, and circular padding which achieved performances of 76.38%, 76.19%, and 75.59%, respectively (using normalized data).

Table 12 IoU comparison of DeepLabv3 for semantic segmentation task with three different padding (Zero, Reflect, and No pad) settings

1.4 No Padding Implementation Discussion

We include no padding comparisons for completeness and to contrast the difference in the border effects between networks trained with padding and without padding. For networks without residual connections (e.g., VGG) one can implement a no padding version by simply discarding the padding. However, controlling for consistent spatial resolution is crucial when comparing padding types since an inconsistent spatial resolution between padding and no padding would result in a significant performance drop due to the reduced dimensionality of the feature representations. Another solution is to remove all the padding from a VGG network and then padding the input image by a sufficient amount to keep the spatial resolution. However, this is not applicable to the ResNet backbone as there will be spatial misalignment between the features of layers due to the residual connections. Alternatively, one can interpolate the output feature map to the same size as the input, which is also the method used in a recent study (Xu et al., 2020). In the end, we choose the interpolation implementation because we believe the visual information near the border is better retained while working for networks with and without residual connections.

One concern of using interpolation is how to align the feature maps during the interpolation. If the features maps are aligned in the center, interpolating the feature map will move the contents of feature map slightly towards the edges. The composite will thus not have the features from the two branches perfectly line up with each other anymore. This shifting effect is largest near the edges and smallest near the center, which matches with the observed performance characteristics. The subsequent convolution layers may be able to undo some of this shifting, but only at the cost of location-dependent kernels that are tailored to fit the offset caused at different parts of the image. The other option is to align the feature map based on the corners with the interpolation mainly occurring at the center. In this scenario, the shifting effect will be reversed, with the corners being in alignment but the center of the feature map slightly misaligned.

1.5 Relationship Between Position Information and Network Depth

We further investigate the relationship between position information and depth of the network. We interestingly found that the encoding of position information correlates with the depth of the network. Since the network with higher depth has bigger receptive field, it is likely to encode more position information. We conduct positional encoding experiments with ResNet backbone of different depth and report the results in Table 13. It it clear that network with higher depth encode more position information.

Table 13 Positional encoding comparison of networks with different depth in terms of SPC and MAE across DUT-OMRON (Yang et al., 2013) and ECSSD (Shi et al., 2015) dataset

1.6 Extended Boundary Effect Analysis on Cityscapes Dataset

We continue to investigate the impact that zero padding has on the ability of a strong and deep CNN to segment objects near the image boundary. Results shown use the same network and training settings as in Sect. 7 of the main manuscript, on the Cityscapes (Cordts et al., 2016) dataset (Fig. 16).

We present additional results (see Table 12 and Fig. 17) of the analysis presented in Sect. 6 (semantic segmentation) in the main paper. Figure 16 shows sample evaluation regions used for this analysis. The no padding case has a steeper drop-off in performance as regions of evaluation get closer to the image boundary. Note how, in all cases, the performance increases from the border to the inner \(25\%\), at which point the performance is somewhat stagnant until it reaches the innermost \(80\%\).

Fig. 16
figure 16

An illustration of the evaluation regions used for the analysis in Table 12 and Fig. 17

Fig. 17
figure 17

Performance comparison of DeepLabv3 network with respect to various image regions and padding settings used in Table 12

Surprisingly, we also observe a steeper drop off in the middle of the image for the no padding case, supporting our hypothesis that boundary effects play a role at all regions of the image without the use of padding. We believe the drop in performance at the center regions is due to Cityscapes being an automotive-centric dataset, where pixels at the center of the image are often at large distances away from the camera, unless the vehicle collecting the data has an object directly in front of it.

1.7 Canvas Analysis: Cutout and Adversarial Robustness

Fig. 18 shows two training examples of Cutout strategy. Following Cutout, we simply place a rectangular mask (zero and max) over a random region during the training. Note that we evaluate on the standard PASCAL VOC 2012 validation images.

Fig. 18
figure 18

Sample training images generated using Cutout (DeVries and Taylor, 2017) under two different canvases

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Islam, M.A., Kowal, M., Jia, S. et al. Position, Padding and Predictions: A Deeper Look at Position Information in CNNs. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02069-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02069-9

Keywords

Navigation