Multi-Processor System-on-Chip 1. Liliana Andrade
We introduced the third-generation MPPA processor, which implements a many-core architecture that targets intelligent systems, defined as cyber-physical systems enhanced with high-performance machine learning capabilities and strong cyber-security support. As with the GPGPU architecture, the MPPA3 architecture is composed of a number of multi-core compute units that share the processor external memory and I/O through on-chip global interconnects. However, the MPPA architecture is able to host standard software, offers excellent time predictability and provides strong partitioning capabilities. This enables us to consolidate, on a single or dual processor platform, the high-performance machine learning and computer vision functions implied by vehicle perception, the high-integrity functions developed through model-based design, and the cyber-security functions required by secured communications.
2.6. References
Bodin, B., Munier-Kordon, A., and Dupont de Dinechin, B. (2013). Periodic schedules for cyclo-static dataflow. The 11th IEEE Symposium on Embedded Systems for Real-time Multimedia, Montreal, QC, Canada, 105–114.
Bodin, B., Munier-Kordon, A., and Dupont de Dinechin, B. (2016). Optimal and fast throughput evaluation of CSDF. Proceedings of the 53rd Annual Design Automation Conference. Austin, USA, 160:1–160:6.
Brunie, N. (2017). Modified fused multiply and add for exact low precision product accumulation. 24th IEEE Symposium on Computer Arithmetic. London, United Kingdom, 106–113.
Carmichael, Z., Langroudi, H.F., Khazanov, C., Lillie, J., Gustafson, J.L., and Kudithipudi, D. (2019). Performance-efficiency trade-off of low-precision numerical formats in deep neural networks. Proceedings of the Conference for Next Generation Arithmetic. New York, USA, 3:1–3:9.
CAST (2016). Multi-core Processors, Technical Report CAST-32A, FAA [Online]. Available: https://www.faa.gov/aircraft/air_cert/design_approvals/air_software/cast/cast_papers/.
Cavicchioli, R., Capodieci, N., Solieri, M., and Bertogna, M. (2019). Novel methodologies for predictable CPU-To-GPU command offloading. Proceedings of the 31st Euromicro Conference on Real-Time Systems. Stuttgart, Germany, vol. 133 of LIPIcs, 22:1–22:22.
Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu, M., Ghandi, M., Lo, D., Reinhardt, S., Alkalay, S., Angepat, H., Chiou, D., Forin, A., Burger, D., Woods, L., Weisz, G., Haselman, M., and Zhang, D. (2018). Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro, 38, 8–20.
CNX (2019). Autoware.AI-Software-Architecture [Online]. Available: https://www.cnx-software.com/wp-content/uploads/2019/02/Autoware.AI-Software-Architecture.png.
Davis, R.I., Altmeyer, S., Indrusiak, L.S., Maiza, C., Nélis, V., and Reineke, J. (2018). An extensible framework for multicore response time analysis. Real-Time Systems, 54(3), 607–661.
de Dinechin, F., Forget, L., Muller, J.-M., and Uguen, Y. (2019). Posits: The good, the bad and the ugly. Proceedings of the Conference for Next Generation Arithmetic. Association for Computing Machinery, New York, USA.
Dupont de Dinechin, B. (2004). From machine scheduling to VLIW instruction scheduling. ST Journal of Research, 1(2).
Dupont de Dinechin, B. (2014). Using the SSA-Form in a code generator. 23rd International Conference on Compiler Construction, vol. 8409 of Lecture Notes in Computer Science, Springer, 1–17.
Dupont de Dinechin, B., and Graillat, A. (2017). Feed-forward routing for the wormhole switching network-on-chip of the kalray MPPA2 processor. Proceedings of the 10th International Workshop on Network on Chip Architectures. Cambridge, USA, 10:1–10:6.
Dupont de Dinechin, B., de Ferrière, F., Guillon, C., and Stoutchinin, A. (2000). Code generator optimizations for the ST120 DSP-MCU core. Proceedings of the 2000 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES, San Jose, USA, 93–102.
Dupont de Dinechin, B., Ayrignac, R., Beaucamps, P., Couvert, P., Ganne, B., de Massas, P. G., Jacquet, F., Jones, S., Chaisemartin, N. M., Riss, F., and Strudel, T. (2013). A clustered manycore processor architecture for embedded and accelerated applications. IEEE High Performance Extreme Computing Conference, Waltham, USA, 1–6.
Dupont de Dinechin, B., van Amstel, D., Poulhiès, M., and Lager, G. (2014). Time-critical computing on a single-chip massively parallel processor. Design, Automation and Test in Europe Conference and Exhibition, Dresden, Germany, 1–6.
Dupont de Dinechin, M., Schuh, M., Moy, M., and Maïza, C. (2020). Scaling up the memory interference analysis for hard real-time many-core systems. Design, Automation and Test in Europe Conference and Exhibition, Grenoble, France, 1–4.
Firesmith, D. (2017). Multicore Processing [Online]. Available: https://insights.sei.cmu.edu/ sei_blog/2017/08/multicore-processing.html.
Fisher, J. A., Faraboschi, P., and Young, C. (2005). Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann Publishers Inc., San Francisco, USA.
Forsberg, B., Palossi, D., Marongiu, A., and Benini, L. (2017). GPU-accelerated real-time path planning and the predictable execution model. Procedia Computer Science – International Conference on Computational Science, Zurich, Switzerland, 108, 2428–2432.
Graillat, A., Moy, M., Raymond, P., and Dupont de Dinechin, B. (2018). Parallel code generation of synchronous programs for a many-core architecture. Design, Automation and Test in Europe Conference and Exhibition, Dresden, Germany, 1139–1142.
Graillat, A., Maiza, C., Moy, M., Raymond, P., and Dupont de Dinechin, B. (2019). Response time analysis of dataflow applications on a many-core processor with shared-memory and network-on-chip. Proceedings of the 27th International Conference on Real-Time Networks and Systems. Toulouse, France, 61–69.
Gschwind, M. (2016). Workload acceleration with the IBM POWER vector–scalar architecture. IBM Journal of Research and Development, 60(2–3).
Gustafson, J.L. (2017). Beyond floating point: Next-generation computer arithmetic [Online]. Available: https://web.stanford.edu/class/ee380/Abstracts/170201-slides.pdf.
Gustafson, J.L. and Yonemoto, I.T. (2017). Beating floating point at its own game: Posit arithmetic. Supercomputing Frontiers and Innovations, 4(2), 71–86.
Halbwachs, N., Caspi, P., Raymond, P., and Pilaud, D. (1991). The synchronous data flow programming language LUSTRE. Proceedings of the IEEE, 79(9), 1305–1320.
Hascoët, J., Dupont de Dinechin, B., de Massas, P.G., and Ho, M.Q. (2017). Asynchronous one-sided communications and synchronizations for a clustered manycore processor. Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia, Seoul, Republic of Korea, 51–60.
Hascoët, J., Dupont de Dinechin, B., Desnos, K., and Nezan, J. (2018). A distributed framework for low-latency openVX over the RDMA NoC of a clustered manycore. 2018 IEEE High Performance Extreme Computing Conference HPEC, Waltham, USA, 1–7.
Huang, M., Men, L., and Lai, C. (2013). Accelerating mean shift segmentation algorithm on hybrid CPU/GPU platforms. In Modern Accelerator Technologies for Geographic Information Science, Shi, X., Kindratenko, V. and Yang, C. (eds). Springer, New York.
Intel (2018). BFLOAT16 – Hardware Numerics Definition Revision 1.0. November 2018.
Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.G., Adam, H., and Kalenichenko,