Projects
⬗ Developing a prototype of a collocation-aware resource manager in deep learning training context
– In this ongoing project, the focus is on developing a scheduler/resource manager that integrates insights from previous projects. This system is designed to efficiently handle available resources, with workload collocation as a core priority.
⬗ Studying collocating deep learning training tasks on GPUs compute, memory utilization, energy, etc.
– This project, which was part of my PhD thesis, focused on optimizing the efficiency of deep learning training by exploring various GPU collocation methods on NVIDIA hardware. Given the significant computational cost of deep learning, this work analyzed the performance of different strategies, including multiple process submission with multiple streams, the use of NVIDIA's Multi-Process Service (MPS), and the implementation of Multi-Instance GPU (MIG) technology. The findings showed that collocating multiple training runs can significantly enhance training throughput, with potential increases of up to threefold. However, these gains depend on the careful management of GPU memory and compute resources. The project demonstrated that while MIG offers interference-free partitioning, it may lead to suboptimal GPU utilization with dynamic or mixed workloads. Overall, the research highlights MPS as the most effective and flexible method for single-user training job submissions, offering valuable guidelines for maximizing GPU performance in deep learning tasks. This work resulted in a workshop paper publication at EuroMLSys 2024, accessible here.
⬗ Compiling, analyzing available profiling, monitoring tools for CPUs and GPUs in the context of deep learning training
– During this project, which was part of my PhD thesis, I gathered and evaluated various profiling and monitoring tools for deep learning training tasks on GPUs. I assessed their advantages, disadvantages, costs, and intrusiveness, and provided guidelines on when, how, and where to use them effectively. Additionally, I conducted an in-depth analysis of the GPU utilization metric, exploring its meaning and implications. This work provided valuable insights into these systems and resulted in a workshop paper publication at EuroMLSys 2023, accessible here.
⬗ OSM: Off-chip Shared Memory for GPUs
– Contribution to the design and implementation of an innovative on-chip memory handling both the shared memory and L1 data cache accesses. First, Shared memory accesses were logged, which were generated within the GPGPU-Sim simulator, and then studied their locality, liveness, and read-after-write frequency characteristics. Finally, based on the observations, the proposed mechanism was implemented by changing the source code of the GPGPU-Sim simulator. The project resulted in an publication that can be accessed here.
⬗ Cache with different configs
– Implementation of direct-mapped and set-associative caches. The goal was to experiment with different replacement policies and their effect on hit/ miss rate. It can be accessed here.
⬗ Morris Mano's book basic computer
– Implementation of Mano's basic computer in Verilog HDL. It can be accessed here.
⬗ Knowledge dissemination projects
– As one of my hobbies, I work on designing and developing high-quality tutorials for those who want to learn fast and easily. You can check them out on my GitHub, my Medium, and my YouTube pages.
⬗ Web Development Projects
– Experience of being a part of two web development teams as a back-end developer. My tasks were mainly about developing queries for feeding the UI forms with correct data. In addition, I experienced developing APIs for sending data in JSON format. URL Shortener API in Go programming language
⬗ Automizing an archiving system and multiple other processes in Microsoft Office with VB
– Design, and implementation of an efficient archiving system in Excel with VBA regarding the documents type and their transactions.