Partha Maji's Homepage

Partha Maji is a Technical Director of AI Hardware Acceleration at Tenstorrent. Prior to joining Tenstorrent, he was a Principal Research Engineer at Arm based in Cambridge, where he used to lead the UK ML research team. His current research interests lie in multiple disciplines that bridge the topics of machine learning and optimisations, computer architecture and hardware implementation. In particular, his research focuses on Hardware/ML co-design of deep learning models for efficiency with research threads investigating Graph Neural Networks, Transformers, Generative Models, Convolutional Neural Networks, 3D Vision, Neural rendering for AR/VR and how to train, optimise, compress, and accelerate them for energy-efficient hardware both for data centre and edge AI.

Prior to this, he worked as a Staff SoC Design Engineer at Broadcom. where he led various front-end and back-end chip design and implementation activities. Partha started his career in the semiconductor industry as a Micro-architecture design engineer in the Processor Division group at Arm. He also spent in the software industry for a while before embarking on chip design. He has extensive experience with the end-to-end chip design process starting from microarchitecture-level design to physical design through multiple tape-outs of low-power chips at 65/40/28/22nm deep-submicron CMOS process technology.

Partha has received several excellence awards from the industry including a Mentor Graphics prize for outstanding achievement in the master’s degree. Partha also received multiple accolades for his research in on-chip interconnect including an award from Epson Europe and the IET, UK. He was also recognized by the European Neural Network Society for his high-quality contribution to machine learning research. Partha was a recipient of the prestigious UK Chevening scholarship. He received a PhD in Computer Science from the University of Cambridge and an MSc in System-on-Chip design from the University of Edinburgh.

You can find more up-to-date information about Partha on Linkedin.

Talks

Invited Panel/Speaker at ICML 2021 - Link
Invited Speaker at RCML at FPL 2022 - Link
Distinguished Speaker at IEEE CASS Season School - Link
Industrial Technical Insights Talk at Embedded Vision Summit - Link to Recording
Invited Speaker at Oxford ML Summer School - Link
Invited Speaker at Royal Statistical Society's Sustainable Computing workshop - Link
Invited Speaker at AI for Circuit and Systems (AICAS conference) - AICAS

Publications & Patents (Selected)

Efficient Winograd or Cook-Toom Convolution Kernel Implementation on Widely Used Arm's A-Class Mobile CPUs

P. Maji (ARM Research), A. Mundy (ARM Research), G. Dasika (ARM Research), J. Beu (ARM Research), M. Mattina (ARM Research), and R. Mullins (Cambridge).

HPCA 2019 (Paper: pdf | Presentation: pdf)

The Winograd or Cook-Toom class of algorithms help to reduce the overall compute complexity of many modern deep convolutional neural networks (CNNs). Although there has been a lot of research done on model and algorithmic optimization of CNN, little attention has been paid to the efficient implementation of these algorithms on embedded CPUs, which usually have frugal memory and low power budget. This research work aims to fill this gap and focuses on the efficient implementation of Winograd or Cook-Toom based convolution on modern Arm Cortex-A CPUs, widely used in mobile devices today. Specifically, we demonstrate a reduction in inference latency by using a set of optimization strategies that improve the utilization of computational resources, and by effectively leveraging the ARMv8-A NEON SIMD instruction set. We evaluated our proposed region-wise multi-channel implementations on Arm Cortex-A73 platform using several representative CNNs. The results show significant performance improvements in full network, up to 60%, over existing im2row/im2col based optimization techniques.

Enabling ImageNet-Scale Deep Learning on Arm's M-Class MCUs for Accurate and Efficient Inference

Sulaiman Sadiq, Jonathon Hare, Simon Craske (Arm), Partha Maji (Arm), Geoff Merrett

IEEE Internet of Things Journal (A collaborative work between Arm and University of Southampton)

Conventional approaches to TinyML achieve high accuracy by deploying the largest deep learning model with highest input resolutions that fit within the size constraints imposed by the microcontroller’s (MCUs) fast internal storage and memory. In this paper, we perform an in-depth analysis of prior works to show that models derived within these constraints suffer from low accuracy and, surprisingly, high latency. We propose an alternative approach that enables the deployment of efficient models with low inference latency, but free from the constraints of internal memory. We take a holistic view of typical MCU architectures, and utilise plentiful but slower external memories to relax internal storage and memory constraints. To avoid the lower speed of external memory impacting inference latency, we build on the TinyOps inference framework, which performs operation partitioning and uses overlays via DMA, to accelerate the latency. Using insights from our study, we deploy efficient models from the TinyOps design space onto a range of embedded MCUs achieving record performance on TinyML ImageNet classification with up to 6.7% higher accuracy and 1.4x faster latency compared to state-of-the-art internal memory approaches.

Cook-Toom AI HW Accelerator: A scalable custom HW architecture to enable fast inference via co-designing convolution algorithm and hardware

Partha Maji, Robert Mullins

University of Cambridge

The Cook-Toom custom AI accelerator illustrates that co-designing algorithm and underlying hardware architecture helps to achieve significant performance improvement. It requires more collaborative effort between machine-learning researchers and computer architects. In this work, I illustrated how this could be achieved by first establishing design guidelines. Any domain-specific architecture, including the Cook-Toom accelerator, must be flexible, configurable and scalable. Through a variety of benchmarking exercises, I showed that the novel Cook-Toom accelerator is easily configurable and scalable. Additionally, the Cook-Toom accelerator can be configured to support any of three main data-reuse patterns. For efficient implementation of the MCMR algorithm, the data-reuse pattern must be altered while switching regions within the input feature maps as explained earlier. Additionally, I illustrated a few mapping examples of spatial convolutional layers onto the core Hadamard engine.

Enhanced Block Floating Point Number Multiplier

Neil Burgess, Sangwon HA, Partha Prasun MAJI

US Patent - US20240036822A1

A data processing apparatus is configured to determine a product of two operands stored in an Extended Block Floating-Point format. The operands are decoded, based on their tags and payloads, to generate exponent differences and at least the fractional parts of significands. The significands are multiplied to generate an output significand and shared exponents and exponent differences of the operands are combined to generate an output exponent. Signs of the operands may also be combined to provide an output sign. The apparatus may be combined with an accumulator having one or more lanes to provide an apparatus for determining dot products.

Method and Apparatus for Converting to Enhanced Block Floating Point Format

Neil Burgess, Sangwon HA, Partha Prasun MAJI

US Patent - US20240045653A1

An apparatus and method of converting data into an Enhanced Block Floating Point (EBFP) format with a shared exponent is provided. The EBFP format enables data within a wide range of values to be stored using a reduced number of bits compared with conventional floating-point or fixed-point formats. The data to be converted may be in any other format, such as fixed-point, floating-point, block floating-point or EBFP.

Methods and hardware for inter-layer data format conversion in neural networks

Partha Maji, Sangwon HA

US Patent - US20240013051A1

The present disclosure relates to a method of inter-layer format conversion for a neural network, the neural network comprising at least two computation layers including a first layer to process first data in a first data format and a second layer to process second data in a second data format, the method comprising: extracting data statistics from data output by the first layer, said data statistics being representative of the data output by the first layer; determining one or more conversion parameters based on the extracted data statistics and the second data format; and generating the second data for the second layer by modifying said data output by the first layer using the one or more conversion parameters.

Apparatus and method for performing accumulation operations

Sangwon HA, Neil Burgess, Partha Maji

US Patent - US20230409286A1

An apparatus has processing circuitry to perform an accumulation operation in which a first addend is added to a second addend. The apparatus has storage circuitry to store the second addend in a plurality of lanes, each lane having a significance different to that of each other lane. Each lane within at least a subset of the lanes comprises at least one overlap bit having the same bit significance as a bit in an adjacent more significant lane in the plurality of lanes. The accumulation operation includes selecting an accumulating lane out of the plurality of lanes and performing an addition operation between bits of the accumulating lane and the first addend. The at least one overlap bit of the accumulating lane enables the addition operation to be performed without a possibility of overflowing the accumulating lane.

Availability Attacks on Graph Neural Networks

Shyam A. Tailor, Miguel Tairum Cruz, Tiago Azevedo, Nicholas D. Lane, Partha Maji

ICML 2022

Graph neural networks (GNNs) have become a popular approach for processing non-uniformly structured data in recent years. These models implement permutation-equivariant functions: their output does not depend on the order of the graph. Although reordering the graph does not affect model output, it is widely recognised that it may reduce inference latency. Less widely noted, however, is the observation that it is also possible to reorder the input graph to \textit{increase} latency, representing a possible security (availability) vulnerability. Reordering attacks are difficult to mitigate, as finding an efficient processing order for an arbitrary graph is challenging, yet discovering an inefficient order is practically trivial in many cases: random shuffling is often sufficient. We focus on point cloud GNNs, which we find are especially susceptible to reordering attacks, and which may be deployed in real-time, safety-critical applications such as autonomous vehicles. We propose a lightweight reordering mechanism for spatial data, which can be used to mitigate reordering attacks in this special case. This mechanism is effective in defending against the slowdowns from shuffling, which we find for point cloud models can increase message propagation latency by 7.1$\times$, with 81\% increases to end-to-end latency with PosPool models at 1M points.

TinyOps: ImageNet Scale Deep Learning on Microcontrollers

S. Sadiq, J. Hare, S. Craske, G. Merrett, P. Maji

CVPR 2022

Deep Learning on microcontroller (MCU) based IoT devices is extremely challenging due to memory constraints. Prior approaches focus on using internal memory or external memories exclusively which limit either accuracy or latency. We find that a hybrid method using internal and external MCU memories outperforms both approaches in accuracy and latency. We develop TinyOps, an inference engine which accelerates inference latency of models in slow external memory, using a partitioning and overlaying scheme via the available Direct Memory Access (DMA) peripheral to combine the advantages of external memory (size) and internal memory (speed). Experimental results show that architectures deployed with TinyOps significantly outperform models designed for internal memory with up to 6% higher accuracy and importantly, 1.3-2.2x faster inference latency to set the state-of-the-art in TinyML ImageNet classification. Our work shows that the TinyOps space is more efficient compared to the internal or external memory design spaces and should be explored further for TinyML applications.

Towards Efficient Point Cloud Graph Neural Networks Through Architectural Simplification (Awarded Best Paper)

Shyam A. Tailor, René de Jong, Tiago Azevedo, Matthew Mattina, Partha Maji

ICCV 2021

In recent years graph neural network (GNN)-based approaches have become a popular strategy for processing point cloud data, regularly achieving state-of-the-art performance on a variety of tasks. To date, the research community has primarily focused on improving model expressiveness, with secondary thought given to how to design models that can run efficiently on resource-constrained mobile devices including smartphones or mixed reality headsets. In this work we make a step towards improving the efficiency of these models by making the observation that these GNN models are heavily limited by the representational power of their first, feature extracting, layer. We find that it is possible to radically simplify these models so long as the feature extraction layer is retained with minimal degradation to model performance; further, we discover that it is possible to improve performance overall on ModelNet40 and S3DIS by improving the design of the feature extractor. Our approach reduces memory consumption by 20× and latency by up to 9.9× for graph layers in models such as DGCNN; overall, we achieve speed-ups of up to 4.5× and peak memory reductions of 72.5%.

An Underexplored Dilemma between Confidence and Calibration in Quantized Neural Networks

Guoxuan Xia, Sangwon Ha, Tiago Azevedo, Partha Maji

NeurIPS 2021, ICBINB

Modern convolutional neural networks (CNNs) are known to be overconfident in terms of their calibration on unseen input data. That is to say, they are more confident than they are accurate. This is undesirable if the probabilities predicted are to be used for downstream decision making. When considering accuracy, CNNs are also surprisingly robust to compression techniques, such as quantization, which aim to reduce computational and memory costs. We show that this robustness can be partially explained by the calibration behaviour of modern CNNs, and may be improved with overconfidence. This is due to an intuitive result: low confidence predictions are more likely to change post-quantization, whilst being less accurate. High confidence predictions will be more accurate, but more difficult to change. Thus, a minimal drop in post-quantization accuracy is incurred. This presents a potential conflict in neural network design: worse calibration from overconfidence may lead to better robustness to quantization. We perform experiments applying post-training quantization to a variety of CNNs, on the CIFAR-100 and ImageNet datasets, and make our code publicly available.\footnote{Code to be released on Github after review.

On Efficient Uncertainty Estimation for Resource-Constrained Mobile Applications

Johanna Rock, Rene De Jong, Tiago Azevedo, Matt Mattina, Partha Maji

NeurIPS 2021, Bayesian Deep Learning

Deep neural networks have shown great success in prediction quality while reliable and robust uncertainty estimation remains a challenge. Predictive uncertainty supplements model predictions and enables improved functionality of downstream tasks including embedded and mobile applications, such as virtual reality, augmented reality, sensor fusion, and perception. These applications often require a compromise in complexity to obtain uncertainty estimates due to very limited memory and compute resources. We tackle this problem by building upon Monte Carlo Dropout (MCDO) models using the Axolotl framework; specifically, we diversify sampled subnetworks, leverage dropout patterns, and use a branching technique to improve predictive performance while maintaining fast computations. We conduct experiments on (1) a multi-class classification task using the CIFAR10 dataset, and (2) a more complex human body segmentation task. Our results show the effectiveness of our approach by reaching close to Deep Ensemble prediction quality and uncertainty estimation, while still achieving faster inference on resource-limited mobile platforms.

On the Effects of Quantisation on Model Uncertainty in Bayesian Neural Networks

Martin Ferianc, Partha Maji, Matthew Mattina, Miguel Rodrigues

UAI 2021 (Uncertainty in AI)

Bayesian neural networks (BNNs) are making significant progress in many research areas where decision-making needs to be accompanied by uncertainty estimation. Being able to quantify uncertainty while making decisions is essential for understanding when the model is over-/under-confident, and hence BNNs are attracting interest in safety-critical applications, such as autonomous driving, healthcare, and robotics. Nevertheless, BNNs have not been as widely used in industrial practice, mainly because of their increased memory and compute costs. In this work, we investigate quantisation of BNNs by compressing 32-bit floating-point weights and activations to their integer counterparts, that has already been successful in reducing the compute demand in standard pointwise neural networks. We study three types of quantised BNNs, we evaluate them under a wide range of different settings, and we empirically demonstrate that a uniform quantisation scheme applied to BNNs does not substantially decrease their quality of uncertainty estimation.

Stochastic-Shield: A Probabilistic Approach Towards Training-Free Adversarial Defense in Quantized CNNs

Lorena Qendro, Sangwon Ha, René de Jong, Partha Maji

MobiSys 2021

Quantized neural networks (NN) are the common standard to efficiently deploy deep learning models on tiny hardware platforms. However, we notice that quantized NNs are as vulnerable to adversarial attacks as the full-precision models. With the proliferation of neural networks on small devices that we carry or surround us, there is a need for efficient models without sacrificing trust in the prediction in presence of malign perturbations. Current mitigation approaches often need adversarial training or are bypassed when the strength of adversarial examples is increased.

In this work, we investigate how a probabilistic framework would assist in overcoming the aforementioned limitations for quantized deep learning models. We explore Stochastic-Shield: a flexible defense mechanism that leverages input filtering and a probabilistic deep learning approach materialized via Monte Carlo Dropout. We show that it is possible to jointly achieve efficiency and robustness by accurately enabling each module without the burden of re-retraining or ad hoc fine-tuning.

Stochastic-YOLO: Efficient Probabilistic Object Detection under Dataset Shifts, Blog

T. Azevedo, R. Jong, M. Mattina, P. Maji

NeurIPS 2020 (Workshop on ML for Autonomous Driving)

In image classification tasks, the evaluation of models' robustness to increased dataset shifts with a probabilistic framework is very well studied. However, object detection (OD) tasks pose other challenges for uncertainty estimation and evaluation. For example, one needs to evaluate both the quality of the label uncertainty (i.e., what?) and spatial uncertainty (i.e., where?) for a given bounding box, but that evaluation cannot be performed with more traditional average precision metrics (e.g., mAP). In this paper, we adapt the well-established YOLOv3 architecture to generate uncertainty estimations by introducing stochasticity in the form of Monte Carlo Dropout (MC-Drop), and evaluate it across different levels of dataset shift. We call this novel architecture Stochastic-YOLO, and provide an efficient implementation to effectively reduce the burden of the MC-Drop sampling mechanism at inference time. Finally, we provide some sensitivity analyses, while arguing that Stochastic-YOLO is a sound approach that improves different components of uncertainty estimations, in particular spatial uncertainties.

DEff-ARTS: Differentiable Efficient Architecture Search (Neural Architecture Search - NAS)

S. Sadiq, P. Maji, J. Hare, and G. Merrett

NeurIPS 2020 (Workshop on ML for Systems)

Manual design of efficient Deep Neural Networks (DNNs) for mobile and edge devices is an involved process which requires expert human knowledge to improve efficiency in different dimensions. In this paper, we present DEff-ARTS, a differentiable efficient architecture search method for automatically deriving CNN architectures for resource constrained devices. We frame the search as a multi-objective optimisation problem where we minimise the classification loss and the computational complexity of performing inference on the target hardware. Our formulation allows for easy trading-off between the sub-objectives depending on user requirements. Experimental results on CIFAR-10 classification showed that our approach achieved a highly competitive test error rate of 3:24% with 30% fewer parameters and multiply and accumulate (MAC) operations compared to Differentiable ARchiTecture Search (DARTS).

Object Detection Network with Spatial Uncertainty

Partha Maji, Tiago Azevedo

US Patent - US20220277159A1

A hardware accelerator for an object detection network and a method for detecting an object are provided. The present disclosure provides robust object detection that advantageously augments traditional deterministic bounding box predictions with spatial uncertainties for various computer vision applications, such as, for example, autonomous driving, robotic surgery, etc.

On the Reduction of Computational Complexity of Deep Convolutional Neural Networks (Published)

P. Maji, R. Mullins.

Journal Entropy 2018 (Special Issue with Selected Papers)

Deep convolutional neural networks (ConvNets), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks. Unfortunately, achieving accuracy often implies significant computational costs, limiting deployability. In modern ConvNets it is typical for the convolution layers to consume the vast majority of computational resources during inference. This has made the acceleration of these layers an important research area in academia and industry. In this paper, we examine the effects of co-optimizing the internal structures of the convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speedup of a ConvNet, achieving a ten-fold increase over baseline. We also introduce a new class of fast one-dimensional (1D) convolutions for ConvNets using the Toom–Cook algorithm. We show that our proposed scheme is mathematically well-grounded, robust, and does not require any time-consuming retraining, while still achieving speedups solely from convolutional layers with no loss in baseline accuracy.

1D-FALCON: Accelerating Deep Convolutional Neural Network Inference by Co-optimization of Models and Underlying Arithmetic Implementation (Published)

P. Maji, R. Mullins. The 26th International Conference on Artificial Neural Networks

ICANN 2017 (Best Paper Candidate)

Deep convolutional neural networks (CNNs), which are at the heart of many new emerging applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity, limiting their deployability. In modern CNNs, convolutional layers mostly consume 90% of the processing time during a forward inference and acceleration of these layers are of great research and commercial interest. In this paper, we examine the effects of co-optimizing internal structures of convolutional layers and underlying implementation of fundamental convolution operation. We demonstrate that a combination of these methods can have a big impact on the overall speed-up of a CNN, achieving a tenfold increase over baseline. We also introduce a new class of fast 1-D convolutions for CNNs using the Toom-Cook algorithm. We show that our proposed scheme is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speedups solely from convolutional layers with no loss in baseline accuracy.

ADaPT: Optimizing CNN inference on IoT and mobile devices using approximately separable 1-D kernels (Published)

P. Maji, D. Bates, A. Chadwick, and R. Mullins. In the ACM International Conference on Internet of Things and Machine Learning (ACM - IML), 2017 (Full, ORAL)

Breakthroughs from the field of deep learning are radically changing how sensor data are interpreted to extract important information to help advance healthcare, make our cities smarter, and innovate in smart home technology. Deep convolutional neural networks, which are at the heart of many emerging Internet-of-Things (IoT) applications, achieve remarkable performance in audio and visual recognition tasks, at the expense of high computational complexity in convolutional layers, limiting their deployability. In this paper, we present an easy-to-implement acceleration scheme, named ADaPT, which can be applied to already available pre-trained networks. Our proposed technique exploits redundancy present in the convolutional layers to reduce computation and storage requirements. Additionally, we also decompose each convolution layer into two consecutive one-dimensional stages to make full use of the approximate model. This technique can easily be applied to existing low power processors, GPUs or new accelerators. We evaluated this technique using four diverse and widely used benchmarks, on hardware ranging from embedded CPUs to server GPUs. Our experiments show an average 3-5x speed-up in all deep models and a maximum 8-9x speed-up on many individual convolutional layers. We demonstrate that unlike iterative pruning based methodology, our approximation technique is mathematically well grounded, robust, does not require any time-consuming retraining, and still achieves speed-ups solely from convolutional layers with no loss in baseline accuracy.

Apparatus and method for providing a bidirectional communications link between a master device and a slave device

P. Maji, S. R. Mellor, ARM Ltd. US Patent #US8924612B2, Granted on 2014-12-30.

A bidirectional communications link between a master device and a slave device includes first endpoint circuitry coupled to the master device generating forward data packets, second endpoint circuitry coupled to the slave device for receiving reverse data packets, and bidirectional communication circuitry for transferring forward data packets from the first endpoint circuitry to the second endpoint circuitry and reverse data packets from the second endpoint circuitry to the first endpoint circuitry. In response to a power down condition requiring a power down of at least one of the first endpoint circuitry and the second endpoint circuitry, performance of said power down is deferred until both said outstanding forward credit signal and said outstanding reverse credit signal have been deasserted.

ARM Big Data Poster - Partha Maji.pdf

Enabling Deep Learning on Embedded Systems

Deep Convolutional Networks (ConvNets) have demonstrated state-of-the-art performance in many machine learning problems involving image classification and speech recognition. Over the last few years several advances in the design of ConvNets have not only led to a further boost in achieved accuracy on image recognition tasks but also played a crucial role as a feature generator for other machine learning tasks such as object detection, localization, semantic segmentation and image retrieval. However, the complexity and size of ConvNets have limited their use in mobile applications and embedded system. The aim of my research is to find ways to optimize these deep neural networks using model-architecture co-design and enable mass deployment of deep-learning based applications in consumer products.

P. Maji, R. Mullins. In the ARM-Cambridge Research Showcase,

Poster Session, Cambridge Big Data,

Maxwell Centre, University of Cambridge, Dec 2016.

Data packet flow control across an asynchronous clock domain boundary

P. Maji, S. R. Mellor, ARM Ltd. US Patent #US8630358B2, Granted on 2014-01-14.

A system-on-chip integrated circuit includes a packet transmitter for generating data packets to be sent via a communication circuit to a packet receiver containing a buffer circuit. A transmitter counter stores a transmitter count value counting data packets sent. A receiver counter stores a receiver count value tracking data packets emptied from the buffer circuit. A comparison circuitry is used to compare the transmitter count value and the receiver count value to determine whether or not there is storage space available within the buffer circuit to receive transmission of further data packets. The packet transmitter operates in a transmitter clock domain that is asynchronous from a receiver clock domain in which the packet receiver operates. One of the count values is passed across this asynchronous clock boundary in order that the comparison may be performed and flow control exercised.

My Hobby Set Top Box Project

Inspired by industrial scale streaming Set Top Boxes, I built a very elementary networked Set Top Box on my spare time using just discrete chips. It uses 8 symmetric 32-bit processors (cores) with 32KB of RAM running at 80MHz. The complete STB solution works with any NTSC or PAL TV that has composite (RCA) input, with any network router that supports DHCP. It does not have any support for wifi. I also built few basic widgets to go with the hardware. The final PCB is in the middle between a Costa club card and a Raspberry pi.

Design and Implementation of the Quarc Network-on-Chip

P. Maji, M. Moadeli, and W. Vanderbauwhede.

IPDPS 2009 (International Parallel & Distributed Processing Symposium)

Networks-on-Chip (NoC) have emerged as alternative to buses to provide a packet-switched communication medium for modular development of large Systems-on-Chip. However, to successfully replace its predecessor, the NoC has to be able to efficiently exchange all types of traffic including collective communications. The latter is especially important for e.g. cache updates in multicore systems. The Quarc NoC architecture [9] has been introduced as a Networks-on-Chip which is highly efficient in exchanging all types of traffic including broadcast and multicast. In this paper we present the hardware implementation of the switch architecture and the network adapter (transceiver) of the Quarc NoC. Moreover, the paper presents an analysis and comparison of the cost and performance between the Quarc and the Spidergon NoCs implemented in Verilog targeting the Xilinx Virtex FPGA family. We demonstrate a dramatic improvement in performance over the Spidergon especially for broadcast traffic, at no additional hardware cost.

Energy efficient Convolutional Neural Network for Embedded Systems

P. Maji, R Mullins. In the Microsoft PhD Summer School Poster Session, Microsoft Research (MSR), Cambridge, July 2016.

Quarc: a High-Efficiency Network on-Chip Architecture

M. Moadeli, P. Maji, and W. Vanderbauwhede. In the IEEE 23rd International Conference on Advanced Information Networking & Applications (AINA), Bradford, UK, May 2009.

An analytical performance model for the Spidergon NoC with virtual channels

M. Moadeli, A. Shahrabi, W. Vanderbauwhede and P. Maji. In proceedings of the Journal of Systems Architecture – Embedded Systems Design (JSA), 2010.

Design and FPGA Implementation of a packet switch for the Quarc Network-on-Chip

P. Maji, W. Vanderbauwhede, and F. Rodriguez. In iSLI Annual day student poster presentation, Alba Centre, Livingston, Scotland (Awarded Best Poster).

A novel NoC in Multi-processor paradigm

P. Maji. In the IET Scotland Present Around the World competition, 2009, at the Old School, the University of Edinburgh (Awarded Best Presentation).

Design and FPGA Implementation of a packet switch for the Quarc Network-on-Chip (Available on request)

P. Maji, MSc Thesis, September 2008.

An Idea vs Real Product - A Pragmatic Point of View

'''It’s the disease of thinking that a really great idea is 90% of the work. And if you just tell all these other people “here’s this great idea,” then of course they can go off and make it happen. And the problem with that is that there’s just a tremendous amount of craftsmanship in between a great idea and a great product. And as you evolve that great idea, it changes and grows. It never comes out like it starts because you learn a lot more as you get into the subtleties of it. And you also find there are tremendous trade-offs that you have to make. There are just certain things you can’t make electrons do. There are certain things you can’t make plastic do. Or glass do. Or factories do. Or robots do. Designing a product is keeping five thousand things in your brain and fitting them all together in new and different ways to get what you want.''' (Steve Jobs 1995)