Course pages 2013–14
Here I'll post a list of questions that have been asked, whose answers may be useful to everyone.
What are the most promising post-CMOS technologies?
Currently graphene-based transistors and integrated circuits are seen as the most likely technology to replace CMOS. Recently researchers at the University of Manchester created a new type of graphene transistor. You can read a the paper online or a summary article. Other candidates are spintronics and molecular electronics. A very recent development is phosphorene transistors. There is a good overview paper from 2010 that explains the motivation for post-CMOS technologies and describes a number of different technologies.
Why does performance only improve by S from adding S2 transistors in the S2 era? (Lecture 1, slide 32.)
As transistors shrink we can clock them faster, which gives us a factor S improvement each generation. We can also put them to use to extract parallelism from our applications, implement prediction and make the common case faster. In the S3 era we used them to expand data paths and implement pipelining, which gave a S2 improvement in performance. However, in the S2 era, we used them to implement large on-chip caches and superscalar issue, which only gave us a factor S improvement in performance. You can see more information in Norm Jouppi's keynote from MICRO 38 (one of the top conferences in computer architecture).
How come the POWER5 was out-of-order, POWER6 was in-order, then POWER7 out-of-order again?
This is a classic case of designers optimising for high frequency and low power in the POWER6 design. With the POWER6, the clock frequency was doubled by reducing the pipeline delay from 23-FO4 to 13-FO4. However, to do this they also needed to reduce the power consumption, which they did by removing the dynamic scheduling logic and minimising speculation. There is more information in this paper.
With the POWER7, although designers doubled the number of threads supported by each core (to 4-way SMT), they also wanted to increase single-threaded performance. They achieved this by adding out-of-order execution back in, and could afford to do this since they had shrunk the design to 45nm (instead of 65nm for the POWER6), thereby gaining the benefits of smaller and more power-efficient transistors, and reducing the clock frequency. More about the POWER7 can be found in this paper.
With memory reference speculation in VLIW processors (lecture 7, slide 39), why does the store remove the entry from the table and not just store the correct value there?
If we have speculatively moved a load above a store, we need to make sure that later on we can easily identify whether it was correct or not. We may also have moved other instructions above the store, that depend on the load, so not only have to patch up the load, but also alter any dependent instructions too. In the example given, this would be any instruction that uses R1.
The most simple option therefore is for the store to remove any entry that it conflicts with from the hardware table. The check instruction then just has to verify that the entry is still there, branching to fix-up code if it isn't. Were the store to write the correct address into the table, the hardware would have to keep around the actual address loaded to do a comparison when executing the check. This would be more costly in terms of storage and implementation (checking for the existence of something is much easier than doing lookup followed by a comparison). You'd also still have to fix up any dependent instructions.
If the check fails, the processor branches to some fix-up code. Since the compiler produced the original schedule, it knows what steps need to be taken to make things correct again. Typically, this would involve executing the load once more, and then all instructions that were executed before the check that were directly or indirectly dependent on that load. This fix-up code doesn't have to be optimal, since it should rarely be executed.