Retro-uC first time right - yes, we can
Posted on za 15 september 2018 in Retro-uC
This post is again about comments posted after launch of the Retro-uC campaign claiming it defies 'common wisdom' and thus is unrealistic. This time about the fact the project assumes the chip will work the first time; in the industry also as first-time-right. In order to explain what are possible problems and the strategy to avoid them, this post will need to be more technical than the previous posts.
As expained in the previous post, the startup costs are a big contributor to the final cost of the products for a low volume costum chip. Doing an iteration on the production would double the startup costs of the repeated production steps. The Retro-uC project was from the start defined in such a way with the purpose of having a working chip on the first iteration and avoid having to raise the pledge levels. The main principle applied is KISS: keep it simple, stupid.
Before going into the gory details of what can go wrong let's first give some background on how a microelectronics chip works. On the chip, the zeros and ones are represented by electric signals. Computations are done by transistors on a chip; for digital applications they act as switches with the voltage on one of the input nodes determining if the switch is open or closed. Switches can be used to make digital circuits. Take for example a light in a room that can be controlled from multiple switches in a classic way; e.g. not centralized in the electrical cabinet. The state of the individual switches determines if the light is on or not, thus if you have a zero or one on the output. Electrically controllable switches can also be used to make memory elements. When going back to our light example you can also have configurations where pushing a button switches the light on or off. In technical terms a memory element on a chip is often called a register. Another use case for transistors is to use them with the input not fully open or fully closed. This is called an analog application of the transistor. Example here is the use as in a transistor radio or an audio amplifier.
The fact that zeros and ones are presented by electrical signals also means that they are bound by the physical laws for electrical signals. The wires will have a certain resistance, wires close to each other will form a capacitor and other similar effects. This means that these signals will take time to propagate through the chip and may also interact with each other. There can be different causes why a chip does not work first time:
- a bug is present in the source code of the chip
- an analog circuit on the chip does not work or does not meet the specification
- timing problems
- power distribution problems
- signal integrity problems; these are unwanted interactions between different signals that make the chip malfunction
With the scaling of the technology more things are put on a chip; and features like dynamic frequency and voltage scaling, multiple clocks, multiple voltage domains, etc. The complexity of designing a chips increases with each and this almost exponentially with the technology node. In the next paragraphs it is detailed what decisions have been made to mitigate the problems for the Retro-uC chip.
To minimize the risk of having a killer bug in the RTL source code of the Retro-uC well tested code was used for the three cores. The MOS 6502 and the Z80 were used in emulators for the Commodore 64 and the ZX Spectrum. The Motorola 68000 core has been used to boot Linux. The JTAG interface has been newly written but has first been unit tested using the cocotb tool and afterwards tested on FPGA. The rest of the glue logic has been kept relatively simple and is also tested by simulation and on FPGA. Implementing new features has only minimal impact on final cost as these are mainly determined by startup costs. So it is tempting to let feature creep come in. But every new feature needs development time and increases the risk of introducing a killer bug. Therefor it was to decided to develop the Retro-uC in a MVP (minimum viable product) fashion but this may disappoint people who really want their own favorite feature implemented.
For analog blocks it is common to at least have one iteration on the design. Each type of circuit can have different architectures implementing the function. Each architecture has it's own peculiarities and there are no (fully) automated tools like for digital applications to check if a design fulfills it's specification. The designer needs to do the design of the circuit and also the test benches with which to check it's functionality. So to have a circuit performing according to specification when it is delivered from the fab, both the simulation models provided by the silicon fab as the test benches developed by the designer have to cover the whole functionality and for the peculiar requirements of the chosen architecture. As a consequence often one or more iterations are needed to get the circuit right. For the Retro-uC it was decided to not include analog functionality like for example an ADC (analog to digital converter). The design itself would take considerable time and delay the project with several months and also the startup costs would need to be increased in order to allow for test chips of the analog blocks. Also the needed information to be able to design open source analog blocks without needing a NDA is not in place at the moment; which is another reason to not include analog blocks in the pilot Retro-uC project. In the mid-term it is planned to also include open source analog design possibility inside the Chips4Makers project but this will be developed in parallel with the digital part and will take some time to get there.
Most digital chips are clocked, the registers on the chip typically renew their stored value at the start of each clock cycle. Timing problems in the digital part of a chip are caused because signals take a certain time to propagate through a chip. Transistors can only provide a certain current and this limits the speed with which signals can travel between the transistors. In order for a clocked register to work well it has requirements on the timing of the signals on it's input. The signals have to arrive a certain time before the next cycle starts and at the start of the cycle the value has to be kept it's state a certain time before it may change to the next state. The former is called a setup constraint and the latter a hold constraint. A violation of the setup is not a killer problem if you allow to reduce the clock frequency to give enough time for the signal to propagate. For the Retro-uC the max. frequency will be determined when the chip comes back from the foundry and is a refinement of the max. frequency range determined during the design phase. A hold violation is a killer problem though; if it is present you have to fix your design and order a new chip. But for the 0.35 micron technology chosen this will not pose a problem when not doing crazy things on the chip. The delay of the register is big enough; even if you connect the output of one register directly to the input of another register you won't violate the hold requirements.
On a chip there can be more complicated timing constraints. For certain protocols there may be requirements on different signals arriving with only limited difference in delay; sometimes circuit may also depend on signals arriving with a certain order in time and thus depend on the propagation delay of each of the elements in the signal paths. These kind of timing problems can also be the cause that designs that work on a FPGA don't work when implemented in an ASIC. Even when the special timing requirements were not specified, it can be that by luck and due to the architecture of the FPGA the timing requirements are never violated during FPGA testing. When now implemented in ASIC with a more fine grained architecture this unspecified timing problem may be triggered and cause a malfunction of the chip. These problems are avoided on the Retro-uC by not using RTL code that depends on such requirements; all timing is done on signal arriving at registers relative to the clock and the tools I selected to design and verify the chips handle these requirements well. Another possible source of timing problems is if you have more than one clock on a chip and signals need to go from one clock domain to another clock domain. The problem is caused because the timing of the clocks is independent from each other; the start of one of the clock cycles can happen any time relative to the other clock cycles; the frequencies of the clock may be even different. This can lead to the so-called meta-stability problem. When a signal is changing value it has a voltage in between the two digital values for a certain time. During that time, this signal can be seen as different logic values at different places and leading to wrong logic computation. Alternatively you may want to have a bus of signals to go between the clock domains and some of them may be seen in the old state and some of them in the new state by the other clock domain. In the Retro-uC this problem is avoided as much as possible by having only one main clock. The only place where a second clock could not be avoided is for the JTAG interface which is defined to carry it's own clock signal. The risk has been minimized here by using a handshake protocol to transfer data between the JTAG interface and the rest of the design. A single signal is used to indicate data is ready and only in the next clock cycle the actual data is read so it is in a stable state. Then an acknowledge signal is given form the receiving side to the sender so the latter knows the data is no longer needed and the cycle can be ended. Next to the normal digital verification; this mechanism will also be verified before tape-out with transistor-level simulations - known as SPICE simulations by the people in the field.
Both for the remaining power distribution and the signal integrity problems the complexity of them increases almost exponentially with scaling. The main problem for power distribution is the voltage drop one gets over a wire with a certain resistance and a current going through it. Typically voltage is scaled with technology scaling and more functionality is put on a chip. When lowering the voltage the current has to increase in order to deliver the same power. Having the same resistance of the power distribution network thus results in a higher voltage drop. As the voltage was already lower this is thus a double effect in relative terms to supply voltage. With technology scaling the devices become smaller and by this the power density on the chip increases. This is an additional factor making the design of power distribution network more difficult. With the increasing frequency with which chips are run even the inductance of the power distribution network can become important. This is the phenomenon that high current peaks take a certain amount of time to build up and the distribution network may need to be designed to avoid having problems with it. Next to the power distribution the integrity of the signals has to be guaranteed. The integrity may be broken because signals next to each other may influence each other. With scaling the signals are put closer together, much more things are happening at the same time and the signals will change faster; this all makes the problem (much) more complex. For the 0.35um technology chosen for the Retro-uC the power density on the chip is relatively low and the chip is running at a moderate frequency so both power distribution and signal integrity can be checked with simple extraction of power consumption, resistance of distribution network and capacitance on some critical (asynchronous) signals. If one would go for 0.18um where the same functionality is put on a quarter of the area and running at higher frequency more automated tools would be needed to minimize risk mainly for power distribution; when even going smaller also signal integrity has to be verified carefully.
With this blog post I explained how the first time right aspect of chip development has been looked at from the beginning of this project and how all has been done to minimize the risk of having a non-functional product. Although this risk has been minimized it is not 0% and thus this was also mentioned in the 'Risks & Challenges' chapter on the campaign. I consider the risk certainly not higher than the average crowdfunding project.