Skip to content

Commit

Permalink
New and updated IP cores for Caches. Related updates in other IP core…
Browse files Browse the repository at this point in the history
…s. Minor fix in toolchain.
  • Loading branch information
mzabeltud committed Nov 22, 2016
1 parent c90631f commit 33e80e6
Show file tree
Hide file tree
Showing 32 changed files with 3,401 additions and 234 deletions.
12 changes: 12 additions & 0 deletions docs/ChangeLog/2016/v1.x.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,10 +28,22 @@ Already documented changes are available on the ``release`` branch at GitHub.
* New Entities

* :ref:`IP:ocram_sdp_wf`
* :ref:`IP:cache_par2`
* :ref:`IP:cache_cpu`
* :ref:`IP:cache_mem`

* Updated Entities

* Interface of :ref:`IP:cache_tagunit_par` changed slightly.
* New port "write-mask" in :ref:`IP:ddr3_mem2mig_adapter_Series7`.
* New port "write-mask" in :ref:`IP:ddr2_mem2mig_adapter_Spartan6`.

* New Testbenches

* Testbench for :ref:`IP:ocram_sdp_wf`
* Testbench for :ref:`IP:cache_par2`
* Testbench for :ref:`IP:cache_cpu`
* Testbench for :ref:`IP:cache_mem`

* New Constraints
* Shipped Tool and Helper Scripts
177 changes: 177 additions & 0 deletions docs/IPCores/cache/cache_cpu.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
.. only:: html

.. |gh-src| image:: /_static/logos/GitHub-Mark-32px.png
:scale: 40
:target: https://github.com/VLSI-EDA/PoC/blob/master/src/cache/cache_cpu.vhdl
:alt: Source Code on GitHub
.. |gh-tb| image:: /_static/logos/GitHub-Mark-32px.png
:scale: 40
:target: https://github.com/VLSI-EDA/PoC/blob/master/tb/cache/cache_cpu_tb.vhdl
:alt: Source Code on GitHub

.. sidebar:: GitHub Links

* |gh-src| :pocsrc:`Sourcecode <cache/cache_cpu.vhdl>`
* |gh-tb| :poctb:`Testbench <cache/cache_cpu_tb.vhdl>`


.. _IP:cache_cpu:

cache_cpu
#########

This unit provides a cache (:ref:`IP:cache_par2`) together
with a cache controller which reads / writes cache lines from / to memory.
The memory is accessed using a :ref:`INT:PoC.Mem` interfaces, the related
ports and parameters are prefixed with ``mem_``.

The CPU side (prefix ``cpu_``) has a modified PoC.Mem interface, so that
this unit can be easily integrated into processor pipelines. For example,
let's have a pipeline where a load/store instruction is executed in 3
stages (after fetching, decoding, ...):

1. Execute (EX) for address calculation,
2. Load/Store 1 (LS1) for the cache access,
3. Load/Store 2 (LS2) where the cache returns the read data.

The read data is always returned one cycle after the cache access completes,
so there is conceptually a pipeline register within this unit. The stage LS2
can be merged with a write-back stage if the clock period allows so.

The stage LS1 and thus EX and LS2 must stall, until the cache access is
completed, i.e., the EX/LS1 pipeline register must hold the cache request
until it is acknowledged by the cache. This is signaled by ``cpu_got`` as
described in Section Operation below. The pipeline moves forward (is
enabled) when::

pipeline_enable <= (not cpu_req) or cpu_got;

If the pipeline can stall due to other reasons, care must be taken to not
unintentionally executing the cache access twice or missing the read data.

Of course, the EX/LS1 pipeline register can be omitted and the CPU side
directly fed by the address caculator. But be aware of the high setup time
of this unit and high propate time for ``cpu_got``.

This unit supports only one outstanding CPU request. More outstanding
requests are provided by :ref:`IP:cache_mem`.


Configuration
*************

+--------------------+-----------------------------------------------------+
| Parameter | Description |
+====================+=====================================================+
| REPLACEMENT_POLICY | Replacement policy of embedded cache. For supported |
| | values see PoC.cache_replacement_policy. |
+--------------------+-----------------------------------------------------+
| CACHE_LINES | Number of cache lines. |
+--------------------+-----------------------------------------------------+
| ASSOCIATIVITY | Associativity of embedded cache. |
+--------------------+-----------------------------------------------------+
| CPU_ADDR_BITS | Number of address bits on the CPU side. Each address|
| | identifies one memory word as seen from the CPU. |
| | Calculated from other parameters as described below.|
+--------------------+-----------------------------------------------------+
| CPU_DATA_BITS | Width of the data bus (in bits) on the CPU side. |
| | CPU_DATA_BITS must be divisible by 8. |
+--------------------+-----------------------------------------------------+
| MEM_ADDR_BITS | Number of address bits on the memory side. Each |
| | address identifies one word in the memory. |
+--------------------+-----------------------------------------------------+
| MEM_DATA_BITS | Width of a memory word and of a cache line in bits. |
| | MEM_DATA_BITS must be divisible by CPU_DATA_BITS. |
+--------------------+-----------------------------------------------------+

If the CPU data-bus width is smaller than the memory data-bus width, then
the CPU needs additional address bits to identify one CPU data word inside a
memory word. Thus, the CPU address-bus width is calculated from::

CPU_ADDR_BITS=log2ceil(MEM_DATA_BITS/CPU_DATA_BITS)+MEM_ADDR_BITS

The write policy is: write-through, no-write-allocate.


Operation
*********

Alignment of Cache / Memory Accesses
++++++++++++++++++++++++++++++++++++

Memory accesses are always aligned to a word boundary. Each memory word
(and each cache line) consists of MEM_DATA_BITS bits.
For example if MEM_DATA_BITS=128:

* memory address 0 selects the bits 0..127 in memory,
* memory address 1 selects the bits 128..256 in memory, and so on.

Cache accesses are always aligned to a CPU word boundary. Each CPU word
consists of CPU_DATA_BITS bits. For example if CPU_DATA_BITS=32:

* CPU address 0 selects the bits 0.. 31 in memory word 0,
* CPU address 1 selects the bits 32.. 63 in memory word 0,
* CPU address 2 selects the bits 64.. 95 in memory word 0,
* CPU address 3 selects the bits 96..127 in memory word 0,
* CPU address 4 selects the bits 0.. 31 in memory word 1,
* CPU address 5 selects the bits 32.. 63 in memory word 1, and so on.


Shared and Memory Side Interface
++++++++++++++++++++++++++++++++

A synchronous reset must be applied even on a FPGA.

The memory side interface is documented in detail :ref:`here <INT:PoC.Mem>`.


CPU Side Interface
++++++++++++++++++

The CPU (pipeline stage LS1, see above) issues a request by setting
``cpu_req``, ``cpu_write``, ``cpu_addr``, ``cpu_wdata`` and ``cpu_wmask`` as
in the :ref:`INT:PoC.Mem` interface. The cache acknowledges the request by
setting ``cpu_got`` to '1'. If the request is not acknowledged (``cpu_got =
'0'``) in the current clock cycle, then the request must be repeated in the
following clock cycle(s) until it is acknowledged, i.e., the pipeline must
stall.

A cache access is completed when it is acknowledged. A new request can be
issued in the following clock cycle.

Of course, ``cpu_got`` may be asserted in the same clock cycle where the
request was issued if a read hit occurs. This allows a throughput of one
(read) request per clock cycle, but the drawback is, that ``cpu_got`` has a
high propagation delay. Thus, this output should only control a simple
pipeline enable logic.

When ``cpu_got`` is asserted for a read access, then the read data will be
available in the following clock cycle.

Due to the write-through policy, a write will always take several clock
cycles and acknowledged when the data has been issued to the memory.

.. WARNING::

If the design is synthesized with Xilinx ISE / XST, then the synthesis
option "Keep Hierarchy" must be set to SOFT or TRUE.



.. rubric:: Entity Declaration:

.. literalinclude:: ../../../src/cache/cache_cpu.vhdl
:language: vhdl
:tab-width: 2
:linenos:
:lines: 175-207

.. seealso::

:ref:`IP:cache_mem`



.. only:: latex

Source file: :pocsrc:`cache/cache_cpu.vhdl <cache/cache_cpu.vhdl>`
137 changes: 137 additions & 0 deletions docs/IPCores/cache/cache_mem.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
.. only:: html

.. |gh-src| image:: /_static/logos/GitHub-Mark-32px.png
:scale: 40
:target: https://github.com/VLSI-EDA/PoC/blob/master/src/cache/cache_mem.vhdl
:alt: Source Code on GitHub
.. |gh-tb| image:: /_static/logos/GitHub-Mark-32px.png
:scale: 40
:target: https://github.com/VLSI-EDA/PoC/blob/master/tb/cache/cache_mem_tb.vhdl
:alt: Source Code on GitHub

.. sidebar:: GitHub Links

* |gh-src| :pocsrc:`Sourcecode <cache/cache_mem.vhdl>`
* |gh-tb| :poctb:`Testbench <cache/cache_mem_tb.vhdl>`


.. _IP:cache_mem:

cache_mem
#########

This unit provides a cache (:ref:`IP:cache_par2`) together
with a cache controller which reads / writes cache lines from / to memory.
It has two :ref:`INT:PoC.Mem` interfaces:

* one for the "CPU" side (ports with prefix ``cpu_``), and
* one for the memory side (ports with prefix ``mem_``).

Thus, this unit can be placed into an already available memory path between
the CPU and the memory (controller). If you want to plugin a cache into a
CPU pipeline, see :ref:`IP:cache_cpu`.


Configuration
*************

+--------------------+-----------------------------------------------------+
| Parameter | Description |
+====================+=====================================================+
| REPLACEMENT_POLICY | Replacement policy of embedded cache. For supported |
| | values see PoC.cache_replacement_policy. |
+--------------------+-----------------------------------------------------+
| CACHE_LINES | Number of cache lines. |
+--------------------+-----------------------------------------------------+
| ASSOCIATIVITY | Associativity of embedded cache. |
+--------------------+-----------------------------------------------------+
| CPU_ADDR_BITS | Number of address bits on the CPU side. Each address|
| | identifies one memory word as seen from the CPU. |
| | Calculated from other parameters as described below.|
+--------------------+-----------------------------------------------------+
| CPU_DATA_BITS | Width of the data bus (in bits) on the CPU side. |
| | CPU_DATA_BITS must be divisible by 8. |
+--------------------+-----------------------------------------------------+
| MEM_ADDR_BITS | Number of address bits on the memory side. Each |
| | address identifies one word in the memory. |
+--------------------+-----------------------------------------------------+
| MEM_DATA_BITS | Width of a memory word and of a cache line in bits. |
| | MEM_DATA_BITS must be divisible by CPU_DATA_BITS. |
+--------------------+-----------------------------------------------------+
| OUTSTANDING_REQ | Number of oustanding requests, see notes below. |
+--------------------+-----------------------------------------------------+

If the CPU data-bus width is smaller than the memory data-bus width, then
the CPU needs additional address bits to identify one CPU data word inside a
memory word. Thus, the CPU address-bus width is calculated from::

CPU_ADDR_BITS=log2ceil(MEM_DATA_BITS/CPU_DATA_BITS)+MEM_ADDR_BITS

The write policy is: write-through, no-write-allocate.

The maximum throughput is one request per clock cycle, except for
``OUSTANDING_REQ = 1``.

If ``OUTSTANDING_REQ`` is:

* 1: then 1 request is buffered by a single register. To give a short
critical path (clock-to-output delay) for ``cpu_rdy``, the throughput is
degraded to one request per 2 clock cycles at maximum.

* 2: then 2 requests are buffered by :ref:`IP:fifo_glue`. This setting has
the lowest area requirements without degrading the performance.

* >2: then the requests are buffered by :ref:`IP:fifo_cc_got`. The number of
outstanding requests is rounded up to the next suitable value. This setting
is useful in applications with out-of-order execution (of other
operations). The CPU requests to the cache are always processed in-order.


Operation
*********

Memory accesses are always aligned to a word boundary. Each memory word
(and each cache line) consists of MEM_DATA_BITS bits.
For example if MEM_DATA_BITS=128:

* memory address 0 selects the bits 0..127 in memory,
* memory address 1 selects the bits 128..256 in memory, and so on.

Cache accesses are always aligned to a CPU word boundary. Each CPU word
consists of CPU_DATA_BITS bits. For example if CPU_DATA_BITS=32:

* CPU address 0 selects the bits 0.. 31 in memory word 0,
* CPU address 1 selects the bits 32.. 63 in memory word 0,
* CPU address 2 selects the bits 64.. 95 in memory word 0,
* CPU address 3 selects the bits 96..127 in memory word 0,
* CPU address 4 selects the bits 0.. 31 in memory word 1,
* CPU address 5 selects the bits 32.. 63 in memory word 1, and so on.

A synchronous reset must be applied even on a FPGA.

The interface is documented in detail :ref:`here <INT:PoC.Mem>`.

.. WARNING::

If the design is synthesized with Xilinx ISE / XST, then the synthesis
option "Keep Hierarchy" must be set to SOFT or TRUE.



.. rubric:: Entity Declaration:

.. literalinclude:: ../../../src/cache/cache_mem.vhdl
:language: vhdl
:tab-width: 2
:linenos:
:lines: 135-169

.. seealso::

:ref:`IP:cache_cpu`



.. only:: latex

Source file: :pocsrc:`cache/cache_mem.vhdl <cache/cache_mem.vhdl>`
17 changes: 16 additions & 1 deletion docs/IPCores/cache/cache_par.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,16 @@
cache_par
#########

Implements a cache with parallel tag-unit and data memory.

.. NOTE::
This component infers a single-port memory with read-first behavior, that
is, upon writes the old-data is returned on the read output. Such memory
(e.g. LUT-RAM) is not available on all devices. Thus, synthesis may
infer a lot of flip-flops plus multiplexers instead, which is very inefficient.
It is recommended to use :doc:`PoC.cache.par2 <cache_par2>` instead which has a
slightly different interface.

All inputs are synchronous to the rising-edge of the clock `clock`.

**Command truth table:**
Expand Down Expand Up @@ -57,6 +67,11 @@ Upon replacing a cache line, the new content is given by ``CacheLineIn``. The
old content is outputed on ``CacheLineOut`` and the old tag on ``OldAddress``,
both with a latency of one clock cycle.

.. WARNING::

If the design is synthesized with Xilinx ISE / XST, then the synthesis
option "Keep Hierarchy" must be set to SOFT or TRUE.



.. rubric:: Entity Declaration:
Expand All @@ -65,7 +80,7 @@ both with a latency of one clock cycle.
:language: vhdl
:tab-width: 2
:linenos:
:lines: 76-100
:lines: 91-115



Expand Down
Loading

0 comments on commit 33e80e6

Please sign in to comment.