Conclusions

by Rachel Rosen last modified Nov 16, 2016

Table of Contents
_________________

1 Conclusions
.. 1.1 Software improvements for more efficient sub-arrays
.. 1.2 Ideas for possible improvements
..... 1.2.1 Timing Event warning and DLL out of lock monitoring
..... 1.2.2 Importance and frequency of Single Event Upsets
..... 1.2.3 Revive ICT-232 towards a reliable protocol
..... 1.2.4 Detect as early as possible that a correlator configuration is bad
..... 1.2.5 Inter sub-scan duration limited by partial dump data spreading
..... 1.2.6 Avoid excessive CAN-bus payload whenever possible
.. 1.3 Ideas for possible longer term improvements
..... 1.3.1 CCC finer commanding granularity
..... 1.3.2 CCC Maintenance component should be locked out during observations
..... 1.3.3 Increase data rate peak a few factors above its current limit
..... 1.3.4 Reuse of configuration and state slots at run-time
..... 1.3.5 A 5th quadrant for HIL simulation testing
..... 1.3.6 Join efforts for a 5th quadrant
..... 1.3.7 Compute coarse delays in SC code
..... 1.3.8 DGCK encode coarse delay value changes
..... 1.3.9 Correlator component state machine not used
..... 1.3.10 printf usage in CCC
.. 1.4 Documentation improvements
..... 1.4.1 CAN-bus command execution timing summary
..... 1.4.2 3x3 TDM mode is a single polarization mode
.. 1.5 Pending clarifications
..... 1.5.1 CDP nodes missing workload profiling
..... 1.5.2 How long it takes to a stop-sequence command to complete?
..... 1.5.3 Return-to-phase reset time?
..... 1.5.4 There is evidence that the CCC is sometime applying delay too late


1 Conclusions
=============

1.1 Software improvements for more efficient sub-arrays
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  A face-to-face discussion on ICT-5111 resulted in the realization of a
  series of feasible improvements. It is clear that the usefulness of
  these discussions comes from having different points of views at the
  same time in the same room.

  It is expected that the aggregation of all those improvements would
  produce a much more 'responsive' system when different observation
  types execute in different sub-arrays.

  A prioritized items list (high to low) is presented below. Each one of
  these items will become an individual Jira ticket, and ICT-5111 will
  be closed as a consequence.

  1. CCC to parallelize all instances where the same CAN-bus command is
     issued through independent CAN-bus channels. That is to say, extend
     the CCC feature implemented with ICT-2539 to all possible commands
     targeted to many base-bands at the same time.
  2. Firmware change to send X and Y polarization delays with just one
     command, saving 50% of the current delay CAN-bus related
     traffic. Protocol options in firmware to be discussed before
     deciding how to proceed in software.
  3. Stop uploading scaling factors again if they are the same already
     in hardware. Even if a TDM sub-scan is interleaved between the same
     FDM mode, it is expected that the current scaling factor values
     would take place automatically next time the FDM sub-scan
     executes. This firmware feature has never been tested before, some
     verification is required.
  4. CCC sub-scan coalescence. If a sequence repeats the same sub-scan
     many times then the CCC should assume that it is executing just one
     sub-scan, leaving to the CDP the actual execution of individual
     sub-scans. In this case there would be no CAN-bus workload, apart
     from delay updates, during the whole sequence execution. Sub-scan
     started and eneded callback invocations are still expected to
     happen, therefore, some specific CCC changes are expected beyond
     suppressing CAN-bus traffic during the sequence. Note that the same
     FDM sub-scan means same frequency as well, and that's why
     return-to-phase activity is also avoided.
  5. Compute bulk delay in CCC instead of SCC microprocessor. It should
     be possible to avoid CAN-bus traffic due to return-to-phase actions
     by letting the CCC chose and set the bulk delay applied in Station
     Cards. Before starting a sequence the CCC could make an educated
     guess for a single bulk-delay value per antenna and leave it
     constant throughout that sequence. Implying that in FDM mode there
     should be no need to return-to-phase between sub-scans if the
     digital LO frequency does not change. Similarly than previous item.
  6. Reusing previously generated states might be useful. The maximum
     number of state slots is 16, therefore, the applicability of this
     improvement is restricted to just some very specific
     conditions. For example, if just two sub-arrays are executing and
     each one is aware of only two different correlator configurations,
     then both sub-arrays could execute just applying a previously
     generated state permutation (no new generate needed). Allowing for
     more than just 16 slots is possible in firmware, but it is not just
     trivial to implement. Alejandro pointed out that LTA::Apply cannot
     receive a start/continue flag per sub-array, and that's why only 16
     slots is a somewhat short number for this specific use-case.


1.2 Ideas for possible improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1.2.1 Timing Event warning and DLL out of lock monitoring
---------------------------------------------------------

  CCC to reenable monitoring of both items.


1.2.2 Importance and frequency of Single Event Upsets
-----------------------------------------------------

  CCC should monitor FPGAs that provide such information and report the
  value to the monitoring database. FPGA personalities should be
  reloaded whenever an event has occurred, probably at the next
  sub-array creation. Investigate if all involved FPGAs could be
  addressed while other sub-arrays are executing, it is assumed that
  there is no need for a global FSR to reload one FPGA in the system.


1.2.3 Revive ICT-232 towards a reliable protocol
------------------------------------------------

  It is not possible determine if a CAN-bus broadcast command has
  successfully completed for all involved CAN-bus nodes. Delay updates
  are an interesting example for their relevance through the data path.


1.2.4 Detect as early as possible that a correlator configuration is bad
------------------------------------------------------------------------

  Neil: if control could guaranty calling loadConfiguration as soon as
  possible then some time could be saved if one of those configurations
  is invalid (e.g. too high data rates).


1.2.5 Inter sub-scan duration limited by partial dump data spreading
--------------------------------------------------------------------

  The actual 'spreading' of partial dumps through a dump duration
  interval imposes a limit to the shortest time bewtween sub-scans in
  the same sub-array. Doing anything different in software would
  introduce some other interdependency between sub-arrays (exceed
  aggregated data rate) which are difficult to predict in advance. For
  the moment, we stick to the fact that it takes at least one dump
  duration to complete a sub-scan before the next could run in the same
  sub-array.


1.2.6 Avoid excessive CAN-bus payload whenever possible
-------------------------------------------------------

  Pete could investigate those CAN-bus protocols that transmit data in
  excess and discuss with Alejandro and Rich ways to simplify them. The
  typical case is scaling factors for all antennas when in reality those
  in the sub-array would suffice.


1.3 Ideas for possible longer term improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1.3.1 CCC finer commanding granularity
--------------------------------------

  Geoff: would it be possible for the CCC to command the hardware again,
  for some different purpose, while waiting for a previous command
  completion that it is known to take many millisecond to finish?


1.3.2 CCC Maintenance component should be locked out during observations
------------------------------------------------------------------------

  One typical activity through that component is the collection of
  sample statistics from TFBs. That activity is in general incompatible
  with an ongoing observation. There should be a mechanism that gives
  priority to observations by locking out any possibility to use the
  Maintenance component.


1.3.3 Increase data rate peak a few factors above its current limit
-------------------------------------------------------------------

  There is some motivation to increase the peak value ~4 times above its
  current 60 MB/sec limit. Greater number of antennas and storing APC
  corrected and uncorrected data would benefit from such change. Some
  analysis would be needed to identify how much data storage impact
  would such a change really have in the archive infrastructure. A
  modification like this would require enabling a 10 GbE Ethernet
  environment in the correlator room and switch to connect to the
  OSF. Bulk-data reconfiguration for 10 GbE also needs to be coordinated
  with proper expert support.


1.3.4 Reuse of configuration and state slots at run-time
--------------------------------------------------------

  At this moment there is a misunderstanding about when is valid for the
  CCC to generate a new LTA state after one has been applied. This need
  to be clarified with Alejandro. Assuming that there is a limitation
  then that implies that the CCC needs to wait for a dump duration
  before scheduling a new state to happen.


1.3.5 A 5th quadrant for HIL simulation testing
-----------------------------------------------

  We have identified that the most promissing alternative is based on a
  'partially populated' correlator quadrant. Capable of one base-band
  and up to 16 physical antennas connected to it (is 16 correct?). Such
  option implies no changes in firmware and medium sized modifications
  in software. It also provides enough support for a number of use-cases
  taking data from real sources in the sky.


1.3.6 Join efforts for a 5th quadrant
-------------------------------------

  We have also identified that providing a 5th quadrant hardware should
  be shared between otherwise independent projects. That is, a hardware
  based simulator environment itself (Tzu) and a potential enhancement
  to the current correlator hardware (Rich). Both projects should join
  ends for an optimal resources assignment.


1.3.7 Compute coarse delays in SC code
--------------------------------------

  Rich: based on a polynomial fit with parameters updated, through
  CAN-bus from CCC, not faster than every many seconds.


1.3.8 DGCK encode coarse delay value changes
--------------------------------------------

  Ray: label the data itself with individual coarse delay changes
  detected by the DGCK everytime its fine delay setting wraps around.


1.3.9 Correlator component state machine not used
-------------------------------------------------

  Every c++ correlator component inherits from a characteristic
  component class that implements a state machine to allow for ONLINE
  and STANDBY like states. This functionality was a good idea in the
  past but it never became of real use. We could simply remove the state
  class simplifying our code base without application changes.


1.3.10 printf usage in CCC
--------------------------

  Without looking into specific logging needs, it seems to be convenient
  in general to use the logging system instead of printf/cout
  logs. Analyze specific CCC requirements for logging and plan forward
  to move to the logging system.


1.4 Documentation improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1.4.1 CAN-bus command execution timing summary
----------------------------------------------

  It would be convenient to provide in the Correlator's ICD a table in
  which different CAN-bus commands display their expected execution
  times, avoiding many tables across the entire document. ICT should
  provide typical CAN-bus command sequences as examples.


1.4.2 3x3 TDM mode is a single polarization mode
------------------------------------------------

  Make this detail much more visible to science stakeholders.


1.5 Pending clarifications
~~~~~~~~~~~~~~~~~~~~~~~~~~

1.5.1 CDP nodes missing workload profiling
------------------------------------------

  There are a few threads deployed in each node. Threads are arranged in
  groups dedicated to different activities. Understanding the workload
  balance between them and real levels of concurrency between groups
  would help measure minimum hardware requirements (number of cores and
  memory) for the cluster.


1.5.2 How long it takes to a stop-sequence command to complete?
---------------------------------------------------------------


1.5.3 Return-to-phase reset time?
---------------------------------


1.5.4 There is evidence that the CCC is sometime applying delay too late
------------------------------------------------------------------------

  Kris to report details in already existing ticket.