Conclusions
Table of Contents
_________________
1 Conclusions
.. 1.1 Software improvements for more efficient sub-arrays
.. 1.2 Ideas for possible improvements
..... 1.2.1 Timing Event warning and DLL out of lock monitoring
..... 1.2.2 Importance and frequency of Single Event Upsets
..... 1.2.3 Revive ICT-232 towards a reliable protocol
..... 1.2.4 Detect as early as possible that a correlator configuration is bad
..... 1.2.5 Inter sub-scan duration limited by partial dump data spreading
..... 1.2.6 Avoid excessive CAN-bus payload whenever possible
.. 1.3 Ideas for possible longer term improvements
..... 1.3.1 CCC finer commanding granularity
..... 1.3.2 CCC Maintenance component should be locked out during observations
..... 1.3.3 Increase data rate peak a few factors above its current limit
..... 1.3.4 Reuse of configuration and state slots at run-time
..... 1.3.5 A 5th quadrant for HIL simulation testing
..... 1.3.6 Join efforts for a 5th quadrant
..... 1.3.7 Compute coarse delays in SC code
..... 1.3.8 DGCK encode coarse delay value changes
..... 1.3.9 Correlator component state machine not used
..... 1.3.10 printf usage in CCC
.. 1.4 Documentation improvements
..... 1.4.1 CAN-bus command execution timing summary
..... 1.4.2 3x3 TDM mode is a single polarization mode
.. 1.5 Pending clarifications
..... 1.5.1 CDP nodes missing workload profiling
..... 1.5.2 How long it takes to a stop-sequence command to complete?
..... 1.5.3 Return-to-phase reset time?
..... 1.5.4 There is evidence that the CCC is sometime applying delay too late
1 Conclusions
=============
1.1 Software improvements for more efficient sub-arrays
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A face-to-face discussion on ICT-5111 resulted in the realization of a
series of feasible improvements. It is clear that the usefulness of
these discussions comes from having different points of views at the
same time in the same room.
It is expected that the aggregation of all those improvements would
produce a much more 'responsive' system when different observation
types execute in different sub-arrays.
A prioritized items list (high to low) is presented below. Each one of
these items will become an individual Jira ticket, and ICT-5111 will
be closed as a consequence.
1. CCC to parallelize all instances where the same CAN-bus command is
issued through independent CAN-bus channels. That is to say, extend
the CCC feature implemented with ICT-2539 to all possible commands
targeted to many base-bands at the same time.
2. Firmware change to send X and Y polarization delays with just one
command, saving 50% of the current delay CAN-bus related
traffic. Protocol options in firmware to be discussed before
deciding how to proceed in software.
3. Stop uploading scaling factors again if they are the same already
in hardware. Even if a TDM sub-scan is interleaved between the same
FDM mode, it is expected that the current scaling factor values
would take place automatically next time the FDM sub-scan
executes. This firmware feature has never been tested before, some
verification is required.
4. CCC sub-scan coalescence. If a sequence repeats the same sub-scan
many times then the CCC should assume that it is executing just one
sub-scan, leaving to the CDP the actual execution of individual
sub-scans. In this case there would be no CAN-bus workload, apart
from delay updates, during the whole sequence execution. Sub-scan
started and eneded callback invocations are still expected to
happen, therefore, some specific CCC changes are expected beyond
suppressing CAN-bus traffic during the sequence. Note that the same
FDM sub-scan means same frequency as well, and that's why
return-to-phase activity is also avoided.
5. Compute bulk delay in CCC instead of SCC microprocessor. It should
be possible to avoid CAN-bus traffic due to return-to-phase actions
by letting the CCC chose and set the bulk delay applied in Station
Cards. Before starting a sequence the CCC could make an educated
guess for a single bulk-delay value per antenna and leave it
constant throughout that sequence. Implying that in FDM mode there
should be no need to return-to-phase between sub-scans if the
digital LO frequency does not change. Similarly than previous item.
6. Reusing previously generated states might be useful. The maximum
number of state slots is 16, therefore, the applicability of this
improvement is restricted to just some very specific
conditions. For example, if just two sub-arrays are executing and
each one is aware of only two different correlator configurations,
then both sub-arrays could execute just applying a previously
generated state permutation (no new generate needed). Allowing for
more than just 16 slots is possible in firmware, but it is not just
trivial to implement. Alejandro pointed out that LTA::Apply cannot
receive a start/continue flag per sub-array, and that's why only 16
slots is a somewhat short number for this specific use-case.
1.2 Ideas for possible improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.2.1 Timing Event warning and DLL out of lock monitoring
---------------------------------------------------------
CCC to reenable monitoring of both items.
1.2.2 Importance and frequency of Single Event Upsets
-----------------------------------------------------
CCC should monitor FPGAs that provide such information and report the
value to the monitoring database. FPGA personalities should be
reloaded whenever an event has occurred, probably at the next
sub-array creation. Investigate if all involved FPGAs could be
addressed while other sub-arrays are executing, it is assumed that
there is no need for a global FSR to reload one FPGA in the system.
1.2.3 Revive ICT-232 towards a reliable protocol
------------------------------------------------
It is not possible determine if a CAN-bus broadcast command has
successfully completed for all involved CAN-bus nodes. Delay updates
are an interesting example for their relevance through the data path.
1.2.4 Detect as early as possible that a correlator configuration is bad
------------------------------------------------------------------------
Neil: if control could guaranty calling loadConfiguration as soon as
possible then some time could be saved if one of those configurations
is invalid (e.g. too high data rates).
1.2.5 Inter sub-scan duration limited by partial dump data spreading
--------------------------------------------------------------------
The actual 'spreading' of partial dumps through a dump duration
interval imposes a limit to the shortest time bewtween sub-scans in
the same sub-array. Doing anything different in software would
introduce some other interdependency between sub-arrays (exceed
aggregated data rate) which are difficult to predict in advance. For
the moment, we stick to the fact that it takes at least one dump
duration to complete a sub-scan before the next could run in the same
sub-array.
1.2.6 Avoid excessive CAN-bus payload whenever possible
-------------------------------------------------------
Pete could investigate those CAN-bus protocols that transmit data in
excess and discuss with Alejandro and Rich ways to simplify them. The
typical case is scaling factors for all antennas when in reality those
in the sub-array would suffice.
1.3 Ideas for possible longer term improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.3.1 CCC finer commanding granularity
--------------------------------------
Geoff: would it be possible for the CCC to command the hardware again,
for some different purpose, while waiting for a previous command
completion that it is known to take many millisecond to finish?
1.3.2 CCC Maintenance component should be locked out during observations
------------------------------------------------------------------------
One typical activity through that component is the collection of
sample statistics from TFBs. That activity is in general incompatible
with an ongoing observation. There should be a mechanism that gives
priority to observations by locking out any possibility to use the
Maintenance component.
1.3.3 Increase data rate peak a few factors above its current limit
-------------------------------------------------------------------
There is some motivation to increase the peak value ~4 times above its
current 60 MB/sec limit. Greater number of antennas and storing APC
corrected and uncorrected data would benefit from such change. Some
analysis would be needed to identify how much data storage impact
would such a change really have in the archive infrastructure. A
modification like this would require enabling a 10 GbE Ethernet
environment in the correlator room and switch to connect to the
OSF. Bulk-data reconfiguration for 10 GbE also needs to be coordinated
with proper expert support.
1.3.4 Reuse of configuration and state slots at run-time
--------------------------------------------------------
At this moment there is a misunderstanding about when is valid for the
CCC to generate a new LTA state after one has been applied. This need
to be clarified with Alejandro. Assuming that there is a limitation
then that implies that the CCC needs to wait for a dump duration
before scheduling a new state to happen.
1.3.5 A 5th quadrant for HIL simulation testing
-----------------------------------------------
We have identified that the most promissing alternative is based on a
'partially populated' correlator quadrant. Capable of one base-band
and up to 16 physical antennas connected to it (is 16 correct?). Such
option implies no changes in firmware and medium sized modifications
in software. It also provides enough support for a number of use-cases
taking data from real sources in the sky.
1.3.6 Join efforts for a 5th quadrant
-------------------------------------
We have also identified that providing a 5th quadrant hardware should
be shared between otherwise independent projects. That is, a hardware
based simulator environment itself (Tzu) and a potential enhancement
to the current correlator hardware (Rich). Both projects should join
ends for an optimal resources assignment.
1.3.7 Compute coarse delays in SC code
--------------------------------------
Rich: based on a polynomial fit with parameters updated, through
CAN-bus from CCC, not faster than every many seconds.
1.3.8 DGCK encode coarse delay value changes
--------------------------------------------
Ray: label the data itself with individual coarse delay changes
detected by the DGCK everytime its fine delay setting wraps around.
1.3.9 Correlator component state machine not used
-------------------------------------------------
Every c++ correlator component inherits from a characteristic
component class that implements a state machine to allow for ONLINE
and STANDBY like states. This functionality was a good idea in the
past but it never became of real use. We could simply remove the state
class simplifying our code base without application changes.
1.3.10 printf usage in CCC
--------------------------
Without looking into specific logging needs, it seems to be convenient
in general to use the logging system instead of printf/cout
logs. Analyze specific CCC requirements for logging and plan forward
to move to the logging system.
1.4 Documentation improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.4.1 CAN-bus command execution timing summary
----------------------------------------------
It would be convenient to provide in the Correlator's ICD a table in
which different CAN-bus commands display their expected execution
times, avoiding many tables across the entire document. ICT should
provide typical CAN-bus command sequences as examples.
1.4.2 3x3 TDM mode is a single polarization mode
------------------------------------------------
Make this detail much more visible to science stakeholders.
1.5 Pending clarifications
~~~~~~~~~~~~~~~~~~~~~~~~~~
1.5.1 CDP nodes missing workload profiling
------------------------------------------
There are a few threads deployed in each node. Threads are arranged in
groups dedicated to different activities. Understanding the workload
balance between them and real levels of concurrency between groups
would help measure minimum hardware requirements (number of cores and
memory) for the cluster.
1.5.2 How long it takes to a stop-sequence command to complete?
---------------------------------------------------------------
1.5.3 Return-to-phase reset time?
---------------------------------
1.5.4 There is evidence that the CCC is sometime applying delay too late
------------------------------------------------------------------------
Kris to report details in already existing ticket.
Connect with NRAO