Inter Node Communication Problems

This will typically be encountered when starting to use the SDK to call the TSM nodes, or when making changes. The will typically result in the SDK returning the error:

tsm operation failed ; node 0 returned 500: Internal Server Error\n sessionID=<SessionID>

To get more information on what the underlying problem is, the logs should be extracted from the TSM nodes.
The communication errors that are covered on this page will typically give one or more of the following log messages

endpoint error: an error occurred during key generation: timed out while creating channels sessionID=<SessionID>
endpoint error: an error occurred during key generation: EOF sessionID=<SessionID>
closing unclaimed channel for session id <SessionID>

The first means that a node was waiting for another node to connect to it, but no one ever did.
The second means that the connection was closed when trying to read from it. This typically happens if a node has timed out and is closing down, but the connection has not finished closing down completely, or if a firewall cuts the connection for some reason.
The third indicates that the connection was made, but there the channel was never used. This happens if e.g. session IDs are not the same so the operations are not linked together correctly, or if the TSM nodes are called at different times (e.g. at the other end of an EOF).

If running in a multi-tenant setting, i.e. where node 0 handled by multiple mobile nodes, the following log entry should also be present in the non-mobile nodes:

tenant public key for sessionID <SessionID> registered

This means that the public key for the mobile phone was registered successfully. If this is not present in the multi-tenant setup, then it should be checked that RegisterTenantPublicKey is called with the correct session ID and key for each session ID operation that is invoked.

Suggested actions

The following actions can be performed to try and fix the issues.

Configuration

First action should be to check that connections are configured correctly. Example (partial) configuration:

[Player]
Index = 1
PrivateKey = "<Private Key.1>"

[Players.0]
Address = "<Node.0 Address>"
PublicKey = "<Public Key.0>"

[Players.1]
Address = "<Node.1 Address>"
PublicKey = "<Public Key.1>"

[Players.2]
Address = "<Node.2 Address>"
PublicKey = "<Public Key.2>"

Note that [Players.0] is not present when using multi tenant setup (Node 0 running on a mobile phone), and the
[Players.1] is optional on node 1.

Things to check:

  • The Index in the [Player] section is the correct index for the node.
  • That the Address in Address of the [Players.<id>] contains the correct host and port. The format is tsm-node0:9000.
  • That the public key matches the private key. This means that the PrivateKey in [Player] must match the public key in the [Players.<id>] on the other nodes. Here <id> should match the Index in the [Player] with the private key.

Code

There are several things that need to be done to get things to work.

If running in a setup where a single SDK controls all nodes, then calling the e.g. Keygen (does not need to be the
WithSessionID variant) on the SDK should call all nodes, which should work.

If running with multiple SDKs each controlling one (or a few) node(s), then there are a lot of pitfalls. The general
process that needs to be followed:

  • Generate a session ID.
  • Distribute the session ID.
  • If running node 0 on a mobile: Call Register Tenant with the session ID and the public key of the mobile phone on all non-mobile nodes.
  • Call the WithSessionID operation (e.g. KeygenWithSessionID). This needs to be called on all nodes at roughly the same time.

If this fails, then there multiple things to check:

  • Try and find the logs mentioned. Check that the session ID is consistent across nodes.
  • The default connection time is 10 seconds. If this is too low it can be increased by setting the ConnectionTimeout in the [MPC] section.
  • Make sure that things are actually called in parallel. Often the main thread will block when calling, so the following call to the server to trigger the call against the server SDK will not be made until the mobile SDK call have timed out.
  • If running mobile nodes, then make sure the tenant public key for sessionID <SessionID> registered log is present on the server nodes.