SmartSim API
Contents
SmartSim API¶
Experiment¶
|
Initialize an Experiment instance |
|
Start passed instances using Experiment launcher |
|
Stop specific instances launched by this |
|
Create an |
|
Create a general purpose |
|
Initialize an Orchestrator database |
|
Create a |
|
Create a |
|
Generate the file structure for an |
|
Monitor jobs through logging to stdout. |
|
Query if a job has completed. |
|
Query the status of launched instances |
|
Reconnect to a running |
|
Return a summary of the |
- class Experiment(name, exp_path=None, launcher='local')[source]¶
Bases:
object
Experiments are the Python user interface for SmartSim.
Experiment is a factory class that creates stages of a workflow and manages their execution.
The instances created by an Experiment represent executable code that is either user-specified, like the
Model
instance created byExperiment.create_model
, or pre-configured, like theOrchestrator
instance created byExperiment.create_database
.Experiment methods that accept a variable list of arguments, such as
Experiment.start
orExperiment.stop
, accept any number of the instances created by the Experiment.In general, the Experiment class is designed to be initialized once and utilized throughout runtime.
Initialize an Experiment instance
With the default settings, the Experiment will use the local launcher, which will start all Experiment created instances on the localhost.
Example of initializing an Experiment with the local launcher
exp = Experiment(name="my_exp", launcher="local")
SmartSim supports multiple launchers which also can be specified based on the type of system you are running on.
exp = Experiment(name="my_exp", launcher="slurm")
If you wish your driver script and Experiment to be run across multiple system with different schedulers (workload managers) you can also use the auto argument to have the Experiment guess which launcher to use based on system installed binaries and libraries
exp = Experiment(name="my_exp", launcher="auto")
The Experiment path will default to the current working directory and if the
Experiment.generate
method is called, a directory with the Experiment name will be created to house the output from the Experiment.- Parameters
name (str) – name for the
Experiment
exp_path (str, optional) – path to location of
Experiment
directory if generatedlauncher (str, optional) – type of launcher being used, options are “slurm”, “pbs”, “cobalt”, “lsf”, or “local”. If set to “auto”, an attempt will be made to find an available launcher on the system. Defaults to “local”
- create_batch_settings(nodes=1, time='', queue='', account='', batch_args=None, **kwargs)[source]¶
Create a
BatchSettings
instanceBatch settings parameterize batch workloads. The result of this function can be passed to the
Ensemble
initialization.the batch_args parameter can be used to pass in a dictionary of additional batch command arguments that aren’t supported through the smartsim interface
# i.e. for Slurm batch_args = { "distribution": "block" "exclusive": None } bs = exp.create_batch_settings(nodes=3, time="10:00:00", batch_args=batch_args) bs.set_account("default")
- Parameters
nodes (int, optional) – number of nodes for batch job, defaults to 1
time (str, optional) – length of batch job, defaults to “”
queue (str, optional) – queue or partition (if slurm), defaults to “”
account (str, optional) – user account name for batch system, defaults to “”
batch_args (dict[str, str], optional) – additional batch arguments, defaults to None
- Returns
a newly created BatchSettings instance
- Return type
BatchSettings
- Raises
SmartSimError – if batch creation fails
- create_database(port=6379, db_nodes=1, batch=False, hosts=None, run_command='auto', interface='ipogif0', account=None, time=None, queue=None, single_cmd=True, **kwargs)[source]¶
Initialize an Orchestrator database
The
Orchestrator
database is a key-value store based on Redis that can be launched together with other Experiment created instances for online data storage.When launched,
Orchestrator
can be used to communicate data between Fortran, Python, C, and C++ applications.Machine Learning models in Pytorch, Tensorflow, and ONNX (i.e. scikit-learn) can also be stored within the Orchestrator database where they can be called remotely and executed on CPU or GPU where the database is hosted.
To enable a SmartSim
Model
to communicate with the database the workload must utilize the SmartRedis clients. For more information on the database, and SmartRedis clients see the documentation at www.craylabs.org- Parameters
port (int, optional) – TCP/IP port, defaults to 6379
db_nodes (int, optional) – number of database shards, defaults to 1
batch (bool, optional) – run as a batch workload, defaults to False
hosts (list[str], optional) – specify hosts to launch on, defaults to None
run_command (str, optional) – specify launch binary or detect automatically, defaults to “auto”
interface (str, optional) – Network interface, defaults to “ipogif0”
account (str, optional) – account to run batch on, defaults to None
time (str, optional) – walltime for batch ‘HH:MM:SS’ format, defaults to None
queue (str, optional) – queue to run the batch on, defaults to None
single_cmd (bool, optional) – run all shards with one (MPMD) command, defaults to True
- Raises
SmartSimError – if detection of launcher or of run command fails
SmartSimError – if user indicated an incompatible run command for the launcher
- Returns
Orchestrator
- Return type
Orchestrator or derived class
- create_ensemble(name, params=None, batch_settings=None, run_settings=None, replicas=None, perm_strategy='all_perm', **kwargs)[source]¶
Create an
Ensemble
ofModel
instancesEnsembles can be launched sequentially or as a batch if using a non-local launcher. e.g. slurm
Ensembles require one of the following combinations of arguments
run_settings
andparams
run_settings
andreplicas
batch_settings
batch_settings
,run_settings
, andparams
batch_settings
,run_settings
, andreplicas
If given solely batch settings, an empty ensemble will be created that models can be added to manually through
Ensemble.add_model()
. The entire ensemble will launch as one batch.Provided batch and run settings, either
params
orreplicas
must be passed and the entire ensemble will launch as a single batch.Provided solely run settings, either
params
orreplicas
must be passed and the ensemble members will each launch sequentially.The kwargs argument can be used to pass custom input parameters to the permutation strategy.
- Parameters
name (str) – name of the ensemble
params (dict[str, Any]) – parameters to expand into
Model
membersbatch_settings (BatchSettings) – describes settings for
Ensemble
as batch workloadrun_settings (RunSettings) – describes how each
Model
should be executedreplicas (int) – number of replicas to create
perm_strategy (str, optional) – strategy for expanding
params
intoModel
instances from params argument options are “all_perm”, “stepped”, “random” or a callable function. Default is “all_perm”.
- Raises
SmartSimError – if initialization fails
- Returns
Ensemble
instance- Return type
- create_model(name, run_settings, params=None, path=None, enable_key_prefixing=False)[source]¶
Create a general purpose
Model
The
Model
class is the most general encapsulation of executable code in SmartSim.Model
instances are named references to pieces of a workflow that can be parameterized, and executed.Model
instances can be launched sequentially or as a batch by adding them into anEnsemble
.Parameters supplied in the params argument can be written into configuration files supplied at runtime to the model through
Model.attach_generator_files
. params can also be turned into executable arguments by callingModel.params_to_args
By default,
Model
instances will be executed in the current working directory if no path argument is supplied. If aModel
instance is passed toExperiment.generate
, a directory within theExperiment
directory will be created to house the input and output files from the model.Example initialization of a
Model
instancefrom smartsim import Experiment run_settings = exp.create_run_settings("python", "run_pytorch_model.py") model = exp.create_model("pytorch_model", run_settings) # adding parameters to a model run_settings = exp.create_run_settings("python", "run_pytorch_model.py") train_params = { "batch": 32, "epoch": 10, "lr": 0.001 } model = exp.create_model("pytorch_model", run_settings, params=params) model.attach_generator_files(to_configure="./train.cfg") exp.generate(model)
New in 0.4.0,
Model
instances can be co-located with an Orchestrator database shard throughModel.colocate_db
. This will launch a singleOrchestrator
instance on each compute host used by the (possibly distributed) application. This is useful for performant online inference or processing at runtime.- Parameters
name (str) – name of the model
run_settings (RunSettings) – defines how
Model
should be runparams (dict, optional) – model parameters for writing into configuration files
path (str, optional) – path to where the model should be executed at runtime
enable_key_prefixing (bool, optional) – If True, data sent to the Orchestrator using SmartRedis from this
Model
will be prefixed with theModel
name. Default is True.
- Raises
SmartSimError – if initialization fails
- Returns
the created
Model
- Return type
- create_run_settings(exe, exe_args=None, run_command='auto', run_args=None, env_vars=None, container=None, **kwargs)[source]¶
Create a
RunSettings
instance.run_command=”auto” will attempt to automatically match a run command on the system with a RunSettings class in SmartSim. If found, the class corresponding to that run_command will be created and returned.
If the local launcher is being used, auto detection will be turned off.
If a recognized run command is passed, the
RunSettings
instance will be a child class such asSrunSettings
If not supported by smartsim, the base
RunSettings
class will be created and returned with the specified run_command and run_args will be evaluated literally.- Run Commands with implemented helper classes:
aprun (ALPS)
srun (SLURM)
mpirun (OpenMPI)
jsrun (LSF)
- Parameters
run_command (str) – command to run the executable
exe (str) – executable to run
exe_args (list[str], optional) – arguments to pass to the executable
run_args (list[str], optional) – arguments to pass to the
run_command
env_vars (dict[str, str], optional) – environment variables to pass to the executable
- Returns
the created
RunSettings
- Return type
- finished(entity)[source]¶
Query if a job has completed.
An instance of
Model
orEnsemble
can be passed as an argument.Passing
Orchestrator
will return an error as a database deployment is never finished until stopped by the user.
- generate(*args, tag=None, overwrite=False)[source]¶
Generate the file structure for an
Experiment
Experiment.generate
creates directories for each instance passed to organize Experiments that launch many instances.If files or directories are attached to
Model
objects usingModel.attach_generator_files()
, those files or directories will be symlinked, copied, or configured and written into the created directory for that instance.Instances of
Model
,Ensemble
andOrchestrator
can all be passed as arguments to the generate method.- Parameters
tag (str, optional) – tag used in to_configure generator files
overwrite (bool, optional) – overwrite existing folders and contents, defaults to False
- get_status(*args)[source]¶
Query the status of launched instances
Return a smartsim.status string representing the status of the launched instance.
exp.get_status(model)
As with an Experiment method, multiple instance of varying types can be passed to and all statuses will be returned at once.
statuses = exp.get_status(model, ensemble, orchestrator) assert all([status == smartsim.status.STATUS_COMPLETED for status in statuses])
- Returns
status of the instances passed as arguments
- Return type
list[str]
- Raises
SmartSimError – if status retrieval fails
- poll(interval=10, verbose=True, kill_on_interrupt=True)[source]¶
Monitor jobs through logging to stdout.
This method should only be used if jobs were launched with
Experiment.start(block=False)
The internal specified will control how often the logging is performed, not how often the polling occurs. By default, internal polling is set to every second for local launcher jobs and every 10 seconds for all other launchers.
If internal polling needs to be slower or faster based on system or site standards, set the
SMARTSIM_JM_INTERNAL
environment variable to control the internal polling interval for SmartSim.For more verbose logging output, the
SMARTSIM_LOG_LEVEL
environment variable can be set to debugIf kill_on_interrupt=True, then all jobs launched by this experiment are guaranteed to be killed when ^C (SIGINT) signal is received. If kill_on_interrupt=False, then it is not guaranteed that all jobs launched by this experiment will be killed, and the zombie processes will need to be manually killed.
- Parameters
interval (int, optional) – frequency (in seconds) of logging to stdout, defaults to 10 seconds
verbose (bool, optional) – set verbosity, defaults to True
kill_on_interrupt (bool, optional) – flag for killing jobs when SIGINT is received
- Raises
SmartSimError –
- reconnect_orchestrator(checkpoint)[source]¶
Reconnect to a running
Orchestrator
This method can be used to connect to a
Orchestrator
deployment that was launched by a previousExperiment
. This can be helpful in the case where separate runs of anExperiment
wish to use the sameOrchestrator
instance currently running on a system.- Parameters
checkpoint (str) – the smartsim_db.dat file created when an
Orchestrator
is launched
- start(*args, block=True, summary=False, kill_on_interrupt=True)[source]¶
Start passed instances using Experiment launcher
Any instance
Model
,Ensemble
orOrchestrator
instance created by the Experiment can be passed as an argument to the start method.exp = Experiment(name="my_exp", launcher="slurm") settings = exp.create_run_settings(exe="./path/to/binary") model = exp.create_model("my_model", settings) exp.start(model)
Multiple instance can also be passed to the start method at once no matter which type of instance they are. These will all be launched together.
exp.start(model_1, model_2, db, ensemble, block=True) # alternatively stage_1 = [model_1, model_2, db, ensemble] exp.start(*stage_1, block=True)
If block==True the Experiment will poll the launched instances at runtime until all non-database jobs have completed. Database jobs must be killed by the user by passing them to
Experiment.stop
. This allows for multiple stages of a workflow to produce to and consume from the same Orchestrator database.If kill_on_interrupt=True, then all jobs launched by this experiment are guaranteed to be killed when ^C (SIGINT) signal is received. If kill_on_interrupt=False, then it is not guaranteed that all jobs launched by this experiment will be killed, and the zombie processes will need to be manually killed.
- Parameters
block (bool, optional) – block execution until all non-database jobs are finished, defaults to True
summary (bool, optional) – print a launch summary prior to launch, defaults to False
kill_on_interrupt (bool, optional) – flag for killing jobs when ^C (SIGINT) signal is received.
- stop(*args)[source]¶
Stop specific instances launched by this
Experiment
Instances of
Model
,Ensemble
andOrchestrator
can all be passed as arguments to the stop method.Whichever launcher was specified at Experiment initialization will be used to stop the instance. For example, which using the slurm launcher, this equates to running scancel on the instance.
Example
exp.stop(model) # multiple exp.stop(model_1, model_2, db, ensemble)
- Raises
TypeError – if wrong type
SmartSimError – if stop request fails
- summary(format='github')[source]¶
Return a summary of the
Experiment
The summary will show each instance that has been launched and completed in this
Experiment
- Parameters
format (str, optional) – the style in which the summary table is formatted, for a full list of styles see: https://github.com/astanin/python-tabulate#table-format, defaults to “github”
- Returns
tabulate string of
Experiment
history- Return type
str
Settings¶
Settings are provided to Model
and Ensemble
objects
to provide parameters for how a job should be executed. Some
are specifically meant for certain launchers like SbatchSettings
is solely meant for system using Slurm as a workload manager.
MpirunSettings
for OpenMPI based jobs is supported by Slurm,
PBSPro, and Cobalt.
Types of Settings:
|
Run parameters for a |
|
Initialize run parameters for a slurm job with |
|
Settings to run job with |
|
Settings to run job with |
|
Settings to run job with |
|
Settings to run job with |
|
Settings to run job with |
|
Specify run parameters for a Slurm batch job |
|
Specify |
|
Specify settings for a Cobalt |
|
Specify |
Settings objects can accept a container object that defines a container runtime, image, and arguments to use for the workload. Below is a list of supported container runtimes.
Types of Containers:
|
Singularity (apptainer) container type. |
RunSettings¶
When running SmartSim on laptops and single node workstations,
the base RunSettings
object is used to parameterize jobs.
RunSettings
include a run_command
parameter for local
launches that utilize a parallel launch binary like
mpirun
, mpiexec
, and others.
|
Add executable arguments to executable |
|
Update the job environment variables |
- class RunSettings(exe, exe_args=None, run_command='', run_args=None, env_vars=None, container=None, **kwargs)[source]¶
Run parameters for a
Model
The base
RunSettings
class should only be used with the local launcher on single node, workstations, or laptops.If no
run_command
is specified, the executable will be launched locally.run_args
passed as a dict will be interpreted literally for localRunSettings
and added directly to therun_command
e.g. run_args = {“-np”: 2} will be “-np 2”Example initialization
rs = RunSettings("echo", "hello", "mpirun", run_args={"-np": "2"})
- Parameters
exe (str) – executable to run
exe_args (str | list[str], optional) – executable arguments, defaults to None
run_command (str, optional) – launch binary (e.g. “srun”), defaults to empty str
run_args (dict[str, str], optional) – arguments for run command (e.g. -np for mpiexec), defaults to None
env_vars (dict[str, str], optional) – environment vars to launch job with, defaults to None
container (Container, optional) – container type for workload (e.g. “singularity”), defaults to None
- add_exe_args(args)[source]¶
Add executable arguments to executable
- Parameters
args (str | list[str]) – executable arguments
- Raises
TypeError – if exe args are not strings
- format_env_vars()[source]¶
Build environment variable string
- Returns
formatted list of strings to export variables
- Return type
list[str]
- format_run_args()[source]¶
Return formatted run arguments
For
RunSettings
, the run arguments are passed literally with no formatting.- Returns
list run arguments for these settings
- Return type
list[str]
- reserved_run_args: set[str] = {}¶
- property run_command¶
Return the launch binary used to launch the executable
Attempt to expand the path to the executable if possible
- Returns
launch binary e.g. mpiexec
- Type
str | None
- set(arg, value=None, condition=True)[source]¶
Allows users to set individual run arguments.
A method that allows users to set run arguments after object instantiation. Does basic formatting such as stripping leading dashes. If the argument has been set previously, this method will log warning but ultimately comply.
Conditional expressions may be passed to the conditional parameter. If the expression evaluates to True, the argument will be set. In not an info message is logged and no further operation is performed.
Basic Usage
rs = RunSettings("python") rs.set("an-arg", "a-val") rs.set("a-flag") rs.format_run_args() # returns ["an-arg", "a-val", "a-flag", "None"]
Slurm Example with Conditional Setting
import socket rs = SrunSettings("echo", "hello") rs.set_tasks(1) rs.set("exclusive") # Only set this argument if condition param evals True # Otherwise log and NOP rs.set("partition", "debug", condition=socket.gethostname()=="testing-system") rs.format_run_args() # returns ["exclusive", "None", "partition", "debug"] iff socket.gethostname()=="testing-system" # otherwise returns ["exclusive", "None"]
- Parameters
arg (str) – name of the argument
value (str | None) – value of the argument
conditon – set the argument if condition evaluates to True
- set_broadcast(dest_path=None)[source]¶
Copy executable file to allocated compute nodes
- Parameters
dest_path (str | None) – Path to copy an executable file
- set_cpu_bindings(bindings)[source]¶
Set the cores to which MPI processes are bound
- Parameters
bindings (list[int] | int) – List specifing the cores to which MPI processes are bound
- set_cpus_per_task(cpus_per_task)[source]¶
Set the number of cpus per task
- Parameters
cpus_per_task (int) – number of cpus per task
- set_excluded_hosts(host_list)[source]¶
Specify a list of hosts to exclude for launching this job
- Parameters
host_list (str | list[str]) – hosts to exclude
- set_hostlist(host_list)[source]¶
Specify the hostlist for this job
- Parameters
host_list (str | list[str]) – hosts to launch on
- set_hostlist_from_file(file_path)[source]¶
Use the contents of a file to specify the hostlist for this job
- Parameters
file_path (str) – Path to the hostlist file
- set_memory_per_node(memory_per_node)[source]¶
Set the amount of memory required per node in megabytes
- Parameters
memory_per_node (int) – Number of megabytes per node
- set_mpmd_preamble(preamble_lines)[source]¶
Set preamble to a file to make a job MPMD
- Parameters
preamble_lines (list[str]) – lines to put at the beginning of a file.
- set_nodes(nodes)[source]¶
Set the number of nodes
- Parameters
nodes (int) – number of nodes to run with
- set_quiet_launch(quiet)[source]¶
Set the job to run in quiet mode
- Parameters
quiet (bool) – Whether the job should be run quietly
- set_tasks(tasks)[source]¶
Set the number of tasks to launch
- Parameters
tasks (int) – number of tasks to launch
- set_tasks_per_node(tasks_per_node)[source]¶
Set the number of tasks per node
- Parameters
tasks_per_node (int) – number of tasks to launch per node
- set_time(hours=0, minutes=0, seconds=0)[source]¶
Automatically format and set wall time
- Parameters
hours (int) – number of hours to run job
minutes (int) – number of minutes to run job
seconds (int) – number of seconds to run job
- set_verbose_launch(verbose)[source]¶
Set the job to run in verbose mode
- Parameters
verbose (bool) – Whether the job should be run verbosely
- set_walltime(walltime)[source]¶
Set the formatted walltime
- Parameters
walltime (str) – Time in format required by launcher``
- update_env(env_vars)[source]¶
Update the job environment variables
To fully inherit the current user environment, add the workload-manager-specific flag to the launch command through the
add_exe_args()
method. For example,--export=ALL
for slurm, or-V
for PBS/aprun.- Parameters
env_vars (dict[str, Union[str, int, float, bool]]) – environment variables to update or add
- Raises
TypeError – if env_vars values cannot be coerced to strings
SrunSettings¶
SrunSettings
can be used for running on existing allocations,
running jobs in interactive allocations, and for adding srun
steps to a batch.
|
Set the number of nodes |
|
Set the number of tasks for this job |
|
Set the number of tasks for this job |
|
Set the walltime of the job |
|
Specify the hostlist for this job |
|
Specify a list of hosts to exclude for launching this job |
|
Set the number of cpus to use per task |
Add executable arguments to executable |
|
Return a list of slurm formatted run arguments |
|
Build bash compatible environment variable string for Slurm |
|
|
Update the job environment variables |
- class SrunSettings(exe, exe_args=None, run_args=None, env_vars=None, alloc=None, **kwargs)[source]¶
Initialize run parameters for a slurm job with
srun
SrunSettings
should only be used on Slurm based systems.If an allocation is specified, the instance receiving these run parameters will launch on that allocation.
- Parameters
exe (str) – executable to run
exe_args (list[str] | str, optional) – executable arguments, defaults to None
run_args (dict[str, str | None], optional) – srun arguments without dashes, defaults to None
env_vars (dict[str, str], optional) – environment variables for job, defaults to None
alloc (str, optional) – allocation ID if running on existing alloc, defaults to None
- add_exe_args(args)¶
Add executable arguments to executable
- Parameters
args (str | list[str]) – executable arguments
- Raises
TypeError – if exe args are not strings
- format_comma_sep_env_vars()[source]¶
Build environment variable string for Slurm
Slurm takes exports in comma separated lists the list starts with all as to not disturb the rest of the environment for more information on this, see the slurm documentation for srun
- Returns
the formatted string of environment variables
- Return type
tuple[str, list[str]]
- format_env_vars()[source]¶
Build bash compatible environment variable string for Slurm
- Returns
the formatted string of environment variables
- Return type
list[str]
- format_run_args()[source]¶
Return a list of slurm formatted run arguments
- Returns
list of slurm arguments for these settings
- Return type
list[str]
- make_mpmd(srun_settings)[source]¶
Make a mpmd workload by combining two
srun
commandsThis connects the two settings to be executed with a single Model instance
- Parameters
srun_settings (SrunSettings) – SrunSettings instance
- reserved_run_args: set[str] = {'D', 'chdir'}¶
- property run_command¶
Return the launch binary used to launch the executable
Attempt to expand the path to the executable if possible
- Returns
launch binary e.g. mpiexec
- Type
str | None
- set(arg, value=None, condition=True)¶
Allows users to set individual run arguments.
A method that allows users to set run arguments after object instantiation. Does basic formatting such as stripping leading dashes. If the argument has been set previously, this method will log warning but ultimately comply.
Conditional expressions may be passed to the conditional parameter. If the expression evaluates to True, the argument will be set. In not an info message is logged and no further operation is performed.
Basic Usage
rs = RunSettings("python") rs.set("an-arg", "a-val") rs.set("a-flag") rs.format_run_args() # returns ["an-arg", "a-val", "a-flag", "None"]
Slurm Example with Conditional Setting
import socket rs = SrunSettings("echo", "hello") rs.set_tasks(1) rs.set("exclusive") # Only set this argument if condition param evals True # Otherwise log and NOP rs.set("partition", "debug", condition=socket.gethostname()=="testing-system") rs.format_run_args() # returns ["exclusive", "None", "partition", "debug"] iff socket.gethostname()=="testing-system" # otherwise returns ["exclusive", "None"]
- Parameters
arg (str) – name of the argument
value (str | None) – value of the argument
conditon – set the argument if condition evaluates to True
- set_binding(binding)¶
Set binding
- Parameters
binding (str) – Binding
- set_broadcast(dest_path=None)[source]¶
Copy executable file to allocated compute nodes
This sets
--bcast
- Parameters
dest_path (str | None) – Path to copy an executable file
- set_cpu_bindings(bindings)[source]¶
Bind by setting CPU masks on tasks
This sets
--cpu-bind
using themap_cpu:<list>
option- Parameters
bindings (list[int] | int) – List specifing the cores to which MPI processes are bound
- set_cpus_per_task(cpus_per_task)[source]¶
Set the number of cpus to use per task
This sets
--cpus-per-task
- Parameters
num_cpus (int) – number of cpus to use per task
- set_excluded_hosts(host_list)[source]¶
Specify a list of hosts to exclude for launching this job
- Parameters
host_list (list[str]) – hosts to exclude
- Raises
TypeError –
- set_hostlist(host_list)[source]¶
Specify the hostlist for this job
This sets
--nodelist
- Parameters
host_list (str | list[str]) – hosts to launch on
- Raises
TypeError – if not str or list of str
- set_hostlist_from_file(file_path)[source]¶
Use the contents of a file to set the node list
This sets
--nodefile
- Parameters
file_path (str) – Path to the hostlist file
- set_memory_per_node(memory_per_node)[source]¶
Specify the real memory required per node
This sets
--mem
in megabytes- Parameters
memory_per_node (int) – Amount of memory per node in megabytes
- set_mpmd_preamble(preamble_lines)¶
Set preamble to a file to make a job MPMD
- Parameters
preamble_lines (list[str]) – lines to put at the beginning of a file.
- set_nodes(nodes)[source]¶
Set the number of nodes
Effectively this is setting:
srun --nodes <num_nodes>
- Parameters
nodes (int) – number of nodes to run with
- set_quiet_launch(quiet)[source]¶
Set the job to run in quiet mode
This sets
--quiet
- Parameters
quiet (bool) – Whether the job should be run quietly
- set_task_map(task_mapping)¶
Set a task mapping
- Parameters
task_mapping (str) – task mapping
- set_tasks(tasks)[source]¶
Set the number of tasks for this job
This sets
--ntasks
- Parameters
tasks (int) – number of tasks
- set_tasks_per_node(tasks_per_node)[source]¶
Set the number of tasks for this job
This sets
--ntasks-per-node
- Parameters
tasks_per_node (int) – number of tasks per node
- set_time(hours=0, minutes=0, seconds=0)¶
Automatically format and set wall time
- Parameters
hours (int) – number of hours to run job
minutes (int) – number of minutes to run job
seconds (int) – number of seconds to run job
- set_verbose_launch(verbose)[source]¶
Set the job to run in verbose mode
This sets
--verbose
- Parameters
verbose (bool) – Whether the job should be run verbosely
- set_walltime(walltime)[source]¶
Set the walltime of the job
format = “HH:MM:SS”
- Parameters
walltime (str) – wall time
- update_env(env_vars)¶
Update the job environment variables
To fully inherit the current user environment, add the workload-manager-specific flag to the launch command through the
add_exe_args()
method. For example,--export=ALL
for slurm, or-V
for PBS/aprun.- Parameters
env_vars (dict[str, Union[str, int, float, bool]]) – environment variables to update or add
- Raises
TypeError – if env_vars values cannot be coerced to strings
AprunSettings¶
AprunSettings
can be used on any system that supports the
Cray ALPS layer. SmartSim supports using AprunSettings
on PBSPro and Cobalt WLM systems.
AprunSettings
can be used in interactive session (on allocation)
and within batch launches (e.g., QsubBatchSettings
)
|
Set the number of cpus to use per task |
|
Specify the hostlist for this job |
|
Set the number of tasks for this job |
|
Set the number of tasks for this job |
|
Make job an MPMD job |
Add executable arguments to executable |
|
Return a list of ALPS formatted run arguments |
|
Format the environment variables for aprun |
|
|
Update the job environment variables |
- class AprunSettings(exe, exe_args=None, run_args=None, env_vars=None, **kwargs)[source]¶
Settings to run job with
aprun
commandAprunSettings
can be used for both the pbs and cobalt launchers.- Parameters
exe (str) – executable
exe_args (str | list[str], optional) – executable arguments, defaults to None
run_args (dict[str, str], optional) – arguments for run command, defaults to None
env_vars (dict[str, str], optional) – environment vars to launch job with, defaults to None
- add_exe_args(args)¶
Add executable arguments to executable
- Parameters
args (str | list[str]) – executable arguments
- Raises
TypeError – if exe args are not strings
- format_env_vars()[source]¶
Format the environment variables for aprun
- Returns
list of env vars
- Return type
list[str]
- format_run_args()[source]¶
Return a list of ALPS formatted run arguments
- Returns
list of ALPS arguments for these settings
- Return type
list[str]
- make_mpmd(aprun_settings)[source]¶
Make job an MPMD job
This method combines two
AprunSettings
into a single MPMD command joined with ‘:’- Parameters
aprun_settings (AprunSettings) –
AprunSettings
instance
- reserved_run_args: set[str] = {}¶
- property run_command¶
Return the launch binary used to launch the executable
Attempt to expand the path to the executable if possible
- Returns
launch binary e.g. mpiexec
- Type
str | None
- set(arg, value=None, condition=True)¶
Allows users to set individual run arguments.
A method that allows users to set run arguments after object instantiation. Does basic formatting such as stripping leading dashes. If the argument has been set previously, this method will log warning but ultimately comply.
Conditional expressions may be passed to the conditional parameter. If the expression evaluates to True, the argument will be set. In not an info message is logged and no further operation is performed.
Basic Usage
rs = RunSettings("python") rs.set("an-arg", "a-val") rs.set("a-flag") rs.format_run_args() # returns ["an-arg", "a-val", "a-flag", "None"]
Slurm Example with Conditional Setting
import socket rs = SrunSettings("echo", "hello") rs.set_tasks(1) rs.set("exclusive") # Only set this argument if condition param evals True # Otherwise log and NOP rs.set("partition", "debug", condition=socket.gethostname()=="testing-system") rs.format_run_args() # returns ["exclusive", "None", "partition", "debug"] iff socket.gethostname()=="testing-system" # otherwise returns ["exclusive", "None"]
- Parameters
arg (str) – name of the argument
value (str | None) – value of the argument
conditon – set the argument if condition evaluates to True
- set_binding(binding)¶
Set binding
- Parameters
binding (str) – Binding
- set_broadcast(dest_path=None)¶
Copy executable file to allocated compute nodes
- Parameters
dest_path (str | None) – Path to copy an executable file
- set_cpu_bindings(bindings)[source]¶
Specifies the cores to which MPI processes are bound
This sets
--cpu-binding
- Parameters
bindings (list[int] | int) – List of cpu numbers
- set_cpus_per_task(cpus_per_task)[source]¶
Set the number of cpus to use per task
This sets
--cpus-per-pe
- Parameters
cpus_per_task (int) – number of cpus to use per task
- set_excluded_hosts(host_list)[source]¶
Specify a list of hosts to exclude for launching this job
- Parameters
host_list (str | list[str]) – hosts to exclude
- Raises
TypeError – if not str or list of str
- set_hostlist(host_list)[source]¶
Specify the hostlist for this job
- Parameters
host_list (str | list[str]) – hosts to launch on
- Raises
TypeError – if not str or list of str
- set_hostlist_from_file(file_path)[source]¶
Use the contents of a file to set the node list
This sets
--node-list-file
- Parameters
file_path (str) – Path to the hostlist file
- set_memory_per_node(memory_per_node)[source]¶
Specify the real memory required per node
This sets
--memory-per-pe
in megabytes- Parameters
memory_per_node (int) – Per PE memory limit in megabytes
- set_mpmd_preamble(preamble_lines)¶
Set preamble to a file to make a job MPMD
- Parameters
preamble_lines (list[str]) – lines to put at the beginning of a file.
- set_nodes(nodes)¶
Set the number of nodes
- Parameters
nodes (int) – number of nodes to run with
- set_quiet_launch(quiet)[source]¶
Set the job to run in quiet mode
This sets
--quiet
- Parameters
quiet (bool) – Whether the job should be run quietly
- set_task_map(task_mapping)¶
Set a task mapping
- Parameters
task_mapping (str) – task mapping
- set_tasks(tasks)[source]¶
Set the number of tasks for this job
This sets
--pes
- Parameters
tasks (int) – number of tasks
- set_tasks_per_node(tasks_per_node)[source]¶
Set the number of tasks for this job
This sets
--pes-per-node
- Parameters
tasks_per_node (int) – number of tasks per node
- set_time(hours=0, minutes=0, seconds=0)¶
Automatically format and set wall time
- Parameters
hours (int) – number of hours to run job
minutes (int) – number of minutes to run job
seconds (int) – number of seconds to run job
- set_verbose_launch(verbose)[source]¶
Set the job to run in verbose mode
This sets
--debug
arg to the highest level- Parameters
verbose (bool) – Whether the job should be run verbosely
- set_walltime(walltime)[source]¶
Set the walltime of the job
Walltime is given in total number of seconds
- Parameters
walltime (str) – wall time
- update_env(env_vars)¶
Update the job environment variables
To fully inherit the current user environment, add the workload-manager-specific flag to the launch command through the
add_exe_args()
method. For example,--export=ALL
for slurm, or-V
for PBS/aprun.- Parameters
env_vars (dict[str, Union[str, int, float, bool]]) – environment variables to update or add
- Raises
TypeError – if env_vars values cannot be coerced to strings
JsrunSettings¶
JsrunSettings
can be used on any system that supports the
IBM LSF launcher.
JsrunSettings
can be used in interactive session (on allocation)
and within batch launches (i.e. BsubBatchSettings
)
|
Set the number of resource sets to use |
|
Set the number of cpus to use per resource set |
|
Set the number of gpus to use per resource set |
|
Set the number of resource sets to use per host |
|
Set the number of tasks for this job |
|
Set the number of tasks per resource set |
|
Set binding |
|
Make step an MPMD (or SPMD) job. |
|
Set preamble used in ERF file. |
|
Update the job environment variables |
|
Set resource sets used for ERF (SPMD or MPMD) steps. |
Format environment variables. |
|
Return a list of LSF formatted run arguments |
- class JsrunSettings(exe, exe_args=None, run_args=None, env_vars=None, **kwargs)[source]¶
Settings to run job with
jsrun
commandJsrunSettings
should only be used on LSF-based systems.- Parameters
exe (str) – executable
exe_args (str | list[str], optional) – executable arguments, defaults to None
run_args (dict[str, str], optional) – arguments for run command, defaults to None
env_vars (dict[str, str], optional) – environment vars to launch job with, defaults to None
- add_exe_args(args)¶
Add executable arguments to executable
- Parameters
args (str | list[str]) – executable arguments
- Raises
TypeError – if exe args are not strings
- format_env_vars()[source]¶
Format environment variables. Each variable needs to be passed with
--env
. If a variable is set toNone
, its value is propagated from the current environment.- Returns
formatted list of strings to export variables
- Return type
list[str]
- format_run_args()[source]¶
Return a list of LSF formatted run arguments
- Returns
list of LSF arguments for these settings
- Return type
list[str]
- make_mpmd(jsrun_settings=None)[source]¶
Make step an MPMD (or SPMD) job.
This method will activate job execution through an ERF file.
Optionally, this method adds an instance of
JsrunSettings
to the list of settings to be launched in the same ERF file.- Parameters
aprun_settings (JsrunSettings, optional) –
JsrunSettings
instance, defaults to None
- reserved_run_args: set[str] = {'chdir', 'h'}¶
- property run_command¶
Return the launch binary used to launch the executable
Attempt to expand the path to the executable if possible
- Returns
launch binary e.g. mpiexec
- Type
str | None
- set(arg, value=None, condition=True)¶
Allows users to set individual run arguments.
A method that allows users to set run arguments after object instantiation. Does basic formatting such as stripping leading dashes. If the argument has been set previously, this method will log warning but ultimately comply.
Conditional expressions may be passed to the conditional parameter. If the expression evaluates to True, the argument will be set. In not an info message is logged and no further operation is performed.
Basic Usage
rs = RunSettings("python") rs.set("an-arg", "a-val") rs.set("a-flag") rs.format_run_args() # returns ["an-arg", "a-val", "a-flag", "None"]
Slurm Example with Conditional Setting
import socket rs = SrunSettings("echo", "hello") rs.set_tasks(1) rs.set("exclusive") # Only set this argument if condition param evals True # Otherwise log and NOP rs.set("partition", "debug", condition=socket.gethostname()=="testing-system") rs.format_run_args() # returns ["exclusive", "None", "partition", "debug"] iff socket.gethostname()=="testing-system" # otherwise returns ["exclusive", "None"]
- Parameters
arg (str) – name of the argument
value (str | None) – value of the argument
conditon – set the argument if condition evaluates to True
- set_binding(binding)[source]¶
Set binding
This sets
--bind
- Parameters
binding (str) – Binding, e.g. packed:21
- set_broadcast(dest_path=None)¶
Copy executable file to allocated compute nodes
- Parameters
dest_path (str | None) – Path to copy an executable file
- set_cpu_bindings(bindings)¶
Set the cores to which MPI processes are bound
- Parameters
bindings (list[int] | int) – List specifing the cores to which MPI processes are bound
- set_cpus_per_rs(cpus_per_rs)[source]¶
Set the number of cpus to use per resource set
This sets
--cpu_per_rs
- Parameters
cpus_per_rs (int or str) – number of cpus to use per resource set or ALL_CPUS
- set_cpus_per_task(cpus_per_task)[source]¶
Set the number of cpus per tasks.
This function is an alias for set_cpus_per_rs.
- Parameters
cpus_per_task (int) – number of cpus per resource set
- set_erf_sets(erf_sets)[source]¶
Set resource sets used for ERF (SPMD or MPMD) steps.
erf_sets
is a dictionary used to fill the ERF line representing these settings, e.g. {“host”: “1”, “cpu”: “{0:21}, {21:21}”, “gpu”: “*”} can be used to specify rank (or rank_count), hosts, cpus, gpus, and memory. The key rank is used to give specific ranks, as in {“rank”: “1, 2, 5”}, while the key rank_count is used to specify the count only, as in {“rank_count”: “3”}. If both are specified, only rank is used.- Parameters
hosts (dict[str,str]) – dictionary of resources
- set_excluded_hosts(host_list)¶
Specify a list of hosts to exclude for launching this job
- Parameters
host_list (str | list[str]) – hosts to exclude
- set_gpus_per_rs(gpus_per_rs)[source]¶
Set the number of gpus to use per resource set
This sets
--gpu_per_rs
- Parameters
gpus_per_rs (int or str) – number of gpus to use per resource set or ALL_GPUS
- set_hostlist(host_list)¶
Specify the hostlist for this job
- Parameters
host_list (str | list[str]) – hosts to launch on
- set_hostlist_from_file(file_path)¶
Use the contents of a file to specify the hostlist for this job
- Parameters
file_path (str) – Path to the hostlist file
- set_individual_output(suffix=None)[source]¶
Set individual std output.
This sets
--stdio_mode individual
and inserts the suffix into the output name. The resulting output name will beself.name + suffix + .out
.- Parameters
suffix (str, optional) – Optional suffix to add to output file names, it can contain %j, %h, %p, or %t, as specified by jsrun options.
- set_memory_per_node(memory_per_node)[source]¶
Specify the number of megabytes of memory to assign to a resource set
Alias for set_memory_per_rs.
- Parameters
memory_per_node (int) – Number of megabytes per rs
- set_memory_per_rs(memory_per_rs)[source]¶
Specify the number of megabytes of memory to assign to a resource set
This sets
--memory_per_rs
- Parameters
memory_per_rs (int) – Number of megabytes per rs
- set_mpmd_preamble(preamble_lines)[source]¶
Set preamble used in ERF file. Typical lines include oversubscribe-cpu : allow or overlapping-rs : allow. Can be used to set launch_distribution. If it is not present, it will be inferred from the settings, or set to packed by default.
- Parameters
preamble_lines (list[str]) – lines to put at the beginning of the ERF file.
- set_nodes(nodes)¶
Set the number of nodes
- Parameters
nodes (int) – number of nodes to run with
- set_num_rs(num_rs)[source]¶
Set the number of resource sets to use
This sets
--nrs
.- Parameters
num_rs (int or str) – Number of resource sets or ALL_HOSTS
- set_quiet_launch(quiet)¶
Set the job to run in quiet mode
- Parameters
quiet (bool) – Whether the job should be run quietly
- set_rs_per_host(rs_per_host)[source]¶
Set the number of resource sets to use per host
This sets
--rs_per_host
- Parameters
rs_per_host (int) – number of resource sets to use per host
- set_task_map(task_mapping)¶
Set a task mapping
- Parameters
task_mapping (str) – task mapping
- set_tasks(tasks)[source]¶
Set the number of tasks for this job
This sets
--np
- Parameters
tasks (int) – number of tasks
- set_tasks_per_node(tasks_per_node)[source]¶
Set the number of tasks per resource set.
This function is an alias for set_tasks_per_rs.
- Parameters
tasks_per_node (int) – number of tasks per resource set
- set_tasks_per_rs(tasks_per_rs)[source]¶
Set the number of tasks per resource set
This sets
--tasks_per_rs
- Parameters
tasks_per_rs (int) – number of tasks per resource set
- set_time(hours=0, minutes=0, seconds=0)¶
Automatically format and set wall time
- Parameters
hours (int) – number of hours to run job
minutes (int) – number of minutes to run job
seconds (int) – number of seconds to run job
- set_verbose_launch(verbose)¶
Set the job to run in verbose mode
- Parameters
verbose (bool) – Whether the job should be run verbosely
- set_walltime(walltime)¶
Set the formatted walltime
- Parameters
walltime (str) – Time in format required by launcher``
- update_env(env_vars)¶
Update the job environment variables
To fully inherit the current user environment, add the workload-manager-specific flag to the launch command through the
add_exe_args()
method. For example,--export=ALL
for slurm, or-V
for PBS/aprun.- Parameters
env_vars (dict[str, Union[str, int, float, bool]]) – environment variables to update or add
- Raises
TypeError – if env_vars values cannot be coerced to strings
MpirunSettings¶
MpirunSettings
are for launching with OpenMPI. MpirunSettings
are
supported on Slurm, PBSpro, and Cobalt.
|
Set the number of tasks for this job |
|
Set the hostlist for the |
|
Set the number of tasks for this job |
|
Set |
|
Make a mpmd workload by combining two |
Add executable arguments to executable |
|
Return a list of OpenMPI formatted run arguments |
|
Format the environment variables for mpirun |
|
|
Update the job environment variables |
- class MpirunSettings(exe, exe_args=None, run_args=None, env_vars=None, **kwargs)[source]¶
Settings to run job with
mpirun
command (OpenMPI)Note that environment variables can be passed with a None value to signify that they should be exported from the current environment
Any arguments passed in the
run_args
dict will be converted intompirun
arguments and prefixed with--
. Values of None can be provided for arguments that do not have values.- Parameters
exe (str) – executable
exe_args (str | list[str], optional) – executable arguments, defaults to None
run_args (dict[str, str], optional) – arguments for run command, defaults to None
env_vars (dict[str, str], optional) – environment vars to launch job with, defaults to None
- add_exe_args(args)¶
Add executable arguments to executable
- Parameters
args (str | list[str]) – executable arguments
- Raises
TypeError – if exe args are not strings
- format_env_vars()¶
Format the environment variables for mpirun
- Returns
list of env vars
- Return type
list[str]
- format_run_args()¶
Return a list of OpenMPI formatted run arguments
- Returns
list of OpenMPI arguments for these settings
- Return type
list[str]
- make_mpmd(mpirun_settings)¶
Make a mpmd workload by combining two
mpirun
commandsThis connects the two settings to be executed with a single Model instance
- Parameters
mpirun_settings (MpirunSettings) – MpirunSettings instance
- reserved_run_args: set[str] = {'wd', 'wdir'}¶
- property run_command¶
Return the launch binary used to launch the executable
Attempt to expand the path to the executable if possible
- Returns
launch binary e.g. mpiexec
- Type
str | None
- set(arg, value=None, condition=True)¶
Allows users to set individual run arguments.
A method that allows users to set run arguments after object instantiation. Does basic formatting such as stripping leading dashes. If the argument has been set previously, this method will log warning but ultimately comply.
Conditional expressions may be passed to the conditional parameter. If the expression evaluates to True, the argument will be set. In not an info message is logged and no further operation is performed.
Basic Usage
rs = RunSettings("python") rs.set("an-arg", "a-val") rs.set("a-flag") rs.format_run_args() # returns ["an-arg", "a-val", "a-flag", "None"]
Slurm Example with Conditional Setting
import socket rs = SrunSettings("echo", "hello") rs.set_tasks(1) rs.set("exclusive") # Only set this argument if condition param evals True # Otherwise log and NOP rs.set("partition", "debug", condition=socket.gethostname()=="testing-system") rs.format_run_args() # returns ["exclusive", "None", "partition", "debug"] iff socket.gethostname()=="testing-system" # otherwise returns ["exclusive", "None"]
- Parameters
arg (str) – name of the argument
value (str | None) – value of the argument
conditon – set the argument if condition evaluates to True
- set_binding(binding)¶
Set binding
- Parameters
binding (str) – Binding
- set_broadcast(dest_path=None)¶
Copy the specified executable(s) to remote machines
This sets
--preload-binary
- Parameters
dest_path (str | None) – Destination path (Ignored)
- set_cpu_bindings(bindings)¶
Set the cores to which MPI processes are bound
- Parameters
bindings (list[int] | int) – List specifing the cores to which MPI processes are bound
- set_cpus_per_task(cpus_per_task)¶
Set the number of tasks for this job
This sets
--cpus-per-proc
note: this option has been deprecated in openMPI 4.0+ and will soon be replaced.
- Parameters
cpus_per_task (int) – number of tasks
- set_excluded_hosts(host_list)¶
Specify a list of hosts to exclude for launching this job
- Parameters
host_list (str | list[str]) – hosts to exclude
- set_hostlist(host_list)¶
Set the hostlist for the
mpirun
commandThis sets
--host
- Parameters
host_list (str | list[str]) – list of host names
- Raises
TypeError – if not str or list of str
- set_hostlist_from_file(file_path)¶
Use the contents of a file to set the hostlist
This sets
--hostfile
- Parameters
file_path (str) – Path to the hostlist file
- set_memory_per_node(memory_per_node)¶
Set the amount of memory required per node in megabytes
- Parameters
memory_per_node (int) – Number of megabytes per node
- set_mpmd_preamble(preamble_lines)¶
Set preamble to a file to make a job MPMD
- Parameters
preamble_lines (list[str]) – lines to put at the beginning of a file.
- set_nodes(nodes)¶
Set the number of nodes
- Parameters
nodes (int) – number of nodes to run with
- set_quiet_launch(quiet)¶
Set the job to run in quiet mode
This sets
--quiet
- Parameters
quiet (bool) – Whether the job should be run quietly
- set_task_map(task_mapping)¶
Set
mpirun
task mappingthis sets
--map-by <mapping>
For examples, see the man page for
mpirun
- Parameters
task_mapping (str) – task mapping
- set_tasks(tasks)¶
Set the number of tasks for this job
This sets
--n
- Parameters
tasks (int) – number of tasks
- set_tasks_per_node(tasks_per_node)¶
Set the number of tasks per node
- Parameters
tasks_per_node (int) – number of tasks to launch per node
- set_time(hours=0, minutes=0, seconds=0)¶
Automatically format and set wall time
- Parameters
hours (int) – number of hours to run job
minutes (int) – number of minutes to run job
seconds (int) – number of seconds to run job
- set_verbose_launch(verbose)¶
Set the job to run in verbose mode
This sets
--verbose
- Parameters
verbose (bool) – Whether the job should be run verbosely
- set_walltime(walltime)¶
Set the maximum number of seconds that a job will run
This sets
--timeout
- Parameters
walltime (str) – number like string of seconds that a job will run in secs
- update_env(env_vars)¶
Update the job environment variables
To fully inherit the current user environment, add the workload-manager-specific flag to the launch command through the
add_exe_args()
method. For example,--export=ALL
for slurm, or-V
for PBS/aprun.- Parameters
env_vars (dict[str, Union[str, int, float, bool]]) – environment variables to update or add
- Raises
TypeError – if env_vars values cannot be coerced to strings
MpiexecSettings¶
MpiexecSettings
are for launching with OpenMPI’s mpiexec
. MpirunSettings
are
supported on Slurm, PBSpro, and Cobalt.
|
Set the number of tasks for this job |
|
Set the hostlist for the |
|
Set the number of tasks for this job |
|
Set |
|
Make a mpmd workload by combining two |
Add executable arguments to executable |
|
Return a list of OpenMPI formatted run arguments |
|
Format the environment variables for mpirun |
|
|
Update the job environment variables |
- class MpiexecSettings(exe, exe_args=None, run_args=None, env_vars=None, **kwargs)[source]¶
Settings to run job with
mpiexec
command (OpenMPI)Note that environment variables can be passed with a None value to signify that they should be exported from the current environment
Any arguments passed in the
run_args
dict will be converted intompiexec
arguments and prefixed with--
. Values of None can be provided for arguments that do not have values.- Parameters
exe (str) – executable
exe_args (str | list[str], optional) – executable arguments, defaults to None
run_args (dict[str, str], optional) – arguments for run command, defaults to None
env_vars (dict[str, str], optional) – environment vars to launch job with, defaults to None
- add_exe_args(args)¶
Add executable arguments to executable
- Parameters
args (str | list[str]) – executable arguments
- Raises
TypeError – if exe args are not strings
- format_env_vars()¶
Format the environment variables for mpirun
- Returns
list of env vars
- Return type
list[str]
- format_run_args()¶
Return a list of OpenMPI formatted run arguments
- Returns
list of OpenMPI arguments for these settings
- Return type
list[str]
- make_mpmd(mpirun_settings)¶
Make a mpmd workload by combining two
mpirun
commandsThis connects the two settings to be executed with a single Model instance
- Parameters
mpirun_settings (MpirunSettings) – MpirunSettings instance
- reserved_run_args: set[str] = {'wd', 'wdir'}¶
- property run_command¶
Return the launch binary used to launch the executable
Attempt to expand the path to the executable if possible
- Returns
launch binary e.g. mpiexec
- Type
str | None
- set(arg, value=None, condition=True)¶
Allows users to set individual run arguments.
A method that allows users to set run arguments after object instantiation. Does basic formatting such as stripping leading dashes. If the argument has been set previously, this method will log warning but ultimately comply.
Conditional expressions may be passed to the conditional parameter. If the expression evaluates to True, the argument will be set. In not an info message is logged and no further operation is performed.
Basic Usage
rs = RunSettings("python") rs.set("an-arg", "a-val") rs.set("a-flag") rs.format_run_args() # returns ["an-arg", "a-val", "a-flag", "None"]
Slurm Example with Conditional Setting
import socket rs = SrunSettings("echo", "hello") rs.set_tasks(1) rs.set("exclusive") # Only set this argument if condition param evals True # Otherwise log and NOP rs.set("partition", "debug", condition=socket.gethostname()=="testing-system") rs.format_run_args() # returns ["exclusive", "None", "partition", "debug"] iff socket.gethostname()=="testing-system" # otherwise returns ["exclusive", "None"]
- Parameters
arg (str) – name of the argument
value (str | None) – value of the argument
conditon – set the argument if condition evaluates to True
- set_binding(binding)¶
Set binding
- Parameters
binding (str) – Binding
- set_broadcast(dest_path=None)¶
Copy the specified executable(s) to remote machines
This sets
--preload-binary
- Parameters
dest_path (str | None) – Destination path (Ignored)
- set_cpu_bindings(bindings)¶
Set the cores to which MPI processes are bound
- Parameters
bindings (list[int] | int) – List specifing the cores to which MPI processes are bound
- set_cpus_per_task(cpus_per_task)¶
Set the number of tasks for this job
This sets
--cpus-per-proc
note: this option has been deprecated in openMPI 4.0+ and will soon be replaced.
- Parameters
cpus_per_task (int) – number of tasks
- set_excluded_hosts(host_list)¶
Specify a list of hosts to exclude for launching this job
- Parameters
host_list (str | list[str]) – hosts to exclude
- set_hostlist(host_list)¶
Set the hostlist for the
mpirun
commandThis sets
--host
- Parameters
host_list (str | list[str]) – list of host names
- Raises
TypeError – if not str or list of str
- set_hostlist_from_file(file_path)¶
Use the contents of a file to set the hostlist
This sets
--hostfile
- Parameters
file_path (str) – Path to the hostlist file
- set_memory_per_node(memory_per_node)¶
Set the amount of memory required per node in megabytes
- Parameters
memory_per_node (int) – Number of megabytes per node
- set_mpmd_preamble(preamble_lines)¶
Set preamble to a file to make a job MPMD
- Parameters
preamble_lines (list[str]) – lines to put at the beginning of a file.
- set_nodes(nodes)¶
Set the number of nodes
- Parameters
nodes (int) – number of nodes to run with
- set_quiet_launch(quiet)¶
Set the job to run in quiet mode
This sets
--quiet
- Parameters
quiet (bool) – Whether the job should be run quietly
- set_task_map(task_mapping)¶
Set
mpirun
task mappingthis sets
--map-by <mapping>
For examples, see the man page for
mpirun
- Parameters
task_mapping (str) – task mapping
- set_tasks(tasks)¶
Set the number of tasks for this job
This sets
--n
- Parameters
tasks (int) – number of tasks
- set_tasks_per_node(tasks_per_node)¶
Set the number of tasks per node
- Parameters
tasks_per_node (int) – number of tasks to launch per node
- set_time(hours=0, minutes=0, seconds=0)¶
Automatically format and set wall time
- Parameters
hours (int) – number of hours to run job
minutes (int) – number of minutes to run job
seconds (int) – number of seconds to run job
- set_verbose_launch(verbose)¶
Set the job to run in verbose mode
This sets
--verbose
- Parameters
verbose (bool) – Whether the job should be run verbosely
- set_walltime(walltime)¶
Set the maximum number of seconds that a job will run
This sets
--timeout
- Parameters
walltime (str) – number like string of seconds that a job will run in secs
- update_env(env_vars)¶
Update the job environment variables
To fully inherit the current user environment, add the workload-manager-specific flag to the launch command through the
add_exe_args()
method. For example,--export=ALL
for slurm, or-V
for PBS/aprun.- Parameters
env_vars (dict[str, Union[str, int, float, bool]]) – environment variables to update or add
- Raises
TypeError – if env_vars values cannot be coerced to strings
OrterunSettings¶
OrterunSettings
are for launching with OpenMPI’s orterun
. OrterunSettings
are
supported on Slurm, PBSpro, and Cobalt.
|
Set the number of tasks for this job |
|
Set the hostlist for the |
|
Set the number of tasks for this job |
|
Set |
|
Make a mpmd workload by combining two |
Add executable arguments to executable |
|
Return a list of OpenMPI formatted run arguments |
|
Format the environment variables for mpirun |
|
|
Update the job environment variables |
- class OrterunSettings(exe, exe_args=None, run_args=None, env_vars=None, **kwargs)[source]¶
Settings to run job with
orterun
command (OpenMPI)Note that environment variables can be passed with a None value to signify that they should be exported from the current environment
Any arguments passed in the
run_args
dict will be converted intoorterun
arguments and prefixed with--
. Values of None can be provided for arguments that do not have values.- Parameters
exe (str) – executable
exe_args (str | list[str], optional) – executable arguments, defaults to None
run_args (dict[str, str], optional) – arguments for run command, defaults to None
env_vars (dict[str, str], optional) – environment vars to launch job with, defaults to None
- add_exe_args(args)¶
Add executable arguments to executable
- Parameters
args (str | list[str]) – executable arguments
- Raises
TypeError – if exe args are not strings
- format_env_vars()¶
Format the environment variables for mpirun
- Returns
list of env vars
- Return type
list[str]
- format_run_args()¶
Return a list of OpenMPI formatted run arguments
- Returns
list of OpenMPI arguments for these settings
- Return type
list[str]
- make_mpmd(mpirun_settings)¶
Make a mpmd workload by combining two
mpirun
commandsThis connects the two settings to be executed with a single Model instance
- Parameters
mpirun_settings (MpirunSettings) – MpirunSettings instance
- reserved_run_args: set[str] = {'wd', 'wdir'}¶
- property run_command¶
Return the launch binary used to launch the executable
Attempt to expand the path to the executable if possible
- Returns
launch binary e.g. mpiexec
- Type
str | None
- set(arg, value=None, condition=True)¶
Allows users to set individual run arguments.
A method that allows users to set run arguments after object instantiation. Does basic formatting such as stripping leading dashes. If the argument has been set previously, this method will log warning but ultimately comply.
Conditional expressions may be passed to the conditional parameter. If the expression evaluates to True, the argument will be set. In not an info message is logged and no further operation is performed.
Basic Usage
rs = RunSettings("python") rs.set("an-arg", "a-val") rs.set("a-flag") rs.format_run_args() # returns ["an-arg", "a-val", "a-flag", "None"]
Slurm Example with Conditional Setting
import socket rs = SrunSettings("echo", "hello") rs.set_tasks(1) rs.set("exclusive") # Only set this argument if condition param evals True # Otherwise log and NOP rs.set("partition", "debug", condition=socket.gethostname()=="testing-system") rs.format_run_args() # returns ["exclusive", "None", "partition", "debug"] iff socket.gethostname()=="testing-system" # otherwise returns ["exclusive", "None"]
- Parameters
arg (str) – name of the argument
value (str | None) – value of the argument
conditon – set the argument if condition evaluates to True
- set_binding(binding)¶
Set binding
- Parameters
binding (str) – Binding
- set_broadcast(dest_path=None)¶
Copy the specified executable(s) to remote machines
This sets
--preload-binary
- Parameters
dest_path (str | None) – Destination path (Ignored)
- set_cpu_bindings(bindings)¶
Set the cores to which MPI processes are bound
- Parameters
bindings (list[int] | int) – List specifing the cores to which MPI processes are bound
- set_cpus_per_task(cpus_per_task)¶
Set the number of tasks for this job
This sets
--cpus-per-proc
note: this option has been deprecated in openMPI 4.0+ and will soon be replaced.
- Parameters
cpus_per_task (int) – number of tasks
- set_excluded_hosts(host_list)¶
Specify a list of hosts to exclude for launching this job
- Parameters
host_list (str | list[str]) – hosts to exclude
- set_hostlist(host_list)¶
Set the hostlist for the
mpirun
commandThis sets
--host
- Parameters
host_list (str | list[str]) – list of host names
- Raises
TypeError – if not str or list of str
- set_hostlist_from_file(file_path)¶
Use the contents of a file to set the hostlist
This sets
--hostfile
- Parameters
file_path (str) – Path to the hostlist file
- set_memory_per_node(memory_per_node)¶
Set the amount of memory required per node in megabytes
- Parameters
memory_per_node (int) – Number of megabytes per node
- set_mpmd_preamble(preamble_lines)¶
Set preamble to a file to make a job MPMD
- Parameters
preamble_lines (list[str]) – lines to put at the beginning of a file.
- set_nodes(nodes)¶
Set the number of nodes
- Parameters
nodes (int) – number of nodes to run with
- set_quiet_launch(quiet)¶
Set the job to run in quiet mode
This sets
--quiet
- Parameters
quiet (bool) – Whether the job should be run quietly
- set_task_map(task_mapping)¶
Set
mpirun
task mappingthis sets
--map-by <mapping>
For examples, see the man page for
mpirun
- Parameters
task_mapping (str) – task mapping
- set_tasks(tasks)¶
Set the number of tasks for this job
This sets
--n
- Parameters
tasks (int) – number of tasks
- set_tasks_per_node(tasks_per_node)¶
Set the number of tasks per node
- Parameters
tasks_per_node (int) – number of tasks to launch per node
- set_time(hours=0, minutes=0, seconds=0)¶
Automatically format and set wall time
- Parameters
hours (int) – number of hours to run job
minutes (int) – number of minutes to run job
seconds (int) – number of seconds to run job
- set_verbose_launch(verbose)¶
Set the job to run in verbose mode
This sets
--verbose
- Parameters
verbose (bool) – Whether the job should be run verbosely
- set_walltime(walltime)¶
Set the maximum number of seconds that a job will run
This sets
--timeout
- Parameters
walltime (str) – number like string of seconds that a job will run in secs
- update_env(env_vars)¶
Update the job environment variables
To fully inherit the current user environment, add the workload-manager-specific flag to the launch command through the
add_exe_args()
method. For example,--export=ALL
for slurm, or-V
for PBS/aprun.- Parameters
env_vars (dict[str, Union[str, int, float, bool]]) – environment variables to update or add
- Raises
TypeError – if env_vars values cannot be coerced to strings
SbatchSettings¶
SbatchSettings
are used for launching batches onto Slurm
WLM systems.
|
Set the account for this batch job |
|
Set the command used to launch the batch e.g. |
|
Set the number of nodes for this batch job |
|
Specify the hostlist for this job |
|
Set the partition for the batch job |
|
alias for set_partition |
|
Set the walltime of the job |
Get the formatted batch arguments for a preview |
- class SbatchSettings(nodes=None, time='', account=None, batch_args=None, **kwargs)[source]¶
Specify run parameters for a Slurm batch job
Slurm sbatch arguments can be written into
batch_args
as a dictionary. e.g. {‘ntasks’: 1}If the argument doesn’t have a parameter, put None as the value. e.g. {‘exclusive’: None}
Initialization values provided (nodes, time, account) will overwrite the same arguments in
batch_args
if present- Parameters
nodes (int, optional) – number of nodes, defaults to None
time (str, optional) – walltime for job, e.g. “10:00:00” for 10 hours
account (str, optional) – account for job, defaults to None
batch_args (dict[str, str], optional) – extra batch arguments, defaults to None
- add_preamble(lines)¶
Add lines to the batch file preamble. The lines are just written (unmodified) at the beginning of the batch file (after the WLM directives) and can be used to e.g. start virtual environments before running the executables.
- Parameters
line (str or list[str]) – lines to add to preamble.
- property batch_cmd¶
Return the batch command
Tests to see if we can expand the batch command path. If we can, then returns the expanded batch command. If we cannot, returns the batch command as is.
- Returns
batch command
- Type
str
- format_batch_args()[source]¶
Get the formatted batch arguments for a preview
- Returns
batch arguments for Sbatch
- Return type
list[str]
- set_account(account)[source]¶
Set the account for this batch job
- Parameters
account (str) – account id
- set_batch_command(command)¶
Set the command used to launch the batch e.g.
sbatch
- Parameters
command (str) – batch command
- set_cpus_per_task(cpus_per_task)[source]¶
Set the number of cpus to use per task
This sets
--cpus-per-task
- Parameters
num_cpus (int) – number of cpus to use per task
- set_hostlist(host_list)[source]¶
Specify the hostlist for this job
- Parameters
host_list (str | list[str]) – hosts to launch on
- Raises
TypeError – if not str or list of str
- set_nodes(num_nodes)[source]¶
Set the number of nodes for this batch job
- Parameters
num_nodes (int) – number of nodes
- set_partition(partition)[source]¶
Set the partition for the batch job
- Parameters
partition (str) – partition name
QsubBatchSettings¶
QsubBatchSettings
are used to configure jobs that should
be launched as a batch on PBSPro systems.
|
Set the account for this batch job |
|
Set the command used to launch the batch e.g. |
|
Set the number of nodes for this batch job |
|
Set the number of cpus obtained in each node. |
|
Set the queue for the batch job |
Set a resource value for the Qsub batch |
|
|
Set the walltime of the job |
Get the formatted batch arguments for a preview |
- class QsubBatchSettings(nodes=None, ncpus=None, time=None, queue=None, account=None, resources=None, batch_args=None, **kwargs)[source]¶
Specify
qsub
batch parameters for a jobnodes
, andncpus
are used to create the select statement for PBS if a select statement is not included in theresources
. If both are supplied the value for select statement supplied inresources
will override.- Parameters
nodes (int, optional) – number of nodes for batch, defaults to None
ncpus (int, optional) – number of cpus per node, defaults to None
time (str, optional) – walltime for batch job, defaults to None
queue (str, optional) – queue to run batch in, defaults to None
account (str, optional) – account for batch launch, defaults to None
resources (dict[str, str], optional) – overrides for resource arguments, defaults to None
batch_args (dict[str, str], optional) – overrides for PBS batch arguments, defaults to None
- add_preamble(lines)¶
Add lines to the batch file preamble. The lines are just written (unmodified) at the beginning of the batch file (after the WLM directives) and can be used to e.g. start virtual environments before running the executables.
- Parameters
line (str or list[str]) – lines to add to preamble.
- property batch_cmd¶
Return the batch command
Tests to see if we can expand the batch command path. If we can, then returns the expanded batch command. If we cannot, returns the batch command as is.
- Returns
batch command
- Type
str
- format_batch_args()[source]¶
Get the formatted batch arguments for a preview
- Returns
batch arguments for Qsub
- Return type
list[str]
- Raises
ValueError – if options are supplied without values
- set_batch_command(command)¶
Set the command used to launch the batch e.g.
sbatch
- Parameters
command (str) – batch command
- set_hostlist(host_list)[source]¶
Specify the hostlist for this job
- Parameters
host_list (str | list[str]) – hosts to launch on
- Raises
TypeError – if not str or list of str
- set_ncpus(num_cpus)[source]¶
Set the number of cpus obtained in each node.
If a select argument is provided in
QsubBatchSettings.resources
, then this value will be overridden- Parameters
num_cpus (int) – number of cpus per node in select
- set_nodes(num_nodes)[source]¶
Set the number of nodes for this batch job
If a select argument is provided in
QsubBatchSettings.resources
this value will be overridden- Parameters
num_nodes (int) – number of nodes
CobaltBatchSettings¶
CobaltBatchSettings
are used to configure jobs that should
be launched as a batch on Cobalt Systems. They closely mimic
that of the QsubBatchSettings
for PBSPro.
|
Set the account for this batch job |
Set the command used to launch the batch e.g. |
|
|
Set the number of nodes for this batch job |
Set the queue for the batch job |
|
|
Set the walltime of the job |
Get the formatted batch arguments for a preview |
- class CobaltBatchSettings(nodes=None, time='', queue=None, account=None, batch_args=None, **kwargs)[source]¶
Specify settings for a Cobalt
qsub
batch launchIf the argument doesn’t have a parameter, put None as the value. e.g. {‘exclusive’: None}
Initialization values provided (nodes, time, account) will overwrite the same arguments in
batch_args
if present- Parameters
nodes (int, optional) – number of nodes, defaults to None
time (str, optional) – walltime for job, e.g. “10:00:00” for 10 hours, defaults to empty str
queue (str, optional) – queue to launch job in, defaults to None
account (str, optional) – account for job, defaults to None
batch_args (dict[str, str], optional) – extra batch arguments, defaults to None
- add_preamble(lines)¶
Add lines to the batch file preamble. The lines are just written (unmodified) at the beginning of the batch file (after the WLM directives) and can be used to e.g. start virtual environments before running the executables.
- Parameters
line (str or list[str]) – lines to add to preamble.
- property batch_cmd¶
Return the batch command
Tests to see if we can expand the batch command path. If we can, then returns the expanded batch command. If we cannot, returns the batch command as is.
- Returns
batch command
- Type
str
- format_batch_args()[source]¶
Get the formatted batch arguments for a preview
- Returns
list of batch arguments for Sbatch
- Return type
list[str]
- set_batch_command(command)¶
Set the command used to launch the batch e.g.
sbatch
- Parameters
command (str) – batch command
- set_hostlist(host_list)[source]¶
Specify the hostlist for this job
- Parameters
host_list (str | list[str]) – hosts to launch on
- Raises
TypeError – if not str or list of str
- set_nodes(num_nodes)[source]¶
Set the number of nodes for this batch job
- Parameters
num_nodes (int) – number of nodes
BsubBatchSettings¶
BsubBatchSettings
are used to configure jobs that should
be launched as a batch on LSF systems.
|
Set the walltime |
Set SMTs |
|
|
Set the project |
|
Set the number of nodes for this batch job |
Set allocation for expert mode. |
|
|
Specify the hostlist for this job |
|
Set the number of tasks for this job |
Get the formatted batch arguments for a preview |
- class BsubBatchSettings(nodes=None, time=None, project=None, batch_args=None, smts=None, **kwargs)[source]¶
Specify
bsub
batch parameters for a job- Parameters
nodes (int, optional) – number of nodes for batch, defaults to None
time (str, optional) – walltime for batch job in format hh:mm, defaults to None
project (str, optional) – project for batch launch, defaults to None
batch_args (dict[str, str], optional) – overrides for LSF batch arguments, defaults to None
smts (int, optional) – SMTs, defaults to None
- add_preamble(lines)¶
Add lines to the batch file preamble. The lines are just written (unmodified) at the beginning of the batch file (after the WLM directives) and can be used to e.g. start virtual environments before running the executables.
- Parameters
line (str or list[str]) – lines to add to preamble.
- property batch_cmd¶
Return the batch command
Tests to see if we can expand the batch command path. If we can, then returns the expanded batch command. If we cannot, returns the batch command as is.
- Returns
batch command
- Type
str
- format_batch_args()[source]¶
Get the formatted batch arguments for a preview
- Returns
list of batch arguments for Qsub
- Return type
list[str]
- set_account(account)[source]¶
Set the project
this function is an alias for set_project.
- Parameters
account (str) – project name
- set_batch_command(command)¶
Set the command used to launch the batch e.g.
sbatch
- Parameters
command (str) – batch command
- set_expert_mode_req(res_req, slots)[source]¶
Set allocation for expert mode. This will activate expert mode (
-csm
) and disregard all other allocation options.This sets
-csm -n slots -R res_req
- set_hostlist(host_list)[source]¶
Specify the hostlist for this job
- Parameters
host_list (str | list[str]) – hosts to launch on
- Raises
TypeError – if not str or list of str
- set_nodes(num_nodes)[source]¶
Set the number of nodes for this batch job
This sets
-nnodes
.- Parameters
nodes (int) – number of nodes
- set_queue(queue)[source]¶
Set the queue for this job
- Parameters
queue (str) – The queue to submit the job on
- set_smts(smts)[source]¶
Set SMTs
This sets
-alloc_flags
. If the user sets SMT explicitly through-alloc_flags
, then that takes precedence.- Parameters
smts (int) – SMT (e.g on Summit: 1, 2, or 4)
Singularity¶
Singularity
is a type of Container
that can be passed to a
RunSettings
class or child class to enable running the workload in a
container.
- class Singularity(*args, **kwargs)[source]¶
Singularity (apptainer) container type. To be passed into a
RunSettings
class initializer orExperiment.create_run_settings
.Note
Singularity integration is currently tested with Apptainer 1.0 with slurm and PBS workload managers only.
Also, note that user-defined bind paths (
mount
argument) may be disabled by a system administrator- Parameters
image (str) – local or remote path to container image, e.g.
docker://sylabsio/lolcow
args (str | list[str], optional) – arguments to ‘singularity exec’ command
mount (str | list[str] | dict[str, str], optional) – paths to mount (bind) from host machine into image.
Orchestrator¶
Orchestrator¶
- class Orchestrator(port=6379, interface='lo', launcher='local', run_command='auto', db_nodes=1, batch=False, hosts=None, account=None, time=None, alloc=None, single_cmd=False, **kwargs)[source]¶
The Orchestrator is an in-memory database that can be launched alongside entities in SmartSim. Data can be transferred between entities by using one of the Python, C, C++ or Fortran clients within an entity.
Initialize an Orchestrator reference for local launch
- Parameters
port (int, optional) – TCP/IP port, defaults to 6379
interface (str, optional) – network interface, defaults to “lo”
Extra configurations for RedisAI
See https://oss.redislabs.com/redisai/configuration/
- Parameters
threads_per_queue (int, optional) – threads per GPU device
inter_op_threads (int, optional) – threads accross CPU operations
intra_op_threads (int, optional) – threads per CPU operation
- property batch¶
- enable_checkpoints(frequency)[source]¶
Sets the database’s save configuration to save the DB every ‘frequency’ seconds given that at least one write operation against the DB occurred in that time. For example, if frequency is 900, then the database will save to disk after 900 seconds if there is at least 1 change to the dataset.
- Parameters
frequency (int) – the given number of seconds before the DB saves
- get_address()[source]¶
Return database addresses
- Returns
addresses
- Return type
list[str]
- Raises
SmartSimError – If database address cannot be found or is not active
- property hosts¶
Return the hostnames of orchestrator instance hosts
Note that this will only be populated after the orchestrator has been launched by SmartSim.
- Returns
hostnames
- Return type
list[str]
- is_active()[source]¶
Check if the database is active
- Returns
True if database is active, False otherwise
- Return type
bool
- property num_shards¶
Return the number of DB shards contained in the orchestrator. This might differ from the number of
DBNode
objects, as eachDBNode
may start more than one shard (e.g. with MPMD).- Returns
num_shards
- Return type
int
- set_batch_arg(arg, value)[source]¶
Set a batch argument the orchestrator should launch with
Some commonly used arguments such as –job-name are used by SmartSim and will not be allowed to be set.
- Parameters
arg (str) – batch argument to set e.g. “exclusive”
value (str | None) – batch param - set to None if no param value
- Raises
SmartSimError – if orchestrator not launching as batch
- set_cpus(num_cpus)[source]¶
Set the number of CPUs available to each database shard
This effectively will determine how many cpus can be used for compute threads, background threads, and network I/O.
- Parameters
num_cpus (int) – number of cpus to set
- set_db_conf(key, value)[source]¶
Set any valid configuration at runtime without the need to restart the database. All configuration parameters that are set are immediately loaded by the database and will take effect starting with the next command executed.
- Parameters
key (str) – the configuration parameter
value (str) – the database configuration parameter’s new value
- set_eviction_strategy(strategy)[source]¶
Sets how the database will select what to remove when ‘maxmemory’ is reached. The default is noeviction.
- Parameters
strategy (str) – The max memory policy to use e.g. “volatile-lru”, “allkeys-lru”, etc.
- Raises
SmartSimError – If ‘strategy’ is an invalid maxmemory policy
SmartSimError – If database is not active
- set_hosts(host_list)[source]¶
Specify the hosts for the
Orchestrator
to launch on- Parameters
host_list (str, list[str]) – list of host (compute node names)
- Raises
TypeError – if wrong type
- set_max_clients(clients=50000)[source]¶
Sets the max number of connected clients at the same time. When the number of DB shards contained in the orchestrator is more than two, then every node will use two connections, one incoming and another outgoing.
- Parameters
clients (int, optional) – the maximum number of connected clients
- set_max_memory(mem)[source]¶
Sets the max memory configuration. By default there is no memory limit. Setting max memory to zero also results in no memory limit. Once a limit is surpassed, keys will be removed according to the eviction strategy. The specified memory size is case insensitive and supports the typical forms of: 1k => 1000 bytes 1kb => 1024 bytes 1m => 1000000 bytes 1mb => 1024*1024 bytes 1g => 1000000000 bytes 1gb => 1024*1024*1024 bytes
- Parameters
mem (str) – the desired max memory size e.g. 3gb
- Raises
SmartSimError – If ‘mem’ is an invalid memory value
SmartSimError – If database is not active
- set_max_message_size(size=1073741824)[source]¶
Sets the database’s memory size limit for bulk requests, which are elements representing single strings. The default is 1 gigabyte. Message size must be greater than or equal to 1mb. The specified memory size should be an integer that represents the number of bytes. For example, to set the max message size to 1gb, use 1024*1024*1024.
- Parameters
size (int, optional) – maximum message size in bytes
- set_path(new_path)¶
- set_run_arg(arg, value)[source]¶
Set a run argument the orchestrator should launch each node with (it will be passed to jrun)
Some commonly used arguments are used by SmartSim and will not be allowed to be set. For example, “n”, “N”, etc.
- Parameters
arg (str) – run argument to set
value (str | None) – run parameter - set to None if no parameter value
- set_walltime(walltime)[source]¶
Set the batch walltime of the orchestrator
Note: This will only effect orchestrators launched as a batch
- Parameters
walltime (str) – amount of time e.g. 10 hours is 10:00:00
- Raises
SmartSimError – if orchestrator isn’t launching as batch
- property type¶
Return the name of the class
Model¶
|
Initialize a |
|
Attach files to an entity for generation |
|
Colocate an Orchestrator instance with this Model at runtime. |
Convert parameters to command line arguments and update run settings. |
|
|
Register future communication between entities. |
If called, the entity will prefix its keys with its own model name |
|
If called, the entity will not prefix its keys with its own model name |
|
Inquire as to whether this entity will prefix its keys with its name |
- class Model(name, params, path, run_settings, params_as_args=None)[source]¶
Bases:
smartsim.entity.entity.SmartSimEntity
Initialize a
Model
- Parameters
name (str) – name of the model
params (dict) – model parameters for writing into configuration files or to be passed as command line arguments to executable.
path (str) – path to output, error, and configuration files
run_settings (RunSettings) – launcher settings specified in the experiment
params_as_args (list[str]) – list of parameters which have to be interpreted as command line arguments to be added to run_settings
- add_function(name, function=None, device='CPU', devices_per_node=1)[source]¶
TorchScript function to launch with this Model instance
Each script function to the model will be loaded into a non-converged orchestrator prior to the execution of this Model instance.
For converged orchestrators, the
add_script()
method should be used.Device selection is either “GPU” or “CPU”. If many devices are present, a number can be passed for specification e.g. “GPU:1”.
Setting
devices_per_node=N
, with N greater than one will result in the model being stored in the first N devices of typedevice
.- Parameters
name (str) – key to store function under
script (str or byte string, optional) – TorchScript code
script_path (str, optional) – path to TorchScript code
device (str, optional) – device for script execution, defaults to “CPU”
devices_per_node (int) – number of devices on each host
- add_ml_model(name, backend, model=None, model_path=None, device='CPU', devices_per_node=1, batch_size=0, min_batch_size=0, tag='', inputs=None, outputs=None)[source]¶
A TF, TF-lite, PT, or ONNX model to load into the DB at runtime
Each ML Model added will be loaded into an orchestrator (converged or not) prior to the execution of this Model instance
One of either model (in memory representation) or model_path (file) must be provided
- Parameters
name (str) – key to store model under
model (byte string, optional) – model in memory
model_path (file path to model) – serialized model
backend (str) – name of the backend (TORCH, TF, TFLITE, ONNX)
device (str, optional) – name of device for execution, defaults to “CPU”
batch_size (int, optional) – batch size for execution, defaults to 0
min_batch_size (int, optional) – minimum batch size for model execution, defaults to 0
tag (str, optional) – additional tag for model information, defaults to “”
inputs (list[str], optional) – model inputs (TF only), defaults to None
outputs (list[str], optional) – model outupts (TF only), defaults to None
- add_script(name, script=None, script_path=None, device='CPU', devices_per_node=1)[source]¶
TorchScript to launch with this Model instance
Each script added to the model will be loaded into an orchestrator (converged or not) prior to the execution of this Model instance
Device selection is either “GPU” or “CPU”. If many devices are present, a number can be passed for specification e.g. “GPU:1”.
Setting
devices_per_node=N
, with N greater than one will result in the model being stored in the first N devices of typedevice
.One of either script (in memory string representation) or script_path (file) must be provided
- Parameters
name (str) – key to store script under
script (str, optional) – TorchScript code
script_path (str, optional) – path to TorchScript code
device (str, optional) – device for script execution, defaults to “CPU”
devices_per_node (int) – number of devices on each host
- attach_generator_files(to_copy=None, to_symlink=None, to_configure=None)[source]¶
Attach files to an entity for generation
Attach files needed for the entity that, upon generation, will be located in the path of the entity.
During generation, files “to_copy” are copied into the path of the entity, and files “to_symlink” are symlinked into the path of the entity.
Files “to_configure” are text based model input files where parameters for the model are set. Note that only models support the “to_configure” field. These files must have fields tagged that correspond to the values the user would like to change. The tag is settable but defaults to a semicolon e.g. THERMO = ;10;
- Parameters
to_copy (list, optional) – files to copy, defaults to []
to_symlink (list, optional) – files to symlink, defaults to []
to_configure (list, optional) – input files with tagged parameters, defaults to []
- colocate_db(port=6379, db_cpus=1, limit_app_cpus=True, ifname='lo', debug=False, **kwargs)[source]¶
Colocate an Orchestrator instance with this Model at runtime.
This method will initialize settings which add an unsharded (not connected) database to this Model instance. Only this Model will be able to communicate with this colocated database by using the loopback TCP interface or Unix Domain sockets (UDS coming soon).
Extra parameters for the db can be passed through kwargs. This includes many performance, caching and inference settings.
ex. kwargs = { maxclients: 100000, threads_per_queue: 1, inter_op_threads: 1, intra_op_threads: 1, server_threads: 2 # keydb only }
Generally these don’t need to be changed.
- Parameters
port (int, optional) – port to use for orchestrator database, defaults to 6379
db_cpus (int, optional) – number of cpus to use for orchestrator, defaults to 1
limit_app_cpus (bool, optional) – whether to limit the number of cpus used by the app, defaults to True
ifname (str, optional) – interface to use for orchestrator, defaults to “lo”
debug (bool, optional) – launch Model with extra debug information about the co-located db
kwargs (dict, optional) – additional keyword arguments to pass to the orchestrator database
- property colocated¶
Return True if this Model will run with a colocated Orchestrator
- disable_key_prefixing()[source]¶
If called, the entity will not prefix its keys with its own model name
- register_incoming_entity(incoming_entity)[source]¶
Register future communication between entities.
Registers the named data sources that this entity has access to by storing the key_prefix associated with that entity
- Parameters
incoming_entity (SmartSimEntity) – The entity that data will be received from
- Raises
SmartSimError – if incoming entity has already been registered
- property type¶
Return the name of the class
Ensemble¶
|
Initialize an Ensemble of Model instances. |
|
|
|
Add a model to this ensemble |
|
Attach files to each model within the ensemble for generation |
Register future communication between entities. |
|
If called, all models within this ensemble will prefix their keys with its own model name. |
|
Inquire as to whether each model within the ensemble will prefix its keys |
- class Ensemble(name, params, params_as_args=None, batch_settings=None, run_settings=None, perm_strat='all_perm', **kwargs)[source]¶
Bases:
smartsim.entity.entityList.EntityList
Ensemble
is a group ofModel
instances that can be treated as a reference to a single instance.Initialize an Ensemble of Model instances.
The kwargs argument can be used to pass custom input parameters to the permutation strategy.
- Parameters
name (str) – name of the ensemble
params (dict[str, Any]) – parameters to expand into
Model
membersparams_as_args – list of params which should be used as command line arguments to the
Model
member executables and not written to generator filesbatch_settings (BatchSettings, optional) – describes settings for
Ensemble
as batch workloadrun_settings (RunSettings, optional) – describes how each
Model
should be executedreplicas (int, optional) – number of
Model
replicas to create - a keyword argument of kwargsperm_strategy (str) – strategy for expanding
params
intoModel
instances from params argument options are “all_perm”, “stepped”, “random” or a callable function. Defaults to “all_perm”.
- Returns
Ensemble
instance- Return type
Ensemble
- add_function(name, function=None, device='CPU', devices_per_node=1)[source]¶
TorchScript function to launch with every entity belonging to this ensemble
Each script function to the model will be loaded into a non-converged orchestrator prior to the execution of every entity belonging to this ensemble.
For converged orchestrators, the
add_script()
method should be used.Device selection is either “GPU” or “CPU”. If many devices are present, a number can be passed for specification e.g. “GPU:1”.
Setting
devices_per_node=N
, with N greater than one will result in the model being stored in the first N devices of typedevice
.- Parameters
name (str) – key to store function under
script (str, optional) – TorchScript code
script_path (str, optional) – path to TorchScript code
device (str, optional) – device for script execution, defaults to “CPU”
devices_per_node (int) – number of devices on each host
- add_ml_model(name, backend, model=None, model_path=None, device='CPU', devices_per_node=1, batch_size=0, min_batch_size=0, tag='', inputs=None, outputs=None)[source]¶
A TF, TF-lite, PT, or ONNX model to load into the DB at runtime
Each ML Model added will be loaded into an orchestrator (converged or not) prior to the execution of every entity belonging to this ensemble
One of either model (in memory representation) or model_path (file) must be provided
- Parameters
name (str) – key to store model under
model (str | bytes | None) – model in memory
model_path (file path to model) – serialized model
backend (str) – name of the backend (TORCH, TF, TFLITE, ONNX)
device (str, optional) – name of device for execution, defaults to “CPU”
batch_size (int, optional) – batch size for execution, defaults to 0
min_batch_size (int, optional) – minimum batch size for model execution, defaults to 0
tag (str, optional) – additional tag for model information, defaults to “”
inputs (list[str], optional) – model inputs (TF only), defaults to None
outputs (list[str], optional) – model outupts (TF only), defaults to None
- add_model(model)[source]¶
Add a model to this ensemble
- Parameters
model (Model) – model instance to be added
- Raises
TypeError – if model is not an instance of
Model
EntityExistsError – if model already exists in this ensemble
- add_script(name, script=None, script_path=None, device='CPU', devices_per_node=1)[source]¶
TorchScript to launch with every entity belonging to this ensemble
Each script added to the model will be loaded into an orchestrator (converged or not) prior to the execution of every entity belonging to this ensemble
Device selection is either “GPU” or “CPU”. If many devices are present, a number can be passed for specification e.g. “GPU:1”.
Setting
devices_per_node=N
, with N greater than one will result in the model being stored in the first N devices of typedevice
.One of either script (in memory string representation) or script_path (file) must be provided
- Parameters
name (str) – key to store script under
script (str, optional) – TorchScript code
script_path (str, optional) – path to TorchScript code
device (str, optional) – device for script execution, defaults to “CPU”
devices_per_node (int) – number of devices on each host
- attach_generator_files(to_copy=None, to_symlink=None, to_configure=None)[source]¶
Attach files to each model within the ensemble for generation
Attach files needed for the entity that, upon generation, will be located in the path of the entity.
During generation, files “to_copy” are copied into the path of the entity, and files “to_symlink” are symlinked into the path of the entity.
Files “to_configure” are text based model input files where parameters for the model are set. Note that only models support the “to_configure” field. These files must have fields tagged that correspond to the values the user would like to change. The tag is settable but defaults to a semicolon e.g. THERMO = ;10;
- Parameters
to_copy (list, optional) – files to copy, defaults to []
to_symlink (list, optional) – files to symlink, defaults to []
to_configure (list, optional) – input files with tagged parameters, defaults to []
- enable_key_prefixing()[source]¶
If called, all models within this ensemble will prefix their keys with its own model name.
- query_key_prefixing()[source]¶
Inquire as to whether each model within the ensemble will prefix its keys
- Returns
True if all models have key prefixing enabled, False otherwise
- Return type
bool
- register_incoming_entity(incoming_entity)[source]¶
Register future communication between entities.
Registers the named data sources that this entity has access to by storing the key_prefix associated with that entity
Only python clients can have multiple incoming connections
- Parameters
incoming_entity (SmartSimEntity) – The entity that data will be received from
- property type¶
Return the name of the class
Machine Learning¶
SmartSim includes built-in utilities for supporting TensorFlow, Keras, and Pytorch.
TensorFlow¶
SmartSim includes built-in utilities for supporting TensorFlow and Keras in training and inference.
|
Freeze a Keras or TensorFlow Graph |
- freeze_model(model, output_dir, file_name)[source]¶
Freeze a Keras or TensorFlow Graph
to use a Keras or TensorFlow model in SmartSim, the model must be frozen and the inputs and outputs provided to the smartredis.client.set_model_from_file() method.
This utiliy function provides everything users need to take a trained model and put it inside an
orchestrator
instance- Parameters
model (tf.Module) – TensorFlow or Keras model
output_dir (str) – output dir to save model file to
file_name (str) – name of model file to create
- Returns
path to model file, model input layer names, model output layer names
- Return type
str, list[str], list[str]
- serialize_model(model)[source]¶
Serialize a Keras or TensorFlow Graph
to use a Keras or TensorFlow model in SmartSim, the model must be frozen and the inputs and outputs provided to the smartredis.client.set_model() method.
This utiliy function provides everything users need to take a trained model and put it inside an
orchestrator
instance.- Parameters
model (tf.Module) – TensorFlow or Keras model
- Returns
serialized model, model input layer names, model output layer names
- Return type
str, list[str], list[str]
- class StaticDataGenerator(**kwargs)[source]¶
Bases:
smartsim.ml.data.StaticDataDownloader
,tensorflow.python.keras.utils.data_utils.Sequence
A class to download a dataset from the DB.
Details about parameters and features of this class can be found in the documentation of
StaticDataDownloader
, of which it is just a TensorFlow-specialized sub-class.- init_samples(sources=None)¶
Initialize samples (and targets, if needed).
This function will not return until samples have been downloaded from all sources.
- Parameters
sources (list[tuple], optional) – List of sources as defined in init_sources, defaults to None, in which case sources will be initialized, unless self.sources is already set
- init_sources()¶
Initalize list of data sources based on incoming entitites and self.sub_indices.
Each source is represented as a tuple (producer_name, sub_index). Before init_samples() is called, the user can modify the list. Once init_samples() is called, all data is downloaded and batches can be obtained with iter(). The list of all sources is stored as self.sources.
- Raises
ValueError – If self.uploader_info is set to auto but no uploader_name is specified.
ValueError – If self.uploader_info is not set to auto or manual.
- property need_targets¶
Compute if targets have to be downloaded.
- Returns
Whether targets (or labels) should be downloaded
- Return type
bool
- class DynamicDataGenerator(**kwargs)[source]¶
Bases:
smartsim.ml.data.DynamicDataDownloader
,smartsim.ml.tf.data.StaticDataGenerator
A class to download batches from the DB.
Details about parameters and features of this class can be found in the documentation of
DynamicDataDownloader
, of which it is just a TensorFlow-specialized sub-class.- init_samples(sources=None)¶
Initialize samples (and targets, if needed).
This function will not return until at least one batch worth of data has been downloaded.
- Parameters
sources (list[tuple], optional) – List of sources as defined in init_sources, defaults to None, in which case sources will be initialized, unless self.sources is already set
- init_sources()¶
Initalize list of data sources based on incoming entitites and self.sub_indices.
Each source is represented as a tuple (producer_name, sub_index). Before init_samples() is called, the user can modify the list. Once init_samples() is called, all data is downloaded and batches can be obtained with iter(). The list of all sources is stored as self.sources.
- Raises
ValueError – If self.uploader_info is set to auto but no uploader_name is specified.
ValueError – If self.uploader_info is not set to auto or manual.
- property need_targets¶
Compute if targets have to be downloaded.
- Returns
Whether targets (or labels) should be downloaded
- Return type
bool
- on_epoch_end()[source]¶
Callback called at the end of each training epoch
Update data (the DB is queried for new batches) and if self.shuffle is set to True, data is also shuffled.
- update_data()¶
Update data.
Fetch new batches (if available) from the DB. Also shuffle list of samples if self.shuffle is set to
True
.
PyTorch¶
SmartSim includes built-in utilities for supporting PyTorch in training and inference.
- class StaticDataGenerator(**kwargs)[source]¶
Bases:
smartsim.ml.data.StaticDataDownloader
,torch.utils.data.dataset.IterableDataset
A class to download a dataset from the DB.
Details about parameters and features of this class can be found in the documentation of
StaticDataDownloader
, of which it is just a PyTorch-specialized sub-class.Note that if the
StaticDataGenerator
has to be used through aDataLoader
, init_samples must be set to False, as sources and samples will be initialized by theDataLoader
workers.- init_samples(sources=None)¶
Initialize samples (and targets, if needed).
This function will not return until samples have been downloaded from all sources.
- Parameters
sources (list[tuple], optional) – List of sources as defined in init_sources, defaults to None, in which case sources will be initialized, unless self.sources is already set
- init_sources()¶
Initalize list of data sources based on incoming entitites and self.sub_indices.
Each source is represented as a tuple (producer_name, sub_index). Before init_samples() is called, the user can modify the list. Once init_samples() is called, all data is downloaded and batches can be obtained with iter(). The list of all sources is stored as self.sources.
- Raises
ValueError – If self.uploader_info is set to auto but no uploader_name is specified.
ValueError – If self.uploader_info is not set to auto or manual.
- property need_targets¶
Compute if targets have to be downloaded.
- Returns
Whether targets (or labels) should be downloaded
- Return type
bool
- class DynamicDataGenerator(**kwargs)[source]¶
Bases:
smartsim.ml.data.DynamicDataDownloader
,smartsim.ml.torch.data.StaticDataGenerator
A class to download batches from the DB.
Details about parameters and features of this class can be found in the documentation of
DynamicDataDownloader
, of which it is just a PyTorch-specialized sub-class.Note that if the
DynamicDataGenerator
has to be used through aDataLoader
, init_samples must be set to False, as sources and samples will be initialized by theDataLoader
workers.- init_samples(sources=None)¶
Initialize samples (and targets, if needed).
This function will not return until at least one batch worth of data has been downloaded.
- Parameters
sources (list[tuple], optional) – List of sources as defined in init_sources, defaults to None, in which case sources will be initialized, unless self.sources is already set
- init_sources()¶
Initalize list of data sources based on incoming entitites and self.sub_indices.
Each source is represented as a tuple (producer_name, sub_index). Before init_samples() is called, the user can modify the list. Once init_samples() is called, all data is downloaded and batches can be obtained with iter(). The list of all sources is stored as self.sources.
- Raises
ValueError – If self.uploader_info is set to auto but no uploader_name is specified.
ValueError – If self.uploader_info is not set to auto or manual.
- property need_targets¶
Compute if targets have to be downloaded.
- Returns
Whether targets (or labels) should be downloaded
- Return type
bool
- update_data()¶
Update data.
Fetch new batches (if available) from the DB. Also shuffle list of samples if self.shuffle is set to
True
.
- class DataLoader(dataset: smartsim.ml.torch.data.StaticDataGenerator, **kwargs)[source]¶
Bases:
torch.utils.data.dataloader.DataLoader
DataLoader to be used as a wrapper of StaticDataGenerator or DynamicDataGenerator
This is just a sub-class of
torch.utils.data.DataLoader
which sets up sources of a data generator correctly. DataLoader parameters such as num_workers can be passed at initialization. batch_size should always be set to None.
Slurm¶
|
Request an allocation |
|
Free an allocation's resources |
Ray¶
RayCluster
is used to launch a Ray clusterand can be launched as a batch or in an interactive allocation.
- class RayCluster(name, path='/usr/local/src/SmartSim/doc', ray_port=6789, ray_args=None, num_nodes=1, run_args=None, batch_args=None, launcher='local', batch=False, time='01:00:00', interface='ipogif0', alloc=None, run_command=None, host_list=None, password='auto', **kwargs)[source]¶
Bases:
smartsim.entity.entityList.EntityList
Entity used to run a Ray cluster on a given number of hosts. One Ray node is launched on each host, and the first host is used to launch the head node.
- Parameters
name (str) – The name of the entity.
path (str) – Path to output, error, and configuration files
ray_port (int) – Port at which the head node will be running.
ray_args (dict[str,str]) – Arguments to be passed to Ray executable.
num_nodes (int) – Number of hosts, includes 1 head node and all worker nodes.
run_args (dict[str,str]) – Arguments to pass to launcher to specify details such as partition or time.
batch_args (dict[str,str]) – Additional batch arguments passed to launcher when running batch jobs.
launcher (str) – Name of launcher to use for starting the cluster.
interface (str) – Name of network interface the cluster nodes should bind to.
alloc (int) – ID of allocation to run on, if obtained with
smartsim.slurm.get_allocation
batch (bool) – Whether cluster should be launched as batch file, ignored when
launcher
is localtime (str) – The walltime the cluster will be running for
run_command (str) – Specify launch binary, defaults to automatic selection.
hosts (str, list[str]) – Specify hosts to launch on, defaults to None. Optional if not launching with OpenMPI.
password (str) – Password to use for Redis server, which is passed as –redis_password to ray start. Can be set to - auto: a strong password will be generated internally - a string: it will be used as password - None: the default Ray password will be used. Defaults to auto
- property batch¶
- get_dashboard_address()[source]¶
Returns dashboard address
The format is <head_ip>:<dashboard_port>
- Returns
Dashboard address
- Return type
str
- get_head_address()[source]¶
Return address of head node
If address has not been initialized, returns None
- Returns
Address of head node
- Return type
str
- set_hosts(host_list, launcher)[source]¶
Specify the hosts for the
RayCluster
to launch on. This is optional, unlessrun_command
is mpirun.- Parameters
host_list (str | list[str]) – list of hosts (compute node names)
- Raises
TypeError – if wrong type
- set_path(new_path)¶
- property type¶
Return the name of the class