4.12. Workflow

ext_Experiment is the container for whole project data. sample_registry is the dict to track sample information and status, outside class to facilitate multiprocessing. Heavy lifting is in constructors.CompositeMap,

which contains MassGrid for correspondence, and FeatureList from feature/peak detection.

Annotation is facilitated by jms-metabolite-services, mass2chem.

asari.workflow.batch_EIC_from_samples_(sample_registry, parameters)[source]

Batch extraction of mass tracks from samples via multiprocessing.

Parameters:
  • sample_registry (dict) – dictionary like {‘sample_id’: ii, ‘input_file’: file}, generated by register_samples.

  • parameters (dict) – parameter dictionary passed from main.py, which imports from default_parameters and updates the dict by user arguments.

Returns:

shared_dict – dictionary object used to pass data btw multiple processing.

Return type:

dict

asari.workflow.create_export_folders(parameters, time_stamp)[source]

Creates local directory for storing temporary files and output result. A time stamp is added to directory name to avoid overwriting existing projects.

Parameters:
  • paramaters (dict) – passed from main.py to get outdir and project_name

  • time_stamp (str) – a time_stamp string to prevent overwriting existing projects

asari.workflow.get_mz_list(infile)[source]

Get a list of m/z valuies from infile, to be used as targets to extract from LC-MS features. Currently, extract is a short cut function - asari still processes full feature tables, then searchs for the given targets.

Parameters:

infile (str) – filepath to input table, which is tab or comma delimited and has m/z in the first column, header as first row.

Return type:

A list of m/z values.

asari.workflow.make_iter_parameters(sample_registry, parameters, shared_dict)[source]

Generate iterables for multiprocess.starmap for getting sample mass tracks.

Parameters:
  • sample_registry (dict) – dictionary like {‘sample_id’: ii, ‘input_file’: file}, generated by register_samples.

  • parameters (dict) – parameter dictionary passed from main.py, which imports from default_parameters and updates the dict by user arguments.

  • shared_dict (dict) – dictionary object used to pass data btw multiple processing.

Returns:

  • A list of iterative parameters, e.g.

  • [(‘sample_id’, input_file, mode, mz_tolerance_ppm, min_intensity, min_timepoints,

  • min_peak_height, output_file, shared_dict), …]

asari.workflow.process_project(list_input_files, parameters)[source]

This defines the main work flow in processing a list of LC-MS files, creates the output folder with a time stamp, and uses sample registry to coordinate parallel processing. The whole project data are tracked in experiment.ext_Experiment class.

Parameters:
  • list_input_files (list[str]) – list of centroided mzML filepaths from LC-MS metabolomics. Usually found in a folder.

  • parameters (dict) – parameter dictionary passed from main.py, which imports from default_parameters and updates the dict by user arguments.

Outputs

A local folder with asari processing result, e.g:

rsvstudy_asari_project_427105156
├── Annotated_empricalCompounds.json
├── Feature_annotation.tsv
├── export
│   ├── _mass_grid_mapping.csv
│   ├── cmap.pickle
│   ├── full_Feature_table.tsv
│   └── unique_compound__Feature_table.tsv
├── pickle
│   ├── Blank_20210803_003.pickle
│   ├── ...
├── preferred_Feature_table.tsv
└── project.json

The pickle folder is removed after the processing by default.

asari.workflow.process_xics(list_input_files, parameters)[source]

Get XICs (aka EICs or mass tracks) from a folder of centroid mzML files and store in local pickle files.

Parameters:
  • list_input_files (list[str]) – list of centroided mzML filepaths from LC-MS metabolomics. Usually found in a folder.

  • parameters (dict) – parameter dictionary passed from main.py, which imports from default_parameters and updates the dict by user arguments.

Outputs

A local folder with asari extracted XICs (aka EICs or mass tracks), without full processing, in pickle files.

asari.workflow.read_project_dir(directory, file_pattern='.mzML')[source]

This reads centroided LC-MS files from directory. Returns a list of files that match file_pattern.

Parameters:
  • directory (str) – path to a directory containing mzML files

  • file_pattern (str, optional, default: '.mzML') – files with this substring will be ingested

Return type:

list of paths to files containing the file_pattern

asari.workflow.register_samples(list_input_files)[source]

Establish sample_id here, return sample_registry as a dictionary.

Parameters:

list_input_files (list[str]) – list of input filepaths, each representing a sample

Return type:

sample_registry, a dictionary of integer sample id’s to filepaths

asari.workflow.remove_intermediate_pickles(parameters)[source]

Remove all temporary files under pickle/ to free up disk space.

Parameters:

paramaters (dict) – passed from main.py to get tmp_pickle_dir

asari.workflow.single_sample_EICs_(sample_id, infile, ion_mode, database_mode, mz_tolerance_ppm, min_intensity, min_timepoints, min_peak_height, outfile, shared_dict)[source]

Extraction of mass tracks from a single sample. Used by multiprocessing in batch_EIC_from_samples_. shared_dict is used to pass back information, thus critical. Designed here as sample_id:

(‘status:mzml_parsing’, ‘status:eic’, outfile, max_scan_number, list_scan_numbers, list_retention_time, track_mzs, number_anchor_mz_pairs, anchor_mz_pairs, dict({mass tracks}) )

track_mzs or anchor_mz_pairs are used later for aligning m/z tracks.

list of scans starts from 0.

anchor_mz_pairs are defined as m/z pairs that match to 13C/12C pattern. More anchors mean better coverage of features, helpful to select reference sample.

Parameters:
  • sample_id (int) – sample id passed from sample_registry by make_iter_parameters.

  • infile (str) – input mzML filepath, passed from sample_registry by make_iter_parameters.

  • ion_mode (str) – from parameter dictionary, marks if acquisition was in positive or negative mode

  • database_mode (str) – from parameter dictionary, marks if intermediatesare kept on disk or in memory

  • mz_tolerance_ppm (float) – from parameter dictionray, the assumed mz resolution of the instrument

  • min_intensity (float) – peaks below this value are ignored

  • min_timepoints (int) – then number of time points a peak must span to be considered a peak

  • min_peak_height (float) – peaks below this height are ignored

  • outfile (str) – where the output will be written passed from parameter dictionary by make_iter_parameters.

  • shared_dict (dict) – dictionary object used to pass data btw multiple processing.

Updates:

shared_dict (dict) – dictionary object used to pass data btw multiple processing.