Rearrangement Schema#

A Rearrangement is a sequence which describes a rearranged adaptive immune receptor chain (e.g., antibody heavy chain or TCR beta chain) along with a host of annotations. These annotations are defined by the AIRR Rearrangement schema and comprises eight categories.

Category

Description

Input

The input sequence to the V(D)J assignment process.

Identifiers

Primary and foreign key identifiers for linking AIRR data across files and databases.

Primary Annotations

The primary outputs of the V(D)J assignment process, which includes the gene locus, V, D, J, and C gene calls, various flags, V(D)J junction sequence, copy number (duplicate_count), and the number of reads contributing to a consensus input sequence (consensus_count).

Alignment Annotations

Detailed alignment annotations including the input and germline sequences used in the alignment; score, identity, statistical support (E-value, likelihood, etc); and the alignment itself through CIGAR strings for each aligned gene.

Alignment Positions

The start/end positions for genes in both the input and germline sequences.

Region Sequence

Sequence annotations for the framework regions (FWRs) and complementarity-determining regions (CDRs).

Region Positions

Positional annotations for the framework regions (FWRs) and complementarity-determining regions (CDRs).

Junction Lengths

Lengths for junction sub-regions associated with aspects of the V(D)J recombination process.

File Format Specification#

Data for Rearrangement or Alignment objects are stored as rows in a tab-delimited file and should be compatible with any TSV reader.

Encoding#

  • The file should be encoded as ASCII or UTF-8.

  • Everything is case-sensitive.

Dialect#

  • The record separator is a newline \n and the field separator is a tab \t.

  • Fields or data should not be quoted.

  • A header line with the AIRR-specified column names is always required.

  • Values must not contain tab or newline characters.

  • Values should avoid @, #, and quote (" or ') characters, as the result may be implementation dependent.

  • Nested delimiters are not supported by the schema explicitly and should be avoided. However, if multiple values must be reported in a single column for an application specific reason, then the use of a comma as the delimiter is recommended.

File names#

AIRR formatted TSV files should end with .tsv.

File Structure#

The data file has two sections in this order:

  1. Header. A single line with column names.

  2. Data values. One record per line.

A comment section preceding the header (e.g., # or @ blocks) is not part of the specification, but such a section is reserved for potential inclusion in a future release. As such, a comment section should not be included in the file as it may be incompatible with a future specification.

Required columns#

Some of the fields are defined as required and therefore must always be present in the header. Note, however, that all columns allow for null values. Therefore, required columns exist to define a core set of fields that are always present in the table structure, but do not mandate that a value be reported.

Custom columns#

There are no restrictions on inclusion of additional custom columns in the Rearrangements file, provided such columns do not use the same name as an existing required or optional field. It is recommended that custom fields follow the same naming scheme as existing fields. Meaning, snake_case with narrowing scope when read from left to right. For example, sequence_id is the “identifier of the query sequence”.

Consider submitting a pull request for a field name reservation to the airr-standards repository if the field may be broadly useful.

Ordering#

There are no requirements that fields or records be sorted or ordered in any specific way. However, the field ordering provided by the schema is a recommended default, with top-to-bottom equating to left-to-right.

Data Values#

The possible data types are string, boolean, number (floating point), integer, and null (empty string).

Boolean values#

Boolean values must be encoded as T for true and F for false.

Null values#

All fields may contain null values. This includes columns that are described as required. A null value should be encoded as an empty string.

Coordinate numbering#

All alignment sequence coordinates use the same scheme as IMGT and INSDC (DDBJ, ENA, GenBank), with the exception that partial coordinate information should not be used in favor of simply assigning the start/end of the alignment. Meaning, coordinates should be provided as 1-based values with closed intervals, without the use of > or < annotations that denoted a partial region.

CIGAR specification#

Alignments details are specified using the CIGAR format as defined in the SAM specifications, with some vocabulary restrictions on the use of clipping, skipping, and padding operators.

The CIGAR string defines the reference sequence as the germline sequence of the given gene or region; e.g., for v_cigar the reference is the V gene germline sequence. The query sequence is what was input into the alignment tool, which must correspond to what is contained in the sequence field of the Rearrangement data. For the majority of use cases, this will necessarily exclude alignment spacers from the CIGAR string, such as IMGT numbering gaps. However, any gaps appearing in the query sequence should be accounted for in the CIGAR string so that the alignment between the query and reference is correctly represented.

The valid operator sets and definitions are as follows:

Operator

Description

=

An identical non-gap character.

X

A differing non-gap character.

M

A positional match in the alignment. This can be either an identical (=) or differing (x) non-gap character.

D

Deletion in the query (gap in the query).

I

Insertion in the query (gap in the reference).

S

Positions that appear in the query, but not the reference. Used exclusively to denote the start position of the alignment in the query. Should precede any N operators.

N

A space in the alignment. Used exclusively to denote the start position of the alignment in the reference. Should follow any S operators.

Note, the use of either the =/X or M syntax is valid, but should be used consistently. While leading S and N operators are required, tailing S and N operators are optional.

For example, an D gene alignment that starts at position 419 in the query sequence (leading 418S), that is 16 nucleotides long with no indels (middle 16M), has an 10 nucleotide 5’ deletion (leading 10N), a 5 nucleotide 3’ deletion (trailing 5N), and ends 72 nucleotides from the end of the query sequence (trailing 71S) would have the following D gene CIGAR string (d_cigar) and positional information:

Field

Value

d_cigar

418S10N16M71S5N

d_sequence_start

419

d_sequence_end

434

d_germline_start

11

d_germline_end

26

Definition Clarifications#

Junction versus CDR3#

We work with the IMGT definitions of the junction and CDR3 regions. Specifically, the IMGT JUNCTION includes the conserved cysteine and tryptophan/phenylalanine residues, while CDR3 excludes those two residues. Therefore, our junction and junction_aa fields which represent the extracted sequence include the two conserved residues, while the coordinate fields (cdr3_start and cdr3_end) exclude them.

Productive#

The schema does not define a strict definition of a productive rearrangement. However, the IMGT definition is recommended:

  1. Coding region has an open reading frame

  2. No defect in the start codon, splicing sites or regulatory elements.

  3. No internal stop codons.

  4. An in-frame junction region.

Locus names#

A naming convention for locus names is not strictly enforced, but the IMGT locus names are recommended. For example, in the case of human data, this would be the set: IGH, IGK, IGL, TRA, TRB, TRD, or TRG.

Gene and allele names#

Gene call examples use the IMGT nomenclature, but no specific gene or allele nomenclature is strictly mandated. Species denotations may or may not be included in the gene name, as appropriate. For example, “Homo sapiens IGHV4-59*01”, “IGHV4-59*01” and “AB019438” are all valid entries for the same allele.

However, when using an established reference database to assign gene calls adherence to the exact nomenclature used by the reference database is strongly recommended, as this will facilitate mapping to the database entries, cross-study comparison, and upload to public repositories.

Alignments#

There is no required alignment scheme for the nucleotide and amino acid alignment fields. These fields may, or may not, include numbering spacers (e.g., IMGT-numbering gaps), variations in case to denote mismatches, deletions, or other features appropriate to the tool that performed the alignment. The only strict requirement is that the query (sequence) and reference (germline) must be properly aligned.

Frameshifts#

For purposes of annotating alignments, a frameshift is defined as a frameshift that is maintained until the end of the aligned gene, where frames are designated numerically as 1 (in-frame), 2, or 3. For example, an V gene alignment that starts in frame 1 and ends in frame 2, disrupting the conserved cystine, would be defined as a frameshift. Whereas, a V gene alignment with an internal frameshift that corrects with a second frameshift, back to the original frame 1 prior to the conserved cystine, would not need to be annotated as a frameshift.

Fields#

The specification includes two classes of fields. Those that are required and those that are optional. Required is defined as a column that must be present in the header of the TSV. Optional is defined as column that may, or may not, appear in the TSV. All fields, including required fields, are nullable by assigning an empty string as the value. There are no requirements for column ordering in the schema, although the Python and R reference APIs enforce ordering for the sake of generating predictable output. The set of optional fields that provide alignment and region coordinates (“_start” and “_end” fields) are defined as 1- based closed intervals, similar to the SAM, VCF, GFF, IMGT, and INDSC formats (GenBank, ENA, and DDJB; http://www.insdc.org).

Most fields have strict definitions for the values that they contain. However, some commonly provided information cannot be standardized across diverse toolchains, so a small selection of fields have context-dependent definitions. In particular, these context-dependent fields include the optional “_score,” “_identity,” and “_support” fields used for assessing the quality of alignments which vary considerably in definition based on the methodology used. Similarly, the “_alignment” fields require strict alignment between the corresponding observed and germline sequences, but the manner in which that alignment is conveyed is somewhat flexible in that it allows for any numbering scheme (e.g., IMGT or KABAT) or lack thereof.

By default, data elements representing sequences in the schema contain nucleotide sequences except for data elements ending in “_aa,” which are amino acid translations of the associated nucleotide sequence.

While the format contains an extensive list of reserved field names, there are no restrictions on inclusion of custom fields in the TSV file, provided such custom fields have a unique name. Furthermore, suggestions for extending the format with additional reserved names are welcomed through the issue tracker on the GitHub repository (airr-community/airr-standards).

Download as TSV

Name

Type

Attributes

Definition

sequence_id

string

required, identifier, nullable

Unique query sequence identifier for the Rearrangement. Most often this will be the input sequence header or a substring thereof, but may also be a custom identifier defined by the tool in cases where query sequences have been combined in some fashion prior to alignment. When downloaded from an AIRR Data Commons repository, this will usually be a universally unique record locator for linking with other objects in the AIRR Data Model.

sequence

string

required, nullable

The query nucleotide sequence. Usually, this is the unmodified input sequence, which may be reverse complemented if necessary. In some cases, this field may contain consensus sequences or other types of collapsed input sequences if these steps are performed prior to alignment.

quality

string

optional, nullable

The Sanger/Phred quality scores for assessment of sequence quality. Phred quality scores from 0 to 93 are encoded using ASCII 33 to 126 (Used by Illumina from v1.8.)

sequence_aa

string

optional, nullable

Amino acid translation of the query nucleotide sequence.

rev_comp

boolean

required, nullable

True if the alignment is on the opposite strand (reverse complemented) with respect to the query sequence. If True then all output data, such as alignment coordinates and sequences, are based on the reverse complement of ‘sequence’.

productive

boolean

required, nullable

True if the V(D)J sequence is predicted to be productive.

vj_in_frame

boolean

optional, nullable

True if the V and J gene alignments are in-frame.

stop_codon

boolean

optional, nullable

True if the aligned sequence contains a stop codon.

complete_vdj

boolean

optional, nullable

True if the sequence alignment spans the entire V(D)J region. Meaning, sequence_alignment includes both the first V gene codon that encodes the mature polypeptide chain (i.e., after the leader sequence) and the last complete codon of the J gene (i.e., before the J-C splice site). This does not require an absence of deletions within the internal FWR and CDR regions of the alignment.

locus

string

optional, nullable

Gene locus (chain type). Note that this field uses a controlled vocabulary that is meant to provide a generic classification of the locus, not necessarily the correct designation according to a specific nomenclature.

locus_species

Ontology

optional, nullable

Binomial designation of the species from which the locus originates. Typically, this value should be identical to organism, if which case it SHOULD NOT be set explicitly. However, there are valid experimental setups in which the two might differ, e.g. transgenic animal models. If set, this key will overwrite the organism information for all lower layers of the schema.

v_call

string

required, nullable

V gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHV4-59*01 if using IMGT/GENE-DB).

d_call

string

required, nullable

First or only D gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01 if using IMGT/GENE-DB).

d2_call

string

optional, nullable

Second D gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHD3-10*01 if using IMGT/GENE-DB).

j_call

string

required, nullable

J gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHJ4*02 if using IMGT/GENE-DB).

c_call

string

optional, nullable

Constant region gene with allele. If referring to a known reference sequence in a database the relevant gene/allele nomenclature should be followed (e.g., IGHG1*01 if using IMGT/GENE-DB).

sequence_alignment

string

required, nullable

Aligned portion of query sequence, including any indel corrections or numbering spacers, such as IMGT-gaps. Typically, this will include only the V(D)J region, but that is not a requirement.

quality_alignment

string

optional, nullable

Sanger/Phred quality scores for assessment of sequence_alignment quality. Phred quality scores from 0 to 93 are encoded using ASCII 33 to 126 (Used by Illumina from v1.8.)

sequence_alignment_aa

string

optional, nullable

Amino acid translation of the aligned query sequence.

germline_alignment

string

required, nullable

Assembled, aligned, full-length inferred germline sequence spanning the same region as the sequence_alignment field (typically the V(D)J region) and including the same set of corrections and spacers (if any).

germline_alignment_aa

string

optional, nullable

Amino acid translation of the assembled germline sequence.

junction

string

required, nullable

Junction region nucleotide sequence, where the junction is defined as the CDR3 plus the two flanking conserved codons.

junction_aa

string

required, nullable

Amino acid translation of the junction.

np1

string

optional, nullable

Nucleotide sequence of the combined N/P region between the V gene and first D gene alignment or between the V gene and J gene alignments.

np1_aa

string

optional, nullable

Amino acid translation of the np1 field.

np2

string

optional, nullable

Nucleotide sequence of the combined N/P region between either the first D gene and J gene alignments or the first D gene and second D gene alignments.

np2_aa

string

optional, nullable

Amino acid translation of the np2 field.

np3

string

optional, nullable

Nucleotide sequence of the combined N/P region between the second D gene and J gene alignments.

np3_aa

string

optional, nullable

Amino acid translation of the np3 field.

cdr1

string

optional, nullable

Nucleotide sequence of the aligned CDR1 region.

cdr1_aa

string

optional, nullable

Amino acid translation of the cdr1 field.

cdr2

string

optional, nullable

Nucleotide sequence of the aligned CDR2 region.

cdr2_aa

string

optional, nullable

Amino acid translation of the cdr2 field.

cdr3

string

optional, nullable

Nucleotide sequence of the aligned CDR3 region.

cdr3_aa

string

optional, nullable

Amino acid translation of the cdr3 field.

fwr1

string

optional, nullable

Nucleotide sequence of the aligned FWR1 region.

fwr1_aa

string

optional, nullable

Amino acid translation of the fwr1 field.

fwr2

string

optional, nullable

Nucleotide sequence of the aligned FWR2 region.

fwr2_aa

string

optional, nullable

Amino acid translation of the fwr2 field.

fwr3

string

optional, nullable

Nucleotide sequence of the aligned FWR3 region.

fwr3_aa

string

optional, nullable

Amino acid translation of the fwr3 field.

fwr4

string

optional, nullable

Nucleotide sequence of the aligned FWR4 region.

fwr4_aa

string

optional, nullable

Amino acid translation of the fwr4 field.

v_score

number

optional, nullable

Alignment score for the V gene.

v_identity

number

optional, nullable

Fractional identity for the V gene alignment.

v_support

number

optional, nullable

V gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the V gene assignment as defined by the alignment tool.

v_cigar

string

required, nullable

CIGAR string for the V gene alignment.

d_score

number

optional, nullable

Alignment score for the first or only D gene alignment.

d_identity

number

optional, nullable

Fractional identity for the first or only D gene alignment.

d_support

number

optional, nullable

D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the first or only D gene as defined by the alignment tool.

d_cigar

string

required, nullable

CIGAR string for the first or only D gene alignment.

d2_score

number

optional, nullable

Alignment score for the second D gene alignment.

d2_identity

number

optional, nullable

Fractional identity for the second D gene alignment.

d2_support

number

optional, nullable

D gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the second D gene as defined by the alignment tool.

d2_cigar

string

optional, nullable

CIGAR string for the second D gene alignment.

j_score

number

optional, nullable

Alignment score for the J gene alignment.

j_identity

number

optional, nullable

Fractional identity for the J gene alignment.

j_support

number

optional, nullable

J gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the J gene assignment as defined by the alignment tool.

j_cigar

string

required, nullable

CIGAR string for the J gene alignment.

c_score

number

optional, nullable

Alignment score for the C gene alignment.

c_identity

number

optional, nullable

Fractional identity for the C gene alignment.

c_support

number

optional, nullable

C gene alignment E-value, p-value, likelihood, probability or other similar measure of support for the C gene assignment as defined by the alignment tool.

c_cigar

string

optional, nullable

CIGAR string for the C gene alignment.

v_sequence_start

integer

optional, nullable

Start position of the V gene in the query sequence (1-based closed interval).

v_sequence_end

integer

optional, nullable

End position of the V gene in the query sequence (1-based closed interval).

v_germline_start

integer

optional, nullable

Alignment start position in the V gene reference sequence (1-based closed interval).

v_germline_end

integer

optional, nullable

Alignment end position in the V gene reference sequence (1-based closed interval).

v_alignment_start

integer

optional, nullable

Start position of the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval).

v_alignment_end

integer

optional, nullable

End position of the V gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval).

d_sequence_start

integer

optional, nullable

Start position of the first or only D gene in the query sequence. (1-based closed interval).

d_sequence_end

integer

optional, nullable

End position of the first or only D gene in the query sequence. (1-based closed interval).

d_germline_start

integer

optional, nullable

Alignment start position in the D gene reference sequence for the first or only D gene (1-based closed interval).

d_germline_end

integer

optional, nullable

Alignment end position in the D gene reference sequence for the first or only D gene (1-based closed interval).

d_alignment_start

integer

optional, nullable

Start position of the first or only D gene in both the sequence_alignment and germline_alignment fields (1-based closed interval).

d_alignment_end

integer

optional, nullable

End position of the first or only D gene in both the sequence_alignment and germline_alignment fields (1-based closed interval).

d2_sequence_start

integer

optional, nullable

Start position of the second D gene in the query sequence (1-based closed interval).

d2_sequence_end

integer

optional, nullable

End position of the second D gene in the query sequence (1-based closed interval).

d2_germline_start

integer

optional, nullable

Alignment start position in the second D gene reference sequence (1-based closed interval).

d2_germline_end

integer

optional, nullable

Alignment end position in the second D gene reference sequence (1-based closed interval).

d2_alignment_start

integer

optional, nullable

Start position of the second D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval).

d2_alignment_end

integer

optional, nullable

End position of the second D gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval).

j_sequence_start

integer

optional, nullable

Start position of the J gene in the query sequence (1-based closed interval).

j_sequence_end

integer

optional, nullable

End position of the J gene in the query sequence (1-based closed interval).

j_germline_start

integer

optional, nullable

Alignment start position in the J gene reference sequence (1-based closed interval).

j_germline_end

integer

optional, nullable

Alignment end position in the J gene reference sequence (1-based closed interval).

j_alignment_start

integer

optional, nullable

Start position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval).

j_alignment_end

integer

optional, nullable

End position of the J gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval).

c_sequence_start

integer

optional, nullable

Start position of the C gene in the query sequence (1-based closed interval).

c_sequence_end

integer

optional, nullable

End position of the C gene in the query sequence (1-based closed interval).

c_germline_start

integer

optional, nullable

Alignment start position in the C gene reference sequence (1-based closed interval).

c_germline_end

integer

optional, nullable

Alignment end position in the C gene reference sequence (1-based closed interval).

c_alignment_start

integer

optional, nullable

Start position of the C gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval).

c_alignment_end

integer

optional, nullable

End position of the C gene alignment in both the sequence_alignment and germline_alignment fields (1-based closed interval).

cdr1_start

integer

optional, nullable

CDR1 start position in the query sequence (1-based closed interval).

cdr1_end

integer

optional, nullable

CDR1 end position in the query sequence (1-based closed interval).

cdr2_start

integer

optional, nullable

CDR2 start position in the query sequence (1-based closed interval).

cdr2_end

integer

optional, nullable

CDR2 end position in the query sequence (1-based closed interval).

cdr3_start

integer

optional, nullable

CDR3 start position in the query sequence (1-based closed interval).

cdr3_end

integer

optional, nullable

CDR3 end position in the query sequence (1-based closed interval).

fwr1_start

integer

optional, nullable

FWR1 start position in the query sequence (1-based closed interval).

fwr1_end

integer

optional, nullable

FWR1 end position in the query sequence (1-based closed interval).

fwr2_start

integer

optional, nullable

FWR2 start position in the query sequence (1-based closed interval).

fwr2_end

integer

optional, nullable

FWR2 end position in the query sequence (1-based closed interval).

fwr3_start

integer

optional, nullable

FWR3 start position in the query sequence (1-based closed interval).

fwr3_end

integer

optional, nullable

FWR3 end position in the query sequence (1-based closed interval).

fwr4_start

integer

optional, nullable

FWR4 start position in the query sequence (1-based closed interval).

fwr4_end

integer

optional, nullable

FWR4 end position in the query sequence (1-based closed interval).

v_sequence_alignment

string

optional, nullable

Aligned portion of query sequence assigned to the V gene, including any indel corrections or numbering spacers.

v_sequence_alignment_aa

string

optional, nullable

Amino acid translation of the v_sequence_alignment field.

d_sequence_alignment

string

optional, nullable

Aligned portion of query sequence assigned to the first or only D gene, including any indel corrections or numbering spacers.

d_sequence_alignment_aa

string

optional, nullable

Amino acid translation of the d_sequence_alignment field.

d2_sequence_alignment

string

optional, nullable

Aligned portion of query sequence assigned to the second D gene, including any indel corrections or numbering spacers.

d2_sequence_alignment_aa

string

optional, nullable

Amino acid translation of the d2_sequence_alignment field.

j_sequence_alignment

string

optional, nullable

Aligned portion of query sequence assigned to the J gene, including any indel corrections or numbering spacers.

j_sequence_alignment_aa

string

optional, nullable

Amino acid translation of the j_sequence_alignment field.

c_sequence_alignment

string

optional, nullable

Aligned portion of query sequence assigned to the constant region, including any indel corrections or numbering spacers.

c_sequence_alignment_aa

string

optional, nullable

Amino acid translation of the c_sequence_alignment field.

v_germline_alignment

string

optional, nullable

Aligned V gene germline sequence spanning the same region as the v_sequence_alignment field and including the same set of corrections and spacers (if any).

v_germline_alignment_aa

string

optional, nullable

Amino acid translation of the v_germline_alignment field.

d_germline_alignment

string

optional, nullable

Aligned D gene germline sequence spanning the same region as the d_sequence_alignment field and including the same set of corrections and spacers (if any).

d_germline_alignment_aa

string

optional, nullable

Amino acid translation of the d_germline_alignment field.

d2_germline_alignment

string

optional, nullable

Aligned D gene germline sequence spanning the same region as the d2_sequence_alignment field and including the same set of corrections and spacers (if any).

d2_germline_alignment_aa

string

optional, nullable

Amino acid translation of the d2_germline_alignment field.

j_germline_alignment

string

optional, nullable

Aligned J gene germline sequence spanning the same region as the j_sequence_alignment field and including the same set of corrections and spacers (if any).

j_germline_alignment_aa

string

optional, nullable

Amino acid translation of the j_germline_alignment field.

c_germline_alignment

string

optional, nullable

Aligned constant region germline sequence spanning the same region as the c_sequence_alignment field and including the same set of corrections and spacers (if any).

c_germline_alignment_aa

string

optional, nullable

Amino acid translation of the c_germline_aligment field.

junction_length

integer

optional, nullable

Number of nucleotides in the junction sequence.

junction_aa_length

integer

optional, nullable

Number of amino acids in the junction sequence.

np1_length

integer

optional, nullable

Number of nucleotides between the V gene and first D gene alignments or between the V gene and J gene alignments.

np2_length

integer

optional, nullable

Number of nucleotides between either the first D gene and J gene alignments or the first D gene and second D gene alignments.

np3_length

integer

optional, nullable

Number of nucleotides between the second D gene and J gene alignments.

n1_length

integer

optional, nullable

Number of untemplated nucleotides 5’ of the first or only D gene alignment.

n2_length

integer

optional, nullable

Number of untemplated nucleotides 3’ of the first or only D gene alignment.

n3_length

integer

optional, nullable

Number of untemplated nucleotides 3’ of the second D gene alignment.

p3v_length

integer

optional, nullable

Number of palindromic nucleotides 3’ of the V gene alignment.

p5d_length

integer

optional, nullable

Number of palindromic nucleotides 5’ of the first or only D gene alignment.

p3d_length

integer

optional, nullable

Number of palindromic nucleotides 3’ of the first or only D gene alignment.

p5d2_length

integer

optional, nullable

Number of palindromic nucleotides 5’ of the second D gene alignment.

p3d2_length

integer

optional, nullable

Number of palindromic nucleotides 3’ of the second D gene alignment.

p5j_length

integer

optional, nullable

Number of palindromic nucleotides 5’ of the J gene alignment.

v_frameshift

boolean

optional, nullable

True if the V gene in the query nucleotide sequence contains a translational frameshift relative to the frame of the V gene reference sequence.

j_frameshift

boolean

optional, nullable

True if the J gene in the query nucleotide sequence contains a translational frameshift relative to the frame of the J gene reference sequence.

d_frame

integer

optional, nullable

Numerical reading frame (1, 2, 3) of the first or only D gene in the query nucleotide sequence, where frame 1 is relative to the first codon of D gene reference sequence.

d2_frame

integer

optional, nullable

Numerical reading frame (1, 2, 3) of the second D gene in the query nucleotide sequence, where frame 1 is relative to the first codon of D gene reference sequence.

consensus_count

integer

optional, nullable

Number of reads contributing to the UMI consensus or contig assembly for this sequence. For example, the sum of the number of reads for all UMIs that contribute to the query sequence.

duplicate_count

integer

optional, nullable

Copy number or number of duplicate observations for the query sequence. For example, the number of identical reads observed for this sequence.

umi_count

integer

optional, nullable

Number of distinct UMIs represented by this sequence. For example, the total number of UMIs that contribute to the contig assembly for the query sequence.

cell_id

string

optional, identifier, nullable

Identifier defining the cell of origin for the query sequence.

clone_id

string

optional, identifier, nullable

Clonal cluster assignment for the query sequence.

repertoire_id

string

optional, identifier, nullable

Identifier to the associated repertoire in study metadata.

sample_processing_id

string

optional, identifier, nullable

Identifier to the sample processing object in the repertoire metadata for this rearrangement. If the repertoire has a single sample then this field may be empty or missing. If the repertoire has multiple samples then this field may be empty or missing if the sample cannot be differentiated or the relationship is not maintained by the data processing.

data_processing_id

string

optional, identifier, nullable

Identifier to the data processing object in the repertoire metadata for this rearrangement. If this field is empty than the primary data processing object is assumed.

rearrangement_id

string

DEPRECATED

Identifier for the Rearrangement object. May be identical to sequence_id, but will usually be a universally unique record locator for database applications.

rearrangement_set_id

string

DEPRECATED

Identifier for grouping Rearrangement objects.

germline_database

string

DEPRECATED

Source of germline V(D)J genes with version number or date accessed.