Julia: So What's the Difference Between ! and : In Dataframes Indexing

Hongtao Hao / 2021-07-05

Unless you are able to constantly moniter and update changes in this post, please DO NOT repost it anywere. Feel free to share the link, though.

The following codes were tested under Julia v1.6.1 and DataFrames v1.2.0.

Similar discussions are found in this thread on stackoverflow and Bogumił Kamiński’s post of On the bang row selector in DataFrames.jl .

I’ll talk about the differences between df[!, col(s)] and df[:, col(s)] from three aspects: accessing, updating, and adding column(s).

Accessing column(s) #

When df[!, col(s)] or df[:, col(s)] are on the right hand side of the equal sign =, you are accessing the column(s). For example, col1 = df[:, 1], in which you are accessing the first column of df.

When “accessing” column(s), df[:, col(s)] will make a copy of the column(s) and assign it to the variable you are declaring; in this case, col1. By contrast, df[!, col(s)] won’t make a copy . As fredrikekre rightly puts it :

(It is) a reference to the underlying vector storing the data, rather than a copy of it.

Since it’s not a copy but a reference to the column(s) of df itself, if we modify col1 (if declared with [df[!, col]), df will be mutated as well.

For example:

julia> df1 = DataFrame(col1=1:3, col2='a':'c')
3×2 DataFrame
 Row │ col1   col2 
     │ Int64  Char 
─────┼─────────────
   11  a
   22  b
   33  c

julia> a = df1[!, :col1] # using ! here
3-element Vector{Int64}:
 1
 2
 3

julia> a[2] = 99
99

julia> b = df1[:, :col2] # using : here
3-element Vector{Char}:
 'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)
 'b': ASCII/Unicode U+0062 (category Ll: Letter, lowercase)
 'c': ASCII/Unicode U+0063 (category Ll: Letter, lowercase)

julia> b[2] = 'd'
'd': ASCII/Unicode U+0064 (category Ll: Letter, lowercase)

julia> df1
3×2 DataFrame
 Row │ col1   col2 
     │ Int64  Char 
─────┼─────────────
   11  a
   299  b
   33  c

As you can see, col1 is mutated but col2 is not.

Think about it: if you run a = [4, 5, 6] instead of a[2] = 99, will df1 be modified as well? Why or why not?

The answer is no. df1 won’t change. This is because with a = [4, 5, 6], you are reassigning a. This re-assignment has no effect on df1 itself.

Also, if you run a = df1[!, :col1][[2,2,3]] and then a[2]=99, will df1 be mutated?

The answer is also no. This is because, with df1[!, :col1][[2,2,3], you are creating a new array based on df1[!, col1]. Then, with a[22]=99, you are mutating the new array, not df1[!, col1].

You can check out abarnert’s answer . It is about Python, not Julia, but the logic is similar.

Updating column(s) #

When df[!, col(s)] or df[:, col(s)] are on the left side of the equal sign =, you are updating columns.

When you update column(s), whether you use ! or :, df will be mutated. However, how df is mutated is different. ! will create new column(s) to replace the old one(s). Only the updated column will be shown but the old one will be stored in memory. By contrast, : will update the values in-place.

a = ["Tom", "Mike", "John", "Jason", "Bob"]
b = [missing, 2, 3, missing, 8]
c = [1, 3, missing, 6, missing]
df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)
julia> df

5×3 DataFrame
 Row │ Name    Var1     Var2    
     │ String  Int64?   Int64?  
─────┼──────────────────────────
   1 │ Tom     missing        1
   2 │ Mike          2        3
   3 │ John          3  missing 
   4 │ Jason   missing        6
   5 │ Bob           8  missing 

See how ! will mutate the source:

julia>df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)

julia> df[!, :Var1] = collect(1:5)
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> df
5×3 DataFrame
 Row │ Name    Var1   Var2    
     │ String  Int64  Int64?  
─────┼────────────────────────
   1 │ Tom         1        1
   2 │ Mike        2        3
   3 │ John        3  missing 
   4 │ Jason       4        6
   5 │ Bob         5  missing 

julia> typeof(df.Var1)
Vector{Int64} (alias for Array{Int64, 1})

You’ll see a slightly different result when using :.

julia>df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)

julia> df[:, :Var1] = collect(1:5) 
5-element Vector{Int64}:
 1
 2
 3
 4
 5

julia> df
5×3 DataFrame
 Row │ Name    Var1    Var2    
     │ String  Int64?  Int64?  
─────┼─────────────────────────
   1 │ Tom          1        1
   2 │ Mike         2        3
   3 │ John         3  missing 
   4 │ Jason        4        6
   5 │ Bob          5  missing 

julia> typeof(df.Var1)
Vector{Union{Missing, Int64}} (alias for Array{Union{Missing, Int64}, 1})

Why do we have this difference in typeof(df.Var1)?

The explanation can be found in Bogumił Kamiński’s post :

[U]sing ! puts a new column passed on the right hand side to the data frame without copying it (no matter if the column exists or not in the data frame), while : assigns to an existing column in-place.

Updating multiple columns is the same as updating a single column. The difference is that, to update multiple columns, you need a matrix or a data frame (I don’t know how you can update multiple columns with a data frame, though).

Using !:

df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)
matrixA = [1 2; 3 4; 5 6; 7 8; 9 10 ] # This creates a 5×2 array
df[!, [:Var1, :Var2]] = matrixA
julia> df

5×3 DataFrame
 Row │ Name    Var1   Var2  
     │ String  Int64  Int64 
─────┼──────────────────────
   1 │ Tom         1      2
   2 │ Mike        3      4
   3 │ John        5      6
   4 │ Jason       7      8
   5 │ Bob         9     10

Bogumił Kamiński wrote in his summary that when updating a single column, ! will replace the column without copying. By contrast, when updating multiple columns, ! will replace the columns with copying.

I don’t know what the difference between “with copying” and “without copying” is.

Using ::

df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)
matrixA = [1 2; 3 4; 5 6; 7 8; 9 10 ] 
df[:, [:Var1, :Var2]] = matrixA
julia> df

5×3 DataFrame
 Row │ Name    Var1    Var2   
     │ String  Int64?  Int64? 
─────┼────────────────────────
   1 │ Tom          1       2
   2 │ Mike         3       4
   3 │ John         5       6
   4 │ Jason        7       8
   5 │ Bob          9      10

So, how in-place changes are different from creating and replacing? #

The following example is taken from Bogumił Kamiński’s post .

julia> df1 = DataFrame(col1=1:3, col2='a':'c')
3×2 DataFrame
 Row │ col1   col2 
     │ Int64  Char 
─────┼─────────────
   11  a
   22  b
   33  c

julia> col1 = df1.col1
3-element Array{Int64,1}:
 1
 2
 3

julia> v = [11, 13, 13]
3-element Array{Int64,1}:
 11
 13
 13

julia> df1[!, :col1] = v # Using ! here
3-element Array{Int64,1}:
 11
 13
 13

Take a break here. Guess what the result of col1 and df2.col1 will be.

julia> col1
3-element Array{Int64,1}:
 1
 2
 3

julia> df1.col1
3-element Array{Int64,1}:
 11
 13
 13

Okay. Let’s see what the results will be if we use :.

julia> df1 = DataFrame(col1=1:3, col2='a':'c')
3×2 DataFrame
 Row │ col1   col2 
     │ Int64  Char 
─────┼─────────────
   11  a
   22  b
   33  c

julia> col1 = df1.col1
3-element Array{Int64,1}:
 1
 2
 3

julia> v = [11, 13, 13]
3-element Array{Int64,1}:
 11
 13
 13

julia> df1[:, :col1] = v # Using : here
3-element Array{Int64,1}:
 11
 13
 13

See the results:

julia> col1
3-element Array{Int64,1}:
 11
 12
 13

julia> df1.col1
3-element Array{Int64,1}:
 11
 13
 13

Why is it that when we use df1[!, :col1] = v, col1 = [1,2,3] but when use df1[:, :col1] = v, col1 = [11,12,13]?

This is because df1[!, :col1] = v will first generate a new column and then use that to replace the old one. By contrast, df1[:, :col1] = v directly modifies the existing column. Therefore, when using !, you will have two col1s: the old one and the updated one. Only the updated one will be shown but the old one is stored in memory. That is why col1 will get the old column of col1 in df1 whereas the df1.col1 will get the updated one. When you use :, there is only one col1, i.e., the updated one. The old one is erased from memory. Therefore, col1 will give you only the updated data.

Read @Chris Rackauckas’s answer to the question of What is the difference between “==” and “===” comparison operators in Julia? on Stack Overflow to get a deeper understanding of this difference.

Homework #

Change col1 = df1.col1 to col1 = df1[:, col1], and see whether ! and : still generate different results.

The answer is that when col1 = df1[:, col1], whether you run df1[!, :col1] = v or df1[:, :col1] = v, col1 will always return [1,2,3].

This is because if you use col1 = df1[:, col], col1 will be a copy of df1.col1. Therefore, whatever changes you make to df1, col1 stays unchanged.

Broadcasting #

A special way to update column(s) is through broadcasting .

You’ll need Dot Syntax to get broadcasting to work. What Dot Syntax does is to apply a function to each element in an array and then to return a new array.

julia> df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)

julia> df[!, :Name] .= "Johnson"

julia> df[!, [:Var1, :Var2]] .= 99

julia> df

5×3 DataFrame
 Row │ Name     Var1   Var2  
     │ String   Int64  Int64 
─────┼───────────────────────
   1 │ Johnson     99     99
   2 │ Johnson     99     99
   3 │ Johnson     99     99
   4 │ Johnson     99     99
   5 │ Johnson     99     99

Note that in broadcasting, there cannot exist space between . and =; otherwise, you will see an error.

Both ! and : work in broadcasting. As with non-broadcasting methods to update column(s), ! first creates new column(s) to replace the old one(s) whereas : updates the column(s) in-place.

Please note that there is a restriction when updating column(s) with : through broadcasting: the type of object has to be the same as the one you are updating. Otherwise, you will see an error:

julia> df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)

julia> df[:, :Var1] .= "A"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

The same error occurs if you use df.Var1 .= "A". However, df[!, :Var1] .= "A" will work because it creates a new column to replace the old one.

One thing I don’t understand is that if I replace "A" with 'a', there won’t be an error, and all the values are updated to be 65.

Another thing I don’t understand is that df[!, [:Var1, :Var2]] .= 99 will return the whole data frame whereas df[:, [:Var1, :Var2]] .= 99 returns only the selected two columns:

julia> df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)

julia> df[!, [:Var1, :Var2]] .= 99

5×3 DataFrame
 Row │ Name    Var1   Var2  
     │ String  Int64  Int64 
─────┼──────────────────────
   1 │ Tom        99     99
   2 │ Mike       99     99
   3 │ John       99     99
   4 │ Jason      99     99
   5 │ Bob        99     99

julia> df[:, [:Var1, :Var2]] .= 99

5×2 SubDataFrame
 Row │ Var1   Var2  
     │ Int64  Int64 
─────┼──────────────
   199     99
   299     99
   399     99
   499     99
   599     99

Updating the value of a single cell #

Updating a single cell is different from updating columns: When updating a single cell, ! will change the source whereas : won’t.

This is essentially the same as # accessing columns .

julia> df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)

julia> df[!, :Var1][2] = 99
99

julia> df[:, :Var2][2] = 99
99

julia> df
5×3 DataFrame
 Row │ Name    Var1     Var2    
     │ String  Int64?   Int64?  
─────┼──────────────────────────
   1 │ Tom     missing        1
   2 │ Mike         99        3
   3 │ John          3  missing 
   4 │ Jason   missing        6
   5 │ Bob           8  missing 

Note that no matter whether you use : or !, you are not allowed to change the cell’s value to a different type . For example, if you run df[!, :Var1][2] = "Good", you’ll see an error:

MethodError: Cannot `convert` an object of type String to an object of type Int64

To force this conversion, you have to convert column data type first:

typeof(df[!, :Var1]) 
# Returns Vector{Union{Missing, Int64}} (alias for Array{Union{Missing, Int64}, 1}
df[!, :Var1] = string.(df[!, :Var1])
typeof(df[!, :Var1]) # Returns Vector{String} (alias for Array{String, 1})
# Now you can change the value of the cell
df[:, :Var1][2] = "Good"

Reference: DataFrames: convert column data type

Or, you can use Core.ifelse as @bkamins recommended :

julia> df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)
5×3 DataFrame
 Row │ Name    Var1     Var2    
     │ String  Int64?   Int64?  
─────┼──────────────────────────
   1 │ Tom     missing        1
   2 │ Mike          2        3
   3 │ John          3  missing 
   4 │ Jason   missing        6
   5 │ Bob           8  missing 

julia> df.Var1 = ifelse.(axes(df.Var1, 1) .== 2, "Good", df.Var1) # 2 is index, not value
5-element Vector{Any}:
  missing
  "Good"
 3
  missing
 8

julia> df
5×3 DataFrame
 Row │ Name    Var1     Var2    
     │ String  Any      Int64?  
─────┼──────────────────────────
   1 │ Tom     missing        1
   2 │ Mike    Good           3
   3 │ John    3        missing 
   4 │ Jason   missing        6
   5 │ Bob     8        missing 

Adding a column #

In df[!, col] = or df[:, col] = , if col is not present in df, then a new column will be added.

julia> df = DataFrame(:Name => a, :Var1 => b, :Var2 => c)

julia> df[!, :Var3] = collect(11:15)

julia> df[:, :Var4] = collect('e':'i')

julia> df[!, :Var5] .= 99 # Broadcasting

julia> df[:, :Var6] .= 't' # Broadcasting

julia> df

5×7 DataFrame
 Row │ Name    Var1     Var2     Var3   Var4  Var5   Var6 
     │ String  Int64?   Int64?   Int64  Char  Int64  Char 
─────┼────────────────────────────────────────────────────
   1 │ Tom     missing        1     11  e        99  t
   2 │ Mike          2        3     12  f        99  t
   3 │ John          3  missing     13  g        99  t
   4 │ Jason   missing        6     14  h        99  t
   5 │ Bob           8  missing     15  i        99  t

Note that it is IMPOSSIBLE to add multiple columns. For example, df[!, [:Var7, :Var8]] .= 99 will throw an error.

getproperty, or df.col #

I haven’t talked about how df.col works. It pretty much does the same thing as df[!, col], except:

Conclusion #

A table summarizing the key points #

df[!, col(s)] df[:, col(s)] df.col
Access 1 col direct reference to df a copy of the col direct reference
Access cols direct reference a copy of the cols
Update 1 col with a vector create a new column to replace the old one modify values in-place create and replace
Update cols with a matrix or df create and replace modify in-place
Update 1 col with broadcasting create and replace modify in-place modify in-place
Update cols with broadcasting create and replace modify in-place
Add 1 col with a vector create a new col create a new col create a new col
Add cols
Add 1 col with broadcasting create a new col create a new col ❌ (available later )
Add cols with broadcasting

Best practices #

Last modified on 2021-10-07