Julia: How to Conditionally Select Row(s) in A DataFrame

Hongtao Hao / 2021-07-14


This post is tested under Julia v1.6.1 and DataFrames v1.2.0.

Base.filter is suitable for this task.

df = DataFrame(
    x = [3, 1, 2, 1, 5], 
    y = ["b", "c", "a", "b", "b"],
    z = [true, true, false, false, false]
)
julia> df
5×3 DataFrame
 Row │ x      y       z     
     │ Int64  String  Bool  
─────┼──────────────────────
   13  b        true
   21  c        true
   32  a       false
   41  b       false
   55  b       false

If it’s just one condition, then selection is very simple:

julia> filter(r -> r.x > 1, df)
3×3 DataFrame
 Row │ x      y       z     
     │ Int64  String  Bool  
─────┼──────────────────────
   13  b        true
   22  a       false
   35  b       false

r here stands for row, but you can replace it with anything you like: t, row, anything, etc.

If you have more than two conditions:

julia> filter([:x, :y] => (x, y) -> x > 1 && y == "b", df)
2×3 DataFrame
 Row │ x      y       z     
     │ Int64  String  Bool  
─────┼──────────────────────
   13  b        true
   25  b       false
julia> filter([:x, :y, :z] => (x, y, z) -> x > 1 && y == "b" && z == true, df)
1×3 DataFrame
 Row │ x      y       z    
     │ Int64  String  Bool 
─────┼─────────────────────
   13  b       true

You can do the same thing without using the filter function as well:

julia> df[df.x .> 1, :]
3×3 DataFrame
 Row │ x      y       z     
     │ Int64  String  Bool  
─────┼──────────────────────
   13  b        true
   22  a       false
   35  b       false
julia> df[(df.x .> 1) .& (df.y .== "b") .& (df.z .== true), :]
1×3 DataFrame
 Row │ x      y       z    
     │ Int64  String  Bool 
─────┼─────────────────────
   13  b       true

Note that when you use the Base.filter function, you need to use &&, a Boolean operator , whereas when you select rows using data frame indexing, you are supposed to use & (note the Dot before it, though), a bitwise operator .

When there are missing values #

One thing you need to keep in mind is that if there are missing values in the data frame, you might not be able to use the second option, and you might not use equality and comparison operators in filter.

For example,

julia> df = DataFrame(
    x = [3, 1, 2, 1, missing], 
    y = ["b", "c", "a", "b", "b"], 
    z = [true, true, false, false, false]
)
5×3 DataFrame
 Row │ x        y       z     
     │ Int64?   String  Bool  
─────┼────────────────────────
   13  b        true
   21  c        true
   32  a       false
   41  b       false
   5 │ missing  b       false

julia> filter(r -> r.x == 1, df)
ERROR: TypeError: non-boolean (Missing) used in boolean context

julia> df[df.x .> 1, :]
ERROR: ArgumentError: unable to check bounds for indices of type Missing

You need to use Base.isequal and Base.isless instead:

julia> filter(r -> isequal(r.x, 1), df)
2×3 DataFrame
 Row │ x       y       z     
     │ Int64?  String  Bool  
─────┼───────────────────────
   11  c        true
   21  b       false

julia> df[isless.(1, df.x), :]
3×3 DataFrame
 Row │ x        y       z     
     │ Int64?   String  Bool  
─────┼────────────────────────
   13  b        true
   22  a       false
   3 │ missing  b       false

Note that Julia Documentation mentions that in isless operator,

missing is considered as greater than any other value.

This is the reason why you see the row containing missing in the result of df[isless.(1, df.x), :].

References #

Last modified on 2021-10-05