Dataframe column order

1 September 2022

Sometimes—in an ETL process for instance—the order of columns does matter, in which case it can be easily set.

cols_order = ["a", "b", "c"]
df = df.loc[:, cols_order]
print(df.columns)

['a', 'b', 'c']

But what if you want to specify columns that may or may not be there, and have the order be the same as is specified, to the extent that such columns are indeed in the dataframe?

First we need a subset, and then we can sort based on the order of our specified columns.

import pandas as pd

x = 1
df = pd.DataFrame([{"a": x, "b": x, "c": x} for i in range(5)])

print(df.to_markdown())

|    |   a |   b |   c |
|---:|----:|----:|----:|
|  0 |   1 |   1 |   1 |
|  1 |   1 |   1 |   1 |
|  2 |   1 |   1 |   1 |
|  3 |   1 |   1 |   1 |
|  4 |   1 |   1 |   1 |

def df_order_cols(df: pd.DataFrame, desired_cols_order: list) -> pd.DataFrame:
    # Subset
    got = [have for have in df.columns if have in desired_cols_order]
    # Sort the list of columns we do have per the order in desired_cols_order.index
    new_cols = sorted(got, key=lambda x: desired_cols_order.index(x))
    # Return the dataframe with the sorted columns
    return df.loc[:, new_cols]


desired_cols_order = ["c", "d", "a"]
print(df_order_cols(df, desired_cols_order).to_markdown())

|    |   c |   a |
|---:|----:|----:|
|  0 |   1 |   1 |
|  1 |   1 |   1 |
|  2 |   1 |   1 |
|  3 |   1 |   1 |
|  4 |   1 |   1 |

In the example above, we specify that we want the order ['c', 'd', 'e'] to be followed. The dataframe in question does not contain a column 'd', and so that cannot be included. The order is respected, however, for the two columns that are contained; ['a', 'c'].