文字列（Strings）

このセクションでは、DataFrame を扱う際によく使用される DataType である String データに対して行われる操作について説明します。しかし、文字列を処理することは、その予測不可能なメモリーサイズのためにしばしば非効率的であり、CPU が多くのランダムなメモリー位置にアクセスすることを要求します。この問題に対処するため、Polars はそのバックエンドとして Arrow を使用し、すべての文字列を連続したメモリーブロックに保存します。その結果、文字列のトラバーサルはキャッシュ最適であり、CPU にとって予測可能です。

文字列処理関数は str 名前空間で利用可能です。

文字列名前空間へのアクセス

str 名前空間は、String データタイプのカラムの .str 属性を通じてアクセスできます。次の例では、animal という名前のカラムを作成し、カラム内の各要素のバイト数および文字数での長さを計算します。ASCII テキストを扱っている場合、これら二つの計算の結果は同じになり、より速い lengths の使用が推奨されます。

Python Rust

str.len_bytes · str.len_chars

df = pl.DataFrame({"animal": ["Crab", "cat and dog", "rab$bit", None]})

out = df.select(
    pl.col("animal").str.len_bytes().alias("byte_count"),
    pl.col("animal").str.len_chars().alias("letter_count"),
)
print(out)

str.len_bytes · str.len_chars

let df = df! (
        "animal" => &[Some("Crab"), Some("cat and dog"), Some("rab$bit"), None],
)?;

let out = df
    .clone()
    .lazy()
    .select([
        col("animal").str().len_bytes().alias("byte_count"),
        col("animal").str().len_chars().alias("letter_count"),
    ])
    .collect()?;

println!("{}", &out);

shape: (4, 2)
┌────────────┬──────────────┐
│ byte_count ┆ letter_count │
│ ---        ┆ ---          │
│ u32        ┆ u32          │
╞════════════╪══════════════╡
│ 4          ┆ 4            │
│ 11         ┆ 11           │
│ 7          ┆ 7            │
│ null       ┆ null         │
└────────────┴──────────────┘

文字列の解析

Polars は、文字列の要素をチェックし、解析するための複数の方法を提供します。まず、contains メソッドを使用して、部分文字列内に特定のパターンが存在するかどうかをチェックできます。その後、これらのパターンを抽出して他の方法で置換することが、今後の例で示されます。

パターンの存在チェック

文字列内にパターンが存在するかをチェックするには、contains メソッドを使用できます。contains メソッドは、literal パラメーターの値に応じて、通常の部分文字列または正規表現パターンのいずれかを受け入れます。私たちが探しているパターンが文字列の始まりまたは終わりに位置する単純な部分文字列である場合、代わりに starts_with および ends_with 関数を使用することができます。

Python Rust

str.contains · str.starts_with · str.ends_with

out = df.select(
    pl.col("animal"),
    pl.col("animal").str.contains("cat|bit").alias("regex"),
    pl.col("animal").str.contains("rab$", literal=True).alias("literal"),
    pl.col("animal").str.starts_with("rab").alias("starts_with"),
    pl.col("animal").str.ends_with("dog").alias("ends_with"),
)
print(out)

str.contains · str.starts_with · str.ends_with · Available on feature regex

let out = df
    .clone()
    .lazy()
    .select([
        col("animal"),
        col("animal")
            .str()
            .contains(lit("cat|bit"), false)
            .alias("regex"),
        col("animal")
            .str()
            .contains_literal(lit("rab$"))
            .alias("literal"),
        col("animal")
            .str()
            .starts_with(lit("rab"))
            .alias("starts_with"),
        col("animal").str().ends_with(lit("dog")).alias("ends_with"),
    ])
    .collect()?;
println!("{}", &out);

shape: (4, 5)
┌─────────────┬───────┬─────────┬─────────────┬───────────┐
│ animal      ┆ regex ┆ literal ┆ starts_with ┆ ends_with │
│ ---         ┆ ---   ┆ ---     ┆ ---         ┆ ---       │
│ str         ┆ bool  ┆ bool    ┆ bool        ┆ bool      │
╞═════════════╪═══════╪═════════╪═════════════╪═══════════╡
│ Crab        ┆ false ┆ false   ┆ false       ┆ false     │
│ cat and dog ┆ true  ┆ false   ┆ false       ┆ true      │
│ rab$bit     ┆ true  ┆ true    ┆ true        ┆ false     │
│ null        ┆ null  ┆ null    ┆ null        ┆ null      │
└─────────────┴───────┴─────────┴─────────────┴───────────┘

パターンの抽出

extract メソッドを使用して、指定された文字列からパターンを抽出できます。この方法では、パターンに含まれる一つ以上のキャプチャグループ（パターン内の括弧 () によって定義されます）を含む正規表現パターンを取ります。グループインデックスは、どのキャプチャグループを出力するかを示します。

Python Rust

str.extract

df = pl.DataFrame(
    {
        "a": [
            "http://vote.com/ballon_dor?candidate=messi&ref=polars",
            "http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
            "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
        ]
    }
)
out = df.select(
    pl.col("a").str.extract(r"candidate=(\w+)", group_index=1),
)
print(out)

str.extract

let df = df!(
        "a" =>  &[
            "http://vote.com/ballon_dor?candidate=messi&ref=polars",
            "http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
            "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
        ]
)?;
let out = df
    .clone()
    .lazy()
    .select([col("a").str().extract(lit(r"candidate=(\w+)"), 1)])
    .collect()?;
println!("{}", &out);

shape: (3, 1)
┌─────────┐
│ a       │
│ ---     │
│ str     │
╞═════════╡
│ messi   │
│ null    │
│ ronaldo │
└─────────┘

文字列内のパターンのすべての出現を抽出するには、extract_all メソッドを使用できます。以下の例では、正規表現パターン (\d+) を使用して文字列からすべての数字を抽出し、一つ以上の数字に一致します。extract_all メソッドの結果として出力されるのは、文字列内の一致したパターンのすべてのインスタンスを含むリストです。

Python Rust

str.extract_all

df = pl.DataFrame({"foo": ["123 bla 45 asd", "xyz 678 910t"]})
out = df.select(
    pl.col("foo").str.extract_all(r"(\d+)").alias("extracted_nrs"),
)
print(out)

str.extract_all

let df = df!("foo"=> &["123 bla 45 asd", "xyz 678 910t"])?;
let out = df
    .clone()
    .lazy()
    .select([col("foo")
        .str()
        .extract_all(lit(r"(\d+)"))
        .alias("extracted_nrs")])
    .collect()?;
println!("{}", &out);

shape: (2, 1)
┌────────────────┐
│ extracted_nrs  │
│ ---            │
│ list[str]      │
╞════════════════╡
│ ["123", "45"]  │
│ ["678", "910"] │
└────────────────┘

パターンの置換

これまでにパターンの一致と抽出の二つの方法を議論しましたが、今度は文字列内でパターンを置換する方法を探ります。extract と extract_all と同様に、Polars はこの目的のために replace と replace_all メソッドを提供します。以下の例では、単語の終わり（\b）にある abc の一つの一致を ABC に置き換え、a のすべての出現を - に置き換えます。

Python Rust

str.replace · str.replace_all

df = pl.DataFrame({"id": [1, 2], "text": ["123abc", "abc456"]})
out = df.with_columns(
    pl.col("text").str.replace(r"abc\b", "ABC"),
    pl.col("text").str.replace_all("a", "-", literal=True).alias("text_replace_all"),
)
print(out)

str.replace · str.replace_all · Available on feature regex

let df = df!("id"=> &[1, 2], "text"=> &["123abc", "abc456"])?;
let out = df
    .clone()
    .lazy()
    .with_columns([
        col("text").str().replace(lit(r"abc\b"), lit("ABC"), false),
        col("text")
            .str()
            .replace_all(lit("a"), lit("-"), false)
            .alias("text_replace_all"),
    ])
    .collect()?;
println!("{}", &out);

shape: (2, 3)
┌─────┬────────┬──────────────────┐
│ id  ┆ text   ┆ text_replace_all │
│ --- ┆ ---    ┆ ---              │
│ i64 ┆ str    ┆ str              │
╞═════╪════════╪══════════════════╡
│ 1   ┆ 123ABC ┆ 123-bc           │
│ 2   ┆ abc456 ┆ -bc456           │
└─────┴────────┴──────────────────┘

API ドキュメント

上記でカバーされた例に加えて、Polars は書式設定、ストリッピング、分割などのタスクのためのさまざまな他の文字列操作方法を提供します。これらの追加的な方法を探るためには、あなたが選んだプログラミング言語の Polars の API ドキュメントにアクセスできます。