Skip to content

Allow passing empty string as match argument to String.replace/4#6559

Merged
josevalim merged 5 commits intoelixir-lang:masterfrom
dkarter:allow-calling-replace-with-empty-string
Sep 12, 2017
Merged

Allow passing empty string as match argument to String.replace/4#6559
josevalim merged 5 commits intoelixir-lang:masterfrom
dkarter:allow-calling-replace-with-empty-string

Conversation

@dkarter
Copy link
Copy Markdown
Contributor

@dkarter dkarter commented Sep 10, 2017

Summary

This patch allows calling String.replace with an empty string as the match argument.

Reason for patch

When I found out that passing an empty string as a match argument to replace throws an ArgumentError I was surprised by this unexpected behavior.

Many languages support this feature which allows supplying match arguments coming from dynamic sources.

For example assume we are processing file names and trying to remove the
extension (assuming Path.basename/2 function did not exist):

Before patch:

inputs = ["README.md", "package.json", "LICENSE"]

inputs
|> Enum.map(&(String.replace(&1, Path.extname(&1), "")))
# ** (ArgumentError) argument error
#    (stdlib) binary.erl:275: :binary.replace/4
#    (elixir) lib/enum.ex:1255: Enum."-map/2-lists^map/1-0-"/2
#    (elixir) lib/enum.ex:1255: Enum."-map/2-lists^map/1-0-"/2

To fix it we would have to transform the code to something like this:

inputs = ["README.md", "package.json", "LICENSE"]
inputs
|> Enum.map(fn (file) ->
  ext = Path.extname(file)
  if ext == "" do
    file
  else
    String.replace(file, ext, ""))
  end
end)

# => ["README", "package", "LICENSE"]

After patch:

inputs = ["README.md", "package.json", "LICENSE"]

inputs
|> Enum.map(&(String.replace(&1, Path.extname(&1), "")))

# => ["README", "package", "LICENSE"]

Examples from other languages

Ruby

'LICENSE'.gsub('','')
# => 'LICENSE'

JavaScript

'LICENSE'.replace('','')
// => 'LICENSE'

Go

fmt.Println(strings.Replace("LICENSE", "", "", -1))
// => "LICENSE"

Rust

fn main() {
    let result = str::replace("LICENSE", "", "");
    println!("{}", result); // => "LICENSE"
}

Java

System.out.println("LICENSE".replace("",""));
// => "LICENSE"

Lua

str = "LICENSE"
res = str:gsub("", "")

io.write(res, "\n")
-- "LICENSE"

SQL

select replace('LICENSE', '', '');
-- LICENSE

Swift

"LICENSE".replacingOccurrences(of: "", with: "") // "LICENSE"

Objective C

NSString *str = [@"LICENSE" stringByReplacingOccurrencesOfString:@"" withString:@""];
NSLog (str);

Using empty string match for interspersing

This patch also adds support for interspersing every character using the empty string as match - as in many other languages (Ruby, Rust, Java, Python, Lua...):

Ruby

'LICENSE'.gsub('','x')
# => 'xLxIxCxExNxSxEx'

After this patch:

String.replace("LICENSE","","x") 
# => "xLxIxCxExNxSxEx"

Thank you for your consideration and time.


Edit: added more languages as examples

@josevalim
Copy link
Copy Markdown
Member

Thank you @dkarter! Does any of the other languages do the interspersing behaviour? The reason I am asking you is because those results are seemingly inconsistent.

For example:

String.replace("", "", "x") #=> "x"
String.replace_leading("", "", "x") #=> ""

String.replace("a", "a", "x") #=> "x"
String.replace_leading("a", "a", "x") #=> "x"

Or if I were to define a property, it would be written like this:

check all string <- string(:ascii) do
  assert String.replace(string, string, "x") == "x"
  assert String.replace_leading(string, string, "x") == "x"
  assert String.replace_trailing(string, string, "x") == "x"
end

And that seems to be true on all cases, unless the string is empty.

If we think in properties, than we would conclude that if an empty string is given to replace_leading and replace_trailing, then it should always prepend or append the replacement. This is what replace_suffix and replace_prefix does and it would also reflect this property:

check all left <- string(:ascii),
          right <- string(:ascii),
          left != right do
  assert String.replace_leading(left <> right, left, "x") == "x" <> right
  assert String.replace_trailing(left <> right, right, "x") == left <> "x"
end

And again, this is always true except for the empty string in the chosen implementation.

@whatyouhide
Copy link
Copy Markdown
Member

I am inclined to say this is not a good idea. If we look at this from a properties point of view, one property of String.replace/3 could be that

check all str1 <- string(),
          str2 <- string(),
          substr1 <- substring(str1) do
  assert String.contains?(String.replace(str1, substr1, str2), str2)
end

This property should hold, but it doesn't if substr1 == "" because we don't replace it with str2. After all, since String.contains?(any_string, "") == true, the "correct" behaviour for String.replace(any_string, "", replacement) would mean getting "infinite" replacements inside any_string. Does this make sense?

This is only one way of looking at the problem so I might be looking at it from the wrong angle here.

@dkarter dkarter force-pushed the allow-calling-replace-with-empty-string branch from b9105d9 to 96d3964 Compare September 11, 2017 00:15
@dkarter
Copy link
Copy Markdown
Contributor Author

dkarter commented Sep 11, 2017

@josevalim Thank you for your input. Very valid points.

To answer your first question: yes - In my tests I have found Ruby, Rust, Python and Java to support the interspersing functionality, SQL, Swift and PHP support replacement with empty string but do not intresperse.

I've adjusted the code for replace_leading and replace_trailing to always prepend/append the replacement to the subject. I believe this should now pass the property tests you described.

Although you mentioned property based testing becoming part of Elixir 1.6 in ElixirConf I am not sure how to run those on master. I am assuming this is not ready yet, but if it is, are there any examples in this repo I can lean on to test my code?


@whatyouhide Thank you for reviewing the code. I am not entirely sure what you mean here:

This property should hold, but it doesn't if substr1 == "" because we don't replace it with str2. After all, since String.contains?(any_string, "") == true, the "correct" behaviour for String.replace(any_string, "", replacement) would mean getting "infinite" replacements inside any_string. Does this make sense?

Particularly I do not understand the "infinite" replacement statement.

The code as it is in this patch will return a string containing str2 if substr1 == "":

iex(8)> String.replace("", "", "foo")
"foo"
iex(9)> String.replace(".", "", "foo")
"foo.foo"
iex(10)> String.contains?(String.replace("", "", "foo"), "foo")
true
iex(11)> String.contains?(String.replace(".", "", "foo"), "foo")
true

Can you please provide an example of an input and what you expect the output would be vs what it is using this patch? That would be very helpful for me.

@lexmag
Copy link
Copy Markdown
Member

lexmag commented Sep 11, 2017

@dkarter the PR that introduced raising behaviour has a discussion on what infinite empty string is: #5022.

@josevalim
Copy link
Copy Markdown
Member

I believe there are two discussions here. One about replace and another about replace_leading/replace_trailing. Shall we make two separate issues and start again?

Currently we have an inconsistency when comparing replace with the suffix and prefix ones. In particular, should this property hold?

check all string <- string(:ascii) do
  assert String.replace(string, string, "x") == "x"
  assert String.replace_prefix(string, string, "x") == "x"
  assert String.replace_suffix(string, string, "x") == "x"
end

In my opinion we should be consistent, so we need to either stop supporting supporting empty strings on prefix and suffix, or support them altogether.

The replace_suffix and replace_prefix is a bit more complicated because of the reasons stated.

@michalmuskala
Copy link
Copy Markdown
Member

michalmuskala commented Sep 11, 2017

There's also another property to consider: if I replace some value in the string, it shouldn't be present in it anymore. Or in code:

check all string <- string(:ascii), replace <- string(:ascii) do
  # use ą as the replacement since it can't be generated by the ascii generators
  replaced = String.replace(string, replace, "ą")
  refute String.contains?(replaced, replace)
end

This does not hold with empty "replace" argument, since String.contains?/2 considers an empty string to be always contained in other strings.

@josevalim
Copy link
Copy Markdown
Member

@michalmuskala I would say that property only holds if the replace string is not in the replacement and "" is always in the replacement.

dkarter pushed a commit to dkarter/elixir that referenced this pull request Sep 11, 2017
This is a follow up to elixir-lang#6559.

Add support for using empty string as a match to
`String.replace_leading` and `String.replace_trailing` while matching
behavior to `String.replace_prefix` and `String.replace_suffix`.
@dkarter dkarter force-pushed the allow-calling-replace-with-empty-string branch from 96d3964 to 14b786a Compare September 11, 2017 16:58
@dkarter dkarter changed the title Allow passing empty string to replace functions Allow passing empty string as match argument String.replace Sep 11, 2017
@dkarter
Copy link
Copy Markdown
Contributor Author

dkarter commented Sep 11, 2017

@josevalim As per your suggestion, I have split the String.replace_trailing and String.replace_leading out into it's own PR (#6563), and this PR is now solely focused on String.replace. I hope this makes it easier.

The main point I hope to get across is that using an exception for String.replace leads to more verbose code and potentially unexpected behavior during runtime if a dynamic value is used for match. That argument is strengthened by the fact that there are plenty of precedents in most popular programming languages to support empty string as a match argument to replace.

Comment thread lib/elixir/lib/string.ex Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather implement this ourselves instead of dispatching to regex. Something like:

defp intersperse(subject, replacement) do
  case next_grapheme(subject) do
    {current, rest} -> [current, replacement | intersperse(rest, replacement)]
    nil -> []
  end
end

And then you call IO.iodata_to_binary [replacement | intersperse(subject, replacement)] on this clause.

PS: I haven't tested this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your solution, next_grapheme is cool.

The tests did not pass with your solution when the global: false option is provided, so I had to add another pattern match:

assert String.replace("LICENSE","","x") == "xLxIxCxExNxSxEx"
assert String.replace("LICENSE","","x", global: true) == "xLxIxCxExNxSxEx"
assert String.replace("LICENSE","","x", global: false) == "xLICENSE"
# ^^^ did not pass

This is what I ended up with if we are not going to call out to Regex.replace via my original solution:

  def replace(subject, pattern, replacement, options \\ [])
  def replace(subject, "", "", _), do: subject
  def replace(subject, "", replacement, global: false), do: replacement <> subject
  def replace(subject, "", replacement, _options) do
    IO.iodata_to_binary [replacement | intersperse(subject, replacement)]
  end
  def replace(subject, pattern, replacement, options) when is_binary(replacement) do
    if Regex.regex?(pattern) do
      Regex.replace(pattern, subject, replacement, global: options[:global])
    else
      opts = translate_replace_options(options)
      :binary.replace(subject, pattern, replacement, opts)
    end
  end

  defp intersperse(subject, replacement) do
    case next_grapheme(subject) do
      {current, rest} -> [current, replacement | intersperse(rest, replacement)]
      nil -> []
    end
  end

This passes all the tests, and although it is longer it is definitely more explicit. If that looks good to you I'll revise my commit. @josevalim

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful. We should not pattern match on keyword lists though, so it is probably better for you to do:

  def replace(subject, "", replacement, options) do
    if Keyword.get(option, :global, true) do
      IO.iodata_to_binary [replacement | intersperse(subject, replacement)]
    else
      replacement <> subject
    end
  end

or something like that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim I've committed the changes you requested and added some documentation. Please let me know if you have any other concerns or anything else you would like me to change. Thanks!

@dkarter dkarter force-pushed the allow-calling-replace-with-empty-string branch from fb25625 to 4ad148c Compare September 11, 2017 22:14
Dorian Karter added 3 commits September 11, 2017 17:17
This patch allows calling `String.replace` with an empty string as the
match argument.

This is particularly useful when the match expression is being passed
down as variable value that could be an empty string.

For example assume we are processing file names and trying to remove the
extension (get the basename - assuming Path.basename/2 function did not
exist):

Before patch:

```elixir
inputs = ["README.md", "package.json", "LICENSE"]

inputs
|> Enum.map(fn (file) ->
  ext = Path.extname(file)
  if ext == "" do
    file
  else
    String.replace(file, ext, ""))
  end
end)
```

After patch:

```elixir
inputs = ["README.md", "package.json", "LICENSE"]

inputs
|> Enum.map(&(String.replace(&1, Path.extname(&1), "")))
```
When empty string is provided for `match` and non-empty string is
provided for `replacement`.
@dkarter dkarter force-pushed the allow-calling-replace-with-empty-string branch from 4ad148c to d8683d6 Compare September 11, 2017 22:18
Comment thread lib/elixir/lib/string.ex
interspersed. If an empty string is provided as `replacement` the `subject`
will be returned:

iex> String.replace("ELIXIR", "", ".")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it is not consistent with how split/2 sees empty strings inside:

String.split("ELIXIR", "")
#=> ["E", "L", "I", "X", "I", "R", ""]

Should not String.replace("ELIXIR", "", ".") return "E.L.I.X.I.R." then? 🤔

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lexmag that however does not mirror how replace_suffix and replace_prefix works today. In Rust we would get an empty string before and after, so I would rather say it is a bug on how split works.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim great.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change how split works because it is deeply grained into how split with regex works.

The other point to consider is that changing split may not be the way to go depending on how we want to write the properties. For example, this holds today:

check all string <- string(:unicode) do
  assert length(String.split(string, "")) == String.length + 1
end

But if we rewrite split to have an empty string before and after, then it no longer holds. The main point here still holds though, replace_suffix, replace_prefix and replace are inconsistent.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we rewrite split to have an empty string before and after, then it no longer holds.

If todays property holds, it does not mean that it is correct though; with fixed splice we would have + 2 instead of + 1, thus we still have a property.
Maybe we should change Regex.split/3 as well, considering that Regex.replace(~r//, "qwe", ".") sees an empty string before and returns ".q.w.e."?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lexmag We can generalize the property above:

check all string <- string(:unicode), count <- integer() do
  duplicate = String.duplicate(string, count)
  assert String.split(duplicate, string)) == count + 1
end

The only way the property above holds is if the previous property also holds. We can generally say that the number of results in split is the number of occurrences of the pattern + 1. Adding an empty string before and after makes no sense, I am afraid.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, great catch regarding Regex.replace. That puts me clearly in the camp of merging this PR. Split and replace will have the inconsistency we have been debating but this inconsistency exists today and I would prefer to make all replacements consistent between them and all splits consistent between them.

@lexmag lexmag changed the title Allow passing empty string as match argument String.replace Allow passing empty string as match argument String.replace/4 Sep 12, 2017
@whatyouhide
Copy link
Copy Markdown
Member

It would really help me review this if we could decide on a set of properties that replace should hold, write them down, and run them quickly to see if it holds them before or after this PR. What do you folks think?

@whatyouhide whatyouhide changed the title Allow passing empty string as match argument String.replace/4 Allow passing empty string as match argument to String.replace/4 Sep 12, 2017
@josevalim
Copy link
Copy Markdown
Member

@whatyouhide I would like to see this being consistent:

check all string <- string(:ascii) do
  assert String.replace(string, string, "x") == "x"
  assert String.replace_suffix(string, string, "x") == "x"
  assert String.replace_prefix(string, string, "x") == "x"
end

@whatyouhide
Copy link
Copy Markdown
Member

@josevalim can we generalize it to say

check all string <- string(:ascii),
          replacement <- string(:ascii) do
  assert String.replace(string, string, replacement) == replacement
  # same for _prefix and _suffix
end

?

@josevalim
Copy link
Copy Markdown
Member

@whatyouhide it doesn't matter in my opinion... so whatever you prefer.

Comment thread lib/elixir/lib/string.ex Outdated
"a[,,]b[,,]c"

When an empty string is provided as a `pattern`, the function will treat it as
an implicit empty string between each character and the string will be
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say codepoint instead of character here?

Copy link
Copy Markdown
Member

@lexmag lexmag Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO "character" is alright, but if we want to be more specific it should be "grapheme", not "codepoint".

Copy link
Copy Markdown
Member

@lexmag lexmag Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see we use "character" in other places.

Copy link
Copy Markdown
Member

@lexmag lexmag Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to go over String docs since our use of "character" is inconsistent: usually it stands for grapheme, sometimes—codepoint. 🤔

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use grapheme.

@josevalim josevalim merged commit 63f1d91 into elixir-lang:master Sep 12, 2017
@josevalim
Copy link
Copy Markdown
Member

❤️ 💚 💙 💛 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants