性能文章>【全网首发】栽了,迄今为止吐槽最多的标准库函数>

【全网首发】栽了,迄今为止吐槽最多的标准库函数原创

518457

你好,我是雨乐!

对正则的使用,基本用于日志分析,比如awk、grep等操作。自C++11起,也将正则表达式纳入新标准的一部分,因为项目需求中需求场景并不是很多,所以也就仅仅知道C++11对其的支持。记得在去年群里聊天的时候,群里有人提到了std::regex,有不少人进行了吐槽:

当时,没有对这块做更多的发言,毕竟没有调查也就没有发言权,直至前段时间的一个bug,才知道原来大家对std::regex的吐槽不无道理。

背景

对于大流量业务来说,上线某个模型或者feature,需要通过实验来检验效果。通常的情况是,流量进入实验平台进行标签操作,然后将实验平台返回的实验标签以某种结构拼接起来,继续向流量下游下发,在一开始的时候,因为实验标签较少,所以将实验标签全部返回客户端进行上报,然后实验人员进行数据分析,这种方式一直运行正常。随着业务压力越来越大,无论是算法还是产品同学,需要进行更多的实验,这就存在一个问题,随着时间的推移,实验越来越多,实验标签长度达到几千个甚至上万个字节,因此去除无用的实验标签迫在眉睫。

对实验的细节不做过多解释,仅仅解释下返回的标签类型。在将标签返回给客户端的时候,会将标签以字符串方式进行拼接,如下expa;expb;layerid_def;,需要做个说明的是,因为某些特殊原因,如果没有命中某个实验层的实验,就以layerid_def这种方式来表示,经过分析,layerid_def占了整个标签串一半以上,所以征求了算法以及产品同学的意见,将这部分无用标签去掉。

解决

其实,这个算一个非常非常小的需求,几行代码的事。所以第一时间想到的是用正则

const static std::regex rex("[0-9]*_def;");
std::string result;
std::regex_replace(std::back_inserter(result),
                   res.begin(), res.end(), rex, "");

代码很简单,不做过多解释,结果就是:

输入:

123;345_def;456_def;789

输出:

123;789

自测 & 线上灰度后,一切正常,全量上线~~

问题

突然在某一天,收到了报警,服务重启~~

登录服务器看了下coredump文件,存在,于是,通过gdb查看堆栈信息:

core在了regex处,自上次上线与本次coredump直接没有任何上线操作,所以基本定位到是因为std::regex导致coredump,所以,借助万能的谷歌进行关键字搜索:

乖乖,从前几个就能看到,原来std::regex crash是个问题,所以就看了下第二条,有人给gcc提的一个bug里面给出了个简单的代码示例:

#include <regex>
#include <iostream>

int main() {
    std::string s (100000, '*');
    std::**atch m;
    std::regex r ("^(.*?)$");

    std::regex_search (s, m, r);

    std::cout << s.substr (0, 10) << std::endl;
    std::cout << m.str(1).substr (0, 10) << std::endl;
}

把这端代码在本地编译并运行后:

Program terminated with signal 11, Segmentation fault.
#0  0x000000000040a0ae in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode, long)
    () at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:200
200     void _Executor<_BiIter, _Alloc, _TraitsT, __dfs_mode>::
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.192.el6.x86_64 libgcc-4.4.7-23.el6.x86_64

查看堆栈信息:

与线上服务出现的现象一样。

既然有人向gnu提了bug,也就懒得自己看源码分析原因了,直接拉到页面最下面,这么一个回帖(Nadav Har'El 2023-04-09 16:02:58 UTC):

More than 5 years later, more and more projects are discovering this bug the hard way, and moving from std::regex to boost::regex which doesn't have this bug - boost::regex defaults to BOOST_REGEX_NON_RECURSIVE mode, which uses a stack on the heap instead of recursion (but I don't know if the specific examples shown the various duplicates all need this stack in practice, for example it's unfortunate if matching " *" needs to copy the entire input string in a stack). The latest example of this exodus is https://github.com/scylladb/scylladb/pull/13452. 
So I think it's about time this issue is solved. Maybe even the Boost implementation can studied for inspiration and implementation ideas?

其实,从上面回帖也能看出,此次coredump的原因基本明了,是因为递归导致的爆栈,即递归次数过多,而导致栈溢出。。。

好了,通过gdb分析下调用堆栈:

 (gdb) bt
#0  std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode, long) ()
    at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:273
#1  0x000000000040a24a in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_dfs(std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode, long)
    () at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:257
#2  0x0000000000407d99 in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std:---Type <return> to continue, or q <return> to quit---
:allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_main_dispatch(std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode, std::integral_constant<bool, true>) ()
    at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:87
#3  0x0000000000406892 in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_main(std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_Match_mode) ()
    at /usr/local/include/c++/5.4.0/bits/regex_executor.h:116
#4  0x00000000004068c5 in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__nor---Type <return> to continue, or q <return> to quit---
mal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_search_from_first() ()
    at /usr/local/include/c++/5.4.0/bits/regex_executor.h:101
#5  0x0000000000405a9e in std::__detail::_Executor<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, std::__cxx11::regex_traits<char>, true>::_M_search() () at /usr/local/include/c++/5.4.0/bits/regex_executor.tcc:42
#6  0x0000000000404d98 in bool std::__detail::__regex_algo_impl<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char>, (std::__detail::_RegexExecutorPolicy)0, false>(__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, __gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::match_results<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<c---Type <return> to continue, or q <return> to quit---
har> > const&, std::regex_constants::match_flag_type) ()
    at /usr/local/include/c++/5.4.0/bits/regex.tcc:95
#7  0x0000000000404718 in bool std::regex_search<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> >(__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, __gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__cxx11::match_results<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
    at /usr/local/include/c++/5.4.0/bits/regex.h:2148
#8  0x00000000004043ff in bool std::regex_search<std::char_traits<char>, std::allocator<char>, std::allocator<std::__cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >, char, std::__cxx11::regex_traits<char> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::match_results<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::const_iterator, std::allocator<std::__---Type <return> to continue, or q <return> to quit---
cxx11::sub_match<__gnu_cxx::__normal_iterator<char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >&, std::__cxx11::basic_regex<char, std::__cxx11::regex_traits<char> > const&, std::regex_constants::match_flag_type) ()
    at /usr/local/include/c++/5.4.0/bits/regex.h:2254

从调用关系上看:

 regex_search
-> regex_search
--> __detail::__regex_algo_impl
----> _M_search
------> _M_search_from_first
--------> _M_main
--------->_M_main_dispatch
---------->_M_dfs

好了,看到dfs基本就知道爆栈的原因了。

至于解决办法,有下面几个:

  • • 修改栈大小,从默认的1m改成4m,不过这个不推荐

  • • 通过split对字符串进行切割,然后进行判断

  • • 使用boost::regex(其默认使用BOOST_REGEX_NON_RECURSIVE方式)

最终选用了第四种也就是boost::regex,长字符串测试,灰度、全量,一切OK~~

今天的文章就到这,我们下期见!

你好,我是雨乐,从业十二年有余,历经过传统行业网络研发、互联网推荐引擎研发,目前在广告行业从业8年。目前任职某互联网公司高级技术专家一职,负责广告引擎的架构和研发。

 

点赞收藏
高性能架构探索

公众号《高性能架构探索》

请先登录,查看5条精彩评论吧
快去登录吧,你将获得
  • 浏览更多精彩评论
  • 和开发者讨论交流,共同进步
7
5