fail over
1 failover介绍
failover转换,主要考虑的是如何减少数据量的丢失,如果可以允许数据丢失,可以直接切换备库为主库,不必在进行下面的讨论!
减少数据的丢失,实际就是弥补日志数据没有传递的问题
在11g中,推出了日志手工flush的功能,来弥补日志数据没有传递的问题。前提是主库能启动到mount 状态。Flush 可以把没有发送的归档和current online redo 发送到备库。
SQL> startup mount
SQL> alter system flush redo to 'orclps1';
注意:这个过程并不会经常成功执行,而且在10g这样的版本下也没有办法自动flush redo。解决的方法也是有的,就是从Primary目录中,将日志拷贝到Standby端,手工去加载。
另外failover之后,原主库如何重建为备库:
1.利用flashback database将原主库恢复到failover之前的时间点,前提是开启flashback on。
2.rman备份恢复,如果有failover之前的备份,可以先恢复到之前时间点,然后在将主库转为备库。
3.环境重建,最简单的方法,删除原主库,再搭建新的备库。
备库转为主库的时间戳
select to_char(standby_became_primary_scn) from v$database;
常用于数据库开启了flashback on后原主库闪回到备库切换为主库的时间点,以保证原主库切换为新备库后还能继续和新主库同步并应用日志。
下面介绍一下后面的实验
实验一:利用11g新功能flush redo同步数据,减少数据丢失。
实验二:当主库宕机后,无法启动到mount时,通过拷贝主库归档和在线日志到备库注册加载,减少数据量的丢失。(当然,如果运气好,注册后就能应用)
2 实验一:failover之flush redo同步数据
现数据库环境如下
节点 |
网络ip地址 |
数据库名 |
unique name |
数据库实例名 |
数据文件位置 |
zyx.test.com(备库) |
192.168.11.111 |
orcl |
orcl |
test |
/u01/app/oracle/oradata/orcl/ |
orcl.test.com(主库) |
192.168.11.22 |
orcl |
orclps |
orclps |
/u01/app/oracle/oradata/orcl/ |
dg2.orcl.com(级联库) |
192.168.11.23 |
orcl |
orclstd |
orclstd |
/u01/app/oracle/oradata/orcl/ |
----实验中,将主库failover为备库,级联库不变(为了后续实验,主库备库级联库都开启flashback)
2.1 环境准备
----首先,orclps主库开启flahback
sys@ORCL>shut immediate
sys@ORCL>startup mount
sys@ORCL> alter database flashback on;
sys@ORCL>alter database open;
sys@ORCL>select name,flashback_on from v$database;
----备库orcl开启flashback
sys@ORCL> alter database recover managed standby database cancel;
sys@ORCL>shut immediate
sys@ORCL>startup mount
sys@ORCL> alter database flashback on;
sys@ORCL>alter database open;
sys@ORCL>alter database recover managed standby database using current logfile disconnect;
----级联库orclstd开启flashback
SQL> alter database recover managed standby database cancel;
SQL> shut immediate
SQL> startup mount
SQL> alter database flashback on;
SQL> alter database recover managed standby database using current logfile disconnect;
2.2 模拟故障
----备库orcl关闭网络,然后在主库上创建表插入数据并提交
service network stop
----此时主库状态
sys@ORCL>select database_role,switchover_status,db_unique_name from v$database;
DATABASE_ROLE SWITCHOVER_STATUS DB_UNIQUE_NAME
---------------- -------------------- ------------------------------
PRIMARY RESOLVABLE GAP orclps
----主库上模拟故障
sys@ORCL>create table shall(shall int);
begin
for i in 1..100000 loop
insert into shall values(i);
end loop;
commit;
end;
/
PL/SQL procedure successfully completed.
sys@ORCL>alter system switch logfile;
sys@ORCL>insert into shall values (111);
sys@ORCL>commit;
sys@ORCL>shut abort
2.3 flush redo同步数据
----备库开启网络
service network start
----此时备库是无法查询到主库故障之前未传输的数据
sys@ORCL>select count(*) from shall;
select count(*) from shall
*
ERROR at line 1:
ORA-00942: table or view does not exist
----此时如果主库能启动到mount,可以利用flush redo同步
------先查看未同步日志
sys@ORCL>startup mount
sys@ORCL>select sequence#, applied from v$archived_log;
------开始同步
sys@ORCL> alter system flush redo to 'orcl';
------等待传输完毕后,在备库查看之前未传输的数据
sys@ORCL>select count(*) from shall;
COUNT(*)
----------
100001
2.4 flush同步后主备库切换
--------现备切换为新主库
sys@ORCL>alter database recover managed standby database cancel;
sys@ORCL>alter database recover managed standby database finish;
sys@ORCL>select switchover_status,db_unique_name from v$database;
sys@ORCL>alter database commit to switchover to primary with session shutdown;
sys@ORCL>alter database open;
-------原主库切换为新备库(没有开启flashback on和原rman备份,就重建备库)
----------先查询新主库切换时间戳
sys@ORCL>select to_char(standby_became_primary_scn) from v$database;
TO_CHAR(STANDBY_BECAME_PRIMARY_SCN)
----------------------------------------
1045602
----------闪回数据库
sys@ORCL>flashback database to scn 1045602;
----------原主库切换备库
sys@ORCL>alter database convert to physical standby;
sys@ORCL>shut immediate
sys@ORCL>startup
sys@ORCL>alter database recover managed standby database using current logfile disconnect;
----查看主备日志应用情况
sys@ORCL>select sequence#, applied from v$archived_log;
2.5 数据同步测试
------主库orcl删除shall表
sys@ORCL>drop table shall purge;
------备库orclps备库已经无法查看该表咯
sys@ORCL>select count(*) from shall;
select count(*) from shall
*
ERROR at line 1:
ORA-00942: table or view does not exist
------此时级联库orclstd还能查看到数据
SQL> select count(*) from shall;
COUNT(*)
----------
100001
------主库orcl切换一下日志
sys@ORCL>alter system switch logfile;
------此时级联库orclstd也无法查看数据咯
SQL> select count(*) from shall;
select count(*) from shall
*
ERROR at line 1:
ORA-00942: table or view does not exist
---------------为什么这次failover级联库上没有任何更改,还能找到新的备库去同步数据?因为之前做switover时参数都已经配置好了,主库和备库任何一个库只要是备库就会传归档到级联库。
3 实验二:failover之备库手动注册主库日志后无法应用
现数据库环境如下
节点 |
网络ip地址 |
数据库名 |
unique name |
数据库实例名 |
数据文件位置 |
zyx.test.com(主库) |
192.168.11.111 |
orcl |
orcl |
test |
/u01/app/oracle/oradata/orcl/ |
orcl.test.com(备库) |
192.168.11.22 |
orcl |
orclps |
orclps |
/u01/app/oracle/oradata/orcl/ |
dg2.orcl.com(级联库) |
192.168.11.23 |
orcl |
orclstd |
orclstd |
/u01/app/oracle/oradata/orcl/ |
----实验中,将备库failover为主库
3.1模拟故障
----备库orclps关闭网络,然后在主库上创建表插入数据并提交
service network stop
----此时主库orcl状态
sys@ORCL>select database_role,switchover_status,db_unique_name from v$database;
DATABASE_ROLE SWITCHOVER_STATUS DB_UNIQUE_NAME
---------------- -------------------- ------------------------------
PRIMARY RESOLVABLE GAP orcl
----主库orcl上模拟故障
sys@ORCL>create table shall(shall int);
begin
for i in 1..100000 loop
insert into shall values(i);
end loop;
commit;
end;
/
PL/SQL procedure successfully completed.
sys@ORCL>alter system switch logfile;
begin
for i in 1..1234 loop
insert into shall values(i);
end loop;
commit;
end;
/
PL/SQL procedure successfully completed.
sys@ORCL>shut abort
3.2手动注册主库归档
----如果flush redo不能完成日志传递,或者说主库已经无法进入到mount模式,那么就需要手动去完成日志同步
----备库orclps开启网络
service network start
----此时备库orclps是无法查询到主库故障之前未传输的数据
sys@ORCL>select count(*) from shall;
select count(*) from shall
*
ERROR at line 1:
ORA-00942: table or view does not exist
----假如主库orcl已经无法启动,这时候先查看备库orclps应用日志序号,然后把主库orcl未传送归档拷贝过去
sys@ORCL>select sequence#, applied from v$archived_log;
SEQUENCE# APPLIED
38 YES
38 YES
39 YES
39 YES
40 YES
40 YES
41 IN-MEMORY
41 YES
70 rows selected.
------从上面可以看到,备库orclps日志正应用到序号41,且还为发送已经同步完毕信号给主库,此时需要去找主库orcl把对应的41号序号之后的归档,传输到备库进行应用。(41 yes表示级联库已经应用完毕)---->此时建议将级联库orclstd转为主库,就不用再在当前备库orclps中考虑如何让注册的归档进行应用,为了实验环境的需要,这里还是将当前备库orclps转为主库。
------>看下面,呵呵,当前备库orclps无法应用日志,但是级联库orclstd可以应用
------手工归档应用 -------复制主库orcl未应用归档到备库
[oracle@zyx ~]$ cd /u01/app/oracle/fast_recovery_area/ORCL/archivelog/2016_04_23/
[oracle@zyx 2016_04_23]$ ll -rth
-rw-r----- 1 oracle oinstall 120K Apr 23 04:02 o1_mf_1_40_cko0vbn5_.arc
-rw-r----- 1 oracle oinstall 228K Apr 23 04:10 o1_mf_1_41_cko195d2_.arc
-rw-r----- 1 oracle oinstall 31K Apr 23 04:11 o1_mf_1_42_cko1cm23_.arc
-rw-r----- 1 oracle oinstall 25M Apr 23 04:12 o1_mf_1_43_cko1dq1c_.arc
[oracle@zyx 2016_04_23]$ scp o1_mf_1_42_cko1cm23_.arc o1_mf_1_43_cko1dq1c_.arc o1_mf_1_42_cko1cm23_.arc 192.168.11.22:/home/oracle/
oracle@192.168.11.22's password:
o1_mf_1_42_cko1cm23_.arc 100% 31KB 31.0KB/s 00:00
o1_mf_1_43_cko1dq1c_.arc 100% 24MB 24.5MB/s 00:01
-------备库orclps上注册复制过来的归档日志
sys@ORCL>select sequence#, applied from v$archived_log;
sys@ORCL>alter database register physical logfile '/home/oracle/o1_mf_1_42_cko1cm23_.arc';
sys@ORCL>alter database register physical logfile '/home/oracle/o1_mf_1_43_cko1dq1c_.arc';
sys@ORCL>select sequence#, applied from v$archived_log;
39 YES
40 YES
40 YES
41 IN-MEMORY
41 YES
42 NO
43 NO
72 rows selected.
------------如果日志未应用
-------应用归档日志
----如果日志未应用,需要更改日志应用方式,必须更改才能应用日志(备用数据库断开)
sys@ORCL>alter database recover managed standby database cancel;
sys@ORCL> alter database recover managed standby database disconnect from session;
sys@ORCL>select sequence#, applied from v$archived_log;
40 YES
41 YES
41 YES
42 NO
43 NO
42 YES
43 YES
74 rows selected.
----发现级联库orclstd日志已经应用完毕,但是当前备库orclps还是没有应用
----级联库orclstd可以查看表shall,但是备库orclps是无法查看shall表的
SQL> select count(*) from shall;
COUNT(*)
----------
100000
3.3 处理当前备库orclps日志无法应用问题
------首先,应该查看alert日志,作为数据库管理员,任何时候都应该时刻关注日志内容
----查看告警日志之前,我再试一试看能否应用日志
sys@ORCL> alter database recover managed standby database cancel;
sys@ORCL>alter database recover managed standby database using current logfile disconnect;
sys@ORCL>select sequence#, applied from v$archived_log;
----发现还是无法应用日志
Mon Apr 25 21:18:16 2016
Completed: alter database recover managed standby database disconnect from session
Mon Apr 25 21:56:02 2016
alter database recover managed standby database cancel
Mon Apr 25 21:56:03 2016
MRP0: Background Media Recovery cancelled with status 16037
Errors in file /u01/app/oracle/diag/rdbms/orclps/orclps/trace/orclps_pr00_45894.trc:
ORA-16037: user requested cancel of managed recovery operation
Recovery interrupted!
Mon Apr 25 21:56:03 2016
------看一下这个trc的内容
[oracle@orcl ~]$ vim /u01/app/oracle/diag/rdbms/orclps/orclps/trace/orclps_pr00_45894.trc
Started Parallel Media Recovery
*** 2016-04-25 21:18:15.833 4329 krsh.c
Managed Standby Recovery not using Real Time Apply
Dumping database incarnation table:
Resetlogs 0 scn and time: 0x0000.000e2006 04/22/2016 22:09:31
Recovery target incarnation = 2, activation ID = 1437653945
Influx buffer limit = 8330 min(50% x 16660, 100000)
Start recovery at thread 1 ckpt scn 1046709 logseq 42 block 58
Initial buffer sizes: read 1024K, overflow 832K, change 805K
--------再看看当前的数据库scn时间戳及ckpt检查点记录的scn时间戳。
--------发现当前数据库的时间戳和恢复点ckpt时间戳不一致,怎么办??
sys@ORCL>select current_scn,checkpoint_change# from v$database;
CURRENT_SCN CHECKPOINT_CHANGE#
----------- ------------------
1046708 1024546
-------利用ORA debug 修改数据库KCSGSCN(bbed 也可以直接修改)
---------修改备库当前检查点时间戳,把时间戳和恢复点时间戳改为一致
------------当前的数据库scn
sys@ORCL>oradebug setmypid
Statement processed.
sys@ORCL>oradebug dumpvar sga kcsgscn_
kcslf kcsgscn_ [06001AE70, 06001AEA0) = 000FF8B4 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6001AB50 00000000
-------------计算一下000FF8B4该值为16进制转10进制得1046708
----先计算一下上面日志中ckpt scn 1046709的16进制值其实就是000FF8B5
sys@ORCL>select to_char(1046709,'xxxxxxxx') from dual;
TO_CHAR(1
---------
ff8b5
----下面修改数据库scn
sys@ORCL>oradebug poke 0x06001AE70 8 0x00000000000ff8b5
BEFORE: [06001AE70, 06001AE78) = 000FF8B4 00000000
AFTER: [06001AE70, 06001AE78) = 000FF8B5 00000000
sys@ORCL>oradebug dumpvar sga kcsgscn_
kcslf kcsgscn_ [06001AE70, 06001AEA0) = 000F2A26 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 6001AB50 00000000
------现在查看一下日志是否应用
sys@ORCL>select sequence#, applied from v$archived_log;
SEQUENCE# APPLIED
---------- ---------
41 YES
41 YES
42 NO
43 NO
42 YES
43 YES
74 rows selected.
------重新开启日志应用在查看
sys@ORCL>alter database recover managed standby database using current logfile disconnect;
--------发现数据库hang住,强制退出,然后shut baort重启在应用日志
sys@ORCL>shut abort
sys@ORCL>startup
--------查看下当前scn
sys@ORCL>select current_scn,checkpoint_change# from v$database;
CURRENT_SCN CHECKPOINT_CHANGE#
----------- ------------------
1046708 1024546
------进行日志应用
sys@ORCL>alter database recover managed standby database using current logfile disconnect;
Database altered.
--------再次查看当前scn
sys@ORCL>select current_scn,checkpoint_change# from v$database;
CURRENT_SCN CHECKPOINT_CHANGE#
----------- ------------------
1047091 1024546
--------查看日志应用情况
sys@ORCL>select sequence#, applied from v$archived_log;
SEQUENCE# APPLIED
---------- ---------
41 YES
41 YES
42 YES
43 YES
42 YES
43 IN-MEMORY
74 rows selected.
-------------哈哈!现在应用日志咯!!!
------查询下之前未应用的数据
sys@ORCL>select count(*) from shall;
COUNT(*)
----------
100000
3.4 手工注册主库redo日志
--------上面可以看到,数据归档部分数据已经完成恢复,但是主库shutdown之前的在线redo数据丢失,下面就找回主库在线redo数据
------复制主库redo到备库
------如果主库能看到current redo,可以只传current redo到备库
sys@ORCL>select group#,sequence#,status from v$log;
[oracle@zyx ~]$ scp /u01/app/oracle/oradata/orcl/redo0*.log 192.168.11.22:/home/oracle
oracle@192.168.11.22's password:
redo01.log 100% 50MB 16.7MB/s 00:03
redo02.log 100% 50MB 50.0MB/s 00:01
redo03.log 100% 50MB 4.6MB/s 00:11
------recover standby
------如果不知道主库current log,就每一个redo都应用一次,如果不对会报错,对了就会出现下面恢复完成的。当然你也可以dump redo查看里面scn,再确定恢复哪一个redo
sys@ORCL>alter database recover managed standby database cancel;
sys@ORCL>recover standby database until cancel;
ORA-00279: change 1047092 generated at 04/23/2016 04:11:57 needed for thread 1
ORA-00289: suggestion :
/u01/app/oracle/fast_recovery_area/ORCLPS/archivelog/2016_04_25/o1_mf_1_44_%u_.arc
ORA-00280: change 1047092 for thread 1 is in sequence #44
Specify log: {<RET>=suggested | filename | AUTO | CANCEL}
/home/oracle/redo02.log
Log applied.
Media recovery complete.
----再次打开数据库,然后查询数据是否完成
sys@ORCL>alter database open;
Database altered.
sys@ORCL>select count(*) from shall;
COUNT(*)
----------
101234
------现在数据已经全部找回!!!下面把该备库强制转为主库
3.5 强制切换备库orclps为主库
------备库orclps切换为主库
sys@ORCL>select name,open_mode,database_role,db_unique_name,primary_db_unique_name from v$database;
NAME OPEN_MODE DATABASE_ROLE DB_UNIQUE_NAME PRIMARY_DB_UNIQUE_NAME
----- ---------- ---------------- --------------- ------------------------------
ORCL READ ONLY PHYSICAL STANDBY orclps orcl
sys@ORCL>alter database activate physical standby database;
sys@ORCL>select name,open_mode,database_role,db_unique_name,primary_db_unique_name from v$database;
NAME OPEN_MODE DATABASE_ROLE DB_UNIQUE_NAME PRIMARY_DB_UNIQUE_NAME
----- ---------- ---------------- --------------- ------------------------------
ORCL MOUNTED PRIMARY orclps orcl
sys@ORCL>alter database open;
---------备库修改scn后转为主库后,原主库切换为备库,无法应用日志,不知道如何处理,估计只能重建备库吧,告警日志如下,望高手解答
Started Parallel Media Recovery
*** 2016-04-25 19:48:59.677 4320 krsh.c
Managed Standby Recovery not using Real Time Apply
Datafile 1 belongs to incarnation with resetlogs SCN : 995548, timestamp: 363f3c82
Datafile 2 belongs to incarnation with resetlogs SCN : 995548, timestamp: 363f3c82
Datafile 3 belongs to incarnation with resetlogs SCN : 995548, timestamp: 363f3c82
Datafile 4 belongs to incarnation with resetlogs SCN : 995548, timestamp: 363f3c82
Datafile 5 belongs to incarnation with resetlogs SCN : 995548, timestamp: 363f3c82
Dumping database incarnation table:
Resetlogs 0 scn and time: 0x0000.0011d146 04/30/2016 05:23:05
Resetlogs 1 scn and time: 0x0000.000f30dc 04/25/2016 17:25:22
Recovery target incarnation = 3, activation ID = 0
Influx buffer limit = 18412 min(50% x 36825, 100000)
Start recovery at thread 1 ckpt scn 1167684 logseq 40 block 2
Initial buffer sizes: read 1024K, overflow 832K, change 805K
*** 2016-04-25 19:48:59.739
Media Recovery add redo thread 1
*** 2016-04-25 19:48:59.812 4320 krsh.c
Media Recovery Waiting for thread 1 sequence 40 branch(resetlogs_id) 910113922
*** 2016-04-25 19:49:21.838
*** 2016-04-25 19:49:21.838 4320 krsh.c
MRP0: Background Media Recovery cancelled with status 16037
ORA-16037: user requested cancel of managed recovery operation
*** 2016-04-25 19:49:21.846
Media Recovery drop redo thread 1
*** 2016-04-25 19:49:21.894
Completed Media Recovery
Managed Recovery: Not Active posted.
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30130773/viewspace-2119324/,如需转载,请注明出处,否则将追究法律责任。